Skip to content
Second Brain Chronicles
Go back

238 Apple Books Into Booklore Via a Categorisation Script

238 Apple Books Into Booklore Via a Categorisation Script

304 files. No consistent naming. No trustworthy metadata. And a target system with its own opinions about how its API should work.

Apple Books had been accumulating files for years — D&D sourcebooks, LEGO instructions, music theory PDFs, novels, reference material, random downloads. All sitting in its local storage, unsearchable, uncategorised, invisible to the rest of my system. You know the feeling: you’re certain you own a book, you can picture the cover, but you can’t prove it exists because Apple’s walled garden doesn’t let you search your own library in any meaningful way. I wanted them in Booklore, where they’d be indexed and searchable alongside the 3,054 entries already there.

The stakes were simple: if this doesn’t work, those 304 files stay buried in Apple’s opaque storage forever. Not deleted — worse. Present but invisible. Books I own that I can’t find when I need them.

The plan seemed straightforward: categorise by filename patterns, dedupe against existing entries, create missing libraries, import. The plan did not account for Apple Books, macOS, or bash itself having strong feelings about the process.

The Categorisation

Regex matching against filenames. Not glamorous, but 304 files don’t justify a machine learning pipeline.

D&D books have “Player’s Handbook,” “Monster Manual,” “Dungeon Master” in the name. LEGO files are numbered instruction PDFs. Music books mention “guitar,” “theory,” “chord.” Everything else got sorted into Books or Reference based on whether it looked like a novel or a technical document.

CategoryCountPattern
D&D44Sourcebook keywords, publisher names
LEGO11Numbered instruction PDFs
Music7Instrument/theory keywords
Books65Narrative titles, author names
Reference111Technical keywords, manual patterns
Skipped27Numeric IDs, UUIDs, no usable filename
Duplicates39Fuzzy match against 3,054 existing entries

27 files had nothing to match against — numeric IDs, UUIDs, personal documents with filenames like 8A3F2D1E.pdf. Those got skipped. Better to skip than miscategorise.

Deduplication ran fuzzy title matching against Booklore’s existing library. 39 duplicates caught. Not bad for naive string-distance comparison.

Full categorisation regex patterns
# D&D — sourcebook keywords and publisher names
"(player.*handbook|dungeon.*master|monster.*manual|wizards.*coast|forgotten.*realms|sword.*coast|ravenloft|eberron|spelljammer|planescape|dragonlance|critical.*role|d&d|dnd|volo|xanathar|tasha|mordenkainen|fizbans|strixhaven)"

# LEGO — numbered instruction PDFs
"^[0-9]{4,6}.*\\.pdf$"

# Music — instrument and theory keywords
"(guitar|piano|bass|drum|music.*theory|chord|scale|tab|songbook|sheet.*music|ukulele|banjo)"

# Reference — technical keywords and manual patterns
"(manual|guide|handbook|reference|tutorial|documentation|specification|whitepaper|technical|programming|protocol|standard|rfc|api)"

# Books — everything else that has a plausible title
# (matched last as the fallback for files that passed dedup)

Three Bash Gotchas in One Script

This is where the story stops being about books and starts being about macOS platform assumptions.

Gotcha 1: The Shebang

#!/bin/bash           
#!/opt/homebrew/bin/bash  

The script used associative arrays (declare -A), which require bash 4+. macOS ships bash 3.2 from 2007. Apple won’t upgrade it because bash 4+ is GPLv3 and Apple won’t ship GPLv3 software. I had Homebrew’s bash 5 installed, but the shebang pointed at /bin/bash — the system’s ancient version.

The error message for “associative arrays in bash 3” is not “this feature requires bash 4.” It’s declare: -A: invalid option. Helpful.

Gotcha 2: The Pipe Subshell

# This looks right but the counter stays at zero
find . -name "*.pdf" | while read file; do
  ((count++))                                
done

# Process substitution — counter persists in parent scope
while read file; do
  ((count++))                                
done < <(find . -name "*.pdf")

A while read loop piped from find runs in a subshell. Variable increments inside the loop don’t persist to the parent scope. The counter that was supposed to track successful imports stayed at zero through the entire run. Everything was importing correctly — the script just couldn’t count.

What I expected: A running total of imports printed at the end.

What actually happened: “Successfully imported: 0” after importing 238 files.

Process substitution (< <(find ...)) runs the loop in the current shell. Same logic, same output, but the variables survive.

Each gotcha was a different class of assumption the OS makes about your code — and each one failed silently enough to waste real debugging time.

Gotcha 3: The epub Directory Bundle

macOS treats .epub files as directory bundles, not files. An epub is actually a zip archive with a specific internal structure, and Finder (and by extension, find) sees it as a directory.

find . -type f -name "*.epub"  # Finds nothing — silently
find . -type d -name "*.epub"  # Finds all epubs

find -type f missed every epub in the collection. Silently. No error, no warning, just an incomplete file list. I didn’t notice until the import counts didn’t match — 238 imported but the library was missing titles I knew were there.

What macOS thinks an epub is: A directory with a .epub extension containing META-INF/, OEBPS/, and mimetype entries. Technically correct. Practically infuriating when you’re trying to find files.

Added -type d -name "*.epub" as a second pass and the missing books appeared.

The API’s Opinions

The Booklore API had its own ideas about data formats. Creating a library requires the path as an object, not a string:

// What I sent first (400 Bad Request)
{ "path": "/books/D&D" }

// What Booklore actually wants
{ "path": { "path": "/books/D&D" } }

Found this through trial and error after the first four API calls returned 400. The documentation doesn’t clarify this. (Why would it? That would be too easy.)

The Numbers

MetricValue
Input files304
Categories5 (D&D, LEGO, Music, Books, Reference)
Duplicates caught39
Skipped (unclassifiable)27
Successfully imported238
Libraries created5
Library rescans (HTTP 204)7
Container recreations0 (for once)

The session ran long enough to hit multiple AI context compactions — the point where the conversation history gets compressed to fit the context window. The script didn’t care. It held state across every reset. Bash scripts don’t forget their variables when the conversation around them gets compressed. In a session where AI context kept evaporating and being rebuilt from summaries, the bash script was the one artifact that remembered everything — every counter, every path, every category assignment. The dumbest technology in the stack was the most reliable state machine.

The Pattern

Each of the three bash gotchas represents a different class of platform assumption:

  1. The shebang — assuming the system provides a modern tool (it provides a 2007 version)
  2. The pipe subshell — assuming shell syntax behaves like other languages (it has its own scoping rules)
  3. The epub bundle — assuming files are files (macOS disagrees for certain extensions)

None of these are bugs. They’re all documented, defensible design decisions. They’re also invisible until they bite you, and the error messages range from misleading (bash 3) to nonexistent (silent epub skip). In every case, I only found the problem because a number didn’t add up. The error messages didn’t catch it. The numbers did.


Share this post on:

Previous Post
Rebranding a Website With AI in 90 Minutes
Next Post
Three Permission Layers, Zero Files Imported