RRHub audit reports stopped arriving two days ago. SketchScript reports kept coming. Same system, same cron job, same time every day. One worked, one didn’t, and nothing told me.
What OpenClaw Does
OpenClaw monitors Pickaxe chatbot studios. It pulls subscriber data from Kit, conversation history from Pickaxe, feeds everything to an LLM, and produces audit reports — findings, recommendations, usage patterns. Two studios: SketchScript and RRHub (MyCityZen). Runs daily on a VPS cron job, sends reports back to the vault via SCP, pings Telegram when it’s done.
That’s the design. Here’s what was actually happening.
The Sequence
SketchScript runs first. It hits the Anthropic Claude API, burns through most of the 30,000 token/minute rate limit, and finishes successfully. About two minutes later, RRHub tries the same API. 429. Rate limited.
No retry logic. No per-studio error isolation. RRHub’s API failure crashed the entire process — the exception wasn’t caught, so the script died mid-run. SketchScript’s report had already been written, so it looked like everything was fine. RRHub just… stopped existing.
There was a secondary bug hiding underneath: the Kit tag in the config pointed to a tag that didn’t exist — a typo from when the product was renamed. That one would have surfaced eventually, but the rate limit killed things before it ever mattered.
The Fix I Didn’t Write
The obvious fix is retry logic — back off, wait, try again, twenty lines of Python. But once I was looking at the architecture, the whole VPS setup started looking like overhead. The analysis task is structured data extraction with a template. It doesn’t need Claude-level reasoning — a 7B model handles it fine. And if the model runs locally, the entire dependency chain collapses: no API credits, no rate limits, no SCP to move files around, no Telegram notifications (the reports land directly in the vault), no environment variables for API keys.
So instead of adding retry logic to save the existing architecture, I replaced the architecture.
What Changed
The analyzer. claude_analyzer.py became ollama_analyzer.py. Instead of hitting the Anthropic API, it talks to Ollama’s HTTP API on localhost:11434. Model: qwen2.5-coder:7b, 32,768 context window, temperature 0.3.
The analysis prompts needed simplifying for the smaller model — explicit JSON output format, structured field requirements. The old prompts assumed a model that could infer structure from loose instructions. The new ones spell it out.
The data pipeline. SketchScript’s raw data came back at 412,332 characters — nowhere near fitting in a 7B context window. Added truncation at 60,000 characters before analysis, which lands at roughly 15K tokens and leaves room for the prompt template and output.
The error isolation. Each studio now runs in its own try/except. If SketchScript fails, RRHub still runs. If RRHub fails, SketchScript’s report is already written. The bug that killed the whole system for two days is structurally impossible now.
The delivery. vault_sync.py went from SCP-to-remote-VPS to a local file copy into the Obsidian vault inbox. Reports land directly. No network hop, no SSH, no Telegram.
The scheduler. VPS cron job became a macOS launchd plist (com.openclaw.daily), fires at 07:00 every morning.
The Before and After
| VPS (before) | Local (after) | |
|---|---|---|
| Scheduler | cron | launchd |
| Data collection | Pickaxe API | Pickaxe CLI |
| Analysis | Claude API (Anthropic) | Ollama (qwen2.5-coder:7b) |
| Delivery | SCP + Telegram | shutil.copy to vault |
| Error isolation | None | Per-studio try/except |
| API cost | Per-token | Zero |
| Network dependencies for analysis | Anthropic API | None |
Verification
Both studios ran successfully. SketchScript’s audit analyzed 5 conversations, found 3 free users, 100% generation success rate, 2 findings. The 412K-character dataset truncated cleanly to 60K before analysis. Used approximately 15,373 tokens on the local model.
RRHub produced a report for the first time in two days.
The Damage Report
| Metric | Value |
|---|---|
| Days RRHub was silently failing | 2 |
| Root cause | 30K token/min rate limit, no retry, no error isolation |
| Hidden secondary bug | Kit tag in config pointed to a non-existent tag (product rename typo) |
| Lines of retry logic written | 0 (replaced the architecture instead) |
| API cost going forward | $0 |
| Network dependencies for analysis | 0 (was 2: Anthropic API + SCP) |
| Raw data truncated | 412,332 chars to 60,000 chars |
| Local model tokens used | ~15,373 |
The Minimum Viable Fix Wasn’t
The minimum viable fix was retry logic. But the rate limit failure was a symptom of a deeper mismatch: I was paying for cloud infrastructure and API tokens to run a task that doesn’t need either.
Sometimes the right response to “this broke” is “why is this here at all?” The rate limit forced that question. The answer turned out to be: no good reason anymore.
The 7B model produces slightly less polished prose in the audit reports, but the findings, recommendations, and structured data are all there. For a daily monitoring task I skim over coffee, that tradeoff isn’t even close.
Why Two Days Passed Without Anyone Noticing
The worst part wasn’t the rate limit — it was the two days. SketchScript kept succeeding, so the system looked healthy. The cron job ran, a report appeared, Telegram pinged. Everything looked fine.
The absence of a report is invisible unless you’re specifically checking for it. “I got a SketchScript report today” doesn’t trigger “but where’s the RRHub report?” — not when you’re scanning Telegram notifications between other things.
Per-studio error isolation doesn’t just prevent cascade failures. It makes failures visible. When each studio’s success or failure is independent, you can check for the presence of each report individually. A missing report is a signal, not a gap you have to notice.