Skip to content
Second Brain Chronicles
Go back

One Bot Starved the Other. So I Fired the Cloud.

One Bot Starved the Other. So I Fired the Cloud.

RRHub audit reports stopped arriving two days ago. SketchScript reports kept coming. Same system, same cron job, same time every day. One worked, one didn’t, and nothing told me.

What OpenClaw Does

OpenClaw monitors Pickaxe chatbot studios. It pulls subscriber data from Kit, conversation history from Pickaxe, feeds everything to an LLM, and produces audit reports — findings, recommendations, usage patterns. Two studios: SketchScript and RRHub (MyCityZen). Runs daily on a VPS cron job, sends reports back to the vault via SCP, pings Telegram when it’s done.

That’s the design. Here’s what was actually happening.

The Sequence

SketchScript runs first. It hits the Anthropic Claude API, burns through most of the 30,000 token/minute rate limit, and finishes successfully. About two minutes later, RRHub tries the same API. 429. Rate limited.

No retry logic. No per-studio error isolation. RRHub’s API failure crashed the entire process — the exception wasn’t caught, so the script died mid-run. SketchScript’s report had already been written, so it looked like everything was fine. RRHub just… stopped existing.

There was a secondary bug hiding underneath: the Kit tag in the config pointed to a tag that didn’t exist — a typo from when the product was renamed. That one would have surfaced eventually, but the rate limit killed things before it ever mattered.

The Fix I Didn’t Write

The obvious fix is retry logic — back off, wait, try again, twenty lines of Python. But once I was looking at the architecture, the whole VPS setup started looking like overhead. The analysis task is structured data extraction with a template. It doesn’t need Claude-level reasoning — a 7B model handles it fine. And if the model runs locally, the entire dependency chain collapses: no API credits, no rate limits, no SCP to move files around, no Telegram notifications (the reports land directly in the vault), no environment variables for API keys.

So instead of adding retry logic to save the existing architecture, I replaced the architecture.

What Changed

The analyzer. claude_analyzer.py became ollama_analyzer.py. Instead of hitting the Anthropic API, it talks to Ollama’s HTTP API on localhost:11434. Model: qwen2.5-coder:7b, 32,768 context window, temperature 0.3.

The analysis prompts needed simplifying for the smaller model — explicit JSON output format, structured field requirements. The old prompts assumed a model that could infer structure from loose instructions. The new ones spell it out.

The data pipeline. SketchScript’s raw data came back at 412,332 characters — nowhere near fitting in a 7B context window. Added truncation at 60,000 characters before analysis, which lands at roughly 15K tokens and leaves room for the prompt template and output.

The error isolation. Each studio now runs in its own try/except. If SketchScript fails, RRHub still runs. If RRHub fails, SketchScript’s report is already written. The bug that killed the whole system for two days is structurally impossible now.

The delivery. vault_sync.py went from SCP-to-remote-VPS to a local file copy into the Obsidian vault inbox. Reports land directly. No network hop, no SSH, no Telegram.

The scheduler. VPS cron job became a macOS launchd plist (com.openclaw.daily), fires at 07:00 every morning.

The Before and After

VPS (before)Local (after)
Schedulercronlaunchd
Data collectionPickaxe APIPickaxe CLI
AnalysisClaude API (Anthropic)Ollama (qwen2.5-coder:7b)
DeliverySCP + Telegramshutil.copy to vault
Error isolationNonePer-studio try/except
API costPer-tokenZero
Network dependencies for analysisAnthropic APINone

Verification

Both studios ran successfully. SketchScript’s audit analyzed 5 conversations, found 3 free users, 100% generation success rate, 2 findings. The 412K-character dataset truncated cleanly to 60K before analysis. Used approximately 15,373 tokens on the local model.

RRHub produced a report for the first time in two days.

The Damage Report

MetricValue
Days RRHub was silently failing2
Root cause30K token/min rate limit, no retry, no error isolation
Hidden secondary bugKit tag in config pointed to a non-existent tag (product rename typo)
Lines of retry logic written0 (replaced the architecture instead)
API cost going forward$0
Network dependencies for analysis0 (was 2: Anthropic API + SCP)
Raw data truncated412,332 chars to 60,000 chars
Local model tokens used~15,373

The Minimum Viable Fix Wasn’t

The minimum viable fix was retry logic. But the rate limit failure was a symptom of a deeper mismatch: I was paying for cloud infrastructure and API tokens to run a task that doesn’t need either.

Sometimes the right response to “this broke” is “why is this here at all?” The rate limit forced that question. The answer turned out to be: no good reason anymore.

The 7B model produces slightly less polished prose in the audit reports, but the findings, recommendations, and structured data are all there. For a daily monitoring task I skim over coffee, that tradeoff isn’t even close.

Why Two Days Passed Without Anyone Noticing

The worst part wasn’t the rate limit — it was the two days. SketchScript kept succeeding, so the system looked healthy. The cron job ran, a report appeared, Telegram pinged. Everything looked fine.

The absence of a report is invisible unless you’re specifically checking for it. “I got a SketchScript report today” doesn’t trigger “but where’s the RRHub report?” — not when you’re scanning Telegram notifications between other things.

Per-studio error isolation doesn’t just prevent cascade failures. It makes failures visible. When each studio’s success or failure is independent, you can check for the presence of each report individually. A missing report is a signal, not a gap you have to notice.


Share this post on:

Previous Post
After the Honeymoon
Next Post
Rebranding a Website With AI in 90 Minutes