Government data is public. Getting at it is not.
Spain’s national statistics institute (INE) has an API. The official gazette (BOE) also has an API. Both return XML, both have documentation that assumes you already know the schema, and both require enough setup that most people just Google the number they need and move on.
This week’s experiment: what if an AI assistant could query both directly?
What I Tried
I built two MCP servers in a single session — one wrapping the INE statistics API (70+ datasets: CPI, employment, demographics, housing, tourism), another wrapping the BOE legislation API plus a local corpus of 12,052 laws across 18 jurisdictions.
The corpus came from legalize-es, an existing open-source collection. I pulled it in as a Git submodule rather than forking or copying. The two servers share a base package with caching, retry logic, an XML parser, and Spanish-specific type validation.
The whole thing took about fifty minutes.
Grep Won
The INE server worked on the first try — live API calls returned real CPI data without drama.
The BOE server was trickier. The legislation API returns XML with nested structures that don’t map cleanly to what you’d want an AI to reason about. Parsing decisions ate more time than the API integration itself.
The corpus search was where it got interesting. My first instinct was to reach for vector embeddings — semantic search, the whole setup. I stopped myself. Twelve thousand laws is not that many documents, and a grep-based approach with an in-memory index turned out to need zero infrastructure, build in seconds, and return results fast enough that the latency was indistinguishable from a network call. No vector database, no embedding model, no index to maintain.
One technical snag: the MCP SDK expects Zod v3 types internally. My project used a newer version. A compatibility import fixed it, but it cost ten minutes of confusion before I found the mismatch.
The Deeper Thing
The instinct to over-engineer search is strong. “Twelve thousand documents” sounds like a semantic search problem — it isn’t. The threshold where you actually need vector embeddings is higher than most people assume, and every layer of infrastructure you add is a layer you maintain.
The transferable version: before adding complexity to a search problem, check whether the corpus fits in memory. If it does, start with the dumbest thing that works.
The Numbers
| Metric | Value | Why It Matters |
|---|---|---|
| MCP servers built | 2 | INE (statistics) + BOE (legislation) |
| Total session time | ~50 min | Architecture through verified live calls |
| Laws in corpus | 12,052 | Across 18 Spanish jurisdictions |
| INE datasets accessible | 70+ | CPI, employment, demographics, housing, tourism |
| BOE tools | 10 | 8 live API + 2 corpus search |
| Vector databases needed | 0 | Grep won |
Next
- Does the INE API rate-limit under real usage, or is the documentation just cautious?
- What’s the right granularity for corpus search — full document, article, or section?
- Still not sure about the collaboration angle. The play is to open-source this alongside a similar project for another EU country and let the ecosystem emerge. Whether that happens depends on conversations that haven’t happened yet.