A No-Vector RAG That Works: The Architecture, Decision by Decision
New to self-hosting AI? Start at the Self-Hosted AI: Start Here hub. If you want the build steps rather than the design reasoning, the knowledge-base setup guide is the how-to; this article is the why.
Quick Take
- The retrieval system behind my local agents is a folder of Markdown, one JSON index, and about a hundred lines of BM25. No vector database, no embeddings, no GPU.
- Every layer is a deliberate choice with a reason and a source, not a default. This article walks each one.
- On a small, curated, technical corpus this is not a compromise: I benchmarked the heavy stack and it bought nothing.
- It is also small enough to be a product: point it at your own folders and it is a working local RAG in an afternoon.
Most retrieval-augmented-generation writeups show you a framework and a vector database. This one shows you the opposite: what you can delete and still answer questions well. The system serves a DGX Spark running local models, and it has carried real operational load for months. The interesting part is not that it is small. It is that each thing it lacks was considered and rejected on evidence.
A note on scope first. RAG just means: find the few documents relevant to a question and put them in front of the model, because the model only knows its training data plus what you hand it. Everything below is about the “find” step, done as cheaply as correctness allows.
Layer 1: the data is plain Markdown, on purpose
The corpus is *.md files across several roots: published blog posts, working documentation, podcast notes, and the operations namespace that holds the grid’s canonical facts and playbooks. Each file may carry a YAML frontmatter block with title and tags. That is the entire schema.
The decision: plain text, no database of record. The reason: the files are the source of truth, editable in any editor, diffable in git, readable without tooling. A database would add a migration story, a backup story, and a layer between me and my own notes. Binary formats (.pdf, .docx) are deliberately out of scope; if you need them, convert to Markdown in front of the indexer rather than teaching the indexer to parse them. The cost of this choice is that everything downstream has to be cheap enough to rebuild from scratch, which turns out to be a feature.
Layer 2: the index is a single JSON file
A small Python indexer walks the roots, parses frontmatter, and writes one file: /data/knowledge-index.json. It holds per-document metadata (title, tags, a short summary, backlinks) and an inverted tag map. It is written atomically (temp file, then rename) so a live reader never sees a half-written index during the daily rebuild.
The decision: one human-readable JSON, not a search server or an embedded database. The reason: at a few hundred documents, the entire index is small enough to load into memory in milliseconds, and being plain JSON means I can open it, diff two rebuilds, and see exactly what changed. There is no daemon to keep alive, no port, no schema version. Recovery is rm index.json and rerun.
Layer 3: retrieval is Okapi BM25 over the full body
This is the core, and the place most systems reach for embeddings instead. Okapi BM25 ranks a document by how often the query terms appear in it, weighting rare terms more than common ones (inverse document frequency) and normalizing for length so a long file does not win just for being long. The implementation is about a hundred lines of standard library: tokenize, count, score with k1=1.5, b=0.75. Title and tag terms are boosted by repeating them into the document’s term bag.
The decision: sparse keyword retrieval, not dense vector similarity. The reason: this corpus is precise and technical. The queries that matter contain file paths, flags, error strings, command names. For that text, exact lexical matching is a stronger signal than semantic similarity, which tends to blur the very tokens you are searching for. The literature agrees: BM25 is hard to beat on keyword-heavy and technical collections, while dense embeddings win mainly on paraphrase-heavy natural-language prose, which is why serious systems on mixed corpora run a hybrid of both (sparse vs dense, when each wins; what actually breaks in production). Dropping vectors also costs little even in general: one study needed eight keyword results to match the recall of seven embedding results, a rounding error against the cost of running a vector database (RAG without embeddings).
The one upgrade that mattered: per-section chunking
The first version indexed each document as one bag of words. That let a single keyword hit drag a large file to the top even when the relevant passage was buried, and it returned the whole document rather than the part you needed. The current version splits each file into sections on its headings and indexes each section as its own unit, so the inverse-document-frequency math and the length normalization operate at section granularity. Results collapse to the best section per document and return its heading and anchor, so an agent lands on the right paragraph.
The splitter is fence-aware: it ignores # lines inside fenced code blocks, so a shell comment like # stop the service is never mistaken for a Markdown heading. That single bug, untreated, would have shredded every code-heavy playbook into nonsense sections.
Two small ranking priors
Two corpus-specific nudges live in the retriever so that every consumer gets them, not just the command-line tool. First, a modest boost for the operational playbooks and canonical docs, so a how-to outranks a related blog essay when someone asks an operations question. Second, a tiny symptom-to-keyword expansion: a query for “out of memory” also matches “OOM”, “page cache”, “drop_caches”, because operators search by symptom while the docs are written in nouns. Both are a handful of lines, both are easy to read, and neither requires a model.
Layer 4: tagging is a cached local-LLM pass
Tags are useful for filtering and for cheap relevance signal, but writing them by hand does not scale and a model is good at it. So the indexer asks the local production model to suggest tags for any untagged document, then caches the result.
The decision: generate tags with the model, but cache them and keep them out of the source files. The reason: the cache means the fast daily reindex stays fully tagged with no model call at all; the model is only consulted for genuinely new files on a full pass. Keeping generated tags in the index and a side cache, rather than writing them into every source file, keeps code and note repositories clean. There is a scar here worth naming: the tagger originally called a model on a second inference engine, and when that engine went dormant, tagging silently failed and half the corpus lost its tags. The fix was to point it at the always-on production model and to canonicalize near-duplicate tags (so firewall and networking do not fragment into two buckets). The lesson generalizes: a background enrichment step must degrade safely when its dependency is gone.
Layer 5: serving is a local MCP tool
A small FastMCP server exposes the index to agents as a query_knowledge tool. MCP (Model Context Protocol) is the standard plug that lets a model call a tool; the agents (a local Qwen, opencode, others) call it like any function and get back ranked sections.
The decision: serve over MCP via standard input/output, not as a network service. The reason: stdio means the knowledge tool has no open port and no network attack surface; it runs as a child of whatever agent invoked it. The server caches the BM25 statistics and rebuilds them only when the index file’s modification time changes, so a day’s queries pay the build cost once. And it is defensive: if the BM25 module ever fails to import, it falls back to a naive scorer rather than taking down a tool the agents depend on. There is a measured precedent for this whole shape: a 2026 result found that an agent calling a keyword-search tool reaches over ninety percent of full vector-RAG quality with no standing vector database (Keyword search is all you need).
Why no vectors, said plainly
Because I measured. Before trusting the design, I benchmarked keyword scoring, full-body BM25, dense embeddings, hybrid fusion, and a reranker against this exact corpus. BM25 won or tied at zero added memory, and the vector stack bought nothing. Getting to an honest number was its own adventure, including two benchmarks I accidentally rigged by choosing the test queries myself; the full path is in I Rigged My Own RAG Benchmark.
So is this the best of all possible designs? For a small, curated, technical corpus and a one-person sovereign setup, it is very close. The conditions under which BM25 wins all hold, and there is no second service, no embedding recompute, no cloud call. It is not the universal best, and the honest ceiling is worth stating: if you searched mostly by paraphrase, across languages, or for fuzzy concepts rather than exact terms, a hybrid of BM25 plus embeddings plus a reranker would pull ahead. Measure your corpus before you buy that stack.
How it compares to the off-the-shelf tools
Plenty of open-source projects solve “search my documents for an LLM”, and almost all of them are vector-first and heavier:
- txtai: a self-contained embeddings database with RAG built in. The closest single-package option, but vector-centric with more dependencies.
- LlamaIndex, Haystack, RAGFlow: full frameworks with pipelines and many index types. Powerful, and overkill for a folder of Markdown.
- Khoj: the closest in spirit, a self-hosted personal knowledge base over Markdown and Org, but it still uses embeddings underneath.
- bm25s, Meilisearch: stronger pure-keyword retrievers than the hundred lines here, if you outgrow them.
What this design trades for being tiny: no dependency beyond the Python standard library, one human-readable index file, full-body BM25 with section anchors, and a retriever that is native to the agents over MCP and shared across all of them. None of the big frameworks ships that exact combination, which is the reason to keep this and not adopt theirs.
Could it be a product?
It nearly is one. The indexer, the BM25 retriever, and the MCP server are a complete minimal RAG application; the only thing tying it to my machine is a list of folder paths. Lift that into a config file, package it, and it is “a zero-dependency, no-vector RAG over your Markdown, native to your agents”. The niche at the minimalist end is genuinely open: every popular tool starts by assuming you want embeddings. If you want to build it, the Sovereign AI Blog MCP source is a clean reference for the serving half.
What I would not change, and what I would
Settled: plain text, single JSON, BM25, per-section chunks, stdio MCP, cached tagging. Each earned its place against a measured alternative.
Two items that were open when I first wrote this are now closed. The tag vocabulary was canonicalized, so plural and hyphen variants of one concept (agent and agents, health-check and healthcheck) stop fragmenting retrieval into separate buckets. And oversized sections now sub-chunk on paragraph boundaries, keeping fenced code blocks whole, instead of leaving one giant heading-section to blur the length math; the number of sections over five hundred tokens dropped by three quarters, and the few that remain are single indivisible blocks that should not be split.
What stays open is the one worth leaving open. The day a corpus arrives that is genuinely paraphrase-heavy, the right move is to add a dense index alongside BM25 and fuse them, not to replace what works. The architecture is built to allow that without a rewrite, which is the last design decision worth naming: keep the cheap thing cheap, and leave a clean seam for the expensive thing you have not needed yet.
For the build steps, see the knowledge-base setup guide. For the vector-store version of the same idea and the retrieval bugs that came with it, see A Second Brain for a Local Model.
No-Vector RAG, end to end
Markdown to agent answer