I Rigged My Own RAG Benchmark. Twice.
New to this stack? The Self-Hosted Knowledge Base guide covers the plain-text architecture this article stress-tests, and the Start Here hub walks the bigger decisions first.
Quick Take
- The 2026 RAG playbook is unanimous: add a vector database, fuse it with keyword search, top it with a reranker. I tested all three before building any of them.
- My first benchmark used a model as judge and called it a tie. My second used hard numbers but queries I wrote myself, and vectors won in a landslide. Both verdicts were things I had built, not found.
- When I let the model write the queries instead, plain BM25 beat my baseline and matched vector search at zero added memory. The expensive stack bought nothing.
- The lesson outlived the benchmark: an LLM judge with no answer key measures nothing, and whoever picks the test queries picks the winner.
I almost bolted a vector database onto a system that did not need one. My own benchmark told me to, and I nearly listened. A week later the same benchmark told me the opposite, just as confidently. Both answers were wrong, and both times the person rigging the test was me.
The system is small. 386 Markdown files: operational notes, fix writeups, benchmark logs, every dead end worth keeping. Local agents read it over MCP. The retriever is deliberately stupid. It counts how often the query words appear in each document’s title, tags, path, and summary, and returns the top scorers. No embeddings, no vector store, no GPU. It has worked for months, which is exactly the kind of comfort that should make an engineer nervous, because working is not the same as good. So I went looking for the bad news: is the dumb retriever costing me quality I could buy back?
What the playbook says to build
Ask the internet how to do retrieval well in 2026 and you get one answer with small variations. Embed every document with a sentence-transformer so you can search by meaning instead of by spelling, and keep the vectors in a database. Do not throw away keyword search; run both and fuse the rankings with something like reciprocal rank fusion, because each catches what the other drops. Put a reranker on top, a heavier model that rescores the finalists. And do not grade your own homework: generate a synthetic test set and measure, the way frameworks like RAGAS do.
None of that is wrong. All of it assumes a corpus that is large, messy, and curated by nobody. Mine is small, hand-tagged, and read by the same person who wrote it. Advice built for one world does not automatically survive in another, and the only way to find out is to measure. So I measured, and that is where I embarrassed myself.
Test one: the judge that knew nothing
The eval literature loves a model as judge, so I started there. I gave my local 35B model two result sets for each of ten queries, the keyword baseline and a proper full-body BM25 retriever, and asked which was better. I shuffled which set was A and which was B so it could not simply favor a side.
Five to four, one tie. A coin landing on its edge. The tidy reading was “no difference, the baseline is fine,” and I came close to writing it down.
It was noise wearing the costume of a verdict. A judge with no answer key is not scoring relevance, it is scoring its own gut, and when both lists look plausible the gut shrugs and picks one. The model was not dim. Nothing in the setup ever said which document was the correct one, so there was nothing for it to be right about. I had taken a careful measurement with an unmarked ruler.
Test two: rigging it the other way
So I gave the thing an answer key. Fourteen queries, each tied to the single document that should come back, scored on two numbers that cannot argue with you: did the right document land in the top five, and how near the top. No judge.
Then I did the clever move that ruined it. I wrote every query in plain language that stepped around the document’s own tags and title, on the theory that this is where meaning-search earns its keep. Someone types “the model ran out of memory after a restart.” The document is tagged oom and page-cache. Not a word in common. Keyword search should be blind here, and vectors should stroll in and win.
They did. The baseline put the right document in the top five twice out of fourteen. Dense vectors managed ten. On the eight queries with no shared words at all, the baseline scored a flat zero. It looked like a rout, and I enjoyed it for about an hour.
Then I reread my own queries and saw the con. I had hand-picked them to share no words with their answers. That is not an experiment, it is a trap built to a blueprint and sprung on the retriever I had already decided should lose. Test one had quietly given keyword search a home crowd by letting tag words leak into the questions. Test two stripped those words out on purpose and handed the home crowd to vectors. Two tests, opposite thumbs on the scale, both of them mine.
Test three: let the model ask the questions
The fix was the one step from the playbook I had skipped: stop choosing the queries. The whole point of a synthetic test set is to get your hands off the scale. So I pulled random documents and, for each, asked the model to write the query a real person with that problem would type, with no nudge toward or away from any particular word. Thirty queries. Two runs, different random seeds.
The neutral set showed how fake my trap had been. Across the generated queries, only three to six percent shared no words with the document’s tags or title. Real questions overlap the topic nearly every time, because the topic is where the words come from. My 57-percent-no-overlap benchmark had been measuring, with great precision, a situation that barely exists.
And the ranking went honest. Across both seeds the baseline landed the right document in the top five 23 and 19 times out of 30. Full-body BM25 landed 27 and 26. Dense vectors landed 24 and 26. Hybrid fusion landed 27 and 23. BM25 led or tied every run. Vectors were good, never better, and never free.
The three tests on one card, and why only the last one is worth trusting:
| Test | Who chose the queries | Metric | Verdict it gave | The thumb on the scale |
|---|---|---|---|---|
| 1. Model as judge | Me, tag words leaked in | None, no answer key | 5 to 4, one tie | Judge graded its own gut; keyword got a home crowd |
| 2. Gold labels | Me, written to dodge the tags | Recall@5 and MRR | Vectors 10/14, baseline 2/14 | 57% of queries shared no words with their answer, a case that barely exists |
| 3. Model-generated | The model, no nudging | Recall@5, two seeds | BM25 led or tied every run | None: only 3 to 6% zero-overlap, like real questions |
Why the boring option won
What each option actually costs on this corpus, scored on the honest test:
| Retriever | Right doc in top 5 (two seeds) | Added memory | Speed | Earns its keep when |
|---|---|---|---|---|
| Keyword baseline | 23 and 19 of 30 | none | instant | never, it is the floor |
| Full-body BM25 | 27 and 26 of 30 | ~20 MB | 0.2 ms | curated and tagged, any size |
| Dense vectors | 24 and 26 of 30 | 300 to 500 MB | a model call | large, messy, untagged text |
| Hybrid fusion | 27 and 23 of 30 | dense plus fusion | a model call | big corpus, recall is the bottleneck |
| Reranker | a net loss here | dense plus a heavier model | two model calls | noisy candidate lists need reordering |
BM25 is the grown-up version of what my baseline was groping toward. It scores the whole document with real inverse-document-frequency weighting and length normalization instead of squinting at a 300-character summary. A hundred lines of standard library, no model, no GPU. Across all 386 documents it holds 20 megabytes of state and answers a query in two tenths of a millisecond.
The vector stack I had been tempted by wanted a sentence-transformer and PyTorch resident in memory, 300 to 500 megabytes, to return results no better on this corpus. Not because embeddings are weak, but because my corpus is small, curated, and tagged, and two of every three tags belong to a single document. The tags already carry the meaning that embeddings would add. On a pile of well-kept notes, the keeping is the retrieval. A vector index would have been a second rent on a room I had already furnished.
The same arithmetic sinks hybrid search and rerankers here. They earn their cost on large, noisy, untagged corpora. On a small tidy one they bill you in latency, memory, and dependencies and return a rounding error, or in the reranker’s case a small loss, because it keeps overruling answers that were already correct.
The two things I am keeping
An LLM judge with no answer key measures nothing. It feels rigorous and it hands you a clean number, and the number is weather. If you cannot name the right answer before the test runs, you do not have a test.
Whoever picks the queries picks the winner. That is the one that stings, because I did it twice in opposite directions and only caught it because the two rigged runs disagreed so loudly that one of them had to be lying. The defense is the dull discipline the eval people keep preaching: generate the queries, and never curate them toward the result you are hoping for.
What I actually changed
I put BM25 into the retriever all my local agents share, with the old scorer left in as a fallback for the day the import breaks. It runs under plain Python, so every agent gets the upgrade with nothing new to install. I did not add the vector database. I shelved it with evidence instead of a feeling, which is a far better place to leave a decision.
I also started logging real queries, because everything above is still synthetic. Queries the model invents are more honest than the ones I cherry-picked, but they are not my actual traffic. In a few weeks I will have a record of what my agents really ask, and I will run this a fourth time against reality. If that data says the corpus has finally grown messy enough to want vectors, I will add them, and I will have a measurement to point at instead of a benchmark I massaged until it agreed with me.
Sources for the standard playbook I tested against: RAGAS for the synthetic-eval discipline, Okapi BM25 for the ranking function that won, Sentence-Transformers for the dense embeddings I did not end up needing, and reciprocal rank fusion for the hybrid step.
For the architecture underneath all this, see the knowledge base build. For the road I did not take, a second brain that does run on a vector store and the bugs that came with it, see A Second Brain for a Local Model. The current production stack lives on the stack page.