Learn

BM25: the keyword ranking that is hard to beat

BM25 (Best Matching 25) is a classic keyword ranking function. It scores how well a document matches a query from two signals: how often the query words appear in the document, and how rare those words are across the whole collection. Rare words count for more, and a word stops earning extra credit once it appears many times. It is the strong lexical-search baseline that fancier dense vector retrieval often fails to beat, especially on small, curated, well-tagged collections.

At a glance

What it is
A keyword ranking function that scores query-to-document match
What it rewards
Words that appear in the document, weighted up when they are rare
Where it shines
Small, curated, well-tagged collections with exact-word queries
Why it matters
The lexical baseline dense embeddings have to beat, and often do not
Flow

How BM25 scores a document

Two signals decide the score: how often the query words appear here, and how rare those words are across the whole collection.

1
Query words the terms a reader actually typed
2
Count them in the document more matches score higher, but with diminishing returns
3
Weight by rarity a rare word counts for more than a common one
4
Rank documents by score the highest-scoring documents come back first

What does BM25 actually measure?

BM25 (Best Matching 25) answers one question: given a search query, which documents match it best? It scores each document from two signals. The first is how often the query words appear in that document, so a page that mentions your term repeatedly scores higher than one that mentions it once. The second is how rare each word is across the whole collection, so a rare, specific word counts for far more than a common one that turns up everywhere. There is also a brake on the first signal: once a word has appeared many times, each further mention adds less, so a document cannot win by sheer repetition.

That is the whole idea. No model, no training, no vectors. It looks at the words the reader typed, counts them, weights them by rarity, and ranks. The result is cheap to compute, easy to read, and easy to debug, because you can always trace exactly why a document scored the way it did.

Why does it keep beating the fancy approach?

The fashionable alternative is dense retrieval, which turns both the query and the documents into embeddings (vectors of numbers) and compares them by meaning rather than by shared words. That helps when a reader and a document say the same thing in different words. It is genuinely better at synonyms.

But meaning matching has a cost, and it is not free of failure either. On a collection that is small, curated, and already well tagged, the documents tend to use the same words a reader would search for, and that is exactly the home ground where keyword scoring is strongest. I found this the hard way: on my own tagged corpus, plain keyword scoring like BM25 matched the dense embeddings, so I rolled the embeddings back and kept the simpler layer. The fancy method has to earn its place by beating the baseline, and here it did not.

So BM25 is worth knowing for two reasons. It is often the right answer on its own, for small, well-organised collections. And even when you do reach for embeddings, it is the honest baseline you measure them against, so you know whether the extra machinery is actually paying for itself.

Check it yourself

printf 'the cat sat on the mat\n' > doc1.txt; printf 'a dog ran across the yard\n' > doc2.txt; rg --count-matches cat doc1.txt doc2.txt

ripgrep is the simplest lexical baseline: it reports the match only in doc1.txt. BM25 adds the rarity weighting and length handling on top of exactly this counting idea, then ranks the documents by the combined score.

BM25 is strong when

  • The collection is small, curated, and already well tagged
  • Readers search with the same words the documents use
  • You want a baseline that is cheap, transparent, and easy to debug
  • You need an honest number to measure dense retrieval against

BM25 struggles when

  • The answer is phrased in different words than the query (synonyms)
  • Meaning matters more than shared words; that is where embeddings earn their keep
  • Documents are very long and uneven, where length tuning starts to matter

Related terms

← All terms Reviewed: June 2026