BM25: the keyword ranking that is hard to beat : Learn

BM25 (Best Matching 25) is a classic keyword ranking function. It scores how well a document matches a query from two signals: how often the query words appear in the document, and how rare those words are across the whole collection. Rare words count for more, and a word stops earning extra credit once it appears many times. It is the strong lexical-search baseline that fancier dense vector retrieval often fails to beat, especially on small, curated, well-tagged collections.

What does BM25 actually measure?

BM25 (Best Matching 25) answers one question: given a search query, which documents match it best? It scores each document from two signals. The first is how often the query words appear in that document, so a page that mentions your term repeatedly scores higher than one that mentions it once. The second is how rare each word is across the whole collection, so a rare, specific word counts for far more than a common one that turns up everywhere. There is also a brake on the first signal: once a word has appeared many times, each further mention adds less, so a document cannot win by sheer repetition.

That is the whole idea. No model, no training, no vectors. It looks at the words the reader typed, counts them, weights them by rarity, and ranks. The result is cheap to compute, easy to read, and easy to debug, because you can always trace exactly why a document scored the way it did.

Why does it keep beating the fancy approach?

The fashionable alternative is dense retrieval, which turns both the query and the documents into embeddings (vectors of numbers) and compares them by meaning rather than by shared words. That helps when a reader and a document say the same thing in different words. It is genuinely better at synonyms.

But meaning matching has a cost, and it is not free of failure either. On a collection that is small, curated, and already well tagged, the documents tend to use the same words a reader would search for, and that is exactly the home ground where keyword scoring is strongest. I found this the hard way: on my own tagged corpus, plain keyword scoring like BM25 matched the dense embeddings, so I rolled the embeddings back and kept the simpler layer. The fancy method has to earn its place by beating the baseline, and here it did not.

So BM25 is worth knowing for two reasons. It is often the right answer on its own, for small, well-organised collections. And even when you do reach for embeddings, it is the honest baseline you measure them against, so you know whether the extra machinery is actually paying for itself.

BM25: the keyword ranking that is hard to beat

At a glance

How BM25 scores a document

What does BM25 actually measure?

Why does it keep beating the fancy approach?

Check it yourself

BM25 is strong when

BM25 struggles when

Related terms

At a glance

How BM25 scores a document

What does BM25 actually measure?

Why does it keep beating the fancy approach?

Check it yourself

BM25 is strong when

BM25 struggles when

Related terms

Go deeper