How does Elo rating work?
Elo does not measure a model against a fixed answer key. Instead, two models are shown the same prompt, a human picks the better output without knowing which is which, and each model’s score moves based on that result. Beating a higher-rated opponent earns more points than beating a weaker one, and losing costs accordingly. Run thousands of these blind matchups and the ratings settle into an order that reflects human preference.
Because the score is built from comparisons, an Elo number only means something relative to the pool it was computed in. There is no absolute scale, so a rating of one value in one arena is not directly comparable to the same value in another. As more votes come in, ratings also drift, both as the math refines its estimate and as the set of competing models changes.
When does it matter, and when not?
It matters when quality is subjective and hard to pin to a single metric, which is common in generative tasks like expressive speech or open-ended text. When no fixed test set captures what you care about, letting humans vote in a blind arena is often the most honest ranking you can get, and Elo turns those votes into a clean order.
It matters less when you need a stable, absolute number, for instance to track one model over time or to set a release threshold. Because Elo shifts with the pool and the vote count, it is a snapshot of a population, not a fixed grade. For that, pair it with an absolute benchmark score so you have both the relative ranking and a number that holds still.