Learn

Elo rating: ranking models by head-to-head wins

Elo rating is a relative skill score, originally from chess, that is updated from head-to-head outcomes, so in a blind A/B arena it ranks models by how often humans prefer one over another rather than by any absolute score.

At a glance

Where it comes from
Chess, where it ranks players by their match results
What it scores
Relative skill, computed from head-to-head comparisons
How it is used here
Blind A/B arenas rank models by human preference
Its catch
Only meaningful relative to the pool, and it drifts with votes

How does Elo rating work?

Elo does not measure a model against a fixed answer key. Instead, two models are shown the same prompt, a human picks the better output without knowing which is which, and each model’s score moves based on that result. Beating a higher-rated opponent earns more points than beating a weaker one, and losing costs accordingly. Run thousands of these blind matchups and the ratings settle into an order that reflects human preference.

Because the score is built from comparisons, an Elo number only means something relative to the pool it was computed in. There is no absolute scale, so a rating of one value in one arena is not directly comparable to the same value in another. As more votes come in, ratings also drift, both as the math refines its estimate and as the set of competing models changes.

When does it matter, and when not?

It matters when quality is subjective and hard to pin to a single metric, which is common in generative tasks like expressive speech or open-ended text. When no fixed test set captures what you care about, letting humans vote in a blind arena is often the most honest ranking you can get, and Elo turns those votes into a clean order.

It matters less when you need a stable, absolute number, for instance to track one model over time or to set a release threshold. Because Elo shifts with the pool and the vote count, it is a snapshot of a population, not a fixed grade. For that, pair it with an absolute benchmark score so you have both the relative ranking and a number that holds still.

Elo rating

  • Comes from pairwise votes, relative to the other models

Benchmark score

  • An absolute number from a fixed test set

Related terms

← All terms Reviewed: June 2026