Learn

Perplexity: how surprised a model is by text

Perplexity is a measure of how well a language model predicts a piece of text, often described as how surprised the model is by each next word. Lower perplexity means the model predicted the text better. It is commonly used to compare models or to check whether a change degraded one.

At a glance

What it is
A measure of how well a model predicts text
Which way is good
Lower is better; the model was less surprised
What it is for
Comparing models, or checking a change did not degrade one
What it is not
A direct score of whether answers are correct or useful

What does perplexity measure?

Perplexity measures how well a model predicts text. The intuition is surprise: as the model reads along, it expects what comes next, and perplexity captures how often that expectation was wrong. A model that keeps guessing the next piece correctly is rarely surprised, and its perplexity is low. A model that is constantly caught off guard has high perplexity. So lower is better, which is the backwards-feeling part most people trip on at first.

It is a single number summarising prediction over a stretch of text, which is what makes it handy: no human has to read and grade anything, and you can compute it cheaply and repeatedly.

What is perplexity good and bad for?

Its honest use is comparison on the same text. Run two models, or a model before and after a change, over identical text, and the one with lower perplexity predicted it better. That makes perplexity a common sanity check after quantizing a model: if the compressed version’s perplexity barely moved, the compression probably did not wreck it.

Its limits matter just as much. Perplexity rewards predicting text, not being correct, helpful, or good at following instructions. A model can predict fluent text and still be wrong, so a low number is not a promise of a good answer. And the number is only comparable on the same text, so two perplexity figures from different sources tell you nothing on their own. Treat it as one signal, not a verdict.

Perplexity is good at

  • Comparing two models on the same text, lower being the better predictor
  • Spotting whether a quantized model drifted far from the original
  • Giving a quick, cheap signal that needs no human to grade

Perplexity is poor at

  • Telling you whether an answer is actually correct or useful
  • Comparing across different texts, where the number is not comparable
  • Measuring how good a model is at following instructions or tools

Related terms

← All terms Reviewed: June 2026