Perplexity: how surprised a model is by text : Learn

Perplexity is a measure of how well a language model predicts a piece of text, often described as how surprised the model is by each next word. Lower perplexity means the model predicted the text better. It is commonly used to compare models or to check whether a change degraded one.

What does perplexity measure?

Perplexity measures how well a model predicts text. The intuition is surprise: as the model reads along, it expects what comes next, and perplexity captures how often that expectation was wrong. A model that keeps guessing the next piece correctly is rarely surprised, and its perplexity is low. A model that is constantly caught off guard has high perplexity. So lower is better, which is the backwards-feeling part most people trip on at first.

It is a single number summarising prediction over a stretch of text, which is what makes it handy: no human has to read and grade anything, and you can compute it cheaply and repeatedly.

What is perplexity good and bad for?

Its honest use is comparison on the same text. Run two models, or a model before and after a change, over identical text, and the one with lower perplexity predicted it better. That makes perplexity a common sanity check after quantizing a model: if the compressed version’s perplexity barely moved, the compression probably did not wreck it.

Its limits matter just as much. Perplexity rewards predicting text, not being correct, helpful, or good at following instructions. A model can predict fluent text and still be wrong, so a low number is not a promise of a good answer. And the number is only comparable on the same text, so two perplexity figures from different sources tell you nothing on their own. Treat it as one signal, not a verdict.

Perplexity: how surprised a model is by text

At a glance

What does perplexity measure?

What is perplexity good and bad for?

Perplexity is good at

Perplexity is poor at

Related terms