What does perplexity measure?
Perplexity measures how well a model predicts text. The intuition is surprise: as the model reads along, it expects what comes next, and perplexity captures how often that expectation was wrong. A model that keeps guessing the next piece correctly is rarely surprised, and its perplexity is low. A model that is constantly caught off guard has high perplexity. So lower is better, which is the backwards-feeling part most people trip on at first.
It is a single number summarising prediction over a stretch of text, which is what makes it handy: no human has to read and grade anything, and you can compute it cheaply and repeatedly.
What is perplexity good and bad for?
Its honest use is comparison on the same text. Run two models, or a model before and after a change, over identical text, and the one with lower perplexity predicted it better. That makes perplexity a common sanity check after quantizing a model: if the compressed version’s perplexity barely moved, the compression probably did not wreck it.
Its limits matter just as much. Perplexity rewards predicting text, not being correct, helpful, or good at following instructions. A model can predict fluent text and still be wrong, so a low number is not a promise of a good answer. And the number is only comparable on the same text, so two perplexity figures from different sources tell you nothing on their own. Treat it as one signal, not a verdict.