What happens in each phase?
When you send a prompt, the model first does prefill: it reads the whole prompt in one pass and, for every token, computes and stores the keys and values it will need later. Those land in the key-value (KV) cache. Because the entire prompt is available at once, prefill can do a great deal of math in parallel, so it is limited mostly by how much compute the hardware has.
Then comes decode. The model generates the answer one token at a time, and each new token depends on the one before it, so this part cannot be parallelised the same way. Each step reads the model’s weights and the growing cache from memory to produce a single token. That makes decode limited by memory speed rather than raw compute. A long answer is mostly decode time.
Why does this split matter to you?
Because one tokens-per-second number can quietly average two unlike things. The delay before the first token appears is prefill, and it grows with prompt length. The steady rhythm after that is decode, and it grows with answer length. A box that prefills quickly can still decode slowly, or the reverse, so an honest benchmark reports them apart. It also explains where speed tricks apply: techniques that speed up generation target the decode phase, because that is the one bound by memory rather than compute. Knowing which phase you are watching turns a vague “it feels slow” into a specific thing you can measure and fix.