Prefill and decode: the two halves of inference : Learn

Prefill and decode are the two phases of running a language model. Prefill processes the entire prompt in one pass and fills the key-value (KV) cache; decode then generates the answer one token at a time, reusing that cache. Prefill is limited by compute, decode by memory speed, so they behave so differently that a single tokens-per-second figure can hide which one you are measuring.

What happens in each phase?

When you send a prompt, the model first does prefill: it reads the whole prompt in one pass and, for every token, computes and stores the keys and values it will need later. Those land in the key-value (KV) cache. Because the entire prompt is available at once, prefill can do a great deal of math in parallel, so it is limited mostly by how much compute the hardware has.

Then comes decode. The model generates the answer one token at a time, and each new token depends on the one before it, so this part cannot be parallelised the same way. Each step reads the model’s weights and the growing cache from memory to produce a single token. That makes decode limited by memory speed rather than raw compute. A long answer is mostly decode time.

Why does this split matter to you?

Because one tokens-per-second number can quietly average two unlike things. The delay before the first token appears is prefill, and it grows with prompt length. The steady rhythm after that is decode, and it grows with answer length. A box that prefills quickly can still decode slowly, or the reverse, so an honest benchmark reports them apart. It also explains where speed tricks apply: techniques that speed up generation target the decode phase, because that is the one bound by memory rather than compute. Knowing which phase you are watching turns a vague “it feels slow” into a specific thing you can measure and fix.

Prefill and decode: the two halves of inference

At a glance

One request, two phases

What happens in each phase?

Why does this split matter to you?

Prefill

Decode

Related terms

At a glance

One request, two phases

What happens in each phase?

Why does this split matter to you?

Prefill

Decode

Related terms

Go deeper