Learn

Prefill and decode: the two halves of inference

Prefill and decode are the two phases of running a language model. Prefill processes the entire prompt in one pass and fills the key-value (KV) cache; decode then generates the answer one token at a time, reusing that cache. Prefill is limited by compute, decode by memory speed, so they behave so differently that a single tokens-per-second figure can hide which one you are measuring.

At a glance

Prefill
Reads the whole prompt at once, fills the key-value (KV) cache
Decode
Generates the answer one token at a time, reusing the cache
What limits prefill
Compute, since it does a lot of math in parallel
What limits decode
Memory speed, since it streams the weights per token
Flow

One request, two phases

Prefill ingests the prompt in a single pass and builds the cache. Decode then emits tokens one by one, reading that cache each step. Most of the wait you feel on a long answer is decode.

1
Prompt arrives the full text you sent
2
Prefill reads it all at once, fills the key-value (KV) cache
3
Decode one token, then the next, reusing the cache

What happens in each phase?

When you send a prompt, the model first does prefill: it reads the whole prompt in one pass and, for every token, computes and stores the keys and values it will need later. Those land in the key-value (KV) cache. Because the entire prompt is available at once, prefill can do a great deal of math in parallel, so it is limited mostly by how much compute the hardware has.

Then comes decode. The model generates the answer one token at a time, and each new token depends on the one before it, so this part cannot be parallelised the same way. Each step reads the model’s weights and the growing cache from memory to produce a single token. That makes decode limited by memory speed rather than raw compute. A long answer is mostly decode time.

Why does this split matter to you?

Because one tokens-per-second number can quietly average two unlike things. The delay before the first token appears is prefill, and it grows with prompt length. The steady rhythm after that is decode, and it grows with answer length. A box that prefills quickly can still decode slowly, or the reverse, so an honest benchmark reports them apart. It also explains where speed tricks apply: techniques that speed up generation target the decode phase, because that is the one bound by memory rather than compute. Knowing which phase you are watching turns a vague “it feels slow” into a specific thing you can measure and fix.

Prefill

  • Processes the entire prompt in a single parallel pass
  • Limited by raw compute, lots of math at once
  • Sets the time-to-first-token you wait at the start
  • Longer prompts make it cost more

Decode

  • Emits the answer one token at a time, in sequence
  • Limited by memory speed, streaming weights each step
  • Sets the steady tokens-per-second after the first token
  • Longer answers make it cost more

Related terms

← All terms Reviewed: June 2026