Learn

KV cache: the memory that grows with your context

The key-value (KV) cache is the working memory that holds a model's internal view of every token it has processed so far. Keeping it means the model reuses past work instead of recomputing the whole prompt for each new token. Its size grows with context length, which makes it the usual reason a longer prompt runs out of memory.

At a glance

What it is
Stored internal state for every token processed so far
Why it exists
So each new token reuses past work instead of recomputing it
What grows it
Context length: more tokens held means more cache
Why it matters
It is the usual culprit when a long context OOMs
Stack

How the KV cache grows during generation

The cache starts at the prompt size and adds an entry per generated token. The green band is your free memory; when the growing cache leaves none, you OOM.

3
Free headroom (keep this above zero) what is left for the cache to grow into before the next allocation fails
2
Generated tokens each new token adds to the cache, so it grows as it writes
1
Prompt tokens the cache is seeded from the input you sent

What does the KV cache actually store?

When a model reads your prompt, each layer produces an internal representation of every token, split into two parts the model later looks things up by: keys and values. The key-value (KV) cache keeps those around. Without it, generating the next token would mean reprocessing the entire prompt from scratch every single step, which gets quadratically slower as the text grows. With it, the prompt is processed once and each new token only adds its own entry. That is why local generation is usable at all on long inputs.

Why is it the thing that runs out of memory?

Because it grows. Every token you hold in context, prompt plus everything generated, adds to the cache, so its size scales with context length and with how many requests you run at once. The model weights are a fixed cost once loaded; the KV cache is the moving one. When a prompt that fit yesterday OOMs today, the cache is almost always why. The lever is context length: shorten the maximum context and the cache has less room to grow into, and the headroom comes back. Budget for the cache, not just the weights.

The KV cache helps by

  • Skipping recomputation of the whole prompt at every step
  • Making each new token cheap once the prompt is processed
  • Turning long generation from quadratic work into steady work

It costs you

  • Memory that grows with context length, often the OOM culprit
  • Headroom you must budget for, not just the model weights
  • A reason a run that fit yesterday fails on a longer prompt today

Related terms

← All terms Reviewed: June 2026