The key-value (KV) cache is the working memory that holds a model's internal view of every token it has processed so far. Keeping it means the model reuses past work instead of recomputing the whole prompt for each new token. Its size grows with context length, which makes it the usual reason a longer prompt runs out of memory.
At a glance
What it is
Stored internal state for every token processed so far
Why it exists
So each new token reuses past work instead of recomputing it
What grows it
Context length: more tokens held means more cache
Why it matters
It is the usual culprit when a long context OOMs
Stack
How the KV cache grows during generation
The cache starts at the prompt size and adds an entry per generated token. The green band is your free memory; when the growing cache leaves none, you OOM.
3
Free headroom (keep this above zero)what is left for the cache to grow into before the next allocation fails
2
Generated tokenseach new token adds to the cache, so it grows as it writes
1
Prompt tokensthe cache is seeded from the input you sent
What does the KV cache actually store?
When a model reads your prompt, each layer produces an internal representation of
every token, split into two parts the model later looks things up by: keys and
values. The key-value (KV) cache keeps those around. Without it, generating the
next token would mean reprocessing the entire prompt from scratch every single
step, which gets quadratically slower as the text grows. With it, the prompt is
processed once and each new token only adds its own entry. That is why local
generation is usable at all on long inputs.
Why is it the thing that runs out of memory?
Because it grows. Every token you hold in context, prompt plus everything
generated, adds to the cache, so its size scales with context length and with how
many requests you run at once. The model weights are a fixed cost once loaded;
the KV cache is the moving one. When a prompt that fit yesterday OOMs today, the
cache is almost always why. The lever is context length: shorten the maximum
context and the cache has less room to grow into, and the headroom comes back.
Budget for the cache, not just the weights.
The KV cache helps by
Skipping recomputation of the whole prompt at every step
Making each new token cheap once the prompt is processed
Turning long generation from quadratic work into steady work
It costs you
Memory that grows with context length, often the OOM culprit
Headroom you must budget for, not just the model weights
A reason a run that fit yesterday fails on a longer prompt today