KV cache: the memory that grows with your context : Learn

The key-value (KV) cache is the working memory that holds a model's internal view of every token it has processed so far. Keeping it means the model reuses past work instead of recomputing the whole prompt for each new token. Its size grows with context length, which makes it the usual reason a longer prompt runs out of memory.

What does the KV cache actually store?

When a model reads your prompt, each layer produces an internal representation of every token, split into two parts the model later looks things up by: keys and values. The key-value (KV) cache keeps those around. Without it, generating the next token would mean reprocessing the entire prompt from scratch every single step, which gets quadratically slower as the text grows. With it, the prompt is processed once and each new token only adds its own entry. That is why local generation is usable at all on long inputs.

Why is it the thing that runs out of memory?

Because it grows. Every token you hold in context, prompt plus everything generated, adds to the cache, so its size scales with context length and with how many requests you run at once. The model weights are a fixed cost once loaded; the KV cache is the moving one. When a prompt that fit yesterday OOMs today, the cache is almost always why. The lever is context length: shorten the maximum context and the cache has less room to grow into, and the headroom comes back. Budget for the cache, not just the weights.

KV cache: the memory that grows with your context

At a glance

How the KV cache grows during generation

What does the KV cache actually store?

Why is it the thing that runs out of memory?

The KV cache helps by

It costs you

Related terms

At a glance

How the KV cache grows during generation

What does the KV cache actually store?

Why is it the thing that runs out of memory?

The KV cache helps by

It costs you

Related terms

Go deeper