Learn

PagedAttention: paging the KV cache to stop wasting memory

PagedAttention is the technique vLLM uses to manage the key-value (KV) cache in small fixed-size pages rather than one contiguous block per request. By allocating cache in pages on demand, it cuts the wasted memory that contiguous reservations leave behind, so more requests fit in the same pool.

At a glance

What it is
Storing the key-value (KV) cache in fixed-size pages, not one block
Where it comes from
The serving engine vLLM, which built its memory manager around it
What it fixes
Wasted memory from reserving one contiguous block per request
What you get
More concurrent requests in the same memory pool
Comparison

How the KV cache is laid out

Contiguous reservation
PagedAttention
Layout per request
One block sized for the worst case
Small fixed-size pages added as needed
Unused tail of a request
Reserved and wasted
Never allocated until needed
Requests that fit at once
Fewer; padding eats the pool
More; freed pages return to a shared set

What problem does PagedAttention solve?

Every request a model serves carries a key-value (KV) cache: the running store of past tokens that grows as the context grows. The naive way to hold it is one contiguous block per request, sized for the longest the request might get. That wastes memory twice over. A request that ends short leaves most of its block reserved and unused, and pinning each request to its own block fragments the pool, so even when memory is free in total, no single gap is large enough for the next request.

PagedAttention, the technique vLLM is built around, borrows the idea of paging from operating systems. Instead of one big block, the cache is stored in small fixed-size pages, handed out only as a request actually needs them. There is no worst-case reservation sitting idle, and when a request finishes, its pages go back to a shared set for the next one to use.

What does it buy, and what does it not?

The win is packing. With the waste removed, more requests fit in the same memory, which is exactly the pressure point on a box where the model, its cache, and the operating system all draw from one pool. Higher concurrency for free, more or less.

What it does not do is make memory appear. PagedAttention packs the pool more tightly; it does not enlarge it. The weights still cost what they cost, and a context long enough still exhausts the budget no matter how neatly it is paged. It also does not speed up a single request on its own. Read it as a memory-efficiency trick that raises how much you can serve at once, not a cure for running out of room or a substitute for more memory.

PagedAttention helps with

  • Wasted memory from over-reserving cache for each request
  • Fitting more concurrent requests in the same memory budget
  • Returning a finished request's pages to a shared pool for reuse
  • Serving many varied-length prompts without padding to the longest

It does not change

  • The total memory you have; it packs the pool, it does not grow it
  • How big the model's weights are; that is a separate cost
  • The fact that a long enough context can still exhaust the pool
  • Raw single-request speed, which depends on the hardware and model

Related terms

← All terms Reviewed: June 2026