PagedAttention: paging the KV cache to stop wasting memory : Learn

PagedAttention is the technique vLLM uses to manage the key-value (KV) cache in small fixed-size pages rather than one contiguous block per request. By allocating cache in pages on demand, it cuts the wasted memory that contiguous reservations leave behind, so more requests fit in the same pool.

At a glance

What it is

Storing the key-value (KV) cache in fixed-size pages, not one block

Where it comes from

The serving engine vLLM, which built its memory manager around it

What it fixes

Wasted memory from reserving one contiguous block per request

What you get

More concurrent requests in the same memory pool

What problem does PagedAttention solve?

Every request a model serves carries a key-value (KV) cache: the running store of past tokens that grows as the context grows. The naive way to hold it is one contiguous block per request, sized for the longest the request might get. That wastes memory twice over. A request that ends short leaves most of its block reserved and unused, and pinning each request to its own block fragments the pool, so even when memory is free in total, no single gap is large enough for the next request.

PagedAttention, the technique vLLM is built around, borrows the idea of paging from operating systems. Instead of one big block, the cache is stored in small fixed-size pages, handed out only as a request actually needs them. There is no worst-case reservation sitting idle, and when a request finishes, its pages go back to a shared set for the next one to use.

What does it buy, and what does it not?

The win is packing. With the waste removed, more requests fit in the same memory, which is exactly the pressure point on a box where the model, its cache, and the operating system all draw from one pool. Higher concurrency for free, more or less.

What it does not do is make memory appear. PagedAttention packs the pool more tightly; it does not enlarge it. The weights still cost what they cost, and a context long enough still exhausts the budget no matter how neatly it is paged. It also does not speed up a single request on its own. Read it as a memory-efficiency trick that raises how much you can serve at once, not a cure for running out of room or a substitute for more memory.

PagedAttention: paging the KV cache to stop wasting memory

At a glance

How the KV cache is laid out

What problem does PagedAttention solve?

What does it buy, and what does it not?

PagedAttention helps with

It does not change

Related terms

At a glance

How the KV cache is laid out

What problem does PagedAttention solve?

What does it buy, and what does it not?

PagedAttention helps with

It does not change

Related terms

Go deeper