What problem does PagedAttention solve?
Every request a model serves carries a key-value (KV) cache: the running store of past tokens that grows as the context grows. The naive way to hold it is one contiguous block per request, sized for the longest the request might get. That wastes memory twice over. A request that ends short leaves most of its block reserved and unused, and pinning each request to its own block fragments the pool, so even when memory is free in total, no single gap is large enough for the next request.
PagedAttention, the technique vLLM is built around, borrows the idea of paging from operating systems. Instead of one big block, the cache is stored in small fixed-size pages, handed out only as a request actually needs them. There is no worst-case reservation sitting idle, and when a request finishes, its pages go back to a shared set for the next one to use.
What does it buy, and what does it not?
The win is packing. With the waste removed, more requests fit in the same memory, which is exactly the pressure point on a box where the model, its cache, and the operating system all draw from one pool. Higher concurrency for free, more or less.
What it does not do is make memory appear. PagedAttention packs the pool more tightly; it does not enlarge it. The weights still cost what they cost, and a context long enough still exhausts the budget no matter how neatly it is paged. It also does not speed up a single request on its own. Read it as a memory-efficiency trick that raises how much you can serve at once, not a cure for running out of room or a substitute for more memory.