What is continuous batching?
A serving engine runs many requests through the GPU together because the hardware is most efficient when it has a full batch of work. The naive version fixes the batch at the start: it waits for every request in the batch to finish before starting the next group. The trouble is that requests finish at different times. A short answer is done in a moment, a long one keeps going, and the short request’s slot sits empty in the meantime. The accelerator is paying rent on idle seats.
Continuous batching updates the batch at every token step. The instant a request finishes, a waiting request slides into its place, so the batch stays as full as the queue allows. Nothing waits for the slowest member. This is why a single box can serve a steady stream of users at high total throughput: the expensive hardware almost never idles.
Will it make my single request faster?
No, and it helps to be honest about that. Continuous batching raises total throughput, the sum of tokens across everyone being served. If you are the only user with one request running, there is no one to batch alongside you, so the benefit is close to nothing. This is the gap behind a lot of confusing numbers: a server quoting a big tokens-per-second figure is usually measuring many requests at once, while one agent on a desk experiences single-stream speed. Both numbers are real. They answer different questions, and continuous batching only moves the first one.