Continuous batching: keeping the accelerator busy : Learn

Continuous batching is a serving technique where the engine packs many requests into one batch and updates that batch every step: as soon as one request finishes, a waiting one takes its slot, instead of holding the whole batch until the slowest member is done. It keeps the accelerator busy and is the main reason a server can hold high total throughput across many users at once.

What is continuous batching?

A serving engine runs many requests through the GPU together because the hardware is most efficient when it has a full batch of work. The naive version fixes the batch at the start: it waits for every request in the batch to finish before starting the next group. The trouble is that requests finish at different times. A short answer is done in a moment, a long one keeps going, and the short request’s slot sits empty in the meantime. The accelerator is paying rent on idle seats.

Continuous batching updates the batch at every token step. The instant a request finishes, a waiting request slides into its place, so the batch stays as full as the queue allows. Nothing waits for the slowest member. This is why a single box can serve a steady stream of users at high total throughput: the expensive hardware almost never idles.

Will it make my single request faster?

No, and it helps to be honest about that. Continuous batching raises total throughput, the sum of tokens across everyone being served. If you are the only user with one request running, there is no one to batch alongside you, so the benefit is close to nothing. This is the gap behind a lot of confusing numbers: a server quoting a big tokens-per-second figure is usually measuring many requests at once, while one agent on a desk experiences single-stream speed. Both numbers are real. They answer different questions, and continuous batching only moves the first one.

Continuous batching: keeping the accelerator busy

At a glance

Static versus continuous batching

What is continuous batching?

Will it make my single request faster?

Continuous batching helps

It does little for

Related terms

At a glance

Static versus continuous batching

What is continuous batching?

Will it make my single request faster?

Continuous batching helps

It does little for

Related terms

Go deeper