Learn

Continuous batching: keeping the accelerator busy

Continuous batching is a serving technique where the engine packs many requests into one batch and updates that batch every step: as soon as one request finishes, a waiting one takes its slot, instead of holding the whole batch until the slowest member is done. It keeps the accelerator busy and is the main reason a server can hold high total throughput across many users at once.

At a glance

What it is
Adding and dropping requests from a batch every token step
What it fixes
The GPU idling while it waits for the slowest request
What it raises
Total throughput across many concurrent requests
What it does not raise
Single-stream speed when only one request is running
Comparison

Static versus continuous batching

Static batch
Continuous batch
When a request finishes
Its slot sits empty until the whole batch ends
A waiting request takes the slot immediately
Accelerator use
Drops as fast requests finish early
Stays high, slots keep refilling
Best for
Requests that all finish together
Mixed lengths and steady concurrent traffic

What is continuous batching?

A serving engine runs many requests through the GPU together because the hardware is most efficient when it has a full batch of work. The naive version fixes the batch at the start: it waits for every request in the batch to finish before starting the next group. The trouble is that requests finish at different times. A short answer is done in a moment, a long one keeps going, and the short request’s slot sits empty in the meantime. The accelerator is paying rent on idle seats.

Continuous batching updates the batch at every token step. The instant a request finishes, a waiting request slides into its place, so the batch stays as full as the queue allows. Nothing waits for the slowest member. This is why a single box can serve a steady stream of users at high total throughput: the expensive hardware almost never idles.

Will it make my single request faster?

No, and it helps to be honest about that. Continuous batching raises total throughput, the sum of tokens across everyone being served. If you are the only user with one request running, there is no one to batch alongside you, so the benefit is close to nothing. This is the gap behind a lot of confusing numbers: a server quoting a big tokens-per-second figure is usually measuring many requests at once, while one agent on a desk experiences single-stream speed. Both numbers are real. They answer different questions, and continuous batching only moves the first one.

Continuous batching helps

  • A server with many users sending requests at once
  • Mixed request lengths, where some finish far sooner than others
  • Total tokens per second across all in-flight requests

It does little for

  • One person, one request at a time, with nothing else queued
  • The latency of a single short prompt with no other load
  • Memory pressure; more concurrent requests still cost cache

Related terms

← All terms Reviewed: June 2026