Learn

Batch size: how many requests run together

Batch size is how many requests or sequences a model processes together in one pass. A larger batch puts more work through the GPU at once, raising total throughput, but it costs more memory because every request in the batch needs its own working space and key-value (KV) cache.

At a glance

What it is
How many requests or sequences run together in one pass
What it buys
Higher total throughput by keeping the GPU busier
What it costs
More memory: each request adds its own working state
Why it is a knob
Set too high it OOMs, set too low it wastes the GPU

What is batch size?

Batch size is how many requests, or sequences, a model works through together in a single pass. A GPU (graphics processing unit) is happiest doing the same maths across many numbers at once, so feeding it one request at a time leaves it partly idle. Group several requests into a batch and the same pass does more useful work, which raises total throughput, the number of tokens per second across everyone you are serving. For a server handling many users, a healthy batch size is where a lot of the performance lives.

Why is it a knob you can get wrong?

Because every request in the batch needs its own working space and its own key-value (KV) cache, so memory use climbs with batch size. Set it too high and you OOM (hit an out-of-memory error), often before the first real load arrives. Set it too low and the GPU idles while throughput suffers. There is also a latency angle: a single request can finish sooner when it is not queued behind a big batch, so the best setting depends on whether you are optimising for one fast answer or many answers at once. On a shared-memory box like the DGX Spark, where the model competes with the operating system for one pool, batch size is one of the first things to turn down when memory is tight.

A bigger batch

  • Pushes more total tokens per second through the GPU
  • Keeps the GPU busy instead of idling between single requests
  • Suits serving many users where total throughput matters

A smaller batch

  • Uses less memory, leaving headroom against an OOM
  • Can lower latency for a single request that is not waiting on others
  • Is the safer default when you are unsure of your memory budget

Related terms

← All terms Reviewed: June 2026