What is batch size?
Batch size is how many requests, or sequences, a model works through together in a single pass. A GPU (graphics processing unit) is happiest doing the same maths across many numbers at once, so feeding it one request at a time leaves it partly idle. Group several requests into a batch and the same pass does more useful work, which raises total throughput, the number of tokens per second across everyone you are serving. For a server handling many users, a healthy batch size is where a lot of the performance lives.
Why is it a knob you can get wrong?
Because every request in the batch needs its own working space and its own key-value (KV) cache, so memory use climbs with batch size. Set it too high and you OOM (hit an out-of-memory error), often before the first real load arrives. Set it too low and the GPU idles while throughput suffers. There is also a latency angle: a single request can finish sooner when it is not queued behind a big batch, so the best setting depends on whether you are optimising for one fast answer or many answers at once. On a shared-memory box like the DGX Spark, where the model competes with the operating system for one pool, batch size is one of the first things to turn down when memory is tight.