Throughput: how many tokens per second : Learn

Throughput is the rate at which a serving system produces output, usually measured in tokens per second. It tells you how quickly a long response finishes and how many requests a server can carry at once. It is distinct from latency, which is about how long you wait before output starts.

What is throughput?

Throughput is the rate at which a model produces output, almost always counted in tokens per second. If a reply is two hundred tokens long and the system runs at fifty tokens per second, the body of that reply takes about four seconds to stream out. Throughput is the number you feel on a long answer: higher means the text finishes sooner. There are two senses worth keeping apart: the speed of a single request, and the total tokens a server pushes across all requests it is serving at once.

How is it different from latency?

Latency is the wait before anything happens: the time to the first token. Throughput is the speed after that, while tokens are flowing. The two can move in opposite directions. Batching many requests together usually raises total throughput because the hardware stays busy, but it can also raise the latency any single request sees while it waits its turn. A server tuned for one is not automatically good at the other.

How do you read a throughput claim?

A throughput figure without its conditions is close to meaningless. Was it one request or a hundred? What prompt length, what output length, what hardware? The same model honestly produces very different numbers under different load. When you see tokens per second quoted, the right reflex is to ask how it was measured before you compare it to anything.

Throughput: how many tokens per second

At a glance

Throughput versus latency

What is throughput?

How is it different from latency?

How do you read a throughput claim?

Raises throughput

Does not raise throughput

Related terms

At a glance

Throughput versus latency

What is throughput?

How is it different from latency?

How do you read a throughput claim?

Raises throughput

Does not raise throughput

Related terms

Go deeper