Learn

Throughput: how many tokens per second

Throughput is the rate at which a serving system produces output, usually measured in tokens per second. It tells you how quickly a long response finishes and how many requests a server can carry at once. It is distinct from latency, which is about how long you wait before output starts.

At a glance

What it is
Tokens generated per second once output is flowing
Why it matters
It sets how fast a long answer completes
Two flavours
Per-request speed, and total tokens across all requests
Not the same as
Latency, the wait before the first token
Comparison

Throughput versus latency

Throughput
Latency
Question it answers
How fast do tokens come out?
How long until the first token?
Unit
Tokens per second
Seconds (or milliseconds)
Felt most when
The answer is long
The answer is just starting

What is throughput?

Throughput is the rate at which a model produces output, almost always counted in tokens per second. If a reply is two hundred tokens long and the system runs at fifty tokens per second, the body of that reply takes about four seconds to stream out. Throughput is the number you feel on a long answer: higher means the text finishes sooner. There are two senses worth keeping apart: the speed of a single request, and the total tokens a server pushes across all requests it is serving at once.

How is it different from latency?

Latency is the wait before anything happens: the time to the first token. Throughput is the speed after that, while tokens are flowing. The two can move in opposite directions. Batching many requests together usually raises total throughput because the hardware stays busy, but it can also raise the latency any single request sees while it waits its turn. A server tuned for one is not automatically good at the other.

How do you read a throughput claim?

A throughput figure without its conditions is close to meaningless. Was it one request or a hundred? What prompt length, what output length, what hardware? The same model honestly produces very different numbers under different load. When you see tokens per second quoted, the right reflex is to ask how it was measured before you compare it to anything.

Raises throughput

  • Batching several requests so the hardware stays busy
  • A faster memory bus feeding the compute units
  • Speculative decoding, when it accepts most draft tokens
  • A smaller weight format that moves less data per token

Does not raise throughput

  • A bigger memory pool on its own; capacity is not speed
  • Cutting latency, which is a different measurement
  • More requests than the hardware can actually feed
  • Reporting one number while ignoring how it was measured

Related terms

← All terms Reviewed: June 2026