What is throughput?
Throughput is the rate at which a model produces output, almost always counted in tokens per second. If a reply is two hundred tokens long and the system runs at fifty tokens per second, the body of that reply takes about four seconds to stream out. Throughput is the number you feel on a long answer: higher means the text finishes sooner. There are two senses worth keeping apart: the speed of a single request, and the total tokens a server pushes across all requests it is serving at once.
How is it different from latency?
Latency is the wait before anything happens: the time to the first token. Throughput is the speed after that, while tokens are flowing. The two can move in opposite directions. Batching many requests together usually raises total throughput because the hardware stays busy, but it can also raise the latency any single request sees while it waits its turn. A server tuned for one is not automatically good at the other.
How do you read a throughput claim?
A throughput figure without its conditions is close to meaningless. Was it one request or a hundred? What prompt length, what output length, what hardware? The same model honestly produces very different numbers under different load. When you see tokens per second quoted, the right reflex is to ask how it was measured before you compare it to anything.