Latency: the wait before output starts : Learn

Latency is the delay between sending a request and getting a response. For a chat model the headline figure is time to first token (TTFT): how long you wait before any output appears. It is distinct from throughput, which measures how fast tokens come out once they start.

What is latency?

Latency is the time between asking and getting an answer. For an interactive model the figure that matters most is time to first token (TTFT): how long you sit looking at nothing before the first word appears. After that point a separate number takes over, the rate at which the rest of the tokens stream out. Low latency is what makes a model feel responsive even when the full answer is long, because the wait that annoys people is the silent one at the start.

Why is the prompt length in this?

Before a model can produce its first token, it has to read your entire prompt. That pass is called prefill, and it scales with how many input tokens you sent. A short question starts answering almost at once. A prompt with a long document pasted in front of it makes you wait while the model works through all of it first. This is why “the model got slow” is often really “the prompt got long”. If part of the prompt was sent before, a cached prefix can skip that work and cut the wait.

How does it trade against throughput?

Latency and throughput are different measurements and tuning for one can hurt the other. Packing many requests into a batch raises total throughput but can make any single request wait longer in the queue, which raises its latency. There is no single “fast”; there is fast to start and fast to finish, and you choose which one your workload cares about.

Latency: the wait before output starts

At a glance

Where the wait goes before the first token

What is latency?

Why is the prompt length in this?

How does it trade against throughput?

Lowers latency

Does not lower latency

Related terms

At a glance

Where the wait goes before the first token

What is latency?

Why is the prompt length in this?

How does it trade against throughput?

Lowers latency

Does not lower latency

Related terms

Go deeper