Learn

Latency: the wait before output starts

Latency is the delay between sending a request and getting a response. For a chat model the headline figure is time to first token (TTFT): how long you wait before any output appears. It is distinct from throughput, which measures how fast tokens come out once they start.

At a glance

What it is
The delay before a response arrives
Headline measure
Time to first token (TTFT)
Why it matters
It is the wait that makes a system feel slow or snappy
Not the same as
Throughput, the speed once tokens are flowing
Flow

Where the wait goes before the first token

Latency is everything that happens before output begins. Most of it is the prefill: reading your whole prompt. A longer prompt means a longer wait.

1
Queue the request waits its turn behind others
2
Prefill the model reads the whole prompt; grows with prompt length
3
First token output begins; this moment is the time to first token (TTFT)

What is latency?

Latency is the time between asking and getting an answer. For an interactive model the figure that matters most is time to first token (TTFT): how long you sit looking at nothing before the first word appears. After that point a separate number takes over, the rate at which the rest of the tokens stream out. Low latency is what makes a model feel responsive even when the full answer is long, because the wait that annoys people is the silent one at the start.

Why is the prompt length in this?

Before a model can produce its first token, it has to read your entire prompt. That pass is called prefill, and it scales with how many input tokens you sent. A short question starts answering almost at once. A prompt with a long document pasted in front of it makes you wait while the model works through all of it first. This is why “the model got slow” is often really “the prompt got long”. If part of the prompt was sent before, a cached prefix can skip that work and cut the wait.

How does it trade against throughput?

Latency and throughput are different measurements and tuning for one can hurt the other. Packing many requests into a batch raises total throughput but can make any single request wait longer in the queue, which raises its latency. There is no single “fast”; there is fast to start and fast to finish, and you choose which one your workload cares about.

Lowers latency

  • A shorter prompt, since prefill reads every input token
  • Less queue contention from other requests
  • A reused (cached) prefix so part of prefill is skipped
  • Faster compute for the prefill pass

Does not lower latency

  • Raising throughput, which is a separate measurement
  • Heavy batching, which can add queue wait per request
  • A bigger memory pool by itself
  • A faster generation rate, which acts after the first token

Related terms

← All terms Reviewed: June 2026