What is latency?
Latency is the time between asking and getting an answer. For an interactive model the figure that matters most is time to first token (TTFT): how long you sit looking at nothing before the first word appears. After that point a separate number takes over, the rate at which the rest of the tokens stream out. Low latency is what makes a model feel responsive even when the full answer is long, because the wait that annoys people is the silent one at the start.
Why is the prompt length in this?
Before a model can produce its first token, it has to read your entire prompt. That pass is called prefill, and it scales with how many input tokens you sent. A short question starts answering almost at once. A prompt with a long document pasted in front of it makes you wait while the model works through all of it first. This is why “the model got slow” is often really “the prompt got long”. If part of the prompt was sent before, a cached prefix can skip that work and cut the wait.
How does it trade against throughput?
Latency and throughput are different measurements and tuning for one can hurt the other. Packing many requests into a batch raises total throughput but can make any single request wait longer in the queue, which raises its latency. There is no single “fast”; there is fast to start and fast to finish, and you choose which one your workload cares about.