Learn

Context window: how much a model can read at once

The context window is the maximum amount of text, measured in tokens, that a model can take into account at once. Everything the model sees for a single request has to fit inside it: your prompt, any material you retrieved and pasted in, and the model's own answer as it is written. When the total would exceed the window, something has to be dropped or the request fails.

At a glance

What it is
The most text a model can consider at once, counted in tokens
What counts against it
Your prompt, retrieved material, and the model's own output, all together
The catch
Bigger costs more, and models attend worse to the middle of a long one
Better default
A short, curated context usually beats a full one
Stack

What has to fit inside the window

Everything the model sees for one request shares the same token budget. Your prompt, the material you retrieved, and the answer being written all draw from it; the green band is the room left for the reply.

4
Room left for the reply (keep this above zero) what is free; fill the window and the answer gets cut short
3
The model's own output the answer counts against the same budget as it is written
2
Retrieved material documents or context you pasted in, the part that balloons
1
System and user prompt your instructions and question, the fixed part

What has to fit inside it?

A model does not have a memory of your conversation in any human sense. For each request it reads a block of text, all at once, and writes a reply. The context window is how big that block is allowed to be, counted in tokens (the small word-pieces a model reads, not characters or whole words). Everything the model takes into account shares that one budget: the system prompt, your question, any documents you retrieved and pasted in, and the answer itself as it is generated. A 32,000-token window does not mean 32,000 tokens of input plus a free reply. The reply comes out of the same 32,000.

That is the part newcomers miss. Pour a large document set into the input and the model has less room left to answer; push far enough and the reply gets cut off mid-sentence, or the request is rejected before it starts.

Why is a bigger window not simply better?

Two reasons, and both push the other way.

The first is cost. The longer the context, the more the key-value (KV) cache grows, the more memory the request holds, and the slower the first token arrives. A long window you actually fill is more expensive on every request, not just the big ones.

The second is quality. Models attend worse to the middle of a very long context than to its start or its end. This is the lost-in-the-middle effect: the line that answers the question can be sitting right there, buried halfway down a wall of retrieved text, and the model glides past it. A full window is often a worse context than a short, curated one that contains only what the question needs.

So the working rule is the opposite of the instinct. The skill is not fitting as much as possible into the window. It is choosing the little that belongs there and leaving the rest out.

Check it yourself

curl -s localhost:30001/v1/models

Read the max_model_len field in the response: that number is the context window, in tokens, that the served model was started with. It is the ceiling every request shares.

Do

  • Put the few passages that actually answer the question into the context
  • Count tokens, not characters, so you know how close to the ceiling you are
  • Keep the most important material near the start or the end, not buried in the middle
  • Trim retrieved chunks before you paste them, so the answer still has room

Don't

  • Pour a whole document set in and hope the model finds the relevant line
  • Assume a longer window is always better; it costs more and reads the middle worse
  • Forget that the model's own answer eats from the same budget as the input
  • Treat the window as free; every extra token is more memory and more latency

Related terms

← All terms Reviewed: June 2026