The context window is the maximum amount of text, measured in tokens, that a model can take into account at once. Everything the model sees for a single request has to fit inside it: your prompt, any material you retrieved and pasted in, and the model's own answer as it is written. When the total would exceed the window, something has to be dropped or the request fails.
At a glance
What it is
The most text a model can consider at once, counted in tokens
What counts against it
Your prompt, retrieved material, and the model's own output, all together
The catch
Bigger costs more, and models attend worse to the middle of a long one
Better default
A short, curated context usually beats a full one
Stack
What has to fit inside the window
Everything the model sees for one request shares the same token budget. Your prompt, the material you retrieved, and the answer being written all draw from it; the green band is the room left for the reply.
4
Room left for the reply (keep this above zero)what is free; fill the window and the answer gets cut short
3
The model's own outputthe answer counts against the same budget as it is written
2
Retrieved materialdocuments or context you pasted in, the part that balloons
1
System and user promptyour instructions and question, the fixed part
What has to fit inside it?
A model does not have a memory of your conversation in any human sense. For each
request it reads a block of text, all at once, and writes a reply. The context
window is how big that block is allowed to be, counted in tokens (the small
word-pieces a model reads, not characters or whole words). Everything the model
takes into account shares that one budget: the system prompt, your question, any
documents you retrieved and pasted in, and the answer itself as it is generated.
A 32,000-token window does not mean 32,000 tokens of input plus a free reply. The
reply comes out of the same 32,000.
That is the part newcomers miss. Pour a large document set into the input and the
model has less room left to answer; push far enough and the reply gets cut off
mid-sentence, or the request is rejected before it starts.
Why is a bigger window not simply better?
Two reasons, and both push the other way.
The first is cost. The longer the context, the more the key-value (KV) cache
grows, the more memory the request holds, and the slower the first token arrives.
A long window you actually fill is more expensive on every request, not just the
big ones.
The second is quality. Models attend worse to the middle of a very long context
than to its start or its end. This is the lost-in-the-middle effect: the line
that answers the question can be sitting right there, buried halfway down a wall
of retrieved text, and the model glides past it. A full window is often a worse
context than a short, curated one that contains only what the question needs.
So the working rule is the opposite of the instinct. The skill is not fitting as
much as possible into the window. It is choosing the little that belongs there
and leaving the rest out.
Check it yourself
curl -s localhost:30001/v1/models
Read the max_model_len field in the response: that number is the context window, in tokens, that the served model was started with. It is the ceiling every request shares.
Do
Put the few passages that actually answer the question into the context
Count tokens, not characters, so you know how close to the ceiling you are
Keep the most important material near the start or the end, not buried in the middle
Trim retrieved chunks before you paste them, so the answer still has room
Don't
Pour a whole document set in and hope the model finds the relevant line
Assume a longer window is always better; it costs more and reads the middle worse
Forget that the model's own answer eats from the same budget as the input
Treat the window as free; every extra token is more memory and more latency