Context window: how much a model can read at once : Learn

The context window is the maximum amount of text, measured in tokens, that a model can take into account at once. Everything the model sees for a single request has to fit inside it: your prompt, any material you retrieved and pasted in, and the model's own answer as it is written. When the total would exceed the window, something has to be dropped or the request fails.

At a glance

What it is

The most text a model can consider at once, counted in tokens

What counts against it

Your prompt, retrieved material, and the model's own output, all together

The catch

Bigger costs more, and models attend worse to the middle of a long one

Better default

A short, curated context usually beats a full one

What has to fit inside it?

A model does not have a memory of your conversation in any human sense. For each request it reads a block of text, all at once, and writes a reply. The context window is how big that block is allowed to be, counted in tokens (the small word-pieces a model reads, not characters or whole words). Everything the model takes into account shares that one budget: the system prompt, your question, any documents you retrieved and pasted in, and the answer itself as it is generated. A 32,000-token window does not mean 32,000 tokens of input plus a free reply. The reply comes out of the same 32,000.

That is the part newcomers miss. Pour a large document set into the input and the model has less room left to answer; push far enough and the reply gets cut off mid-sentence, or the request is rejected before it starts.

Why is a bigger window not simply better?

Two reasons, and both push the other way.

The first is cost. The longer the context, the more the key-value (KV) cache grows, the more memory the request holds, and the slower the first token arrives. A long window you actually fill is more expensive on every request, not just the big ones.

The second is quality. Models attend worse to the middle of a very long context than to its start or its end. This is the lost-in-the-middle effect: the line that answers the question can be sitting right there, buried halfway down a wall of retrieved text, and the model glides past it. A full window is often a worse context than a short, curated one that contains only what the question needs.

So the working rule is the opposite of the instinct. The skill is not fitting as much as possible into the window. It is choosing the little that belongs there and leaving the rest out.

Put the few passages that actually answer the question into the context

Count tokens, not characters, so you know how close to the ceiling you are

Keep the most important material near the start or the end, not buried in the middle

Trim retrieved chunks before you paste them, so the answer still has room

Don't

Pour a whole document set in and hope the model finds the relevant line

Assume a longer window is always better; it costs more and reads the middle worse

Forget that the model's own answer eats from the same budget as the input

Treat the window as free; every extra token is more memory and more latency

Context window: how much a model can read at once

At a glance

What has to fit inside the window

What has to fit inside it?

Why is a bigger window not simply better?

Check it yourself

Do

Don't

Related terms

At a glance

What has to fit inside the window

What has to fit inside it?

Why is a bigger window not simply better?

Check it yourself

Do

Don't

Related terms

Go deeper