Learn

Attention: how a model decides what matters

Attention is the mechanism inside a transformer that lets each token weigh how much every other token in the sequence matters to its own meaning. It is how the model relates a pronoun to the noun it refers to, or pulls context from far back in a prompt. The cost is that relating every token to every other grows quickly as the context gets longer, which is why long contexts are expensive in memory and time.

At a glance

What it is
The step where each token weighs how much the others matter
What it gives you
The ability to relate distant words, not just neighbours
The cost
Relating all tokens grows fast as context gets longer
Where it lives
Inside every transformer; it is the architecture's core trick
Stack

What attention does for one token

While processing a single token, the model weighs every other token in the context. The strongest links carry the meaning; the work of computing them all is what makes long context costly.

3
Links to every other token too all weighed, most weak; this is the cost
2
Links to relevant earlier tokens the noun a pronoun points back to, say
1
The token being processed needs to know what it means here

What does attention do?

When a model reads a sequence, every token needs to know what it means in this particular place, not in general. Attention is how it works that out. While processing one token, the model looks at every other token in the context and assigns each a weight: how much does this one matter to me right now? The pronoun “it” leans heavily on the noun it refers to. A word near the end can draw on a detail near the start. The model learned these weightings during training; nobody hand-wrote the rules.

That is the whole intuition. Attention is weighing. It is the mechanism that lets meaning travel across a sequence instead of staying stuck with neighbours, and it is the reason a transformer reads context as well as it does.

Why does it cost you memory and speed?

Because to weigh every token against every other, the model has to do work for all those pairings, and that work grows quickly as the context gets longer. Double the context and you do not double the attention work, you do more than that. This is a big part of why a long prompt runs slower and eats more memory, and why the running state has to be cached as you generate, the key-value cache that grows with context.

It is also why the attention implementation, the backend, is an operator concern. Different backends compute the same maths at different speeds, and not every one runs on every chip. Picking one that actually supports your hardware can be the difference between a model that serves and a model that dies on the first batch.

What attention buys

  • Relating words far apart, so meaning survives across a long sentence
  • Context, so the same word can mean different things in different places
  • The core capability that makes a transformer good at language

What it costs

  • Work that grows quickly as the context length grows
  • Memory for the running state, which the key-value cache holds
  • Backend choices: which attention implementation runs on your hardware

Related terms

← All terms Reviewed: June 2026