Attention: how a model decides what matters : Learn

Attention is the mechanism inside a transformer that lets each token weigh how much every other token in the sequence matters to its own meaning. It is how the model relates a pronoun to the noun it refers to, or pulls context from far back in a prompt. The cost is that relating every token to every other grows quickly as the context gets longer, which is why long contexts are expensive in memory and time.

What does attention do?

When a model reads a sequence, every token needs to know what it means in this particular place, not in general. Attention is how it works that out. While processing one token, the model looks at every other token in the context and assigns each a weight: how much does this one matter to me right now? The pronoun “it” leans heavily on the noun it refers to. A word near the end can draw on a detail near the start. The model learned these weightings during training; nobody hand-wrote the rules.

That is the whole intuition. Attention is weighing. It is the mechanism that lets meaning travel across a sequence instead of staying stuck with neighbours, and it is the reason a transformer reads context as well as it does.

Why does it cost you memory and speed?

Because to weigh every token against every other, the model has to do work for all those pairings, and that work grows quickly as the context gets longer. Double the context and you do not double the attention work, you do more than that. This is a big part of why a long prompt runs slower and eats more memory, and why the running state has to be cached as you generate, the key-value cache that grows with context.

It is also why the attention implementation, the backend, is an operator concern. Different backends compute the same maths at different speeds, and not every one runs on every chip. Picking one that actually supports your hardware can be the difference between a model that serves and a model that dies on the first batch.

Attention: how a model decides what matters

At a glance

What attention does for one token

What does attention do?

Why does it cost you memory and speed?

What attention buys

What it costs

Related terms

At a glance

What attention does for one token

What does attention do?

Why does it cost you memory and speed?

What attention buys

What it costs

Related terms

Go deeper