What does attention do?
When a model reads a sequence, every token needs to know what it means in this particular place, not in general. Attention is how it works that out. While processing one token, the model looks at every other token in the context and assigns each a weight: how much does this one matter to me right now? The pronoun “it” leans heavily on the noun it refers to. A word near the end can draw on a detail near the start. The model learned these weightings during training; nobody hand-wrote the rules.
That is the whole intuition. Attention is weighing. It is the mechanism that lets meaning travel across a sequence instead of staying stuck with neighbours, and it is the reason a transformer reads context as well as it does.
Why does it cost you memory and speed?
Because to weigh every token against every other, the model has to do work for all those pairings, and that work grows quickly as the context gets longer. Double the context and you do not double the attention work, you do more than that. This is a big part of why a long prompt runs slower and eats more memory, and why the running state has to be cached as you generate, the key-value cache that grows with context.
It is also why the attention implementation, the backend, is an operator concern. Different backends compute the same maths at different speeds, and not every one runs on every chip. Picking one that actually supports your hardware can be the difference between a model that serves and a model that dies on the first batch.