Learn

Transformer: the architecture behind modern models

A transformer is a neural-network architecture that processes a sequence of tokens by letting each token weigh how much every other token matters to it, a mechanism called attention. That ability to relate distant words to each other, in parallel, is what makes it good at language. Almost every modern large language model is a transformer of one variant or another.

At a glance

What it is
The neural-network design behind most modern language models
Core idea
Attention: every token weighs how much every other token matters
Why it won
It relates distant words and processes them in parallel
Where you meet it
Almost any model you run locally is a transformer variant
Flow

How a transformer reads a sequence

Tokens go in, each layer lets every token attend to the others to build context, and the model predicts the next token.

1
Tokens in the prompt, split into pieces
2
Attention layers relate the tokens each token weighs the others to gather context
3
Next token predicted appended, then the sequence runs again

What is a transformer, without the maths?

A transformer is the shape of the network underneath most modern language models. Its defining move is attention: when the model processes a sequence of tokens, it lets each token look at every other token and decide how much each one matters to its own meaning. The word “it” can reach back across a sentence to find the noun it refers to. A token near the end can pull context from a token near the start. The model learns these relationships rather than being told them.

What made the design win over earlier approaches is that it does this for the whole sequence at once, in parallel, instead of crawling left to right one step at a time. That parallelism is what let these models be trained at the scale that produced the capabilities you see today.

Why does the architecture matter to you?

Because its strengths and costs are baked into every model you run. The attention that makes a transformer good at language also makes long context expensive: the work of relating every token to every other grows quickly as the context gets longer, which is part of why a long prompt eats memory and slows down. And the architecture predicts plausible next tokens, which is exactly why it is fluent and exactly why it can hallucinate.

You do not need the equations to operate one well. You need the intuition: it relates words by weighing them, it does so in parallel, and the bill for that arrives as memory and compute when your context grows.

What a transformer is good at

  • Relating words far apart in a sequence, not just neighbours
  • Processing a whole sequence in parallel rather than one step at a time
  • Scaling up cleanly: more layers and data tend to mean more capability

What it does not give you free

  • Cheap long context; attention work grows fast as context grows
  • Built-in truth; it predicts plausible text, not verified fact
  • A small memory footprint; the running state has to be held somewhere

Related terms

← All terms Reviewed: June 2026