Learn

FlashAttention: attention that fits in fast memory

FlashAttention is a memory-efficient algorithm for the attention step in a transformer. Instead of building the full attention-score matrix in main GPU memory, it works in small tiles inside fast on-chip memory and never materialises the whole matrix. The result is the same math, using far less memory and fewer slow memory trips, which is what makes long contexts practical.

At a glance

What it is
A tiled, memory-efficient way to compute attention
What it avoids
Writing the full attention-score matrix to slow GPU memory
Why it matters
Long contexts become affordable in memory and time
Where you meet it
A backend kernel inside serving engines, not a knob you set

What problem does FlashAttention solve?

Attention is the step where every token looks at every other token. Done the direct way, that means building a square score matrix whose size grows with the square of the context length, then writing it to the GPU’s main memory and reading it back. For a long prompt that matrix is enormous, and the slow part is not the arithmetic, it is shuttling all that data to and from memory.

FlashAttention does the same calculation in tiles. It loads a small block of keys and values into the fast memory that sits right next to the compute units, does the work there, keeps a running result, and moves on. The full score matrix never exists in main memory. You get the identical answer, you just stop paying to store and re-read a matrix you did not need to keep.

Is it a setting I turn on?

Mostly not. FlashAttention lives inside the serving engine as a kernel, a piece of hand-tuned GPU code the engine picks when your hardware and build support it. On a desk you benefit from it without configuring it. The one place it becomes visible is when it does not load: a kernel built for the wrong architecture can fail, sometimes quietly, and attention falls back to a slower path or breaks. The practical habit is to test a small request after any engine or driver change rather than assume the fast path is active. The win is real, but it depends on the kernel actually matching the silicon underneath it.

FlashAttention helps with

  • The memory cost of attention at long context length
  • Speed, by cutting trips to slow GPU memory
  • Holding more requests in flight before memory runs out

It does not change

  • The model's weights or its answers; the math is the same
  • The size of the key-value (KV) cache, which is a separate cost
  • Whether your kernel is built for your hardware; it can fail to load

Related terms

← All terms Reviewed: June 2026