FlashAttention: attention that fits in fast memory : Learn

FlashAttention is a memory-efficient algorithm for the attention step in a transformer. Instead of building the full attention-score matrix in main GPU memory, it works in small tiles inside fast on-chip memory and never materialises the whole matrix. The result is the same math, using far less memory and fewer slow memory trips, which is what makes long contexts practical.

What problem does FlashAttention solve?

Attention is the step where every token looks at every other token. Done the direct way, that means building a square score matrix whose size grows with the square of the context length, then writing it to the GPU’s main memory and reading it back. For a long prompt that matrix is enormous, and the slow part is not the arithmetic, it is shuttling all that data to and from memory.

FlashAttention does the same calculation in tiles. It loads a small block of keys and values into the fast memory that sits right next to the compute units, does the work there, keeps a running result, and moves on. The full score matrix never exists in main memory. You get the identical answer, you just stop paying to store and re-read a matrix you did not need to keep.

Is it a setting I turn on?

Mostly not. FlashAttention lives inside the serving engine as a kernel, a piece of hand-tuned GPU code the engine picks when your hardware and build support it. On a desk you benefit from it without configuring it. The one place it becomes visible is when it does not load: a kernel built for the wrong architecture can fail, sometimes quietly, and attention falls back to a slower path or breaks. The practical habit is to test a small request after any engine or driver change rather than assume the fast path is active. The win is real, but it depends on the kernel actually matching the silicon underneath it.

FlashAttention: attention that fits in fast memory

At a glance

What problem does FlashAttention solve?

Is it a setting I turn on?

FlashAttention helps with

It does not change

Related terms

At a glance

What problem does FlashAttention solve?

Is it a setting I turn on?

FlashAttention helps with

It does not change

Related terms

Go deeper