What problem does FlashAttention solve?
Attention is the step where every token looks at every other token. Done the direct way, that means building a square score matrix whose size grows with the square of the context length, then writing it to the GPU’s main memory and reading it back. For a long prompt that matrix is enormous, and the slow part is not the arithmetic, it is shuttling all that data to and from memory.
FlashAttention does the same calculation in tiles. It loads a small block of keys and values into the fast memory that sits right next to the compute units, does the work there, keeps a running result, and moves on. The full score matrix never exists in main memory. You get the identical answer, you just stop paying to store and re-read a matrix you did not need to keep.
Is it a setting I turn on?
Mostly not. FlashAttention lives inside the serving engine as a kernel, a piece of hand-tuned GPU code the engine picks when your hardware and build support it. On a desk you benefit from it without configuring it. The one place it becomes visible is when it does not load: a kernel built for the wrong architecture can fail, sometimes quietly, and attention falls back to a slower path or breaks. The practical habit is to test a small request after any engine or driver change rather than assume the fast path is active. The win is real, but it depends on the kernel actually matching the silicon underneath it.