Memory bandwidth: the real speed limit : Learn

Memory bandwidth is the rate at which data moves between the accelerator and its memory, measured in gigabytes per second. It is distinct from memory capacity, which is how much fits. When a model generates one token at a time, it must stream its weights from memory every step, so bandwidth, not raw compute, is usually what caps how fast it runs.

Why is bandwidth the real ceiling?

When a model generates text, decode reads the model’s weights out of memory for every single token it produces. That is a lot of data to move, over and over. If the path between the accelerator and its memory can only carry so many gigabytes per second, then no matter how fast the compute units are, they spend their time waiting for weights to arrive. The speed you actually get is set by how fast the data moves, which is memory bandwidth.

This is why two machines with the same amount of memory can generate text at very different speeds. Capacity tells you what fits. Bandwidth tells you how fast it runs once it fits. They are easy to confuse because both are quoted in the spec sheet, but they answer different questions, and for generation speed it is bandwidth that holds the answer.

What changes the bandwidth pressure?

The size of the model in memory is the main lever. A dense model reads every parameter on every token, so it is the heaviest possible load on bandwidth. A mixture-of-experts (MoE) model activates only a fraction of its parameters per token, so it reads far less data per step and feels much faster for the same total size. Quantizing to a smaller weight format helps too, since each parameter takes fewer bytes to move. None of this changes the raw compute available; it changes how much data has to travel, which is the thing that was actually slowing you down. When you hear that a box is “bandwidth-bound”, this is the trap it has fallen into.

Memory bandwidth: the real speed limit

At a glance

Why is bandwidth the real ceiling?

What changes the bandwidth pressure?

Bandwidth sets

Capacity sets

Related terms

At a glance

Why is bandwidth the real ceiling?

What changes the bandwidth pressure?

Bandwidth sets

Capacity sets

Related terms

Go deeper