Learn

Memory bandwidth: the real speed limit

Memory bandwidth is the rate at which data moves between the accelerator and its memory, measured in gigabytes per second. It is distinct from memory capacity, which is how much fits. When a model generates one token at a time, it must stream its weights from memory every step, so bandwidth, not raw compute, is usually what caps how fast it runs.

At a glance

What it is
How fast data moves to and from memory, in gigabytes per second
Not the same as
Capacity, which is how much memory you have, not how fast
Why it matters
It often caps token generation speed more than compute does
Worst case for it
A dense model where every parameter is read on every token

Why is bandwidth the real ceiling?

When a model generates text, decode reads the model’s weights out of memory for every single token it produces. That is a lot of data to move, over and over. If the path between the accelerator and its memory can only carry so many gigabytes per second, then no matter how fast the compute units are, they spend their time waiting for weights to arrive. The speed you actually get is set by how fast the data moves, which is memory bandwidth.

This is why two machines with the same amount of memory can generate text at very different speeds. Capacity tells you what fits. Bandwidth tells you how fast it runs once it fits. They are easy to confuse because both are quoted in the spec sheet, but they answer different questions, and for generation speed it is bandwidth that holds the answer.

What changes the bandwidth pressure?

The size of the model in memory is the main lever. A dense model reads every parameter on every token, so it is the heaviest possible load on bandwidth. A mixture-of-experts (MoE) model activates only a fraction of its parameters per token, so it reads far less data per step and feels much faster for the same total size. Quantizing to a smaller weight format helps too, since each parameter takes fewer bytes to move. None of this changes the raw compute available; it changes how much data has to travel, which is the thing that was actually slowing you down. When you hear that a box is “bandwidth-bound”, this is the trap it has fallen into.

Bandwidth sets

  • How fast tokens come out during one-at-a-time decode
  • Whether a dense model runs at usable speed at all
  • How much speed you gain from a smaller weight format

Capacity sets

  • How big a model you can load in the first place
  • How long a context the key-value (KV) cache can hold
  • How many requests you can keep in flight at once

Related terms

← All terms Reviewed: June 2026