Why is bandwidth the real ceiling?
When a model generates text, decode reads the model’s weights out of memory for every single token it produces. That is a lot of data to move, over and over. If the path between the accelerator and its memory can only carry so many gigabytes per second, then no matter how fast the compute units are, they spend their time waiting for weights to arrive. The speed you actually get is set by how fast the data moves, which is memory bandwidth.
This is why two machines with the same amount of memory can generate text at very different speeds. Capacity tells you what fits. Bandwidth tells you how fast it runs once it fits. They are easy to confuse because both are quoted in the spec sheet, but they answer different questions, and for generation speed it is bandwidth that holds the answer.
What changes the bandwidth pressure?
The size of the model in memory is the main lever. A dense model reads every parameter on every token, so it is the heaviest possible load on bandwidth. A mixture-of-experts (MoE) model activates only a fraction of its parameters per token, so it reads far less data per step and feels much faster for the same total size. Quantizing to a smaller weight format helps too, since each parameter takes fewer bytes to move. None of this changes the raw compute available; it changes how much data has to travel, which is the thing that was actually slowing you down. When you hear that a box is “bandwidth-bound”, this is the trap it has fallen into.