HBM: the fast memory on big GPUs : Learn

HBM (High-Bandwidth Memory) is a stacked memory design used on high-end graphics processors (GPUs) to deliver very high memory bandwidth. Streaming a model's weights fast is what sets token speed, so HBM is why data-centre cards run dense models quickly. A DGX Spark uses a different, lower-bandwidth memory and is not an HBM machine.

What is HBM and why does bandwidth matter?

HBM stands for High-Bandwidth Memory. It is a memory design where the chips are stacked into a tall block and wired with a very wide path to the processor. The result is high memory bandwidth: a lot of data moved per second.

That number matters because of how a model runs. To produce each token, the processor has to read the model’s weights out of memory. The faster it can read them, the faster the tokens come out. For a dense model, which touches every weight on every token, bandwidth is the thing that sets your speed. This is why data-centre accelerators with HBM stream big dense models so much faster than a desktop card.

How does a DGX Spark differ?

A DGX Spark does not use HBM. It uses a shared, system-class memory pool, which trades raw bandwidth for capacity and a small chassis. So the Spark can hold a very large model that an HBM card could not afford the capacity for, but it reads that model out of memory more slowly.

The practical shape: a Spark is happiest with mixture-of-experts models, which activate only a fraction of their weights per token and so ask less of the memory bus. A dense model that reads every weight every token is the case where the missing HBM bandwidth is felt most.

HBM: the fast memory on big GPUs

At a glance

Memory bandwidth, by where the model runs

What is HBM and why does bandwidth matter?

How does a DGX Spark differ?

HBM buys you

HBM does not

Related terms

At a glance

Memory bandwidth, by where the model runs

What is HBM and why does bandwidth matter?

How does a DGX Spark differ?

HBM buys you

HBM does not

Related terms

Go deeper