Learn

HBM: the fast memory on big GPUs

HBM (High-Bandwidth Memory) is a stacked memory design used on high-end graphics processors (GPUs) to deliver very high memory bandwidth. Streaming a model's weights fast is what sets token speed, so HBM is why data-centre cards run dense models quickly. A DGX Spark uses a different, lower-bandwidth memory and is not an HBM machine.

At a glance

What it is
Stacked, very high-bandwidth memory on high-end GPUs
Why it matters
Bandwidth sets token speed; HBM moves weights very fast
On a DGX Spark
Not HBM; it uses lower-bandwidth shared memory
Where you meet it
Data-centre accelerators built for dense large models
Comparison

Memory bandwidth, by where the model runs

HBM data-centre GPU
DGX Spark (shared memory)
Memory type
Stacked High-Bandwidth Memory
Shared system-class memory
Bandwidth
Very high
Lower; capacity is the strength, not speed
Best at
Dense large models that stream every weight
Large mixture-of-experts models that activate a fraction

What is HBM and why does bandwidth matter?

HBM stands for High-Bandwidth Memory. It is a memory design where the chips are stacked into a tall block and wired with a very wide path to the processor. The result is high memory bandwidth: a lot of data moved per second.

That number matters because of how a model runs. To produce each token, the processor has to read the model’s weights out of memory. The faster it can read them, the faster the tokens come out. For a dense model, which touches every weight on every token, bandwidth is the thing that sets your speed. This is why data-centre accelerators with HBM stream big dense models so much faster than a desktop card.

How does a DGX Spark differ?

A DGX Spark does not use HBM. It uses a shared, system-class memory pool, which trades raw bandwidth for capacity and a small chassis. So the Spark can hold a very large model that an HBM card could not afford the capacity for, but it reads that model out of memory more slowly.

The practical shape: a Spark is happiest with mixture-of-experts models, which activate only a fraction of their weights per token and so ask less of the memory bus. A dense model that reads every weight every token is the case where the missing HBM bandwidth is felt most.

HBM buys you

  • High memory bandwidth, which streams weights fast and lifts token speed
  • Headroom for dense models that touch every weight per token
  • The speed edge a data-centre card holds over a desktop one

HBM does not

  • Live in a DGX Spark, which uses lower-bandwidth shared memory
  • Add capacity by itself; bandwidth and capacity are separate specs
  • Help a model that does not fit in the first place

Related terms

← All terms Reviewed: June 2026