HBM (High-Bandwidth Memory) is a stacked memory design used on high-end graphics processors (GPUs) to deliver very high memory bandwidth. Streaming a model's weights fast is what sets token speed, so HBM is why data-centre cards run dense models quickly. A DGX Spark uses a different, lower-bandwidth memory and is not an HBM machine.
At a glance
What it is
Stacked, very high-bandwidth memory on high-end GPUs
Why it matters
Bandwidth sets token speed; HBM moves weights very fast
On a DGX Spark
Not HBM; it uses lower-bandwidth shared memory
Where you meet it
Data-centre accelerators built for dense large models
Comparison
Memory bandwidth, by where the model runs
HBM data-centre GPU
DGX Spark (shared memory)
Memory type
Stacked High-Bandwidth Memory
Shared system-class memory
Bandwidth
Very high
Lower; capacity is the strength, not speed
Best at
Dense large models that stream every weight
Large mixture-of-experts models that activate a fraction
What is HBM and why does bandwidth matter?
HBM stands for High-Bandwidth Memory. It is a memory design where the chips are
stacked into a tall block and wired with a very wide path to the processor. The
result is high memory bandwidth: a lot of data moved per second.
That number matters because of how a model runs. To produce each token, the
processor has to read the model’s weights out of memory. The faster it can read
them, the faster the tokens come out. For a dense model, which touches every
weight on every token, bandwidth is the thing that sets your speed. This is why
data-centre accelerators with HBM stream big dense models so much faster than a
desktop card.
How does a DGX Spark differ?
A DGX Spark does not use HBM. It uses a shared, system-class memory pool, which
trades raw bandwidth for capacity and a small chassis. So the Spark can hold a
very large model that an HBM card could not afford the capacity for, but it reads
that model out of memory more slowly.
The practical shape: a Spark is happiest with mixture-of-experts models, which
activate only a fraction of their weights per token and so ask less of the memory
bus. A dense model that reads every weight every token is the case where the
missing HBM bandwidth is felt most.
HBM buys you
High memory bandwidth, which streams weights fast and lifts token speed
Headroom for dense models that touch every weight per token
The speed edge a data-centre card holds over a desktop one
HBM does not
Live in a DGX Spark, which uses lower-bandwidth shared memory
Add capacity by itself; bandwidth and capacity are separate specs