Unified memory: one pool the processor and GPU share
Unified memory is a memory architecture in which the central processor (CPU) and the graphics processor (GPU) share one pool of system memory instead of each owning its own. There is no separate dedicated video memory: weights, request state and the operating system all draw from the same pool. On the NVIDIA GB10 in a DGX Spark, the 128 GB of memory is shared this way, so the whole pool is available to hold a model.
At a glance
What it is
One memory pool shared by the processor (CPU) and graphics processor (GPU)
Why it matters
The whole pool can hold a model, so a desk machine fits a very large one
On a DGX Spark
The NVIDIA GB10's 128 GB is shared; there is no separate video memory
The newcomer trap
A GPU tool showing no dedicated video memory is normal here, not a fault
Comparison
Where the model's memory comes from
Discrete GPU
Unified memory (DGX Spark)
Memory layout
Separate video memory on the card, plus system memory
One pool, shared by processor and graphics processor
Dedicated video memory
A fixed amount soldered to the card
None; the GPU tool reports zero, and that is normal
How big a model fits
Capped by the card's video memory
Capped by the shared pool the OS leaves free
Why is there no dedicated video memory?
On an ordinary desktop the graphics processor (GPU) has its own video memory
soldered to the card, separate from the system memory the processor (CPU) and
the operating system use. Unified memory does away with that split. There is one
pool of memory, and both the processor and the GPU read and write the same
bytes. On the NVIDIA GB10 inside a DGX Spark the 128 GB of memory is shared this
way, so nothing is reserved off to the side as video memory.
This is the part that ambushes newcomers. Open a GPU memory tool on such a
machine and it shows no dedicated video memory. That is not a fault and nothing
is missing. There simply is no separate card memory to report, because the GPU
draws from the same pool as everything else. The number to watch is the size of
that shared pool, not a video-memory line that was never going to be there.
Why does this let a desk machine hold a very large model?
A model’s weights have to sit in memory the GPU can reach before it can serve a
single token. On a discrete card that ceiling is whatever video memory the card
ships with, often far less than the system memory beside it. With unified memory
the ceiling is the whole shared pool. That is why a quiet desk machine can hold a
model of 70 to 120 billion parameters: there is no small card memory acting as
the bottleneck, only the one large pool.
The catch is the flip side of the same fact. Because the operating system, your
browser and any background service draw from that pool too, the memory they hold
is memory the model cannot. The pool is large, but it is shared, so the honest
question is not “how much video memory do I have” but “how much of the shared
pool is free right now”. Run it dry and you still hit an out-of-memory (OOM)
error, the same wall as anywhere else, reached from one pool instead of two.
Check it yourself
free -h
On a unified-memory box this one pool is what holds your model. A discrete GPU's video memory would not show here at all, since it sits on the card; here there is no separate card memory to hide.
Why it helps
The whole pool can back a model, so a desk machine fits a 70 to 120 billion parameter one
No copying weights across a card boundary before the GPU can use them
Capacity scales with how much memory the box has, not with a fixed card spec
What it will not change
Memory bandwidth; capacity is not speed, and bandwidth sets how fast tokens come
The competition for the pool: the OS and a stray browser hold memory the model cannot
The out-of-memory (OOM) wall, which still arrives when the shared pool runs dry