Learn

Unified memory: one pool the processor and GPU share

Unified memory is a memory architecture in which the central processor (CPU) and the graphics processor (GPU) share one pool of system memory instead of each owning its own. There is no separate dedicated video memory: weights, request state and the operating system all draw from the same pool. On the NVIDIA GB10 in a DGX Spark, the 128 GB of memory is shared this way, so the whole pool is available to hold a model.

At a glance

What it is
One memory pool shared by the processor (CPU) and graphics processor (GPU)
Why it matters
The whole pool can hold a model, so a desk machine fits a very large one
On a DGX Spark
The NVIDIA GB10's 128 GB is shared; there is no separate video memory
The newcomer trap
A GPU tool showing no dedicated video memory is normal here, not a fault
Comparison

Where the model's memory comes from

Discrete GPU
Unified memory (DGX Spark)
Memory layout
Separate video memory on the card, plus system memory
One pool, shared by processor and graphics processor
Dedicated video memory
A fixed amount soldered to the card
None; the GPU tool reports zero, and that is normal
How big a model fits
Capped by the card's video memory
Capped by the shared pool the OS leaves free

Why is there no dedicated video memory?

On an ordinary desktop the graphics processor (GPU) has its own video memory soldered to the card, separate from the system memory the processor (CPU) and the operating system use. Unified memory does away with that split. There is one pool of memory, and both the processor and the GPU read and write the same bytes. On the NVIDIA GB10 inside a DGX Spark the 128 GB of memory is shared this way, so nothing is reserved off to the side as video memory.

This is the part that ambushes newcomers. Open a GPU memory tool on such a machine and it shows no dedicated video memory. That is not a fault and nothing is missing. There simply is no separate card memory to report, because the GPU draws from the same pool as everything else. The number to watch is the size of that shared pool, not a video-memory line that was never going to be there.

Why does this let a desk machine hold a very large model?

A model’s weights have to sit in memory the GPU can reach before it can serve a single token. On a discrete card that ceiling is whatever video memory the card ships with, often far less than the system memory beside it. With unified memory the ceiling is the whole shared pool. That is why a quiet desk machine can hold a model of 70 to 120 billion parameters: there is no small card memory acting as the bottleneck, only the one large pool.

The catch is the flip side of the same fact. Because the operating system, your browser and any background service draw from that pool too, the memory they hold is memory the model cannot. The pool is large, but it is shared, so the honest question is not “how much video memory do I have” but “how much of the shared pool is free right now”. Run it dry and you still hit an out-of-memory (OOM) error, the same wall as anywhere else, reached from one pool instead of two.

Check it yourself

free -h

On a unified-memory box this one pool is what holds your model. A discrete GPU's video memory would not show here at all, since it sits on the card; here there is no separate card memory to hide.

Why it helps

  • The whole pool can back a model, so a desk machine fits a 70 to 120 billion parameter one
  • No copying weights across a card boundary before the GPU can use them
  • Capacity scales with how much memory the box has, not with a fixed card spec

What it will not change

  • Memory bandwidth; capacity is not speed, and bandwidth sets how fast tokens come
  • The competition for the pool: the OS and a stray browser hold memory the model cannot
  • The out-of-memory (OOM) wall, which still arrives when the shared pool runs dry

Related terms

← All terms Reviewed: June 2026