Unified memory: one pool the processor and GPU share : Learn

Unified memory is a memory architecture in which the central processor (CPU) and the graphics processor (GPU) share one pool of system memory instead of each owning its own. There is no separate dedicated video memory: weights, request state and the operating system all draw from the same pool. On the NVIDIA GB10 in a DGX Spark, the 128 GB of memory is shared this way, so the whole pool is available to hold a model.

At a glance

What it is

One memory pool shared by the processor (CPU) and graphics processor (GPU)

Why it matters

The whole pool can hold a model, so a desk machine fits a very large one

On a DGX Spark

The NVIDIA GB10's 128 GB is shared; there is no separate video memory

The newcomer trap

A GPU tool showing no dedicated video memory is normal here, not a fault

Why is there no dedicated video memory?

On an ordinary desktop the graphics processor (GPU) has its own video memory soldered to the card, separate from the system memory the processor (CPU) and the operating system use. Unified memory does away with that split. There is one pool of memory, and both the processor and the GPU read and write the same bytes. On the NVIDIA GB10 inside a DGX Spark the 128 GB of memory is shared this way, so nothing is reserved off to the side as video memory.

This is the part that ambushes newcomers. Open a GPU memory tool on such a machine and it shows no dedicated video memory. That is not a fault and nothing is missing. There simply is no separate card memory to report, because the GPU draws from the same pool as everything else. The number to watch is the size of that shared pool, not a video-memory line that was never going to be there.

Why does this let a desk machine hold a very large model?

A model’s weights have to sit in memory the GPU can reach before it can serve a single token. On a discrete card that ceiling is whatever video memory the card ships with, often far less than the system memory beside it. With unified memory the ceiling is the whole shared pool. That is why a quiet desk machine can hold a model of 70 to 120 billion parameters: there is no small card memory acting as the bottleneck, only the one large pool.

The catch is the flip side of the same fact. Because the operating system, your browser and any background service draw from that pool too, the memory they hold is memory the model cannot. The pool is large, but it is shared, so the honest question is not “how much video memory do I have” but “how much of the shared pool is free right now”. Run it dry and you still hit an out-of-memory (OOM) error, the same wall as anywhere else, reached from one pool instead of two.

Unified memory: one pool the processor and GPU share

At a glance

Where the model's memory comes from

Why is there no dedicated video memory?

Why does this let a desk machine hold a very large model?

Check it yourself

Why it helps

What it will not change

Related terms

At a glance

Where the model's memory comes from

Why is there no dedicated video memory?

Why does this let a desk machine hold a very large model?

Check it yourself

Why it helps

What it will not change

Related terms

Go deeper