VRAM (video random-access memory) is the memory attached to a graphics processor (GPU), where a model's weights and live request data sit while it runs. How much you have caps how big a model and how long a context you can hold. On a unified-memory box like the DGX Spark there is no separate VRAM: the GPU and the operating system draw from one shared pool.
At a glance
What it is
The memory a GPU uses for model weights and live request data
Why it matters
It caps how big a model and how long a context you can run
On a DGX Spark
No dedicated VRAM; one pool is shared with the OS
When you run out
You get an OOM (an out-of-memory error)
Comparison
Where the model's memory lives
Discrete GPU
Unified memory (DGX Spark)
Model memory
Dedicated VRAM on the card
Shared system pool
Does the OS compete?
No, the OS uses separate system memory
Yes, OS and model draw from one pool
Your headroom
Fixed by the card you bought
Whatever the OS leaves free
Why does VRAM decide what you can run?
Everything a model needs while it runs has to fit in memory the GPU can reach:
the weights, the key-value (KV) cache that grows with context length, and some
scratch space. VRAM is that memory. A 70B model in a small weight format might need tens
of gigabytes before you have served a single token, so the size of the model you
can run is set, first of all, by how much VRAM you have. Run a longer context
and the KV cache grows into the same space, which is why “it fit yesterday” and
“it OOMs today” can both be true.
What changes on a unified-memory box?
A discrete GPU has its own VRAM soldered to the card, separate from the system
RAM (random-access memory) the operating system uses. A DGX Spark does not work that way. There is one
pool of memory, and the GPU and the OS both draw from it. That is a feature, it
is how the box fits a large model in a small chassis, but it has a catch: the
memory your browser and background services hold is memory your model cannot. So
“how much VRAM do I have” becomes “how much of the shared pool is free right
now”, and the answer moves while you work.
The practical upshot: treat the number from nvidia-smi as a live budget, not a
fixed spec. The ceiling is fixed. What is free under it is not.