Learn

VRAM: the memory that decides what you can run

VRAM (video random-access memory) is the memory attached to a graphics processor (GPU), where a model's weights and live request data sit while it runs. How much you have caps how big a model and how long a context you can hold. On a unified-memory box like the DGX Spark there is no separate VRAM: the GPU and the operating system draw from one shared pool.

At a glance

What it is
The memory a GPU uses for model weights and live request data
Why it matters
It caps how big a model and how long a context you can run
On a DGX Spark
No dedicated VRAM; one pool is shared with the OS
When you run out
You get an OOM (an out-of-memory error)
Comparison

Where the model's memory lives

Discrete GPU
Unified memory (DGX Spark)
Model memory
Dedicated VRAM on the card
Shared system pool
Does the OS compete?
No, the OS uses separate system memory
Yes, OS and model draw from one pool
Your headroom
Fixed by the card you bought
Whatever the OS leaves free

Why does VRAM decide what you can run?

Everything a model needs while it runs has to fit in memory the GPU can reach: the weights, the key-value (KV) cache that grows with context length, and some scratch space. VRAM is that memory. A 70B model in a small weight format might need tens of gigabytes before you have served a single token, so the size of the model you can run is set, first of all, by how much VRAM you have. Run a longer context and the KV cache grows into the same space, which is why “it fit yesterday” and “it OOMs today” can both be true.

What changes on a unified-memory box?

A discrete GPU has its own VRAM soldered to the card, separate from the system RAM (random-access memory) the operating system uses. A DGX Spark does not work that way. There is one pool of memory, and the GPU and the OS both draw from it. That is a feature, it is how the box fits a large model in a small chassis, but it has a catch: the memory your browser and background services hold is memory your model cannot. So “how much VRAM do I have” becomes “how much of the shared pool is free right now”, and the answer moves while you work.

The practical upshot: treat the number from nvidia-smi as a live budget, not a fixed spec. The ceiling is fixed. What is free under it is not.

Check it yourself

nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv

Total is your ceiling, used is what the model and everything else hold right now, and free is the headroom you have before the next allocation OOMs.

More VRAM lets you

  • Load a larger model without quantizing it down
  • Hold a longer context, since the KV cache has room to grow
  • Keep more requests in flight at once

It will not fix

  • A slow memory bus; capacity is not bandwidth, and bandwidth sets your speed
  • A model that is simply too big even after quantization
  • Wasteful batching that squanders the headroom you already have

Related terms

← All terms Reviewed: June 2026