Out of memory (OOM) is what happens when a model asks for more memory than is free at that moment: the allocation fails and the process is killed. On your own hardware it is the most common wall you will hit, and almost always a tuning problem, not a hardware one.
At a glance
What it is
A failed memory allocation that kills the running process
Why it bites locally
Weights, the key-value (KV) cache and scratch can exceed one shared pool
First thing to try
Free other memory, then shorten the context length
How worried to be
Low. It is recoverable and almost always tunable
Stack
What fills the memory budget
Weights, the key-value (KV) cache and scratch share one pool. The green band is your free headroom; when they leave none of it, the next allocation fails, and that is the OOM.
4
Free headroom (keep this above zero)what is left, and it is shared with the OS
3
Scratchtransient working space per request
2
Key-value (KV) cachegrows with context length, the usual culprit
1
Weightsthe model itself, a fixed size once loaded
What actually runs out?
A model needs room for three things at once: its weights, the working state of
the request (the KV cache, a key-value store of past tokens that grows with
context length), and a bit of scratch space. When the sum of those is larger
than the free memory, the allocation fails and the process is killed. No
ceremony, no warning shot.
On a normal desktop the model lives in dedicated GPU (graphics processing unit)
memory, the VRAM, and the operating system lives in system RAM (random-access
memory), so the two rarely fight. On a unified-memory box like the
DGX Spark they share one pool. That is the part that ambushes newcomers: a
forgotten second model, or a browser doing what browsers do, can quietly eat the
headroom your inference run needed. The OOM looks like it came from nowhere. It
did not. It came from the other thing.
How do you recognise an OOM?
You see
It usually means
CUDA out of memory / torch ... OutOfMemoryError
The model could not fit weights or cache in memory
The process dies with Killed and no stack trace
The OS OOM-killer stepped in to save the system from itself
It worked yesterday, fails today on a longer prompt
The KV cache grew with context length
Fails only on restart, fine after
Stale memory from the previous run was never handed back
OOM is a budgeting problem, not a defeat. Once you can read which of the three
consumers blew the budget, the error stops being a wall and becomes a number you
tune. You will still hit it now and then. You will just stop taking it
personally.
Check it yourself
watch -n1 nvidia-smi
Watch the memory-used column climb toward the limit as the context grows. The OOM is the moment a new allocation no longer fits under it.
Do
Free what else is running first: a second model, a notebook, a spare browser window
Shorten the max context length to shrink the KV cache, the usual culprit
Lower the batch size or the fraction of memory the server pre-reserves
Quantize to a smaller weight format once the free knobs are exhausted
Don't
Panic-buy a bigger GPU before trying the free knobs
Assume the model is broken; it ran out of room, it did not break
Change five settings at once so you cannot tell which one helped
Ignore the context length when a run that worked yesterday now fails