Learn

OOM: when a model runs out of memory

Out of memory (OOM) is what happens when a model asks for more memory than is free at that moment: the allocation fails and the process is killed. On your own hardware it is the most common wall you will hit, and almost always a tuning problem, not a hardware one.

At a glance

What it is
A failed memory allocation that kills the running process
Why it bites locally
Weights, the key-value (KV) cache and scratch can exceed one shared pool
First thing to try
Free other memory, then shorten the context length
How worried to be
Low. It is recoverable and almost always tunable
Stack

What fills the memory budget

Weights, the key-value (KV) cache and scratch share one pool. The green band is your free headroom; when they leave none of it, the next allocation fails, and that is the OOM.

4
Free headroom (keep this above zero) what is left, and it is shared with the OS
3
Scratch transient working space per request
2
Key-value (KV) cache grows with context length, the usual culprit
1
Weights the model itself, a fixed size once loaded

What actually runs out?

A model needs room for three things at once: its weights, the working state of the request (the KV cache, a key-value store of past tokens that grows with context length), and a bit of scratch space. When the sum of those is larger than the free memory, the allocation fails and the process is killed. No ceremony, no warning shot.

On a normal desktop the model lives in dedicated GPU (graphics processing unit) memory, the VRAM, and the operating system lives in system RAM (random-access memory), so the two rarely fight. On a unified-memory box like the DGX Spark they share one pool. That is the part that ambushes newcomers: a forgotten second model, or a browser doing what browsers do, can quietly eat the headroom your inference run needed. The OOM looks like it came from nowhere. It did not. It came from the other thing.

How do you recognise an OOM?

You seeIt usually means
CUDA out of memory / torch ... OutOfMemoryErrorThe model could not fit weights or cache in memory
The process dies with Killed and no stack traceThe OS OOM-killer stepped in to save the system from itself
It worked yesterday, fails today on a longer promptThe KV cache grew with context length
Fails only on restart, fine afterStale memory from the previous run was never handed back

OOM is a budgeting problem, not a defeat. Once you can read which of the three consumers blew the budget, the error stops being a wall and becomes a number you tune. You will still hit it now and then. You will just stop taking it personally.

Check it yourself

watch -n1 nvidia-smi

Watch the memory-used column climb toward the limit as the context grows. The OOM is the moment a new allocation no longer fits under it.

Do

  • Free what else is running first: a second model, a notebook, a spare browser window
  • Shorten the max context length to shrink the KV cache, the usual culprit
  • Lower the batch size or the fraction of memory the server pre-reserves
  • Quantize to a smaller weight format once the free knobs are exhausted

Don't

  • Panic-buy a bigger GPU before trying the free knobs
  • Assume the model is broken; it ran out of room, it did not break
  • Change five settings at once so you cannot tell which one helped
  • Ignore the context length when a run that worked yesterday now fails

Related terms

← All terms Reviewed: June 2026