OOM: when a model runs out of memory : Learn

Out of memory (OOM) is what happens when a model asks for more memory than is free at that moment: the allocation fails and the process is killed. On your own hardware it is the most common wall you will hit, and almost always a tuning problem, not a hardware one.

At a glance

What it is

A failed memory allocation that kills the running process

Why it bites locally

Weights, the key-value (KV) cache and scratch can exceed one shared pool

First thing to try

Free other memory, then shorten the context length

How worried to be

Low. It is recoverable and almost always tunable

What actually runs out?

A model needs room for three things at once: its weights, the working state of the request (the KV cache, a key-value store of past tokens that grows with context length), and a bit of scratch space. When the sum of those is larger than the free memory, the allocation fails and the process is killed. No ceremony, no warning shot.

On a normal desktop the model lives in dedicated GPU (graphics processing unit) memory, the VRAM, and the operating system lives in system RAM (random-access memory), so the two rarely fight. On a unified-memory box like the DGX Spark they share one pool. That is the part that ambushes newcomers: a forgotten second model, or a browser doing what browsers do, can quietly eat the headroom your inference run needed. The OOM looks like it came from nowhere. It did not. It came from the other thing.

How do you recognise an OOM?

You see	It usually means
`CUDA out of memory` / `torch ... OutOfMemoryError`	The model could not fit weights or cache in memory
The process dies with `Killed` and no stack trace	The OS OOM-killer stepped in to save the system from itself
It worked yesterday, fails today on a longer prompt	The KV cache grew with context length
Fails only on restart, fine after	Stale memory from the previous run was never handed back

OOM is a budgeting problem, not a defeat. Once you can read which of the three consumers blew the budget, the error stops being a wall and becomes a number you tune. You will still hit it now and then. You will just stop taking it personally.

OOM: when a model runs out of memory

At a glance

What fills the memory budget

What actually runs out?

How do you recognise an OOM?

Check it yourself

Do

Don't

Related terms

At a glance

What fills the memory budget

What actually runs out?

How do you recognise an OOM?

Check it yourself

Do

Don't

Related terms

Go deeper