Voxtral Stage 1 OOM on GB10: Why --enforce-eager Is Not Enough

April 25, 2026 5 min read

The docker container died two minutes after launch with a CUDA OOM error that made no sense.

Quick Take

Voxtral’s Stage 1 TTS engine tried to allocate a 65536-token KV-cache it could never fit

--enforce-eager only applies to the APIServer, not the StageEngineCore subprocesses

One line added to voxtral-start.sh fixed the crash and kept the podcast pipeline running

The Crash Log That Didn’t Add Up

Last week this failed because the container exited with:

(StageEngineCoreProc pid=364) torch.AcceleratorError: CUDA error: out of memory
(APIServer pid=1) RuntimeError: Orchestrator initialization failed:
  StageEngineCoreProc died during READY (exit code 1)

I checked /proc/meminfo before the run: 115 GB free on the DGX Spark. The error screamed “you’re out of GPU memory,” yet the numbers told a different story. The system had plenty of headroom, so the problem wasn’t the hardware, it was the configuration.

Why Stage 1 Couldn’t Breathe

Voxtral uses a two-stage engine for TTS:

Stage 0 handles text and language modeling with a short context window
Stage 1 synthesizes audio and therefore needs a much larger context window

The critical detail is how vllm-omni allocates memory for each stage. Each StageEngineCoreProc builds its own KV-cache based on the max_seq_len setting for that stage.

Here’s the catch: --enforce-eager only applies to the APIServer process, not to the subprocesses. In practice, the log showed:

(APIServer pid=1) WARNING: Enforce eager set, disabling torch.compile and CUDAGraphs
(StageEngineCoreProc pid=188) config: enforce_eager=False
(StageEngineCoreProc pid=364) ...OOM...

Stage 1 tried to allocate a KV-cache for 65536 audio tokens using CUDA graphs and eager mode disabled. That allocation exceeded the available memory even though the system had 115 GB free.

The issue surfaced after a restart cycle. The first successful run used a profiling fallback that trimmed the KV-cache:

base.py:150  Available KV cache memory: 80.79 GiB (profiling fallback)

Later starts bypassed the fallback and went straight to the normal profiling path, which over-allocated for 65536 tokens.

The One-Line Fix That Worked

The fix is to cap Stage 1’s context window with --max-model-len 4096:

docker run -d --gpus all --name voxtral --network host \
    -e HF_HOME=/ai/models -v /ai/models:/ai/models \
    voxtral-vllm \
    --model mistralai/Voxtral-4B-TTS-2603 \
    --omni --trust-remote-code --enforce-eager \
    --max-model-len 4096 \
    --served-model-name voxtral \
    --port 8001 --host 0.0.0.0

--max-model-len 4096 is defined as the maximum sequence length the model will accept for a single request. In our podcast pipeline, every audio chunk is split into ≤90-character sentences, so no single request ever approaches 4096 tokens. This cap prevents Stage 1 from over-allocating memory while keeping the TTS quality intact.

Bash Bugs That Almost Hid the Real Problem

While fixing the OOM, I found two Bash issues in run_podcast.sh that were masking the real error.

First, the script called sudo bash /data/scripts/voxtral-start.sh, but bash wasn’t in sudoers. The Docker commands inside voxtral-start.sh already use NOPASSWD via /usr/bin/docker, so the sudo bash wrapper was unnecessary. Changing it to a direct call fixed the permission error.

Second, the script checked if SGLang was running with:

SGLANG_RUNNING=$(docker ps ... | grep -c sglang || echo 0)

When grep -c finds zero matches, it exits with code 1 and prints “0”. The || echo 0 then appends another “0”, so $SGLANG_RUNNING becomes “0\n0”. The subsequent integer comparison if [[ $SGLANG_RUNNING -gt 0 ]] fails because Bash can’t convert “0\n0” to an integer. Replacing || echo 0 with || true cleans up the output and keeps the logic clean.

What I Actually Use

DGX Spark with NVIDIA GB10 (Blackwell, SM 12.1): the only hardware that can run Voxtral Stage 1 without melting the VRAM

voxtral-vllm image with vllm-omni 0.19.0rc2 and vllm 0.19.1: handles the two-stage TTS pipeline without leaking memory

Mistral Small 4: the model that powers the TTS without needing a cloud API call

Why this OOM is harder than it looks

The real reason this is hard to diagnose is that the failure-surface and the root cause are in different processes. The CUDA OOM error appears in StageEngineCoreProc, which is a SUBPROCESS of APIServer. The flag that should fix it (--enforce-eager) was passed to APIServer and never inherited by the subprocesses, so StageEngineCoreProc runs with enforce_eager=False even though the parent log says Enforce eager set, disabling torch.compile and CUDAGraphs. That mismatch is the bug.

--enforce-eager is the right flag conceptually, it just lands in the wrong place in the process tree. Capping --max-model-len 4096 works because it limits the KV-cache size the subprocess will try to allocate, regardless of whether the subprocess inherits eager-mode or not. The cap is a constraint on the allocation; the eager-flag is a hint about how to compute. They address different layers of the problem.

The general lesson worth keeping: when an OOM appears in a subprocess and the system has plenty of free memory, check whether the parent-process flags actually propagated. ps -ef plus cat /proc/<pid>/cmdline on each child is the lowest-effort verification. If a flag you set on the parent does not appear in the child’s command line, your fix is not yet applied to the process that actually fails.

What to monitor afterward: the (StageEngineCoreProc pid=N) config: enforce_eager=... line in container logs is the canonical signal. If enforce_eager=False ever shows up after a restart with --max-model-len capped, the cap is doing the load-bearing work; if it shows enforce_eager=True, an upstream vllm-omni fix has propagated the flag and the cap may no longer be needed.

Status update (2026-05-04)

The --max-model-len 4096 cap from this post is now the canonical Voxtral start configuration. Both /data/scripts/voxtral-start.sh and the dashboard’s Voxtral start action pass the flag by default, so the value is no longer something operators have to remember. The vllm-omni enforce_eager propagation bug to subprocesses has not been fixed upstream as of this date, so the cap is still doing the load-bearing work. The “what to monitor” line about (StageEngineCoreProc pid=N) config: enforce_eager=... remains the canonical signal for whether the upstream fix has finally landed; until that line shows enforce_eager=True after a restart, leave the cap in place.

Stack

Voxtral KV-Cache OOM Fix

Memory allocation in TTS engine stages

Fix --max-model-len 4096

APIServer --enforce-eager ignored

StageEngineCore KV-cache over-allocation

Model Config max_seq_len=65536

OS/Runtime CUDA OOM error

Hardware 115 GB GPU free