Voxtral Stage 1 OOM on GB10: Why --enforce-eager Is Not Enough
The docker container died two minutes after launch with a CUDA OOM error that made no sense.
Quick Take
- Voxtral’s Stage 1 TTS engine tried to allocate a 65536-token KV-cache it could never fit
--enforce-eageronly applies to the APIServer, not the StageEngineCore subprocesses- One line added to
voxtral-start.shfixed the crash and kept the podcast pipeline running
The Crash Log That Didn’t Add Up
Last week this failed because the container exited with:
(StageEngineCoreProc pid=364) torch.AcceleratorError: CUDA error: out of memory
(APIServer pid=1) RuntimeError: Orchestrator initialization failed:
StageEngineCoreProc died during READY (exit code 1)
I checked /proc/meminfo before the run: 115 GB free on the DGX Spark. The error screamed “you’re out of GPU memory,” yet the numbers told a different story. The system had plenty of headroom, so the problem wasn’t the hardware, it was the configuration.
Why Stage 1 Couldn’t Breathe
Voxtral uses a two-stage engine for TTS:
- Stage 0 handles text and language modeling with a short context window
- Stage 1 synthesizes audio and therefore needs a much larger context window
The critical detail is how vllm-omni allocates memory for each stage. Each StageEngineCoreProc builds its own KV-cache based on the max_seq_len setting for that stage.
Here’s the catch: --enforce-eager only applies to the APIServer process, not to the subprocesses. In practice, the log showed:
(APIServer pid=1) WARNING: Enforce eager set, disabling torch.compile and CUDAGraphs
(StageEngineCoreProc pid=188) config: enforce_eager=False
(StageEngineCoreProc pid=364) ...OOM...
Stage 1 tried to allocate a KV-cache for 65536 audio tokens using CUDA graphs and eager mode disabled. That allocation exceeded the available memory even though the system had 115 GB free.
The issue surfaced after a restart cycle. The first successful run used a profiling fallback that trimmed the KV-cache:
base.py:150 Available KV cache memory: 80.79 GiB (profiling fallback)
Later starts bypassed the fallback and went straight to the normal profiling path, which over-allocated for 65536 tokens.
The One-Line Fix That Worked
The fix is to cap Stage 1’s context window with --max-model-len 4096:
docker run -d --gpus all --name voxtral --network host \
-e HF_HOME=/ai/models -v /ai/models:/ai/models \
voxtral-vllm \
--model mistralai/Voxtral-4B-TTS-2603 \
--omni --trust-remote-code --enforce-eager \
--max-model-len 4096 \
--served-model-name voxtral \
--port 8001 --host 0.0.0.0
--max-model-len 4096 is defined as the maximum sequence length the model will accept for a single request. In our podcast pipeline, every audio chunk is split into ≤90-character sentences, so no single request ever approaches 4096 tokens. This cap prevents Stage 1 from over-allocating memory while keeping the TTS quality intact.
Bash Bugs That Almost Hid the Real Problem
While fixing the OOM, I found two Bash issues in run_podcast.sh that were masking the real error.
First, the script called sudo bash /data/scripts/voxtral-start.sh, but bash wasn’t in sudoers. The Docker commands inside voxtral-start.sh already use NOPASSWD via /usr/bin/docker, so the sudo bash wrapper was unnecessary. Changing it to a direct call fixed the permission error.
Second, the script checked if SGLang was running with:
SGLANG_RUNNING=$(docker ps ... | grep -c sglang || echo 0)
When grep -c finds zero matches, it exits with code 1 and prints “0”. The || echo 0 then appends another “0”, so $SGLANG_RUNNING becomes “0\n0”. The subsequent integer comparison if [[ $SGLANG_RUNNING -gt 0 ]] fails because Bash can’t convert “0\n0” to an integer. Replacing || echo 0 with || true cleans up the output and keeps the logic clean.
What I Actually Use
- DGX Spark with NVIDIA GB10 (Blackwell, SM 12.1): the only hardware that can run Voxtral Stage 1 without melting the VRAM
- voxtral-vllm image with vllm-omni 0.19.0rc2 and vllm 0.19.1: handles the two-stage TTS pipeline without leaking memory
- Mistral Small 4: the model that powers the TTS without needing a cloud API call
Why this OOM is harder than it looks
The real reason this is hard to diagnose is that the failure-surface and the root cause are in different processes. The CUDA OOM error appears in StageEngineCoreProc, which is a SUBPROCESS of APIServer. The flag that should fix it (--enforce-eager) was passed to APIServer and never inherited by the subprocesses, so StageEngineCoreProc runs with enforce_eager=False even though the parent log says Enforce eager set, disabling torch.compile and CUDAGraphs. That mismatch is the bug.
--enforce-eager is the right flag conceptually, it just lands in the wrong place in the process tree. Capping --max-model-len 4096 works because it limits the KV-cache size the subprocess will try to allocate, regardless of whether the subprocess inherits eager-mode or not. The cap is a constraint on the allocation; the eager-flag is a hint about how to compute. They address different layers of the problem.
The general lesson worth keeping: when an OOM appears in a subprocess and the system has plenty of free memory, check whether the parent-process flags actually propagated. ps -ef plus cat /proc/<pid>/cmdline on each child is the lowest-effort verification. If a flag you set on the parent does not appear in the child’s command line, your fix is not yet applied to the process that actually fails.
What to monitor afterward: the (StageEngineCoreProc pid=N) config: enforce_eager=... line in container logs is the canonical signal. If enforce_eager=False ever shows up after a restart with --max-model-len capped, the cap is doing the load-bearing work; if it shows enforce_eager=True, an upstream vllm-omni fix has propagated the flag and the cap may no longer be needed.
Status update (2026-05-04)
The --max-model-len 4096 cap from this post is now the canonical Voxtral start configuration. Both /data/scripts/voxtral-start.sh and the dashboard’s Voxtral start action pass the flag by default, so the value is no longer something operators have to remember. The vllm-omni enforce_eager propagation bug to subprocesses has not been fixed upstream as of this date, so the cap is still doing the load-bearing work. The “what to monitor” line about (StageEngineCoreProc pid=N) config: enforce_eager=... remains the canonical signal for whether the upstream fix has finally landed; until that line shows enforce_eager=True after a restart, leave the cap in place.
Voxtral KV-Cache OOM Fix
Memory allocation in TTS engine stages