Voxtral-TTS Blocker on GB10: The Three-Line vllm-omni Patch

May 3, 2026 5 min read

The vllm-omni container hung for hours on GB10, but the real crash came three lines earlier.

Quick Take

Blackwell SM 12.1 breaks transformers 5.x unless you patch the init order

--enforce-eager keeps torch.compile from melting your GPU

The missing text_config attribute was hiding in plain sight

The silent hang that wasn’t

Last week this failed because the container printed nothing for 3.5 hours after pulling the model. The logs showed no traceback, no error, just the startup banner and then silence. In practice the GPU fans spun normally, the container stayed up, but every request returned HTTP 500. After attaching strace I saw the process stuck in futex, a classic deadlock symptom. But the real culprit was an AttributeError buried 12 stack frames deep.

class VoxtralTTSConfig(PretrainedConfig):
    def __init__(self, **kwargs):
        # This line sets text_config BEFORE the parent init
        self.text_config = kwargs.pop("text_config", None)
        super().__init__(**kwargs)  # transformers 5.5.4 now calls validate_token_ids()

Why does this break? Because transformers 5.5.4 added a call to validate_token_ids() inside PretrainedConfig.__init__. That method reads self.text_config, but VoxstralTTSConfig only sets it after calling super().__init__(). Therefore the first access throws AttributeError, the child process hangs waiting for a log line that never prints, and the parent thinks the container is still booting.

Why vllm-omni needs the patch on GB10

GB10 runs SM 12.1, which PyTorch 2.10 does not officially support. The driver falls back to PTX emulation, so torch.compile can trigger subtle bugs. That’s why --enforce-eager exists: it disables compilation and keeps the runtime stable.

docker run --rm --gpus all \
  -e HF_HOME=/ai/models \
  -v /ai/models:/ai/models \
  voxtral-vllm:latest \
  --model mistralai/Voxtral-4B-TTS-2603 \
  --omni \
  --port 8001 \
  --host 0.0.0.0 \
  --trust-remote-code \
  --enforce-eager

The --trust-remote-code flag is required because Voxtral’s config parser reads custom YAML fields that PyTorch’s sandbox rejects by default. Without it the container exits before the model loads.

The three-line fix that opened the gate

Create /data/config/voxtral/patch_voxtral_config.py with the reordered __init__.
Add a Dockerfile layer that copies the patch into /usr/local/lib/python3.11/site-packages/vllm_omni/patches/.
Rebuild the image and redeploy.

FROM voxtral-vllm:base
COPY patch_voxtral_config.py /usr/local/lib/python3.11/site-packages/vllm_omni/patches/voxtral_config.py

After the rebuild the smoke test produced a 134 KB 16-bit mono 24 kHz WAV in 75 seconds. No more hangs, no more AttributeErrors.

What to watch for next

The patch only works if your snapshot contains the HF-style config.json with a text_config key. If you see LocalEntryNotFoundError or RuntimeError: 'VoxtralTTSConfig' object has no attribute 'text_config', double-check that the model was downloaded via vllm-omni’s downloader, not the legacy mistral_inference folder.

Also remember that GB10’s SM 12.1 requires PyTorch built with TORCH_CUDA_ARCH_LIST=12.1a if you ever rebuild from source. The official wheels fall back to PTX, which is slower and occasionally flaky.

What I Actually Use

vllm/vllm-openai:latest for the ARM64 base image

mistralai/Voxtral-4B-TTS-2603 for the TTS model

GB10 (Blackwell, 128 GB Unified Memory) for the GPU

How to watch the patched config in production

After applying the three-line fix the next question is monitoring: how do you notice if the patched config drifts back into the original failure mode? Three signals are worth scraping.

First, container startup latency. The pre-fix container hung for 3.5 hours on the original failure. The fixed container produces a smoke-test WAV in 75 seconds. Anything between those two extremes after a redeploy is a signal that the patch is not loaded correctly. A simple Prometheus probe that times the first successful TTS request after restart catches drift in the patch-application path.

Second, the text_config attribute presence on warm-up. Add a one-line check to the container entrypoint that asserts hasattr(config, 'text_config') after model load and before serving the first request. If the assert fails the container exits early with an actionable error, rather than hanging silently for hours.

Third, GB10 SM 12.1 toolchain version. The compute-capability mismatch that triggered this whole class of bug will eventually be fixed upstream in PyTorch. When that happens --enforce-eager becomes unnecessary and torch.compile can be re-enabled for whatever throughput gain it provides. Until then the patch must stay; checking PyTorch release notes for SM 12.1 support quarterly is the lowest-effort way to know when this article becomes obsolete.

The article is dated and the fix is current as of mid-2026. If you are reading this in 2027 or later, check whether SM 12.1 is officially supported before assuming you still need the patch.

If you are reading this fix in 2026 or later, recheck three things before applying it. First, the upstream PyTorch tracker for SM 12.1 support; the patch becomes unnecessary the day official wheels include the architecture. Second, the vllm-omni release notes, since the patched init order may have been merged upstream, in which case the local patch should be removed to avoid double-application. Third, the Voxtral model snapshot ID, since model-side config-schema changes can shift where text_config is expected. Each of those checks takes minutes; skipping them and applying a stale patch costs hours.

Status update (2026-05-04): one of the three parts is now fixed upstream

The vllm-omni init-order patch (the third part of the three-part fix above, the patch_voxtral_config.py workaround) is no longer needed against current vllm-omni main. PR #3065 merged on 2026-04-25 reorders the assignments to the same shape my local patch produced, plus moves the file to a new location under transformers_utils/configs/. The fix was authored upstream by yuanheng-zhao independent of this article, I am pointing at it because the fix landed and operators should know.

The other two parts of the fix from this post are unaffected:

--enforce-eager is still required because PyTorch SM 12.1 official support has not landed.
The smoke-test discipline (time the first TTS request after restart, watch for text_config attribute presence) is still the right operational signal.

What changed in the workflow:

On a post-2026-04-25 vllm-omni snapshot, drop the local patch step from the Dockerfile, the upstream code already has the corrected init order.
The local patch_voxtral_config.py in /data/config/voxtral/ is now a no-op that exits cleanly if it cannot find the pre-#3065 file path. Kept in the tree as documentation for anyone still running an older snapshot.

The quarterly recheck criteria still hold for the SM 12.1 toolchain piece, since that is upstream-PyTorch and has not changed yet.

Flow

Voxtral-TTS Fix

Debugging and patching a TTS model on GB10

Symptom Container hangs, HTTP 500 errors

Root Cause AttributeError in config init

Patch Reorder config initialization

Deploy Rebuild container with patch

Result Successful TTS generation