Voxtral 4B Open-Checkpoint: The Encoder is Gated
The Voxtral 4B open checkpoint accepts the ref_audio parameter in its API. Send a clip, and the engine crashes hard enough to need a docker restart. The encoder weights, the part that turns reference audio into a speaker embedding, live exclusively in Mistral’s hosted product. The model card mentions voice cloning. The validator passes the parameter through. Nothing tells you the work won’t actually happen until your orchestrator thread is dead.
Quick Take
- Voxtral-4B-TTS-2603 ships only the decoder and 20 preset voice embeddings
- The audio encoder is gated behind Mistral La Plateforme, not the open weights
- Sending
ref_audioto the local engine raisesRuntimeErrorand kills the orchestrator- The
instructionsfield is silently dropped on the same code path- For self-hosted voice cloning, evaluate Fish Speech S2 Pro, VoxCPM, or Qwen3-TTS
What I expected from the open checkpoint
The HuggingFace model card for mistralai/Voxtral-4B-TTS-2603 lists voice cloning as a feature. The relevant line: “Voice reference input: Accepts 10-second audio reference for adaptation.” The API surface served by vllm-omni (the recommended runtime) accepts a ref_audio parameter that takes URL, base64-data-URL, or file:// paths. So far this looks like everything I need to clone CIPHERFOX and HEXABELLA voices for the podcast.
I built the pipeline assuming that path worked. It doesn’t.
The crash signature
Here is what happens when you send a normal-looking request with a 3-second reference clip. The endpoint accepts the request. The validator returns no error. The tokenizer then raises:
File ".../vllm_omni/model_executor/models/voxtral_tts/voxtral_tts_audio_tokenizer.py", line 985, in encode_waveforms
RuntimeError: encode_waveforms requires encoder weights which are not
available in the open-source checkpoint.
The engine output looks worse from the client. A BadRequest with Orchestrator thread crashed lands first, because the request raises inside the engine core. By the time you read it, the EngineCore is already gone. Subsequent requests return EngineDeadError from vllm/v1/engine/exceptions.py. The fix is docker restart voxtral and a new request. Do not probe ref_audio on a production container; this kills the running engine, not just the failing request.
Why the validator lets you in but the tokenizer kills you
The serving layer in vllm-omni runs a per-model validator before queuing the request. For Voxtral, that validator is _validate_voxtral_tts_request at entrypoints/openai/serving_speech.py:866. It says exactly this:
# Voxtral TTS requires either a preset voice or ref_audio for voice cloning.
if request.voice is None and request.ref_audio is None:
return "Either 'voice' (preset speaker) or 'ref_audio' (voice cloning) must be provided"
So either path is legal at the API surface. The split happens in _build_voxtral_prompt at line 1373, which routes voice to SpeechRequest(input=text, voice=voice) and ref_audio to SpeechRequest(input=text, ref_audio=ref_audio). The ref_audio path then hits encode_waveforms, which expects encoder weights that are not in the open release. The validator never checks for the encoder presence.
A fairer design would make the validator return a 400 with "voice cloning requires encoder weights not present in this checkpoint, use a preset voice". The current design returns a 200, accepts the request, queues it, then crashes the engine. That difference is the whole article.
The same code path silently drops instructions
While I was tracing the crash, I noticed something else. The instructions parameter, which other vllm-omni TTS backends honor as a style hint, is silently dropped for Voxtral. Compare two validators in the same file:
docker exec voxtral grep -A1 "'instructions' is not supported" \
/usr/local/lib/python3.12/dist-packages/vllm_omni/entrypoints/openai/serving_speech.py
# VoxCPM rejects: returns "'instructions' is not supported for VoxCPM"
# Voxtral validator does not check this field at all
Then check whether the model code uses instructions anywhere:
docker exec voxtral grep -rn "instructions" \
/usr/local/lib/python3.12/dist-packages/vllm_omni/model_executor/models/voxtral_tts/
# (zero matches)
The field is parsed by the API schema, accepted by the validator, and never reaches the model. If you have instructions: "Speak as a skeptical engineer..." in your config, that line is a no-op. The output sounds identical with or without it. Drop the field from your config to remove the false signal it provides.
What I actually have, locally
Once both gated features are off the table, the open checkpoint gives me:
- 20 preset voices total. English:
casual_male,casual_female,cheerful_female,neutral_male,neutral_female. The other 15 are non-English (de_*,fr_*,es_*,pt_*,it_*,nl_*,hi_*,ar_*). - Whole-turn rendering up to 4096 tokens, roughly two minutes of audio per pass.
- Native 24 kHz mono output in WAV, PCM, FLAC, MP3, AAC, or Opus.
- No cloning, no
instructions, notask_type=VoiceDesign.
That is enough to build a screen-reader-class TTS. It is not enough to build a podcast-class one. The preset voices have decent acoustic quality, but no per-character emotional range, and you cannot teach them anything by example.
What ships full encoder weights instead
Three open-source TTS models ship full encoder weights and are supported by the same vllm-omni runtime:
| Alternative | Encoder Weights | Voice Cloning | License Notes |
|---|---|---|---|
| Fish Speech S2 Pro | Open | Yes | Apache 2.0 |
| VoxCPM | Open | Yes | Apache 2.0 |
| Qwen3-TTS | Open | Yes | Apache 2.0 |
| Voxtral 4B (open) | Decoder only | No | Apache 2.0 (decoder), encoder gated |
All three support task_type=Base with ref_audio and ref_text. Their input processors live next to Voxtral’s in vllm_omni/model_executor/stage_input_processors/. The runtime, deployment shape, and OpenAI-compatible API contract is the same as the Voxtral path I already had wired up. The migration cost is one Dockerfile, one config, and one round of voice-curation per character.
The editorial part
I respect Mistral’s right to gate parts of their work behind a paid product. The Voxtral 4B weights they did publish are still useful, the decoder is well-engineered, the preset voices are clean, and the runtime integration is solid. What I do not respect is shipping a model card that advertises voice cloning, exposing a ref_audio API parameter, accepting that parameter through the validator, and then having the engine crash on the assumption that nobody will read the source code to find out the encoder is missing.
That is not a sovereign-stack-friendly behavior. The fix is two lines:
- The model card should label the open checkpoint as decoder-only, no voice cloning.
- The
vllm-omnivalidator should rejectref_audiowith a clean 400 when the encoder weights are absent, not crash the engine.
Until either lands, treat “Voxtral self-hosted” as “preset-voice-only, decoder-only, no cloning”. Any blog post telling you otherwise is reading the model card without checking the source.
Status, mid-2026
The vllm-omni handler in current main still has the same validator-tokenizer split. No upstream PR addresses the open-checkpoint detection. Mistral has not labeled the model card. The encoder remains paywalled.
If you are reading this in 2027 or later, check three things before assuming this still applies:
- The model card on HuggingFace, look for “decoder-only” or “no voice cloning” labels
- The
_validate_voxtral_tts_requestfunction invllm-omni, search forencode_waveformsandencoder_weightsreferences - A fresh
ref_audiotest on a throwaway container, the crash signature is unmistakable
What I Actually Use
- mistralai/Voxtral-4B-TTS-2603, decoder-only, 20 preset voices
- vllm-omni
0.19.0rc2.dev199+gd435fe070, OpenAI-compatible audio endpoint- DGX Spark with GB10 Blackwell, 128 GB unified memory, ARM v9.2-A
casual_malefor CIPHERFOX,casual_femalefor HEXABELLA, no cloning
For the perf trade-off this constraint forces (whole-turn render is 30 to 38 percent faster than chunked), see Part 2. The two ffmpeg footguns I hit while building the pipeline that revealed this gap are documented separately as Part 3 and Part 4.
Related in this series
This article is Part 1 of Voxtral Pipeline Discoveries (May 2026):
- Part 1 (this article). The encoder is gated.
- Part 2: Voxtral Chunk Strategy. 30 to 38 percent render-time savings with whole-turn rendering.
- Part 3: FFmpeg
volumeFiltereval=frame. A 4-second silent intro bug. - Part 4: Per-Segment Loudness for Multi-Speaker TTS. Two
loudnormfootguns from the same pipeline.
What you get vs what you expected
Voxtral 4B open-checkpoint capability surface