Voxtral Capped at 3/10: Picking the Next Open TTS

May 12, 2026 18 min read

Episode 1 V6 came back from spot-listen at zero out of ten. Three turns, three different lengths, same verdict: flat, fast, vorgelesen. Episode 1 V1, the earliest run with short sentences, had landed at three out of ten weeks earlier. The failure mode there was different: human-paced delivery but hallucinated “ähm, ähm” between every clause. Eight fix articles in between, two patch threads to upstream vLLM, one full audit-rewrite layer on top of the script, and the engine still has two non-overlapping failure modes with no working configuration in the middle.

This is the article where I stop patching Voxtral and pick a different engine. It documents how the decision crystallized, what the May 2026 open-TTS landscape actually looks like, and the three engines I plan to spike on the DGX Spark next.

Quick Take

Voxtral open-checkpoint has two non-overlapping failure modes. Turns under 100 chars produce filler hallucinations, turns over 350 chars flatten into staccato. No turn-length sweet spot reaches release quality. Capped at 3/10 best case.

The instructions parameter is silently ignored. The ref_audio parameter crashes the engine because the encoder weights stayed gated in Mistral’s hosted product. No speed knob exists. None of these have changed between 2026-05-07 and today.

TTS Arena V2 ranks Fish Audio S2 Pro as the top open-weights model at ELO 1128, behind five closed engines (Realtime TTS 1.5 Max at 1208, Gemini 3.1 Flash TTS at 1206, StepAudio 2.5 TTS at 1187, ElevenLabs v3 at 1178, Inworld TTS 1 Max at 1164). Arena measures general preference, not podcast multi-speaker dialog fitness.

Filtered for podcast (multi-speaker, expressivity, voice clone, Blackwell SM12.1 compatibility, open weights), the top three are different: VibeVoice, Higgs Audio v2, IndexTTS-2.

The spike plan: deploy all three on the Spark, render the same Turn 0 of the V5 polish script, listen side by side, pick a winner. Roughly one day of bring-up per engine.

Eight fixes deep, then the verdict came down

The Voxtral story on this blog goes back to late April 2026. The original bet was simple: Mistral had a fresh TTS checkpoint, it ran on GB10 Blackwell, the API was OpenAI-compatible, and the licensing was open. The first article in the series was Voxtral Stage 1 OOM on GB10: Why —enforce-eager Is Not Enough on 2026-04-25, fixing the immediate “container will not start” problem. The second was the 3.5-hour deadlock that was really an AttributeError on 2026-05-03, a Python init-order bug that masqueraded as a Blackwell GPU hang. The third was the three-line vllm-omni patch for Blackwell on the same day, the upstream fix that made Voxtral actually run on SM12.1.

The pipeline-side fixes followed. Mono 24 kHz baseline and three compression pitfalls on 2026-05-03 set the audio-production baseline. The FFmpeg volume filter eval=frame fix on 2026-05-07 chased down a four-second silent intro bug. Per-segment loudnorm and the 3-second lookahead bug the same day fixed dialogue rhythm issues from dynamic-mode loudnorm. Voxtral chunk strategy at 38 percent faster, also 2026-05-07, was the only piece that came out as throughput-positive: whole-turn rendering beat chunked rendering for the same content.

The decisive article was Voxtral 4B Open-Checkpoint: The Encoder is Gated on 2026-05-07. That one was not a fix. That one was a structural finding. The model card promises voice cloning. The API validator accepts the ref_audio parameter. The tokenizer raises RuntimeError because the encoder weights live exclusively in Mistral’s hosted product. The decoder ships open; the encoder does not. That is when I knew the ceiling existed. I kept going anyway because I had already invested three weeks and because the strategy article on the next model choices for the DGX Spark (2026-05-11) still listed F5-TTS and Kokoro as my fallback path. Those recommendations are wrong, and this article is also the correction.

The Episode 1 V6 production run finished on 2026-05-07. The script was polished from V1 through V5 over four manual iterations. audit_rewrite_v6.py then split the longest monologues with HEXABELLA interjections and applied a global punctuation-normalization pass to produce 239 chunked turns for TTS. The render took 70 minutes. The mix landed in output/2026-05-07-starter-v2/episode.mp3. The verdict came five days later, after I sat down and actually listened to three specific turns end to end: Turn 0 (635 chars), Turn 10 (372 chars), Turn 14 (406 chars). All three rated zero out of ten on the “would I release this” axis.

V1 had been three out of ten on the same axis. V2 through V5 sat in the same range. V6 was the run where I tried to fix the V1 failure mode (filler hallucinations) by going to longer turns and discovered the opposite failure mode (flat staccato). The trade-off is structural to the open-checkpoint preset, not something more script polish can fix.

The two failure modes, in a table

Turn-length regime	Voxtral output	Score
Short (≤100 chars per turn)	Human-paced delivery, but hallucinated “ähm, ähm” filler between clauses	3/10
Mid (200-300 chars, memory-documented sweet spot)	Mixed: some clean, some flat, some still ähm-prone	3/10
Long (≥350 chars per turn)	No filler sounds, but rapid staccato delivery with no prosodic variation	0/10

The thirteen turns in the V7 chunked script that still exceed 350 chars are not the only problem. Even at 200-300 chars, where I had documented a “sweet spot” in working notes after the V3 render, the engine produced material I rated 3/10. The failure modes are not on a single quality axis with a tuning knob between them. They are two different things going wrong for two different reasons.

I confirmed nothing changed between 2026-05-07 and today by checking mistralai/Voxtral-4B-TTS-2603 on HuggingFace (zero commits in the window), the mistralai/ organization (no new Voxtral checkpoints, no encoder release), vLLM-Omni v0.20.0 release notes (generic TTS speedups, no new parameters exposed), and the Mistral docs page for Voxtral-TTS-2603 (still documents only voice preset plus streaming). Both hard limits stand.

What TTS Arena says, and what it does not say

The TTS Arena V2 leaderboard on HuggingFace ranks engines by user-preference ELO. The HF blog post on the Arena methodology describes the protocol: pairwise blind comparison, one prompt at a time, user picks the better of two clips. Top of the public leaderboard as of this writing, cross-referenced with the Artificial Analysis TTS leaderboard:

Rank	Model	ELO	Type
1	Realtime TTS 1.5 Max	1208	Closed
2	Gemini 3.1 Flash TTS	1206	Closed
3	StepAudio 2.5 TTS	1187	Closed
4	ElevenLabs v3	1178	Closed
5	Inworld TTS 1 Max	1164	Closed
…	…	…
Top open	Fish Audio S2 Pro	1128	Open

The open-weights gap to the closed top is roughly 80 ELO, which on a preference leaderboard corresponds to about 60% win-rate of the top closed model versus the top open one in head-to-head. Real, not catastrophic.

Arena measures something narrower than what a podcast needs, though. The preference test is “which of these two clips sounds better” on a single sentence. It does not measure: dialog multi-speaker scaffolding, prosody continuity across a 30-minute episode, voice cloning from a reference, or any kind of expressivity knob a producer can actually steer. Fish Audio S2 Pro at rank one for open is optimized for multilingual cloning of a single narrator, which is not the shape of my problem.

When I filter the candidate set on “things a podcast workflow actually needs,” the ranking changes.

The podcast-specific filter

The criteria that matter for an English-only, two-host plus cameos podcast running on DGX Spark / GB10 Blackwell:

Multi-speaker dialog scaffolding. Built for two or more distinct voices in the same generation context.
Expressivity controls. Either explicit emotion knob, prosody-from-reference, or natural-language style instructions that the engine actually respects.
Voice cloning. From a short reference, 5 to 60 seconds. Without it, two consecutive episodes can drift in voice character because the engine has no anchor.
Speed control. Explicit knob or duration target. The Voxtral failure mode of “too fast for humans” is a direct result of not having this.
ARM v9.2-A and Blackwell SM12.1 compatibility. The Spark uses CUDA 13 wheels for PyTorch; most TTS engines ship for x86 CUDA 12. Bring-up risk is real and engine-specific. The aarch64 CUDA 13 PyTorch wheels recipe is the reference path. The vLLM SM121 support issue tracks the upstream side.
Open weights, permissive license. Apache 2.0 or MIT preferred. Custom non-commercial restrictions complicate any path from blog hosting to supporter-funded distribution.

Two engines in the Arena top dozen pass criteria 1 through 4 cleanly. Two more pass with caveats. Everything else fails on multi-speaker, voice clone, or expressivity.

The three to spike

VibeVoice (community fork, 1.5B / 7B / Realtime-0.5B)

Sources: vibevoice-community/VibeVoice on GitHub, DGX Spark setup discussion on HuggingFace, context on the Microsoft pullback in September 2025.

The only TTS engine I found that is explicitly architected for the long-form multi-speaker podcast use case. The model card from Microsoft’s original release describes it as “designed for up to 4 distinct speakers in a single generation of up to 90 minutes” using next-token diffusion. It was an ICLR 2026 oral. The HuggingFace microsoft/VibeVoice-Realtime-0.5B model card has a verified setup thread for DGX Spark with CUDA 13 and aarch64, which is the only TTS engine in my candidate list with a documented Spark deploy.

Caveats are real. Microsoft pulled the original repository in September 2025 over concerns about deepfake misuse, and the released weights ship with an audible disclaimer and a watermark. The community fork at vibevoice-community/VibeVoice is the practical path; it strips neither the disclaimer nor the watermark, which is the right answer for legal hygiene but may require post-processing for production audio. License is MIT on the weights.

VibeVoice has no explicit speed parameter. Pace emerges from the script structure and the reference voice rather than being directly steerable. Voice cloning works from short references but is conditioned implicitly through the speaker prompt. For my use case, the multi-speaker architecture is the strongest single fit; the lack of an explicit speed knob is the open risk.

Higgs Audio v2 (Boson AI, 3B base)

Sources: boson-ai/higgs-audio on GitHub, Higgs Audio v2 announcement on the Boson AI blog (75.7% win rate over gpt-4o-mini-tts on EmergentTTS-Eval Emotions, 55.7% on Questions), the Blackwell support issue #39.

Best documented raw expressivity in the current open-weights field. Voice clone works from 3 to 10 seconds of reference audio. The architecture is dual-codec: content and style tokens are routed through separate codecs, which is the technical reason it can transfer emotion from a reference rather than just timbre. Pretrained on 10 million hours of audio across about 50 languages.

License is derivative-Llama (commercial use allowed under standard Llama terms). Blackwell support is the open issue. The official Docker image does not yet support SM 12.0 or 12.1. The bring-up path is rebuilding against CUDA 13 and PyTorch nightly aarch64 wheels using the assix recipe linked above, which is well-documented but adds a half-day of work.

For my use case, this is the expressivity ceiling. If VibeVoice produces multi-speaker dialog that still sounds flat, Higgs v2 is the next escalation. The bring-up cost is the friction.

IndexTTS-2 (Bilibili, September 2025)

Sources: index-tts/index-tts on GitHub, the arxiv paper documenting the duration-control mechanism, the open license ambiguity in issue #228.

The killer feature is explicit duration control, the first autoregressive zero-shot TTS model to expose it according to the arxiv paper. This directly addresses the “too fast for humans” failure mode of Voxtral. The other architectural feature is decoupled timbre and emotion: you can supply one reference clip for the voice character and a different reference clip for the mood, with Qwen3-fine-tuned natural-language emotion instructions on top.

License is the most awkward of the three. The root LICENSE says Apache 2.0, but the INDEX_MODEL_LICENSE file requires written authorization from Bilibili for commercial use, and issue #228 documents the contradiction without resolution. Personal use and non-commercial blog content are fine. Lightning bounty distribution or any direct monetization carries license risk until clarified.

No documented DGX Spark deploys exist for IndexTTS-2 yet, which means generic aarch64 and SM12.1 PyTorch bring-up work is required. Expect a similar half-day to Higgs v2.

For my use case, IndexTTS-2 is the fallback that specifically solves the speed problem if VibeVoice and Higgs v2 both miss on pace control. It also gets used for any cameo segment where I want fine-grained emotion direction.

What I am not spiking, and why

Several engines look promising in lay TTS coverage but fail on closer inspection.

Fish Audio S2 Pro is rank one open on Arena but optimized for multilingual cloning of a single narrator, not multi-speaker podcast dialog. Strong general option, wrong shape for this use case.
Chatterbox (Resemble’s roadmap page) advertises emotion control and voice clone, but the multi-speaker dialog feature is on a 2026 roadmap and has not shipped. Every output also carries a Perth neural watermark that cannot be disabled.
F5-TTS, which my previous strategy article recommended, is single-narrator zero-shot with no dialog turn-taking and no explicit emotion knob. Good for short narration, wrong for a two-host show.
Kokoro-82M, also previously recommended, is too small for the expressivity I need and has no voice clone.
Orpheus has not had a significant release since March 2025 and is superseded by Higgs v2 on every relevant axis.
Sesame CSM-1B is conversational but limited on expressivity controls compared to IndexTTS-2.
XTTS-v2 is well-established but Coqui is defunct as a company, and there are no 2026 updates.
Spark-TTS is zero-shot single-speaker with no multi-speaker dialog scaffolding.

The previous open-TTS recommendation in the DGX Spark model choices article of “Kokoro plus F5-TTS” reflects what the leaderboard discourse looked like in early May 2026. The podcast-specific filter shifts the answer once you apply it, and that is what this article corrects.

The spike plan

Three days of focused work, one engine per half-day plus listening time.

Day 1 morning, VibeVoice community fork. Clone, install dependencies, render Turn 0 of the V5 polish script (the 635-character cold-open monologue that Voxtral V6 rendered as flat staccato), listen, score against V6. Day 1 afternoon, render the same turn from V1 (the short-sentence version) and confirm whether VibeVoice eliminates the ähm-hallucinations of Voxtral V1.

Day 2 morning, Higgs Audio v2. Build the Blackwell-compatible container using the aarch64 CUDA 13 wheels recipe, deploy, render the same Turn 0, score, listen comparison.

Day 2 afternoon, IndexTTS-2. Same bring-up class as Higgs. Render Turn 0 with explicit duration control set to a human-paced target. Confirm whether the duration knob actually does what the paper claims.

Day 3 morning, side-by-side comparison. Three turns rendered by three engines. A/B/C listen with the V6 Voxtral baseline as the reference. Pick the winner on quality, document the second-place fallback, write up the result.

The criteria for picking the winner: clearest dialog turn-taking, no filler hallucinations, human-paced delivery, voice consistency across the same generation. If two engines tie on quality, the tiebreaker is bring-up reproducibility and license clarity.

What this means for Episode 1 and the next episodes

Episode 1 does not ship in its current form. The V7 plan (audit-rewrite of the V5 polish, re-render on Voxtral) is dead because the engine ceiling is the bottleneck, not the script. Episode 1 reruns once the TTS spike picks a winner, on the same V5 script. The audit-rewrite logic in audit_rewrite_v7.py likely needs revisiting once we know what the new engine prefers. Some engines tolerate long monologues; some hate them. The script chunking strategy is engine-specific.

Episodes 2 onward inherit whichever engine wins the spike. The pipeline structure (script polish then audit-rewrite then TTS then mix) stays the same, with the TTS service swapped out. The mix stage already has the per-segment loudnorm fixes and the ffmpeg eval=frame patch from earlier in May, both of which are engine-agnostic.

Spike Day-1 results: VibeVoice clears the Voxtral bar, ceiling 7/10

Day-1 of the three-day spike ran on 2026-05-13. Eleven VibeVoice renders (5 CIPHER solo + 2 HEXA solo + 4 dialog combos) on the same V5 polish source text, with audio embedded inline for direct A/B against the Voxtral V6 baseline. Top score 7/10, no sample reached 8. Best dialog combo (Frank-Emma) was “richtiger Dialog” per operator verbatim. Plus a strategic reframe: Carter and Grace, who failed the host auditions, refit cleanly as the cool-clinical CLAWI and VIBE cameo voices.

Full sample matrix with embedded audio, the operator’s per-sample verdicts, and the four-character cast refit in TTS Spike Day 1: VibeVoice Sample Matrix on DGX Spark. Day-2 (Higgs Audio v2) and Day-3 (IndexTTS-2) follow.

How the TTS candidate weights got pulled

Update 2026-05-13: pulling the spike-candidate weights (VibeVoice Realtime / 1.5B / Large, Higgs Audio v2, IndexTTS-2) plus the LLM stack overnight surfaced three concrete HuggingFace failure modes. Xet protocol over IPv6 unreachable on DGX Spark, httpx read-timeout too short for multi-GB shards, and hf download returning exit zero while leaving .incomplete blobs behind. The wrapper that handles all three plus exponential backoff and filesystem-level validation is /data/scripts/ops/hf-pull in cipherfox/sovereign-ops. Full postmortem in Why hf download Lies to You at 22 GB on DGX Spark. For the actual spike runs documented below, every model pull went through that wrapper.

Validation against artificialanalysis.ai (2026-05-13 update)

After this article shipped, the reader-driven correction pass on the arena.ai leaderboard article surfaced artificialanalysis.ai’s TTS leaderboard. Voxtral is on that leaderboard at ELO 1056, roughly tied with Kokoro-82M v1.0. Fish Audio S2 Pro tops the open-weight column at 1128. Top closed models cluster around 1180-1208 (Realtime TTS 1.5 Max, Gemini 3.1 Flash TTS, ElevenLabs v3).

That 1056 ELO is not the catastrophic verdict this article delivers. AA’s leaderboard measures preference on isolated single-sentence prompts, the same methodological limit arena.ai has for LLMs. It does not measure 30-second podcast monologue prosody, multi-speaker dialog turn-taking, or HEXABELLA-translator persona coherence. Voxtral at ELO 1056 reads as “competent on average single-sentence English TTS”; that is consistent with my V1 result rated 3/10 on hallucination-free short clips. The 0/10 verdict on V6 is the use-case-specific failure for long-form podcast content, which the AA leaderboard cannot evaluate.

My three spike candidates are not on the AA TTS leaderboard at all. VibeVoice, Higgs Audio v2, and IndexTTS-2 are either too new or have not been submitted. This is meaningful: there is no third-party benchmark to anchor the spike result against. My A/B/C listen on Turn 0 of the V5 script becomes the primary signal. I will record the verdict numerically and submit the winner to AA once the spike completes, partly so the next operator does not have to repeat the same blind A/B/C from scratch.

Why I am not switching to Fish Audio S2 Pro despite the ELO ranking. AA’s open-weight top is single-narrator multilingual cloning. That is a different shape from podcast multi-speaker dialog with cameo agents. The two-leaderboard lesson applies here too: top-of-leaderboard for a different use case is not the same as best fit for mine.

Caveat on Voxtral’s ranking. ELO 1056 is the average user’s experience on isolated phrases. For a self-hosted operator running multi-character podcast scripts on a DGX Spark with the open-checkpoint constraints (instructions ignored, ref_audio crashed, no speed knob), the realized quality is the floor I documented in the two failure modes table. Average leaderboard performance and use-case-specific ceiling are two different metrics, and confusing them is exactly the kind of leaderboard misread the two-leaderboards article is about.

What I am watching for next

A few things will move the picture again before the spike runs.

Mistral encoder release. The path that fixes Voxtral specifically is the encoder weights being released, which would unlock ref_audio and let me hold all my existing Voxtral pipeline scaffolding intact. There is no public roadmap signal from Mistral. I will recheck the HuggingFace organization weekly.

VibeVoice 1.5 or 2.0 release. The community fork is actively developed. If a major release adds an explicit speed parameter, the case for VibeVoice as the primary becomes harder to beat.

Higgs v2 Blackwell support upstream. The open GitHub issue on SM12.1 support is being tracked by the Boson AI team. If they ship a Spark-compatible container before my spike runs, the bring-up cost drops from a half-day to thirty minutes.

IndexTTS-2 license clarification. Bilibili has not yet responded to issue #228. If the commercial-use ambiguity gets resolved cleanly, IndexTTS-2 moves up the priority list because its duration control is unique in the field.

The spike runs after the next blog-pipeline cycle finishes. The follow-up article that documents which engine I actually picked, and the Episode 1 rerun with the new TTS stack, follows around 2026-05-19 if the spike stays on schedule.

What I Am Trying Next

VibeVoice community fork as the leading candidate. Fits multi-speaker podcast use case directly. Proven DGX Spark setup path exists.

Higgs Audio v2 as the expressivity ceiling option. Needs Blackwell build work but documented win rate on emotion benchmarks is the strongest in open weights.

IndexTTS-2 as the speed-control fallback. Only open engine with explicit duration knob. License requires care for any monetization path.

Voxtral 4B open-checkpoint stays parked. No path to release quality without encoder weights from Mistral. Will not invest more script-side polish work to chase a 3/10 ceiling.