TTS Spike Day 2: My Ears, the Vendor, and the Arena Disagree on Qwen3-TTS
Day 2 of the TTS spike was on the calendar as Higgs Audio v2. A model that was never on the shortlist took the slot instead. The Qwen team’s Qwen3-TTS 1.7B landed on the desk, rendered the exact six-turn CIPHER/HEXA dialog VibeVoice ran on Day 1, and won the operator’s ears outright at 8/10. Then the cross-check against the public leaderboards refused to agree with itself. The ears, the vendor’s own benchmark, and the blind-vote arena each named a different winner, and not by a hair. This is the log of that three-way split, and of the thing it taught: the three tests are not three readings of one quantity. They measure three different things and print them on the same-looking scale.
Quick Take
- Qwen3-TTS 1.7B scored 8/10 by ear on the production dialog, the highest in the spike, over the VibeVoice 7/10 ceiling and the Voxtral 0/10 floor.
- Qwen’s own long-form WER ranks Qwen > VibeVoice > Higgs. The Artificial Analysis blind arena ranks them the other way up: VibeVoice at Elo 970, both Qwen3-TTS entries dead last on a 74-model board.
- They disagree because they measure different constructs: intelligibility (WER), isolated naturalness (arena), and fit to this cast on this script (ear). A model can win one and lose another with no contradiction.
- Verdict for this podcast: Qwen3-TTS, because the ear-test is the only one of the three that judged the actual job, on the only checkpoint this stack can run. The arena result stays on the record as the reason not to overclaim it.
- Higgs and IndexTTS-2 deferred. Final pick waits on a full-episode render, because a snippet hides the seams.
The challenger that skipped the line
Qwen3-TTS was not one of the three spike candidates. It got pulled in because the LLM side of this stack had just moved to Qwen3.6, the weights were already local, and a 1.7B speech model costs almost nothing to try. It also has a property the others do not: at 1.7B it loads into the spare unified memory beside the resident LLM, no service swap, no API. For a stack whose whole point is running offline, that alone earns it a serious listen.
CustomVoice has no multi-speaker call, so the render goes turn by turn. Each line is synthesised solo with a preset timbre (aiden, an American male, for CIPHERFOX; sohee, a warm female, for HEXABELLA) and a one-line emotion instruction, then the turns are concatenated. Same text, same six turns, same roles as the Day-1 VibeVoice runs, so it drops into that matrix rather than starting a new one. Note the cost of solo rendering: there is zero cross-turn prosody, every line is generated blind to the one before it. Hold that thought.
Qwen3-TTS 1.7B is an 8/10. Better quality, natural pacing, and the US-English accent fits these voices better than the UK-English options. I also like how it carries the emotions.
Highest score in the spike. It wins on the two axes VibeVoice kept losing, natural tempo instead of the “too fast for humans” failure that sank Voxtral, and emotional contrast across turns without any tone-tag coaxing, and it wins them as the smallest model in the field. That last point compounds Day 1: VibeVoice-Large 7B already failed the “bigger checkpoint, better voice” test, and a 1.7B model beating a 7B one on delivery is the second nail in it.
One clause in the verdict is quieter than the rest and turns out to be load-bearing: “the US-English accent fits these voices better.” That is not a quality judgment. It is a casting judgment, and it is the seam the leaderboards fall straight through.
Three judges, three verdicts
The ears. Eight for Qwen3-TTS, seven for the best VibeVoice, on the real script with the real voices, one careful listen.
The vendor’s ruler. Qwen’s Qwen3-TTS technical report reports long-form word error rate, WER, lower is better, the rate at which an ASR pass mis-hears the generated speech, for all three engines on one test set:
| Long-form WER, English (lower is better) | Score | In this listen? |
|---|---|---|
| Qwen3-TTS 25Hz 1.7B | 1.225 | yes, the 8/10 above |
| VibeVoice | 1.780 | yes, the 7/10 ceiling |
| Higgs Audio v2 | 6.917 | not yet (deferred) |
It agrees with the ears, Qwen ahead of VibeVoice. But read the Higgs number before trusting the table. Higgs is the field’s emotion leader on its own EmergentTTS results, and it posts the worst long-form stability here by almost four times. That is not noise, it is the trade-off itself: the harder a model pushes expressive range, the more it tends to drift over long spans. The deferred Higgs render now has a prediction to falsify, not just a slot to fill. Two caveats sit on the whole table regardless: WER scores intelligibility, not aliveness, which is the axis the ear actually graded, and the numbers are from Qwen’s own report. A vendor grading its own homework is evidence, not a referee.
The blind crowd. The Artificial Analysis Speech Arena ranks TTS by blind pairwise votes, thousands of them: two anonymous clips of the same line, pick the more human. The order inverts.
| Artificial Analysis Speech Arena (blind votes) | Elo | Rank (of 74) |
|---|---|---|
| VibeVoice 7B | 970 | 68 |
| Qwen3 TTS Flash | 934 | 73 |
| Qwen3 TTS | 923 | 74 |
VibeVoice 7B is low-mid pack, both Qwen3-TTS rows are at the floor, and Higgs and IndexTTS-2 are not on the board at all. The largest, most independent test says the reverse of the ears. But look at what it is actually scoring: the two Qwen rows are the hosted Flash and API variants, not the local 12Hz 1.7B CustomVoice checkpoint that earned the 8. Same family, different weights, and the offline one, the only one this stack would ever run, is on no leaderboard at all.
Why three honest tests disagree
Line up the conditions and the paradox collapses into construct validity, the dull name for a sharp problem: each test measures a different thing and labels it with the same word, “best”.
- The arena measures isolated naturalness. It votes on single lines in each model’s default voice. It cannot hear prosody held across a six-turn argument, and it cannot see casting, because every model speaks in its house voice. The operator’s whole ear-win runs through casting: Qwen3-TTS let him pick a US-male timbre that matches CIPHER, where VibeVoice’s strongest male read British. Change that one variable and part of the 8-versus-7 gap may close. The arena structurally cannot register the variable at all.
- WER measures stability, not life. It rewards a model that never slurs and penalises one that takes expressive risks, which is exactly why the emotion leader sits at the bottom of it.
- The ear measured fit to one job. Specific voices, US-English target, emotional contrast across a real technical dialog. The most specific test in the set, and the smallest sample, N=1.
Sample size cuts both ways at once. The arena is thousands of votes, so it is the better estimate of the general question, which model sounds most human to a stranger. The ear is one listen, so it is the worse estimate of that question and the better estimate of the only question this project is asking.
Which judge to trust
Averaging them is the worst move available, a single number true of nothing. Match the test to the decision instead.
- Shipping TTS to a broad, unknown audience on stock voices? The blind arena is the right judge. It says VibeVoice.
- Casting a fixed two-host show, your own scripts, US-English, running offline on your own silicon? The right judge is an ear-test on exactly that, on the checkpoint you can actually run. It says Qwen3-TTS, and the arena never tested that checkpoint to begin with.
This podcast is the second case on every axis, including the one the leaderboards quietly ignore: the arena’s stronger Qwen entry is a hosted API, and a sovereign stack does not pipe a hosted API into its own pipeline. So Qwen3-TTS leads here, and the arena ranking stays on the page as the exact reason the claim is narrow. It beats VibeVoice for this cast, on this script, to these ears, on hardware you control. Not in general. The general crown is VibeVoice’s, and that is fine, because this was never a general question.
The engine, talking about itself
Benchmarks aside, the real test is whether it carries a whole episode, and that full render is running now. Here is the cold open plus one cameo from it, produced end to end by Qwen3-TTS on the Spark, two hosts and a guest:
The guest is an in-joke the engine made possible. When the jargon stacks up, the co-host hands off to a third voice to translate it. That voice was a cameo persona called VIBE back when the stack ran on Mistral’s command-line tool. With the move to Qwen3.6, and now Qwen3-TTS voicing it, the persona got renamed to QWEN. Picking the timbre took a second pass: the first preset voice I gave the cameo read too plain for a guest spot, so I swapped it for a warmer, more characterful one that actually fit the part. So it is a Qwen model, read by a Qwen text-to-speech voice, playing a character named QWEN, explaining why the toolchain is hard. The show is the stack narrating itself.
The honest open ends
Two things keep Day 2 short of the finale.
First, the seam. The Qwen render concatenated solo turns with no cross-turn prosody, and it still won, which means either its per-line delivery is strong enough to paper over the joins or six turns is too short to expose them. A forty-minute episode will not be so forgiving. The operator’s own hedge after the cross-check was blunter than the score: maybe it was the text. So the final pick rides on a full-episode render of the same script, not a cold open, produced through a switchable pipeline so VibeVoice and Qwen3-TTS can be compared at episode length on identical content.
Second, the two engines still owed a hearing. Higgs Audio v2, whose WER number now predicts the shape of its failure in advance; and IndexTTS-2, whose explicit duration control is the one feature aimed straight at the pacing problem that started this whole spike. Both get rendered before it closes.
The number that ends Day 2 is not 8/10. It is the reminder that one model can place first, last, and first again depending on who holds the stopwatch, and that the only honest report names the stopwatch and says why it was the right one for the job in front of you.
Cross-references
- Day 1, the VibeVoice sample matrix and the 7/10 bar Qwen3-TTS had to clear: TTS Spike Day 1: VibeVoice Sample Matrix
- Why the spike happened at all, the Voxtral verdict that triggered it: Voxtral Capped at 3/10: Picking the Next Open TTS
- The LLM stack the TTS choice runs alongside: Spark Arena Rank 4 Made Me Add Qwen3.6 to My DGX Spark
- How this blog measures its own claims instead of trusting a single number: The Quality Gate That Rewards Fabrication