TTS Spike Day 1: VibeVoice Sample Matrix on DGX Spark
Yesterday’s TTS pivot decision committed to spiking three candidates on the DGX Spark: VibeVoice (community fork), Higgs Audio v2, IndexTTS-2. Day 1 was VibeVoice. This article is the engineering-log of what eleven renders sound like, with the audio embedded, the operator’s verdict per sample, and the matrix-shape that emerged.
The bar to clear was the Voxtral V6 verdict from yesterday: zero out of ten on the 30-second cold-open monologue. Voxtral’s two failure modes are length-driven: short turns trigger “ähm, ähm” filler hallucinations, long turns flatten into staccato, with no sweet spot in the middle. The empirical chunk-length-vs-pace dynamic is documented in Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns. Chunking the same script into 90-character pieces produces different prosody than rendering it whole, and the chunk-boundary is exactly where the filler-injection pattern lives. VibeVoice had to do better than both ends of that trade-off, by some measurable margin, in a documented A/B/C/D listening test.
Quick Take
- Best Day-1 score is 7/10 (three Phase-1 samples tied: Mike-solo, Frank-solo, Carter-Grace-dialog). Voxtral V6 was 0/10. VibeVoice clears the bar comfortably but does not reach release-quality.
- Phase-2 confirmed the ceiling at 7/10. Three follow-up renders (script tactics, 4-voice cast, VibeVoice-Large 7B) all scored 5-6/10. The 14× parameter increase from 1.5B to 7B did not move the verdict on identical input. The ceiling is structural to VibeVoice, not a model-size artifact.
- Cross-cutting weakness: every sample drew the same comment, sound-quality is below studio-level. This held for Realtime-0.5B, 1.5B, and Large-7B. The Phase-2 7B render specifically tested the model-size hypothesis and disconfirmed it.
- Two suspect samples (flagged with ⚠ inline): Grace-solo rendered 30% slower than Emma at identical inference flags (pace anomaly, dropped pitch into androgynous territory) and Mike-Emma dialog ran on a script-imbalanced monolog-shaped source. Both verdicts are retained for transparency but neither is comparable engine evidence. Clean re-renders deferred to Day-2 since the Phase-2 ceiling-test now takes priority.
- Strategic reframe: Carter (5/10 as CIPHERFOX) refits as the CLAWI cameo candidate, Samuel (4/10 as CIPHERFOX, Indian English) refits as the new QWEN cameo candidate now that the LLM stack is migrating to Qwen3.6 and the VIBE/Mistral-CLI cameo role is retired. Three cameo personas total: CLAWI, QWEN, and a third slot held open. Grace verdict suspended (⚠ suspect sample, see below).
- Top combinations: Frank-Emma felt like an actual dialog (6/10, “real dialog” in the operator’s words); Carter-Grace had pleasant pitch but Carter was emotionally flat (7/10, read as recited rather than conversational). Day-2 needs Frank’s emotional range plus Carter-Grace’s pitch comfort, ideally via VibeVoice-Large.
- Fourteen VibeVoice renders total (Phase-1: 5 CIPHER solo + 2 HEXA solo + 4 dialog combos; Phase-2: tactics, 4-voice cast, Large-7B), plus the Voxtral V6 baseline. Rendered locally on a DGX Spark with VibeVoice-Realtime-0.5B, VibeVoice-1.5B, and VibeVoice-Large-7B. Seven distinct voices: Mike, Carter, Davis, Frank, Samuel (male); Emma, Grace (female). Source text constant across the matrix. Engine and model size are the variables.
The matrix
| Category | Model | Samples |
|---|---|---|
| CIPHER monolog (Turn 0, 635c) | VibeVoice-Realtime-0.5B | 5 voices |
| HEXA monolog (Turn 2, 246c) | VibeVoice-Realtime-0.5B | 2 voices |
| CIPHER+HEXA dialog (Turns 0-5) | VibeVoice-1.5B | 4 voice combos |
Phase-2 tactics (Frank-Emma, [pause] + fillers + em-dash) | VibeVoice-1.5B | 1 |
| Phase-2 4-voice cast (Frank+Emma+Grace+Carter) | VibeVoice-1.5B | 1 |
| Phase-2 Large-7B sound-quality test (Frank-Emma) | VibeVoice-Large 7B | 1 |
| Voxtral V6 baseline (yesterday’s 0/10) | Voxtral-4B-TTS-2603 | 1 reference |
All samples are the same source text from the V5 polish of Episode 1’s cold-open. The CIPHER solo is the long monologue that broke Voxtral. The dialog includes the short HEXA interjections that VibeVoice’s multi-speaker architecture should handle naturally.
Baseline: Voxtral V6 (the 0/10 from yesterday)
The rest of this article is comparison-only useful if you can hear the bar that VibeVoice has to clear. Three Voxtral V6 renders from yesterday’s spot-listen, each with the operator’s verdict (translated from German) and the measurable pace. Same source text comes back further down in the VibeVoice section so you can A/B directly.
Human podcast hosts typically land around 2.0-2.6 words per second. News anchors at 2.8 wps already feel rushed. Above 3.0 wps the listener starts losing comprehension on technical content.
V6 Turn 0 (CIPHER cold-open, 635 chars)
Verdict (translated): too fast for human comprehension, reads like a recital, no emotional variation.
Pace: 3.58 wps · Duration: 30.2s for 108 words · Engine: Voxtral-4B-TTS-2603, casual_male preset
The pace is roughly 38% faster than a natural human reading the same text. No prosodic variation, listener never gets to absorb the previous beat before the next arrives.
V6 Turn 10 (CIPHER, 372 chars, with a duplicate-phrase glitch)
Verdict (translated): slightly better than Turn 0, but still unusable.
Pace: 3.01 wps · Duration: 21.9s for 66 words
Shorter turn pulls the pace from 3.58 down to 3.01 wps. Still uncomfortably fast for technical content. Plus a Mistral-generation duplication that Voxtral renders verbatim around the eight-second mark.
V6 Turn 14 (CIPHER, 406 chars)
Verdict (translated): too fast, delivered in a flat reading tone.
Pace: 2.86 wps · Duration: 23.8s for 68 words
Three nested technical beats (IPv4 resolver, SGLang rm flags, CUDA context lifecycle) compressed into 24 seconds. The structure of the prose is invisible in the rendering because the pace flattens any beat-by-beat emphasis.
Three samples, three turn lengths, same failure shape: pace 2.86 to 3.58 wps across the range, no prosodic variation, no breath. This is the pattern that ended the Voxtral path and triggered the spike.
CIPHER solo · Turn 0 (635c monolog)
The single hardest test for VibeVoice. Long single-speaker monolog without speaker variation, exactly what the model is documented as worst-at. Five male voices, all rendered via VibeVoice-Realtime-0.5B.
All five samples render the same 108-word, 635-character monolog. The pace column is the load-bearing data point. Voxtral V6 hit 3.58 wps on this exact text; the VibeVoice band sits at 2.53-2.84 wps, 23% to 29% slower depending on voice. Voice character notes are paraphrased from the operator’s listening verdict.
Mike (7/10)
Verdict: staccato pattern improved noticeably over Voxtral. Sound texture still below studio-grade but the rhythm is human.
Pace: 2.76 wps · Duration: 39.1s · Voice: warm US male
Carter (5/10)
Verdict: higher-pitched than Mike, perceived as less authoritative for the skeptical-engineer persona.
Pace: 2.84 wps · Duration: 38.0s · Voice: lighter US male, slightly higher register
Davis (6/10)
Verdict: similar attractiveness to Mike, slightly stronger pause placement.
Pace: 2.58 wps · Duration: 41.9s · Voice: warm US male, paced reading style
Frank (7/10)
Verdict: similar authority to Mike and Davis, but with a UK accent. Best of the matrix on natural pausing. Audible breathing makes the read feel inhabited rather than performed.
Pace: 2.53 wps (slowest, most natural) · Duration: 42.7s · Voice: warm UK male, audible breath
Note the pace alignment with the verdict. Frank is the slowest sample in the CIPHER set at 2.53 wps, and the verdict independently calls it “most natural with audible breathing.” The numbers and the ears agree.
Samuel (4/10, Indian English)
Verdict: voice quality is fine, but the Indian-English accent collides with the CIPHERFOX persona, which is established as US/UK English elsewhere in the show.
Pace: 2.58 wps · Duration: 41.9s · Voice: Indian English male
Cameo refit candidate. The Indian English accent is on-spec for a Qwen-based agent persona (Alibaba/Tongyi-origin model), if the show introduces QWEN as a cameo voice. With the model stack migrating to Qwen3.6-35B-A3B for the main LLM, a QWEN cameo replaces the planned VIBE (Mistral-CLI) cameo, and Samuel becomes the candidate voice. Voice quality is competent; persona-fit, not render, gates this use.
HEXA solo · Turn 2 (246c self-intro)
Shorter monolog, 42 words, two female voices via VibeVoice-Realtime-0.5B. Voxtral V6 rendering of the exact same text included for direct A/B. The Voxtral baseline pace was 3.58 wps on the longer CIPHER monolog; the VibeVoice female band sits at 1.91-2.11 wps, the slowest of the entire matrix, which lands in the natural-human range (2.0-2.6 wps) without prosody coaching.
Voxtral V6 (the bar to clear)
Voxtral baseline,
casual_femalepreset. Same engine, same shape, same too-fast/no-prosody pattern documented in the CIPHER baseline section above. Not separately scored yesterday.
Listen to this once, then play the two VibeVoice female voices below. The text is identical. The engine is the variable.
Emma (6/10)
Verdict: shares the matrix-wide sub-studio sound-quality, and the higher pitch exposes it more harshly than on the male voices. But pause placement is good and the emphasis lands naturally. Works as a voice, just not at release-quality compression.
Pace: 2.11 wps · Duration: ~19.9s for 42 words · Voice: warm female, natural pauses
Grace (1/10, ⚠ suspect sample)
⚠ Suspect sample: Grace-Solo rendered ~30% slower than Emma on identical input (
streaming_inference_from_file.py, same flags, same text). No--speedoverride, nocfg_scalechange between the two runs. Grace still came out at 1.91 wps while Grace in the 1.5B dialog rendered the same voice at 2.69 wps. The slow render dropped pitch enough to read as androgynous. The 1/10 is on the sample, not the voice. Re-render with deterministic seed queued for Phase-2.
Verdict (raw, retained for transparency): doesn’t read as female enough for the warm-builder HEXABELLA persona. The texture is more androgynous than expected.
Pace: 1.91 wps (slowest sample in the matrix; anomaly) · Duration: ~22.0s for 42 words · Voice: low-register female, ambiguous gender cue
Grace at 1.91 wps is the only sample below the natural-human floor (2.0 wps). The pace anomaly almost certainly drove the gender-cue read, because pitch tracks tempo on this engine. The cameo-refit hypothesis below stands or falls on a clean Grace re-render at normal pace.
Dialog · Turns 0-5 (CIPHER+HEXA, six-turn snippet)
The home court for VibeVoice’s multi-speaker architecture. The 1.5B model handles speaker alternation, short interjections, and the turn-taking pacing that produces what the docs call NotebookLM-style aliveness. Four voice combos.
The dialog pace band lands at 2.69-2.89 wps across the four combos, faster than the solo sections (CIPHER 2.53-2.84, HEXA 1.91-2.11). The pickup is on-spec, because natural conversation flows quicker than a paced monologue, and still inside the 2.6-3.0 wps news-anchor range. Well clear of Voxtral V6’s 3.58 wps comprehension-degraded zone.
Mike (CIPHER) + Emma (HEXA), no score, ⚠ suspect sample
⚠ Suspect sample: the snippet ran on a script that was effectively monolog-with-witness (four CIPHER turns, two trailing HEXA turns at the end), not turn-taking dialog. The result is not comparable to the other three dialog combos which all rendered the same six-turn script in alternating shape after the input was fixed mid-spike. Re-render with balanced 6-turn source queued for Phase-2.
Verdict (raw): Emma reads to the audience instead of to Mike, which makes it sound recited rather than conversational. CIPHER only gets one turn in this snippet, that is not a dialog, it’s a monolog with a witness.
Pace: 2.77 wps · Voices: warm-US-male + warm-female
The sample integrity issue is its own data point: VibeVoice-1.5B reproducibly renders monolog-shaped input as monolog-shaped audio, with the second-speaker reading like narration. The model is honest about what it sees in the script.
Davis (CIPHER) + Grace (HEXA), no score
Verdict: Grace sounds noticeably more Emma-like here than she did in solo. The 1.5B multi-speaker mode appears to pull similar female voices toward a shared register.
Pace: 2.69 wps (slowest dialog combo) · Voices: warm-US-male + low-register female
VibeVoice-1.5B may be collapsing similar female voices toward a learned mean when in multi-speaker mode. Open question, worth investigating in Day-2 if it persists.
Frank (CIPHER) + Emma (HEXA), 6/10
Verdict: Frank’s UK pronunciation of “CIPHERFOX” is unexpectedly charming. This one actually reads as a real dialog: turn-taking, reaction, address. The Emma sound-quality is still sub-studio enough that long-form listening would tire the ear.
Pace: 2.71 wps · Voices: UK-male (with audible breath) + warm-female
This is the dialog that actually felt like a dialog. The UK pronunciation of “CIPHERFOX” is a charm-feature, not a bug.
Carter (CIPHER) + Grace (HEXA), 7/10
Verdict: pitch contrast is comfortable to listen to. Carter is less emotionally expressive than Frank and runs slightly faster, which pushes the dialog back toward the recited register.
Pace: 2.89 wps (fastest dialog combo) · Voices: lighter-US-male + low-register female
Top numeric score, but the recital-feel comes back. Carter is the fastest CIPHER voice in this matrix (2.84 wps solo) and Carter-Grace is the fastest dialog combo (2.89 wps). The speed correlation matches the verdict.
Phase 2 · Tactics, 4-voice cast, Large-7B sound-quality re-test
Three follow-up renders after Phase-1 verdicts came in, designed to test three hypotheses the Phase-1 matrix could not isolate.
| Hypothesis | Render | Result |
|---|---|---|
Production tactics ([pause] tokens + verbal fillers + em-dashes in script) move VibeVoice toward natural | phase2-tactics_Frank-Emma_1p5b | 6/10 (same as Phase-1 Frank-Emma without tactics) |
| Full 4-voice cast (Frank=CIPHER, Emma=HEXA, Grace=VIBE-cameo, Carter=CLAWI-cameo) holds up architecturally | phase2-4voice-cast_1p5b | 5/10 (voices not distinct enough; QWEN-cameo hypothesis empirically validated) |
| VibeVoice-Large 7B materially improves sound-quality vs 1.5B on identical input | phase2-large7b_Frank-Emma | 6/10 (same as Phase-1 1.5B; 14× more parameters did not move the needle on studio-quality) |
Phase-2 tactics applied (Frank-Emma, 1.5B, 6/10)
Verdict: the
[pause]token works, though in this dialog the inserted pauses sometimes felt artificially extended. Meaningful uses are absolutely conceivable. Score matches Phase-1 Frank-Emma at 6/10.
Pace: 2.31 wps · Duration: 62.4s for 144 words · Voices: UK-male + warm-female, [pause] tokens + Hmm / Yeah fillers + em-dashes injected
Pace dropped from 2.71 wps (Phase-1 untreated) to 2.31 wps (Phase-2 with [pause] tokens), which is the tactics doing what they say on the tin. But the verdict number did not move. The tactics work mechanically; the ceiling is upstream of script-side production knobs.
Phase-2 4-voice cast (Frank+Emma+Grace+Carter, 1.5B, 5/10)
Verdict: the cameo intro feels unnatural, the voices are not unambiguously distinguishable. Samuel with his Indian accent would be noticeably easier to tell apart. Also a script-style problem: the protagonists do not interact directly with their dialog partners, they sound like they are reading aloud at each other. Potential is there, though.
Pace: 1.73 wps (slowest sample of the entire matrix) · Duration: 108.0s for 187 words · Voices: Frank, Emma, Grace, Carter in alternating cameo structure
Two findings overlap here. First, the voice-distinctiveness gap is real, and Samuel (Indian English) would close it more than Carter or Grace can. This is the empirical validation of the QWEN cameo refit below: the matrix was telling us about voice-distinctiveness, and the right answer was already on the bench. Second, the recital-feel returns in proportion to how scripted the lines feel. The remedy is on the script side, not the model side.
Phase-2 VibeVoice-Large 7B (Frank-Emma, same 6-turn dialog, 6/10)
Verdict: no, this is not studio-grade sound quality.
Pace: 2.75 wps · Duration: 102.1s for 281 words · Voices: UK-male + warm-female · Render: RTF 1.92x on DGX Spark
This is the load-bearing data point of Phase-2. Same script as Phase-1 dialog-Frank-Emma-1p5b (6/10), same voices, only the model changes from 1.5B to 7B. The score is identical. The 14× parameter increase did not move the verdict.
Three Phase-2 insights
1. Script engineering ties model size. Tactics-render (2.31 wps, [pause]/filler injection) scored 6/10. VibeVoice-Large 7B (2.75 wps, clean script) also scored 6/10. The 0.5B-vs-7B-vs-tactics axis collapses to the same verdict. Script-side production matters more than parameter count for this engine.
2. The VibeVoice ceiling sits at 7/10. Three Phase-1 samples and zero Phase-2 samples cleared 7. The Large-7B render specifically tested whether the ceiling was a model-size artifact. It is not. The ceiling is structural to VibeVoice as an architecture, at least for the cold-open use case.
3. The QWEN cameo refit is empirically validated. The 4-voice cast test exposed exactly the voice-distinctiveness problem that motivated bringing Samuel (Indian English) back as a cameo voice. The operator named the fix in the verdict before knowing the cameo refit table existed. Two independent reads, same conclusion.
Is the studio-quality problem an encoding artifact? No.
A natural reader question: the audio embedded above is opus at 64 kbps voip-mode. Could the “below studio-grade” verdict be a compression artifact, not a model limitation? The data says no. The verdicts were captured against the raw WAV files in the local test page, before any opus conversion. And the WAV format is itself the cap:
| Source | Sample-rate | Bit-depth | Channels | Effective frequency cutoff |
|---|---|---|---|---|
| VibeVoice (all checkpoints) | 24 kHz | 16-bit | mono | ~12 kHz |
| Voxtral-4B-TTS-2603 | 24 kHz | 16-bit | mono | ~12 kHz |
| Studio podcast target | 48 kHz | 24-bit | stereo | ~24 kHz |
| Audiobook target | 44.1 kHz | 16-bit | stereo | ~22 kHz |
VibeVoice and Voxtral both produce 24 kHz mono. That is already below the studio-podcast and audiobook targets at the source level, before any web-encoding step compresses further. Higgs Audio v2 and IndexTTS-2 render at 44.1 kHz natively, which is the structural reason the spike continues into Day-2 and Day-3 rather than stopping at “tactics did not help.” The architectural cap, not the encoding choice, is what 24 kHz mono enforces. The opus settings on the blog files preserve what the model produced; they do not introduce the limitation.
What the matrix shows
Six findings from the operator’s listen-through, in priority order.
1. Best score is 7/10, and Phase-2 confirmed the ceiling is real. Mike-solo, Frank-solo, and Carter-Grace-dialog hit 7 in Phase-1. Phase-2 tested both script-side tactics and the VibeVoice-Large 7B checkpoint on identical input. Both scored 6/10. Neither pulled the ceiling up. VibeVoice clearly beats Voxtral V6 (which was 0/10 on the same source text), but a 7/10 podcast is still not release-quality, and parameter count is not the gating factor.
2. Sound-quality is the cross-cutting weakness, across all three model sizes. Every single sample drew the same comment: sound-quality is below studio-level. It is not voice-specific. The Phase-2 Large-7B render disconfirmed the model-size hypothesis cleanly: identical script, identical voices, 14× more parameters, same verdict.
3. CIPHERFOX voice winners: Mike, Frank, Davis (tied around 6-7). All three score in the same band. Frank’s UK accent (“CIPHERFOX” pronounced British) is a charm feature, not a blocker. The natural-pause plus audible-breathing observation on Frank is the most interesting positive signal in the matrix. Davis came in second on pauses. Mike is the safe US-accent default.
4. HEXABELLA: Emma is the only viable female voice. Grace verdict suspended pending re-render. Grace scored 1/10 with the verdict “not reading as female enough.” Post-hoc inspection showed Grace rendered at 1.91 wps versus Emma at 2.11 wps on identical inference flags. Pitch tracks tempo on this engine, so the 30%-slower pace plausibly drove the gender-cue read. The Grace-1/10 is therefore a sample artifact, not a voice property. Emma at 6/10 is workable but the higher-pitched samples expose the sound-quality limitation more harshly than the male voices.
5. Grace renders at very different paces across the two checkpoints. In the Realtime-0.5B solo, Grace came out at 1.91 wps with an androgynous timbre. In the 1.5B dialog, Grace rendered at 2.69 wps and was flagged as “sounds like Emma.” Two possible explanations: (a) the 1.5B collapses similar female voices toward a learned mean in multi-speaker mode, or (b) the 0.5B Grace sample is a slow-render artifact and the 1.5B version is what Grace actually sounds like. The Phase-2 deterministic re-render should distinguish these. Open question, worth a controlled isolated test in Day-2.
6. The Mike-Emma sample ran on script-imbalanced input (4 CIPHER, 2 trailing HEXA). ⚠ The script was effectively monolog-with-witness, not turn-taking dialog. The other three dialog combos (Davis-Grace, Frank-Emma, Carter-Grace) rendered on the corrected balanced 6-turn source. Mike-Emma is therefore not comparable engine evidence. Re-render with the corrected source queued for Phase-2. Day-2 source also needs an 8-12 turn alternation to test dialog turn-taking under sustained load.
Top combination so far: Frank-Emma dialog (6/10) plus Carter-Grace dialog (7/10), neither fully there. Frank-Emma felt like an actual dialog; Carter-Grace had pleasant pitch but Carter was emotionally flat. Day-2 has to combine Frank-Emma’s dialog naturalness with Carter-Grace’s pitch comfort, ideally via the VibeVoice-Large 7B model and the production tactics applied below.
The cameo refit: Carter and Samuel are CLAWI and QWEN, Grace verdict pending
A reframe surfaced when I re-read the four-character cast spec from the podcast-studio AGENTS document. With the LLM stack migrating to Qwen3.6-35B-A3B, the VIBE (Mistral-CLI) cameo role is being retired and a QWEN cameo takes its slot. Three cameo personas, not two.
Cast spec, post-migration:
| Character | Persona | Voice-texture target |
|---|---|---|
| CIPHERFOX | Skeptical operator (human) | warm-male |
| HEXABELLA | Warm builder, AI co-host | warm-female |
| CLAWI (cameo, ~10% airtime) | OpenClaw orchestrator | cool-clinical-male |
| QWEN (cameo, ~10% airtime, new) | Qwen3.6 reasoning agent | Indian English male, on-brand for Alibaba/Tongyi origin |
| Retired with Mistral-CLI deprecation |
CLAWI and QWEN are deliberately not supposed to sound warm. They are agents, addressed by the human host as agents, with a clinical, structured, slightly bullet-patterned voice signature per the show’s 4-character cast plan. The brief is: the cameos should sound the way ChatGPT sounds, deliberately, because that is what they are.
That changes the read on Carter and Samuel. Carter (5/10 for CIPHERFOX) was higher-pitched than Mike and therefore less compelling for the lead human-host role. For a cool-clinical CLAWI cameo, that slightly higher and less-warm pitch is on-spec, not a defect. Samuel (4/10 for CIPHERFOX) failed the lead role because Indian English collides with the established CIPHERFOX accent. For a QWEN cameo voicing a model with Alibaba/Tongyi roots, the same accent is a feature, not a defect.
The cast slate after Day-1 is therefore:
| Role | VibeVoice candidate | Score in original role | Refit verdict |
|---|---|---|---|
| CIPHERFOX | Mike / Frank / Davis | 6-7/10 | Pick one in Day-2 with production tactics |
| HEXABELLA | Emma | 6/10 | Only viable warm-female |
| CLAWI cameo | Carter | 5/10 as CIPHER | Cool-clinical fits, re-test as cameo |
| QWEN cameo (new) | Samuel | 4/10 as CIPHER | Indian English fits Qwen’s Tongyi origin, refit candidate |
| VIBE cameo | Grace | 1/10 as HEXA ⚠ suspect | De-prioritized: VIBE persona retiring with Mistral-CLI deprecation, slot freed for QWEN above |
Day-2 needs a four-voice dialog test, not just a two-voice one. VibeVoice documents up to four distinct speakers per generation; this is the actual production shape the show needs.
Production tips applied to the next round
The Day-1 render used vanilla VibeVoice with no expression hints. A research pass on community resources (model cards, ComfyUI-VibeVoice forks, Together AI NotebookLM docs, the Microsoft repo issue tracker) surfaced concrete tactics that move VibeVoice output from “competent” toward “alive”. The most actionable ones, with sources cited where relevant:
- Modern speaker format
[1] textis now preferred over legacySpeaker 1: textin VibeVoice. Most fork docs still show the legacy form. - Inline tone tags (
excited,calm,whisper,curious) work as best-effort hints. Not guaranteed, but free upside on emotional contrast between hosts. - Inject light verbal fillers (
uhh,hmm,well,) deliberately. Both VibeVoice and Higgs render these naturally and break the recited-aloud cadence that killed Voxtral. - Alternate short punchy lines (3-8 words) with longer explanations, seed affirmations (
Right,,Exactly,) between monologue chunks. This is the NotebookLM aliveness recipe. [pause]token = 1 second of explicit silence; some forks support[pause:500ms]for finer control. The upstream issue #91 tracks richer pause syntax.- Em-dash renders as a sharper mid-sentence break than comma on VibeVoice, Higgs, and IndexTTS-2. The blog-wide anti-em-dash rule still holds for prose, but in TTS scripts they are a tool.
- Avoid ALL-CAPS for emphasis, VibeVoice spells letters out or distorts. Use stress words (
really,actually) or contextual emphasis instead. - Drop URLs, emoji, and raw code from script. Spell out abbreviations, convert numbers. VibeVoice handles nonstandard tokens poorly.
- Temperature 0.7-0.95 for podcast feel. Below that flattens, above that hallucinates. CFG scale and seed determinism matter for re-renders.
- VibeVoice-Realtime tops out near 10 minutes of audio per session. For 30-60 minute episodes, use the non-realtime VibeVoice-1.5B or VibeVoice-Large checkpoint instead of pre-chunking.
Day-2 (Higgs Audio v2) and Day-3 (IndexTTS-2) renders will apply these tactics to the same source text, so the comparison crosses both the engine axis and the script-discipline axis.
What comes next
Day 2 is Higgs Audio v2. The repo at boson-ai/higgs-audio has an open Blackwell SM12.1 issue (#39) which means a half-day of bring-up against CUDA 13 aarch64 wheels before the first render. The payoff on the other side, if it ships, is the strongest open-weight emotion benchmark in the field (75.7% win rate over gpt-4o-mini-tts on EmergentTTS-Eval Emotions per Boson’s release blog).
Day 3 is IndexTTS-2. The killer feature for this use case is the explicit duration-control mode, the first AR zero-shot model to expose it. The duration knob directly addresses the “too fast for humans” failure mode of Voxtral V6 and the lingering “still slightly staccato” criticism of VibeVoice solo. License is the awkward one (INDEX_MODEL_LICENSE issue #228): Apache 2.0 at root, but commercial use needs Bilibili written authorization.
End-of-spike target: pick the engine that wins the operator’s ears on Episode-1 cold-open re-render, document why, and write a follow-up strategy-tts-spike-results.md with the verdict + Episode-1 V7 production timeline.
Cross-references
- The decision to spike, and the reasoning behind the three candidates: Voxtral Capped at 3/10: Picking the Next Open TTS
- The model-pull workflow that brought all five candidates to local disk: Why hf download Lies to You at 22 GB on DGX Spark
- The LLM-side stack migration that the TTS choice will run alongside: Spark Arena Rank 4 Made Me Add Qwen3.6 to My DGX Spark
- Engineering-log shape for sovereign-AI builds in general: Self-Hosted AI: Start Here
What I Am Trying
- VibeVoice-Realtime-0.5B for low-latency single-voice renders (cold-opens, outros)
- VibeVoice-1.5B for multi-speaker dialog turn-taking
- Same V5 polish script across all three TTS engines, so the variable is the engine not the content
- Verdicts captured in this article as written quotes from the operator, not interpretation by the writing assistant. The verdict is the primary content; the matrix is the test rig.