← All articles

#tts

8 articles

All articles tagged "tts" — self-hosted AI fixes, setups, and architecture notes.

TTS Spike Day 1: VibeVoice Sample Matrix on DGX Spark

TTS Spike Day 1: VibeVoice Sample Matrix on DGX Spark

Eleven VibeVoice renders, one Voxtral baseline, the operator's ears. The first day of the three-day TTS spike that follows the V6=0/10 verdict. Engineering-log shape, with the actual audio embedded.

Read article →
Eight engineering fixes deep, three weeks of patches, two failure modes on the same engine. The Voxtral open checkpoint has no path to release-quality podcast audio. The drama of staying with it anyway, and the three engines I plan to spike next.
strategypodcastvoxtral

Voxtral Capped at 3/10: Picking the Next Open TTS

Eight engineering fixes deep, three weeks of patches, two failure modes on the same engine. The Voxtral open checkpoint has no path to release-quality podcast audio. The drama of staying with it anyway, and the three engines I plan to spike next.

Per-block ffmpeg loudnorm averages multiple speakers to one gain, leaving the quieter voice quieter. Dynamic-mode loudnorm eats the first 3 seconds of audio.
fixdevopsvoxtral

Per-Segment Loudnorm and the 3-Second Lookahead Bug

Per-block ffmpeg loudnorm averages multiple speakers to one gain, leaving the quieter voice quieter. Dynamic-mode loudnorm eats the first 3 seconds of audio.

Voxtral 4B advertises voice cloning, accepts ref_audio in the API, then crashes the engine because the encoder weights live only in Mistral's hosted product.
fixmistralpodcastvoxtral

Voxtral 4B Open-Checkpoint: The Encoder is Gated

Voxtral 4B advertises voice cloning, accepts ref_audio in the API, then crashes the engine because the encoder weights live only in Mistral's hosted product.

Rendering a 367-character podcast turn as one Voxtral call takes 21 seconds. Split into 90-character chunks: 35 seconds. Same words, same voice, 38 percent more wallclock.
strategydevopspodcastvoxtral

Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns

Rendering a 367-character podcast turn as one Voxtral call takes 21 seconds. Split into 90-character chunks: 35 seconds. Same words, same voice, 38 percent more wallclock.

How a silent AttributeError nearly killed our TTS pipeline, and why three lines of code fixed it forever.
fixdevopsvoxtral

Voxtral-TTS Blocker on GB10: The Three-Line vllm-omni Patch

How a silent AttributeError nearly killed our TTS pipeline, and why three lines of code fixed it forever.

How a three-line Python init order bug masqueraded as a Blackwell GPU hang, and why checking raw logs beat all hardware theories.
fixdevopsmistralpodcastvoxtral

The 3.5-Hour Deadlock That Was Really an AttributeError

How a three-line Python init order bug masqueraded as a Blackwell GPU hang, and why checking raw logs beat all hardware theories.

How a single flag killed my self-hosted TTS stack, and how I fixed it without losing a second of audio.
fixdevopspodcastvoxtral

Voxtral Stage 1 OOM on GB10: Why --enforce-eager Is Not Enough

How a single flag killed my self-hosted TTS stack, and how I fixed it without losing a second of audio.