#tts | Sovereign AI Blog

Per-block ffmpeg loudnorm averages multiple speakers to one gain, leaving the quieter voice quieter. Dynamic-mode loudnorm eats the first 3 seconds of audio.

May 18, 2026

Per-Segment Loudnorm and the 3-Second Lookahead Bug

Per-block ffmpeg loudnorm averages multiple speakers to one gain, leaving the quieter voice quieter. Dynamic-mode loudnorm eats the first 3 seconds of audio.

Eleven VibeVoice renders, one Voxtral baseline, the operator's ears. The first day of the TTS spike that follows the V6=0/10 verdict. Engineering-log shape, with the actual audio embedded. Day 2 went to a late entrant, Qwen3-TTS.

May 13, 2026

strategypodcast

TTS Spike Day 1: VibeVoice Sample Matrix on DGX Spark

Eleven VibeVoice renders, one Voxtral baseline, the operator's ears. The first day of the TTS spike that follows the V6=0/10 verdict. Engineering-log shape, with the actual audio embedded. Day 2 went to a late entrant, Qwen3-TTS.

Eight engineering fixes deep, three weeks of patches, two failure modes on the same engine. The Voxtral open checkpoint has no path to release-quality podcast audio. The drama of staying with it anyway, and the three engines I plan to spike next.

May 12, 2026

strategypodcastvoxtral

Voxtral Capped at 3/10: Picking the Next Open TTS

Eight engineering fixes deep, three weeks of patches, two failure modes on the same engine. The Voxtral open checkpoint has no path to release-quality podcast audio. The drama of staying with it anyway, and the three engines I plan to spike next.

Voxtral 4B advertises voice cloning, accepts ref_audio in the API, then crashes the engine because the encoder weights live only in Mistral's hosted product.

May 7, 2026

fixmistralpodcastvoxtral

Voxtral 4B Open-Checkpoint: The Encoder is Gated

Voxtral 4B advertises voice cloning, accepts ref_audio in the API, then crashes the engine because the encoder weights live only in Mistral's hosted product.

Rendering a 367-character podcast turn as one Voxtral call takes 21 seconds. Split into 90-character chunks: 35 seconds. Same words, same voice, 38 percent more wallclock.

May 7, 2026

strategydevopspodcastvoxtral

Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns

Rendering a 367-character podcast turn as one Voxtral call takes 21 seconds. Split into 90-character chunks: 35 seconds. Same words, same voice, 38 percent more wallclock.