Reading path
I want self-hosted text-to-speech
Open TTS on a desk GPU, end to end: the honest state of the art, the constraints nobody warns you about, and getting a podcast-grade pipeline running.
5 articles, in reading order
- Voxtral Capped at 3/10: Picking the Next Open TTS
Start with the honest state of the art: where open TTS actually tops out, and why the model keeps moving.
- Voxtral 4B Open-Checkpoint: The Encoder is Gated
The constraint nobody warns you about: the open checkpoint's encoder is gated, so no voice cloning.
- Voxtral-TTS Blocker on GB10: The Three-Line vllm-omni Patch
Getting it to run at all on GB10: the Blackwell blocker and the fix.
- Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns
Making it usable: the chunk strategy that cut render time by a third.
- Per-Segment Loudnorm and the 3-Second Lookahead Bug
Production audio: loudness normalization for a multi-speaker podcast pipeline.