Reading path

I want self-hosted text-to-speech

Open TTS on a desk GPU, end to end: the honest state of the art, the constraints nobody warns you about, and getting a podcast-grade pipeline running.

5 articles, in reading order

Voxtral Capped at 3/10: Picking the Next Open TTS
Start with the honest state of the art: where open TTS actually tops out, and why the model keeps moving.
Voxtral 4B Open-Checkpoint: The Encoder is Gated
The constraint nobody warns you about: the open checkpoint's encoder is gated, so no voice cloning.
Voxtral-TTS Blocker on GB10: The Three-Line vllm-omni Patch
Getting it to run at all on GB10: the Blackwell blocker and the fix.
Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns
Making it usable: the chunk strategy that cut render time by a third.
Per-Segment Loudnorm and the 3-Second Lookahead Bug
Production audio: loudness normalization for a multi-speaker podcast pipeline.

← All articles