Reading path

I want self-hosted text-to-speech

Open TTS on a desk GPU, end to end: the honest state of the art, the constraints nobody warns you about, and getting a podcast-grade pipeline running.

5 articles, in reading order

  1. Voxtral Capped at 3/10: Picking the Next Open TTS

    Start with the honest state of the art: where open TTS actually tops out, and why the model keeps moving.

  2. Voxtral 4B Open-Checkpoint: The Encoder is Gated

    The constraint nobody warns you about: the open checkpoint's encoder is gated, so no voice cloning.

  3. Voxtral-TTS Blocker on GB10: The Three-Line vllm-omni Patch

    Getting it to run at all on GB10: the Blackwell blocker and the fix.

  4. Voxtral Chunk Strategy: 38 Percent Faster Render with Whole Turns

    Making it usable: the chunk strategy that cut render time by a third.

  5. Per-Segment Loudnorm and the 3-Second Lookahead Bug

    Production audio: loudness normalization for a multi-speaker podcast pipeline.

← All articles