TTS: turning written text into spoken audio : Learn

TTS (text to speech) is the conversion of written text into spoken audio by a model. Modern systems are neural: they predict an audio representation from the text and a chosen voice, then turn that into a waveform you can play. It is the parent topic that other speech terms (intelligibility, prosody, voice cloning) sit underneath.

How does TTS work?

A neural TTS model takes your written text and a choice of voice, then predicts an intermediate audio representation from them, often a spectrogram or a sequence of audio tokens. A second stage, the vocoder or audio decoder, turns that representation into an actual waveform. From the outside it looks like one step: text in, sound out. Inside it is usually a pipeline.

The model has to do more than spell the words out loud. It decides where to pause, which syllables to stress, how fast to go, and how the pitch rises and falls. Those choices are what separate a usable voice from one that sounds like a screen reader. Different models put their effort in different places, which is why two systems can read the same sentence and feel completely different.

Why does TTS matter?

TTS is the layer that lets a system speak instead of only printing text. For a sovereign setup that means you can run narration, voice replies, or audio versions of written content on your own hardware, without sending text to a cloud voice API. The umbrella covers a lot, so when people say “the TTS is good” it is worth asking what they mean.

Judging TTS splits into two questions that do not always agree. One is whether the words come out correctly and reliably, which you can measure. The other is whether the voice sounds alive and human, which is far more subjective. A model can nail the first and still fail the second, so it pays to test both before you trust a system in production.

TTS: turning written text into spoken audio

At a glance

How does TTS work?

Why does TTS matter?

Related terms

At a glance

How does TTS work?

Why does TTS matter?

Related terms

Go deeper