Learn

WER: how often the words come out wrong

WER (word error rate) is a way to score a TTS model's intelligibility. You run the generated audio through a speech recognizer, compare what it transcribes against the text you asked for, and count the fraction of words that were wrong (substituted, dropped, or added). A lower WER means the words came out clearly; it says nothing about whether the voice sounds human.

At a glance

What it is
The fraction of words a recognizer gets wrong from the synthesised audio
Which way is good
Lower is better; zero would mean every word landed cleanly
What it measures
Intelligibility and stability, that the right words came out reliably
What it misses
Naturalness; a model can win WER and still sound robotic

How does WER work?

WER closes a loop. You give the TTS model a sentence, record what it speaks, then feed that audio into a speech recognizer and read back its transcript. You line the transcript up against the original text and count three kinds of mistake: words that were swapped for the wrong word, words that were dropped, and words that were inserted. The total errors over the total words is the rate.

Because the whole thing is mechanical, WER is cheap to run at scale and gives you a single number to sort models by. The catch is that it inherits the recognizer’s own blind spots. If the recognizer mishears a fine but unusual pronunciation, that counts against the TTS model even though a human would have understood it.

Why does WER matter, and where does it stop?

WER is the floor you want every voice to clear. A model that scores badly is dropping or garbling words, and no amount of pleasant tone fixes a sentence you cannot follow. So WER is a good gate: it catches the systems that are unstable or unintelligible before you waste time on anything else.

What WER does not tell you is whether the voice sounds alive. A flat, robotic reading of every word in the right order scores beautifully. That is the trap: a model can win on WER and still feel lifeless, so you pair it with measures of naturalness and prosody before deciding a voice is actually good.

WER

  • Counts wrong words via a speech recognizer, fully mechanical
  • Rewards clear, stable, intelligible speech

A naturalness score

  • Asks whether the voice sounds alive and human
  • Subjective, and a clear voice can still score badly

Related terms

← All terms Reviewed: June 2026