How does prosody work?
When a TTS model reads a sentence, getting the words right is only half the job. It also has to decide which syllables to lean on, where the pitch climbs and falls, how long to hold a pause, and how quickly to move. Those choices ride on top of the words and shape how the line lands. A question that ends with a flat pitch sounds wrong even if every word is correct.
Good prosody depends on understanding the sentence, not just spelling it out loud. The model has to infer where the emphasis belongs from meaning and context, which is why the same text can be read a dozen plausible ways. Weak models default to one flat reading; stronger ones vary the rhythm in a way that tracks what the sentence is actually saying.
Why does prosody matter?
Prosody is the layer people actually react to. A voice with perfect word accuracy but no rhythm sounds like a screen reader, and listeners tune out fast. The same content with natural stress and pacing feels like a person talking. For anything someone has to listen to for more than a few seconds, prosody is the difference between usable and unbearable.
It is also the hardest part to score. Intelligibility gives you a clean number, but prosody mostly has to be judged by ear, and what sounds right shifts with context and language. That makes it easy to overlook when you lean only on automated metrics, which is exactly why a model can look strong on paper and still sound dead.