Speaker similarity: did it sound like the target voice? : Learn

Speaker similarity, often written SIM, is a voice-cloning metric that compares the generated voice to the reference speaker by measuring the cosine similarity between their speaker embeddings, so a higher score means the output sounds more like the target.

How does speaker similarity work?

To compute SIM you run both the reference recording and the generated clip through a speaker-embedding model, which turns each voice into a vector that captures its timbre and identity. You then take the cosine similarity between the two vectors. The closer the angle, the higher the number, and the more the synthetic voice resembles the original speaker.

Because it works on embeddings rather than the raw waveform, SIM ignores what was said and focuses on who appears to be saying it. That is exactly why it pairs naturally with word error rate. SIM answers whether it sounds like the target, and WER answers whether the words are correct, and a good clone needs both.

When does it matter, and when not?

It matters most for voice cloning, where the whole point is to reproduce a specific person. If you are building a custom voice from a few reference samples, a high SIM is the evidence that the output actually carries that identity rather than drifting toward a generic voice.

It matters less when you are not trying to match anyone in particular. For a stock narrator voice or a synthetic persona with no reference, there is nothing to be similar to, so SIM has little to say. Even where it applies, read it alongside WER, since a voice can sound like the target yet still mangle the words.

Speaker similarity: did it sound like the target voice?

At a glance

How does speaker similarity work?

When does it matter, and when not?

Speaker similarity (SIM)

Word error rate (WER)

Related terms

At a glance

How does speaker similarity work?

When does it matter, and when not?

Speaker similarity (SIM)

Word error rate (WER)

Related terms

Go deeper