Learn

Speaker similarity: did it sound like the target voice?

Speaker similarity, often written SIM, is a voice-cloning metric that compares the generated voice to the reference speaker by measuring the cosine similarity between their speaker embeddings, so a higher score means the output sounds more like the target.

At a glance

What it is
A metric for how close a cloned voice is to its reference
How it is measured
Cosine similarity between speaker embeddings
Direction
Higher is better, it means a closer match to the target
Its partner metric
WER, which checks whether the words came out right

How does speaker similarity work?

To compute SIM you run both the reference recording and the generated clip through a speaker-embedding model, which turns each voice into a vector that captures its timbre and identity. You then take the cosine similarity between the two vectors. The closer the angle, the higher the number, and the more the synthetic voice resembles the original speaker.

Because it works on embeddings rather than the raw waveform, SIM ignores what was said and focuses on who appears to be saying it. That is exactly why it pairs naturally with word error rate. SIM answers whether it sounds like the target, and WER answers whether the words are correct, and a good clone needs both.

When does it matter, and when not?

It matters most for voice cloning, where the whole point is to reproduce a specific person. If you are building a custom voice from a few reference samples, a high SIM is the evidence that the output actually carries that identity rather than drifting toward a generic voice.

It matters less when you are not trying to match anyone in particular. For a stock narrator voice or a synthetic persona with no reference, there is nothing to be similar to, so SIM has little to say. Even where it applies, read it alongside WER, since a voice can sound like the target yet still mangle the words.

Speaker similarity (SIM)

  • Did the voice sound like the target speaker?

Word error rate (WER)

  • Did the words come out correct?

Related terms

← All terms Reviewed: June 2026