How does speaker similarity work?
To compute SIM you run both the reference recording and the generated clip through a speaker-embedding model, which turns each voice into a vector that captures its timbre and identity. You then take the cosine similarity between the two vectors. The closer the angle, the higher the number, and the more the synthetic voice resembles the original speaker.
Because it works on embeddings rather than the raw waveform, SIM ignores what was said and focuses on who appears to be saying it. That is exactly why it pairs naturally with word error rate. SIM answers whether it sounds like the target, and WER answers whether the words are correct, and a good clone needs both.
When does it matter, and when not?
It matters most for voice cloning, where the whole point is to reproduce a specific person. If you are building a custom voice from a few reference samples, a high SIM is the evidence that the output actually carries that identity rather than drifting toward a generic voice.
It matters less when you are not trying to match anyone in particular. For a stock narrator voice or a synthetic persona with no reference, there is nothing to be similar to, so SIM has little to say. Even where it applies, read it alongside WER, since a voice can sound like the target yet still mangle the words.