What does mixed precision mean?
A model does an enormous amount of arithmetic, and not every part of it needs the same accuracy. Mixed precision uses that fact. Most of the work runs in a small numeric format, which is faster to move through memory and quicker to multiply. The parts that are sensitive to rounding, where small errors would pile up and push the model off course, stay in a larger, more accurate format.
The word precision here means how many bits a number gets, and so how finely it can represent a value. A smaller format saves memory and time but rounds harder. Mixed precision is simply the decision not to use one format for everything, but to spend the accuracy where it earns its keep.
How is it different from quantization?
The two get confused because both involve smaller number formats. Quantization shrinks a model’s stored weights into a compact, low-precision encoding so the whole thing takes less space. Mixed precision is about the running computation: different parts of the same run use different formats at the same time.
You can use both. A quantized model can still run with mixed precision during inference. The practical rule is the same either way: smaller formats are faster and lighter, but they round more, so you keep the larger format where the model is fragile and measure the output rather than assume it held up.