Learn

Mixed precision: more than one number format in one run

Mixed precision is the practice of using more than one numeric format inside a single model or run: most of the work happens in a small, fast format, while the parts that need accuracy stay in a larger one. The result is faster, lighter computation that keeps the precision where it actually matters.

At a glance

What it is
Using more than one number format in one model or run
Why do it
Speed and memory savings without losing accuracy where it counts
The trade
Small format for the bulk, larger format for the sensitive parts
Related but distinct
Quantization shrinks weights; mixed precision spans formats per part
Stack

One run, two precisions

The bulk of the work runs in a small, fast format. The parts that are sensitive to rounding stay in a larger format. Green marks the accuracy you deliberately keep.

3
Net result faster and lighter, accuracy preserved where it matters
2
Sensitive parts (larger format) kept accurate to avoid drift and instability
1
Bulk compute (small, fast format) most matrix work runs here for speed and memory

What does mixed precision mean?

A model does an enormous amount of arithmetic, and not every part of it needs the same accuracy. Mixed precision uses that fact. Most of the work runs in a small numeric format, which is faster to move through memory and quicker to multiply. The parts that are sensitive to rounding, where small errors would pile up and push the model off course, stay in a larger, more accurate format.

The word precision here means how many bits a number gets, and so how finely it can represent a value. A smaller format saves memory and time but rounds harder. Mixed precision is simply the decision not to use one format for everything, but to spend the accuracy where it earns its keep.

How is it different from quantization?

The two get confused because both involve smaller number formats. Quantization shrinks a model’s stored weights into a compact, low-precision encoding so the whole thing takes less space. Mixed precision is about the running computation: different parts of the same run use different formats at the same time.

You can use both. A quantized model can still run with mixed precision during inference. The practical rule is the same either way: smaller formats are faster and lighter, but they round more, so you keep the larger format where the model is fragile and measure the output rather than assume it held up.

Mixed precision buys you

  • Faster compute, since small formats move and multiply quicker
  • Lower memory use than running everything in a large format
  • Accuracy held where rounding errors would otherwise compound

It will not

  • Make a model fit that is simply too big; that is quantization's job
  • Come for free; the wrong split can still hurt quality
  • Replace measuring; you confirm output quality, you do not assume it

Related terms

← All terms Reviewed: June 2026