Quantization: shrinking a model to a smaller number format : Learn

Quantization is storing a model's weights in a smaller numeric format, using fewer bits per number, so the model takes less memory and runs faster. The cost is a loss of precision: the numbers are approximate, which can slightly change the model's output. Done well, the savings are large and the quality loss is small.

At a glance

What it is

Storing weights in fewer bits per number to save memory

Why you do it

A model fits in less memory and usually runs faster

What it costs

Some precision; outputs can shift a little, sometimes capabilities

Common formats

Smaller floating-point and integer formats, named by their bit width

What is quantization?

A model is a big pile of numbers, the weights. By default each one is stored in a relatively large format, which is accurate but heavy: it takes a lot of memory and a lot of bandwidth to read. Quantization stores those numbers in a smaller format, using fewer bits each. The numbers become approximate, like rounding, but there are far fewer bits to hold and to move.

The payoff is direct. A model in a smaller format takes less memory, so a model that did not fit may now fit, and it usually runs faster because there is less data to read on every step. This is how a large model squeezes onto a single box that could not hold it at full precision.

What does quantization cost you?

Precision. The weights are approximations now, so the output can shift a little from the full-precision version. With a good method that shift is small and often hard to notice. With a careless one it shows up as worse answers, and some formats can drop part of a model entirely, such as a vision component, so the quantized version quietly loses a capability.

So treat the bit width as a dial, not a free win. Going smaller buys memory and speed; you pay in precision. The honest move is to measure the quantized model on your own task before trusting it, not to assume the savings came for nothing.

Quantization: shrinking a model to a smaller number format

At a glance

Full precision versus a quantized model

What is quantization?

What does quantization cost you?

Quantization helps with

It will not fix

Related terms

At a glance

Full precision versus a quantized model

What is quantization?

What does quantization cost you?

Quantization helps with

It will not fix

Related terms

Go deeper