Quantization: shrinking a model to a smaller number format
Quantization is storing a model's weights in a smaller numeric format, using fewer bits per number, so the model takes less memory and runs faster. The cost is a loss of precision: the numbers are approximate, which can slightly change the model's output. Done well, the savings are large and the quality loss is small.
At a glance
What it is
Storing weights in fewer bits per number to save memory
Why you do it
A model fits in less memory and usually runs faster
What it costs
Some precision; outputs can shift a little, sometimes capabilities
Common formats
Smaller floating-point and integer formats, named by their bit width
Comparison
Full precision versus a quantized model
Full precision
Quantized
Memory used by weights
Largest; each number takes the most bits
Smaller; fewer bits per number
Speed
Baseline; more data to move
Usually faster; less data to move
Output quality
The reference; nothing approximated
Close, with a small approximation error
What is quantization?
A model is a big pile of numbers, the weights. By default each one is stored in
a relatively large format, which is accurate but heavy: it takes a lot of memory
and a lot of bandwidth to read. Quantization stores those numbers in a smaller
format, using fewer bits each. The numbers become approximate, like rounding,
but there are far fewer bits to hold and to move.
The payoff is direct. A model in a smaller format takes less memory, so a model
that did not fit may now fit, and it usually runs faster because there is less
data to read on every step. This is how a large model squeezes onto a single
box that could not hold it at full precision.
What does quantization cost you?
Precision. The weights are approximations now, so the output can shift a little
from the full-precision version. With a good method that shift is small and
often hard to notice. With a careless one it shows up as worse answers, and some
formats can drop part of a model entirely, such as a vision component, so the
quantized version quietly loses a capability.
So treat the bit width as a dial, not a free win. Going smaller buys memory and
speed; you pay in precision. The honest move is to measure the quantized model
on your own task before trusting it, not to assume the savings came for nothing.
Quantization helps with
Fitting a larger model into limited memory
Running faster by moving less data per step
Leaving headroom for a longer context
Serving more requests at once on the same box
It will not fix
A model that is simply too big even after shrinking
Slow memory bandwidth, though it does ask less of it
Quality, which it spends rather than improves
Lost capabilities if a format drops part of the model, such as vision