Learn

AWQ: activation-aware weight quantization

AWQ (Activation-aware Weight Quantization) is a post-training quantization method that rewrites a model's weights at lower precision while protecting the weights that carry the most signal, identified by looking at the activations the model produces in practice. The result is a smaller, faster model that holds onto accuracy better than naive rounding would.

At a glance

What it is
A post-training method that quantizes weights while protecting the important ones
Stands for
Activation-aware Weight Quantization
The idea
Use the activations to find which weights matter, and round those more gently
Why you meet it
A common format for serving large models at lower precision locally

What does “activation-aware” mean?

All quantization rewrites weights at lower precision to save memory and speed up decode. The hard part is doing it without making the model noticeably worse. Round every weight the same and you lose accuracy, because some weights carry far more of the model’s signal than others.

AWQ (Activation-aware Weight Quantization) is the method that pays attention to this. It looks at the activations, the values that flow through the model when it actually runs, to work out which weights matter most. Those weights get protected, rounded more gently, while the rest are rounded harder. The “activation-aware” part is exactly that: the activations tell it where to be careful. The payoff is a model that holds accuracy better than naive rounding at the same bit width.

Where does it fit when you are picking a model?

AWQ is a post-training method, so it is applied to a finished model and ships as a ready-to-serve build, often at INT4 (4-bit integer) precision. You will see it named in model cards and launch configs alongside other formats, and many local runtimes load it directly.

One caution that AWQ shares with every quantization method: smaller weights do not make a model small enough on their own. A very large model at AWQ INT4 can still have a footprint several times your memory budget, which means it does not fit at all, never mind how good the quantization is. Quantization changes the size; it does not repeal the size. Check the footprint against your hardware before the quality.

Activation-aware (AWQ)

  • Looks at activations to find the weights that matter most
  • Protects those weights, rounds the rest harder
  • Holds accuracy better at the same bit width
  • Needs a calibration pass to see the activations

Naive rounding

  • Rounds every weight the same way
  • Treats important and unimportant weights alike
  • Loses more accuracy at the same bit width
  • Needs no calibration, but pays for it in quality

Related terms

← All terms Reviewed: June 2026