What does “activation-aware” mean?
All quantization rewrites weights at lower precision to save memory and speed up decode. The hard part is doing it without making the model noticeably worse. Round every weight the same and you lose accuracy, because some weights carry far more of the model’s signal than others.
AWQ (Activation-aware Weight Quantization) is the method that pays attention to this. It looks at the activations, the values that flow through the model when it actually runs, to work out which weights matter most. Those weights get protected, rounded more gently, while the rest are rounded harder. The “activation-aware” part is exactly that: the activations tell it where to be careful. The payoff is a model that holds accuracy better than naive rounding at the same bit width.
Where does it fit when you are picking a model?
AWQ is a post-training method, so it is applied to a finished model and ships as a ready-to-serve build, often at INT4 (4-bit integer) precision. You will see it named in model cards and launch configs alongside other formats, and many local runtimes load it directly.
One caution that AWQ shares with every quantization method: smaller weights do not make a model small enough on their own. A very large model at AWQ INT4 can still have a footprint several times your memory budget, which means it does not fit at all, never mind how good the quantization is. Quantization changes the size; it does not repeal the size. Check the footprint against your hardware before the quality.