GPTQ: a post-training quantization method : Learn

GPTQ is a post-training quantization method: it takes an already-trained model and rewrites its weights at lower precision, using a small calibration dataset to choose the rounding so that accuracy loss stays low. It is a widely supported format, and some other quantized builds pack their weights in a GPTQ-compatible layout so existing runtimes can load them.

What kind of quantization is GPTQ?

Quantization rewrites a model’s weights at lower precision so the model takes less memory and decodes faster. There are two broad ways to do it. You can build the lower precision into training, or you can quantize a model that is already finished. GPTQ is the second kind: post-training quantization.

That matters because post-training is cheap. You do not retrain the model. You run a small calibration dataset through it and use what you see to pick the rounding carefully, weight by weight, so the accuracy loss stays modest. The whole pass is fast compared with training and needs only a slice of data, which is why GPTQ became a common way to ship quantized weights people can actually run at home.

Why does the format keep coming up?

Two reasons. First, support is broad. Many local inference runtimes can load GPTQ weights, so it is a safe target if you want something that just runs.

Second, the layout gets reused. Some builds produced by other quantization methods pack their weights in a GPTQ-compatible format on purpose, so existing runtimes load them without new code. That means you can point a server at such a build and set its quantization flag to gptq even when the underlying method was something else. The name on the box and the recipe inside it are not always the same thing, which is worth remembering when you compare two quantized models.

GPTQ: a post-training quantization method

At a glance

What kind of quantization is GPTQ?

Why does the format keep coming up?

Post-training (like GPTQ)

Quantization-aware training

Related terms

At a glance

What kind of quantization is GPTQ?

Why does the format keep coming up?

Post-training (like GPTQ)

Quantization-aware training

Related terms

Go deeper