Learn

GPTQ: a post-training quantization method

GPTQ is a post-training quantization method: it takes an already-trained model and rewrites its weights at lower precision, using a small calibration dataset to choose the rounding so that accuracy loss stays low. It is a widely supported format, and some other quantized builds pack their weights in a GPTQ-compatible layout so existing runtimes can load them.

At a glance

What it is
A post-training method that quantizes a model's weights to lower precision
How it limits damage
A calibration pass that chooses the rounding to minimise error
When it runs
After training, on the finished model, not during it
Why you meet it
A widely supported format, and a layout other quant builds reuse

What kind of quantization is GPTQ?

Quantization rewrites a model’s weights at lower precision so the model takes less memory and decodes faster. There are two broad ways to do it. You can build the lower precision into training, or you can quantize a model that is already finished. GPTQ is the second kind: post-training quantization.

That matters because post-training is cheap. You do not retrain the model. You run a small calibration dataset through it and use what you see to pick the rounding carefully, weight by weight, so the accuracy loss stays modest. The whole pass is fast compared with training and needs only a slice of data, which is why GPTQ became a common way to ship quantized weights people can actually run at home.

Why does the format keep coming up?

Two reasons. First, support is broad. Many local inference runtimes can load GPTQ weights, so it is a safe target if you want something that just runs.

Second, the layout gets reused. Some builds produced by other quantization methods pack their weights in a GPTQ-compatible format on purpose, so existing runtimes load them without new code. That means you can point a server at such a build and set its quantization flag to gptq even when the underlying method was something else. The name on the box and the recipe inside it are not always the same thing, which is worth remembering when you compare two quantized models.

Post-training (like GPTQ)

  • Quantizes a model that is already trained
  • Needs only a small calibration set, not a full retrain
  • Fast and cheap to apply
  • Accuracy loss is bounded by the calibration, not by retraining

Quantization-aware training

  • Builds quantization into the training run itself
  • Needs the full training pipeline and data
  • Expensive, you pay for training again
  • Can recover more accuracy because the model learns around the rounding

Related terms

← All terms Reviewed: June 2026