What kind of quantization is GPTQ?
Quantization rewrites a model’s weights at lower precision so the model takes less memory and decodes faster. There are two broad ways to do it. You can build the lower precision into training, or you can quantize a model that is already finished. GPTQ is the second kind: post-training quantization.
That matters because post-training is cheap. You do not retrain the model. You run a small calibration dataset through it and use what you see to pick the rounding carefully, weight by weight, so the accuracy loss stays modest. The whole pass is fast compared with training and needs only a slice of data, which is why GPTQ became a common way to ship quantized weights people can actually run at home.
Why does the format keep coming up?
Two reasons. First, support is broad. Many local inference runtimes can load GPTQ weights, so it is a safe target if you want something that just runs.
Second, the layout gets reused. Some builds produced by other quantization methods pack their weights in a GPTQ-compatible format on purpose, so existing runtimes load them without new code. That means you can point a server at such a build and set its quantization flag to gptq even when the underlying method was something else. The name on the box and the recipe inside it are not always the same thing, which is worth remembering when you compare two quantized models.