Learn

INT4: four-bit integers, the small end of quantization

INT4 (4-bit integer) stores each model weight as a 4-bit integer, mapped back to a real value by a scale and offset learned per group of weights. At four bits per weight it uses about a quarter of the memory a 16-bit model needs, which is what lets large models fit on a single box. The accuracy cost is real and depends heavily on how the quantization was done.

At a glance

What it is
Each weight stored as a 4-bit integer with a scale factor
Why use it
About a quarter of the weight memory of 16-bit
The catch
Accuracy loss is real and method-dependent
How it maps back
A per-group scale and offset turn integers into real values
Comparison

INT4 versus 16-bit weights

16-bit (BF16/FP16)
INT4
Bits per weight
Sixteen
Four
Weight memory
Baseline
About a quarter
Accuracy risk
Low
Higher; depends on the method

What is INT4?

INT4 (4-bit integer) stores each weight as a small whole number, four bits wide, which can hold sixteen distinct values. On its own that is far too coarse, so the format also keeps a scale factor (and often an offset) for each small group of weights. To use a weight, the engine multiplies the integer by its group’s scale to recover an approximate real value. The payoff is memory: four bits per weight is about a quarter of 16-bit, which is often the difference between a large model loading on one box and not loading at all.

Why is the method what matters?

Four bits is not much room, so how you choose the integers and scales decides whether the model survives the squeeze. Two INT4 quantizations of the same model can score very differently on the same task. That is why “INT4” alone tells you the size but not the quality. The size is a fact about the file. The quality is a fact about the method, and you only learn it by measuring.

When do you reach for INT4?

Reach for INT4 when a model will not fit in the memory you have at a larger format, or when you want to free room for a longer context. It is the small end of the practical range. Below it, accuracy usually falls off a cliff. Treat a 4-bit model as guilty until your own evaluation proves it innocent.

INT4 helps with

  • Fitting a large model that would not load at 16-bit
  • Leaving more of the pool free for context and the cache
  • Running on hardware without specialised low-precision floats
  • Cutting weight memory to roughly a quarter

INT4 does not

  • Guarantee quality; a bad quantization can wreck a model
  • Keep a floating-point range, since it is integer-based
  • Shrink the key-value (KV) cache, a separate budget
  • Help if the model is too big even at four bits

Related terms

← All terms Reviewed: June 2026