NVFP4: NVIDIA's 4-bit float format : Learn

NVFP4 is NVIDIA's 4-bit floating-point (FP4) quantization format, which stores model weights in 4-bit floats so a large model fits in far less memory while keeping acceptable quality.

How does NVFP4 work?

NVFP4 is a 4-bit floating-point number format from NVIDIA. Instead of storing each model weight in 16 bits, it packs the same weight into 4 bits, a quarter of the size. Because it is a float format rather than a plain integer one, it keeps a small range and scale per group of weights, which helps preserve quality at such a low bit width.

Quantizing to NVFP4 happens after training, you take the full-precision weights and convert them. The model then runs from the smaller 4-bit weights, with the hardware handling the format during inference.

Why does it matter?

The whole point is fitting a big model into memory you actually have. Cutting weights from 16 bits to 4 shrinks the footprint enough that models which would not otherwise load suddenly do, at a quality cost small enough to accept for most work.

That is exactly how a roughly 119B-parameter Mistral fits in the 128 GB of unified memory on the DGX Spark. Without 4-bit quantization a model that large would not fit, and NVFP4 is what makes it practical to serve on a single Spark.

NVFP4: NVIDIA's 4-bit float format

At a glance

How does NVFP4 work?

Why does it matter?

NVFP4 (4-bit)

Full-precision (16-bit)

Related terms

At a glance

How does NVFP4 work?

Why does it matter?

NVFP4 (4-bit)

Full-precision (16-bit)

Related terms

Go deeper