Learn

NVFP4: NVIDIA's 4-bit float format

NVFP4 is NVIDIA's 4-bit floating-point (FP4) quantization format, which stores model weights in 4-bit floats so a large model fits in far less memory while keeping acceptable quality.

At a glance

What it is
NVIDIA's 4-bit floating-point (FP4) number format
What it does
Stores model weights in 4 bits instead of 16
Why it matters
A large model fits in limited memory at acceptable quality
Concrete example
A ~119B Mistral fits in 128 GB unified memory on the DGX Spark

How does NVFP4 work?

NVFP4 is a 4-bit floating-point number format from NVIDIA. Instead of storing each model weight in 16 bits, it packs the same weight into 4 bits, a quarter of the size. Because it is a float format rather than a plain integer one, it keeps a small range and scale per group of weights, which helps preserve quality at such a low bit width.

Quantizing to NVFP4 happens after training, you take the full-precision weights and convert them. The model then runs from the smaller 4-bit weights, with the hardware handling the format during inference.

Why does it matter?

The whole point is fitting a big model into memory you actually have. Cutting weights from 16 bits to 4 shrinks the footprint enough that models which would not otherwise load suddenly do, at a quality cost small enough to accept for most work.

That is exactly how a roughly 119B-parameter Mistral fits in the 128 GB of unified memory on the DGX Spark. Without 4-bit quantization a model that large would not fit, and NVFP4 is what makes it practical to serve on a single Spark.

NVFP4 (4-bit)

  • Each weight stored in 4 bits, much smaller footprint

Full-precision (16-bit)

  • Each weight stored in 16 bits, larger and heavier

Related terms

← All terms Reviewed: June 2026