How does NVFP4 work?
NVFP4 is a 4-bit floating-point number format from NVIDIA. Instead of storing each model weight in 16 bits, it packs the same weight into 4 bits, a quarter of the size. Because it is a float format rather than a plain integer one, it keeps a small range and scale per group of weights, which helps preserve quality at such a low bit width.
Quantizing to NVFP4 happens after training, you take the full-precision weights and convert them. The model then runs from the smaller 4-bit weights, with the hardware handling the format during inference.
Why does it matter?
The whole point is fitting a big model into memory you actually have. Cutting weights from 16 bits to 4 shrinks the footprint enough that models which would not otherwise load suddenly do, at a quality cost small enough to accept for most work.
That is exactly how a roughly 119B-parameter Mistral fits in the 128 GB of unified memory on the DGX Spark. Without 4-bit quantization a model that large would not fit, and NVFP4 is what makes it practical to serve on a single Spark.