NVFP4 Quantization Explained (For Engineers Who Skipped the Paper)
NVFP4 is a 4-bit floating-point format with a 2-bit exponent and a 1-bit mantissa, plus a sign bit. The format is designed for activation and weight storage during inference on Blackwell-class hardware, where the hardware has native FP4 multiply-accumulate support. The headline practical effect on a DGX Spark is that NVFP4-quantized models occupy roughly 1/4 the disk footprint of FP16 baselines and run noticeably faster on the FlashInfer kernel paths.
Quick Take
- Format: sign + 2-bit exponent + 1-bit mantissa = 4 bits per value, FP-class number line (range varies, precision degrades gracefully).
- Hardware support: Blackwell GPUs (including the GB10 in the DGX Spark) have native FP4 multiply-accumulate instructions. Older GPUs emulate FP4 via INT4 paths, which loses the speed advantage.
- Practical disk savings on Mistral Small 4: 119B params at FP16 ≈ 238 GB; at NVFP4 ≈ 60 GB. The Spark fits NVFP4-quantized Mistral comfortably; FP16 baseline does not.
- Throughput on Spark for Mistral NVFP4: ~35 tok/s with EAGLE, ~12-15 baseline. The format helps; speculative decoding helps more.
- The trade vs INT4: NVFP4 preserves more dynamic range than INT4 in the activation tail. This matters for vision and for certain prose qualities. INT4 is sometimes faster but loses fidelity in ways NVFP4 does not.
- The trade vs MXFP4: MXFP4 has the same bit-count but uses a different scaling group. Spark has native support for both; the choice is workload-dependent.
The 4-bit floating point question
Most engineers who have not done quantization work have a mental model of “INT4 = small and fast, FP16 = big and accurate.” The reality in 2026 is more granular. There are at least four 4-bit formats in active production use on accelerator hardware: INT4 (integer), NVFP4 (floating-point with 2-bit exponent), MXFP4 (floating-point with shared exponents per group), and FP4 generic (a placeholder format that vendors specialize in different ways).
The split between integer and floating-point at 4 bits matters because floating-point preserves dynamic range in the tails of the distribution. An INT4 quantization of a weight tensor where the tail values are far from the mean will quantize the tail values to the same bucket as the mean, losing the tail information. An FP4 quantization preserves the order-of-magnitude structure of the tail because the format has an exponent field.
For vision models (where the activation distributions have heavy tails), the FP4 advantage over INT4 is measurable in output quality. (See Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler for the worked example where the choice of quantization decided whether the vision tower survived.)
What NVFP4 actually looks like
The bit layout for a single NVFP4 value is:
[ sign (1) | exponent (2) | mantissa (1) ]
Two bits of exponent give four possible exponent values (effectively, four orders of magnitude). One bit of mantissa gives two precision steps per exponent. The result is a representable number line that looks like:
... -8, -4, -2, -1, -0.5, -0.25, ... 0, ... 0.25, 0.5, 1, 2, 4, 8 ...
(Exact representable values depend on the exponent bias used by the format specification.)
The point: NVFP4 has logarithmic spacing. Small values near zero are densely represented; large values in the tails are sparsely represented. This matches the typical distribution of neural-network weights and activations, where most values cluster near zero and a small number of outlier values determine the tail.
In contrast, INT4 has linear spacing across the range. Sixteen evenly-spaced values from -8 to +7. The tail values either fit (and the center values get poor precision) or the center values fit (and the tail values get clipped). Per-tensor scaling factors mitigate this but do not solve it.
Why this matters on the Spark specifically
The Spark’s GB10 silicon has native FP4 multiply-accumulate. The kernel selection in vLLM’s FlashInfer backend can target this path directly for NVFP4-quantized weights. On Blackwell with the right kernel, FP4 multiplication runs at roughly the same throughput as FP8 multiplication, while moving 2x less data through the memory hierarchy. The arithmetic intensity is higher; the memory bandwidth bottleneck is partly relaxed.
On older NVIDIA hardware (Ampere, Hopper) the FP4 path is emulated through INT4 with software-managed scaling. The emulation works but loses most of the speed advantage. NVFP4 is a Blackwell-and-later format in practice; on earlier hardware the choice is between INT4 (native) and FP8 (native, but 2x larger).
For the Spark with Qwen 3.6 PrismaQuant and Mistral NVFP4 (FP4), the two quantization choices coexist in the same stack. PrismaQuant 4.75bit uses compressed-tensors mixed-precision (NVFP4+MXFP8+BF16 per-layer sensitivity allocation); Mistral NVFP4 is pure FP4 across the model. The per-model choice was driven by which quantization the upstream model release supports best, not by an abstract preference.
What NVFP4 preserves and what it loses
Preserved well: the rough magnitude of every weight. The model’s broad-strokes capability is intact. Conversational quality, factual recall on common topics, and basic reasoning all survive NVFP4 quantization with measurable but small degradation.
Preserved well: vision-tower activations. The Pixtral-lineage vision encoder on Mistral Small 4 survives NVFP4 quantization. This is the load-bearing observation for the dual-model stack on the Spark; without surviving vision, image-reading workloads would route to cloud.
Lost somewhat: numerical precision on edge cases. Workloads that require multi-digit numerical accuracy (financial calculations, scientific reasoning) degrade more on NVFP4 than on FP8 baselines. The degradation is bounded but real.
Lost somewhat: prose register variety. Some operators report that NVFP4-quantized models have more uniform prose register than the FP16 baseline. The effect is subtle and hard to measure cleanly. My own experience with Mistral NVFP4 on the Spark is that the prose is good but the model has stylistic tics (em-dashes, certain repetitive sentence structures) that the FP16 baseline does not have as strongly. Whether this is the quantization or the base model is hard to disentangle.
When to pick NVFP4 over the alternatives
Pick NVFP4 when:
- You are running on Blackwell or later hardware
- Your workload includes vision and you need the dynamic-range preservation
- You are optimizing for memory bandwidth, not raw FLOPS
- The upstream release ships an NVFP4 variant
Pick compressed-tensors mixed-precision (PrismaQuant, AutoRound) when:
- You are optimizing for raw single-stream throughput on language-only workloads
- The model release has invested in a sensitivity-driven calibration (INT4 dominant layers, BF16 for outliers)
- Vision is not required for this model in your stack
Pick MXFP4 when:
- The release is MXFP4-native (gpt-oss-120b, some recent Mistral variants)
- You can spare the unified memory budget for the slightly larger MXFP4 format
- Your hardware path is well-tuned for MXFP4 specifically
Pick FP8 when:
- The model is small enough that 2x larger weights still fit
- You want the precision floor that 4-bit formats cannot match
- The fine-tuning workflow benefits from FP8’s better gradient preservation
Where this fits
For the practical multi-format comparison, see Mistral vs Qwen 3.6 vs GLM-5 on a Single DGX Spark. For the broader inference-mental-model context, see The Unified-Memory Inference Mental Model. For the spelled-out architectural argument that drives quantization choice in this stack, see The Sovereign AI Stack in 2026.
The kernel-level follow-up
The follow-up article on this topic walks through the actual FlashInfer kernel selection logic that decides whether a given vLLM call uses the native NVFP4 path or falls back to a software emulation. Follow via RSS or Nostr (links in footer) to catch it when it ships.