MXFP4 (micro-scaled 4-bit floating point) is a way of storing model weights in four bits each, where every small block of weights shares a single separate scale factor. The shared scale lets a block stretch to fit its own range, recovering most of the accuracy a flat four-bit format throws away. It is an open industry standard, not a single vendor's format.
At a glance
What it is
A four-bit floating-point weight format with a per-block shared scale
Why the scale
Each block stretches to its own range, keeping more accuracy
Standard, not proprietary
An open format backed by several hardware vendors
The catch
Full speed needs kernels built for the exact GPU architecture
Comparison
Flat four-bit versus micro-scaled four-bit
Flat 4-bit
MXFP4 (micro-scaled)
Bits per weight
Four
Four
Scale factor
One range for everything
One per small block of weights
Accuracy kept
Less; outliers get clipped
More; each block fits its own range
What is MXFP4?
MXFP4 stands for micro-scaled 4-bit floating point. Each weight is stored in
just four bits, which is tiny, but the format adds one trick: every small block
of weights, commonly thirty-two of them, shares a single separate scale factor.
That scale lets each block stretch to fit its own range of values. A flat
four-bit format has to cover everything with one range and clips the outliers;
the per-block scale buys most of that lost accuracy back. The “MX” is the
micro-scaling; the “FP4” is the four-bit float.
Why does MXFP4 matter on a small box?
Four-bit weights are how a large model fits into a modest memory budget, and
MXFP4 makes those four bits accurate enough to be worth using. It is an open
industry standard rather than one company’s invention, backed by several hardware
makers, so models can be trained in it directly instead of being squeezed down
afterward. The honest caveat is speed: a format being supported by the silicon is
not the same as the fast code being compiled for your exact GPU. A model in
MXFP4 can still crawl on a brand-new chip until the right kernels ship, which is a
kernel problem, not a format problem.
MXFP4 helps with
Fitting a large model into a small memory budget
Keeping accuracy that a flat four-bit format would lose
Running on hardware that supports the format natively
MXFP4 will not fix
A model that is still too big even at four bits
Speed when the fast kernels are not compiled for your GPU yet
Capability the model never had; quantization preserves, it does not add