FP8 (8-bit floating point) is a way to store each model number in eight bits rather than the usual sixteen. It keeps a sign, an exponent and a mantissa, so it covers a wide range of values at coarse precision. Halving the bits per weight roughly halves the memory the weights need and can speed inference on hardware with FP8 math units.
At a glance
What it is
A floating-point format using eight bits per number
Why use it
Weights take about half the memory of 16-bit, often faster too
Two layouts
E4M3 (more precision) and E5M2 (more range)
The cost
Some accuracy loss; usually small for inference
Comparison
FP8 versus 16-bit weights
16-bit (BF16/FP16)
FP8
Bits per weight
Sixteen
Eight
Weight memory
Baseline
About half
Precision
Higher
Coarser, usually fine for inference
What is FP8?
FP8 (8-bit floating point) packs a model number into eight bits: one sign bit
plus an exponent and a mantissa split between the rest. Because it keeps an
exponent, it covers a wide range of magnitudes the way 16-bit formats do, just
at coarser steps. The practical win is memory. A weight that took sixteen bits
now takes eight, so the weights of a model take roughly half the room. On
hardware with FP8 math units, the matrix multiplies can run faster too.
E4M3 or E5M2?
There are two common layouts and the names tell you the split. E4M3 spends four
bits on the exponent and three on the mantissa: more precision, less range.
E5M2 spends five on the exponent and two on the mantissa: more range, less
precision. Inference weights usually want E4M3 for the extra precision, while
the wider range of E5M2 shows up more in training-style work. You rarely pick
by hand: the quantization tool and the serving engine choose for you.
When is FP8 the right call?
FP8 is a middle step. It gives you a clean halving of weight memory with a small
accuracy cost, while staying in floating point. If you need to shrink further,
4-bit formats go smaller at a larger accuracy risk. The honest rule holds:
quantize, then measure the result on your own task before you trust it.