FP8: eight bits per weight, two layouts : Learn

FP8 (8-bit floating point) is a way to store each model number in eight bits rather than the usual sixteen. It keeps a sign, an exponent and a mantissa, so it covers a wide range of values at coarse precision. Halving the bits per weight roughly halves the memory the weights need and can speed inference on hardware with FP8 math units.

What is FP8?

FP8 (8-bit floating point) packs a model number into eight bits: one sign bit plus an exponent and a mantissa split between the rest. Because it keeps an exponent, it covers a wide range of magnitudes the way 16-bit formats do, just at coarser steps. The practical win is memory. A weight that took sixteen bits now takes eight, so the weights of a model take roughly half the room. On hardware with FP8 math units, the matrix multiplies can run faster too.

E4M3 or E5M2?

There are two common layouts and the names tell you the split. E4M3 spends four bits on the exponent and three on the mantissa: more precision, less range. E5M2 spends five on the exponent and two on the mantissa: more range, less precision. Inference weights usually want E4M3 for the extra precision, while the wider range of E5M2 shows up more in training-style work. You rarely pick by hand: the quantization tool and the serving engine choose for you.

When is FP8 the right call?

FP8 is a middle step. It gives you a clean halving of weight memory with a small accuracy cost, while staying in floating point. If you need to shrink further, 4-bit formats go smaller at a larger accuracy risk. The honest rule holds: quantize, then measure the result on your own task before you trust it.

FP8: eight bits per weight, two layouts

At a glance

FP8 versus 16-bit weights

What is FP8?

E4M3 or E5M2?

When is FP8 the right call?

Check it yourself

FP8 helps with

FP8 does not

Related terms

At a glance

FP8 versus 16-bit weights

What is FP8?

E4M3 or E5M2?

When is FP8 the right call?

Check it yourself

FP8 helps with

FP8 does not

Related terms

Go deeper