Learn

FP8: eight bits per weight, two layouts

FP8 (8-bit floating point) is a way to store each model number in eight bits rather than the usual sixteen. It keeps a sign, an exponent and a mantissa, so it covers a wide range of values at coarse precision. Halving the bits per weight roughly halves the memory the weights need and can speed inference on hardware with FP8 math units.

At a glance

What it is
A floating-point format using eight bits per number
Why use it
Weights take about half the memory of 16-bit, often faster too
Two layouts
E4M3 (more precision) and E5M2 (more range)
The cost
Some accuracy loss; usually small for inference
Comparison

FP8 versus 16-bit weights

16-bit (BF16/FP16)
FP8
Bits per weight
Sixteen
Eight
Weight memory
Baseline
About half
Precision
Higher
Coarser, usually fine for inference

What is FP8?

FP8 (8-bit floating point) packs a model number into eight bits: one sign bit plus an exponent and a mantissa split between the rest. Because it keeps an exponent, it covers a wide range of magnitudes the way 16-bit formats do, just at coarser steps. The practical win is memory. A weight that took sixteen bits now takes eight, so the weights of a model take roughly half the room. On hardware with FP8 math units, the matrix multiplies can run faster too.

E4M3 or E5M2?

There are two common layouts and the names tell you the split. E4M3 spends four bits on the exponent and three on the mantissa: more precision, less range. E5M2 spends five on the exponent and two on the mantissa: more range, less precision. Inference weights usually want E4M3 for the extra precision, while the wider range of E5M2 shows up more in training-style work. You rarely pick by hand: the quantization tool and the serving engine choose for you.

When is FP8 the right call?

FP8 is a middle step. It gives you a clean halving of weight memory with a small accuracy cost, while staying in floating point. If you need to shrink further, 4-bit formats go smaller at a larger accuracy risk. The honest rule holds: quantize, then measure the result on your own task before you trust it.

Check it yourself

python -c "import torch; print(torch.float8_e4m3fn, torch.float8_e5m2)"

Recent PyTorch exposes both FP8 layouts as real dtypes. E4M3 spends bits on precision, E5M2 spends them on range.

FP8 helps with

  • Cutting weight memory to roughly half of a 16-bit model
  • Faster matrix math on hardware with FP8 units
  • Fitting a larger model or longer context in the same pool
  • Keeping a floating-point range, unlike a pure integer format

FP8 does not

  • Match 16-bit accuracy exactly; expect a small drop
  • Help on hardware with no FP8 math support
  • Shrink the key-value (KV) cache, which is a separate budget
  • Replace measuring: confirm quality on your own task

Related terms

← All terms Reviewed: June 2026