Learn

FP32: the full-precision number format

FP32 (32-bit Floating Point) is a number format that stores each value in 32 bits. It is the traditional full-precision format for training and computation: the most accurate of the common formats and the heaviest, since every weight takes four bytes. Local inference almost always uses a narrower format to fit the model and decode faster.

At a glance

What it is
A 32-bit floating-point number format, four bytes per value
Stands for
32-bit Floating Point
Its strength
The most precise of the common formats, full precision
Its cost
The heaviest, so large models are rarely served in it
Stack

How wide each format stores one weight

FP32 is the widest and most accurate, but every weight costs four bytes. Serving a large model means stepping down to a narrower format to fit the memory budget, the green band.

3
4-bit and 8-bit formats (INT4, FP8) smaller and faster to decode, where local serving usually lands
2
16-bit formats (FP16, BF16) half the memory, the common training and serving precision
1
FP32, full precision, 32 bits per weight most accurate, heaviest, rarely used to serve large models

What is FP32 and why is it the baseline?

FP32 (32-bit Floating Point) stores each number in 32 bits, which is four bytes. It is the traditional full-precision format: enough range and resolution that, for most model work, you treat its results as the reference. Training a model often happens at or near this precision, and when people want to know how much a smaller format costs in accuracy, they compare against FP32.

The catch is size. Four bytes per weight adds up fast. A model with tens of billions of weights in FP32 runs to hundreds of gigabytes before you have served a single token. On a box that shares one memory pool between the operating system and the model, that leaves no room for anything else. Precision you cannot fit is precision you cannot use.

Why do you rarely serve in FP32?

Because almost nothing about local inference rewards it. The narrower formats give you most of the model’s quality for a fraction of the memory, and they decode faster too, since every token has to stream the weights through the chip and fewer bytes means less to move. A 16-bit format halves the footprint; quantization down to INT4 (4-bit integer) or FP8 (8-bit floating point) cuts it much further.

So FP32 stays mostly where it earns its keep: in training, and as the honest yardstick you measure a quantized model against. When you are deciding what to run at home, the real question is how far below full precision you can drop before the model gets measurably worse, not whether you can afford FP32. You usually cannot.

FP32 (full precision)

  • Most accurate of the common formats
  • Four bytes per weight, the heaviest
  • Used in training and as a reference baseline
  • Rarely used to serve a large model locally

Narrower formats

  • Trade a little accuracy for a lot less memory
  • Two bytes (16-bit) down to half a byte (4-bit) per weight
  • Faster decode, since fewer bytes stream through the chip
  • Where local inference almost always lands

Related terms

← All terms Reviewed: June 2026