What is FP32 and why is it the baseline?
FP32 (32-bit Floating Point) stores each number in 32 bits, which is four bytes. It is the traditional full-precision format: enough range and resolution that, for most model work, you treat its results as the reference. Training a model often happens at or near this precision, and when people want to know how much a smaller format costs in accuracy, they compare against FP32.
The catch is size. Four bytes per weight adds up fast. A model with tens of billions of weights in FP32 runs to hundreds of gigabytes before you have served a single token. On a box that shares one memory pool between the operating system and the model, that leaves no room for anything else. Precision you cannot fit is precision you cannot use.
Why do you rarely serve in FP32?
Because almost nothing about local inference rewards it. The narrower formats give you most of the model’s quality for a fraction of the memory, and they decode faster too, since every token has to stream the weights through the chip and fewer bytes means less to move. A 16-bit format halves the footprint; quantization down to INT4 (4-bit integer) or FP8 (8-bit floating point) cuts it much further.
So FP32 stays mostly where it earns its keep: in training, and as the honest yardstick you measure a quantized model against. When you are deciding what to run at home, the real question is how far below full precision you can drop before the model gets measurably worse, not whether you can afford FP32. You usually cannot.