What is FP16?
FP16, the 16-bit Floating Point format, also called half precision, stores a number in 16 bits. Like any floating-point format it divides those bits between range and precision, and FP16 leans toward precision: within the values it can represent, it pins them down quite finely. The catch is the range. FP16 cannot hold values as large or as small as a full 32-bit float, so a calculation that strays past its limits overflows to infinity or collapses to zero.
The payoff is the same as any half-size format: a model in FP16 takes half the memory of the 32-bit version, which means more model and more context fit in the same space. On a lot of consumer hardware FP16 is a long-standing native format, so it is the half-precision you are most likely to meet on a desktop card.
When does the range problem bite?
It bites when values get extreme, which happens more in training than in plain inference, but it can surface anywhere a sum grows large or a gradient grows tiny. That weakness is exactly why BF16, the 16-bit Brain Floating Point format, was designed: it keeps the wide range of a 32-bit float and gives up some precision to fit in 16 bits, the mirror image of FP16’s choice. For serving large models BF16’s range usually wins, so newer stacks default to it. FP16 is still widely supported and perfectly usable, especially on hardware where it is the native path, but if you see overflow warnings on a numerically rough workload, the format’s narrow range is the first suspect.