What makes NF4 different from a plain 4-bit number?
NF4 (4-bit NormalFloat) packs each weight into 4 bits, which means only sixteen distinct values are available to represent it. The clever part is where those sixteen levels sit. Model weights are not spread evenly; they cluster around zero in a bell-curve shape. A naive 4-bit format that spaces its levels evenly wastes most of them on ranges that hold few weights. NF4 instead places its levels to match that distribution, so more of the sixteen slots land where the weights actually are, and the rounding error is smaller for the same 4 bits.
Why is it tied to fine-tuning?
NF4 became well known as the frozen base format in quantized fine-tuning. The idea: load a large model with its weights squeezed into NF4, keep them locked, and train only a small set of extra parameters on top, a low-rank adapter (LoRA). The base never changes, so storing it in 4 bits is fine, and the memory you save is what lets a model that would not otherwise fit be tuned on modest hardware. The adapter trains in higher precision, but it is tiny.
The trade is the one every quantization makes: fewer bits per weight means less precision, accepted because the memory saving is what makes the work possible at all. NF4’s shaping is an attempt to lose as little as possible at that bit count.