What does a GGUF file actually contain?
A GGUF file is a container. Inside it are the model’s weights, almost always stored in a quantized form (a reduced-precision encoding that trades a little accuracy for a lot less size), plus the metadata a runtime needs: the tokenizer, the layer shapes, the architecture name, and assorted settings. The point is that everything travels together. You copy one file, point a runtime at it, and it runs. There is no separate config to hunt down and no folder of loose tensors to keep in sync.
The format grew up around llama.cpp, the project that runs models on plain processors and modest graphics cards. That lineage is why GGUF shows up wherever people want a model to run without a heavy server stack behind it.
When is GGUF the right choice, and when is it not?
Reach for GGUF when you want a model that just runs: on a laptop, on a CPU, on a small GPU, offline, with a single binary. The same model is usually published at several precision levels, so you pick how much accuracy to trade for size and speed. That flexibility is the whole appeal.
It is not the format for every job. Servers built for high throughput on large GPUs, like vLLM or SGLang, expect their own formats and will not load a GGUF file. And GGUF is an inference format: you run models from it, you do not train or fine-tune them in it. Match the file to the runtime you actually plan to use, not the other way round.