Learn

GGUF: one file that holds a quantized model

GGUF is a binary container format for language models, used by llama.cpp and the tools built on it. A single GGUF file holds the model's weights, usually in a quantized (reduced-precision) form, alongside the metadata a runtime needs to load and run it. One file, copy it, run it.

At a glance

What it is
A single-file container for a model and its metadata
Who reads it
llama.cpp and the runtimes built on top of it
Why it spread
One portable file, usually quantized, easy to copy and run
What it is not
Not a model architecture and not a quantization method itself

What does a GGUF file actually contain?

A GGUF file is a container. Inside it are the model’s weights, almost always stored in a quantized form (a reduced-precision encoding that trades a little accuracy for a lot less size), plus the metadata a runtime needs: the tokenizer, the layer shapes, the architecture name, and assorted settings. The point is that everything travels together. You copy one file, point a runtime at it, and it runs. There is no separate config to hunt down and no folder of loose tensors to keep in sync.

The format grew up around llama.cpp, the project that runs models on plain processors and modest graphics cards. That lineage is why GGUF shows up wherever people want a model to run without a heavy server stack behind it.

When is GGUF the right choice, and when is it not?

Reach for GGUF when you want a model that just runs: on a laptop, on a CPU, on a small GPU, offline, with a single binary. The same model is usually published at several precision levels, so you pick how much accuracy to trade for size and speed. That flexibility is the whole appeal.

It is not the format for every job. Servers built for high throughput on large GPUs, like vLLM or SGLang, expect their own formats and will not load a GGUF file. And GGUF is an inference format: you run models from it, you do not train or fine-tune them in it. Match the file to the runtime you actually plan to use, not the other way round.

GGUF is good for

  • Running a model on a CPU or a modest GPU with llama.cpp
  • Handing someone one file that just runs
  • Picking a precision level per file, from light to near-lossless
  • Offline, no-dependency, single-binary deployment

GGUF is poor for

  • Maximum throughput on a big GPU, where other servers win
  • Runtimes that expect a different format, like vLLM or SGLang
  • Training or fine-tuning; it is an inference format
  • Editing weights in place; it is built to be loaded, not mutated

Related terms

← All terms Reviewed: June 2026