Learn

Distillation: training a small model to mimic a big one

Distillation is a training technique where a small student model is trained to mimic a larger teacher model, learning from the teacher's outputs rather than from raw data alone. The result is a smaller, cheaper model that keeps much of the larger model's behaviour, which is what makes it runnable on hardware the original never would fit on.

At a glance

What it is
Training a small student model to copy a larger teacher model
Why it matters
Much of the quality at a size you can actually run
What you trade
Some capability for a far smaller, faster model
Common giveaway
A model name with a parent's name plus 'distilled' in it
Flow

How distillation moves capability downhill

A large, hard-to-run teacher hands its behaviour to a small student. The student is the thing you actually serve: smaller, faster, and runnable on your own hardware.

1
Teacher (large model) high quality, too big to run comfortably
2
Distillation (student learns from teacher's outputs) the student is trained to match the teacher
3
Student (small model you run) most of the behaviour, a fraction of the size

How does distillation work?

Train a big model and you get something capable but heavy. Often too heavy to run on a single box at home. Distillation moves that capability into a smaller model you can actually serve.

The recipe: take the large model as a teacher, and train a smaller student to reproduce the teacher’s outputs. Instead of learning only from raw data, the student learns from how the teacher responds, which is a richer signal than the data alone. The student ends up smaller and faster while keeping a good share of the teacher’s behaviour. It is not magic and it is not lossless. You trade some capability for a model that fits.

Where does it show up for an operator?

In two places you meet often. First, the models you run. Many of the small, runnable models in a given weight class are distilled from a larger parent, which is why a name sometimes carries the parent’s name plus the word distilled. That is your hint about where the behaviour came from.

Second, speculative decoding. A small draft model there is typically a distilled version of the main model, trained specifically to mimic its token distribution so its guesses are usually right. Same idea, narrower job. Either way, the value is the same: capability that started in a model you could not run, packed into one you can. Quantize the student on top and the footprint drops again.

Distilled small model

  • Trained to mimic the larger model's outputs
  • Small enough to run on modest hardware
  • Faster decode and lower memory
  • Keeps much, not all, of the behaviour

Original large model

  • Trained on the raw corpus directly
  • Often too big to run on one local box
  • Slower and heavier to serve
  • The full capability the student copies from

Related terms

← All terms Reviewed: June 2026