Distillation: training a small model to mimic a big one : Learn

Distillation is a training technique where a small student model is trained to mimic a larger teacher model, learning from the teacher's outputs rather than from raw data alone. The result is a smaller, cheaper model that keeps much of the larger model's behaviour, which is what makes it runnable on hardware the original never would fit on.

How does distillation work?

Train a big model and you get something capable but heavy. Often too heavy to run on a single box at home. Distillation moves that capability into a smaller model you can actually serve.

The recipe: take the large model as a teacher, and train a smaller student to reproduce the teacher’s outputs. Instead of learning only from raw data, the student learns from how the teacher responds, which is a richer signal than the data alone. The student ends up smaller and faster while keeping a good share of the teacher’s behaviour. It is not magic and it is not lossless. You trade some capability for a model that fits.

Where does it show up for an operator?

In two places you meet often. First, the models you run. Many of the small, runnable models in a given weight class are distilled from a larger parent, which is why a name sometimes carries the parent’s name plus the word distilled. That is your hint about where the behaviour came from.

Second, speculative decoding. A small draft model there is typically a distilled version of the main model, trained specifically to mimic its token distribution so its guesses are usually right. Same idea, narrower job. Either way, the value is the same: capability that started in a model you could not run, packed into one you can. Quantize the student on top and the footprint drops again.

Distillation: training a small model to mimic a big one

At a glance

How distillation moves capability downhill

How does distillation work?

Where does it show up for an operator?

Distilled small model

Original large model

Related terms

At a glance

How distillation moves capability downhill

How does distillation work?

Where does it show up for an operator?

Distilled small model

Original large model

Related terms

Go deeper