How does distillation work?
Train a big model and you get something capable but heavy. Often too heavy to run on a single box at home. Distillation moves that capability into a smaller model you can actually serve.
The recipe: take the large model as a teacher, and train a smaller student to reproduce the teacher’s outputs. Instead of learning only from raw data, the student learns from how the teacher responds, which is a richer signal than the data alone. The student ends up smaller and faster while keeping a good share of the teacher’s behaviour. It is not magic and it is not lossless. You trade some capability for a model that fits.
Where does it show up for an operator?
In two places you meet often. First, the models you run. Many of the small, runnable models in a given weight class are distilled from a larger parent, which is why a name sometimes carries the parent’s name plus the word distilled. That is your hint about where the behaviour came from.
Second, speculative decoding. A small draft model there is typically a distilled version of the main model, trained specifically to mimic its token distribution so its guesses are usually right. Same idea, narrower job. Either way, the value is the same: capability that started in a model you could not run, packed into one you can. Quantize the student on top and the footprint drops again.