What is inference?
A model goes through two very different phases in its life. First it is trained: a slow, expensive process, run once, that adjusts its internal numbers (the weights) until it behaves the way its makers wanted. After that, the weights are fixed. Inference is everything that comes next: you hand the trained model an input, it runs a forward pass over those fixed weights, and it gives you an output. Every prompt you send to a self-hosted model is one act of inference.
The short version: training makes the model, inference uses it. When you download a model and run it, you are downloading the result of someone else’s training and doing your own inference on top of it.
Why does inference speed vary so much?
Inference does no learning, but it is far from free. For a large language model, the work is dominated by reading the weights and the running context out of memory, so memory bandwidth often matters more than raw compute. Model size sets how much there is to read. The batch size and context length set how much work rides along with each step.
That is why two setups with the same model can report very different speeds, and why quantizing a model to a smaller weight format can make inference faster: there is simply less to move.