Inference: running a trained model to get output : Learn

Inference is running an already-trained model to get output: you give it an input, it computes a result, you read it back. It is distinct from training, which is the expensive one-time process that produced the model's weights. When you self-host a model and send it a prompt, you are doing inference.

At a glance

What it is

Using a trained model to turn an input into an output

Not the same as

Training, the separate process that produced the weights

What you run locally

Inference; the weights are downloaded already trained

What sets its speed

Memory bandwidth, model size, batch size, and context length

What is inference?

A model goes through two very different phases in its life. First it is trained: a slow, expensive process, run once, that adjusts its internal numbers (the weights) until it behaves the way its makers wanted. After that, the weights are fixed. Inference is everything that comes next: you hand the trained model an input, it runs a forward pass over those fixed weights, and it gives you an output. Every prompt you send to a self-hosted model is one act of inference.

The short version: training makes the model, inference uses it. When you download a model and run it, you are downloading the result of someone else’s training and doing your own inference on top of it.

Why does inference speed vary so much?

Inference does no learning, but it is far from free. For a large language model, the work is dominated by reading the weights and the running context out of memory, so memory bandwidth often matters more than raw compute. Model size sets how much there is to read. The batch size and context length set how much work rides along with each step.

That is why two setups with the same model can report very different speeds, and why quantizing a model to a smaller weight format can make inference faster: there is simply less to move.

Inference: running a trained model to get output

At a glance

Where inference sits

What is inference?

Why does inference speed vary so much?

Inference

Training

Related terms

At a glance

Where inference sits

What is inference?

Why does inference speed vary so much?

Inference

Training

Related terms

Go deeper