Learn

Inference: running a trained model to get output

Inference is running an already-trained model to get output: you give it an input, it computes a result, you read it back. It is distinct from training, which is the expensive one-time process that produced the model's weights. When you self-host a model and send it a prompt, you are doing inference.

At a glance

What it is
Using a trained model to turn an input into an output
Not the same as
Training, the separate process that produced the weights
What you run locally
Inference; the weights are downloaded already trained
What sets its speed
Memory bandwidth, model size, batch size, and context length
Flow

Where inference sits

Training happens once and produces the weights. Inference is the part you run over and over, every time you send the model a prompt.

1
Training (done by someone, once) huge compute, produces the model weights
2
You download the trained weights the finished model, ready to use
3
Inference (you, every request) feed input, model computes, read output

What is inference?

A model goes through two very different phases in its life. First it is trained: a slow, expensive process, run once, that adjusts its internal numbers (the weights) until it behaves the way its makers wanted. After that, the weights are fixed. Inference is everything that comes next: you hand the trained model an input, it runs a forward pass over those fixed weights, and it gives you an output. Every prompt you send to a self-hosted model is one act of inference.

The short version: training makes the model, inference uses it. When you download a model and run it, you are downloading the result of someone else’s training and doing your own inference on top of it.

Why does inference speed vary so much?

Inference does no learning, but it is far from free. For a large language model, the work is dominated by reading the weights and the running context out of memory, so memory bandwidth often matters more than raw compute. Model size sets how much there is to read. The batch size and context length set how much work rides along with each step.

That is why two setups with the same model can report very different speeds, and why quantizing a model to a smaller weight format can make inference faster: there is simply less to move.

Inference

  • Runs every time you send a prompt
  • Modest hardware can do it, depending on model size
  • Read-only on the weights; nothing is learned
  • Speed is the metric you tune for, in tokens per second

Training

  • Happens once to produce the model
  • Needs large clusters and a lot of time
  • Writes the weights; this is where learning happens
  • Cost and data quality are the metrics that matter

Related terms

← All terms Reviewed: June 2026