Learn

Temperature: the randomness knob on a model

Temperature is a sampling setting that scales how much randomness goes into picking each next token. Low temperature makes the model favour its most likely choice, so output is focused and close to repeatable. High temperature flattens the odds so less likely words get picked more often, which reads as more creative and more erratic. At temperature zero the model is effectively deterministic: the same prompt returns the same answer.

At a glance

What it is
A knob that scales the randomness of each next-token choice
Low temperature
Focused, repeatable, conservative; near zero is deterministic
High temperature
Creative and varied, but more prone to drift and nonsense
Where to set it
Per request, in the inference engine or the API call
Comparison

Low versus high temperature

Low (near zero)
High
Word choice
Almost always the most likely token
Long-shot tokens get picked more often
Same prompt twice
Same or near-identical answer
Different answer each time
Good for
Code, extraction, retrieval, anything you must trust
Brainstorming, prose variety, ideas

What does temperature actually change?

At every step a language model produces a list of candidate next tokens, each with a probability. Temperature decides how literally it takes those probabilities. At a low temperature the model leans hard on its top pick, so it stays on the obvious, safe path. At a high temperature the gaps between candidates shrink, so a word the model thought unlikely still gets its turn now and then. That is the whole effect: it does not make the model smarter or dumber, it changes how willing the model is to gamble on its less favourite words.

The visible result is variety. Low temperature gives you focused, near-repeatable text. High temperature gives you range, and with it the risk that the model wanders somewhere wrong and says it with the same confidence as everything else.

What should you set it to?

There is no universally correct value, only a correct value for the job. If the output feeds something that has to be trusted or reproduced, a retrieval lookup, a code generation step, a fact extraction, push temperature down. Near zero the model becomes effectively deterministic, which means the same prompt returns the same answer, and a result you can reproduce is a result you can debug.

If the job is creative and there is no single right answer, draft titles, prose options, idea lists, raise it. You trade reproducibility for range. The honest operator habit is to default low, then raise it only where variety is the point, not the other way round.

Check it yourself

curl -s localhost:8000/v1/completions -H 'content-type: application/json' -d '{"model":"local","prompt":"Write one word:","temperature":0,"max_tokens":5}'

Run it twice at temperature 0 and the outputs match. Raise temperature and the two runs start to diverge. Adjust the URL and model name to your own server.

Turn it down when

  • You need the same answer twice, like in a retrieval or test step
  • You are extracting facts or generating code that must be correct
  • A confident wrong answer would cost you more than a dull right one

Turn it up when

  • You want variety across runs, like draft headlines or ideas
  • The task is creative and there is no single right answer
  • Repeated runs feel too samey and you want the model to roam

Related terms

← All terms Reviewed: June 2026