Temperature is a sampling setting that scales how much randomness goes into picking each next token. Low temperature makes the model favour its most likely choice, so output is focused and close to repeatable. High temperature flattens the odds so less likely words get picked more often, which reads as more creative and more erratic. At temperature zero the model is effectively deterministic: the same prompt returns the same answer.
At a glance
What it is
A knob that scales the randomness of each next-token choice
Low temperature
Focused, repeatable, conservative; near zero is deterministic
High temperature
Creative and varied, but more prone to drift and nonsense
Where to set it
Per request, in the inference engine or the API call
Comparison
Low versus high temperature
Low (near zero)
High
Word choice
Almost always the most likely token
Long-shot tokens get picked more often
Same prompt twice
Same or near-identical answer
Different answer each time
Good for
Code, extraction, retrieval, anything you must trust
Brainstorming, prose variety, ideas
What does temperature actually change?
At every step a language model produces a list of candidate next tokens, each
with a probability. Temperature decides how literally it takes those
probabilities. At a low temperature the model leans hard on its top pick, so it
stays on the obvious, safe path. At a high temperature the gaps between
candidates shrink, so a word the model thought unlikely still gets its turn now
and then. That is the whole effect: it does not make the model smarter or dumber,
it changes how willing the model is to gamble on its less favourite words.
The visible result is variety. Low temperature gives you focused, near-repeatable
text. High temperature gives you range, and with it the risk that the model
wanders somewhere wrong and says it with the same confidence as everything else.
What should you set it to?
There is no universally correct value, only a correct value for the job. If the
output feeds something that has to be trusted or reproduced, a retrieval lookup,
a code generation step, a fact extraction, push temperature down. Near zero the
model becomes effectively deterministic, which means the same prompt returns the
same answer, and a result you can reproduce is a result you can debug.
If the job is creative and there is no single right answer, draft titles, prose
options, idea lists, raise it. You trade reproducibility for range. The honest
operator habit is to default low, then raise it only where variety is the point,
not the other way round.
Check it yourself
curl -s localhost:8000/v1/completions -H 'content-type: application/json' -d '{"model":"local","prompt":"Write one word:","temperature":0,"max_tokens":5}'
Run it twice at temperature 0 and the outputs match. Raise temperature and the two runs start to diverge. Adjust the URL and model name to your own server.
Turn it down when
You need the same answer twice, like in a retrieval or test step
You are extracting facts or generating code that must be correct
A confident wrong answer would cost you more than a dull right one
Turn it up when
You want variety across runs, like draft headlines or ideas
The task is creative and there is no single right answer
Repeated runs feel too samey and you want the model to roam