Learn

Sampling: how a model picks the next word

Sampling is how an inference engine chooses the next token from the probability distribution the model produces at each step. Greedy sampling always takes the single most likely token. Other strategies, like top-p (nucleus) sampling, draw at random from a shortlist of the most likely tokens, so output varies between runs. The sampling settings, together with temperature, decide how predictable or adventurous the text is.

At a glance

What it is
Choosing one next token from the model's probability distribution
Greedy
Always take the single most likely token; steady, repeatable
Top-p (nucleus)
Draw from the smallest set of top tokens that covers a probability share
Why it matters
It sets how varied, and how reproducible, the output is
Flow

From distribution to one token

The model hands the engine a ranked list of likely next tokens. The sampling rule narrows that list, then one token is chosen and the loop repeats.

1
Model output: ranked next-token odds every candidate with a probability
2
Sampling rule trims the list greedy keeps one, top-p keeps a shortlist
3
One token chosen, then repeat appended to the text, next step begins

How does a model choose the next word?

A language model does not output a sentence. It outputs, one step at a time, a ranked list of possible next tokens, each with a probability. Sampling is the rule that turns that list into a single choice. The simplest rule is greedy: take the most likely token, every time. That is steady and repeatable, but it can also march the model into dull repetition, because the safest word is not always the best one.

The common alternative is top-p, also called nucleus sampling. Instead of always taking the top token, the engine keeps the smallest set of top candidates whose probabilities add up to a chosen share, then draws one from that set at random. A handful of other knobs exist, but top-p plus temperature covers most of what an operator touches.

Why does the sampling choice matter?

Because it sets the trade-off between reproducible and natural. Draw at random and the same prompt gives a different answer each run, which reads as more human and suits creative work. Take the top token every time and you get output you can test and trust, which suits code, extraction, and any retrieval step that has to return the same thing twice.

The practical habit: if a result has to be debugged or reproduced, lean toward the deterministic end. If it has to feel fresh, open the sampling up. And remember where you measure matters too: sample an engine before it has warmed up and you can measure the warmup rather than the model.

Greedy / low randomness

  • Same prompt gives the same output, which is easy to test
  • Good for code, extraction, and retrieval steps
  • Can fall into dull loops, repeating a safe phrase

Top-p / higher randomness

  • Output varies run to run, which reads as more natural
  • Good for prose, ideas, and anything that should feel fresh
  • More room to wander into a confident wrong answer

Related terms

← All terms Reviewed: June 2026