Sampling: how a model picks the next word : Learn

Sampling is how an inference engine chooses the next token from the probability distribution the model produces at each step. Greedy sampling always takes the single most likely token. Other strategies, like top-p (nucleus) sampling, draw at random from a shortlist of the most likely tokens, so output varies between runs. The sampling settings, together with temperature, decide how predictable or adventurous the text is.

At a glance

What it is

Choosing one next token from the model's probability distribution

Greedy

Always take the single most likely token; steady, repeatable

Top-p (nucleus)

Draw from the smallest set of top tokens that covers a probability share

Why it matters

It sets how varied, and how reproducible, the output is

How does a model choose the next word?

A language model does not output a sentence. It outputs, one step at a time, a ranked list of possible next tokens, each with a probability. Sampling is the rule that turns that list into a single choice. The simplest rule is greedy: take the most likely token, every time. That is steady and repeatable, but it can also march the model into dull repetition, because the safest word is not always the best one.

The common alternative is top-p, also called nucleus sampling. Instead of always taking the top token, the engine keeps the smallest set of top candidates whose probabilities add up to a chosen share, then draws one from that set at random. A handful of other knobs exist, but top-p plus temperature covers most of what an operator touches.

Why does the sampling choice matter?

Because it sets the trade-off between reproducible and natural. Draw at random and the same prompt gives a different answer each run, which reads as more human and suits creative work. Take the top token every time and you get output you can test and trust, which suits code, extraction, and any retrieval step that has to return the same thing twice.

The practical habit: if a result has to be debugged or reproduced, lean toward the deterministic end. If it has to feel fresh, open the sampling up. And remember where you measure matters too: sample an engine before it has warmed up and you can measure the warmup rather than the model.

Sampling: how a model picks the next word

At a glance

From distribution to one token

How does a model choose the next word?

Why does the sampling choice matter?

Greedy / low randomness

Top-p / higher randomness

Related terms

At a glance

From distribution to one token

How does a model choose the next word?

Why does the sampling choice matter?

Greedy / low randomness

Top-p / higher randomness

Related terms

Go deeper