Beam search: keeping several drafts at once : Learn

Beam search is a way of choosing a model's output where, instead of picking one next token, the decoder keeps a fixed number of partial sequences (the beams) and at every step extends and re-ranks them, keeping only the best few. It trades speed and variety for a more globally likely sequence.

How does beam search work?

Most chat models pick output one token at a time. Beam search refuses to commit that early. It keeps a fixed number of partial sequences alive, called the beam width, and at each step it extends every one of them, scores the results, and keeps only the best handful. By carrying several drafts forward it can recover from a token that looked good locally but led somewhere worse, which a one-token greedy choice cannot.

The payoff is a sequence with a higher overall likelihood. The cost is work: a wider beam means more candidates to extend and score every step, so it is slower and uses more memory than drawing a single token.

When would you use it, and when not?

Beam search earns its keep where there is a “correct” target to converge on: translation, summarisation, structured extraction, anything where you want the most probable faithful rendering rather than a creative one. Because it is deterministic, the same input yields the same output, which is handy when you need repeatable results.

For open-ended chat it is usually the wrong tool. Optimising hard for the most likely sequence tends to produce flat, generic, sometimes looping text, which is why sampling with temperature and top-p is the default for conversational models. The rule of thumb: beam search when there is one right answer to find, sampling when variety is the point.

Beam search: keeping several drafts at once

At a glance

How does beam search work?

When would you use it, and when not?

Beam search

Sampling

Related terms

At a glance

How does beam search work?

When would you use it, and when not?

Beam search

Sampling

Related terms

Go deeper