How does beam search work?
Most chat models pick output one token at a time. Beam search refuses to commit that early. It keeps a fixed number of partial sequences alive, called the beam width, and at each step it extends every one of them, scores the results, and keeps only the best handful. By carrying several drafts forward it can recover from a token that looked good locally but led somewhere worse, which a one-token greedy choice cannot.
The payoff is a sequence with a higher overall likelihood. The cost is work: a wider beam means more candidates to extend and score every step, so it is slower and uses more memory than drawing a single token.
When would you use it, and when not?
Beam search earns its keep where there is a “correct” target to converge on: translation, summarisation, structured extraction, anything where you want the most probable faithful rendering rather than a creative one. Because it is deterministic, the same input yields the same output, which is handy when you need repeatable results.
For open-ended chat it is usually the wrong tool. Optimising hard for the most likely sequence tends to produce flat, generic, sometimes looping text, which is why sampling with temperature and top-p is the default for conversational models. The rule of thumb: beam search when there is one right answer to find, sampling when variety is the point.