Speculative decoding: draft fast, verify in one pass : Learn

Speculative decoding is a generation technique where a small fast draft model proposes several next tokens and the large model verifies them in a single parallel pass, accepting the run up to the first mismatch.

How does speculative decoding work?

A small, cheap draft model runs ahead and guesses the next several tokens. The big model then takes that whole guessed run and checks it in one parallel forward pass, the same cost as scoring a single token. It accepts the proposed tokens up to the first place where its own choice disagrees, then continues from there.

Because the big model always has the final say, the output is exactly what it would have produced on its own. The draft only affects speed, not what gets written. When the draft guesses well you get several tokens per big-model pass, and when it guesses badly you fall back to roughly normal speed.

Why does it matter?

This is the umbrella technique behind most of the “free” speedups in modern serving. You pay a little extra compute for the draft model and the parallel check, and in return you cut the number of expensive big-model passes.

EAGLE and DFlash are specific schemes for building that draft. They differ in how the draft is trained and configured, but they all sit under speculative decoding and share the same guarantee, that the big model still decides the final tokens.

Speculative decoding: draft fast, verify in one pass

At a glance

How does speculative decoding work?

Why does it matter?

With a draft model

Plain token-by-token

Related terms

At a glance

How does speculative decoding work?

Why does it matter?

With a draft model

Plain token-by-token

Related terms

Go deeper