EAGLE: a feature-predicting draft head : Learn

EAGLE is a speculative-decoding scheme where the draft is a small extra head trained to predict the base model's next-layer features, which raises how often its proposed tokens are accepted.

How does EAGLE work?

EAGLE is a specific way to build the draft inside speculative decoding. Instead of a fully separate small model, it adds a lightweight extra head trained to predict the base model’s own next-layer features. Because the draft is anchored to the big model’s internal representations, its proposed tokens tend to match what the big model would pick.

Higher agreement means a higher acceptance rate, so more of each draft run gets accepted per verification pass. The big model still verifies everything and still owns the final output, so the result is unchanged. EAGLE only tries to make the draft a closer match.

Why does it matter?

A better-aligned draft is the lever that turns speculative decoding from “sometimes faster” into “reliably faster”. That is the appeal of EAGLE, it raises acceptance without changing what the model says.

The catch is that the gain depends on conditions. Acceptance can shift with the content being generated, and the scheme can be sensitive to the serving build. On this stack EAGLE regressed on one SGLang nightly, so it is worth measuring on your own build rather than assuming the win holds.

EAGLE: a feature-predicting draft head

At a glance

How does EAGLE work?

Why does it matter?

EAGLE draft

Separate small model

Related terms

At a glance

How does EAGLE work?

Why does it matter?

EAGLE draft

Separate small model

Related terms

Go deeper