EAGLE Speculative Decoding: When It Helps and When It Doesn't
EAGLE accelerates free-form prose and conversational workloads by roughly 2-3x on Mistral Small 4. On structured-JSON output and on tool-call-heavy workloads, the same configuration is net-negative. The decision rule is the workload class, not the inference engine.
The implementation work that follows from this is a per-call workload classifier that enables or disables speculation, and a dispatcher that routes by class. EAGLE is not a “set and forget” optimization. Treating it as one produces measured throughput that is worse than the baseline on the workloads where the technique loses.
I confirmed this on the DGX Spark in May 2026. Mistral Small 4 NVFP4 with safer-eagle config ran at 36.5 tok/s on a free-form prose run, compared to roughly 14 tok/s baseline (no EAGLE). The same model on a structured-JSON extraction task dropped to 13-15 tok/s with EAGLE on, which is worse than baseline. Switching EAGLE off for that workload class recovered the 14 tok/s baseline. The numbers are consistent with the accept-rate math: 0.81-0.88 on free-form prose, 0.20-0.40 on JSON-constrained output.
Quick Take
- What EAGLE is: a draft-then-verify speculative decoding scheme where a small draft model proposes the next several tokens and the main model verifies them in parallel. Accepted tokens are free throughput; rejected tokens cost a verification pass.
- When EAGLE helps: free-form prose, conversational responses, long-form writing. Accept rates run 60-80 percent, net throughput is 2-3x baseline.
- When EAGLE hurts: structured-JSON output, schema-constrained generation, tool-call responses. Accept rates fall to 20-40 percent and the verification overhead exceeds the savings.
- The trap: benchmarks usually measure free-form prose, so EAGLE looks like a uniform win. Real production workloads have a mix; the mix determines the actual throughput.
- The fix: per-call workload classification, dispatcher that toggles EAGLE per class.
The basic mechanism
Speculative decoding splits inference into two passes per generation step.
In the first pass, a small “draft” model produces the next N candidate tokens at low cost. The draft model is typically a smaller distilled version of the main model, trained specifically to mimic the main model’s token distribution.
In the second pass, the main model verifies all N candidates in parallel by computing the probabilities the main model would have assigned. Tokens where the main model agrees with the draft are accepted; tokens where the main model disagrees are rejected, and generation continues from the first disagreement.
The throughput math: if the draft is correct for k of N tokens, you get k+1 tokens of output (the k accepted plus the first rejected token, which the main model produces correctly) for the cost of one main-model pass plus one cheap draft-model pass. When k is large, the throughput multiplier is large. When k is small, the draft pass is wasted computation, which is why EAGLE actively hurts on the wrong workloads rather than merely failing to help.
EAGLE (which refers to the specific architecture used in Extrapolation Algorithm for Greater Language-model Efficiency) is designed to integrate cleanly with the inference engine’s KV cache. The draft model in EAGLE is unusually small, the verification is tightly coupled to the main model’s hidden states, and the implementation is mature enough to be on by default in vLLM and SGLang for supported model families.
How to read the accept-rate metric
The accept rate, reported per decode batch by SGLang (e.g. accept rate: 0.81), is the fraction of speculative draft tokens the main model accepted. This is the diagnostic I rely on because accept rate is a leading indicator of throughput changes. A drop in accept rate predicts a throughput drop before the per-second numbers become obvious.
# Tail SGLang accept-rate and throughput in real time
docker logs sglang-mistral4 2>&1 | grep "accept rate"
# Example output lines:
# [2026-05-06 18:36:11] Decode batch, gen throughput (token/s): 33.76, accept rate: 0.91
# [2026-05-06 18:36:23] Decode batch, gen throughput (token/s): 13.20, accept rate: 0.28
The log lines come out per batch, which means you see the accept rate change in real time as the workload shifts. On a mixed pipeline run (free-form prose, then JSON extraction, then naturalization), the accept rate tells you exactly which phase is costing you throughput and why.
Where EAGLE helps
Conversational prose. The draft model has been trained on the same data distribution as the main model, so on free-form text the draft predictions are usually right. Accept rates of 0.81-0.88 are typical on this stack (measured in May 2026), which means roughly 3-4 tokens accepted per verification cycle. Net throughput multiplier on Mistral Small 4 with EAGLE: roughly 2-3x baseline.
Why EAGLE helps here: the draft model was trained to mimic the main model’s natural output distribution. Free-form prose is that natural distribution. The draft is asking “what would Mistral say next?” and on free-form prose the answer is predictable enough that the draft is right most of the time. This is the setting EAGLE was designed for.
Long-form writing. Once the model has settled into a topic, the next-token distribution is heavily concentrated on a small number of candidates, and the draft model picks one of them correctly most of the time. Accept rates climb above 0.70. On the naturalize phase of the podcast pipeline, tested in May 2026, peak throughput reached 39.2 tok/s with accept rates at 0.79-0.91.
Code completion in continuation mode. Where the model is producing code that follows existing patterns (the next line of a function whose pattern is established), the draft model’s predictions are good. Accept rates around 0.60. Code completion is an intermediate case: better than JSON extraction, worse than free-form prose, because code has both structure (which works against the draft) and repetition (which helps it).
Where EAGLE hurts
Structured-JSON output. The token distribution for a JSON brace, quote, key, colon, value, comma sequence is narrow but tightly constrained. The draft model’s predictions are wrong more often than they are right because the structural constraint disagrees with the draft’s free-form-text prior. Accept rates fall to 0.20-0.40. The verification overhead exceeds the savings.
Why EAGLE hurts here: the draft model was trained on natural Mistral output, which is prose. JSON-constrained output is not prose. Every brace, every quote, every required key forces the model to produce tokens that look nothing like free-form text. The draft makes prose-style predictions, the main model overrides them, and the wasted verification passes add latency rather than saving it. This is the same reason grammar-constrained decoding is even worse: the grammar constraint can reject every draft token outright, making EAGLE’s cost pure overhead.
Caveat: EAGLE does not fix the underlying problem of constraint-heavy prompts. If a prompt forces strict JSON output with a schema, EAGLE off is better than EAGLE on. The right tool for that workload is either a different inference path or a model that was specifically trained to produce structured output efficiently.
Tool-call responses. Same pathology: the model is producing a structured payload that follows a tight schema. The free-form-text draft is the wrong prior. Accept rates similarly in the 0.20-0.35 range. The multiplier in the table above (1.1-1.6x) is the realistic range rather than a win compared to baseline.
Caveat: avoid deploying EAGLE globally and assuming vendor benchmark numbers apply to production. Those benchmarks are almost always free-form prose runs. A production agentic stack that emits JSON tool calls will not see 2-3x gains. It will see throughput at or below baseline.
Schema-constrained generation. Grammar-constrained decoding (where the inference engine forces output to conform to a JSON schema or an EBNF grammar) is the worst case for EAGLE. The grammar constraint can reject every draft token, in which case EAGLE’s verification cost is pure overhead and the throughput collapses to below baseline.
(See Fixes: EAGLE Content-Dependent Throughput for the worked example with measured numbers across four workload classes.)
The measured ranges
Tested on 2026-05-22, sovgrid pipeline, Mistral Small 4 NVFP4 on the DGX Spark GB10, single-stream interactive. These numbers are specific to that hardware and model version; on different hardware the baseline will differ but the ratio between EAGLE-on and EAGLE-off within each workload class should be similar.
| Workload class | Baseline (no EAGLE) | With EAGLE | Multiplier | Recommended |
|---|---|---|---|---|
| Short summary (free-form prose) | ~14 tok/s | ~37 tok/s | 2.6x | EAGLE on |
| Medium analysis (free-form prose) | ~13 tok/s | ~25 tok/s | 1.9x | EAGLE on |
| Long code refactor (mixed) | ~14 tok/s | ~35-41 tok/s | 2.5-2.9x | EAGLE on |
| Structured JSON output | ~14 tok/s | ~13-25 tok/s | 0.9-1.8x | EAGLE off (or per-call) |
| Tool-call response | ~14 tok/s | ~15-22 tok/s | 1.1-1.6x | EAGLE off |
The single dataset in vendor benchmarks is the “free-form prose” row, which is where EAGLE looks best. The structured-output row is the row that production workloads actually hit, and the row where the marketing-throughput number does not survive contact with the real workload. (For the broader benchmark-honesty argument, see Two Leaderboards Nobody Reads Together.)
For comparison, Qwen3.6 with MTP (Multi-Token Prediction, n=3) rather than EAGLE ran at 57-62 tok/s on the same hardware in May 2026, using DFlash speculative decoding rather than EAGLE. MTP is Qwen’s alternative to EAGLE and is less sensitive to workload class because the draft strategy is different. This is worth knowing if you are choosing between models: the choice of speculative-decoding method is part of the inference-stack decision, not just the model-quality decision.
The implementation work
The fix is a per-call workload classifier plus a dispatcher that toggles EAGLE per class. The classifier in the sovgrid stack is regex-based and lives in master.py. The decision logic is:
def classify_workload(messages, tools, response_format):
if response_format and response_format.get("type") == "json_object":
return "structured"
if response_format and "schema" in response_format:
return "structured"
if tools:
return "tool_call"
last_message = messages[-1]["content"] if messages else ""
if any(marker in last_message for marker in ["JSON", "schema", "structured output"]):
return "structured"
return "free_form"
The dispatcher reads the classification and chooses the inference endpoint. EAGLE-enabled endpoint for free_form; EAGLE-disabled endpoint for structured and tool_call. The two endpoints can be the same vLLM service started twice with different flags, or one service with runtime control over the speculative-decoding setting.
Enabling EAGLE on vLLM
As of vLLM 0.4.x (tested May 2026), EAGLE speculative decoding is enabled via launch flags. The safer-eagle profile that the sovgrid stack uses keeps the draft count conservative to avoid degrading on borderline workloads:
# vLLM: EAGLE speculative decoding, safer-eagle profile
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
--speculative-model /models/mistral-eagle-draft \
--num-speculative-tokens 4 \
--speculative-draft-tensor-parallel-size 1 \
--port 8000
# For the EAGLE-disabled fallback endpoint (structured/tool-call workloads):
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
--port 8001
SGLang safer-eagle config
SGLang also supports EAGLE natively. As of the SGLang version used in May 2026, the equivalent configuration is:
# SGLang: EAGLE with conservative draft count (safer-eagle profile)
python -m sglang.launch_server \
--model-path /models/Mistral-Small-3.1-24B-Instruct-2503-NVFP4 \
--speculative-algorithm EAGLE \
--speculative-draft-model-path /models/mistral-eagle-draft \
--speculative-num-draft-tokens 4 \
--port 30000
# Watch accept rate in real time during a run:
docker logs sglang-mistral4 2>&1 | grep -E "accept rate|throughput"
The --speculative-num-draft-tokens 4 setting, rather than the default 8, is what makes this the “safer-eagle” profile. Fewer draft tokens means the wasted-verification cost per rejected cycle is lower, which reduces the penalty when the accept rate is bad. The downside compared to a higher draft count: you capture less of the upside when the accept rate is good.
The cost of this implementation is roughly 100 lines of Python plus two systemd unit files. The benefit is that EAGLE accelerates the workloads where it should and gets out of the way on the workloads where it should not.
What EAGLE doesn’t fix
Caveat on scope: EAGLE is a throughput optimization for decode-phase token generation. It does not reduce time-to-first-token, which is dominated by the prefill phase. If your latency complaint is “the model takes too long to start responding”, EAGLE is the wrong lever.
EAGLE also does not improve output quality. The speculative tokens that get accepted are exactly the same tokens the main model would have produced without EAGLE. The draft model is an approximation that saves compute when it is right, not a model that changes the output. This means EAGLE is safe to enable on prose workloads: the output is identical, just faster.
Why this matters in practice: I have seen EAGLE described as “an upgrade” in documentation and blog posts. That framing is misleading compared to the accurate description, which is “a throughput optimization that is workload-dependent”. An upgrade implies unconditional improvement. EAGLE is a conditional improvement that becomes a throughput regression on the wrong workload. The distinction matters when you are deciding whether to enable it globally.
One more thing EAGLE doesn’t fix: memory pressure. The draft model consumes GPU memory alongside the main model. On the DGX Spark with 128 GB unified memory, this is not a constraint for Mistral Small 4. On hardware with tighter VRAM budgets, the draft model’s memory footprint can push the main model to a smaller batch size, which erases some of the throughput gains. Check your memory headroom before enabling EAGLE, not after.
Where this fits
For the broader inference-stack reasoning, see The Sovereign AI Stack in 2026. For the model-level comparison that includes EAGLE-vs-MTP, see Mistral vs Qwen 3.6 vs GLM-5 on a Single DGX Spark. For the upstream Qwen 3.6 alternative to EAGLE (MTP n=3), see Spark Arena Rank 4 Made Me Add Qwen3.6.
subscribe for the dispatcher walkthrough
The follow-up article walks through the actual dispatcher implementation in master.py, including the classifier accuracy measurements and the per-class throughput numbers across a week of production traffic. Subscribe via the footer.