EAGLE Throughput Is Content-Dependent: Same Run, 14 to 31 Tokens Per Second
I have been chasing podcast-script quality on my self-hosted stack and ran into a measurement that does not match the usual mental model of “tokens per second is a hardware property.” Same Mistral-Small-4 119B NVFP4 in the same SGLang container with the same EAGLE speculative-decoding config. Same GB10 box. Inside one pipeline run, throughput went from 13 to 31 tokens per second and back. Nothing on the hardware side changed. The only thing that changed was the kind of text the model was generating.
This post documents what I saw, the live SGLang logs that prove it, and the practical takeaway for anyone running long-context speculative decoding on consumer hardware.
The setup
The pipeline generates a 45-minute podcast script in two phases:
- Initial generation. Two long-context calls to Mistral, each with a heavy system prompt that includes a numerical “output contract”: exact word-count targets, host word-share ratios, allowed turn-count ranges, hedge-opener minimums, mandatory cameo-segment counts.
- Naturalize. A second pass that rewrites short batches of turns to add casual hedges and contractions. The system prompt for this pass is short. The user message just says “rewrite these turns to sound more spoken.”
Same SGLang container, same model, same GPU memory state. The only difference between phase one and phase two is the shape of the input.
What I expected
Mistral-Small-4 with EAGLE speculative decoding has historically run at about 25 to 35 tokens per second on this hardware. EAGLE accept rates of 0.7 to 0.9 are normal. I was expecting both phases to land somewhere in that range, with the naturalize phase a bit slower if anything because of the round-trip overhead from many small requests.
What I measured
SGLang’s scheduler logs every decode batch with its own throughput and EAGLE accept-rate stats. The container exposes them via docker logs:
[2026-05-06 18:36:11] Decode batch, gen throughput (token/s): 33.76, accept rate: 0.91
[2026-05-06 18:36:23] Decode batch, gen throughput (token/s): 29.04, accept rate: 0.72
I tailed those during the run and aggregated them in 60-second windows. Here is what the actual data looked like:
| Phase | Sample window | avg tok/s | accept rate | max tok/s |
|---|---|---|---|---|
| Initial gen, Part 1 | 21:02 - 21:09 | 13.9 | 0.59 | 18.1 |
| Initial gen, Part 2 retry | 21:14 - 21:15 | 16.1 | 0.71 | 17.5 |
| Naturalize | 21:16 - 21:18 | 30.1 | 0.79 | 39.2 |
Same model, same GPU, same hour. Throughput more than doubled when the input changed.
The puzzle
If hardware was the bottleneck, throughput would not vary by a factor of two in the same run. If model size was the bottleneck, ditto. Memory bandwidth was at 96 percent utilization the whole time, so the GPU was working flat out in both phases. So what changed?
The answer is in the EAGLE accept rate. EAGLE generates draft tokens with a small model and verifies them with the big model. If the draft predicts the big model’s next token correctly, you get that token “for free.” With three speculative steps and four draft tokens per step, an accept rate of 0.9 means roughly 3.6 tokens accepted per verify cycle. At 0.6 it is 2.4. The ratio of those two is roughly the ratio of throughputs I observed.
Lower accept rate means EAGLE is wasting work, and the big model is doing more solo decoding. The draft model in our setup is the official Mistral EAGLE checkpoint. It was trained on Mistral’s natural output distribution. So the question becomes: what kind of output is “Mistral-natural” and what is not?
The diagnosis
The initial-generation phase was producing text under heavy structural constraints: hit exactly six thousand seven hundred fifty words, balance host word-share to fifty percent, include exactly one VIBE cameo and one CLAWI cameo, ensure twenty-five hedge-openers across the episode, output as a JSON array with specific schema. All that pressure shows up in the actual tokens. The model generates with one eye on the constraint set, and that produces less typical Mistral output. The draft model has not seen that distribution at training time and starts mispredicting.
The naturalize phase is the opposite. The system prompt is short. The instruction is “rewrite these turns to sound more spoken.” There is no global word-count, no JSON schema, no balance ratio. Mistral writes the way it normally writes, and the draft model predicts well.
The accept-rate climb was visible in real time. During the Part 2 retry, where the prompt switches to “deepen and elaborate” rather than “satisfy these specs,” the accept rate climbed from 0.55 to 0.71 in the same generation. Loosen the constraint, get the speed back.
The trade-off
Slower throughput is not automatically worse. Here is the wall-time math.
Without the output contract, my v2 run produced four thousand four hundred fifty-three words against a target of six thousand seven hundred fifty. That is sixty-six percent of target, way under the floor, requiring a retry round-trip that adds two to three minutes.
With the output contract, my v3 run produced three thousand sixty-four words on Part 1 first try, ninety-one percent of target. Part 2 needed one retry to hit ninety-five percent. Total wall-time was about the same. The fast version that misses the target costs more in retries than the slow version that hits the target on first try.
So the contract is worth it. But you have to know what you are buying.
What this means in practice
Three things to take from this if you are running speculative decoding on your own hardware:
One, your throughput number is not a constant. Quote it as a range tied to the type of generation you are doing. “30 tokens per second on free-form prose, 14 tokens per second on structured output” is honest. “30 tokens per second” hides half the picture.
Two, EAGLE accept rate is a load-bearing metric. If your throughput tanks and your accept rate drops with it, the problem is your prompt forcing the model out of its draft-friendly distribution, not your hardware. Look at the prompt, not the GPU.
Three, structured contracts are not free, but the wall-time math can still work out. If a stricter prompt makes the output land on first try, the perf cost is paid back through fewer retries. Measure end-to-end, not just per-token.
The full live-monitoring pattern that exposed this is one line: docker logs sglang-mistral4 | grep "throughput". Worth tailing during long runs to see what your actual content does to your actual hardware.
What I am still figuring out
A few questions this raised that I do not have clean answers for yet:
- Whether you can fine-tune the EAGLE draft model on your own structured-output distribution and recover the speed without giving up the contract. This would matter for production agentic pipelines that emit JSON or function calls all the time.
- Whether mid-prompt content type predicts accept rate well enough to dispatch to a different inference path automatically.
- Whether the same effect shows up on Mixture-of-Experts routing, or only on EAGLE.
For now the practical move is the boring one: when speed matters more than format, prompt loosely. When format matters more than speed, prompt strictly. And measure both.