The Leaderboard Said 239 Tokens a Second. My DGX Spark Said 71.
A community benchmark site lists tuned serving recipes for the NVIDIA DGX Spark, each with a throughput number next to it. Two of them, for the exact model I run in production, looked like a free speed-up: Qwen3.6-35B-A3B with DFlash speculative decoding at k=6 was listed at 138 tokens per second, and the same model with native MTP (multi-token prediction) speculation at 239 tokens per second. My production setup, the same model with DFlash at k=3, sits around 71 tokens per second by my own measurement. A 3.4x gap is worth an evening.
So I spent the evening. I benchmarked all of them on my own Spark, against my own prompts, with the same harness for each. Almost none of the headline numbers reproduced, and the reasons they did not are more useful than the numbers would have been.
The baseline, measured honestly
Production is rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm with the z-lab/Qwen3.6-35B-A3B-DFlash draft model at k=3, served on vLLM. I measure single-stream decode by sending one completion request for 256 tokens with a real technical prompt, then dividing output tokens by wall time. Three runs:
run 1: 256 tok in 3.59s = 71.3 tok/s
run 2: 256 tok in 3.91s = 65.5 tok/s
run 3: 256 tok in 3.29s = 77.8 tok/s
avg: 71.5 tok/s
That is the number every other config has to beat. It is not the leaderboard’s number, and that difference is the first lesson.
The image was not the lever, and I already knew that
My first suspicion was the container. The published recipes specify vllm-node-tf5, and I serve from dgx-vllm-eugr-nightly, swapped months ago because the tf5 build pulls an NCCL (NVIDIA’s multi-GPU networking library) component from a source my network blocks. Two different builds of vLLM are an easy thing to blame for a missing 2x.
Except I had already ruled it out. In the companion write-up on these same numbers I pulled the exact tf5 image as a prebuilt artifact and ran the identical config on it. It produced the same 73 tokens per second and the same draft acceptance. The image was never the lever. So the gap to the leaderboard’s 138 and 239 is not a build difference, which leaves the harder and more honest answer: these numbers are not reproducible on this Spark in this environment, the same conclusion that earlier round reached for the site’s headline 95.11 figure. Every config I tested landed in the 60 to 85 range, and 70 to 73 is the real ceiling for this model on this box. A recipe is a model plus flags plus an image plus a harness plus a prompt, and when someone hands you only the first three and a single number, the number is narrower than it looks.
MTP does not beat a tuned DFlash for one user
The 239 recipe is native MTP, where the model’s built-in multi-token-prediction head proposes the draft tokens instead of a separate draft model. The whole difference between my production and the recipe is one flag, set in /data/scripts/vllm/qwen36-PRODUCTION.sh:
# Production (DFlash): a separate draft model proposes tokens
--speculative-config '{"method":"dflash","model":"z-lab/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":3}'
# The 239 recipe (MTP): the model's own built-in head proposes them, no draft model
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
The checkpoint ships the head: I confirmed mtp.fc.weight and the mtp.layers block are present in the weights, so the recipe is legitimate. It ran. On my box, single-stream, it gave 62.2 tokens per second against my production’s 71.5. The tuned DFlash setup won.
This is not a contradiction of the leaderboard so much as a different question. Speculative decoding helps most when verification has spare compute to absorb, which happens under concurrency. The MTP recipe is marked solo-only on the site, but the gain it is built for shows up when several requests share each verification step. For one user typing one prompt, a well-chosen draft model at the right k is already close to the ceiling, and MTP’s extra machinery does not pay for itself.
The acceptance metrics tell the same story from the inside. Over one run, the draft proposed 831 tokens and the model accepted 501 of them, a 60 percent acceptance rate, about 1.8 accepted tokens per step at k=3. That is healthy. It is also content-dependent in a way the single leaderboard number hides: a predictable prompt (continue a counting sequence) ran at 85.6 tokens per second, the same engine on a dense technical prompt ran at 66.1. Speculation rewards text the draft can guess. Your real workload decides your real speed.
k changes throughput and nothing else
Be precise about what the k in “DFlash k=6” controls, because it is easy to assume a higher number means a different output. It does not. Speculative decoding is lossless by construction. The verification step guarantees the emitted tokens are exactly the ones the main model would have produced on its own. k=3, k=6, and no speculation at all generate identical text. k is a throughput dial, not a behavior dial.
What it trades is acceptance against waste. A higher k proposes tokens further into the future, and those further tokens drift from what the main model would pick, so they get rejected and their draft compute is thrown away. My production comment from months ago reads “k=3 is best, k=6 wastes draft on the low-accept tail,” and the k=6 test confirmed it the hard way: it would not even initialize on my box. With the recipe’s memory setting at 0.8 and a padding penalty that wasted up to 60 percent of the KV cache (the key-value cache, the model’s working memory), the engine ran out of room during startup and the core failed to come up. The leaderboard’s 138 for that config is, on my image, a crash.
The capability hiding in a quant
The most useful finding had nothing to do with speed. While setting up a comparison I checked the FP8 (eight-bit) build of the same model, Qwen/Qwen3.6-35B-A3B-FP8, and its config carried a vision_config, image and video token ids, a preprocessor, and a full set of model.visual.* weights. The FP8 checkpoint is a vision-language model. It reads images and video.
My production quant, the 4.75-bit PrismaQuant, is text-only. The quantization dropped the vision tower to save space. Same base model, same name on paper, but one quant can see and the other cannot. The FP8 ran at 62.8 tokens per second on text, slower than production because it is a larger 35 GB base, but it brings a modality the smaller quant threw away. If you ever want your local model to look at a screenshot or a dashboard, the capability is a quant choice, not a model choice, and it is not on any throughput leaderboard.
Update (2026-06-11). The premise above, that the smaller quant threw vision away and the FP8 is the only build that can see, no longer holds for production. The daily driver has since moved off the 4.75-bit PrismaQuant to a 4-bit Intel/Qwen3.6-35B-A3B-int4-mixed-AutoRound checkpoint, and that one keeps the full model.visual.* tower: 333 vision tensors, patch embed, deepstack blocks, all of it. The only reason it served text-only was a leftover --language-model-only flag carried over from the PrismaQuant launcher. Drop that flag, add --mm-processor-cache-type shm, and it reads images. Verified live by handing it the dashboard from this very stack: it returned the header text, the red and green status dots, and the service rows, at the fast quant’s footprint of 21 GB rather than the FP8’s 35 GB. It does this with DFlash speculative decoding still active, so there is no text-speed versus vision trade after all. The lesson survives in a sharper form. Dropping the vision tower was a property of that specific PrismaQuant build, not of four-bit quantization in general. “Can it see?” is a per-checkpoint fact you read off the weight map, not something you infer from the bit-width. The full build-and-benchmark story, where this same Qwen also out-scores a 120B I spent a day standing up, is in gpt-oss vs Qwen on a single Spark, and the head-to-head of all three quants on speed, accuracy, and vision is in the quant teardown.
Why I dropped the model the guide recommended
The most promising upgrade on paper was not a Qwen tweak at all. vLLM’s own DGX Spark guide names a 100-to-130B mixture-of-experts model in NVFP4 (NVIDIA’s four-bit format) as the current sweet spot for the box, and the concrete example it gives is NVIDIA’s Nemotron-3-Super-120B-A12B-NVFP4. A model roughly three times the size of my Qwen that still fits the Spark’s 128 GB of unified memory and runs at a usable speed is exactly the kind of free upgrade worth chasing. I started downloading it.
Then I checked the license, and stopped at 5 GB in. NVIDIA ships Nemotron under license:other, which resolves to the NVIDIA Open Model License. That license is self-hostable and commercial-friendly, but it is not a free software license: it carries NVIDIA-specific terms and is not OSI-approved open source. The Qwen checkpoints I run are Apache 2.0, and so are vLLM and the other engines in the stack.
Here is why that one line settled it. Self-hosting answers sovereignty: do the weights run on my own metal, with no cloud in the loop and no remote killswitch? Nemotron passes that test cleanly. It does not answer openness, which is a separate question: what does the license actually grant, and can anyone fork and redistribute it without a vendor’s permission? Apache says yes without conditions. The NVIDIA Open Model License does not. A model can be sovereign and still not open source, and Nemotron is exactly that case. For this project the two have to agree, so the recommended model was out before a single benchmark, on a license check that cost less than the download it cancelled. Faster is not the only axis, and on this stack it is not the deciding one.
The trade-offs in plain terms, for anyone new to this
If the jargon above lost you, here is the same thing without it. None of these choices change what the model says. They change how fast it says it and what it can look at.
Speculative decoding is a speed trick. A small, fast helper guesses the next few words, and the big slow model checks all the guesses in a single step instead of writing each word one at a time. When the guesses are right, you get several words for the price of one. When they are wrong, the guesses are thrown away and you paid a little extra for nothing. It never changes the output, only the speed. The two flavors I tested differ in where the guesser lives. DFlash is a separate small model you load next to the big one, which you can tune on its own. MTP is a guesser baked into the big model’s own weights, so there is nothing extra to manage. On a single user, my tuned DFlash was faster. MTP is built to shine when many people hit the model at once.
The number k is how many words ahead the helper guesses. Guess too far and the far guesses are almost always wrong, so the work is wasted. There is a sweet spot, three for me, and pushing it to six did not help and ran out of memory on my setup.
Quantization is compression. The full model is huge, so it gets squeezed to fit and run faster, and like any compression you trade size against fidelity. The heavy 4.75-bit squeeze I run is the smallest and fastest, but it threw away the model’s ability to see images. The lighter FP8 squeeze is bigger and a touch slower, but it kept the eyes. Smaller and faster, or larger and able to look at a screenshot: that is the real choice, and it is the quant, not the model name, that decides it.
Here is the whole evening as one table:
| Approach | In plain words | Upside | Downside | Best for |
|---|---|---|---|---|
| DFlash speculation | a small helper guesses ahead, tunable | free speed, you can tune it | one more model to load | a single user, well tuned |
| Native MTP | the model guesses its own next words | nothing extra to manage | did not win for one user here | many users at the same time |
| 4.75-bit quant | heavy compression | smallest and fastest | drops vision, slight quality loss | pure text, maximum speed |
| FP8 quant | lighter compression | keeps image and video, higher fidelity | larger, a little slower | when you want the model to see |
The reason to run it yourself is that the right row is different for different people. A solo coder wants the first and third rows. A team serving many requests at once, or someone who needs the model to read screenshots, picks differently. The leaderboard only ever shows one row and calls it the answer.
What I kept
Production did not change. After an evening of swaps the winner was the setup I started with, which is its own kind of result: the tuned DFlash at k=3 is already at the single-user ceiling for this model on this image. The work was not wasted, because now the ceiling is measured rather than assumed, and the next time a leaderboard quotes a number I will reproduce it on my own box before I believe it.
The recipes are real and the people who tune them are doing careful work. A published throughput figure is just narrower than it looks. It is true for one container image, one measurement harness, one prompt distribution, and one concurrency level. Change any of those and the number moves by a factor of three. The honest version of a benchmark is the one you ran yourself, on the box you actually serve from, against the prompts your users actually send.
Recipes referenced: the Spark-Arena benchmark recipes for Qwen3.6-35B-A3B on DGX Spark, the docai.hu write-up on Qwen3.6 MTP throughput on GB10, and the vLLM DGX Spark serving guide.