Mistral vs Qwen3.6 on DGX Spark: the 0/30 That Was a Broken Ruler

May 18, 2026 12 min read

Update (2026-06-19). The “Qwen 3.6 PrismaQuant” references here predate the 2026-06-11 production switch to AutoRound int4-mixed (69.2 tok/s, 12.7 percent better on the coding gate, vision retained, PrismaQuant retired). The figures are kept as the engineering-log record; the live stack is on /stack/ and the switch is measured in AutoRound int4 vs PrismaQuant.

New to self-hosting AI? The Self-Hosted AI: Start Here hub covers the hardware tree, the model choice, and the gotchas. This article is the model-choice one, told through the measurement that almost lied.

New here? Jump to “Plain-language version”. Short story: a coding test said the AI scored zero. The AI was fine. The test was broken and I almost believed it.

I ran the Aider Polyglot coding benchmark against Mistral Small 4 and got 0 out of 30. Zero. I wrote it down: “Mistral NVFP4 produces well-formed code that fails every single test, the quantization kills coding quality.” It went into the internal findings doc. It came close to a published article.

It was wrong. Not slightly wrong. The number measured nothing about Mistral.

The red flag I walked past

Mistral Small 4 is a competent model. The public Aider leaderboard puts Mistral-Small variants in the 50 to 70 percent range. A competent model scoring exactly zero on thirty tasks is not a bad score. It is an impossible score. Real weak models still pass a few. Getting clean zero from a known-good model is the signature of a broken instrument, not a bad subject.

Then Qwen3.6 also scored 0/30 on the same harness. Two different, capable models, both exactly zero. At that point the probability that both models are genuinely that bad is roughly nil. The probability that the ruler is broken is roughly one. The user I work with said it plainly: “it can’t be that neither Mistral nor Qwen solves a single one of 30 tasks.” That was the whole diagnosis in one sentence. I had taken the first 0/30 at face value for days before that landed.

What was actually broken

The Aider Polyglot harness runs Aider, which talks to the model through a library called litellm, inside a docker container. This stack hardens all docker traffic through a Tor proxy on purpose (the sovereign-by-default line). litellm has network dependencies beyond the model call itself: it fetches a model-pricing and context-window file from a GitHub raw URL, and does other client-side network work. Those calls hang behind the Tor docker proxy.

The proof was unambiguous. During a task that “hung” for 30 minutes, the vLLM server logs were completely empty. Not a slow request. No request at all. litellm never sent anything to the model. It was stuck client-side, retrying network operations that could not resolve. Meanwhile a direct API call to the exact same model on the exact same container returned in 0.3 to 8 seconds with correct code.

The harness was not measuring the model. It was measuring litellm failing to make a phone call. Every “0/30” was the test instrument timing out before it ever reached the thing it was supposed to test. The solution files were never written. The tests ran against empty stubs. Zero was guaranteed no matter how good the model was.

One litellm setting (LITELLM_LOCAL_MODEL_COST_MAP=True) fixed one of the hangs. Others remained. The honest conclusion: the Aider Polyglot harness, as built, is not viable in this Tor-hardened environment. So I stopped trying to fix the harness and measured the thing directly.

The direct measurement

A small script: for each exercise, send the instructions and the stub straight to the model’s API (no Aider, no litellm, no docker-in-docker, nothing Tor-blocked), pull the code block out of the reply, write it to the solution file, run the exercise’s real test suite, count pass or fail. Same twelve Python tasks for both models. Temperature 0, single attempt, no retry, no error-feedback loop. That last part makes it a harsh metric, harsher than the public leaderboards which allow a second try with the failure shown back to the model. The harshness is fine as long as both models face it equally.

Results:

Qwen3.6-35B-A3B-PrismaQuant: 4 of 12 full pass (33%)
Mistral-Small-4 NVFP4 (safer config): 1 of 12 full pass (8%)

Mistral is not 0%. It is 8%, single-shot, no feedback. The earlier zero was the broken ruler. And Qwen3.6 is roughly four times stronger on clean full completion. Most of both models’ failures are near-misses (Mistral got 22 of 24 subtests on list-ops, 20 of 22 on pig-latin), so both produce usable partial code, but Qwen finishes clean far more often.

Everything at a glance

Dimension	Mistral-Small-4 NVFP4	Qwen3.6-35B-A3B-PrismaQuant
Inference engine	SGLang	vLLM (dgx-vllm-eugr image)
Quantization	NVFP4	INT4 PrismaQuant 4.75-bit
Speed, production config	29 tok/s (safer, no spec-decoding)	~70 tok/s (DFlash spec, k=3)
Speed, best ever measured	35 to 41 tok/s (old tuned SGLang image)	~73 (gpu-mem 0.8, less safe)
Speed, worst regression	12 tok/s (EAGLE on current SGLang nightly)	26 tok/s (DFlash k=6, wrong setting)
Coding quality, direct bench (12 tasks, temp 0, single-shot)	1/12 full pass (8%)	4/12 full pass (33%)
opencode compatibility	BadRequest (strict role alternation; auto-title sends double-USER)	Clean (chat template does not enforce alternation)
Context window	32K (safer config)	262K
Memory, production	~82 GB	~67 GB, 53 GB headroom
Desktop-freeze risk	None (SGLang kernel path)	Only if misconfigured; needs `VLLM_FLASHINFER_MOE_BACKEND=latency` (see the SM 12.1 MoE-kernel story)
Image input (vision)	Yes. The NVFP4 checkpoint keeps the full Pixtral vision encoder; reads real screenshots accurately, occasionally confabulates structure that is not in the image	No. The PrismaQuant build ships 0 of ~127k `visual.*` tensors, runs `--language-model-only`, returns HTTP 400 on any image
Role now	Documented fallback, and the only one of the two that can see (vision section below)	Production, text-only by quant

Experiments at a glance

The 90+ tok/s chase. Spark Arena lists this Qwen3.6 quant at 95.11 tok/s. Tested every lever: speculative-decoding k=3 vs k=6 (k=3 wins, k=6 wastes draft on low-acceptance tail positions), temperature sweep 0.0 / 0.2 / 0.6 (flat, acceptance is temperature-insensitive here), and the exact image the 95.11 was measured on (vllm-node-tf5, pulled as a published artifact since the build is Tor-blocked). The tf5 image produced identical ~73 tok/s and identical 33% draft acceptance. The image was never the lever. 95.11 is not reproducible on this Spark in this environment. Honest ceiling: ~70 tok/s, still +53% over the freeze-fixed baseline.
The broken ruler. Aider Polyglot 0/30 for both models was litellm hanging behind the Tor docker proxy, never reaching the model. Direct measurement replaced it.
The desktop freeze. vLLM’s FlashInfer MoE throughput backend has broken kernels on the Spark’s SM 12.1 GPU and froze the whole desktop. Fixed with one env var. Full story linked in the table.
enable_thinking matters. Qwen3.6 with thinking on takes 100 to 200 seconds per coding response (unusable interactively). enable_thinking=false is mandatory for an agent workload.
The vision asymmetry. Both models are multimodal on paper. Only Mistral’s quant kept it. The Mistral NVFP4 checkpoint carries the complete Pixtral vision_encoder stack and processed a real screenshot correctly (right tool name, dialog box, model list), with one confabulated menu bar that was not in the image. The Qwen PrismaQuant build has 0 of roughly 127,000 tensors named visual.*, is launched with --language-model-only, and returns HTTP 400 on any image. Capability here was set by the quant, not the datasheet.

The model I almost retired can see, the one I shipped cannot

I had Mistral filed as the documented fallback. Slower in the safe config, a role-alternation tax, parked while Qwen3.6 took over code and tools. The plan was to let it sit there. Then I ran the one test this whole comparison had never bothered with, an image, and the fallback did something the production model physically cannot.

The check was symmetric and direct. Mistral Small 4 loads under SGLang as PixtralForConditionalGeneration. Its NVFP4 checkpoint carries the full vision_encoder stack, confirmed in consolidated.safetensors.index.json with a real vision_encoder block in params.json. A direct image request returns HTTP 200. Fed an actual screenshot it read back the correct tool name, the model-selection dialog, and the listed model names, then invented a “File / Edit / View” menu bar that does not exist in the picture. Usable, not reliable, on structure. Qwen3.6 PrismaQuant is multimodal as an architecture and inert as one in practice: the checkpoint holds 0 of roughly 127,000 weight tensors named visual.*. The quantization dropped the vision tower outright. The vision_config still sitting in config.json is inherited boilerplate. The server is correctly run with --language-model-only, and every image request fails with HTTP 400. It is a text-only code specialist by quant, not by choice.

That reframes the stack. Mistral is not just the prose fallback any more, it is the only local model on this Spark that can take an image at all. That is enough to stop treating it as parked. The faster Mistral path, the EAGLE speculative-decoding variant, was abandoned because it OOMed at 95 GB on repeated boot, which is why the shipped config runs the slower safe one. The next step is to get that EAGLE variant OOM-stable so the model that can see also runs fast, instead of being the capable one I keep at arm’s length. That work is not done and is not promised here. What is settled is the reason to do it: not a line on a model card, a verified capability the shipped model does not have. Same lesson as the broken ruler, pointed the other way. The spec sheet is not the system. What you can actually do is set by the checkpoint that fits the GPU, and that is worth measuring before it drives a model decision.

Plain-language version

A benchmark is a fixed set of coding puzzles you give an AI to see how good it is. A “harness” is the plumbing that hands the puzzle to the AI, takes its answer, runs the puzzle’s tests, and counts the score.

Our harness used a helper library that, before asking the AI anything, tries to phone a website for some reference data. This computer routes all such calls through a privacy network (Tor) on purpose. The phone call hung. The helper sat there redialing forever and never actually asked the AI the question. The score came back zero, not because the AI failed, but because the AI was never asked.

Two good AIs both scored a perfect zero. That is the tell. A weak student still gets a few questions right. A student who scores exactly zero on everything probably never received the exam. Same logic.

We threw out that plumbing and asked the AIs directly. Real scores, same fair test for both: Qwen3.6 got 33 percent, Mistral got 8 percent (one attempt each, no hints, a deliberately hard way to grade). Qwen is clearly the better coder here. Both are real, working models. The “zero” never meant anything.

Takeaway

When a measurement returns an impossible result, suspect the instrument before the subject. A known-good model scoring exactly zero is not data, it is a broken ruler. The fastest path to the truth was not debugging the harness for days, it was bypassing it and measuring directly. The numbers that survived that are the ones in the table.

Sources and external links

vLLM on DGX Spark, official troubleshooting (the VLLM_FLASHINFER_MOE_BACKEND=latency line): build.nvidia.com/spark/vllm/troubleshooting
NVIDIA Developer Forums, the SM 12.1 freeze thread: Gemma 4 on DGX Spark, System Freeze at >80% Utilization & sm_121
NVIDIA Developer Forums, the PrismaQuant vs FP8 single-stream numbers: Benchmarks for Qwen3.6 FP8 vs PrismaQuant
DGX Spark known issues: docs.nvidia.com/dgx/dgx-spark/known-issues.html
NVIDIA dgx-spark-playbooks, vLLM section (DeepWiki): deepwiki.com/NVIDIA/dgx-spark-playbooks/4.2-vllm
Spark Arena throughput leaderboard (the 95.11 tok/s claim and the recipes): spark-arena.com
The HuggingFace model card whose own benchmark we could not reproduce: rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm

Cross-references

The direct sequel, where the same distrust-the-zero reflex caught three more measurement traps in a single day (a harness scoring a working model at zero, a one-shot test that framed the model for my own bug, and a cold reading that undersold decode by a third): A Benchmark Handed Me a Number Three Times in One Day. Three Times It Was Lying.
The 95.11 tok/s number in the table comes from a throughput leaderboard most coverage never reads next to the quality one: Two Leaderboards Nobody Reads Together
The agent layer that runs against the Qwen endpoint, and the strict-alternation gotcha that pushed the migration: opencode Setup: Self-Hosted AI Coding Assistant on ARM64
The third ruler, same direct-measurement method pointed at prose: I had both models write this blog’s hub article and ran the raw output through the real publication gate. Both passed, both fabricated facts the gate rewarded: The Quality Gate That Rewards Fabrication
The full follow-up benchmark of the Spark Arena recipes, pushing past the 95.11 chase into the k=6, native MTP, and FP8 recipes, plus dropping the NVIDIA model the vLLM guide recommends on a license check: The Leaderboard Said 239 Tokens a Second. My DGX Spark Said 71.
The companion failure where another “the hardware is broken” conclusion was one wrong setting: Why SGLang Never Froze My Desktop But vLLM Did
The earlier hands-on coding-tool comparison, same privacy-first lens (that piece kept Vibe against the old Mistral endpoint; the Qwen3.6 migration since replaced Vibe with opencode, see the opencode link above for the current setup): Why I Kept Claude Code + Vibe and Dumped Cursor and Continue.dev

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—