I almost published 'Mistral Small 4 scores 0/30 on coding, the quant kills it'. A competent model scoring exactly zero should have been the red flag. The benchmark harness was hanging behind this stack's Tor docker proxy and never reached the model. Here is the broken-ruler story, the direct measurement that replaced it, and every Mistral-vs-Qwen3.6 number at a glance, including which one can actually read an image.

Mistral vs Qwen3.6 on DGX Spark: the 0/30 That Was a Broken Ruler

New to self-hosting AI? The Self-Hosted AI: Start Here hub covers the hardware tree, the model choice, and the gotchas. This article is the model-choice one, told through the measurement that almost lied.

New here? Jump to “Plain-language version”. Short story: a coding test said the AI scored zero. The AI was fine. The test was broken and I almost believed it.

I ran the Aider Polyglot coding benchmark against Mistral Small 4 and got 0 out of 30. Zero. I wrote it down: “Mistral NVFP4 produces well-formed code that fails every single test, the quantization kills coding quality.” It went into the internal findings doc. It came close to a published article.

It was wrong. Not slightly wrong. The number measured nothing about Mistral.

The red flag I walked past

Mistral Small 4 is a competent model. The public Aider leaderboard puts Mistral-Small variants in the 50 to 70 percent range. A competent model scoring exactly zero on thirty tasks is not a bad score. It is an impossible score. Real weak models still pass a few. Getting clean zero from a known-good model is the signature of a broken instrument, not a bad subject.

Then Qwen3.6 also scored 0/30 on the same harness. Two different, capable models, both exactly zero. At that point the probability that both models are genuinely that bad is roughly nil. The probability that the ruler is broken is roughly one. The user I work with said it plainly: “it can’t be that neither Mistral nor Qwen solves a single one of 30 tasks.” That was the whole diagnosis in one sentence. I had taken the first 0/30 at face value for days before that landed.

What was actually broken

The Aider Polyglot harness runs Aider, which talks to the model through a library called litellm, inside a docker container. This stack hardens all docker traffic through a Tor proxy on purpose (the sovereign-by-default line). litellm has network dependencies beyond the model call itself: it fetches a model-pricing and context-window file from a GitHub raw URL, and does other client-side network work. Those calls hang behind the Tor docker proxy.

The proof was unambiguous. During a task that “hung” for 30 minutes, the vLLM server logs were completely empty. Not a slow request. No request at all. litellm never sent anything to the model. It was stuck client-side, retrying network operations that could not resolve. Meanwhile a direct API call to the exact same model on the exact same container returned in 0.3 to 8 seconds with correct code.

The harness was not measuring the model. It was measuring litellm failing to make a phone call. Every “0/30” was the test instrument timing out before it ever reached the thing it was supposed to test. The solution files were never written. The tests ran against empty stubs. Zero was guaranteed no matter how good the model was.

One litellm setting (LITELLM_LOCAL_MODEL_COST_MAP=True) fixed one of the hangs. Others remained. The honest conclusion: the Aider Polyglot harness, as built, is not viable in this Tor-hardened environment. So I stopped trying to fix the harness and measured the thing directly.

The direct measurement

A small script: for each exercise, send the instructions and the stub straight to the model’s API (no Aider, no litellm, no docker-in-docker, nothing Tor-blocked), pull the code block out of the reply, write it to the solution file, run the exercise’s real test suite, count pass or fail. Same twelve Python tasks for both models. Temperature 0, single attempt, no retry, no error-feedback loop. That last part makes it a harsh metric, harsher than the public leaderboards which allow a second try with the failure shown back to the model. The harshness is fine as long as both models face it equally.

Results:

Mistral is not 0%. It is 8%, single-shot, no feedback. The earlier zero was the broken ruler. And Qwen3.6 is roughly four times stronger on clean full completion. Most of both models’ failures are near-misses (Mistral got 22 of 24 subtests on list-ops, 20 of 22 on pig-latin), so both produce usable partial code, but Qwen finishes clean far more often.

Everything at a glance

DimensionMistral-Small-4 NVFP4Qwen3.6-35B-A3B-PrismaQuant
Inference engineSGLangvLLM (dgx-vllm-eugr image)
QuantizationNVFP4INT4 PrismaQuant 4.75-bit
Speed, production config29 tok/s (safer, no spec-decoding)~70 tok/s (DFlash spec, k=3)
Speed, best ever measured35 to 41 tok/s (old tuned SGLang image)~73 (gpu-mem 0.8, less safe)
Speed, worst regression12 tok/s (EAGLE on current SGLang nightly)26 tok/s (DFlash k=6, wrong setting)
Coding quality, direct bench (12 tasks, temp 0, single-shot)1/12 full pass (8%)4/12 full pass (33%)
opencode compatibilityBadRequest (strict role alternation; auto-title sends double-USER)Clean (chat template does not enforce alternation)
Context window32K (safer config)262K
Memory, production~82 GB~67 GB, 53 GB headroom
Desktop-freeze riskNone (SGLang kernel path)Only if misconfigured; needs VLLM_FLASHINFER_MOE_BACKEND=latency (see the SM 12.1 MoE-kernel story)
Image input (vision)Yes. The NVFP4 checkpoint keeps the full Pixtral vision encoder; reads real screenshots accurately, occasionally confabulates structure that is not in the imageNo. The PrismaQuant build ships 0 of ~127k visual.* tensors, runs --language-model-only, returns HTTP 400 on any image
Role nowDocumented fallback, and the only one of the two that can see (vision section below)Production, text-only by quant

Experiments at a glance

The model I almost retired can see, the one I shipped cannot

I had Mistral filed as the documented fallback. Slower in the safe config, a role-alternation tax, parked while Qwen3.6 took over code and tools. The plan was to let it sit there. Then I ran the one test this whole comparison had never bothered with, an image, and the fallback did something the production model physically cannot.

The check was symmetric and direct. Mistral Small 4 loads under SGLang as PixtralForConditionalGeneration. Its NVFP4 checkpoint carries the full vision_encoder stack, confirmed in consolidated.safetensors.index.json with a real vision_encoder block in params.json. A direct image request returns HTTP 200. Fed an actual screenshot it read back the correct tool name, the model-selection dialog, and the listed model names, then invented a “File / Edit / View” menu bar that does not exist in the picture. Usable, not reliable, on structure. Qwen3.6 PrismaQuant is multimodal as an architecture and inert as one in practice: the checkpoint holds 0 of roughly 127,000 weight tensors named visual.*. The quantization dropped the vision tower outright. The vision_config still sitting in config.json is inherited boilerplate. The server is correctly run with --language-model-only, and every image request fails with HTTP 400. It is a text-only code specialist by quant, not by choice.

That reframes the stack. Mistral is not just the prose fallback any more, it is the only local model on this Spark that can take an image at all. That is enough to stop treating it as parked. The faster Mistral path, the EAGLE speculative-decoding variant, was abandoned because it OOMed at 95 GB on repeated boot, which is why the shipped config runs the slower safe one. The next step is to get that EAGLE variant OOM-stable so the model that can see also runs fast, instead of being the capable one I keep at arm’s length. That work is not done and is not promised here. What is settled is the reason to do it: not a line on a model card, a verified capability the shipped model does not have. Same lesson as the broken ruler, pointed the other way. The spec sheet is not the system. What you can actually do is set by the checkpoint that fits the GPU, and that is worth measuring before it drives a model decision.

Plain-language version

A benchmark is a fixed set of coding puzzles you give an AI to see how good it is. A “harness” is the plumbing that hands the puzzle to the AI, takes its answer, runs the puzzle’s tests, and counts the score.

Our harness used a helper library that, before asking the AI anything, tries to phone a website for some reference data. This computer routes all such calls through a privacy network (Tor) on purpose. The phone call hung. The helper sat there redialing forever and never actually asked the AI the question. The score came back zero, not because the AI failed, but because the AI was never asked.

Two good AIs both scored a perfect zero. That is the tell. A weak student still gets a few questions right. A student who scores exactly zero on everything probably never received the exam. Same logic.

We threw out that plumbing and asked the AIs directly. Real scores, same fair test for both: Qwen3.6 got 33 percent, Mistral got 8 percent (one attempt each, no hints, a deliberately hard way to grade). Qwen is clearly the better coder here. Both are real, working models. The “zero” never meant anything.

Takeaway

When a measurement returns an impossible result, suspect the instrument before the subject. A known-good model scoring exactly zero is not data, it is a broken ruler. The fastest path to the truth was not debugging the harness for days, it was bypassing it and measuring directly. The numbers that survived that are the ones in the table.

Cross-references

Illustration: Mistral vs Qwen3.6 on DGX Spark: the 0/30 That Was a Broken Ruler