Mistral Small 4 vs Qwen 3.6 vs GLM-5.1 on a Single DGX Spark
Qwen 3.6 PrismaQuant wins on coding throughput and tool-call cleanliness. Mistral Small 4 NVFP4 holds the creative-prose and verified-vision slot as a kept-in-reserve fallback. GLM-5.1 at 754B total parameters does not fit on a single Spark, and the reason it does not fit is the most useful lesson in this comparison.
Quick Take
- Coding agent primary: Qwen 3.6 PrismaQuant 4.75bit. 57 to 62 tok/s sustained under DFlash speculative decoding on my own pipeline, 73.4 percent SWE-Bench Verified, 97 percent ToolCall-15. Up from the 45 tok/s baseline I shipped in the original Spark Arena measurement.
- Creative prose, vision, and safer fallback: Mistral Small 4 NVFP4. 36.5 tok/s with safer-eagle (switch.sh default since 2026-05-22, EAGLE regression resolved on current SGLang nightly); 29 tok/s no-EAGLE available as documented rollback baseline. Vision tower survives quantization. German prose strong. Reactivated by the switch.sh mutex when the workload calls for it.
- GLM-5.1 on single Spark: 754B total parameters with 40B active per token. AWQ INT4 footprint is roughly 377 GB, almost three times the Spark’s 128 GB unified memory budget. Single-Spark deployment is not possible.
- The honest pattern: two models on disk, one hot at a time, mutex enforced. The switch.sh script flips between Qwen on vLLM port 30001 and Mistral on the SGLang stack, with a Watchtower disable-label that stopped a 385-restart cycle. Pick the right tool per session, not per machine.
- The trap: picking by parameter count. GLM-5.1 at 754B total is technically the most capable model on this list and the most useless on this hardware.
The cold table
Three models. One Spark. All numbers from measurements I or the public Spark Arena leaderboard have run, except where labelled as vendor-published.
| Dimension | Mistral Small 4 NVFP4 | Qwen 3.6 PrismaQuant 4.75bit | GLM-5.1 (AWQ INT4) |
|---|---|---|---|
| Architecture | Dense | MoE | MoE (massive) |
| Total parameters | 119B | 35B (3B active) | 754B (40B active per token) |
| Disk footprint | ~60 GB | 22 GB | ~377 GB |
| Single-Spark fit | ✅ | ✅ | ❌ does not fit |
| Memory bandwidth bottleneck | dense, bandwidth-bound | sparse, throughput-friendly | n/a |
| Single-stream interactive | 36.5 tok/s safer-eagle (switch.sh default); 29 tok/s no-EAGLE rollback | 57 to 62 tok/s with DFlash | n/a |
| Verified vision (in chosen quant) | yes | no (stripped by PrismaQuant) | n/a |
| SWE-Bench Verified | ~58% (Devstral lineage) | 73.4% | (vendor-published SOTA on SWE-Bench Pro, not directly comparable) |
| Tool-call cleanliness | needs alternating-roles patch | 97% out of box | n/a |
| License | Apache 2.0 (Voxtral encoder gated) | Apache 2.0 (no gating) | MIT |
| German prose | strong | weak (irrelevant for English work) | unknown |
| Speculative decoding | EAGLE (safer-eagle default since 2026-05-22; regression resolved on current nightly) | MTP n=3 with DFlash | n/a |
| Best for | creative prose, vision, fallback | coding, tool-calling, agent stacks | not a single-Spark choice |
The single biggest decision the table makes for you is “does my workload fit one model or two.” Most one-operator stacks split cleanly between “code and agent tools” and “creative prose, marketing, image-reading.” If your workload splits, run both on disk and flip with a mutex. If it does not, pick one.
Why Qwen 3.6 is the primary now
The architecture is MoE with 3B active parameters per token of a 35B total. The Spark’s unified-memory architecture is the right shape for this kind of sparse activation. The original Spark Arena measurement at rank 4 had Qwen 3.6 PrismaQuant sustaining around 45 tok/s per-token throughput, which was already enough to make it the primary. (See Spark Arena Rank 4 Made Me Add Qwen3.6 for the receipt with the full configuration.)
The number on my own pipeline today is materially higher. With DFlash speculative decoding enabled and the production config (mem-fraction 0.5, k=3, 15-request stable), Qwen 3.6 sustains 57 to 62 tok/s on the operator workload. The configuration is in /data/scripts/vllm/qwen36-launch.sh, extracted verbatim from the verified-running container.
The 73.4 percent SWE-Bench Verified score is the operational difference between an agent that fixes the GitHub issue first try and an agent that writes plausible-looking code you then debug for an hour. The gap from Mistral’s ~58 percent is roughly fifteen points, which in agent-tooling terms is the difference between “useful” and “theatrical.”
Tool-call cleanliness is the unobvious win. Mistral on SGLang needs a side-car proxy to work around the alternating-roles BadRequestError; see Fixes: OpenClaw Mistral Alternating Roles for the worked example. Qwen 3.6 with the standard vLLM stack reports 97 percent ToolCall-15 accuracy on a single Spark, no patches needed. One fewer proxy in the stack is one fewer thing to maintain through every vLLM version bump.
Why Mistral Small 4 stays installed (and is the safer fallback)
Mistral is dense, which means every parameter activates on every token, which means the Spark’s memory bandwidth is the bottleneck. The current switch.sh default for Mistral is safer-eagle at 36.5 tok/s decode, verified 2026-05-22 on the current SGLang nightly after the earlier EAGLE regression (which had dropped decode to 12.5 tok/s on the previous nightly build) resolved. The no-EAGLE safer config at 29 tok/s remains documented as a rollback baseline if the regression reappears.
The vision tower is the under-appreciated reason Mistral stays on disk. The NVFP4 quantization preserves the Pixtral-lineage vision capability; the PrismaQuant 4.75bit quantization of Qwen 3.6 drops the vision tower entirely. (See Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler for the verified-the-hard-way version of this finding. The verification used a real screenshot HTTP-200 round-trip, not a vendor data sheet.) Image-reading workloads route to Mistral by default.
German prose is the last reason. The blog is English-only, but the consulting practice and several European customers write in German, and Mistral handles German with the same fluency it handles French and English. Qwen 3.6’s German output is weak enough that I would not ship it to a German-speaking customer without a human pass.
The reframe from May 2026 is that Mistral is no longer the primary. It is the kept-in-reserve fallback that the operator flips to when the workload calls for vision or creative prose. The next section is the mutex that makes that flip safe.
Why GLM-5.1 does not fit, and what that teaches you
GLM-5.1 is the most capable model in this list by total parameter count, and it is also genuinely impressive on architecture. The model shipped open-weights on 2026-04-07 under MIT license, with the API live since 2026-03-27. The 754B total parameters with 40B active per token, the 200K context window, and the vendor-published SWE-Bench Pro state-of-the-art claim put it in a different capability class than the other two models in this comparison.
The AWQ INT4 quantization is roughly 377 GB on disk. The Spark’s unified memory budget is 128 GB. Three times over.
Z.ai’s published deployment guides target a 4× H200 SXM5 cluster at FP8 (754 GB) or a 4× H200 or 5× A100 80GB cluster at AWQ INT4 (377 GB). There is no single-Spark configuration. There is no extreme quantization in any public release that would compress the model into the Spark’s memory budget without degrading capability past usefulness, and synthesizing one would be a research project rather than a deployment.
The lesson is that parameter count alone is not a capability ranking on a fixed hardware budget. The right question is not “which model is largest” but “which model is largest and still fits the hardware envelope I have decided to operate.” GLM-5.1 is impressive on a four-Spark cluster, or on a single 8-H200 box, or on any of the cloud APIs that route to Z.ai’s hosted inference. It is irrelevant on a single Spark. (For the broader argument about how benchmark numbers do not survive contact with hardware reality, see Two Leaderboards Nobody Reads Together.)
The vendor-published “8-hour autonomous execution” claim from Z.ai’s launch material is worth flagging honestly. It is a marketing number tied to a benchmark harness, not an operational guarantee on real engineering work. Same caution as every other vendor SOTA claim on this site: cite the configuration or do not cite the number.
The switch.sh mutex pattern
The way to run this stack is not “pick one model” but “two on disk, one hot, mutex enforced.” Both PrismaQuant 4.75bit Qwen at 22 GB and NVFP4 Mistral at 60 GB sit on the same SSD. Only one is loaded into unified memory at a time, because hot-loading both means unified memory contention and Spark instability.
The mutex is /data/scripts/llm/switch.sh qwen|mistral|none|status. Termux-friendly. The script handles the systemctl start/stop pair, the Watchtower disable-label that stopped a 385-restart cycle on vllm-qwen36 and sglang-mistral4, and a sanity check that confirms which model is currently hot. The script delegates to qwen36-switch-from-mistral.sh and start-mistral.sh for the actual container lifecycle.
The vLLM FlashInfer-MoE freeze receipt is worth knowing if you are running Qwen on Spark with the default backend. vLLM’s FlashInfer MoE throughput path bricks on SM 12.1 in a way that triggers a unified-memory cascade that pulls the desktop session down with it. The fix is VLLM_FLASHINFER_MOE_BACKEND=latency. SGLang’s path never used the bricked kernel and so Mistral never froze on that exact failure mode. (See Fixes: vLLM MoE Throughput sm121 Desktop Freeze for the debug log.)
The systemd unit vllm-qwen36.service exists but is deliberately not enabled at boot. Mutual exclusion with Mistral is an operator job, not a systemd default, because picking the wrong default at boot would either lock the operator into Qwen for vision work that needs Mistral or have both services race for memory at startup.
How to read this against your own workload
Three reader profiles get three answers.
If your workload is coding-agent-heavy (opencode, Aider, MCP tool calls, structured output), run Qwen 3.6 PrismaQuant as the primary and stop. The vision capability you lose is rarely needed in coding workflows, and the throughput gain at 57 to 62 tok/s is large enough to be felt in every interaction. If you later add a vision workload, add Mistral as a secondary then. The switch.sh mutex is the path.
If your workload is content-creation-heavy (long-form blog posts, marketing copy, image-reading for accessibility, German-language work), Mistral is the primary and Qwen is the secondary. The decode-speed penalty against Qwen is not felt in content workflows where the bottleneck is editorial review, not generation speed.
If your workload is mixed, run both on disk and flip with the mutex. The operational complexity of a switch is the price of getting the right model on each call. The complexity is real; the throughput-and-capability gain is larger than the cost of a one-line switch.sh invocation.
If your workload genuinely needs a 754B-class model, you are not on a single Spark. You are on a four-Spark cluster, a multi-H200 box, or a hosted API. That is a different article.
Where this article fits in the larger comparison
This piece is the model-stack-level comparison. The hardware-stack-level comparison is DGX Spark vs M3 Ultra Mac Studio and the cost model that contextualizes both is Self-Hosted AI vs Cloud APIs: Real Total Cost. The reference architecture that combines all the choices is the hub article Sovereign AI Stack 2026 Reference Architecture.
The strategic context for why Qwen 3.6 replaced Mistral as the primary in May 2026 is in Spark Arena Rank 4 Made Me Add Qwen3.6. The verified vision-asymmetry finding that changed how I think about open-weights quantization is in Mistral vs Qwen 3.6: The Zero That Was a Broken Ruler.
What to expect next
The DFlash-stable-state measurements are now the running baseline in production. The EAGLE regression resolved on the 2026-05-22 SGLang nightly, pushing Mistral’s default back up to 36.5 tok/s. The next open question is whether that result holds across nightly builds or reverts; the no-EAGLE rollback baseline stays documented as the safety net.
Follow on Nostr (link in footer) or subscribe to the RSS feed at /rss.xml for the next benchmark when it lands.