Spark Arena Rank 4 Made Me Add Qwen3.6 to My DGX Spark
Mistral Small 4 NVFP4 (119B) has been the brain of my DGX Spark for two months. It runs every blog-pipeline prompt, every coding session, every podcast script rewrite. It works. It also has tics I have grown tired of: maritime image prompts that rhyme with each other for ten articles in a row, em-dashes in prose, an encoder gated by Mistral so Voxtral voice cloning is locked out of open users, and an alternating-roles bug that needs a side-car proxy to talk OpenAI-compatible cleanly.
The plan: I am not deleting Mistral. I will add a second model that beats it on the metrics that matter for an agent stack, and keep Mistral installed for the workloads where it is still better. The intended new primary on my single DGX Spark is Qwen3.6-35B-A3B PrismaQuant 4.75bit (Alibaba, Apache 2.0, released April 16, 2026, quantized by Rob Tand for vLLM). It clears 73.4% on SWE-Bench Verified with only 3B active parameters of 35B total, fits in 22 GB of unified memory, and the Spark Arena leaderboard ranks it at 95.11 tokens per second decode on a single Spark, which is the fourth-fastest entry on that board across all sizes and quantizations.
That is 2.7x my measured Mistral throughput on the same hardware, on paper. Mistral will stay installed and answer the creative-writing calls until Gemma-4-31b or a successor proves better there. opencode will replace vibe and OpenClaw as the CLI driver. This article is the model-stack plan with the receipts. The implementation runs over the next two days. The day-2 measurements get published as a follow-up around 2026-05-25 so the throughput claims here get verified on my own pipeline, not just Spark Arena.
Quick Take
- Qwen3.6-35B-A3B PrismaQuant 4.75bit becomes the new code-and-tools primary on the DGX Spark. Mistral Small 4 stays installed as the creative-writing fallback because nothing open has clearly beaten it on prose yet
- Qwen3.6 scores 73.4% on SWE-Bench Verified vs Mistral’s ~58-65%, has 97% ToolCall-15 accuracy without alternating-roles patches, is multimodal, and ships Apache 2.0 with no Voxtral-style encoder gating
- Throughput. Mistral Small 4 with EAGLE: 35 tok/s avg, 13-41 range by workload, from my own SGLang vibe benchmark. Qwen3.6 PrismaQuant: 95.11 tok/s on Spark Arena rank 4, single Spark, vLLM INT4. That is 2.7x faster than Mistral average and beats gpt-oss-120b on dual Spark (75.96 tok/s)
- PrismaQuant is 22 GB on disk vs 60 GB for Mistral NVFP4, leaving room for Qwen-Image-2512, Kokoro, F5-TTS, and a parked Mistral all co-resident
- opencode replaces vibe as the CLI. Real flaws (1 GB RAM, default Grok telemetry until 1.2.23, churning codebase). Plan B is Aider on git-safety, kept hot
- Gemma-4-31b is the next creative-writing upgrade candidate (Arena rank 7 at score 1423) if Mistral’s tics outweigh its prose strength
- Two-day prep, side-by-side install, day-2 measurements published in follow-up article
Mistral Small 4 vs Qwen3.6-35B-A3B: the side-by-side I actually care about
Before any “is the swap worth it” answer, the comparison in cold facts. Both Apache 2.0, both multimodal, both run on a single DGX Spark.
| Mistral Small 4 NVFP4 (current) | Qwen3.6-35B-A3B PrismaQuant (planned) | |
|---|---|---|
| Total params | 119B dense | 35B MoE |
| Active per token | 119B (all) | 3B (sparse) |
| Disk footprint | ~60 GB | 22 GB |
| Single-Spark interactive throughput | Measured: avg 35, short summary 37, long code 35-41, structured JSON 13-25, baseline w/o EAGLE 12-15 (benchmark) | 95.11 tok/s (Spark Arena rank 4, vLLM INT4) |
| RAM during inference | ~94 GB (measured) | ~35 GB resident under load |
| Vendor peak-throughput claim | 131 tok/s, peak 166 (Mistral marketing, batched) | not yet claimed at that scale |
| SWE-Bench Verified | ~58-65% (Devstral lineage, no official Mistral-Small-4 number published) | 73.4% |
| SWE-Bench Multilingual | unpublished | 67.2% |
| MCPMark (tool integration) | unpublished | 37.0% (vs Gemma 4-31B at 18.1%) |
| ToolCall-15 accuracy | needs alternating-roles patches via side-car proxy | 97% single Spark, 100% dual |
| Multimodal | Pixtral lineage built in | Vision encoder built in |
| Speculative decoding | not standardized | MTP n=3 stable, n=4 regresses (known) |
| License caveats | Voxtral encoder gated, blocks open voice cloning | none observed |
| German prose | strong | known weak (irrelevant for my English-only blog and podcast) |
| Release date | March 2026 | April 16, 2026 |
The “is the bigger model smarter or faster” intuition fails on both axes here. Mistral is bigger by parameter count, but Qwen3.6 is faster on real single-stream interactive workloads on this hardware. The reason is architectural: Mistral is dense, so every forward pass pushes all 119B parameters through the Spark’s 273 GB/s memory bandwidth, which is the actual bottleneck on GB10. Qwen3.6 is MoE with 3B active per token, so each token pass moves a tiny fraction of the model through memory and the same hardware sustains a higher token rate.
Mistral’s marketing throughput claim of 131 tok/s with peak 166 comes from a batched, throughput-optimized setup (Sebastien on Medium, Mistral platform docs). Those numbers are real for parallel-request workloads. On a single-stream interactive session (one user, one agent, opencode-style turn-by-turn), my own measurements published in the SGLang vibe performance benchmark and the Mistral SGLang setup article show 35 tok/s average, with the range driven by generation type: short summaries 37, medium analysis 25, long code refactors 35-41. The third article in this series, EAGLE content-dependent throughput, goes one level deeper and explains why throughput is not a hardware constant: structured JSON output forces Mistral out of EAGLE’s draft-friendly distribution and drops it to 13-25 tok/s on the same hardware in the same session.
This matters as a lesson independent of this model swap: when an article cites tok/s without specifying single-stream vs batched, and without specifying generation type (free prose vs structured output), assume the most optimistic case and trust your own measured numbers instead. My measured numbers are public; the vendor’s batched peak is not what you get at the prompt.
Does the swap actually pay off?
This is the question I had to be honest about before committing. Three reasons it does, one reason it does not, one I do not yet know.
Reason 1, coding capability. 73.4% vs ~60% SWE-Bench Verified is roughly the gap between “the agent fixes the GitHub issue first try” and “the agent writes plausible-looking code I then debug for an hour.” Ten to fifteen points on SWE-Bench Verified is operationally enormous. It is the difference between opencode being useful and being theatre.
Reason 2, tool-call cleanliness. My current stack runs an OpenClaw side-car proxy in front of Mistral specifically to work around the alternating-roles BadRequestError that SGLang hits on the Mistral protocol. Qwen3.6 reports 97% ToolCall-15 accuracy on a single Spark out of the box. Removing one custom proxy from the stack is not just an aesthetic win, it is one less thing to maintain when the next vLLM update lands.
Reason 3, RAM freedom. Mistral NVFP4 occupies ~60 GB of unified memory. With Qwen-Image-2512 (24 GB FP8) and Kokoro plus F5-TTS (~5 GB combined), I am at 89 GB and there is no room for ComfyUI to spawn worker batches. The pipeline routine is “stop the LLM, start ComfyUI, stop ComfyUI, start the LLM” and each cycle costs minutes. With Qwen3.6 PrismaQuant at 22 GB, the same three services total 51 GB and I have 77 GB of headroom. ComfyUI and the LLM can be co-resident; no sequential swap needed.
Reason against, throughput. There is no reason against. Qwen3.6 PrismaQuant is dramatically faster. When I first drafted this article I called the swap a 2.5x slowdown based on Mistral’s vendor benchmark of 131 tok/s. Wrong direction entirely. The actual Spark Arena measurement, rank 4 on the public leaderboard, puts Qwen3.6-35B-A3B PrismaQuant 4.75bit at 95.11 tok/s on a single Spark with vLLM INT4. My own measured Mistral throughput on the same hardware is 35 tok/s average, 41 best-case for long code with EAGLE, 13-25 for structured JSON output. Qwen3.6 PrismaQuant runs 2.7x faster than Mistral’s average and 2.3x faster than Mistral’s best case. It also beats gpt-oss-120b on dual Spark (75.96 tok/s) by 25%, despite using one Spark instead of two. The throughput row in the side-by-side table is the most one-sided one in the comparison. If you operate the Spark as a multi-tenant batched service with parallel-request load, Mistral’s NVFP4 batching efficiency might still recover some of this on aggregate request volume; that is not my setup.
Unknown, creative writing quality. Qwen3.6 is not on the Arena creative-writing leaderboard top 15 yet (the Qwen3.5-397B variant is at rank 9 with score 1411). Mistral Small 4’s prose quality has known tics (em-dashes, “essentially,” maritime image-prompt loop) but is otherwise serviceable. The two-day prep includes a creative-writing diff pass on five real blog-pipeline outputs against both endpoints before I commit anything.
The intended handling is dispatcher-style in master.py: every action will declare whether it is code or creative. Code calls will route to Qwen3.6 PrismaQuant on the new vLLM endpoint (planned port 30000). Creative calls (image-prompt generation, description rewrites, audit_rewrite_v6 podcast pass) will keep routing to Mistral on the SGLang endpoint that is already running today, until I have measured evidence that something open beats Mistral on prose. Gemma-4-31b is the leading candidate for the creative-writing upgrade if Mistral’s tics ever outweigh its prose strength: Arena rank 7 with score 1423 on the creative-writing leaderboard, 31 B dense and ~31 GB at FP8, Apache 2.0, fits comfortably alongside everything else. I am not downloading Gemma yet because there is no measured reason to, and Mistral is already on disk.
SWE-Bench Verified, ranked by what actually fits
This is the table the listicles do not give you. Score is from official model cards or independent reproductions. The “Single Spark” column is what matters when you have one box.
| Model | License | SWE-Bench Verified | Total / Active params | Single Spark? |
|---|---|---|---|---|
| Claude Opus 4.6 (closed) | proprietary | 80.8% | unknown | n/a |
| Qwen3.6-35B-A3B (PrismaQuant 4.75bit) | Apache 2.0 | 73.4% | 35B / 3B | ✅ 22 GB on disk |
| Qwen3-Coder-Next FP8 | Apache 2.0 | 74.2% | 80B / 3B | ✅ 89 GB on disk |
| DeepSeek R1 (agentic) | MIT | ~65.8% | 671B / 37B | ❌ too big |
| GLM-4.5 | MIT | 64.2% | 355B / 32B | ❌ too big |
| GLM-5.1 | MIT | (no published SWE-V) | 754B / ~8 experts active | ❌ ~377GB at MXFP4 |
| gpt-oss-120b | Apache 2.0 | 62.4% | 116.8B / 5.1B | ✅ MXFP4, ~65GB |
| Mistral Small 4 (current) | Apache 2.0 | ~58% reported | 119B dense | ✅ NVFP4, 60 GB |
Why Qwen3.6 over Qwen3-Coder-Next, despite Qwen3-Coder-Next scoring 0.8 points higher on SWE-Bench Verified? Three reasons: (1) Qwen3.6 is half the size on disk, (2) Qwen3.6 has a 3.5-point lead on SWE-Bench Multilingual which matters because my own codebase is multi-language Python plus TypeScript plus Astro plus Bash, (3) Qwen3.6 has explicit MCPMark numbers showing 37% tool integration accuracy versus 18% for gemma-4-31b, and tool integration is the whole point of an opencode plus MCP stack. The 0.8 SWE-V difference is benchmark noise. The 3.5-multilingual and 19-MCPMark differences are not.
gpt-oss-120b plus opencode is a known-broken combo
Before I committed to a model, I dug into GitHub Issue #7185 on the opencode repo. Title: “When use gpt-oss-120B by vLLM locally, opencode doesn’t call the tools.” Quote from the report:
“only content of thinking in response, with no tools calling (even if model thinks it should call tools from thinking content) and no other response”
This is exactly the failure mode the vLLM 0.17 MXFP4 patches thread on the NVIDIA developer forum warned about: gpt-oss-120b on TP=1 (single Spark) “exhibits FP4 quantization errors affecting structured reasoning tokens.” The model thinks fine, the model emits thoughts, but the JSON tool-call schema breaks. For a coding agent that lives or dies by read_file, edit, and bash tool calls, this is fatal.
gpt-oss-120b plus opencode is a known-broken combo
Before I committed to a model, I dug into GitHub Issue #7185 on the opencode repo. Title: “When use gpt-oss-120B by vLLM locally, opencode doesn’t call the tools.” Quote from the report:
“only content of thinking in response, with no tools calling (even if model thinks it should call tools from thinking content) and no other response”
This is exactly the failure mode the vLLM 0.17 MXFP4 patches thread on the NVIDIA developer forum warned about: gpt-oss-120b on TP=1 (single Spark) “exhibits FP4 quantization errors affecting structured reasoning tokens.” The model thinks fine, the model emits thoughts, but the JSON tool-call schema breaks. For a coding agent that lives or dies by read_file, edit, and bash tool calls, this is fatal.
You can work around it with TP=2 on dual Spark. I have one Spark. Qwen3-Coder-Next FP8 has no such quirk: tool calls work cleanly, tested by the ztolley/dgx-spark-qwen3-coder-next-compose reference stack, which bundles Aider polyglot and Aider refactor benchmark runners as proof.
opencode is the right CLI, with caveats I am tracking
vibe is dead to me. opencode replaces it, and Hacker News has opinions worth listening to before you commit.
The headline numbers are real: 120,000 GitHub stars, 800 contributors, 5 million monthly developers, number one on Hacker News on March 20, 2026. It supports 75+ LLM providers through the Models.dev integration, including any OpenAI-compatible endpoint, which is the whole point: vLLM exposes one, my Spark serves it, opencode talks to it.
But the same Hacker News thread is full of receipts on what opencode does wrong. Five comments I am taking seriously:
“OpenCode is permissive by default… tries to pull its config from the web.” (rbehrends, HN)
“sends all your prompts to Grok’s free tier by default… Grok trains on submitted information.” (heavyset_go, HN. This was the session-title-generation feature, fixed in version 1.2.23 after public outcry, but the fact it shipped at all is the lesson.)
“uses 1GB of RAM or more… resource inefficient (often uses 1GB+…) for a TUI.” (logicprog, HN)
“constantly releasing at extremely high cadence, don’t even spend time to test or fix things.” (logicprog, HN)
“20k commits, almost 700k lines of code, only four months old… no coherent architecture.” (siddboots, HN)
I am still moving to opencode. The TypeScript bloat is annoying but not blocking. The default-cloud-telemetry incident was real and was fixed; I will verify the fix in my install and audit the config to confirm no remote-config-pull is enabled. The release cadence concern is genuine, so I will pin to a known-good version and update deliberately, not auto-upgrade.
Compare this honestly against the alternatives. Aider is 39,000 stars, 4.1 million installs, 15 billion tokens processed per week, the oldest tool in the category and the one with the cleanest git-commit discipline. Claude Code is locked to Anthropic’s API and bills per token, scoring 80.8% on SWE-Bench (best in class) but at a cost that scales with use and a vendor risk that bit opencode users earlier this year when Anthropic briefly blocked third-party access.
The trade is real: Aider for stability and git safety, Claude Code for raw capability if budget and vendor lock-in are acceptable, opencode for provider freedom and openness. I am picking opencode because the vendor freedom is the whole point of running a sovereign stack in the first place. If I wanted vendor lock-in I would not have a DGX Spark in my basement.
Tokens per second I actually expect
The number that decides whether this stack is usable, not just defensible.
The verified single-Spark numbers from Spark Arena as of this writing, all decode-mode, sorted:
| Rank | Model | Runtime | Quant | tok/s |
|---|---|---|---|---|
| 4 | Qwen3.6-35B-A3B PrismaQuant 4.75bit | vLLM | INT4 | 95.11 |
| 5 | Qwen3.6-35B-A3B-int4-AutoRound | vLLM | INT4 | 92.34 |
| 6 | Qwen3.6-35B-A3B-NVFP4 | vLLM | NVFP4 | 77.07 |
| 7 | gpt-oss-120b | vLLM | MXFP4 (2 nodes!) | 75.96 |
| 8 | Qwen3.6-35B-A3B PrismaQuant 4.75bit | vLLM | INT4 (second run) | 73.44 |
| 9 | Qwen3-Coder-Next-int4-AutoRound | vLLM | INT4 | 73.33 |
Three observations from this. First, the PrismaQuant 4.75bit beats NVFP4 of the same base model by 23% in throughput. Second, the PrismaQuant 4.75bit on a single Spark beats gpt-oss-120b on a dual-Spark cluster by 25%; you get more interactive speed from one Spark with the right quant than from two Sparks with the wrong one. Third, the runs at rank 4 (95.11) and rank 8 (73.44) are both the same model on the same runtime, which suggests configuration matters and the high number reflects the optimal setup (MTP n=3, flashinfer NVFP4, gpu-memory-utilization 0.90).
The PrismaQuant 4.75-bit variant ships with speculative decoding via MTP (multi-token prediction) at n=3 enabled by default; n=4 regresses on this model family. Realistic three-stage projection for my own deployment:
- Stage 1 (PrismaQuant 4.75bit INT4, week 1): target ~95 tok/s decode matching Spark Arena rank 4, 97% tool-call accuracy from the FP8 forum reports
- Stage 2 (tuned config, MTP n=3 confirmed, week 2-3): sustain ~95, optimize cold-prompt latency below 5 seconds
- Stage 3 (vLLM 0.20+, EAGLE-3 layered on, weeks ahead): NVIDIA claims 2.5x improvement on key workloads since DGX Spark launch from quantization plus speculative decoding combined; sustained 150+ tok/s by year-end is plausible if the pattern holds
For comparison, Claude Code via the Anthropic API responds at roughly 80-120 tok/s on Sonnet. Stage 1 of my setup at 95 tok/s is already in that range. Stage 2 confirms it. Stage 3 would beat it. The “running open-source locally feels slower than the cloud” assumption was 18 months ago. It is not true on this hardware with this model anymore.
The comparison to the outgoing Mistral Small 4 is no longer subtle once you put both numbers side by side on the same hardware. Mistral’s vendor docs cite 131 tok/s peak on Spark, but that figure is from a batched, throughput-optimized configuration. My own measured single-stream interactive throughput, published in the SGLang vibe performance benchmark, is 35 tok/s average with EAGLE speculative decoding, broken down as 37 tok/s for short summaries, 25 tok/s for medium analysis, 35-41 tok/s for long code refactors, and 13-25 tok/s for structured JSON output (per EAGLE content-dependent throughput). Without EAGLE the baseline collapses to 12-15 tok/s, documented in the Mistral SGLang setup article. Spark Arena measures Qwen3.6-35B-A3B PrismaQuant at 95.11 tok/s on the same class of hardware. That is 2.7x Mistral’s average and 2.3x Mistral’s best-case workload. The “I am paying speed for capability” framing I started with was inverted from reality; the swap is a major speed win on every measured workload. The only caveat: if you run a multi-tenant batch service with parallel-request load, Mistral’s NVFP4 batching efficiency partially recovers on aggregate throughput. Single-stream interactive is not close.
Text-to-image: Qwen-Image-2512 retires FLUX
The text-to-image OSS leaderboard moved hard since FLUX.1 dominated. Top three open-source on Arena.ai:
| Rank | Model | License | Arena score |
|---|---|---|---|
| 3 | qwen-image-2512 | Apache 2.0 | 1131 |
| 4 | z-image-turbo | Apache 2.0 | 1084 |
| 8 | flux-2-klein-4b | Apache 2.0 | 1028 |
Ranks 1 and 2 (Tencent Hunyuan, FLUX.2-dev) are proprietary or non-commercial. Qwen-Image-2512 needs 48GB VRAM in BF16, 24GB in FP8. The Spark has 128GB unified, so both quantizations fit with room for the LLM container co-resident.
Per an independent three-week review, Qwen-Image-2512 “nails text rendering” (the historic FLUX weakness) and runs about 11x faster than FLUX.1-dev at comparable quality. Generation time is around 5 seconds per 1024x1024 at FP8 with 28 steps on a 4090; the Spark should match or beat that given memory bandwidth.
HiDream-I1, which I considered six weeks ago, has fallen out of the top 13. T2I moves fast.
Speech: Kokoro for TTS, F5-TTS for cloning
Update 2026-05-12. After Episode 1 V6 landed at 0/10 on spot-listen, the Kokoro plus F5-TTS recommendation in this section is superseded. Both engines are single-narrator zero-shot and miss on multi-speaker dialog. The pivot article Voxtral Capped at 3/10: Picking the Next Open TTS replaces this recommendation with VibeVoice, Higgs Audio v2, and IndexTTS-2 as the spike candidates, applying a podcast-specific filter on top of the raw TTS Arena ranking.
Voxtral has been frustrating. Mistral gated the encoder weights, so ref_audio crashes the engine and the instructions parameter is silently ignored. Voice cloning is locked away from open users. I documented that in my Voxtral expressivity report last week.
Kokoro 82M has a working DGX Spark ARM64 setup in NVIDIA’s developer forum, with full GPU acceleration via CUDA 12.2. 67 English voice packs out of the box. Sub-300ms generation for normal-length text. F5-TTS is the realistic voice-clone path: Apache 2.0, sub-7-second processing, best MOS-WER balance among current open models. My podcast is English-only, so multi-language support is not a selection criterion.
Both fit alongside the LLM container easily. Total RAM commitment for the new stack with Qwen3.6-35B-A3B PrismaQuant (22 GB on disk, plus KV cache and overhead, call it ~35 GB resident under load) plus Qwen-Image-2512 FP8 (~24 GB) plus Kokoro and F5-TTS (combined ~5 GB) sums to ~64 GB on the 128 GB Spark. All four services can be co-resident with 64 GB of headroom left. That is the big architectural change versus the Mistral setup, where the LLM alone consumed 60 GB and ComfyUI had to be sequenced with LLM stop → ComfyUI start → ComfyUI stop → LLM start for every image-generation batch. With the new stack, that swap dance is gone.
Framework: vLLM 0.17 with MXFP4 patches, SGLang on the bench
The framework choice is the part most operators get wrong. The default answer “use SGLang because structured outputs” was true a year ago. As of vLLM 0.17.0:
- BF16 to MXFP4 online quantization for MoE experts, attention layers, and lm_head via Marlin backend
- SM121 device support for CUTLASS MoE kernels
- Marlin MoE 256-thread kernel shared-memory race-condition fix
- Native OpenAI-compatible API, which opencode and Aider both speak
Setup flags that matter for Qwen3.6-35B-A3B PrismaQuant on a single Spark, copied directly from the model card:
vllm serve rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
# Required environment variable
VLLM_USE_FLASHINFER_NVFP4=1
The tool-call-parser qwen3_xml flag is the secret handshake. Without it, opencode sees raw XML where it expects JSON and the agent halts. With it, tool calls work the first time. The num_speculative_tokens=3 is critical because the PrismaQuant variant explicitly regresses at n=4. Context length of 32k is the documented sweet spot for everyday coding work; 40k is feasible but starts wasting KV cache memory. SGLang stays installed on port 30001 as a fallback for structured-output edge cases and as the rollback path. Mistral SGLang container stays running for 30 days as the rollback target, then gets disabled.
Two days of preparation, not thirty minutes
The previous time I rushed a model swap, it cost me a week of pipeline bugs. The preparation list this time is two days of focused work, not optional.
Day 1, morning: audit. List every place Mistral Small 4 is called today: vibe-coding CLI, OpenClaw, the blog-pipeline image-prompt generator (call1 and call2 in update_blog_from_gitea.py), the description rewriter, the audit_rewrite_v6 podcast pass, the MCP search-rerank call, the dashboard cheatsheet generator. Each entry is a smoke-test target.
Day 1, afternoon: side-by-side install. vLLM 0.19.1+ container on port 30001, Qwen3.6-35B-A3B PrismaQuant weights downloaded (~22 GB on disk), VLLM_USE_FLASHINFER_NVFP4=1 set, speculative-config MTP n=3 enabled, tool-call-parser qwen3_xml configured. opencode installed, pinned to a known-good version, config audited to confirm no remote-config-pull and no Grok telemetry. Health-check from a curl loop. Mistral SGLang stays running on 30000 untouched.
Day 2, morning: five-test-case suite. Five prompts that exercise different modes: a code refactor, an image prompt with hard-forbidden motifs, an EEAT scoring call, a podcast script rewrite, an MCP tool-call. Each runs against both endpoints. I diff the outputs. I do not look at benchmarks for this. I look at what the model actually produces for my prompts. The creative-writing diff matters most because that is the one Qwen3.6 has not been independently benchmarked on against Mistral.
Day 2, afternoon: write the overuse-phrases file. Mistral has its own overuse-phrases catalogue I built up over months: em-dash compulsion, “essentially” overuse, watchmaker imagery in image prompts. Qwen3.6 will have its own tics. The file starts empty today and gets populated by reviewing the five-test-case outputs. Without this step, I am inheriting Mistral’s word-list against a model that has different problems.
The two days are not optional. The previous time I rushed a swap, the unknown-unknowns cost me five days of debugging. Two days of prep buy that back.
Why this moment matters for self-hosted AI
There is a wider context. April 2026 was when Anthropic opened Claude Cowork to third-party platforms, which sounded like good news but also reminded everyone how much control vendors hold. January 2026 was when a developer in the EU shipped “Sovereign Claude Code” running fully against local Ollama. India’s regulators started pushing for sovereign hosting of Anthropic models. The “self-hosted Claude Code alternative” search term went from niche to mainstream in three months.
We are at a specific point in time where the open models are good enough (Qwen3.6-35B-A3B at 73.4% SWE-Bench is seven points behind Claude Opus 4.6 at 80.8%, not seventy), the hardware to run them is on a desk (DGX Spark at $4,000), and the CLI tooling to drive them is open (opencode and Aider, both Apache or MIT). The combination did not exist 18 months ago. It does today.
Running this stack is no longer a hobbyist statement. It is a working alternative to the Anthropic-subscription path with about 90% of the capability and 0% of the vendor risk. The 90% number is the SWE-Bench gap. The 0% is what makes the swap worth two days of prep.
The DGX Spark forum is moving faster than this article
While I was writing this, the NVIDIA DGX Spark / GB10 forum shipped four model releases in 48 hours, all dated May 10 to 11, 2026. The release I am building around (Qwen3.6-35B-A3B PrismaQuant) is itself only 25 days old, which raises the obvious question of cutting-edge stress.
Three reasons I am comfortable committing now anyway:
- The FP8 base checkpoint has 239 replies and 18,686 views on the NVIDIA forum within four weeks of release. That is not bleeding-edge anymore; that is “freshly stabilized.” Independent users report 97% ToolCall-15 accuracy at default settings.
- PrismaQuant uses stock vLLM 0.11+ with no custom patches required. The mixed NVFP4/MXFP8/BF16 precision scheme is handled entirely by the
compressed-tensorslibrary that ships with vLLM. No bespoke build steps, no fork-and-rebase risk. - The author of PrismaQuant documents a 15-minute wall-clock reproduction time on a DGX Spark for the entire quantization pipeline (probe, cost, activation, export). If the upstream model is updated, my own re-quant takes a coffee break, not a week.
Three other things in the forum I am tracking but not betting on yet:
- Qwen3.6-27B released (59 replies, 11,443 views). Smaller, fits even more comfortably. Candidate for Stage 2 if PrismaQuant 4.75bit has quality regressions I can measure.
- Qwen3.5 27B optimization thread, starting at 30+ tok/s TP=1. Direct single-Spark throughput confirmation for the 27B size class with vLLM. Useful reference data.
- DeepSeek-V4 released, with a separate thread on a DeepSeek-V4-Flash hybrid-quant 128GB recipe ported from antirez’s MLX work to vLLM on GB10. If that recipe genuinely fits the 685B-class V4-Flash in 128 GB unified, the recommendation changes within a week.
I budgeted one swap per month going forward. PrismaQuant Qwen3.6 is this month’s swap.
How the model downloads actually ran
Update 2026-05-13: pulling the 22 GB rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm weights to disk turned out to be its own engineering story. Three failure modes hit on the same overnight run: Xet protocol over IPv6 (unreachable on DGX Spark), httpx read-timeout too short for 3.5 GB safetensor shards, and hf download returning exit zero with .incomplete blobs behind it. The wrapper that catches all three plus exponential backoff and filesystem-level validation is /data/scripts/ops/hf-pull in cipherfox/sovereign-ops. Full postmortem and the wrapper design in Why hf download Lies to You at 22 GB on DGX Spark. For any DGX Spark model pull from this point forward, use hf-pull <repo-id> instead of bare hf download.
Validation against artificialanalysis.ai (2026-05-13 update)
After this article shipped, the reader-driven correction pass on the arena.ai leaderboard article surfaced artificialanalysis.ai, which is the closest existing leaderboard that combines quality, speed, and price in one view. Cross-checking the picks in this article against their Intelligence Index produced a tighter ranking than my original SWE-Bench-only frame.
| Model | AA Intelligence | AA Speed | AA Price ($/MTok blended) | This article’s pick |
|---|---|---|---|---|
| Kimi K2.6 | 54 (open-weight #1) | n/a | n/a | not considered |
| Gemma-4-31B | 39 | 36 tok/s | $0.00 | Creative-writing backup |
| gpt-oss-120B | 33 | 214 tok/s | $0.26 | rejected (opencode tool-call breakage on TP=1) |
| Qwen3.6-35B-A3B | 32 | 199 tok/s | $0.84 | chosen primary |
| Qwen3-Coder-Next | 28 | 134 tok/s | $0.56 | rejected (0.8 pt SWE-V less, 8 pt less multilingual) |
| Mistral Small 4 | 19 | 143 tok/s | $0.26 | being replaced |
Three confirmations and one new entry on the watch list:
Qwen3.6 over Mistral is right on quality, not just throughput. AA scores Qwen3.6 at 32 vs Mistral Small 4 at 19. The thirteen-point gap roughly matches the SWE-Bench gap I cited (73.4% vs ~58-65%). The swap is a quality upgrade, not just a speed trade.
Gemma-4-31B genuinely beats Qwen3.6 on quality. AA Intelligence 39 vs 32. The article keeps Gemma as the creative-writing upgrade candidate; the data now confirms that framing. The speed cost (36 vs 199 tok/s) is what keeps Gemma off the primary slot for code work, where each second matters more.
gpt-oss-120B beats Qwen3.6 on AA Intelligence (33 vs 32) AND speed (214 vs 199 tok/s) AND price ($0.26 vs $0.84). AA’s metrics do not capture the opencode tool-call breakage on TP=1 (GitHub Issue #7185, FP4 quantization on SM 12.1). The rejection still stands but the trade-off is sharper than the original article framed. If a future vLLM release closes the SM 12.1 FP4 bug for gpt-oss on single-Spark, that swap becomes worth revisiting.
Kimi K2.6 is the open-weight Intelligence king at 54, a 22-point lead over Qwen3.6. Released after this article’s original draft, not on the Spark Arena leaderboard yet, not benchmarked on DGX Spark by anyone I could find. Adding it to the watch list below as the most interesting future migration target.
Caveat on AA’s pricing column. The $0.84 for Qwen3.6 is one cloud-hosting provider’s blended price. Self-hosting on the DGX Spark, the per-token cost is roughly $0.04 once hardware amortizes (see the two-leaderboards article for the math). AA’s price column is cloud-oriented; it is not the price you pay running this stack on your own hardware.
What I am watching for next
A few things will move the picture again.
Quantization advances on MoE. GLM-5.1 and DeepSeek-V4 (full) do not fit today. If the DeepSeek-V4-Flash hybrid-quant recipe holds up, that excludes-list shrinks fast. Spark Arena adds entries weekly.
Spark Arena coverage of coding-specialist models. Today’s spark-arena.com leaderboard is heavy on general models. Aider-polyglot and SWE-Bench numbers per quantization on the Spark would settle the model question concretely. That data is starting to appear in NVIDIA developer forum threads but not yet on the leaderboard.
opencode stability over the next two quarters. The HN critiques are real. If the codebase stabilizes and the release cadence calms down, opencode becomes the obvious choice. If it does not, Aider remains the safer default for production work. I will track the opencode GitHub issues for the next 90 days and switch if quality regresses.
Voice cloning open-source progress. F5-TTS is the front-runner but Fish Speech and Higgs-TTS are closing in with Apache 2.0 licenses. If one of them ships English prosody clearly better than F5, the choice flips. (Update 2026-05-12: the choice flipped within 24 hours, see Voxtral Capped at 3/10: Picking the Next Open TTS for VibeVoice, Higgs Audio v2, and IndexTTS-2 as the new candidates.)
The plan is not a one-shot migration. It is the first ratchet up. Qwen3.6 takes over code and tools, Mistral keeps the creative-writing endpoints, opencode replaces vibe, and I will publish the day-2 five-test-case measurements as a follow-up article around 2026-05-25 so the throughput claims in this article get verified against my own pipeline, not just Spark Arena. If the diffs say Mistral still wins on prose, Mistral stays on creative until Gemma or a newer model proves better. If they say Qwen3.6 is good enough on prose too, Mistral becomes pure rollback. Either way the dispatcher pattern in master.py routes each call to the right model and the user (me) never sees the seam.
What I Actually Use
- Qwen3.6-35B-A3B PrismaQuant 4.75bit on vLLM 0.19.1+ with MTP n=3 speculative decoding as the new code-and-tools primary, single Spark, 95.11 tok/s verified on Spark Arena rank 4 (vLLM INT4), 22 GB on disk
- Mistral Small 4 NVFP4 stays running on SGLang for creative-writing endpoints (descriptions, image-prompts, podcast rewrites) until Gemma or a successor proves measurably better on prose
- Gemma-4-31b queued as the creative-writing upgrade candidate, Arena rank 7 creative-writing at score 1423, downloaded only when day-2 diffs say it is worth it
- opencode as the CLI replacing vibe and OpenClaw, pinned to a known-good version with telemetry audited off, Aider kept hot as Plan B for git-safety
- Qwen-Image-2512 in ComfyUI replacing FLUX.1-schnell, co-resident with the LLM instead of sequenced
- Kokoro for English TTS, F5-TTS for voice cloning, both replacing Voxtral