#vllm | Sovereign AI Blog

Gemma-4-31B NVFP4 on a Single DGX Spark: When the Quantization Is the Bottleneck

Gemma-4-31B in NVIDIA's NVFP4 format fits a single DGX Spark and is a strong reasoner. But on Blackwell sm_121 the default FP4 kernel path is broken, and a dense 31B is bandwidth-bound at around 4 tok/s no matter what you do. I measured the baseline, the Marlin fix, and the honest conclusion: the real speedup is a model swap, not a flag.

Read article →

GLM-4.7-Flash on a Single DGX Spark: the Repo Says AWQ, the Model Says MLA

GLM-4.7-Flash is a 30B-A3B MoE coding model that fits a single 128GB DGX Spark with room to spare. Bringing it up on Blackwell sm_121 took two failures that every published recipe gets wrong: the 'AWQ' build is actually compressed-tensors, and the model speaks MLA, so flash_attn is illegal. Here is the working recipe, the single-stream decode number nobody reports, and what it does to my coding agent.

Jun 12, 2026

dgx-sparkcomparisonbenchmarkingstrategy

I Built OpenAI's gpt-oss-120b on a Single DGX Spark. My 35B Qwen Out-Coded It.

gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.

Three Quants of One 35B Qwen on a DGX Spark. The Fastest Build Was the Only One That Could Still See.

Same model, same box, three ways to shrink it: Intel's AutoRound int4, a 4.75-bit PrismaQuant, and FP8. I measured all three on decode speed, coding accuracy, and vision, with one ruler per axis and the failed runs thrown out. AutoRound won every column that mattered, and the surprise was vision: the leanest build kept its eyes while the others went blind or broke. Here is the teardown.

I Ran NVIDIA's 120B Nemotron on a Single DGX Spark. It Is Smart, Slow, and Surprisingly Good at One Job

NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.