#benchmarking | Sovereign AI Blog

Gemma-4-31B in NVIDIA's NVFP4 format fits a single DGX Spark and is a strong reasoner. But on Blackwell sm_121 the default FP4 kernel path is broken, and a dense 31B is bandwidth-bound at around 4 tok/s no matter what you do. I measured the baseline, the Marlin fix, and the honest conclusion: the real speedup is a model swap, not a flag.

Jun 25, 2026

Gemma-4-31B NVFP4 on a Single DGX Spark: When the Quantization Is the Bottleneck

Gemma-4-31B in NVIDIA's NVFP4 format fits a single DGX Spark and is a strong reasoner. But on Blackwell sm_121 the default FP4 kernel path is broken, and a dense 31B is bandwidth-bound at around 4 tok/s no matter what you do. I measured the baseline, the Marlin fix, and the honest conclusion: the real speedup is a model swap, not a flag.

GLM-4.7-Flash is a 30B-A3B MoE coding model that fits a single 128GB DGX Spark with room to spare. Bringing it up on Blackwell sm_121 took two failures that every published recipe gets wrong: the 'AWQ' build is actually compressed-tensors, and the model speaks MLA, so flash_attn is illegal. Here is the working recipe, the single-stream decode number nobody reports, and what it does to my coding agent.

Jun 25, 2026

strategydgx-sparkvllmmcp

GLM-4.7-Flash on a Single DGX Spark: the Repo Says AWQ, the Model Says MLA

GLM-4.7-Flash is a 30B-A3B MoE coding model that fits a single 128GB DGX Spark with room to spare. Bringing it up on Blackwell sm_121 took two failures that every published recipe gets wrong: the 'AWQ' build is actually compressed-tensors, and the model speaks MLA, so flash_attn is illegal. Here is the working recipe, the single-stream decode number nobody reports, and what it does to my coding agent.

I almost bolted a vector database onto a knowledge base that did not need one. My own benchmark told me to, then told me the opposite, both times with total confidence, both times because I had quietly chosen the test queries. Here is the path from the standard 2026 RAG playbook to the number that finally told the truth, and why the retriever that won is a hundred lines of standard library.

Jun 19, 2026

sovereign-aiself-hostedengineering-honestyrag

I Rigged My Own RAG Benchmark. Twice.

I almost bolted a vector database onto a knowledge base that did not need one. My own benchmark told me to, then told me the opposite, both times with total confidence, both times because I had quietly chosen the test queries. Here is the path from the standard 2026 RAG playbook to the number that finally told the truth, and why the retriever that won is a hundred lines of standard library.

Standing up two large models on a DGX Spark, my own measurements tried to deceive me three separate ways: a harness that scored a working model at zero, a one-shot test that framed the model for a bug that was mine, and a cold reading that undersold decode speed by 35 percent. None of the wrong numbers were random. Each had a cause, a tell, and a fix. Here is the field guide.

Jun 12, 2026

engineering-honestydgx-sparkqwen

A Benchmark Handed Me a Number Three Times in One Day. Three Times It Was Lying.

Standing up two large models on a DGX Spark, my own measurements tried to deceive me three separate ways: a harness that scored a working model at zero, a one-shot test that framed the model for a bug that was mine, and a cold reading that undersold decode speed by 35 percent. None of the wrong numbers were random. Each had a cause, a tell, and a fix. Here is the field guide.

gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.

Jun 12, 2026

dgx-sparkcomparisonstrategyvllm

I Built OpenAI's gpt-oss-120b on a Single DGX Spark. My 35B Qwen Out-Coded It.

gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.

Same model, same box, three ways to shrink it: Intel's AutoRound int4, a 4.75-bit PrismaQuant, and FP8. I measured all three on decode speed, coding accuracy, and vision, with one ruler per axis and the failed runs thrown out. AutoRound won every column that mattered, and the surprise was vision: the leanest build kept its eyes while the others went blind or broke. Here is the teardown.

Jun 12, 2026

dgx-sparkcomparisonqwenvllm

Three Quants of One 35B Qwen on a DGX Spark. The Fastest Build Was the Only One That Could Still See.

Same model, same box, three ways to shrink it: Intel's AutoRound int4, a 4.75-bit PrismaQuant, and FP8. I measured all three on decode speed, coding accuracy, and vision, with one ruler per axis and the failed runs thrown out. AutoRound won every column that mattered, and the surprise was vision: the leanest build kept its eyes while the others went blind or broke. Here is the teardown.

NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.

Jun 11, 2026

strategydgx-sparkmcpvllm

I Ran NVIDIA's 120B Nemotron on a Single DGX Spark. It Is Smart, Slow, and Surprisingly Good at One Job

NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.

I built a small, dependency-free harness that answers one question with numbers instead of vibes: does this enhancement make my agent measurably better, on my models, on my tasks? Here is the method, what I found, and why deterministic gates are the whole point.

Jun 10, 2026

strategyagentsopencode

Agent-bench: stop trusting install counts, start measuring your agent's tools

I built a small, dependency-free harness that answers one question with numbers instead of vibes: does this enhancement make my agent measurably better, on my models, on my tasks? Here is the method, what I found, and why deterministic gates are the whole point.

I run Qwen3.6-35B at 4.75-bit for coding. A 4.0-bit AutoRound build promised more speed. Fewer bits usually means a dumber model, so I measured both halves: decode throughput and coding quality, the latter through my own agent-bench harness. The result settled it. Here is the duel, the bandwidth math, and why the bit count was the wrong thing to fear.

Jun 10, 2026

strategyqwendgx-spark

Smaller, Faster, Still Smart? AutoRound int4 vs PrismaQuant for a Self-Hosted Coding Model

I run Qwen3.6-35B at 4.75-bit for coding. A 4.0-bit AutoRound build promised more speed. Fewer bits usually means a dumber model, so I measured both halves: decode throughput and coding quality, the latter through my own agent-bench harness. The result settled it. Here is the duel, the bandwidth math, and why the bit count was the wrong thing to fear.

caveman has ~200k installs and claims 75% token reduction. I measured it on two local models and three Claude frontiers (Sonnet 4.6, Opus 4.8, Fable 5). The math does not work out the way the claim says it does.

Jun 10, 2026

strategyagents

Caveman: does the 75% token-saving skill survive contact with a self-hosted model?

caveman has ~200k installs and claims 75% token reduction. I measured it on two local models and three Claude frontiers (Sonnet 4.6, Opus 4.8, Fable 5). The math does not work out the way the claim says it does.

Serena is one of the most-installed coding MCP servers. I tested it against two local models (Qwen3.6-35b and Mistral-Small-4) on three refactor tasks with deterministic gates. The short answer is more interesting than yes or no.

Jun 10, 2026

strategymcpagents

Does Serena help a self-hosted coding model? I benchmarked it

Serena is one of the most-installed coding MCP servers. I tested it against two local models (Qwen3.6-35b and Mistral-Small-4) on three refactor tasks with deterministic gates. The short answer is more interesting than yes or no.

I Let Qwen3.6 Build a Full-Stack App. It Worked. I Wasn't Satisfied.

Gemma-4-31B NVFP4 on a Single DGX Spark: When the Quantization Is the Bottleneck

GLM-4.7-Flash on a Single DGX Spark: the Repo Says AWQ, the Model Says MLA

I Rigged My Own RAG Benchmark. Twice.

A Benchmark Handed Me a Number Three Times in One Day. Three Times It Was Lying.

I Built OpenAI's gpt-oss-120b on a Single DGX Spark. My 35B Qwen Out-Coded It.

Three Quants of One 35B Qwen on a DGX Spark. The Fastest Build Was the Only One That Could Still See.

I Ran NVIDIA's 120B Nemotron on a Single DGX Spark. It Is Smart, Slow, and Surprisingly Good at One Job

Agent-bench: stop trusting install counts, start measuring your agent's tools

Smaller, Faster, Still Smart? AutoRound int4 vs PrismaQuant for a Self-Hosted Coding Model

Caveman: does the 75% token-saving skill survive contact with a self-hosted model?

Does Serena help a self-hosted coding model? I benchmarked it