Sovereign AI engineering log

Running AI at home.
No cloud. No compromises.

A self-hosted AI stack built and documented in public. Every article comes from a real system, a real error, or a real fix. Nothing is speculative.

For people running, scoping, or considering self-hosted LLMs on DGX Spark, Strix Halo, multi-3090, or similar hardware.

host
DGX Spark
cloud
none
posts
154

Latest Articles

All articles
Standing up two large models on a DGX Spark, my own measurements tried to deceive me three separate ways: a harness that scored a working model at zero, a one-shot test that framed the model for a bug that was mine, and a cold reading that undersold decode speed by 35 percent. None of the wrong numbers were random. Each had a cause, a tell, and a fix. Here is the field guide.
engineering-honestybenchmarking

A Benchmark Handed Me a Number Three Times in One Day. Three Times It Was Lying.

Standing up two large models on a DGX Spark, my own measurements tried to deceive me three separate ways: a harness that scored a working model at zero, a one-shot test that framed the model for a bug that was mine, and a cold reading that undersold decode speed by 35 percent. None of the wrong numbers were random. Each had a cause, a tell, and a fix. Here is the field guide.

Read article
gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.
dgx-sparkcomparison

I Built OpenAI's gpt-oss-120b on a Single DGX Spark. My 35B Qwen Out-Coded It.

gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.

Read article
Same model, same box, three ways to shrink it: Intel's AutoRound int4, a 4.75-bit PrismaQuant, and FP8. I measured all three on decode speed, coding accuracy, and vision, with one ruler per axis and the failed runs thrown out. AutoRound won every column that mattered, and the surprise was vision: the leanest build kept its eyes while the others went blind or broke. Here is the teardown.
dgx-sparkcomparison

Three Quants of One 35B Qwen on a DGX Spark. The Fastest Build Was the Only One That Could Still See.

Same model, same box, three ways to shrink it: Intel's AutoRound int4, a 4.75-bit PrismaQuant, and FP8. I measured all three on decode speed, coding accuracy, and vision, with one ruler per axis and the failed runs thrown out. AutoRound won every column that mattered, and the surprise was vision: the leanest build kept its eyes while the others went blind or broke. Here is the teardown.

Read article
NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.
strategydgx-spark

I Ran NVIDIA's 120B Nemotron on a Single DGX Spark. It Is Smart, Slow, and Surprisingly Good at One Job

NVIDIA's Nemotron-3-Super-120B-A12B is tuned for Blackwell and ships an NVFP4 build that fits a single 128GB DGX Spark. I measured it where almost nobody else does: single-stream, on one GB10. The result is 23.7 tok/s, a competent but painfully verbose coder, and a genuinely strong retrieval agent. Here is the full teardown, with the published benchmarks fact-checked against what the box actually did.

Read article