I Ran NVIDIA's 120B Nemotron on a Single DGX Spark. It Is Smart, Slow, and Surprisingly Good at One Job

June 11, 2026 11 min read

NVIDIA released Nemotron-3-Super-120B-A12B in March 2026 as an open-weight reasoning model, and the part that made me curious is the NVFP4 build: a 4-bit, Blackwell-native quantization that compresses a 120B mixture-of-experts model down to about 75GB on disk. That number matters, because 75GB fits inside the 128GB of unified memory on a single DGX Spark. So the obvious question for this stack: NVIDIA tuned this thing for its own Blackwell silicon, it has a million-token context, and people are calling it a coding model. Is it actually good on the one box I have?

I measured it the way almost nobody publishes: single-stream, one GB10, the same harness I use for everything else. The answer is layered, and two of the most-repeated claims about this model do not survive contact with the hardware.

Verdict at a glance


What it is	120B total / 12B active MoE, NVFP4, NVIDIA Open Model License (open weights, not OSI)
Single-Spark decode	23.7 tok/s single-stream, roughly a third of my Qwen3.6 coding model
As a coding agent	competent (gets the hard rename right, 2/2) but unusable: 3.5 to 6 minutes per task
As a RAG agent	genuinely strong: correct tool chaining, grounded synthesis, no hallucination
Context	ran 256k on a single Spark, the model supports up to 1M
Spark gotcha	NVIDIA’s recipe sets gpu-memory-utilization 0.90; on unified memory that fails at startup, 0.80 works
Do I run it?	Not as a coding model. It earns a look only for long-context retrieval where latency does not matter.

The number nobody else reports: single-stream on one GB10

Every Nemotron-3-Super throughput figure I could find online, as of June 2026, is a server number. NVIDIA cites roughly 1,200 tok/s on an 8xH100 box with TensorRT-LLM and continuous batching. A community benchmark on 8x RTX PRO 6000 Blackwell hit 3,215 tok/s in a burst with speculative decoding. API providers serve it at 117 to 528 tok/s depending on who you ask, per Artificial Analysis. Those are all real, and all useless for my question, because they measure eight datacenter GPUs running many requests at once. What I need is single-stream, in other words one request at a time with nothing batched behind it, which is what an agent on a desk actually experiences.

On one DGX Spark, serving one request at a time, vLLM with the official Blackwell recipe (NVFP4 marlin backend, fp8 KV cache, native multi-token-prediction at k=3), the model decodes at 23.7 tok/s. For comparison, my daily Qwen3.6-35B coding model on the same box does 69 tok/s. Nemotron is about a third of the speed.

The reason is not the quantization, it is the active-parameter count. Single-stream decode on the GB10 is bound by memory bandwidth, around 273 GB/s, and every token has to move the active weights across that bus. Nemotron activates 12B parameters per token; my Qwen activates roughly 3B. Four times the active weight, four times the bandwidth per token, and you land near a third of the speed once NVFP4 claws a little back. The 120B headline number is misleading: what you pay for at decode time is the 12B that fire, not the 120B that sit idle.

Smart, but you will not wait for it

Speed alone does not condemn a model. So I ran it through the deterministic gates I use for every agent: the agent-bench harness, which scores success on a compiler exit code and a frozen checklist rather than another LLM’s opinion. The hardest task renames one save method on a UserRepository class while leaving an unrelated Logger.save untouched, the task that separates careful agents from text-replacers.

Nemotron got it right, twice out of two runs. Correct four-file diff, the unrelated method left alone, the project still type-checks. As a reasoner it is clearly competent, which lines up with the published capability scores: SWE-bench Verified around 55%, HumanEval+ in the low 90s. I did not independently reproduce those suites, but my one discriminating coding task agrees with the direction: this model can code.

The problem is what it costs. The first run took six minutes, 3,127 output tokens, and 23 tool round-trips. The second was its best case: three and a half minutes, 1,785 tokens, 15 calls. My Qwen does the same task in under a minute with about 600 tokens. The gap is multiplicative, because three penalties stack: three times slower decode, roughly four times more output since the model narrates its reasoning at length, and more tool round-trips per task. In practice that means you type a request, leave, make coffee, and come back to a correct answer you no longer care about. For an interactive coding agent, that is not a slow tool, it is a broken workflow. The capability is real and the latency makes it irrelevant for the job everyone wants to use it for.

The one thing it is genuinely good at

Here is the result that changed my read. I pointed opencode at Nemotron with my real stack wired in: the sovereign-ai blog-search MCP and the knowledge-base MCP both live, and asked it to explain what the agent-bench project measures using only what the tools retrieve.

It chained two tools correctly. First query_knowledge to search the knowledge base for “agent-bench”, then get_article with the slug it discovered, to pull the full text. Then it answered, accurately and entirely grounded in the retrieved document: agent-bench runs A/B experiments recording tool calls, tokens, wallclock, and three quality KPIs, and it uses deterministic gates because their verdict never changes for the same input. That is exactly what the source says, with no invented detail, following the “tools only” instruction.

That is a strong retrieval-augmented generation showing. Correct tool selection, sensible chaining, faithful synthesis. Combined with the long context, it points at the actual use case for this model on a Spark: not editing code in a tight loop, but reading a large corpus and answering from it, where you ask once and wait, and where the answer’s fidelity matters more than the seconds it took.

Fact-checking the claims against the box

The web is confident about two things that my single Spark contradicts.

Claim: it needs multi-GPU servers, not consumer hardware. One widely-cited writeup states the model “was not designed for single-GPU consumer hardware” and lists a minimum of 3x A100-80GB or 2x H100-80GB. That is true at full precision. It is false for the NVFP4 build, which is the entire point of the NVFP4 build. I ran it on one GB10 with 128GB of unified memory. The quantization is what moves a 120B model onto a single desktop-class device, and any “needs a rack” claim that ignores the quant is measuring the wrong artifact.

Claim: 128K context. The same writeup states a 128K window. NVIDIA’s own serving recipe sets max-model-len to 1,000,000, and I launched the model at 262,144 and confirmed it through the running endpoint. The 128K figure understates the model by nearly an order of magnitude. The architecture is the reason it can: Nemotron-3-Super is a Mamba-2 plus attention hybrid, and the Mamba layers carry a constant-size state instead of a quadratically-growing KV cache, which is exactly why a million-token context is feasible on hardware that would choke a pure-attention model at a fraction of that.

So the honest correction is not a nitpick. The two things that make this model interesting on a Spark, that it fits at all and that its context is enormous, are the two things the popular framing gets wrong.

Downloading it became a saga of its own

Getting the weights was harder than running them, and worth recording because the failure modes are reusable. Three of them, in the order they bit.

First, throttling. The model is gated on Hugging Face under the NVIDIA Open Model License, though in practice the files download without an accepted token. Unauthenticated, the hub crawled at 88 seconds per file; with a token in the environment the same files came down at 12 seconds per file, specifically a sevenfold difference for one env variable. Second, a stall that no retry could fix: a download process hung in uninterruptible IO while holding the cache lock under .locks/, so every fresh attempt waited forever on a lock that a dead process owned. The fix was to kill it by PID and remove the stale lock file, not to keep restarting, because restarts were exactly what multiplied the mess. Third, size surprises. A plain hf download of openai/gpt-oss-120b for a later comparison pulled past 130GB, because that repository ships two formats, the sharded safetensors that vLLM needs and a separate single-file Metal build it does not. Nemotron itself behaved, landing at about 75GB across 17 shards, and I verified every shard by hashing each blob against its content-addressed filename: 17 of 17 intact. If you pull these models, budget the disk and verify the hashes rather than trusting that a green progress bar means a correct file.

# verify each shard against its HF content-addressed blob name
for f in snapshots/*/*.safetensors; do
  [ "$(basename "$(readlink -f "$f")")" = "$(sha256sum "$(readlink -f "$f")" | cut -d' ' -f1)" ] \
    && echo "ok $(basename $f)" || echo "CORRUPT $(basename $f)"
done

The one launch failure you will hit too

NVIDIA’s serving recipe sets --gpu-memory-utilization 0.90. On a Spark that runs anything else, that flag kills the launch before a single weight loads:

ValueError: Free memory on device cuda:0 (106.68/121.69 GiB) on startup
is less than desired GPU memory utilization

The reason is the unified memory architecture. On a discrete GPU, 0.90 means ninety percent of dedicated VRAM that nothing else touches. On the GB10, CPU and GPU share one 128GB pool, so your containers, your browser, even a running download all eat into the same budget the flag tries to claim. For example, my idle services held about 15GB, which left 106GB free against the 109GB that 0.90 requested. Lowering the flag to 0.80 fixed it on the first try. If you adapt any datacenter recipe for a Spark, treat the memory-utilization flag as the first thing to question, not the last.

Open weights are not open source

The label on the box matters for a project built on a sovereignty principle, so it is worth being precise. Nemotron-3-Super is released under the NVIDIA Open Model License. That license permits commercial use, public deployment, and gives you ownership of the outputs, which is genuinely permissive. It is not, however, an OSI-approved open-source license the way Apache 2.0 is, which is what my Qwen and Mistral models use. Calling it an “open-source coding model,” as more than one writeup does in its title while quoting the NVIDIA Open Model License in its body, blurs a real distinction. Open weights mean you can run and ship it. Open source means the license meets a specific freedom standard, and this one does not.

For internal measurement, like this article, the distinction is moot. For putting a model in front of readers as the engine behind a sovereign-AI blog, it is the whole question, and it is a deliberate choice rather than a default. I measured Nemotron happily. Whether it ever serves the public here is a values decision, not a benchmark one.

So, who is this for

Not for a coding agent on a Spark. The latency is disqualifying, and my Qwen3.6 at 69 tok/s does the same edits in a tenth of the time. The 120B headline buys you 12B of active compute per token and a verbosity tax on top, and on a bandwidth-bound desktop that is a bad trade for interactive work.

It earns exactly one look: a long-context retrieval and analysis model, where its strong RAG behavior and million-token window do work that Qwen’s 256k cannot, and where you are content to ask a question and come back in a few minutes for a faithful, grounded answer. That is a real niche. It is just a much narrower one than “NVIDIA’s open-source 120B coding model” suggests.

The bigger lesson survives the model. On a bandwidth-bound box, the best model is not the largest one that fits. It is the smallest one that does the job, because every active parameter is latency you pay on every single token. Nemotron fits on a Spark. Fitting was never the question; affording it on every keystroke was.

Reproduce it and the honest caveats

The single-stream decode used a fixed prompt, prefill subtracted, median of three, at temperature 0. The agent-bench runs are small, N of a few, and the coding capability scores are NVIDIA’s published numbers, not mine. One real confound: a large concurrent download was competing for the same memory bus during the speed measurement, which likely shaved a few tok/s off the 23.7, though not the thirty-tok/s gap to Qwen. The model is nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on the official vllm/vllm-openai container; the harness is agent-bench. For the bandwidth reasoning, see the AutoRound quant duel; for why single numbers lie without a method, the broken-ruler story.

Measured on a single DGX Spark, single-stream, with the same deterministic-gate harness used across this engineering log. Follow via RSS or Nostr.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—