gpt-oss-120b pulls nearly four million downloads a month, so I assumed it was a one-command experience. Getting it to serve on a DGX Spark took a frozen box, a 25GB image pull strangled by a Tor proxy, and a 43-minute kernel compile. Then the measurement: on my own coding tasks the 120B scored 56 percent where the 35B Qwen I already run scored 100. Here is the full teardown, with every number measured on the box and the failed measurements thrown out, not published.

I Built OpenAI's gpt-oss-120b on a Single DGX Spark. My 35B Qwen Out-Coded It.

OpenAI’s gpt-oss-120b pulls 3,924,278 downloads a month on Hugging Face. A number like that reads as “solved.” Download the weights, point an engine at them, watch the tokens stream. That is the experience on a normal datacenter GPU. It is not the experience on a DGX Spark, and the gap between the download counter and the reality is the whole story.

I run a Qwen3.6-35B as my daily coding model on this box. The pitch for gpt-oss-120b was simple: it is more than three times the size, it is OpenAI’s, and the working hypothesis was that the bigger model would be the better agent. So I spent a day standing it up properly and then measured whether the hypothesis was true.

It was not. Here is what happened, and what the box actually did.

What gpt-oss-120b is, and what it is not

It is a mixture-of-experts model: 117B parameters total, but only 5.1B active per token, quantized to MXFP4 (a four-bit format). The “120B” is storage, not thinking power per token. My Qwen3.6-35B activates 3B per token. In actual compute per token the two are in the same ballpark, which is the first thing the parameter count hides.

MXFP4 earns a sentence, because it is the whole reason this model is supposed to be gentle on a small box. It is a “microscaling” four-bit float: each weight is one sign bit, two exponent bits, and one mantissa bit, and every block of 32 weights shares a single 8-bit scale factor, which buys back most of the accuracy a flat four-bit format would throw away. It is an open industry standard backed by NVIDIA, AMD, Microsoft and others, and OpenAI trained gpt-oss in it natively rather than quantizing after the fact. Now the irony that sets up the entire build below: MXFP4 is a hardware feature of exactly the Blackwell generation this Spark’s GB10 belongs to. The format was designed to scream on this silicon. The software that feeds the silicon just had not been compiled for this particular two-month-old chip yet, which is the gap the rest of this post lives in.

It is text-only. No image input. Hold that thought, because it matters at the end.

The day it froze the box

The first attempt ran unattended and came back to a dead machine. Two reboots and a hard power-off later, the dead container told the story. gpt-oss had loaded its weights fine, then walked into the torch.compile plus CUDA-graph capture phase on a cold 120B mixture-of-experts and hung there for 44 minutes, pinning the GPU the entire time.

On a normal server a wedged GPU job kills a process. On a DGX Spark the GPU and the desktop share the same unified memory, so a job that monopolizes the GPU starves the GNOME compositor and the whole machine appears frozen. There is no “the inference is busy but the desktop stays smooth” on this hardware. That is the Spark tax nobody warns you about.

The fix for the hang itself is one flag, --enforce-eager, which skips the compile-and-capture phase. But that only exposed the next wall.

Why you have to build anything at all

Here is the part that looks insane if you have only ever run models the easy way.

Running a model is two separate things. The model is the pile of numbers, which I downloaded. The engine is the software plus little chunks of GPU code called kernels that do the actual math. On common hardware the engine is also ready-made: someone already compiled the kernels for that chip, so you pip install a finished package and go.

GPU kernels are not portable. They have to be compiled for the exact chip architecture, the same way a Windows binary will not run on a Mac. The DGX Spark uses NVIDIA’s GB10 (Blackwell-class, “SM121A”, and ARM-based), a design two months old at the time of writing. The efficient MXFP4 path gpt-oss needs (the “CUTLASS” kernels) has not been pre-compiled and shipped for this chip in any downloadable package. The off-the-shelf images fall back to a slower MXFP4 backend called MARLIN, which on this box balloons the model to 118GB after load and ignores the memory-utilization cap entirely. On a 128GB box that leaves nothing for the desktop.

So you build the right image: start from NVIDIA’s CUDA base, compile the CUTLASS and FlashInfer kernels from source for arch 12.1a, and seal the result. The compile pinned every core for 43 minutes. That is the frontier-hardware tax. You bought a GPU so new the software world has not shipped binaries for it, so you compile from source, which is the thing systems programmers do routinely and the thing the download counter never mentions.

The payoff was immediate and worth the day: on the purpose-built image the engine selected CUTLASS, loaded the native FlashInfer attention for SM12x, and sat at 61GB used instead of the 118GB balloon. The model that would not fit suddenly fit with room to spare. And the tax is paid exactly once. The image is sealed and tagged, every later launch reuses it, and only the weights still have to load. The 43 minutes buy a permanent artifact, not a one-off run.

Getting the image past a Tor proxy

One more wall, and a good one. Building the image needs a 25GB NVIDIA base container, and on this box Docker pulls are routed through a Tor proxy for hardening. A 25GB image over Tor does not finish. It stalls, the daemon sits idle, and a daemon restart does not fix it because the proxy is the bottleneck, not the daemon.

I did not touch the proxy. Everything else on the box (the HuggingFace downloads, plain curl) goes direct and fast, so I fetched the base image with skopeo (a registry client that is not the Docker daemon) straight to a tar, then docker load. Six minutes instead of an hour. The lesson worth keeping: when one tool is wedged, find the one that takes the fast road, and do not assume the slow path is the only path.

There was also a self-inflicted lesson in here. Twice during the build I called it hung because the CPU sat at zero. It was not hung. It was unpacking a 19GB base layer, which is disk-bound and CPU-idle and looks identical to a deadlock. On unfamiliar hardware, measure the resource the work actually uses before declaring a hang. CPU-idle plus heavy disk writes is progress, not a stall.

Test A: speed against the leaderboard

Spark Arena publishes a verified gpt-oss-120b run on a DGX Spark at 58.82 tok/s. With the model finally serving, I measured decode throughput from vLLM’s own metrics (the authoritative source, after a third-party benchmark tool gave me three different and untrustworthy numbers for the same model, one of several measurement traps from this same day).

Our decode rate: 59.5 tok/s. That reproduces the published 58.82 almost exactly, with the small difference being run-to-run noise. And it is a genuine ceiling, not a number you tune past. I checked by locking the GPU clock to its 3003MHz maximum, and the rate moved by a fraction of a token per second. The GPU was healthy, cool, drawing under 50W, and not throttling. MoE decode here is memory-bandwidth-bound, not compute-bound. Every token streams the active expert weights out of the Spark’s unified LPDDR5X, and that bandwidth sets the wall. More clock does not help, because the bottleneck is moving 5.1B parameters’ worth of weights per token, not the arithmetic. The leaderboard number is honest, and on this box it is the speed of light.

Test B: does the 120B actually out-code my 35B?

This is the question the whole day was for. I run a small, deterministic-gate benchmark: the agent gets a coding task, a gate script decides pass or fail, no model grades another model. Baseline arm, no tool intervention, on my own tasks.

taskgpt-oss-120bQwen3.6-35B
rename a function across 2 files67%100%
rename across 16 files0%100%
rename a symbol without touching a same-named one67%100%
three structured chat tasks0% / 100% / 100%100% each
overall56% (18 runs)100% (56 runs)

The 120B I fought all day to stand up passes 56 percent of the work. The 35B I already run passes everything. For reference, NVIDIA’s Nemotron-3-Super-120B matched Qwen’s accuracy on the hardest task but took roughly eight times as long per run. The harness, gates, and raw per-run logs are the agent-bench project.

The failure signature is consistent and worth naming: gpt-oss’s failures are almost all the same shape, a short reply with zero tool calls, the model answering briefly and never attempting the edit. When it does drive the tools it succeeds. Part of that 44 percent may be a harness-integration quirk rather than pure incapability, and I flag that honestly. But it is also the real experience of wiring gpt-oss in as an agent on this stack, which is exactly the question “should I switch my daily driver to it.” Charitably or not, it does not beat Qwen.

What the public benchmarks say

Our gate result is not an outlier. The instinct is “117B must crush 35B,” but on Artificial Analysis the two sit close on the Agentic Index, and that is not a glitch. It is the lesson. Active parameters drive per-token reasoning (5.1B versus 3B, the same ballpark), and the extra 80B of gpt-oss is stored knowledge the agentic index does not reward. Qwen3.6 is tuned for tool use and coding; gpt-oss-120b is a strong general and reasoning model on a different axis. Two independent measurements, a public eval and my own deterministic gates, agree that the bigger model is not the better agent here.

The kicker: the model that wins can also see

While confirming gpt-oss has no vision at all, I checked whether the Qwen I run could get its own back. The production quant carries a full vision tower; one stale launch flag was hiding it. Drop the flag and the 35B that already out-codes the 120B also reads a screenshot. I handed it the dashboard from this very stack and it returned the header text and the status dots, correctly.

So the final tally on this hardware: the model that wins on coding also sees, and the 120B that loses cannot. Which Qwen quant keeps that vision, and how the three of them compare on speed and accuracy, is its own quant teardown.

What I run now

A frozen box, two reboots, a Tor-strangled pull beaten with skopeo, a 43-minute compile, and a self-inflicted watchdog kill at the finish line. At the end of all of it, gpt-oss-120b runs cleanly at the hardware’s bandwidth ceiling, and the measurement says keep Qwen.

That is not a disappointing ending. It is the point. The day did not buy a better model. It bought a number in place of a hypothesis, and the number says the 120B is not the better agent on the work I do. Build it, measure it, and ship the negative result with the same prominence as a positive one. The download counter measures curiosity, not reproducibility on your stack, and nearly four million pulls a month did not put one line of this gauntlet anywhere I could find it before I walked into it myself.

Was this worth it? Zap the article.

Value for value, no signup. Sats go straight to the writer.