The Gap Is Widening and I'm Staying Anyway

June 22, 2026 12 min read

There is a version of this site that does not survive contact with the data, and I want to name it before I defend the version that does. That version says: hold on a little longer, the open models are catching the closed ones, your desk box will sit at the frontier soon, just wait. It is a capability bet, and it is a bet I would lose. The honest reading of the numbers in June 2026 is that the open-closed gap is not closing on a schedule, and the most credible voices saying so are not the cloud’s salespeople. They are the people who run open-model labs for a living.

So I have to do the thing this series exists to do, which is concede the strongest objection in full before I answer it. The objection here is not from a critic of self-hosting. It is from inside the open camp, which is exactly why it lands.

The debate, read honestly

By 2026 the question “open versus closed” stopped being a yes or no and became a question about a threshold. Where is the line past which the open model is good enough that you stop renting, and is that line moving toward you or away from you? The data points in two directions at once, and an essay that only quotes the direction it likes is doing marketing with a chart in it.

In one direction, convergence. By Stanford’s AI Index, the quality gap between the best closed and best open models on one public benchmark fell from about 8.0% to roughly 1.7% across a single year, 2024 into 2025. That is a real collapse, and it is the number I leaned on in the earlier essay when I argued that a forkable weight is a real hedge because the substitute is close. It is still true. A gap of 1.7% on a public ruler is close enough that a replacement checkpoint is usually a quarter away, not a decade.

In the other direction, the gap is widening. And the people pointing at that direction are not the ones I can wave off.

Conceding the gap, in full

Nathan Lambert runs an open-model lab. He is not a closed-source partisan, he is a working participant in the open ecosystem, and his read is blunt: “the open-closed gap is more likely to grow than shrink”. His picture of open’s realistic future is not parity, it is niche specialization. Open models win where you can fine-tune them tightly to a narrow task, not where you line them up head to head against the latest frontier release and ask which is smarter in general.

Epoch puts a number on the same direction. By their open-closed capability index, the lag widened from 3 months to 4 months, and they note the figure is likely understated, because open models hillclimb the public benchmarks everyone can see while underperforming on the private ones they cannot train against. So the published gap is the optimistic gap. The real one is probably worse.

I do not get to dismiss either of these. They contradict the convergence story I just told, and the honest position is that both are measuring something true. The public-benchmark quality gap narrowed. The capability lead, measured by when an open model matches a closed one, widened. Those are not the same axis, and the axis that moved against me, reasoning, test-time compute, long-horizon agentic work, is exactly the axis where capital compounds. The frontier is buying its lead on the dimension a desk box is worst at.

If sovereignty were a capability bet, this is where I would have to fold. The leaderboard in June 2026 is entirely closed at the top. My box will never sit there. A reader who came here for “open is catching up, hold the line” should take the Epoch number and walk, because that reader is right and I cannot argue them out of it.

But that was never the bet. And the proof that it was never the bet is sitting on my own desk, where the bigger, faster, higher-scoring models lose to the small one every single day.

The lived receipt: bigger models lose here

Watch what happens when I actually run the frontier-adjacent models on the machine, against the small open checkpoint that does my real work.

I ran NVIDIA’s Nemotron-3-Super-120B on a single DGX Spark and fact-checked the vendor claims. It is a far bigger model than my Qwen daily driver. It ran at about 23.7 tokens per second, roughly a third of Qwen’s speed. On a coding gate it scored 17 out of 17, a clean sweep. And then it made 0 tool calls. Zero. A model that can write correct code in isolation but cannot pick up a tool and act is, for an agentic desk workflow, furniture. Expensive furniture, and it scored perfectly on the part of the exam that does not count. The 17 out of 17 is real and the 0 is also real, and the 0 is the one that decides whether the model is useful to me.

I ran GPT-OSS-120B on the same box and measured it the same way. Faster, about 59.5 tokens per second, comfortably quicker than Nemotron. On the agentic benchmark it scored 56%. Better than zero, and still well short of the small model that scores enough to ship work. Bigger parameter count, faster throughput, and it loses on the only axis I care about, which is whether it can do the loop of read, decide, call a tool, and continue.

Then there is the switch I made under everything. When I moved the Qwen checkpoint from one quantization to an AutoRound build, I kept the same served name and the same port. The rest of the stack did not notice. No client changed a line. That is the receipt that matters most, because it shows what the bet actually rests on. The frontier could ship a model twice as good tomorrow and it would not change the served name on my box, it would not touch the port, and it would not interrupt a single running job. My system does not poll the leaderboard. It serves an endpoint I control.

Put the three together and a pattern falls out that the capability framing cannot explain. The slower model wins. The smaller model wins. The lower-scored-on-paper model wins. They win because “frontier capability” and “useful on my desk” are different quantities, and the desk stack is optimized for the second one. None of these results would surprise anyone who has stopped confusing the two.

The hams made this bet a century ago

None of this is new. Amateur radio operators have been making the exact bet I am making, and losing the exact race I am losing, for about a hundred years. They never matched the commercial broadcasters on transmitter power or reach. They could not. The big stations had the towers, the wattage, the licenses, and the budgets, and with each decade of commercial buildout that gap got wider, not narrower. A ham at a kitchen table was never going to out-broadcast a network. That was permanent, and everyone involved knew it.

And yet, when hurricanes and earthquakes and floods knocked out the commercial and cellular infrastructure, the amateur networks kept carrying emergency communication when the big networks went dark. There are organized volunteer services built for precisely this, like ARES, the Amateur Radio Emergency Service, standing by for the day the professional grid fails. The hams did not bet on out-powering the networks. They bet on still being on the air when the networks were not.

That is the control-and-resilience bet, drawn cleanly, a full century before anyone argued about open weights versus the frontier. The capability gap was real, permanent, and growing, and it was also beside the point. The question was never whose signal was strongest. It was whose signal was still there. A leaderboard measures power and reach. A disaster measures who is left transmitting. My desk box is a transmitter that answers to me, and the bet is the same one the hams have been quietly winning, in the worst week of someone’s life, for a hundred years.

Separating control from capability

Here is the mistake, and both sides make it. The evangelist says self-hosting is winning because open is catching up. The critic says self-hosting is losing because open is falling behind. They are arguing about the same axis, capability at the frontier, and they have both quietly agreed that this axis is what sovereignty is about. It is not. That shared premise is the error, and once you drop it the whole debate I just walked through stops being load-bearing for the decision I made.

I keep one definition of sovereign and it has nothing to do with the frontier. A system is sovereign if you can keep operating it after every external dependency in the stack changes its mind about you. Read that again and find the word “best” in it. It is not there. There is no clause about parity, no clause about leaderboards, no clause that says your model must beat the latest closed release. The test is whether you keep operating, not whether you win.

A capability bet is a bet on a number that someone else controls and that moves every few months, almost always away from you. A control bet is a bet on a property of your own setup that the leaderboard cannot touch: that the weights are on your disk, that the endpoint answers to you, that when the upstream lab goes closed or the cloud changes its terms, the checkpoint you already hold keeps running. The Epoch number can widen every quarter from here to forever and not falsify a single thing I have claimed, because I never claimed my box would close that gap. I claimed it would still answer to me when the gap moved, and a gap you do not run in is one you cannot lose.

This is why the two columns in the diagram at the top can both be true at once. “A capability bet” loses the moment a better closed model ships, and a better closed model always ships. “A control bet” is not exposed to that event at all. The desk receipts are the demonstration: a model can be slower (23.7 against Qwen’s roughly triple that), bigger, even perfect on a narrow gate (17 out of 17) and still lose on my desk, because the thing I optimized for was never the frontier. It was the loop landing, the tool call connecting, the work shipping, on terms I set and can keep setting.

There is an honest caveat, and hiding it would make this the authority theatre the series refuses. The control bet is comparative, not absolute. I am not free of the upstream, the open weights came from a lab I do not control, and if every open lab went closed and froze new releases tomorrow, my niche-specialized desk model would slowly age out against a frontier that kept moving. Control buys me the ability to keep operating through a change of terms. It does not buy me immortality against a decade of frontier progress I opted out of. The bet is that for the work I actually do, below the frontier, that decade does not arrive before the next forkable checkpoint does. That is a bet, with a failure mode, named.

What I am actually claiming

I am not claiming the open models are catching up on schedule, because the most credible read says they are not, and the people saying so run open labs. I am not claiming my desk box is frontier-competitive, because it is not and will not be. I am not claiming the gap is closing, because Epoch says it widened from 3 months to 4 and is probably understating it, and I believe them.

What I am claiming is narrower and survives all of that. Sovereignty was never a capability bet. It is a control bet, and the two are different quantities that both camps keep collapsing into one. The proof is on my desk, where the bigger model at 17 out of 17 makes 0 tool calls and loses, where the faster 59.5-tokens-per-second model scores 56% and loses, where the slower small model at a third of nobody’s frontier wins because winning here means landing the loop, and where a quantization swap under the same served name and port changed everything about the model and nothing about the system. A leaderboard I do not control got better. My endpoint, which I do control, did not flinch.

So the gap is widening, and I am staying, and those two sentences do not fight each other once you stop confusing capability with control. The widening gap is a fact about a race I am not running. Staying is a position in a different game entirely, the one about who can keep operating when the terms change. I conceded the strongest objection in full. It happened to be aimed at a bet I never placed.

This is essay four of a series. The prior one was about how I moved the dependency without removing it, which is the same move read from a different angle: a relocated dependency you can see is a control gain even when it is not a capability gain. The series spine, and why each essay concedes its strongest objection before it answers, lives on the philosophy page. The structured, complete version of the argument is the forthcoming book, for which these essays are the public workshop. Read the series in order on the philosophy page if you want to watch the control bet get built one concession at a time.

Comparison

The bet I am actually making

Both columns can be true at once. That is the whole point.

A capability bet

A control bet

What it wins on

Your box beating the frontier.

Your box still running after the frontier moves.

What kills it

A better closed model ships. It always does.

Nothing the leaderboard does can touch it.

How you check it

Benchmark scores against the latest release.

Can you keep operating when the tap closes.

What you measure

Tokens per second, points on a gate.

Tool calls landed, work shipped, terms you set.