A Benchmark Handed Me a Number Three Times in One Day. Three Times It Was Lying.

June 12, 2026 9 min read

A while back I almost published a sentence that read “Mistral Small 4 scores zero on coding, the quant kills it.” It was wrong. The benchmark harness was hanging behind this stack’s Tor proxy and never reached the model, so it scored an empty transcript. A competent model scoring exactly zero should have stopped me cold, and it nearly did not. I wrote that one up as the broken ruler.

I thought that was the lesson learned. Then I spent a day standing up two more large models on the same box, and my own measurements tried to lie to me three more times, in three different ways. None of the wrong numbers were random noise. Each one had a specific cause, a specific tell that should have flagged it, and a specific fix. This is the field guide I wish I had taped to the monitor.

The one rule underneath all three: a number that a benchmark hands you is not the same thing as a number you can trust. The gap between those two is where a 140-article blog quietly dies, one plausible-looking wrong figure at a time.

The three traps, and the tell that caught each:

Trap	The number it gave	Root cause	The tell	The rule it left
1. Working model scored at zero, twice	0% on every task	measured in reconstructed containers, not the prod launcher	a model I run daily cannot score a clean zero	measure with the actual production launcher
2. Test framed the model for my bug	”vision breaks tool-calling”	a malformed request gave the model no tools to call	the result contradicted what the vendor built it to do	never publish from a single unclean test
3. Cold number undersold the truth	43 tok/s	sampled before speculative decoding warmed up	43 against a trusted 69, a 35% collapse with no cause	warm the engine, measure over a real window

Trap one: the harness that scored a working model at zero, twice

I built a clean little matrix to compare three quantizations of the same model on coding accuracy. I ran it. Every cell came back zero percent. I tuned it and ran it again. Zero percent, across the board, a second time.

Here is the tell I should have trusted immediately: I run one of those exact quants as my daily driver and watch it complete real refactors every day. A model I personally know to be competent does not score a clean zero on three tasks. When the instrument disagrees that hard with lived reality, the instrument is the thing to doubt, not reality.

The cause was not the models. I had measured them inside hand-rolled containers I assembled for the matrix, and those containers differed from my production launchers in some small way I never fully isolated. The difference was enough to send the model into a reasoning loop on complex prompts: it would think for thirty thousand tokens, never call a tool, and time out. Zero. My real production launcher, given the identical task, scored that same model at one hundred percent.

The fix is a rule now: measure with the actual production launcher, never a reconstructed approximation of it. A benchmark is a model plus an image plus flags plus a harness plus a prompt, and if you rebuild four of those five from memory you are measuring your reconstruction, not the model. The reconstructed-from-memory result went in the bin. The production-launcher result is the one that stands.

Trap two: the test that framed the model for my bug

While I was at it, I checked whether one quant could still process images, and whether vision and tool-calling could work at the same time. My first test said no: with vision active, the model emitted what looked like a tool call as plain text instead of an actual structured call. Clean story, almost wrote it: “turning vision on breaks tool-calling.”

It was wrong, and the wrongness was mine. The web told me plainly that this model family is built for agentic use and vision together, which should have been my first stop, not my last. So I reproduced the test properly, with a correctly formed request that actually included the tools the model was supposed to call. With that fixed, vision-on returned a clean, structured tool call, exactly the way it is supposed to. The original “plain text” was the model improvising because my malformed request never gave it any tools to call. I had handed it a broken prompt and blamed it for the result.

This is the most insidious of the three, because the wrong conclusion was technical, specific, and citable. It would have looked authoritative on the page. The fix is uncomfortable and worth saying out loud: never publish a finding from a single unclean test, especially one that contradicts what the vendor designed the thing to do. If your result says a tool is broken in a way its makers clearly built it not to be, the prior should be that your test is broken. Reproduce with a validated harness before you write the word “breaks.”

Trap three: the cold number that undersold the truth by a third

The third lie was the quietest. I measured the decode speed of my production model right after it launched and got 43 tokens per second. A clean, specific, plausible number. I had every reason to write it down.

I did not, because it failed a sanity check against a figure I already trusted. This model had been measured at around 69 tokens per second by a separate method weeks earlier, the measurement that put it into production in the first place. A drop to 43 is not noise, it is a thirty-five percent collapse, and nothing about the setup had changed to justify it. So I went looking instead of publishing.

The cause was speculative decoding, the trick where a small draft model proposes several tokens and the big model verifies them in a batch. The speedup only materializes when the draft’s guesses are accepted often enough to pay for the verification step, and that acceptance rate is not constant. It climbs as the engine settles into a steady request pattern and the draft model finds its rhythm. My cold measurement took its sample in the first handful of requests, before the draft path was warm, and caught the engine at its worst. Sampling there measures the warmup, not the model, the same way timing a sprinter’s first two steps out of the blocks tells you nothing about their top speed. Warm it properly, measure over a real window of requests, and it climbed back to 69, matching the older number from the other method almost exactly.

That last detail is the whole point. Two independent rulers agreeing is how you earn the right to trust a number. My warm measurement and the older production measurement landed within a fraction of a token of each other, so 69 is real. The cold 43 agreed with nothing, so it went in the bin with the others.

The deeper problem: you own more than one ruler, and they disagree

Underneath all three traps is a fact people gloss over. There is no single canonical “tokens per second.” I had three different measurement tools available, and on the same unchanged model they gave me materially different numbers. One of them, a popular throughput benchmark, gave me 3.6 tokens per second on one run and thirty on the next for an unchanged setup. It is not trustworthy on this hardware, so I stopped using it entirely.

The discipline that survives is narrow and absolute: pick one ruler per axis, measure every contender with that one ruler, and only ever compare same-ruler numbers. A speed from tool A next to a speed from tool B is not a comparison, it is a category error dressed up as a table. When I publish that one quant is thirteen percent faster than another, both figures come from the same ruler, run the same day, on the same box. The cross-ruler agreement on the winner is a confidence check, not a data source. The moment you let two rulers into the same column, the column is fiction.

The tape on the monitor

The whole day compresses into four questions, and every number now has to survive all of them before it is allowed near a sentence:

Was it measured through the actual production launcher, not a reconstruction of it?
Is it in a plausible range against something I already trust, and if not, can I name the reason?
Was the engine warm, and was the measurement window wide enough to swallow the warmup?
Does a second, independent ruler agree with the conclusion this number is supposed to carry?

A no on any of the four sends the run to the bin, not to the blog. Trap one fails the first question, trap two fails the second, trap three fails the third, and the number that finally got published is the one that passed the fourth.

Why I would rather ship a hole than a wrong number

Each of these three caught numbers was specific and believable. A reader would have nodded at any of them. That is exactly why they are dangerous: a wrong number does not announce itself, it sits quietly in a table looking like all the right ones, and it spends the trust the other hundred-odd articles earned.

So the working rule on this blog is to throw runs out aggressively. Every result gets a sanity gate before it is allowed near a sentence: is this in a plausible range against something I already trust, and if not, why not? A model I know is competent scoring zero, a tool the vendor built for a job failing at that job, a speed that collapses for no reason, all three were caught by the same reflex of refusing to believe a number just because a machine produced it. When I could not measure something cleanly, like a single-stream speed for one of the models this round, I left the hole and said so, rather than reaching for the nearest plausible figure.

The discarded run is not wasted work. On a blog whose only real asset is that the numbers are real, the discarded run is the product. Catching the lie is the job. The clean number is just what is left over after you have done it.

This day’s three traps sit on the same box and the same discipline as the gpt-oss-120b teardown and the quant comparison, and they are the sequel to the original broken ruler that taught me to distrust a zero. The benchmark code, gates, and raw logs are the agent-bench project.

	Today	7d	30d	All-time
Unique readers	—	—	—	—
Page views	—	—	—	—