Three Coding Leaderboards, Three Blind Spots: What HackerNoon and WhatLLM Don't Tell Self-Hosters
Numbers snapshot 2026-05-20. Both leaderboards update continuously: HackerNoon last refresh 2026-03-08, WhatLLM.org refreshes weekly. Verify against the live pages before quoting any specific pass rate or Quality Index. The framing below is durable, the specific numbers are not.
On this page:
- The Quote That Travels Faster Than Its Data
- What a “Coding Leaderboard” Actually Measures
- HackerNoon: 5 Models x 6 Languages x 100 LeetCode
- The Sonnet Surprise (And the Opus Inverse)
- WhatLLM.org: Top-10 Coding, and What That Ranking Is Made Of
- Two Tests, Zero Overlap
- Cloud Dominance Is a Methodology Bias
- The Self-Host Column That Doesn’t Exist
- Spark Arena Has Throughput, Not Correctness
- Other Coding Benchmarks That Should Be on the Radar
- Qwen3.6 Max Preview Is Not Qwen3.6 PrismaQuant
- A Reading Protocol for Self-Hosters
- What’s Missing Across All Sources
- Glossary and Sources
The Quote That Travels Faster Than Its Data
“All are pretty solid choices, but they do have their specialties. There are comparisons for instance with programming languages, and there are also raw benchmarks, but you need to look into if what they are testing against is what your problem needs (i.e. programming language, API, database etc). Currently I use Sonnet for most of the non-thinking work.”
a forum reply we recently saw
This is the everyday default reflex of a working developer. They tried a few models, settled on one for the bulk of their day, and now reach for it without thinking. The reply is friendly, honest, and useful right up to its last sentence. “Sonnet for most of the non-thinking work” is the bit that travels. It gets quoted in Slack, lifted into a coworker’s bookmarks, repeated as advice. It is also load-bearing on an assumption that never gets stated: that the recommendation ports across stacks, languages, and problems. That assumption is what this article tests.
This is not a takedown of Sonnet. Sonnet is a fine model. It is a takedown of recommendations that arrive without a workload description. A coding model that wins on a Python LeetCode set may be the wrong call for a Rust ops script that has to compile against three target triples, or an Oracle SQL audit where the dialect details decide whether the query runs at all. The three public leaderboards we are about to walk through already say so in their own data. The interesting part is that they disagree with each other in ways that change the answer.
We did this exercise once before, on a different axis. In the two-leaderboards piece we put arena.ai (cloud quality, Elo-ranked) next to spark-arena.com (self-host throughput, tokens per second on real hardware) and showed that they measure two different things on two different machines. The missing third axis there was total cost of ownership. The same triangulation pattern applies here, just rotated ninety degrees. The axis this time is coding ability per language, and the missing column is the one no public board publishes for self-hosted models on the box you actually own.
A model recommendation without a workload description is taste, not advice.
What a “Coding Leaderboard” Actually Measures
Every public board that claims to rank coding models is really claiming to measure one of three axes. Quality: does the answer compile, pass tests, and do what was asked. Throughput: how fast the tokens come out once the model starts talking. Specialty: does the answer hold up in this specific language, framework, or database, not just in the average case across all of them. The punchline up front: no public board covers all three. Each one picks a side, and the side it picks decides what its ranking means.
| Axis | Asks | Boards that try | Boards that don’t |
|---|---|---|---|
| Quality | Is the answer right | arena.ai, WhatLLM coding | spark-arena.com |
| Throughput | How fast does it answer | spark-arena.com | arena.ai, WhatLLM |
| Specialty | Right for this language | HackerNoon (6 languages), MultiPL-E, Aider Polyglot | WhatLLM aggregate, arena.ai |
If you have not lived inside coding benchmarks before, here is the smallest vocabulary you need to read the rest of this article without bouncing off the acronyms.
- Pass@1. The fraction of problems a model solves on its first attempt with no retries.
- LiveCodeBench. A contamination-resistant code-generation benchmark that refreshes its problem set monthly, so the questions have not had time to land in any training corpus.
- Quality Index. A single composite number from Artificial Analysis that bundles several benchmark scores together. WhatLLM.org uses it as a top-line ranking.
- Elo. A relative ranking score from pairwise human comparisons, made famous by chess. arena.ai uses it for model quality.
- Pass rate. The fraction of test cases a model gets right on a fixed problem set. Simple, popular, and easy to game if the test set leaks.
The rest of the article walks the three boards that actually exist in public, names what each one is good at, and then names the column none of them publish.
Most “best LLM for coding” articles answer one of three different questions while citing one of three different leaderboards. Knowing which question you are asking is half the work.
HackerNoon: 5 Models x 6 Languages x 100 LeetCode
The first board worth reading is the HackerNoon coding-languages benchmark, published on 2026-03-08. It is not the most-cited coding evaluation on the internet, and that is exactly what makes it useful. Most public coding rankings hand you one average number per model. HackerNoon hands you six numbers per model, one per language, and lets you see how flat or how spiky the curve is. That is the column the popular boards leave out.
The methodology is small and concrete. Five LLMs went on the bench: Claude Sonnet 4.5, Gemini 2.5 Flash, Gemini 3 Flash Preview, GPT-5-mini, and Grok Code Fast 1.0825. Six languages: Python3, Java, Rust, Elixir, MySQL, and Oracle SQL. The algorithmic test set is 100 LeetCode problems sampled between October 2025 and February 2026, split 15 Easy, 59 Medium, 26 Hard. A separate set of 321 database problems carries the SQL comparison. Judging is binary: the online judge either accepts the submission or it does not. No partial credit, no style points, no architectural taste.
| Model | Python3 | Java | Rust | Elixir |
|---|---|---|---|---|
| Claude Sonnet 4.5 | 50% | 52% | 51% | 35% |
| Gemini 2.5 Flash | 82% | 82% | 77% | 39% |
| Gemini 3 Flash Preview | 84% | 93% | 78% | 83% |
| GPT-5-mini | 93% | 94% | 80% | 63% |
| Grok Code Fast 1.0825 | 73% | 65% | 65% | 30% |
The SQL split tells its own story. All five models perform measurably worse on Oracle SQL than on MySQL, with the gap ranging from 9.7 to 18.7 percentage points depending on which model you read. The HackerNoon author lays the cause at training-data density: MySQL shows up in public code orders of magnitude more often than Oracle SQL, so the latter has had less time to land in any training corpus. The lesson generalises beyond databases. If your stack lives in a popular language, every model has seen ten thousand examples of your problem. If your stack lives in Elixir or Oracle SQL or a niche framework, you are working against a thinner slice of every model’s memory.
Read the pass rate honestly. It tells you whether the submission compiled, ran, and produced the expected output on every test case the judge had. It does not tell you whether a human teammate could read the diff a year from now, whether the variable names made sense, whether the algorithm choice fits the surrounding code, or whether the same model would still be right when the test set grew to cover edge cases the judge never asked about. A model that wins LeetCode in Rust is not, by that result alone, a model you want refactoring your service.
Pass rate measures whether a model can clear a single fence. It does not measure whether the codebase you ship can survive the model writing in it for six months.
The Sonnet Surprise (And the Opus Inverse)
Sonnet 4.5 finishes last of five on the HackerNoon LeetCode set. That alone is news worth pausing on, given how many of those forum recommendations cite Sonnet as the default. Fifty percent on Python, fifty-two on Java, fifty-one on Rust, thirty-five on Elixir. The leader on the same set is GPT-5-mini at 93/94/80/63. The most uniform curve belongs to Gemini 3 Flash Preview at 84/93/78/83. Three cheap variants of three frontier families, three very different shapes.
Now look at a different board. On WhatLLM’s coding ranking, Claude Opus 4.7 with Adaptive Reasoning sits at #2 with a Quality Index of 57.3 and a SciCode score of 55%, narrowly behind GPT-5.5 xhigh at #1. WhatLLM also tags Opus 4.5 specifically for “multi-file understanding and architectural reasoning”. The expensive Anthropic variant is at the very top of the WhatLLM list. The cheap Anthropic variant is at the very bottom of the HackerNoon list. Same vendor. Same family. Opposite verdicts, depending on which board you cite and which variant the board chose to test.
The forum quote we opened with said “Sonnet for non-thinking work”. That is a real working sentence about a real working day, and the developer who wrote it is almost certainly happy with Sonnet in the role they use it for. But “Anthropic is good at coding” is not a sentence either board can settle by itself, and the two of them together do not settle it either. They split it: yes if you ask the question on WhatLLM’s terms with Opus on the bench, no if you ask it on HackerNoon’s terms with Sonnet on the bench.
Call this the Sonnet Surprise on one side and the Opus Inverse on the other. Neither is a complete read. Pretending otherwise is how the recommendation game gets stuck.
A round of explicit disclaimers, because the data deserves them. This is not a verdict on Sonnet’s daily fitness. LeetCode in Elixir is not what most developers ask Sonnet to do. The HackerNoon set is algorithm puzzles, where the question is whether the submission passed, not whether the patch made the repo better. It is also not a verdict against Opus, since HackerNoon never put Opus on the bench in the first place. The interesting fact is the disconnect: between what the default reflex reaches for, what the public benchmark rewards, and which family variant the source happened to be able to afford to test.
Within one model family, the cheap variant and the expensive variant can land on opposite ends of two different leaderboards. “Anthropic is good at coding” is not a sentence either board can settle.
WhatLLM.org: Top-10 Coding, and What That Ranking Is Made Of
The second board is the WhatLLM coding ranking. It is the closest thing to a mainstream answer to the question “which model should I use to write code”, and it gets quoted in roundup posts the way arena.ai gets quoted in marketing decks. If you have only ever read one coding leaderboard, this is probably it.
| Rank | Model | Quality Index | SciCode |
|---|---|---|---|
| 1 | GPT-5.5 (xhigh), OpenAI | 60.2 | 56% |
| 2 | Claude Opus 4.7 (Adaptive Reasoning), Anthropic | 57.3 | 55% |
| 3 | Gemini 3.1 Pro Preview, Google | 57.2 | 59% |
| 4 | GPT-5.4 (xhigh), OpenAI | 56.8 | 57% |
| 5 | Qwen3.7 Max, Alibaba | 56.6 | 49% |
| 6 | Gemini 3.5 Flash (high), Google | 55.3 | 53% |
| 7 | Kimi K2.6, Kimi | 53.9 | 54% |
| 8 | MiMo-V2.5-Pro, Xiaomi | 53.8 | 50% |
| 9 | GPT-5.3 Codex (xhigh), OpenAI | 53.6 | 53% |
| 10 | Grok 4.3 (high), xAI | 53.2 | 47% |
WhatLLM’s top-line ranking is an aggregate. Three benchmarks feed into the Quality Index number that drives the sort: LiveCodeBench for contamination-resistant code generation that refreshes its problem set monthly, Terminal-Bench Hard for shell scripting, devops, and system-level programming, and SciCode for scientific computing and research code. The underlying numbers come from Artificial Analysis, and the page itself refreshes weekly as new model releases land.
Note what is missing from this view. There is no per-language split in the top-line ranking. There is no hardware annotation, so you cannot tell which of these scores came from a quantized local run and which came from a full-precision API. There is no real-repo task in the benchmark mix. The Quality Index treats LiveCodeBench-style code-generation, Terminal-Bench shell tasks, and SciCode research workloads as if they were the same kind of skill and rolls them into one number. They are not the same skill, and rolling them up is exactly what makes this ranking useful as a coarse filter and dangerous as a fine instrument.
The point is not that WhatLLM is wrong. It is doing serious work, the page refreshes faster than most academic leaderboards, and the underlying benchmarks have real methodological care behind them. The point is that an aggregate is a sieve, not a scalpel. Use it to decide which two or three models are worth a closer look. Do not use it to decide which one your stack should run.
An aggregate benchmark is a useful coarse filter and a misleading fine instrument. Sieve with it. Do not decide with it.
Two Tests, Zero Overlap
Here is the part that broke my read of the existing coverage. The set of models on the HackerNoon bench and the set of models in the WhatLLM top-10 do not intersect. Zero overlap. Five models each, zero shared between them, even though every major vendor appears on both lists in a different variant.
| Vendor family | Variant on HackerNoon | Variant in WhatLLM top-10 |
|---|---|---|
| Anthropic | Sonnet 4.5 (last of 5 on LeetCode) | Opus 4.7 Adaptive Reasoning (#2) |
| Gemini 2.5 Flash, Gemini 3 Flash Preview | Gemini 3.1 Pro Preview (#3), Gemini 3.5 Flash high (#6) | |
| OpenAI | GPT-5-mini | GPT-5.5 xhigh (#1), GPT-5.4 xhigh (#4), GPT-5.3 Codex xhigh (#9) |
| xAI | Grok Code Fast 1.0825 | Grok 4.3 high (#10) |
HackerNoon picked the cheap, fast, frequently-used variant of each family. WhatLLM picked the expensive flagship. The cheap variant is the one a working developer actually runs all day. The expensive flagship is the one the marketing post compares. Both are valid choices for a benchmark to make, and both produce real, defensible numbers. They are also disjoint, which means the “best coding LLM” answer depends entirely on which slice of the family tree the source decided to test.
This is not a methodology critique. Each board did the work it could afford and labelled it honestly. The critique lands one layer up: it lands on every “best coding LLM” article that cites one of these two boards as if it were the answer, without naming which variant of which vendor was on the bench and why that choice was made. The framing dies in the citation.
When two serious rankings measure disjoint model sets from the same vendor, the disagreement is not about capability, it is about which variant the benchmark could afford to run. Neither answer is wrong. Neither answers your question either.
Cloud Dominance Is a Methodology Bias
Both boards skew heavily cloud, and the reason is mostly logistics. A cloud API is the cheapest testbed in the world. You pay per token, the inference hardware belongs to someone else, the scaling problem is the vendor’s problem, and an engineer can wire up an evaluation script in an afternoon. Running the same evaluation against an open-weight model means standing up SGLang or vLLM, picking a quantization, finding GPU time, and tuning the inference engine before the first benchmark token comes out. That work costs a benchmark team time and money, and the work is invisible in the final ranking. So the rankings tilt cloud.
WhatLLM has a separate open-source page that ranks open-weight models. It is a useful page, and the top entries are Kimi K2.6, MiMo-V2.5-Pro, Qwen3.6 Max Preview, and DeepSeek V4 Pro at Quality Index values in the 51 to 54 range. Read the small print, though: those Quality Index values still come from cloud-hosted runs of those open-weight models, served through vendor APIs and aggregator endpoints, not from local-quantized runs on consumer hardware. The open-weight ranking is open-weight in the sense that you could in principle download the weights, not in the sense that the score reflects what the weights do once you actually have.
Forward-reference, because it lands harder later: the cloud-hosted Qwen3.6 Max Preview that WhatLLM tested is not the same artifact as the locally-quantized Qwen3.6 PrismaQuant that runs on a DGX Spark. The model card is shared. The capability is not. We pick that thread back up in the Qwen section below.
Open-weight does not mean open-tested. The model on the leaderboard usually ran on hardware you cannot afford, in a precision your machine cannot load.
The Self-Host Column That Doesn’t Exist
WhatLLM also publishes a local-LLM recommendation page. It is one of the better public attempts to translate the leaderboard into a hardware-buying decision. The page breaks recommendations into VRAM tiers and names a coding-suitable model for each tier. Read the tiers carefully and you will spot what is missing.
| VRAM tier | Hardware examples | WhatLLM coding pick |
|---|---|---|
| 8-16 GB | M-series laptops, RTX 4060/4070 | Gemma 3 4B, Qwen2.5 7B, Llama 3.2 8B |
| 16-24 GB | RTX 3090/4090, M2 Pro/Max | Qwen2.5-Coder 32B, DeepSeek Coder V2 16B, Mistral Small 22B |
| 40 GB+ multi-GPU | Server racks, Mac Studio Ultra | Llama 3.3 70B, DeepSeek R1 70B distilled, Qwen2.5 72B |
| 128 GB unified | NVIDIA DGX Spark, M-series Ultra, Strix Halo | no recommendation |
The 128 GB unified-memory tier is the one that opened up over the last twelve months. NVIDIA’s DGX Spark is on the market since early April 2026. Apple’s M-series Ultra workstations sit in the same band. AMD’s Strix Halo platform overlaps it. This is the tier where serious self-host coding work is currently being decided. The board that should answer “what should I run on it” simply does not. The recommendation goes up to 70B-class models on a 40 GB GPU and then stops, exactly at the hardware class where the interesting choices begin.
The blog has covered the buying decision in Self-Hosted AI: Start Here and the model-stack decision in the DGX Spark model-choice piece. Neither of those replaces a public coding-correctness ranking for the 128 GB tier. They surface what we run, why we picked it, and what the trade-offs look like. They do not substitute for a board that measures whether our pick is better than the alternative pick on a specific language.
Whenever a leaderboard skips a hardware tier, ask whether the tier is too new to test or too inconvenient to test. The answer changes how seriously you read it.
Spark Arena Has Throughput, Not Correctness
The third board worth pulling into the picture is spark-arena.com. It is the throughput sibling to arena.ai: tokens-per-second numbers for self-host engines (vLLM, SGLang, llama.cpp) running on real NVIDIA DGX Spark hardware. If you want to know whether a model will keep up with your typing when you self-host it, spark-arena is the only public source that gives you a real answer.
It does not score correctness. A model that streams 200 tokens per second but hallucinates the wrong function signature is not better than a model that streams 30 tokens per second and gets the signature right on the first try. Throughput and correctness are independent axes. Spark-arena nails the first one. Nobody publishes the second one for the same hardware tier.
This is exactly the gap the spiritual-parent piece named one axis over. There the missing column was self-host total cost of ownership: arena.ai gave you quality, spark-arena gave you throughput, nobody combined them. The same pattern repeats here, just rotated. HackerNoon gives you specialty by language. WhatLLM gives you aggregate quality. Spark-arena gives you throughput. No board gives you correctness per language on self-host hardware. The reading protocol later in this article is the workaround.
Other Coding Benchmarks That Should Be on the Radar
The popular roundup posts cite HackerNoon, WhatLLM, and occasionally arena.ai. Those are not the only boards that exist. Five more deserve a seat at the table, and most of them are closer to the work you actually do than LeetCode is.
| Benchmark | Tests | Strength | Weakness |
|---|---|---|---|
| LiveCodeBench | Code generation, monthly problem refresh | Contamination-resistant by design | Python-heavy, single-file scope |
| Aider Polyglot | Real edit-tasks across multiple languages | Closer to actual dev work than LeetCode | Smaller scoreboard, slower updates |
| BigCodeBench | Function-level tasks with library calls | Tests real API knowledge, not just algorithms | Single-file scope |
| SWE-Bench | Real GitHub issues, multi-file patches | Closest to a working repo of any public benchmark | Expensive and slow to run, mostly Python |
| MultiPL-E | HumanEval translated into 18+ languages | Best per-language coverage available | HumanEval-style problems are dated |
| HumanEval / MBPP | Original Python pass@1 | Historical anchor everyone still cites | Saturated; results may be contaminated |
If your stack lives in Rust and Postgres, none of the popular boards measure your problem directly. The boards that come closer are Aider Polyglot, which scores real edits across the languages it tests, and MultiPL-E, which translates the same problem set into eighteen-plus languages and lets you compare how a model degrades when the language changes underneath it. Neither is famous. Neither shows up in “best coding LLM in 2026” roundups. That is the gap.
A word on contamination, for the non-experts. A benchmark is “contaminated” when its test cases ended up inside a model’s training data, which means the model is partly remembering the answer instead of solving the problem. Older benchmarks (HumanEval, MBPP) are likely contaminated in every modern model: the test cases have been public for years, scraped repeatedly, and the model trained on those scrapes. Monthly-refresh benchmarks like LiveCodeBench partly fix this by rotating the problem set faster than the next training cycle can keep up. When you see a pass rate quoted from HumanEval in 2026, mentally discount it. When you see one from LiveCodeBench last month, trust it more, but not absolutely; the refresh cycle delays contamination, it does not eliminate it.
The benchmark that names your problem is usually less famous than the benchmark that names its winner.
Qwen3.6 Max Preview Is Not Qwen3.6 PrismaQuant
Pick the thread back up. On WhatLLM’s open-source ranking, Qwen3.6 Max Preview sits near the top of the Quality Index for open-weight models, with Qwen3.6 Plus just below it. Read the page and it is reasonable to conclude that Qwen3.6 is a competitive open-weight coding model. That conclusion is correct in the sense it was measured. It is also incomplete in a specific, important way.
The numbers on that page come from cloud-hosted runs at the precision Alibaba serves the model at through its hosted API. They are runs of “Qwen3.6 Max Preview” the API endpoint, not runs of any specific quantization you can download and put on your own machine. When the same architecture gets pushed into a 4-bit quantization like NVFP4 and loaded onto a DGX Spark, what it can do changes. Sometimes by a little. Sometimes by a lot.
We have written about this exact problem before from the inside. The mistral-vs-qwen36 piece on this blog walked through one concrete case where the local quantization dropped vision support that the model card still advertised. The model card was correct about the architecture. The quantization rewrote the capability sheet, silently, and nothing in the public coding rankings would have warned us. The same risk applies to coding: a quantization that preserves overall pass rate on one language can still degrade noticeably on a different language, on a longer context, or on a specific framework. The leaderboard cannot warn you because the leaderboard never tested the artifact you are actually loading.
The DGX Spark model-stack piece names what we ended up running and why. Read it as a worked example, not as a recommendation that ports across boxes. A different operator on different hardware with different priorities would land somewhere else, and the same Quality Index would have led both of us in if we had treated it as the answer.
Capability is a property of a specific quantization on specific hardware, not a property of the model card.
A Reading Protocol for Self-Hosters
If everything above is right, then no single board can answer “which model should I run”. The boards are still useful, but only if you read them in a specific order and use each one for what it is good at. This is the section to bookmark.
-
Name your stack. Languages, frameworks, databases, deployment target. Be specific. “Backend in Rust, Postgres for everything, deployed on Nomad, CI in GitHub Actions” is a stack description. “Modern web stack” is a wish. Most public boards will fail this step for you, because they average across stacks instead of splitting by them. That is fine. You are doing the splitting yourself.
-
Filter by language first. If your stack is Rust, Elixir, or a non-MySQL SQL dialect, sort the HackerNoon table and the MultiPL-E page first. Models that look great on aggregate boards can drop ten or twenty percentage points on a single language change, and your stack lives or dies on those points. The popular ranking comes second.
-
Use Quality Index as a sieve, not a decision. WhatLLM and arena.ai are good at telling you which models are not contenders. They are bad at telling you which contender is yours. Pull a shortlist of three to five candidates off the aggregate, then put the aggregate down.
-
Test throughput on your own hardware. Spark-arena.com for DGX Spark, your own quick bench for everything else. The vendor’s quoted tokens-per-second figure was measured on hardware you do not own, with batch sizes you will not run, in a precision you might not load. Run your own. It is one afternoon of work for one decision you will live with for months.
-
Test correctness on your own repo. Pick five real tasks you would actually ask a model to do. A bug fix that requires reading two files. A schema migration. A small refactor of a real module. A diff summary. A test you would write yourself. Run each candidate. Read the output yourself. Score them. This is the only benchmark that matches your job, and it is the only one nobody else can run for you.
We have written about the practical workflow side in the opencode self-hosted coding-assistant piece and about prior tool-selection methodology in the coding-tools evaluation piece. They are not a replacement for the five-step protocol above. They are what happens once the protocol picks a winner.
The leaderboard for your codebase has one user, runs on your hardware, and never gets published. Build it anyway.
What’s Missing Across All Sources
Even with three boards triangulated and a five-step protocol on top, five axes are still not measured anywhere public. They are the ones that matter most for self-hosters.
-
Cost per correct answer. No board divides dollars-per-million-tokens by pass rate. The model that costs one tenth the price and passes half the tasks is usually the better buy. Nobody publishes the ratio. You can compute it yourself in a spreadsheet in twenty minutes, and the answer it gives is almost always cheaper than the model marketing posts recommend.
-
Privacy as an axis. Every cloud test logs the prompt. No public board scores the difference between sending your codebase to a US-hosted API and running the same query against a model in your own apartment. For a working developer in a regulated industry, that difference is the entire decision, and the leaderboard is silent on it. The closest thing to a privacy column is “is this model open-weight, yes or no”, which is necessary but not sufficient.
-
Reproducibility. Pass rates drift between benchmark runs and between API versions. The same model on the same benchmark can read differently month to month, sometimes by several percentage points. None of the popular boards publish a version-pinned re-run schedule. You read a number, you cite it, and a quarter later the number has moved and your citation is stale.
-
Real-repo tasks. SWE-Bench gets closest, and it is the slowest of all the public benchmarks to update. LeetCode-style puzzles are a poor proxy for “fix a bug in a 50,000-line monorepo that has its own conventions and three years of history”. The skills barely overlap.
-
Multilingual code review. No board tests whether a model can read a Rust diff and explain it in German, or read a Python service and explain it in Spanish. That is a daily task for half of Europe, and the public benchmarks treat the world as if everyone working with code thinks in English. The mismatch shows up in subtle ways: code is fine, comments are wrong, error messages are translated badly, naming drifts between languages.
The quality-gate piece on this blog is the methodology-skepticism mirror for these gaps. Read it if you want the same kind of “the metric measures the wrong thing” argument applied to a different metric on a different stack.
Every public benchmark is shaped by what is cheap to test. The columns you care about are usually the expensive ones.
Glossary and Sources
Glossary
- Pass@1. The fraction of problems a model solves on its first attempt with no retries. The simplest possible measure of “did it work”.
- Quality Index. A composite score from Artificial Analysis that bundles several benchmark results into one number, used as the top-line ranking on WhatLLM.org.
- LiveCodeBench. A contamination-resistant code-generation benchmark that refreshes its problem set monthly so the test cases have not yet had time to land in any training corpus.
- Terminal-Bench Hard. A benchmark for shell scripting, devops, and system-level programming tasks, used as one of WhatLLM’s three coding-evaluation inputs.
- SciCode. A benchmark for scientific computing and research programming, used as the third of WhatLLM’s coding inputs.
- Elo. A relative ranking score from pairwise human comparisons, made famous by chess. arena.ai uses it for model quality.
- Quantization (NVFP4, Q4, Q8). A way to shrink a model so it fits on a smaller machine by storing weights with fewer bits per number. NVFP4 is NVIDIA’s 4-bit floating-point format. Q4 and Q8 are integer formats common in llama.cpp.
- Contamination. When a benchmark’s test cases ended up in a model’s training data, so the model partly “remembers” the answers instead of solving the problem. Older benchmarks like HumanEval are likely contaminated in every modern model.
- Aider Polyglot. A real-edit-task benchmark across multiple languages, run by the Aider coding-assistant project.
- BigCodeBench. A function-level coding benchmark that requires realistic library calls, designed to be harder than HumanEval.
- SWE-Bench. A benchmark built from real GitHub issues, scored by whether the model’s patch passes the project’s own test suite.
- MultiPL-E. HumanEval translated into 18+ programming languages for cross-language comparison.
- DGX Spark. NVIDIA’s 128 GB unified-memory developer workstation, on the market since early 2026. The hardware tier WhatLLM’s local-LLM page does not have a recommendation for.
Sources
External:
- HackerNoon, “Comparing LLMs’ Coding Abilities Across Programming Languages”: https://hackernoon.com/comparing-llms-coding-abilities-across-programming-languages
- WhatLLM.org, “Best LLM for Coding”: https://whatllm.org/best-llm-for-coding
- WhatLLM.org, “Best Open Source LLM”: https://whatllm.org/best-open-source-llm
- WhatLLM.org, “Best Local LLM”: https://whatllm.org/best-local-llm
- arena.ai: https://arena.ai
- spark-arena.com: https://spark-arena.com
- LiveCodeBench: https://livecodebench.github.io
- Aider Polyglot Leaderboards: https://aider.chat/docs/leaderboards/
- BigCodeBench: https://bigcode-bench.github.io
- SWE-Bench: https://swebench.com
- MultiPL-E: https://huggingface.co/datasets/nuprl/MultiPL-E
- Artificial Analysis: https://artificialanalysis.ai
Related on this blog
- Two Leaderboards Nobody Reads Together: Why arena.ai Doesn’t Tell You About Self-Hosted AI: the spiritual parent, same triangulation pattern on a different axis (quality vs throughput vs TCO).
- Self-Hosted AI: Start Here: hub article for new readers, hardware-decision tree and inference-engine choice.
- Setup opencode: Self-Hosted Coding Assistant: the practical workflow side of self-hosted coding.
- Strategy: Coding Tools Evaluation: prior tool-selection methodology.
- Mistral vs Qwen3.6 on DGX Spark: The Zero That Was a Broken Ruler: concrete case where quantization rewrote the capability sheet.
- Strategy: Next Model Choices on DGX Spark: model-stack decision log.
- The Quality Gate That Rewards Fabrication: methodology-skepticism mirror, the same kind of “metric measures the wrong thing” argument applied to a different scorer.
- Fixes: SGLang Vibe Performance Benchmark: adjacent self-host performance baseline.