Mainstream AI coverage cites only one leaderboard. arena.ai ranks quality. spark-arena.com ranks throughput on real hardware. The decision that matters lives in the third column nobody publishes.

Two Leaderboards Nobody Reads Together: Why arena.ai Doesn't Tell You About Self-Hosted AI

Most “best LLM” articles cite arena.ai. They show Claude Opus 4.7 at Elo 1503, GPT-5.5 High at 1488, Mistral Small 4 somewhere mid-table around 1420. End of story. But a leaderboard that ignores where the model runs, what it costs per token, and who controls the kill-switch is half a leaderboard.

arena.ai ranks models by human-judged quality. spark-arena.com ranks models by raw tokens-per-second on a single NVIDIA DGX Spark. Neither tells the full story alone.

This article reads both at the same time, then asks the column nobody publishes: what does this cost me, and who owns the result?

Two Leaderboards, Two Currencies

Quality and throughput are different currencies, measured by different people, optimized for different audiences.

arena.aispark-arena.com
Question answeredWhich model gives better answers?Which model runs fastest on my hardware?
MethodologyPairwise human votes converted to EloEmpirical benchmark on NVIDIA DGX Spark, test type tg128, concurrency 1
Top 5 (text, 2026-04-29)Claude Opus 4.7 (1503), Claude Opus 4.6 Thinking (1501), Claude Opus 4.6 (1496), Claude Opus 4.7 Thinking (1493), Gemini 3.1 Pro (1493)Qwen3.5-0.8B BF16/sglang (106.69 tok/s), Qwen3.6-35B-A3B-PrismaQuant INT4/vllm (95.11), Qwen3.6-35B-A3B int4-AutoRound (92.34), gpt-oss-120b MXFP4 2 nodes (75.96), gemma-4-26B-A4B FP8 4 nodes (67.63)
Open-source presenceMistral, Qwen, DeepSeek, GLM appear, rarely top 10Exclusively open-source (closed weights cannot run on consumer hardware)
Cost dimensionIgnoredImplicit (hardware = one-time)
Sovereignty dimensionIgnoredRequired
How to submitVote in Battle Modespark-cli benchmark recipe.yaml (auto-uploads)

Two patterns jump out from the spark-arena top 10. First: vllm dominates, sglang only places at rank 1 with a tiny 0.8B model. vllm shipped better DGX Spark optimization for this hardware class right now. Second: cluster size has surprising trade-offs. gpt-oss-120b at 2 nodes hits 75.96 tok/s but at 4 nodes drops to 63.10. Communication overhead beats throughput gain past 2 nodes for that workload.

What Each Leaderboard Hides

Cost

arena.ai ignores cost entirely. spark-arena.com makes it implicit by assuming you own the hardware. The actual math is brutal once you put both sides in the same row.

Daily output of 500,000 tokens at $75 per million output tokens (Claude Opus 4.7 list price) equals $37.50 per day, $1,125 per month. The same workload on local Mistral Small 4 NVFP4 at 41 tok/s takes roughly 3.4 hours of compute, 480 Wh, about 10 cents at €0.30 per kWh. The DGX Spark amortizes against this workload in three months. After that, the next 500,000 tokens cost electricity. The ratio is 11,000-to-1 per token, in exchange for roughly 80 Elo points of quality.

That trade is invisible if you only read arena.ai. It is the entire story if you only read spark-arena.com.

Sovereignty

arena.ai assumes inference is cloud, network is reliable, API keys do not rotate, and someone else’s data center is your problem. spark-arena.com assumes you already control the hardware. Neither prints a sovereignty score.

For workloads touching medical records, financial data, internal architecture notes, or anything regulated by GDPR, the missing column is the one that ranks models by data jurisdiction, log retention, and the training-on-your-data clause. Until that leaderboard exists, the choice is binary: accept the dependency or build the stack.

Latency under load

arena.ai’s Elo scale measures isolated single-prompt votes. spark-arena.com’s tg128 benchmark measures single-stream throughput at concurrency 1. Neither captures what happens when an agentic loop fires hundreds of small calls per session.

A coding assistant making 200 small completions per minute rewards 100+ tok/s steady-state. A single architecture-grade question rewards capability over throughput. The missing column would rank models by p99 latency under realistic concurrent load. Both leaderboards ignore it.

The Three-Column View

The decision that matters fits in one table the rest of the industry refuses to print.

Modelarena.ai Elospark-arena tok/s$/M output (cloud)Sovereign?
Claude Opus 4.71503n/a (cloud only)75no
Claude Opus 4.6 Thinking1501n/a (cloud only)75no
GPT-5.5 High1488n/a (cloud only)100no
Gemini 3 Pro1486n/a (cloud only)~80no
Qwen3.5-0.8B BF16 (sglang)not ranked106.69 (rank 1)~0.01yes
Qwen3.6-35B-A3B INT4 (vllm)~143095.11 (rank 2)~0.04yes
gpt-oss-120b MXFP4 (2 nodes)~141075.96 (rank 4)~0.06yes
Qwen3-Coder-Next int4~1395 (Code)73.33 (rank 6)~0.05yes
gemma-4-26B-A4B FP8 (4 nodes)~138067.63 (rank 9)~0.04yes
Mistral Small 4 119B NVFP4 + EAGLE (this stack)~1420~41~0.05yes

The closed-source rows offer roughly 80 Elo points of quality bonus for 1,500-times the per-token cost and a permanent dependency on someone else’s infrastructure. The open-source rows, even mid-table, cost roughly the price of electricity.

This is the trade nobody graphs because graphs need both axes.

Where This Stack Lands

Mistral Small 4 119B with NVFP4 quantization, SGLang nightly-dev-cu13-20260323-999bad5a (the only build stable on SM121A right now), EAGLE speculative decoding, on a single GB10 box pulls roughly 41 tok/s on tg128-equivalent workload. That is mid-table for this size class, not top 10.

The numbers are unsexy but real:

Why mid-table not top: spark-arena’s leaders are smaller models (0.8B–35B) on mature vllm + INT4 paths. A 119B model in NVFP4 stays memory-bandwidth-bound on unified memory. The trade is real:

The choice depends on the workload. Agentic loops with thousands of small calls reward 100+ tok/s. A single architecture-sized question rewards capability over throughput.

How to Read Both

Pick your category in arena.ai (Text, Code, Vision, etc.). Find the open-source models. Note their Elo. The gap to the top closed-source row is your quality tax for sovereignty.

Then spark-arena.com. Find the same models on your target hardware. Note tok/s. Multiply by ($/kWh × power) to get marginal cost per token.

Compare to the closed-source per-token API cost. If the ratio exceeds your tolerance for quality loss, self-host. If not, cloud. If unsure, run both for a week and measure.

Most real workflows are mixed. Cloud Claude for the architecture pass, local Mistral for the file-by-file refactor pass. Hybrid beats either pure-cloud or pure-local for most product builders. (The setup story covers the local half of that hybrid.)

Submitting Your Own Numbers

spark-arena.com is open-submission via spark-arena-cli. The flow:

# DGX Spark host is ARM64 (GB10 / ARM v9.2-A)
wget https://github.com/spark-arena/spark-arena-cli/releases/latest/download/spark-arena-cli_0.1.0_arm64.deb
sudo dpkg -i spark-arena-cli_0.1.0_arm64.deb

# x86_64 hosts: replace _arm64.deb with _amd64.deb (Linux + macOS binaries also published)

spark-cli login
spark-cli setup
spark-cli benchmark mistral-small-4-nvfp4-eagle.yaml

The benchmark uploads automatically to the leaderboard. There is no opt-out for that step in v0.1.0, which is a sovereignty trade-off worth noting: to be on the public leaderboard, you accept that your hardware run profile becomes public data. Self-hosting privacy and public benchmarking are at tension here. For most setups that’s fine. For air-gapped deployments, run the same scripts manually and skip the upload.

The recipe YAML is the interesting artifact. Once written for a specific stack like mistral-small-4-nvfp4-eagle, the same recipe runs reproducibly elsewhere. That single file becomes the most useful documentation a sovereign AI builder can publish: not “I get 41 tok/s,” but “here is the exact configuration that produces 41 tok/s on a GB10 host.”

What’s Missing

A third leaderboard. One that ranks by privacy and sovereignty, with columns for data jurisdiction, log retention, training-on-your-data clauses, GDPR audit status, and “your code stays on your desk.” That leaderboard would put Anthropic, OpenAI, and Google into the same table as Mistral, Qwen, and DeepSeek and rank them on the dimension that matters for production deployments.

It does not exist yet. arena.ai will probably never add it because the closed-source rows would all rank at the bottom. spark-arena.com cannot add it because it benchmarks hardware, not policy.

Until that third leaderboard exists, read both that exist. Or pick a side and own the trade-off. The cost of the wrong call is one provider rotation, one outage, one regulatory letter, or one runaway monthly bill from convincing yourself the missing column did not matter.

The writing of this article followed the same hybrid pattern the article describes: cloud LLM as scaffold, local Mistral for draft, human polish. Sovereign by output, not by every keystroke.