Two Leaderboards Nobody Reads Together: Why arena.ai Doesn't Tell You About Self-Hosted AI
Most “best LLM” articles cite arena.ai. They show Claude Opus 4.7 at Elo 1503, GPT-5.5 High at 1488, Mistral Small 4 somewhere mid-table around 1420. End of story. But a leaderboard that ignores where the model runs, what it costs per token, and who controls the kill-switch is half a leaderboard.
arena.ai ranks models by human-judged quality. spark-arena.com ranks models by raw tokens-per-second on a single NVIDIA DGX Spark. Neither tells the full story alone.
This article reads both at the same time, then asks the column nobody publishes: what does this cost me, and who owns the result?
Two Leaderboards, Two Currencies
Quality and throughput are different currencies, measured by different people, optimized for different audiences.
| arena.ai | spark-arena.com | |
|---|---|---|
| Question answered | Which model gives better answers? | Which model runs fastest on my hardware? |
| Methodology | Pairwise human votes converted to Elo | Empirical benchmark on NVIDIA DGX Spark, test type tg128, concurrency 1 |
| Top 5 (text, 2026-04-29) | Claude Opus 4.7 (1503), Claude Opus 4.6 Thinking (1501), Claude Opus 4.6 (1496), Claude Opus 4.7 Thinking (1493), Gemini 3.1 Pro (1493) | Qwen3.5-0.8B BF16/sglang (106.69 tok/s), Qwen3.6-35B-A3B-PrismaQuant INT4/vllm (95.11), Qwen3.6-35B-A3B int4-AutoRound (92.34), gpt-oss-120b MXFP4 2 nodes (75.96), gemma-4-26B-A4B FP8 4 nodes (67.63) |
| Open-source presence | Mistral, Qwen, DeepSeek, GLM appear, rarely top 10 | Exclusively open-source (closed weights cannot run on consumer hardware) |
| Cost dimension | Ignored | Implicit (hardware = one-time) |
| Sovereignty dimension | Ignored | Required |
| How to submit | Vote in Battle Mode | spark-cli benchmark recipe.yaml (auto-uploads) |
Two patterns jump out from the spark-arena top 10. First: vllm dominates, sglang only places at rank 1 with a tiny 0.8B model. vllm shipped better DGX Spark optimization for this hardware class right now. Second: cluster size has surprising trade-offs. gpt-oss-120b at 2 nodes hits 75.96 tok/s but at 4 nodes drops to 63.10. Communication overhead beats throughput gain past 2 nodes for that workload.
What Each Leaderboard Hides
Cost
arena.ai ignores cost entirely. spark-arena.com makes it implicit by assuming you own the hardware. The actual math is brutal once you put both sides in the same row.
Daily output of 500,000 tokens at $75 per million output tokens (Claude Opus 4.7 list price) equals $37.50 per day, $1,125 per month. The same workload on local Mistral Small 4 NVFP4 at 41 tok/s takes roughly 3.4 hours of compute, 480 Wh, about 10 cents at €0.30 per kWh. The DGX Spark amortizes against this workload in three months. After that, the next 500,000 tokens cost electricity. The ratio is 11,000-to-1 per token, in exchange for roughly 80 Elo points of quality.
That trade is invisible if you only read arena.ai. It is the entire story if you only read spark-arena.com.
Sovereignty
arena.ai assumes inference is cloud, network is reliable, API keys do not rotate, and someone else’s data center is your problem. spark-arena.com assumes you already control the hardware. Neither prints a sovereignty score.
For workloads touching medical records, financial data, internal architecture notes, or anything regulated by GDPR, the missing column is the one that ranks models by data jurisdiction, log retention, and the training-on-your-data clause. Until that leaderboard exists, the choice is binary: accept the dependency or build the stack.
Latency under load
arena.ai’s Elo scale measures isolated single-prompt votes. spark-arena.com’s tg128 benchmark measures single-stream throughput at concurrency 1. Neither captures what happens when an agentic loop fires hundreds of small calls per session.
A coding assistant making 200 small completions per minute rewards 100+ tok/s steady-state. A single architecture-grade question rewards capability over throughput. The missing column would rank models by p99 latency under realistic concurrent load. Both leaderboards ignore it.
The Three-Column View
The decision that matters fits in one table the rest of the industry refuses to print.
| Model | arena.ai Elo | spark-arena tok/s | $/M output (cloud) | Sovereign? |
|---|---|---|---|---|
| Claude Opus 4.7 | 1503 | n/a (cloud only) | 75 | no |
| Claude Opus 4.6 Thinking | 1501 | n/a (cloud only) | 75 | no |
| GPT-5.5 High | 1488 | n/a (cloud only) | 100 | no |
| Gemini 3 Pro | 1486 | n/a (cloud only) | ~80 | no |
| Qwen3.5-0.8B BF16 (sglang) | not ranked | 106.69 (rank 1) | ~0.01 | yes |
| Qwen3.6-35B-A3B INT4 (vllm) | ~1430 | 95.11 (rank 2) | ~0.04 | yes |
| gpt-oss-120b MXFP4 (2 nodes) | ~1410 | 75.96 (rank 4) | ~0.06 | yes |
| Qwen3-Coder-Next int4 | ~1395 (Code) | 73.33 (rank 6) | ~0.05 | yes |
| gemma-4-26B-A4B FP8 (4 nodes) | ~1380 | 67.63 (rank 9) | ~0.04 | yes |
| Mistral Small 4 119B NVFP4 + EAGLE (this stack) | ~1420 | ~41 | ~0.05 | yes |
The closed-source rows offer roughly 80 Elo points of quality bonus for 1,500-times the per-token cost and a permanent dependency on someone else’s infrastructure. The open-source rows, even mid-table, cost roughly the price of electricity.
This is the trade nobody graphs because graphs need both axes.
Where This Stack Lands
Mistral Small 4 119B with NVFP4 quantization, SGLang nightly-dev-cu13-20260323-999bad5a (the only build stable on SM121A right now), EAGLE speculative decoding, on a single GB10 box pulls roughly 41 tok/s on tg128-equivalent workload. That is mid-table for this size class, not top 10.
The numbers are unsexy but real:
- Output throughput: ~41 tok/s single-stream, EAGLE accept rate 2.5–3.4
- Without EAGLE: 12–15 tok/s (benchmark detail)
- Time to first token: 200–600 ms depending on reasoning budget
- Context length: 65,536 tokens
- Memory utilization: 75% static (
--mem-fraction-static 0.75) - Power draw under load: ~140 W
- Quantization: NVFP4 (4-bit weights)
Why mid-table not top: spark-arena’s leaders are smaller models (0.8B–35B) on mature vllm + INT4 paths. A 119B model in NVFP4 stays memory-bandwidth-bound on unified memory. The trade is real:
- Smaller, faster, less authoritative answers (Qwen3.5-0.8B at 106 tok/s, Elo around 1300)
- Mid-size, fast, broadly useful (Qwen3.6-35B at 95 tok/s, Elo ~1430)
- Large, slower, more capable (Mistral Small 4 119B at 41 tok/s, Elo ~1420 with reasoning enabled)
The choice depends on the workload. Agentic loops with thousands of small calls reward 100+ tok/s. A single architecture-sized question rewards capability over throughput.
How to Read Both
Pick your category in arena.ai (Text, Code, Vision, etc.). Find the open-source models. Note their Elo. The gap to the top closed-source row is your quality tax for sovereignty.
Then spark-arena.com. Find the same models on your target hardware. Note tok/s. Multiply by ($/kWh × power) to get marginal cost per token.
Compare to the closed-source per-token API cost. If the ratio exceeds your tolerance for quality loss, self-host. If not, cloud. If unsure, run both for a week and measure.
Most real workflows are mixed. Cloud Claude for the architecture pass, local Mistral for the file-by-file refactor pass. Hybrid beats either pure-cloud or pure-local for most product builders. (The setup story covers the local half of that hybrid.)
Submitting Your Own Numbers
spark-arena.com is open-submission via spark-arena-cli. The flow:
# DGX Spark host is ARM64 (GB10 / ARM v9.2-A)
wget https://github.com/spark-arena/spark-arena-cli/releases/latest/download/spark-arena-cli_0.1.0_arm64.deb
sudo dpkg -i spark-arena-cli_0.1.0_arm64.deb
# x86_64 hosts: replace _arm64.deb with _amd64.deb (Linux + macOS binaries also published)
spark-cli login
spark-cli setup
spark-cli benchmark mistral-small-4-nvfp4-eagle.yaml
The benchmark uploads automatically to the leaderboard. There is no opt-out for that step in v0.1.0, which is a sovereignty trade-off worth noting: to be on the public leaderboard, you accept that your hardware run profile becomes public data. Self-hosting privacy and public benchmarking are at tension here. For most setups that’s fine. For air-gapped deployments, run the same scripts manually and skip the upload.
The recipe YAML is the interesting artifact. Once written for a specific stack like mistral-small-4-nvfp4-eagle, the same recipe runs reproducibly elsewhere. That single file becomes the most useful documentation a sovereign AI builder can publish: not “I get 41 tok/s,” but “here is the exact configuration that produces 41 tok/s on a GB10 host.”
What’s Missing
A third leaderboard. One that ranks by privacy and sovereignty, with columns for data jurisdiction, log retention, training-on-your-data clauses, GDPR audit status, and “your code stays on your desk.” That leaderboard would put Anthropic, OpenAI, and Google into the same table as Mistral, Qwen, and DeepSeek and rank them on the dimension that matters for production deployments.
It does not exist yet. arena.ai will probably never add it because the closed-source rows would all rank at the bottom. spark-arena.com cannot add it because it benchmarks hardware, not policy.
Until that third leaderboard exists, read both that exist. Or pick a side and own the trade-off. The cost of the wrong call is one provider rotation, one outage, one regulatory letter, or one runaway monthly bill from convincing yourself the missing column did not matter.
The writing of this article followed the same hybrid pattern the article describes: cloud LLM as scaffold, local Mistral for draft, human polish. Sovereign by output, not by every keystroke.