Two Leaderboards Nobody Reads Together: Why arena.ai Doesn't Tell You About Self-Hosted AI
Most “best LLM” articles cite arena.ai. They show Claude Opus 4.7 at Elo 1503, GPT-5.5 High at 1488, Mistral Small 4 somewhere mid-table around 1420. End of story. But a leaderboard that ignores where the model runs, who owns the data, and who controls the kill-switch is half a leaderboard.
arena.ai ranks models by human-judged quality and prints per-token cloud API pricing alongside. spark-arena.com ranks models by raw tokens-per-second on a single NVIDIA DGX Spark. Neither combines them into a self-host total-cost-of-ownership column.
This article reads both at the same time, then asks the column nobody publishes: what does running this myself cost, including hardware and electricity, and who owns the result?
Correction 2026-05-13. This article originally said “arena.ai ignores cost entirely.” That was wrong: arena.ai shows a “Price $/M” column (input/output split) plus input and output price filters. The accurate framing, now reflected throughout, is that arena.ai shows cloud API pricing for hosted models while spark-arena.com shows self-host hardware throughput; neither combines them into a self-host TCO column (hardware amortization + electricity per token), which remains the third missing axis. The article also originally cited “$75 per million output tokens” for Claude Opus 4.7; the actual Anthropic list price is $5 input / $25 output per MTok (Opus 4.5/4.6/4.7 share this rate; the older Opus 4 / 4.1 / 3 used $15/$75). The corrected cost ratio appears in Cost below. The
spark-clicommand names in Submitting Your Own Numbers should have beenspark-arena-cli(interactive REPL, not a per-command flag invocation). Thanks to reader feedback for catching all three.
New here? I shipped a Self-Hosted AI: Start Here hub article that walks through the hardware-decision tree, the inference-engine choice, and what hurts most after you start. Read this article for the leaderboard framing; read that one when you decide to act on it.
On this page:
- Two Leaderboards, Two Currencies
- What Each Leaderboard Hides
- The Three-Column View
- Where This Stack Lands
- How to Read Both
- Submitting Your Own Numbers
- Other leaderboards worth knowing
- What’s Missing
Two Leaderboards, Two Currencies
Quality and throughput are different currencies, measured by different people, optimized for different audiences.
Numbers snapshot 2026-04-29. Both leaderboards update continuously; verify against the live tables at arena.ai and spark-arena.com before quoting any specific Elo or tok/s. The framing below is durable, the specific numbers are not.
| arena.ai | spark-arena.com | |
|---|---|---|
| Question answered | Which model gives better answers? | Which model runs fastest on my hardware? |
| Methodology | Pairwise human votes converted to Elo | Empirical benchmark on NVIDIA DGX Spark, test type tg128, concurrency 1 |
| Top 5 (text, 2026-04-29) | Claude Opus 4.7 (1503), Claude Opus 4.6 Thinking (1501), Claude Opus 4.6 (1496), Claude Opus 4.7 Thinking (1493), Gemini 3.1 Pro (1493) | Qwen3.5-0.8B BF16/sglang (106.69 tok/s), Qwen3.6-35B-A3B-PrismaQuant INT4/vllm (95.11), Qwen3.6-35B-A3B int4-AutoRound (92.34), gpt-oss-120b MXFP4 2 nodes (75.96), gemma-4-26B-A4B FP8 4 nodes (67.63) |
| Open-source presence | Mistral, Qwen, DeepSeek, GLM appear, rarely top 10 | Exclusively open-source (closed weights cannot run on consumer hardware) |
| Cost dimension | Cloud API price column ($/M input + $/M output, with filters) | Implicit (hardware amortization + electricity, not surfaced) |
| Sovereignty dimension | Ignored | Required |
| How to submit | Vote in Battle Mode | spark-arena-cli interactive REPL, benchmark recipe.yaml (auto-uploads) |
Two patterns jump out from the spark-arena top 10. First: vllm dominates, sglang only places at rank 1 with a tiny 0.8B model. vllm shipped better DGX Spark optimization for this hardware class right now. Second: cluster size has surprising trade-offs. gpt-oss-120b at 2 nodes hits 75.96 tok/s but at 4 nodes drops to 63.10. Communication overhead beats throughput gain past 2 nodes for that workload.
What Each Leaderboard Hides
Cost
arena.ai prints cloud API pricing per million tokens. spark-arena.com prints hardware throughput. Neither prints the combined math: hardware amortization plus electricity per self-hosted token, side by side with the cloud cost the closed-source row would have charged.
Daily output of 500,000 tokens at $25 per million output tokens (Claude Opus 4.7 list price as of 2026-05-13: $5 input / $25 output per MTok) equals $12.50 per day, $375 per month. The same workload on local Mistral Small 4 NVFP4 at 41 tok/s takes roughly 3.4 hours of compute, 480 Wh, about 10 cents at €0.30 per kWh. The DGX Spark hardware amortizes against this workload in roughly nine months at this token volume. After that, the next 500,000 tokens cost electricity only. The cloud-to-self-host cost ratio is 125 to 1 per day of operating expense once the hardware is paid off, in exchange for roughly 80 Elo points of quality.
That trade is the part arena.ai’s price column and spark-arena.com’s tok/s column do not co-present. It is the entire story if you self-host. arena.ai readers see only the cloud half; spark-arena.com readers see only the throughput half.
Sovereignty
arena.ai assumes inference is cloud, network is reliable, API keys do not rotate, and someone else’s data center is your problem. spark-arena.com assumes you already control the hardware. Neither prints a sovereignty score.
For workloads touching medical records, financial data, internal architecture notes, or anything regulated by GDPR, the missing column is the one that ranks models by data jurisdiction, log retention, and the training-on-your-data clause. Until that leaderboard exists, the choice is binary: accept the dependency or build the stack.
Latency under load
arena.ai’s Elo scale measures isolated single-prompt votes. spark-arena.com’s tg128 benchmark measures single-stream throughput at concurrency 1. Neither captures what happens when an agentic loop fires hundreds of small calls per session.
A coding assistant making 200 small completions per minute rewards 100+ tok/s steady-state. A single architecture-grade question rewards capability over throughput. The missing column would rank models by p99 latency under realistic concurrent load. Both leaderboards ignore it.
The Three-Column View
The decision that matters fits in one table the rest of the industry refuses to print.
Numbers snapshot 2026-04-29. Cloud per-token pricing changes; verify on each provider’s pricing page before locking a budget against these rows.
| Model | arena.ai Elo | spark-arena tok/s | $/M output (cloud) | Sovereign? |
|---|---|---|---|---|
| Claude Opus 4.7 | 1503 | n/a (cloud only) | 25 | no |
| Claude Opus 4.6 Thinking | 1501 | n/a (cloud only) | 25 | no |
| GPT-5.5 High | 1488 | n/a (cloud only) | ~60 | no |
| Gemini 3 Pro | 1486 | n/a (cloud only) | ~30 | no |
| Qwen3.5-0.8B BF16 (sglang) | not ranked | 106.69 (rank 1) | ~0.01 | yes |
| Qwen3.6-35B-A3B INT4 (vllm) | ~1430 | 95.11 (rank 2) | ~0.04 | yes |
| gpt-oss-120b MXFP4 (2 nodes) | ~1410 | 75.96 (rank 4) | ~0.06 | yes |
| Qwen3-Coder-Next int4 | ~1395 (Code) | 73.33 (rank 6) | ~0.05 | yes |
| gemma-4-26B-A4B FP8 (4 nodes) | ~1380 | 67.63 (rank 9) | ~0.04 | yes |
| Mistral Small 4 119B NVFP4 + EAGLE (this stack) | ~1420 | ~41 | ~0.05 | yes |
The closed-source rows offer roughly 80 Elo points of quality bonus for 500-times the per-token cost and a permanent dependency on someone else’s infrastructure. The open-source rows, even mid-table, cost roughly the price of electricity.
This is the trade nobody graphs because graphs need both axes.
Where This Stack Lands
Mistral Small 4 119B with NVFP4 quantization, SGLang nightly-dev-cu13-20260323-999bad5a (the only build stable on SM121A right now), EAGLE speculative decoding, on a single GB10 box pulls roughly 41 tok/s on tg128-equivalent workload. That is mid-table for this size class, not top 10.
The numbers are unsexy but real:
- Output throughput: ~41 tok/s single-stream, EAGLE accept rate 2.5 to 3.4
- Without EAGLE: 12 to 15 tok/s (benchmark detail)
- Time to first token: 200 to 600 ms depending on reasoning budget
- Context length: 65,536 tokens
- Memory utilization: 75% static (
--mem-fraction-static 0.75) - Power draw under load: ~140 W
- Quantization: NVFP4 (4-bit weights)
Why mid-table not top: spark-arena’s leaders are smaller models (0.8B to 35B) on mature vllm + INT4 paths. A 119B model in NVFP4 stays memory-bandwidth-bound on unified memory. The trade is real:
- Smaller, faster, less authoritative answers (Qwen3.5-0.8B at 106 tok/s, Elo around 1300)
- Mid-size, fast, broadly useful (Qwen3.6-35B at 95 tok/s, Elo ~1430)
- Large, slower, more capable (Mistral Small 4 119B at 41 tok/s, Elo ~1420 with reasoning enabled)
The choice depends on the workload. Agentic loops with thousands of small calls reward 100+ tok/s. A single architecture-sized question rewards capability over throughput.
How to Read Both
Pick your category in arena.ai (Text, Code, Vision, etc.). Find the open-source models. Note their Elo. The gap to the top closed-source row is your quality tax for sovereignty. That tax buys something no leaderboard column prices: the week a frontier vendor’s models were switched off for every non-US user is the argument that a reachable mid-table local model beats an unreachable top-table cloud one on the only day the comparison is forced.
Then spark-arena.com. Find the same models on your target hardware. Note tok/s. Multiply by ($/kWh × power) to get marginal cost per token.
Compare to the closed-source per-token API cost. If the ratio exceeds your tolerance for quality loss, self-host. If not, cloud. If unsure, run both for a week and measure.
Most real workflows are mixed. Cloud Claude for the architecture pass, local Mistral for the file-by-file refactor pass. Hybrid beats either pure-cloud or pure-local for most product builders. (The setup story covers the local half of that hybrid.)
Submitting Your Own Numbers
spark-arena.com is open-submission via spark-arena-cli. The tool is an interactive REPL, not a per-command CLI. The flow:
# DGX Spark host is ARM64 (GB10 / ARM v9.2-A)
wget https://github.com/spark-arena/spark-arena-cli/releases/latest/download/spark-arena-cli_0.1.0_arm64.deb
sudo dpkg -i spark-arena-cli_0.1.0_arm64.deb
# x86_64 hosts: replace _arm64.deb with _amd64.deb (Linux + macOS binaries also published)
spark-arena-cli # launches the REPL
spark-arena> login # Google or GitHub, admin pre-approval required
spark-arena> setup
spark-arena> benchmark mistral-small-4-nvfp4-eagle.yaml
The benchmark uploads automatically to the leaderboard. There is no opt-out for that step in v0.1.0, which is a sovereignty trade-off worth noting: to be on the public leaderboard, you accept that your hardware run profile becomes public data. Self-hosting privacy and public benchmarking are at tension here. For most setups that’s fine. For air-gapped deployments, run the same scripts manually and skip the upload.
The recipe YAML is the interesting artifact. Once written for a specific stack like mistral-small-4-nvfp4-eagle, the same recipe runs reproducibly elsewhere. That single file becomes the most useful documentation a sovereign AI builder can publish: not “I get 41 tok/s,” but “here is the exact configuration that produces 41 tok/s on a GB10 host.”
Other leaderboards worth knowing
The two-leaderboard frame above is the cleanest way to introduce the gap between cloud quality and self-host throughput. It is not an exhaustive map of the leaderboard ecosystem. The honest version of this article names what else is out there so readers do not stop their research at arena.ai plus spark-arena.com.
artificialanalysis.ai is the closest thing to a “three-column view” that already exists. Their LLM leaderboard combines four metrics in one table: Intelligence Index (their internal quality score), Blended USD per 1M tokens, Median Output Speed (tok/s), and Latency to first token. 360 models, both cloud and open-weight, on the same table. Updated about eight times per day over a rolling 72-hour window. If you only want one cloud-comparison leaderboard, this is the one. What it still does not show: the self-host TCO column. Their speed metric is cloud-vendor-reported, not “what this model does on a DGX Spark in your basement”. Their price column is cloud API pricing, not “your hardware amortization plus your electricity bill divided by your tokens”.
Hugging Face Open LLM Leaderboard ranks open-weight models on academic benchmarks (MMLU-Pro, GPQA, MATH, IFEval, BBH, MUSR). No human-preference Elo, no cloud cost, no hardware throughput. It answers “which open model does best on standardized exams” which is a useful but narrow question.
OpenRouter Rankings ranks models by actual usage volume across thousands of apps routed through their gateway. The lens is “what production teams are paying for right now,” which encodes preference, price, and reliability into a single signal. Cloud-API only, but the closest thing to “what working developers actually run”.
SEAL Leaderboards (Scale AI) run private evaluations on domain-specific benchmarks (coding, agents, math) and publish quarterly. Methodology is opaque-by-design to prevent training contamination. Useful as a tiebreaker when arena.ai’s preference vote disagrees with academic benchmark rankings.
What none of these add: the column the rest of this article argues about. Self-host TCO (hardware amortization + electricity per token) is the dimension every one of them leaves to the operator. Until one of them adds it, the math in Cost above is the math you do yourself.
Not actual competitors despite SEO claims: AI productivity workspaces like chatlyai.app, GPT-store-style aggregators, and “Best AI Tools 2026” listicles. These market themselves as “vs arena.ai” because comparison-page SEO is cheap. They do not benchmark, rank, or publish performance data. Save the click.
What’s Missing
A third leaderboard. One that ranks by privacy and sovereignty, with columns for data jurisdiction, log retention, training-on-your-data clauses, GDPR audit status, and “your code stays on your desk.” That leaderboard would put Anthropic, OpenAI, and Google into the same table as Mistral, Qwen, and DeepSeek and rank them on the dimension that matters for production deployments.
It does not exist yet. arena.ai will probably never add it because the closed-source rows would all rank at the bottom. spark-arena.com cannot add it because it benchmarks hardware, not policy.
Until that third leaderboard exists, read both that exist. Or pick a side and own the trade-off. The cost of the wrong call is one provider rotation, one outage, one regulatory letter, or one runaway monthly bill from convincing yourself the missing column did not matter.
The writing of this article followed the same hybrid pattern the article describes: cloud LLM as scaffold, local Mistral for draft, human polish. Sovereign by output, not by every keystroke.
Where to next
If this framing landed for you, the operational follow-up is the Self-Hosted AI: Start Here hub article. It walks through the hardware-decision tree (DGX Spark vs Mac Studio vs used 3090s vs cloud-rented), the inference-engine choice (SGLang vs vLLM vs llama.cpp), the minimum-viable agent-ready deploy, and the operational gotchas that bite hardest in the first three months on this kind of stack.
If you came from a Bitcoin context (self-custody discipline, no-KYC infrastructure, V4V tipping), the Bitcoin-context bridge section maps the not-your-keys-not-your-coins mental model directly onto AI inference.
For the live state of this stack today (what is running, what is being built next, what is honestly broken), the Sovereign AI Grid roadmap is the status snapshot updated as the stack evolves.
Correction log (2026-05-13)
As of 2026-05-13 this article was the most-viewed page on sovgrid.org per the public NSM dashboard (45 views over the 30-day window, ahead of every other blog post). A reader-driven fact-check followed. The questions surfaced three concrete errors. A deep-research pass against the live source pages (arena.ai, Anthropic pricing docs, the spark-arena-cli README) confirmed each. Changes below, with anchor links to the affected sections.
Two Leaderboards, Two Currencies: table row “Cost dimension: Ignored” rewritten to “Cloud API price column ($/M input + $/M output, with filters)”. arena.ai does in fact print pricing; it is the self-host TCO column that is missing.Cost: Claude Opus 4.7 list price corrected from “$75 per million output tokens” to “$5 input / $25 output per MTok” (the $15/$75 rate applies to the older Opus 4 / 4.1 / 3). Daily cost recalculation: $12.50/day instead of $37.50/day. Cloud-to-self-host ratio corrected from “11,000-to-1 per token” to “125-to-1 per day of operating expense once hardware amortizes”. The qualitative trade (cloud cost vs electricity, 80 Elo bonus) is unchanged.The Three-Column View:$/M outputcolumn corrected for the closed-source rows (Opus 4.7/4.6 to 25, GPT-5.5 High to ~60, Gemini 3 Pro to ~30, from the pre-correction 75/75/100/~80). The “1,500-times the per-token cost” summary line corrected to “500-times”.Submitting Your Own Numbers:spark-clicommand names replaced with the actual tool namespark-arena-cli. The tool is an interactive REPL, not a per-command flag-invocation CLI. Code block now shows the correct launch-then-type-inside flow.
Why these slipped past the publish pipeline: factcheck.py validates Docker / PyPI / npm registry presence only. Arbitrary claims about external services (Anthropic API pricing, competitor leaderboard features, third-party CLI invocation patterns) are out of scope for that gate. The lesson moved into the operator’s fact-fabrication audit memo: external-service claims require manual verification against the live source page, not just registry-presence checks. The visible Correction block at the top of this article is the standard going forward whenever a published claim turns out to be wrong.