Agent-bench: stop trusting install counts, start measuring your agent's tools
The agent-tool ecosystem has a measurement problem. MCP servers and skills are ranked by install counts and star counts, their READMEs make quantified claims (“75% token savings”, “surgical semantic edits”), and the evidence behind those claims is usually a demo GIF. Meanwhile you are deciding whether to wire one of these things into the agent that edits your code.
I wanted a different basis for that decision, so I built agent-bench: a small, dependency-free harness that answers one question with numbers instead of vibes. Does this enhancement make my agent measurably better, on my models, on my tasks?
The method in one paragraph
Every experiment is an A/B: the agent runs the same task with the enhancement (treatment arm) and without it (baseline arm), N times each, on each model you care about. An arm, in other words, is one fixed configuration that changes exactly one variable. Success is decided by deterministic gates: the project type-checks, the rename actually happened, the answer contains the frozen list of required facts. No LLM grades another LLM anywhere. Per run it records tool calls, input and output tokens, wallclock, and for code edits three quality KPIs: how minimal the diff is against a reference patch, whether the full project still builds (not just the target), and whether the edit introduced lint violations. Raw JSONL for every published number lives in the repo.
What a deterministic gate actually is
A deterministic gate is a pass/fail check whose verdict never changes for the same input. That is the whole property that matters. tsc either exits 0 or it does not. A grep for the renamed symbol either finds it or it does not. A frozen checklist of required facts is either all present or it is not. Specifically, the gate encodes intent that a raw diff cannot show, which means two edits that look identical in a side-by-side can still earn opposite verdicts.
Compare this to the usual alternative, an LLM asked “is this a good edit?”. That judge is non-deterministic by construction: run it twice on the same diff and you can get “looks correct” and “missing a call site” because temperature and context drift move the answer. For a benchmark whose entire job is to detect small regressions, a grader that disagrees with itself is worse than useless. This is why every gate here is a script, never a model.
The three quality KPIs, defined
Pass/fail is not enough for code edits, because two passing edits are not equally good. So each code run also records three numeric KPIs, measured deterministically. Diff minimality refers to how close the edit is to a hand-written reference patch: a 3-file, 10-line change against a 3-file, 10-line reference scores minimal, while an 8-file change for the same task is bloat. Regression-freedom means the whole project still type-checks, not just the file the agent was asked to touch, because a local fix that breaks a distant import is a net loss. Lint-cleanliness is the count of new linter violations the edit introduced, i.e. style and safety debt the agent left behind. These three turn a green checkmark into a profile, which is what lets the spokes say things like “same success rate, 158% more tokens, identical diff” instead of just “it worked”.
Why this matters more than it sounds
The payoff of deterministic gates is catching failures that look like successes. The Serena benchmark turned up a weak model whose broken edit compiled and linted clean: a CI pipeline would have shipped it, and an LLM judge reading the diff would likely have praised it. Only a gate that encodes the intended target caught the regression. The detail lives in that write-up; the point for the method is that a benchmark which cannot tell a green-but-wrong build from a green-and-right one is measuring the wrong thing. That class of bug is exactly why the gates are scripts and the KPIs are counted, not judged.
What it is not
agent-bench is not an academic benchmark and does not want to be one. MCP-Bench and friends measure how well models use MCP tools in general, against fixed task suites, mostly on frontier APIs. Useful for model comparisons, useless for the question “should I install this”. agent-bench is also not a leaderboard of models: it compares arms within a model, so every conclusion is about the enhancement, not about which model is smarter. And it is deliberately small: a few hundred lines of Node with zero runtime dependencies, because a benchmark you cannot audit in fifteen minutes is just another claim.
What it has measured so far
Two case studies, both fully written up:
Serena (semantic code tools, MCP): a strong local model gained nothing on refactor tasks and paid 15-158% more input tokens for the privilege. The weak model was not fixed by it either, but its failure mode changed from “confidently wrong and it compiles” to “incompletely right”, which is a real safety difference. Verdict: SITUATIONAL, a guardrail rather than a turbo.
caveman (token-compression skill): claims ~75% savings, measured -31% on local models and -33% best-case on Claude, +18% on Fable 5 (it speaks fluent caveman and uses the saved words to say more things), and in measured dollars it was never cheaper on any model, because the injected instruction is billed on every request. Verdict: SKIP.
Both write-ups follow the same template: a verdict box up top (ADOPT / SITUATIONAL / SKIP, install-if, skip-if, cost, and a mandatory “Do I run it myself?” disclosure), then methodology, results, limitations, and a reproduce section. Negative results ship with the same prominence as positive ones. That is the series contract.
Benchmark hygiene, learned the hard way
One result in the caveman matrix initially showed the skill rescuing the weak model on the hardest task, three out of three. It was an artifact: a health-check timer had resurrected an idle model mid-benchmark, both engines fought over unified memory, and the runs degraded. The clean rerun showed zero out of three. The contaminated data was deleted, the rerun is the published number, and the incident is documented. If your benchmark infrastructure can lie to you, it eventually will; the method has to include noticing.
Run it on your stack
The harness needs Node 22, git, and opencode with your models configured as providers. Point MODELS at your provider ids, smoke one run, then run a matrix. Adding a new MCP server or skill to test is one entry in a config file, no harness changes. The repo’s AGENTS.md is a contract for AI agents working on the codebase itself, and the README documents the two opencode footguns that cost me a night so they do not cost you one.
git clone https://github.com/cipherfoxie/agent-bench && cd agent-bench && npm install
node runner/smoke.js baseline <provider/model> ts-rename
EXPERIMENT=serena ARMS=baseline,serena TASK_NAME=ts-ambiguous N=5 node runner/bench.js
If you benchmark something from the ecosystem’s top charts with it, I would genuinely like to see the numbers, especially if they disagree with mine.
What is next in the series
The two case studies so far cover the top of the Claude marketplace skills and MCP directories at the time of writing. The series continues through the charts in the same format: one tool at a time, same harness, same verdict scale, no skipping negative results.
Candidates on the list: memory MCPs (persistent context across sessions), browser automation MCPs (tools that give agents a real browser), and a few skills that make bolder efficiency claims than caveman did. The question is always the same: does this thing actually help on the tasks I run, or is the install count doing the marketing?
If you benchmark something from the top charts before I do and want to submit numbers, the harness is generic and the AGENTS.md explains the contract. Open an issue or PR on the repo.
This is the pillar of the agent-bench series. Spokes so far: Serena, caveman. The repo: github.com/cipherfoxie/agent-bench. Follow via RSS or Nostr.