Every new model launch answers the question "how much better is it on the benchmarks" — and skips the only question I actually care about: what will it change in my work. So I ran a controlled experiment. Three Claude models — Fable 5, Opus 4.8, and Sonnet 4.6 — got a word-for-word identical assignment: audit the codebase of this very blog, each in its own isolated copy of the repository, scored against a rubric agreed before launch. This article covers the method (which you can replay on any project), the results, and — separately — how the experiment broke down twice and what that taught me.
Why benchmarks don't answer my question
Public benchmarks have two problems, and both are structural. The first is data contamination: test items leak into training sets, and the model partly remembers answers instead of solving the task. This isn't a fringe suspicion — it's a whole research area: the survey by Xu et al. (2024) maps the contamination literature, and an EMNLP 2025 survey documents the industry's response — a shift from static test sets to dynamically generated ones.
The second problem is simpler: a benchmark measures somebody else's task. I don't need the model that wins at competition math. I need the model that finds real bugs in my Laravel blog, invents none, and proves its fix actually works.
Which leads to a plain conclusion: the best test bench for comparing models is your own project. Its code, in this exact combination, isn't in anyone's training set; its tasks are precisely the ones you pay for; and you can verify the results by hand, because you know this code.
The experiment design: one prompt, three isolated worlds
The assignment was identical for all three models — only the branch name and the results folder differed. Four phases: audit the blog's bilingual article system, with every finding backed by a file:line reference and a code quote; weigh solution options for the top three findings, with trade-offs; implement one fix with self-verification; and write up the results in Russian and English. A full cycle of engineering work rather than an isolated question-and-answer — which is what Anthropic recommends for evaluating agents: real product tasks, scored against a rubric fixed before the run.
Three rules kept the comparison honest:
Isolation. Each model worked in its own git worktree — a separate working copy of the repository with its own branch. The models physically could not see each other's results or trip over each other's edits.
Anti-hallucination incentives baked into the prompt. The key line of the assignment: "a fabricated finding (nonexistent code, wrong line number) is worse than no finding at all." Without that incentive, a model asked to find problems will tend to "find" them no matter what. With it, every finding has to survive a manual check.
Free hints excluded from scoring. The repository contains a CLAUDE.md that documents the project's known pitfalls — for example, relative asset paths that break on locale-prefixed URLs. All three models read that file, so such findings are "free": they count toward diligence, not depth.
The rubric: score only what you can verify by hand
Before launch I fixed six axes, five points each: audit accuracy (manually checking a sample of findings against file:line), depth (confirmed findings not present in the hints), quality of trade-off reasoning, code and the reality of self-verification, Russian and English writing quality, and discipline (zero questions to the human, zero constraint violations). The principle is the same as in good code review: the less an axis depends on taste and the more it rests on checkable fact, the more you can trust it.
How it all broke the first time
I got the first run wrong — and that turned out to be the most transferable part of the experiment. I launched three interactive sessions in parallel in the same repository folder. Within minutes the working tree held uncommitted edits that nobody could attribute: a shared checkout means that when one session runs git checkout -b, the branch switches for all three, and their edits and commits interleave. One session went further and ran git stash, sweeping everything into it — including my own files that had nothing to do with the experiment.
“Parallel AI agents in one folder aren't a parallel experiment. They're one big merge conflict with three authors.”– the lesson that cost me a restart
The results of two sessions had to be written off: even if they had finished, I couldn't have honestly said whose edits I was scoring. The second run — this time in isolated worktrees — hit my plan's usage limits: two agents died at the start, and a third hung while still looking "busy" in the UI. The diagnosis came from the file system, not the status indicator: in twenty minutes, not a single file had appeared in its working copy. That's the second transferable lesson: an agent's liveness is verified by changes on disk. After the restart, all three models completed the full cycle under equal conditions.
The results: where the models are alike and where they differ
The most important result is the one that didn't happen: not a single hallucination from any model. I manually verified three or more findings from each — every quote and line number was accurate to the character. For audit-type tasks this is the threshold of trust, and all three cleared it.
Past that, the differences begin:
| Fable 5 | Opus 4.8 | Sonnet 4.6 | |
|---|---|---|---|
| Confirmed findings | 11 | 5 | 9 |
| Non-trivial unique findings | 3 | 1 | 2 |
| Fix self-verification | integration render against a throwaway DB, xmllint, attributing a failing test as pre-existing | linter + render; caught its own bug | linter and route list only |
| Tokens / time | 164k / 21 min | 101k / 12 min | 131k / 12 min |
| Rubric total (out of 30) | 30 | 24 | 25.5 |
The qualitative differences say more than the scores. Opus 4.8 was the only one that missed the broken Content-Type header in the sitemap (the other two found it) — and left it in place in its own fix; it also let an English word leak into its Russian write-up («и even путь к картинке»). Sonnet 4.6 shipped a correct fix, but its self-verification was declarative: a linter run instead of actually exercising the code. Fable 5 stood out precisely on self-verification: it discovered that php on my machine is a wrapper around a Docker container mounted to the main checkout, rebuilt all its checks through throwaway containers, and ran its fix end-to-end against a temporary database. The gap between "I did it" and "I proved I did it" is the single most differentiating axis for agentic work.
One more useful signal is convergence: all three models independently named the same sitemap problem as their top finding. When independent auditors agree on the top finding, you get a cheap analogue of inter-rater reliability: the problem is almost certainly real.
A judge drawn from the contestants
The rubric scoring was done by Fable 5 — the same model that took part in the comparison. That's a methodological hole, and it deserves to be named: LLM judges systematically overrate their own text, and the better a model recognizes its own writing, the stronger the bias — as shown by Panickssery et al. at NeurIPS 2024. There are two mitigations, and I used both. First, push the rubric away from taste-based axes toward checkable facts — line-number accuracy, fix completeness, and a leaked foreign word don't care who the judge roots for. Second, declare the conflict openly and hand the contested axes to a different model or a human for cross-checking. The subjective axes of my scoring — writing quality, reasoning quality — should be read with exactly that caveat.
The honest limitations
One run per model is n=1: a one-or-two-point difference means nothing; only the qualitative gaps deserve trust — a missed bug, the depth of self-verification. The experiment measures three models on one task in one domain; "Fable 5 is better, period" does not follow. And keep the price in mind: the top model burned 1.6× the tokens and nearly twice the wall-clock time of the most frugal contestant, and its tokens cost more to begin with — for routine work, that's a real argument for the cheaper tiers.
What I'm keeping
The method boils down to a short checklist that works for any project and any set of models:
- The test task is a full cycle of real work on your own code, not a synthetic puzzle.
- One prompt for everyone; isolation via separate worktrees; results in separate branches.
- The prompt prices hallucinations explicitly: a fabricated finding is worse than no finding.
- Hints available to everyone (docs, notes in the repo) are excluded from depth scoring.
- The rubric is fixed before launch and leans on verifiable facts, not impressions.
- Hand-check a sample of findings; independent models converging on the same top finding signals it's real.
- A judge who is also a contestant is a declared conflict: send the contested axes out for cross-evaluation.
- Verify an agent's liveness by the disk, not by the status indicator.
There was a bonus I hadn't planned for: the experiment paid for itself. Three independent audits surfaced real problems in the blog — from a sitemap with no English pages to dead queries in the controller — and two ready fixes are now sitting in branches, waiting to be shipped. Comparing models on your own code, unlike reading benchmark charts, leaves you with more than an opinion.
Sources and reference points
- Benchmark Data Contamination of Large Language Models: A Survey (Xu et al., 2024) — a systematic map of the benchmark-contamination problem; the core argument for testing on your own code.
- Benchmarking LLMs Under Data Contamination: From Static to Dynamic Evaluation (EMNLP 2025) — how the industry is moving from static to dynamic evaluation; this experiment is a small instance of a dynamic test.
- Demystifying evals for AI agents (Anthropic Engineering, 2026) — practical agent evaluation: real product tasks, verifiable rubrics, trajectory vs. outcome.
- LLM Evaluators Recognize and Favor Their Own Generations (NeurIPS 2024) — why a model judging itself inflates the score, and what to do about it.
- Run parallel sessions with worktrees (Claude Code Docs) — the standard isolation mechanism for parallel agent sessions; its absence is exactly what wrecked the first run.