How do you compare agents?

2026-02-13

My little mind is overflowing with ideas for improving AI agents. I want to know if giving agents encouragement in user messages really helps (I swear it does). I want to see how one agent compares to a horde of subagents. I want to try out a hundred different ways of managing context for Freedom.

You can make all these changes, give them a go, and… vibe it out? That seems like a recipe for confirmation bias. Agentic tasks are massively multi-dimensional and LLMs are non-deterministic.

My first try was “just run existing benchmarks”. Take two identical agents, apply an intervention to one, run it on SWEBench and see if the scores differ. That has three issues:

SWEBench is slow (unless you have no rate limits, which I do not)
SWEBench is bloody expensive (I hit my Claude Max limit in one day)
An agent’s performance on a task is massively multi-dimensional and can’t be condensed into a single number based on binary pass/fail information.

I needed something that was quick and cheap - to run all the experiments my heart desires - and produced a thorough multi-dimensional analysis of everything I cared about.

Agent Comparison Tool (ACT)

Attempt number two: automated analysis.

The idea is we run various agents (e.g. with/without some intervention) on the same task, then have a judge agent analyze their results and produce a comparison.

I will run more experiments over the weekend, but for now we have this. Three models (Sonnet 4.5, Opus 4.6, and GPT-5) are given a simple toy spec-driven development task (using GitHub’s Spec Kit) and evaluated on it. I’ll break it down for you:

Agents check out the test repo. One copy per agent.
The repo contains a .specify directory with a simple constitution and a single spec, as well as a .opencode directory containing the Spec Kit slash commands.
The spec specifies a pretty standard HTTP todo list application. It does not make prescriptions about technology, but the constitution does.
The agents are given the /speckit.plan command and pointed at the spec. This tells them to turn the given spec into an implementation plan (along with supporting documents such as research.md).
When all the agents have finished, their repos are compiled together and another agent (the “judge”) reads each one and compares them. Then, it produces an analysis.md document with its findings.

Claude Opus 4.6 evaluating the other agents

The results are not that interesting and pretty much as expected. In Claude Opus 4.6’s opinion, Claude Opus 4.6 was the best! 10/10 across the board! Sonnet did okay and produced an acceptable plan. GPT-5 was a bit of a hot mess and didn’t even read the constitution one run (leading it to use a totally different tech stack).

The important thing is that I have a flexible and repeatable way to compare agents. Not just for spec-driven development, but for other ideas too. What happens if one agent is given a plain codebase, and the other has Docs In Code applied? What is the real difference between thinking on/off? Should I tell the agent my nan’s dying wish was for it to produce a bug-free codebase?

- omegastick