Home / Blog / Using a spec as a benchmark: What 22 model runs taught me about agents, prompts, and shortcuts

Using a spec as a benchmark: What 22 model runs taught me about agents, prompts, and shortcuts

Simple prompts are the standard benchmarking currency i.e. easy to reproduce, easy to score. I wanted to try something different: what if the benchmark was a proper spec?

A while back I wrote about building a retrospective board in under an hour using spec-driven development. This time I took the same OpenSpec and ran it 22 times across different models, agents, and configurations to see what varied their implementation and why. The spec covered creating a collaborative retrospective board with a React frontend, Node.js backend, WebSockets, SQLite, Docker, drag-and-drop, nested comments, and CSV export.

However, I deliberately left out any visual design guidelines and verification instructions. I wanted to see what choices different models and agents would make on their own. Some results were expected. Others were not.

The full results are documented at Realtime Retro Board — Model Comparison Report.

Code: achintmehta/retrospective-board-eval.

Observation 1: Same model, Different agent, Completely different UI

I ran Claude Opus 4.6 through both Claude Code and Antigravity. Same model, same spec, same effort level.

Antigravity produced a noticeably more polished result with a dark theme, colour-coded columns, avatar chips, and branded navbar. Claude Code was clean and functional but plain.

Antigravity with Opus 4.6:

Antigravity Opus 4.6 — dashboard

Antigravity Opus 4.6 — board view

Same Opus 4.6 model, Claude Code, no system prompt customization:

Claude Opus 4.6 plain — dashboard

Claude Opus 4.6 plain — board view

Antigravity doesn't talk to the model or call Anthropic APIs directly. Instead, it communicates with its Vertex AI backend which in turn communicates with Anthropic's APIs. This Vertex AI injects its own system prompt regardless of whether you provide one. That prompt contains explicit visual design instructions: rich aesthetics, dark modes, modern typography, micro-animations. It tells the model that a plain-looking result is unacceptable.

Once I applied the Antigravity system prompt to my Claude Code runs, the visual quality matched immediately. Sonnet 4.6 with that prompt scored 5/5 on aesthetics:

Sonnet 4.6 with Antigravity prompt — dashboard

Sonnet 4.6 with Antigravity prompt — board view

Agents are not neutral wrappers. The scaffolding around a model shapes its output just as much as the model itself.

Observation 2: More tools does not mean better output

This was the result I found most surprising.

I ran Opus 4.6 at extra-high effort with Playwright enabled and without it. With Playwright: 40/45. Without: 42/45. The no-Playwright version was better architected, had proper error handling, and worked correctly. The Playwright version crashed on drag-and-drop and had no error boundaries anywhere.

Opus 4.6 with Playwright at xHigh effort — completely unstyled, raw browser defaults:

Opus 4.6 with Playwright xhigh — dashboard

Opus 4.6 with Playwright xhigh — board view

I asked Claude why. The explanation, captured in Implementation_difference_rationale.txt:

When Playwright is configured, the model can verify the UI in a browser during implementation. This creates a subtle but important behavioral shift: less upfront investment in code quality — since the model can "see" the result, it tends to take shortcuts and rely on visual verification to confirm things work. The approach becomes: "write something quick, check it in the browser, move on."

The full architecture comparison makes the divergence concrete:

Aspect With Playwright Without Playwright
Frontend files 2 page components, everything inlined 2 pages + 4 dedicated components
CSS ~24 lines of inline styles 400+ lines across 9 component CSS files
Error handling No try-catch, no response.ok checks Proper error states throughout
Drag-and-drop Simple state update, no position sort Defensive guards, filter + sort by position
Result Blank page crash on drag Works correctly

The tool meant to improve quality assurance reduced code quality by making the model less careful upfront. Without a browser to lean on, the model compensated with better architecture and defensive error handling. Architecture Without Architects (2025) found exactly this pattern: agents actively make architectural choices in ways that are not always visible to the developer, and more feedback mechanisms do not automatically mean better design decisions.

The practical takeaway: put the architecture in the spec. Do not delegate design to the model and assume it will tool its way to a good answer. Models take the path the environment makes easiest, and that path is not always the right one.

What the experiment showed overall

Opus 4.7 scored 45/45 across all five configurations tested. Sonnet 4.6 with Playwright at xHigh also hit 45, at $2.57. Qwen Coder Next scored 16/45 at $178 in Claude orchestration overhead — which says something blunt about the cost-quality relationship in agentic coding.

The more durable findings are the three above. Agents are not neutral. Model families use tools differently. More capability available to the model does not automatically produce better software. The spec is the thing that most determines the output, and an underspecified one gets filled by defaults that vary enormously and are often invisible. The more deliberate you are before you start, the less of that gap there is to fill.