How It Works

intent-bench is an A/B benchmark measuring whether structured intent improves coding agent performance.

The Hypothesis

When coding agents receive structured requirements with dependency ordering, acceptance criteria, and decomposed tasks, they produce higher completion rates and use fewer tokens than when given only a natural-language prompt.

A/B Design

Every experiment runs in two conditions with identical prompts:

C
Control -- The agent receives only the task prompt. No structured requirements, no dependency graph, no tools beyond its built-in capabilities.
T
Treatment -- The agent receives the same prompt plus a structured intent layer: a requirements traceability matrix (RTM) with dependency ordering, acceptance criteria, and per-requirement specification files.

Execution Flow

bench.sh run <experiment> --condition <control|treatment>
  |
  +-- setup_workdir()
  |     +-- clone repo (if brownfield)
  |     +-- run setup_command
  |     +-- [treatment only] install intent layer via treatment plugin
  |
  +-- execute_run()
  |     +-- delegate to agents/<agent>.sh
  |     +-- capture transcript, stderr, wall clock
  |     +-- run test_command to determine outcome
  |     +-- parse transcript for token accounting
  |     +-- compute knowledge entropy score
  |     +-- append row to results/summary.csv
  |
  +-- repeat for N runs per condition

What We Measure

Primary Metrics

Completion rate (tests passing), total tokens consumed, wall clock time, and the ratio of planning to execution tokens.

Secondary Metrics

Knowledge entropy (how much thrashing the agent did), backtrack count, intent tool utilization, and coefficient of variation across runs.

Statistical Analysis

Fisher's exact test for completion rate significance. Mann-Whitney U for token efficiency. All results require N ≥ 5 per condition before publishing.

Pluggable Architecture

The benchmark is tool-agnostic. Three extension points allow community contributions:

1
Treatments -- Different ways to provide structured intent. Current: RTMX (MCP-based RTM), manual-spec (plain markdown), test-first (pre-written test suites).
2
Agents -- Different coding agent implementations. Current: Claude Code, aider (supports OpenAI, Ollama, DeepSeek).
3
Experiments -- Different coding tasks at varying complexity. Current: 11 experiments from greenfield URL shorteners to brownfield bug fixes in real open-source projects.