intent-bench is an A/B benchmark measuring whether structured intent improves coding agent performance.
When coding agents receive structured requirements with dependency ordering, acceptance criteria, and decomposed tasks, they produce higher completion rates and use fewer tokens than when given only a natural-language prompt.
Every experiment runs in two conditions with identical prompts:
bench.sh run <experiment> --condition <control|treatment> | +-- setup_workdir() | +-- clone repo (if brownfield) | +-- run setup_command | +-- [treatment only] install intent layer via treatment plugin | +-- execute_run() | +-- delegate to agents/<agent>.sh | +-- capture transcript, stderr, wall clock | +-- run test_command to determine outcome | +-- parse transcript for token accounting | +-- compute knowledge entropy score | +-- append row to results/summary.csv | +-- repeat for N runs per condition
Completion rate (tests passing), total tokens consumed, wall clock time, and the ratio of planning to execution tokens.
Knowledge entropy (how much thrashing the agent did), backtrack count, intent tool utilization, and coefficient of variation across runs.
Fisher's exact test for completion rate significance. Mann-Whitney U for token efficiency. All results require N ≥ 5 per condition before publishing.
The benchmark is tool-agnostic. Three extension points allow community contributions: