How to Use

Run the benchmark locally or via GitHub Actions.

Quick Start

# Clone and set up
git clone https://github.com/intent-bench/intent-bench.git
cd intent-bench
make setup

# Validate an experiment
bash bench.sh validate url-shortener

# Run a single experiment (control condition, 1 run)
bash bench.sh run url-shortener --condition control --runs 1

# Run with treatment
bash bench.sh run url-shortener --condition treatment --runs 1

# Analyze results
make analyze

Multi-Model Runs

Use the --agent and --model flags to benchmark different models:

# Claude Code (default)
bash bench.sh run url-shortener --condition control \
    --agent claude-code --model claude-sonnet-4-20250514

# OpenAI via aider
bash bench.sh run url-shortener --condition control \
    --agent aider --model openai/gpt-4o

# Local Ollama model via aider
bash bench.sh run url-shortener --condition control \
    --agent aider --model ollama/llama3

GitHub Actions

The benchmark can also be triggered via workflow_dispatch in the GitHub Actions UI. Select the experiment, condition, agent, model, and number of runs. Results are submitted as a pull request automatically.

Available Experiments

Experiment	Tier	Language	Budget
url-shortener	Baseline	Any	$5
task-manager	Standard	Any	$5
rest-api	Standard	Python	$10
cli-tool	Standard	Python	$10
brownfield	Advanced	Go	$15
rtmx-self	Advanced	Go	$15
go-toml	Advanced	Go	$15
hyperfine	Advanced	Rust	$15
csvkit	Advanced	Python	$10
go-yaml	Advanced	Go	$15
structlog	Advanced	Python	$10

Contributing Results

Submit Your Runs

Run at least N=5 per condition for statistical significance. Commit results/summary.csv and results/analysis.json to a branch and open a PR.

See REPRODUCING.md for full instructions.

Add a New Experiment

Create experiments/<name>.yaml, prompts/<name>.md, and fixtures/rtmx/<name>/rtm.csv. Run make gen-fixtures to generate the manual-spec treatment. Validate with bash bench.sh validate <name>.

Add a New Agent

Create agents/<name>.sh conforming to the agent interface: <workdir> <model> <prompt_file> <result_dir> <max_budget>. Must produce transcript.jsonl and stderr.log in the result directory.