How to Use

Run the benchmark locally or via GitHub Actions.

Quick Start

# Clone and set up
git clone https://github.com/intent-bench/intent-bench.git
cd intent-bench
make setup

# Validate an experiment
bash bench.sh validate url-shortener

# Run a single experiment (control condition, 1 run)
bash bench.sh run url-shortener --condition control --runs 1

# Run with treatment
bash bench.sh run url-shortener --condition treatment --runs 1

# Analyze results
make analyze

Multi-Model Runs

Use the --agent and --model flags to benchmark different models:

# Claude Code (default)
bash bench.sh run url-shortener --condition control \
    --agent claude-code --model claude-sonnet-4-20250514

# OpenAI via aider
bash bench.sh run url-shortener --condition control \
    --agent aider --model openai/gpt-4o

# Local Ollama model via aider
bash bench.sh run url-shortener --condition control \
    --agent aider --model ollama/llama3

GitHub Actions

The benchmark can also be triggered via workflow_dispatch in the GitHub Actions UI. Select the experiment, condition, agent, model, and number of runs. Results are submitted as a pull request automatically.

Available Experiments

Experiment Tier Language Budget
url-shortenerBaselineAny$5
task-managerStandardAny$5
rest-apiStandardPython$10
cli-toolStandardPython$10
brownfieldAdvancedGo$15
rtmx-selfAdvancedGo$15
go-tomlAdvancedGo$15
hyperfineAdvancedRust$15
csvkitAdvancedPython$10
go-yamlAdvancedGo$15
structlogAdvancedPython$10

Contributing Results

Submit Your Runs

Run at least N=5 per condition for statistical significance. Commit results/summary.csv and results/analysis.json to a branch and open a PR.

See REPRODUCING.md for full instructions.

Add a New Experiment

Create experiments/<name>.yaml, prompts/<name>.md, and fixtures/rtmx/<name>/rtm.csv. Run make gen-fixtures to generate the manual-spec treatment. Validate with bash bench.sh validate <name>.

Add a New Agent

Create agents/<name>.sh conforming to the agent interface: <workdir> <model> <prompt_file> <result_dir> <max_budget>. Must produce transcript.jsonl and stderr.log in the result directory.