Run the benchmark locally or via GitHub Actions.
# Clone and set up git clone https://github.com/intent-bench/intent-bench.git cd intent-bench make setup # Validate an experiment bash bench.sh validate url-shortener # Run a single experiment (control condition, 1 run) bash bench.sh run url-shortener --condition control --runs 1 # Run with treatment bash bench.sh run url-shortener --condition treatment --runs 1 # Analyze results make analyze
Use the --agent and --model flags to benchmark
different models:
# Claude Code (default) bash bench.sh run url-shortener --condition control \ --agent claude-code --model claude-sonnet-4-20250514 # OpenAI via aider bash bench.sh run url-shortener --condition control \ --agent aider --model openai/gpt-4o # Local Ollama model via aider bash bench.sh run url-shortener --condition control \ --agent aider --model ollama/llama3
The benchmark can also be triggered via workflow_dispatch in
the GitHub Actions UI. Select the experiment, condition, agent, model,
and number of runs. Results are submitted as a pull request automatically.
| Experiment | Tier | Language | Budget |
|---|---|---|---|
| url-shortener | Baseline | Any | $5 |
| task-manager | Standard | Any | $5 |
| rest-api | Standard | Python | $10 |
| cli-tool | Standard | Python | $10 |
| brownfield | Advanced | Go | $15 |
| rtmx-self | Advanced | Go | $15 |
| go-toml | Advanced | Go | $15 |
| hyperfine | Advanced | Rust | $15 |
| csvkit | Advanced | Python | $10 |
| go-yaml | Advanced | Go | $15 |
| structlog | Advanced | Python | $10 |
Run at least N=5 per condition for statistical significance.
Commit results/summary.csv and
results/analysis.json to a branch and open a PR.
See REPRODUCING.md for full instructions.
Create experiments/<name>.yaml,
prompts/<name>.md, and
fixtures/rtmx/<name>/rtm.csv.
Run make gen-fixtures to generate the manual-spec treatment.
Validate with bash bench.sh validate <name>.
Create agents/<name>.sh conforming to the agent
interface: <workdir> <model> <prompt_file>
<result_dir> <max_budget>.
Must produce transcript.jsonl and stderr.log
in the result directory.