About

intent-bench is an open-source benchmark for evaluating whether structured intent improves AI coding agent effectiveness.

Motivation

Existing coding benchmarks (HumanEval, SWE-bench, MBPP) measure raw code generation ability. They ask: "Can the model write correct code?" But in real software engineering, the hard part is not writing code -- it is understanding what to build, in what order, and how the pieces fit together.

intent-bench asks a different question: Does providing structured requirements to a coding agent improve its implementation effectiveness? The answer has implications for how engineering teams should work with AI agents -- whether investing in requirement specification pays off in better outcomes.

How It Differs

A/B Design, Not Leaderboard

Most benchmarks rank models against a fixed task set. intent-bench compares two conditions (with and without structured intent) on the same model. The question is not "which model is best" but "does this practice help."

Real Codebases, Not Puzzles

Advanced experiments use real open-source projects (go-toml, hyperfine, csvkit, go-yaml, structlog) at pinned commits with real closed issues as tasks. This measures brownfield effectiveness, not just greenfield generation.

Multi-Dimensional Metrics

Beyond pass/fail, intent-bench measures token efficiency (cost), knowledge entropy (thrashing), planning-to-execution ratio, and outcome variance across runs. These capture the how, not just the what.

Related Work

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez et al., 2024. ICLR 2024.
SWE-EVO: Evolving Software Engineering Benchmarks
Zhang et al., 2025.
ProjDevBench: Benchmarking LLM Agents on Full-Project Development
Zhong et al., 2025.
FeatureBench: Multi-Feature Software Engineering Benchmark
Jain et al., 2025.
Plan Compliance in LLM-Based Multi-Agent Systems
He et al., 2024.

License

intent-bench is released under the Apache License 2.0. Contributions are welcome.