--- title: AI Agent Benchmarks & Productivity Measurement type: concept tags: [ai-agent-benchmark, llm-agents, productivity, rct, evaluation, methodology] sources: ["[[sources/2024-xu-the-agent-company-benchmark]]", "[[sources/2025-becker-metr-ai-developer-productivity]]"] created: 2026-04-20 updated: 2026-04-20 --- # AI Agent Benchmarks & Productivity Measurement How we measure whether AI agents and AI-augmented workflows actually do useful work. Two complementary methodologies are converging in 2025: ## Two methodological poles ### Synthetic but multi-tool agent benchmarks [[sources/2024-xu-the-agent-company-benchmark]] (TheAgentCompany, CMU) builds a reproducible Dockerised simulated software company with GitLab, OwnCloud, Plane, and RocketChat, 175 professional tasks across 7 role categories, and LLM-backed simulated colleagues. Checkpoint-based partial credit supports long-horizon tasks. Best result: Gemini 2.5 Pro at 30.3% full task success. Deliberately nuanced framing: simple tasks automatable, long-horizon professional tasks beyond current frontier agents. **Strengths:** reproducibility, direct agent-vs-agent comparison, cross-role coverage, multi-tool + communication requirement. **Weaknesses:** simulated environment cannot fully mirror workplace politics, stakeholder nuance, or the cost of incorrect outputs; no human baseline. ### Field RCT with human developers [[sources/2025-becker-metr-ai-developer-productivity]] (METR) runs a randomised controlled trial with 16 experienced OSS developers, 246 real issues on their own mature repositories (avg 23K stars, 5 years' familiarity each). Fixed outcome measure (completion time) defined before randomisation. Result: AI-allowed issues take **19% longer** — the opposite of the 24–39% speedup forecasts from developers, ML experts, and economists. Extensive 21-factor analysis identifies 5 contributing causes (repo quality bar, developer context depth, AI weakness on repo-specific context, etc.). **Strengths:** ecological validity, fixed outcome measure, compliance-verified by screen recordings, pre-registered factor analysis. **Weaknesses:** small N, setting-specific (senior OSS developers on mature repos), doesn't generalise to novices or synthetic tasks. ## What each methodology is blind to - Agent benchmarks measure *agent capability*; they don't measure whether humans with AI are more productive. - Field RCTs on developers measure *human productivity with AI as a tool*; they don't measure *autonomous agent task success*. - Both are blind to **skill formation** — the longitudinal effect on human competence. That third dimension is covered by [[concepts/ai-skill-formation]]. ## Methodological lessons - **Fixed outcome measures.** METR's insistence on pre-randomisation task definition exposes inflation in non-fixed metrics (lines of code, PR count) used by prior field studies (Peng 2023; Cui 2024). Non-fixed metrics can rise without productivity rising. - **Checkpoint-based partial credit.** TheAgentCompany's checkpoint evaluators enable informative signals on long-horizon tasks rather than binary pass/fail. Relevant to [[concepts/agent-process-observability]]. - **Perception-vs-reality gap.** Both practitioners (20%) and experts (38–39%) systematically over-estimate AI speedup in complex real-world settings. - **Setting specificity.** Results from any one measurement regime must be scoped: synthetic task ≠ real issue, novice ≠ senior, isolated task ≠ workflow. ## Relevance to APM / BPM Agent benchmarks like TheAgentCompany simulate exactly the multi-tool, multi-stakeholder environment that [[concepts/agentic-bpm|APM]] systems must coordinate. The long-horizon checkpoint grading methodology parallels [[concepts/agent-process-observability]]. The METR productivity surprise cautions BPM practitioners against assuming that AI-augmented execution speeds up knowledge-work processes in the general case — setting and operator experience matter. ## Related [[concepts/agentic-bpm]] · [[concepts/agent-process-observability]] · [[concepts/ai-adoption]] · [[concepts/ai-skill-formation]] · [[concepts/behavioral-variability]]