---
title: "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"
type: source
tags: [ai-productivity, rct, software-engineering, benchmark, metr, slowdown, cursor]
authors: [Becker, Joel; Rush, Nate; Barnes, Beth; Rein, David]
year: 2025
venue: "arXiv:2507.09089 (cs.AI); Model Evaluation & Threat Research (METR)"
kind: paper
raw_path: "raw/AI Capabilities & Adoption/METR - Measuring the Impact of Early-2025 AI.pdf"
sources: []
key_claims:
  - "Randomised controlled trial: 16 experienced open-source developers, 246 real tasks on mature repos (avg 23,000 stars, 1.1M LOC) they had contributed to for ~5 years."
  - "AI tools (primarily Cursor Pro with Claude 3.5/3.7 Sonnet) slowed developers by 19% — opposite of forecasts."
  - "Pre-study developer forecast: 24% speedup; post-study estimate: 20% speedup; ML expert forecast: 38%; economics expert forecast: 39%; observed: -19%."
  - "Screen recordings (143 hours, 29% of total) enable ~10-second-resolution time decomposition and compliance verification."
  - "21 candidate explanations for slowdown analysed; 5 contributed (e.g., high-quality bar of mature repos, developer familiarity with codebase, AI unreliability on repo-specific context), 10 mixed/unclear, 6 ruled against."
  - "Result contradicts synthetic-task studies (Peng 2023: 56% speedup) and output-metric field studies (Cui 2024: 26.8%) that use non-fixed outcome measures."
  - "Tasks defined before randomisation; fixed outcome measure (completion time) controls for AI-induced verbosity and PR fragmentation biases that inflate output metrics."
  - "Developers earn $150/hour; 93% have prior LLM experience but only 44% had used Cursor IDE before the study."
  - "Results are specific to this setting (experienced devs, mature OSS repos, early-2025 frontier) and do not generalise to all economically relevant settings or future models."
created: 2026-04-20
updated: 2026-04-20
---

# Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

## Summary
METR (Model Evaluation & Threat Research) conducts a randomised controlled trial to measure the real-world productivity impact of frontier AI coding tools on experienced open-source developers. This is a methodological corrective to the bulk of AI-productivity literature, which relies on (a) synthetic tasks that over-represent LLM training data or (b) field experiments with non-fixed outcome measures such as lines-of-code or PR count that AI can inflate without real productivity gain.

**Design.** 16 developers (typically 10+ years of experience, ~5 years on their repo, averaging 1,500 commits to it) each provided a list of real issues from large repositories they regularly contribute to (average 23,000 stars, 1.1M LOC, 710 committers, high review bars). Each of the 246 issues was defined before randomisation and then assigned by coin flip to AI-allowed or AI-disallowed. When AI was allowed, developers could freely use any AI tool; in practice they mostly used Cursor Pro with Claude 3.5/3.7 Sonnet. Developers self-reported completion times; ~29% of working hours were screen-recorded (143 hours) and manually labelled for compliance and behavioural analysis. Pay: $150/hour.

**Main result.** AI-allowed issues took **19% longer** than AI-disallowed ones. This contradicts the pre-study developer forecast (24% speedup), post-study developer estimate (20% speedup), ML expert forecast (38%), and economics expert forecast (39%). The 38-percentage-point gap between expectation and observation is the paper's central finding.

**Explanatory analysis.** Authors pre-register 21 candidate factors grouped into four categories (direct productivity loss, experimental artifact, factors raising human performance, factors limiting AI performance). Evidence supports 5 (including: very high code-quality bars on mature repos; deep developer familiarity; AI failures on repo-specific context and large files), is mixed for 10, and argues against 6 (including: experimental artifacts as primary driver). The slowdown is broadly robust across design variants.

**Scope of the claim.** Results are setting-specific. They do *not* imply AI is unhelpful in general; synthetic-task and less-experienced-developer evidence still shows speedups, and the heterogeneous-effects literature (Agrawal et al. 2018) predicts that less-experienced workers benefit most (compressed performance distributions). The paper's core insight is the *perception–reality gap*: both practitioners and experts systematically over-estimate AI speedup in settings where AI's weaknesses (context ingestion, code quality thresholds) compound.

## Connections
- Central evidence for [[concepts/ai-agent-benchmarks]] — methodologically complements [[sources/2024-xu-the-agent-company-benchmark]] (synthetic benchmark of LLM agents) by measuring real developer productivity rather than agent task completion.
- Contradicts perceived productivity in leader-survey [[sources/2025-korst-wharton-gen-ai-enterprise-adoption]] (three-quarters report positive ROI) and in bottom-up usage data [[sources/2025-handa-which-economic-tasks-ai]] (high software-engineering usage).
- Complements learning-focused findings in [[sources/2026-shen-ai-skill-formation]] — METR measures experienced-developer productivity, Shen measures novice skill formation; both find AI less beneficial than expected.
- Feeds [[concepts/ai-adoption]] as a caution against self-reported ROI.
- New entities: [[entities/joel-becker]], [[entities/beth-barnes]].