---
title: "TheAgentCompany: Benchmarking LLM Agents on Consequential Real-World Tasks"
type: source
tags: [ai-agent-benchmark, llm-agents, workplace-automation, benchmark, simulated-environment, cmu]
authors: [Xu, Frank F.; Song, Yufan; Li, Boxuan; Tang, Yuxuan; Jain, Kritanjali; Neubig, Graham]
year: 2024
venue: "arXiv:2412.14161 (cs.CL), v3 Sep 2025"
kind: paper
raw_path: "raw/AI Capabilities & Adoption/The Agent Company2412.14161v3.pdf"
sources: []
key_claims:
  - "Self-hosted, reproducible benchmark of 175 professional tasks in a simulated software-engineering company (GitLab, Plane, OwnCloud, RocketChat) covering SDE, PM, DS, Admin, HR, Finance roles."
  - "Tasks require web browsing, coding, terminal use, and multi-turn communication with simulated colleagues implemented via Sotopia."
  - "Best model (Gemini 2.5 Pro via OpenHands) autonomously completes 30.3% of tasks fully; 39.3% score with partial-credit checkpoint grading."
  - "Checkpoint-based partial-credit evaluators (Python functions checking environment state) support long-horizon tasks rather than single-shot correctness."
  - "Three checkpoint types: Action Completion, Data Accuracy, Collaboration (with simulated colleagues)."
  - "Strong role variance: SDE up to ~38% success, Finance/DS/Admin often <15%; open-weights models (Llama 3.1-405b, Qwen-2.5-72b) trail closed API models markedly."
  - "Simulated colleagues (all backed by Claude-3-5-Sonnet) test communication, negotiation, and information-gathering subtasks."
  - "Designed to address gap between impressive single-task benchmark scores and real workplace performance; finds a good portion of simple tasks automatable but long-horizon professional tasks still beyond frontier agents."
  - "All environments Docker-reset-able; reproducibility a core design goal."
created: 2026-04-20
updated: 2026-04-20
---

# TheAgentCompany: Benchmarking LLM Agents on Consequential Real-World Tasks

## Summary
Xu, Song, Li et al. (CMU, 2024; v3 September 2025) build a reproducible benchmark that evaluates LLM agents on the kind of heterogeneous, multi-tool, multi-stakeholder tasks a digital knowledge worker performs in a small software company. The motivation is a gap in the benchmark literature: existing evaluations (SWE-Bench, WebArena, MiniWob++, τ-bench) isolate one interface or skill, whereas real workplace tasks combine web UIs, code, terminals, and colleague communication across long horizons.

**Environment.** A Dockerised self-hosted intranet mimics a startup (TheAgentCompany) with four platforms: GitLab (code + wiki), OwnCloud (documents), Plane (Jira-like task tracker), RocketChat (Slack-like chat). Simulated colleagues are LLM-backed characters (Sotopia platform; all simulated NPCs driven by Claude-3-5-Sonnet) with roles, responsibilities, and project contexts. Everything is resettable between runs.

**Tasks.** 175 tasks mapped to role categories — SDE (69), PM (28), HR (29), Admin (15), DS (14), Finance (12), Other (8). Each task has an English intent, a list of checkpoints with point values, and Python evaluator functions checking environment state or agent trajectory. Checkpoints cover Action Completion, Data Accuracy, and Collaboration. Partial credit allows long-horizon tasks to yield informative scores rather than binary pass/fail.

**Results.** Twelve LLM backbones tested via the OpenHands agent framework. Best result: Gemini 2.5 Pro at 30.3% full success and 39.3% weighted score. Claude-3.7-Sonnet close behind (23.9% / 32.6%). Open-weights models trail significantly: Llama-3.1-405b 5.6%, Qwen-2.5-72b 5.6%. Role breakdown: SDE strongest, Finance/DS weakest. Platform breakdown: Plane (task-manager) highest, OwnCloud (docs) lowest.

**Framing.** The result is deliberately nuanced: "a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems." It is positioned against both AI-hype (Amodei; Eloundou et al. 2023) and AI-scepticism (Kambhampati et al. 2024) by providing a concrete measurement instrument. Human baseline not collected due to recruitment cost — a stated limitation.

## Connections
- Anchor for [[concepts/ai-agent-benchmarks]] together with [[sources/2025-becker-metr-ai-developer-productivity]]; the two differ in methodology (synthetic vs. real RCT) and population (LLM agents vs. human developers with AI).
- Environment simulates exactly the tools agentic BPM systems need to orchestrate — relevant to [[concepts/agentic-bpm]] and [[concepts/agent-process-observability]]. Multi-colleague communication tests map onto APM's agent-to-agent protocol layer.
- Task coverage across roles complements usage-distribution evidence from [[sources/2025-handa-which-economic-tasks-ai]] and [[sources/2025-tomlinson-working-with-ai]].
- Long-horizon checkpoint grading parallels [[concepts/agent-process-observability]] (trajectory-level evaluation).
- New entities: [[entities/graham-neubig]], [[entities/frank-xu]].