---
title: "LLMs Corrupt Your Documents When You Delegate"
type: source
tags: [llm, agents, evaluation, benchmark, delegation, drift, microsoft-research, agentic-bpm]
authors: [Laban, Philippe; Schnabel, Tobias; Neville, Jennifer]
year: 2026
venue: "arXiv preprint 2604.15597v1 [cs.CL], 17 April 2026 (under review)"
kind: preprint
raw_path: "raw/AI Capabilities & Adoption/2026-laban-schnabel-neville-llms-corrupt-documents-delegate.pdf"
arxiv_id: "2604.15597v1"
external_url: "https://arxiv.org/abs/2604.15597"
status: ingested
sources: []
key_claims:
  - "Frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt on average 25% of document content over 20 delegated edit interactions on DELEGATE-52; the average degradation across all 19 tested LLMs is 50% (Table 1)."
  - "DELEGATE-52 is a reference-free benchmark spanning 52 professional domains, 310 work environments, ~3–5k token seed documents (8–12k token distractor context), and 5–10 invertible edit tasks per environment, organized in five categories: Code & Configuration, Science & Engineering, Creative & Media, Structured Records, Everyday (Figure 3)."
  - "Evaluation uses a round-trip relay: a forward edit (σ) followed by its inverse (σ⁻¹) is applied as two single-turn LLM calls; reconstruction score RS@k = sim(s, ŝ_{k/2}) compares the original seed s with the reconstructed document after k interactions, using domain-specific parsing-based similarity functions calibrated per domain (Section 2.1, Figure 5)."
  - "Python is the only domain (out of 52) where most models reach the 'ready' bar (RS@20 ≥ 98%); the best model (Gemini 3.1 Pro) is 'ready' in only 11 of 52 domains, and 80% of model-domain combinations exhibit at least −20% degradation by end of simulation (Table 2)."
  - "Agentic tool use (file read/write + code execution harness, Yao-style) does NOT improve performance: tested models incur an additional 6% average degradation in agentic mode versus direct text output; even the best (GPT 5.4) loses 3% (71.5% vs 68.3%) (Table 3, Section 4.2)."
  - "Degradation compounds with document size, interaction length, and distractor presence: each +1k tokens in document size costs ~0.7% after 2 interactions but ~3.6% after 20 (≈5× snowball, Table 5); extending relays from 20 to 100 interactions shows monotonic decline with no plateau (Table 6); distractor harm widens from 0.4–4% at 2 interactions to 2–8% at 20 interactions (Table 7)."
  - "Critical-failure analysis: ~80% of total degradation comes from sparse but severe (≥10pt drop) round-trips, not 'death by a thousand cuts'; stronger models delay rather than avoid critical failures (Table 9, Section 5)."
  - "Weaker models degrade primarily via deletion (missing content), while frontier models degrade primarily via corruption (incorrect content present) — a qualitatively different failure mode for the most capable systems (Figure 7, Appendix F)."
  - "Image-editing extension (9 image-generation models, 6 visual environments): final reconstruction scores 28–30%, far worse than 70–80% for textual domains, indicating image models are markedly less ready for delegated work than text models (Table 8, Section 4.6)."
  - "Performance after 2 interactions is not predictive of long-horizon (20-interaction) performance; e.g., GPT 5 and Kimi K2.5 are near-tied at k=2 (91.5 vs 91.1) but diverge to 48.3 vs 64.1 at k=20 — short benchmarks materially underestimate corruption (Section 4.1)."
  - "Limitations explicitly acknowledged: single-turn instructions per round-trip (no clarification dialog), document-editing-only scope (excludes communication/planning), reversibility-based evaluation favors structured domains where parsing is tractable; multi-turn or instruction-sharded simulations would likely amplify degradation (Section 8)."
created: 2026-04-27
updated: 2026-04-27
---

# Laban, Schnabel & Neville 2026 — LLMs Corrupt Your Documents When You Delegate

## Summary

This Microsoft Research preprint (arXiv:2604.15597v1, 17 April 2026, under review) introduces **DELEGATE-52**, a reference-free, multi-domain benchmark designed to measure whether current LLMs are reliable *delegates* for long-horizon document editing — the operational substrate of "vibe coding" and similar delegated-work paradigms ([Shao et al. 2025](https://arxiv.org/abs/2604.15597), [Ulloa et al. 2025]). The headline empirical finding is that **frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt on average 25% of document content after 20 delegated edit interactions, and the average across all 19 tested LLMs is 50%** (Section 1, Table 1). Performance is **non-monotone in domain**: Python coding is the only domain where most models cross the 98% "ready" bar; the best single model is "ready" in only 11/52 domains. Agentic tool harnesses do *not* help — they add roughly 6% additional degradation on average. Degradation compounds with document size, interaction length, and distractor context, and ~80% of corruption comes from sparse "critical failures" (≥10 percentage-point drops in a single round-trip) rather than uniform decay.

For the wiki, the paper provides the **first large-N empirical bound** on the agentic-process-execution thesis being pushed by [[sources/2025-calvanese-autonomy-business-process-execution]] and [[sources/2026-dumas-agentic-bpms-pyramid]], and a sharp empirical anchor for the trajectory-drift framing in [[syntheses/study-sketch-agent-trajectory-drift]].

## DELEGATE-52: what is benchmarked (Section 2)

- **52 professional domains** in five categories (Figure 3): Code & Configuration (11 — Python, DBSchema, Docker, Filesystem, Graphviz, Infra, JSON, Makefile, Malware, DNS, Translation), Science & Engineering (11 — Aviation, Circuit, Crystal, MathLean, Molecule, Protein, Quantum, Robotics, Satellite, StarCatalog, Weather), Creative & Media (11 — AudioSyn, Fiction, FontEng, LaTeX, MusicSheet, OBJ3D, Screenplay, Slides, Subtitles, Vector, Weaving), Structured Records (11 — Accounting, Calendar, EDIFACT, Emails, Genealogy, GeoData, GeoTrack, HamRadio, LibCatalog, Spreadsheet, Treebank), Everyday (8 — Chess, EarnCall, FoodMenu, JobBoard, Landmarks, Playlist, Recipe, Transit). Inclusion criterion: existence of a **standard textual unencoded document type** (e.g., `.srt`, `.cif`).
- **310 work environments** (≈6 per domain). Each environment = (seed document, 5–10 edit tasks, distractor context).
- **Seed documents**: real online documents (no synthetic/template), 2–5k tokens (default), permissively licensed; Section 4.3 also tests 1–10k token variants.
- **Edit tasks**: pairs `(x→, x←)` of forward + backward natural-language instructions defining an *invertible* transformation σ. Edits must (a) reflect a realistic stakeholder request, (b) be non-trivial (not just concatenation/cropping), and (c) be tagged with semantic operations (split/merge, classification, sorting, numerical reasoning, string manipulation, referencing, context expansion, topic modeling, format knowledge, domain knowledge, constraint satisfaction — Figure 9, Appendix H).
- **Distractor context**: 8–12k tokens of topically related but task-irrelevant documents per environment, modelling imperfect retrieval precision.
- **Domain-specific parsers + similarity functions** (Figure 5): each domain implements a parsing function `text → structured representation` and a weighted similarity over parsed components (e.g., recipe: 40% ingredients + 40% steps + 20% tips), calibrated by ablation. Generic measures (Levenshtein, embedding cosine, GPT-5.4-as-judge) capture at most 25% of the variance of the parsing-based metric (Appendix C).

## The round-trip relay method (Section 2.1, Figures 2 and 6)

- **Round-trip primitive**: given seed `s`, apply forward instruction `x→` to get `t = LLM(s; x→)`, then backward instruction `x←` to get reconstructed `ŝ = LLM(t; x←)`. A perfect model yields `sim(s, ŝ) = 1`.
- **Single-turn, independent sessions** for each step — no chain-of-thought continuity between forward and backward.
- **Relay**: chain `N` round-trips in sequence: `ŝ_k = (σ_1 ∘ σ_1⁻¹ ∘ … ∘ σ_n ∘ σ_n⁻¹)(s)` for `1 ≤ n ≤ N`. Main metric: **RS@k = sim(s, ŝ_{k/2})**, evaluated every two interactions.
- **Round-robin scheduling** of the 5–10 available edit tasks (shuffled order each epoch) up to `N=10` round-trips = 20 interactions; Appendix D validates round-robin as more realistic and harsher than repeating the same edit.
- **Critical insight**: backtranslation as evaluation circumvents the need for gold reference solutions. **Limitation made explicit by the authors**: this measures *reversibility/consistency*, not *task correctness* — a model could complete an edit in one of several valid ways and still fail the round-trip if the inverse cannot recover the original. Backtranslation alignment with model performance is validated empirically in Appendix A.

## Findings (Section 4)

- **Main result (Table 1, 20 interactions)**: Gemini 3.1 Pro 80.9, Claude 4.6 Opus 73.1, GPT 5.4 71.5, Claude 4.6 Sonnet 66.9, Kimi K2.5 64.1, GPT 5.1 60.5, Grok 4 59.3, GPT 5.2 66.1, GPT 5 48.3, GPT 4.1 49.5, o3 48.2, o1 48.1, GPT 5 Chat 46.8, GPT 5 Mini 45.1, Mistral Large 3 35.5, Gemini 3 Flash 35.8, OSS 120B 19.2, GPT 4o 14.7, GPT 5 Nano 10.0. **Frontier average ≈ 75% reconstruction = 25% corruption; full average ≈ 50% corruption.**
- **Domain readiness (Table 2)**: scores are binned into ✓ (≥98 "ready"), 95–98, 90–95, 80–90, 70–80, 55–70, <55 (catastrophic). 80% of (model, domain) pairs sit in the catastrophic band (<70). Python is the *only* domain where a majority of models reach "ready"; the best model (Gemini 3.1 Pro) is ready in 11/52 domains.
- **Agentic tool use does NOT help (Section 4.2, Table 3)**: a basic Yao-style harness with file read/write + code execution adds 6% average extra degradation; GPT 5.4 only narrows it to 3%. Tool overhead (Table 4): models invoke 8–12 tools per task, consume 2–5× input tokens, and prefer file-write over code-execution (45–81% file-write vs 6–14% code-exec) — the harness limits the upside.
- **Document size (Section 4.3, Table 5)**: GPT 5.4 drops from 91.4 (1k tokens) to 59.9 (10k tokens) by k=20; size and interaction length compound *multiplicatively*, with the per-1k-token cost increasing ~5× from k=2 to k=20.
- **Interaction length (Section 4.4, Table 6)**: extending relays from 20 to 100 interactions for four GPT models shows monotonic decline with no plateau; even GPT 5.4 falls below 60% by k=100. First half of an extended relay accounts for ~2–3× more loss than the second half, but corruption keeps accumulating with novel errors even on repeated tasks.
- **Distractor effect (Section 4.5, Table 7)**: removing distractors yields a small +0.4–4% bump at k=2 but a +2–8% bump at k=20; distractor harm compounds with interaction length.
- **Image extension (Section 4.6, Table 8)**: 9 image-gen models on 6 visual environments — best models reach only 28–30% by k=20, and even at k=2 no image model exceeds 65%, materially worse than text models at k=20. Image-domain delegation is strictly less ready than text-domain delegation.

## Domain dependence (Section 4.1, Appendix G)

- **Programmatic / structured domains easier**: Python, DBSchema, JSON, Makefile, Crystal, Molecule, Chess.
- **Natural-language / niche domains harder**: Earnings calls, Music sheet, Recipe, Fiction, Transit, Textile/Weaving.
- **Cohen's d effect sizes (Figure 8)**: easier when document is more *repetitive*, has higher *numerical fraction*, higher *structural density*; harder when document has rich *vocabulary* or high *naturalness*. Reading: LLMs are best where verifiable rewards can be defined (echoes Suma & Dauncey 2025), and DELEGATE-52 effectively *constructs* such rewards via parsing.
- **Operation difficulty (Figure 9)**: Split & Merge, Classification, Format Knowledge, Topic Modeling are *easier*; Constraint Satisfaction, Domain Knowledge, Numerical Reasoning, Sorting, Context Expansion, Referencing, String Manipulation are *harder* (negative point-biserial correlation with reconstruction score). Tasks coordinating multiple operations are markedly harder than single-operation tasks (Appendix H).

## Critical failures and deletion-vs-corruption (Section 5, Appendices E–F)

- **Critical failure** = a single round-trip causes a ≥10pt drop. ~80% of total degradation across all models comes from these sparse but severe events. Frontier models (Gemini 3.1 Pro, Claude 4.6 Opus/Sonnet, GPT 5.4) experience them in fewer rounds (Table 9) but do *not* avoid them — they delay them.
- **Deletion vs corruption decomposition (Figure 7)**: weaker models (GPT 5 Nano, GPT 4o, OSS 120B) degrade primarily by *deleting* content; frontier models degrade primarily by *corrupting* existing content (introducing wrong-but-plausible material). The latter is harder to detect in monitoring without ground truth — directly relevant to process-monitoring research that wants to alert on agent misbehaviour.

## Limitations and what the benchmark does NOT measure (Section 8)

The authors' own limitations and several this wiki should also flag:
- **Single-turn per round-trip**: no clarification, no multi-turn refinement. Real delegated work is multi-turn; Naous et al. 2025 / Laban et al. 2025 results suggest multi-turn would *worsen* degradation — so DELEGATE-52 is an optimistic bound.
- **Document-editing-only scope**: excludes communication, planning, retrieval-driven knowledge synthesis. Many delegated knowledge-work tasks are not document edits.
- **Reversibility ≠ correctness**: round-trip identity is a *proxy*; a model can solve the forward task validly in multiple ways and still fail reversibility. The benchmark measures consistency-violations, not task success on stakeholder criteria.
- **Practical-constraint envelope**: 3–5k token documents, 8–12k distractor, 20-interaction horizon — chosen for cost. Real industrial documents and workflows exceed this; the paper shows degradation worsens as parameters grow, so DELEGATE-52 *underestimates* real-world corruption.
- **Domain selection bias**: criterion was the existence of a parseable textual format; open-ended generation domains (only Fiction included) are under-represented.
- **Non-optimised agentic harness**: Section 4.2 explicitly notes this is a basic harness, not a state-of-the-art agent system; future agentic systems may close the gap.

## Why it matters for agentic BPM

This paper is the empirical counter-weight the agentic-BPM literature has been waiting for. Three connections:

1. **Direct bound on autonomous process execution.** [[sources/2025-calvanese-autonomy-business-process-execution|Calvanese et al. 2025 (PMAI'25)]] argues for elevating goals and normative frames so agents can synthesize their own operational plans. DELEGATE-52 shows that even on relatively well-defined edit tasks (which are *operationally* simpler than synthesizing plans), frontier LLMs fail in 80% of model-domain pairs after 20 interactions. The "frame the autonomy" programme is therefore not optional — it is a precondition for the underlying execution engine to be trustworthy at all.

2. **Empirical anchor for trajectory drift.** [[syntheses/study-sketch-agent-trajectory-drift]] (Riess study sketch) hypothesises that agent-induced drift has a distinct temporal signature with sparse, discontinuous failure events. **DELEGATE-52's critical-failure analysis (Table 9) provides exactly this evidence**: ~80% of degradation comes from sparse ≥10pt drops, not gradual decay. The paper validates H2's core mechanism empirically (although on round-trip reconstruction rather than process conformance) and supplies a measurement instrument (parsing-based domain similarity + round-trip relay) that could be adapted to agent-trajectory drift detection. The deletion-vs-corruption split (Figure 7) further sharpens the trajectory-drift framing: frontier-model failures are *corruption*-mode, i.e., agents continue to act plausibly while content drifts off-spec — precisely the regime where naïve outcome-monitoring fails and where conformance-checking on agent actions becomes necessary.

3. **Falsifies the "agentic harness solves it" optimism.** Several positions in the agentic-BPM literature — implicitly in [[sources/2026-dumas-agentic-bpms-pyramid|Dumas et al. 2026 (A-BPMS pyramid)]] and the [[concepts/agentic-bpm-pyramid|agentic BPM pyramid]] concept page — assume that wiring LLMs into tool harnesses (file ops, code execution, MCP servers) lifts performance to deployment-grade. Section 4.2 of DELEGATE-52 is a direct empirical refutation under a basic harness: tools *worsen* performance by ~6% on average. This does not falsify sophisticated agent systems, but it shifts the burden of proof: claims that agentic harnessing improves reliability must now be empirically defended, not assumed.

4. **Practitioner-perspective alignment.** [[sources/2025-vu-practitioner-perspectives-agent-governance|Vu et al. 2025]] reports that BPM practitioners (none of whom had agentic-AI hands-on experience at the time) anticipated *configurable autonomy* — restrict autonomy in high-risk areas, expand in low-risk ones. DELEGATE-52 supplies an objective signal for which areas count as "high-risk": natural-language and niche domains where reconstruction collapses below 70% within 20 interactions. The benchmark could plausibly inform a domain-risk taxonomy for ABPM governance frameworks.

## Connections

**Concepts:** [[concepts/agentic-bpm]] · [[concepts/agentic-bpm-pyramid]] · [[concepts/framed-autonomy]] · [[concepts/concept-drift]] · [[concepts/ai-agent-benchmarks]] · [[concepts/agent-process-observability]] · [[concepts/abps-autonomy-levels]] · [[concepts/conformance-checking]]

**Entities (potentially new):** Philippe Laban, Tobias Schnabel, Jennifer Neville (all Microsoft Research) — *not yet pages in this wiki*.

**Related sources:** [[sources/2025-calvanese-autonomy-business-process-execution]] · [[sources/2026-dumas-agentic-bpms-pyramid]] · [[sources/2025-vu-practitioner-perspectives-agent-governance]] · [[sources/2026-calvanese-agentic-bpm-manifesto]] · [[sources/2025-fournier-agentic-ai-process-observability]]

**Syntheses:** [[syntheses/study-sketch-agent-trajectory-drift]] (primary load-bearing connection) · [[syntheses/apm-manifesto-core-messages]] · [[syntheses/llm-bpm-reading-list]]

## Cited from

Not yet cited from any other wiki page (newly ingested 2026-04-27). Suggested back-links to add in a follow-up sweep:
- [[syntheses/study-sketch-agent-trajectory-drift]] — empirical evidence for H2 (sparse, discontinuous drift signature) and a candidate measurement instrument.
- [[concepts/agentic-bpm-pyramid]] — empirical bound on the "agents execute reliably" assumption underlying the pyramid.
- [[concepts/framed-autonomy]] — domain-risk taxonomy could inform frame-tightening defaults.

## Cited by

None yet (newly ingested 2026-04-27).

## Open questions raised by the source

- Can the parsing-based similarity functions be repurposed as **online process-conformance metrics** for agent trajectories on document-editing workflows in production BPM stacks?
- Does the deletion-vs-corruption split persist for *non-editing* delegated work (planning, communication, retrieval) where DELEGATE-52's methodology cannot be applied directly?
- What does a multi-turn (instruction-sharded, clarification-allowed) extension of DELEGATE-52 measure, and how much of the 25–50% corruption is recoverable by adding human-in-the-loop micro-interventions?
- Given GPT 4o (Nov 2024) → GPT 5.4 (Mar 2026) jumped from 14.7 to 71.5 over 16 months, what is the expected timeline to >98% RS@20 across the 41 currently-not-ready domains? Linear extrapolation is unsafe but the question is operationally relevant for any BPM team planning agentic deployment.
- Is the round-trip-relay method extensible to **end-to-end business-process execution traces** (not just document edits) — e.g., by treating a process as a sequence of state transformations with declared inverses where they exist?