---
title: "The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review"
type: source
tags: [literature-review, systematic-review, llm, meta-review, automation, biomedical]
authors: [Scherbakov Dmitry; Hubig Nina; Jansari Vinita; Bakumenko Alexander; Lenert Leslie A.]
year: 2025
venue: "Journal of the American Medical Informatics Association (JAMIA) 32(6):1071–1086"
kind: paper
raw_path: "raw/Literature Review Methodology/large language models as tools in literature reviews.pdf"
doi: "10.1093/jamia/ocaf063"
created: 2026-04-20
updated: 2026-04-20
key_claims:
  - From 3788 articles retrieved in PubMed/Scopus/Dimensions/Google Scholar (June 2024), 172 LLM-assisted review-automation studies were eligible; 26 (15.1%) were actual reviews that acknowledged LLM usage, the rest methodological.
  - GPT/ChatGPT dominates usage (73.2% of the 126 most-cited automation architectures); BERT-based models are second (18.6%); LLaMA/Alpaca, Claude, Gemini trail.
  - The most automated stages are Searching for publications (34.9%) and Data extraction (31.4%); Evidence synthesis/summarisation (18.6%), Title and abstract screening (25.0%), Drafting (12.8%), Full-text screening (8.1%), Quality/bias assessment (7.0%) are less covered.
  - The authors' own review used LLM assistance via a Covidence plugin built around GPT-4o; LLM achieved 83.0% precision and 86.0% recall in data extraction (vs. BERT baselines).
  - Rule-based and pre-LLM ML systems (SVM, Naive Bayes, logistic regression) previously showed 40–50% workload reduction while maintaining ≥95% recall; LLMs qualitatively expand these capabilities.
  - Automation bias (over-reliance on automated suggestions) is a documented adoption risk; independent human consensus plus LLM as a third reviewer is a promising pattern.
  - Numeric-data extraction accuracy remains a weakness; current LLMs are closer to production for categorical or textual extraction.
---

# Scherbakov et al. 2025 — LLMs as tools in literature reviews (LLM-assisted SR)

A JAMIA *review* that is both (a) a **systematic review of LLM-assisted review automation** and (b) an **applied demonstration** — the authors used an LLM-augmented Covidence workflow to perform the review itself. Published May 2025 (advance access) — the most recent of the four LLM-era papers in this batch.

## Why it matters here

Supplies a **landscape map** of which SR stages are being automated and how well, plus a self-demonstrating pipeline (Covidence + GPT-4o plugin) that operationalises the patterns advocated by [[sources/2023-qureshi-chatgpt-sr-automation]] (always keep a human), [[sources/2024-dennstaedt-llm-title-abstract-screening]] (screening as Likert classifier) and [[sources/2024-agarwal-litllms-are-we-there-yet]] (retrieval + generation).

## Coverage of the field (172 eligible studies)

**By review type automated**: Systematic Review (68.6%), Literature/Narrative Review (21.5%), Meta-Analysis (11.0%), Scoping Review (4.7%), Umbrella (1.2%).

**By stage automated**: Searching 34.9%, Data extraction 31.4%, Title/abstract screening 25.0%, Evidence synthesis 18.6%, Drafting 12.8%, Full-text screening 8.1%, Quality/bias assessment 7.0%.

**By model family**: GPT/ChatGPT 73.2%, BERT-family 18.6%, LLaMA/Alpaca 4.7%, Claude 4.1%, Gemini 2.9%.

## Demonstration pipeline

- Covidence plugin wraps OpenAI GPT-4o via Azure; Python/R intermediary passes content between Covidence and the LLM.
- Three automated stages: **abstract screening**, **full-text screening**, **extraction** — each with 2 calibrated human reviewers + LLM voting 3 inference runs for self-consistency (majority vote).
- Human-LLM consensus via: 2 humans agree → that is human consensus → compared against LLM vote; disagreement reveals LLM false positives/negatives.
- Extraction validated by single human; low-precision categories (<80%) reassigned to human.
- LLM drafted ~40% of Introduction, ~90% of Results, ~30% of Discussion, subsequently edited.

## Performance

- GPT-4o extraction: **precision 83.0% (SD 10.4), recall 86.0% (SD 9.8)** against expert gold standard; outperforms BERT baselines from pre-LLM era.
- Numeric extraction lower accuracy — flagged as a specific weakness.

## Connections
- [[methods/systematic-literature-review]] — covers all [[sources/2007-kitchenham-slr-guidelines|Kitchenham]] stages via LLM assistance.
- [[concepts/llm-assisted-literature-review]] — hub; this paper consolidates the batch.
- [[entities/dmitry-scherbakov]] — first author.
- Applies patterns from [[sources/2024-dennstaedt-llm-title-abstract-screening]] (screening) and [[sources/2024-agarwal-litllms-are-we-there-yet]] (retrieval + writing).