--- title: "The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review" type: source tags: [literature-review, systematic-review, llm, meta-review, automation, biomedical] authors: [Scherbakov Dmitry; Hubig Nina; Jansari Vinita; Bakumenko Alexander; Lenert Leslie A.] year: 2025 venue: "Journal of the American Medical Informatics Association (JAMIA) 32(6):1071–1086" kind: paper raw_path: "raw/Literature Review Methodology/large language models as tools in literature reviews.pdf" doi: "10.1093/jamia/ocaf063" created: 2026-04-20 updated: 2026-04-20 key_claims: - From 3788 articles retrieved in PubMed/Scopus/Dimensions/Google Scholar (June 2024), 172 LLM-assisted review-automation studies were eligible; 26 (15.1%) were actual reviews that acknowledged LLM usage, the rest methodological. - GPT/ChatGPT dominates usage (73.2% of the 126 most-cited automation architectures); BERT-based models are second (18.6%); LLaMA/Alpaca, Claude, Gemini trail. - The most automated stages are Searching for publications (34.9%) and Data extraction (31.4%); Evidence synthesis/summarisation (18.6%), Title and abstract screening (25.0%), Drafting (12.8%), Full-text screening (8.1%), Quality/bias assessment (7.0%) are less covered. - The authors' own review used LLM assistance via a Covidence plugin built around GPT-4o; LLM achieved 83.0% precision and 86.0% recall in data extraction (vs. BERT baselines). - Rule-based and pre-LLM ML systems (SVM, Naive Bayes, logistic regression) previously showed 40–50% workload reduction while maintaining ≥95% recall; LLMs qualitatively expand these capabilities. - Automation bias (over-reliance on automated suggestions) is a documented adoption risk; independent human consensus plus LLM as a third reviewer is a promising pattern. - Numeric-data extraction accuracy remains a weakness; current LLMs are closer to production for categorical or textual extraction. --- # Scherbakov et al. 2025 — LLMs as tools in literature reviews (LLM-assisted SR) A JAMIA *review* that is both (a) a **systematic review of LLM-assisted review automation** and (b) an **applied demonstration** — the authors used an LLM-augmented Covidence workflow to perform the review itself. Published May 2025 (advance access) — the most recent of the four LLM-era papers in this batch. ## Why it matters here Supplies a **landscape map** of which SR stages are being automated and how well, plus a self-demonstrating pipeline (Covidence + GPT-4o plugin) that operationalises the patterns advocated by [[sources/2023-qureshi-chatgpt-sr-automation]] (always keep a human), [[sources/2024-dennstaedt-llm-title-abstract-screening]] (screening as Likert classifier) and [[sources/2024-agarwal-litllms-are-we-there-yet]] (retrieval + generation). ## Coverage of the field (172 eligible studies) **By review type automated**: Systematic Review (68.6%), Literature/Narrative Review (21.5%), Meta-Analysis (11.0%), Scoping Review (4.7%), Umbrella (1.2%). **By stage automated**: Searching 34.9%, Data extraction 31.4%, Title/abstract screening 25.0%, Evidence synthesis 18.6%, Drafting 12.8%, Full-text screening 8.1%, Quality/bias assessment 7.0%. **By model family**: GPT/ChatGPT 73.2%, BERT-family 18.6%, LLaMA/Alpaca 4.7%, Claude 4.1%, Gemini 2.9%. ## Demonstration pipeline - Covidence plugin wraps OpenAI GPT-4o via Azure; Python/R intermediary passes content between Covidence and the LLM. - Three automated stages: **abstract screening**, **full-text screening**, **extraction** — each with 2 calibrated human reviewers + LLM voting 3 inference runs for self-consistency (majority vote). - Human-LLM consensus via: 2 humans agree → that is human consensus → compared against LLM vote; disagreement reveals LLM false positives/negatives. - Extraction validated by single human; low-precision categories (<80%) reassigned to human. - LLM drafted ~40% of Introduction, ~90% of Results, ~30% of Discussion, subsequently edited. ## Performance - GPT-4o extraction: **precision 83.0% (SD 10.4), recall 86.0% (SD 9.8)** against expert gold standard; outperforms BERT baselines from pre-LLM era. - Numeric extraction lower accuracy — flagged as a specific weakness. ## Connections - [[methods/systematic-literature-review]] — covers all [[sources/2007-kitchenham-slr-guidelines|Kitchenham]] stages via LLM assistance. - [[concepts/llm-assisted-literature-review]] — hub; this paper consolidates the batch. - [[entities/dmitry-scherbakov]] — first author. - Applies patterns from [[sources/2024-dennstaedt-llm-title-abstract-screening]] (screening) and [[sources/2024-agarwal-litllms-are-we-there-yet]] (retrieval + writing).