--- title: LLM-assisted Literature Review type: concept tags: [literature-review, llm, screening, extraction, synthesis, automation] sources: ["[[sources/2023-qureshi-chatgpt-sr-automation]]", "[[sources/2024-agarwal-litllms-are-we-there-yet]]", "[[sources/2024-dennstaedt-llm-title-abstract-screening]]", "[[sources/2025-scherbakov-llms-as-tools-literature-reviews]]"] created: 2026-04-20 updated: 2026-04-20 --- # LLM-assisted Literature Review Shared concept tying together the four 2023–2025 papers that ask **what can LLMs do inside the [[methods/systematic-literature-review|SLR]] workflow?**. State of the art is uneven across stages: usable first-pass tooling for screening, retrieval and extraction; unreliable for search-strategy construction and unsupervised synthesis. ## Stage-by-stage capability map | SLR stage (Kitchenham) | LLMs: can | LLMs: cannot | Evidence | |---|---|---|---| | Question formulation / PICOC | Drafting, contextualising | Judging importance to practice | [[sources/2023-qureshi-chatgpt-sr-automation]] | | Search strategy | — | Construct valid MeSH/controlled-vocabulary queries; fabricates terms | [[sources/2023-qureshi-chatgpt-sr-automation]] | | Retrieval | Keyword extraction from abstract → external search; embedding + keyword combined beats either | Ground retrieval in real corpora without a retriever | [[sources/2024-agarwal-litllms-are-we-there-yet]] | | Title/abstract screening | Likert-score classifier; ~82/75 sens/spec (Mixtral) | Reach production specificity alone | [[sources/2024-dennstaedt-llm-title-abstract-screening]], [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] | | Full-text screening | Covered in Scherbakov pipeline with 3-run majority vote | Handle reasoning over long documents without document-structure awareness | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] | | Quality / bias assessment | — (only 7.0% of the 172 studies) | Apply domain-specific rubrics | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] | | Data extraction | Categorical/textual fields: GPT-4o ~83% P, 86% R | Numeric-data extraction (low precision) | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] | | Synthesis / narrative | Drafting outlines, summarising 3–5 abstracts, plan-then-generate related work | Unsupervised multi-study synthesis; 18–26% hallucinated references without plans | [[sources/2024-agarwal-litllms-are-we-there-yet]], [[sources/2023-qureshi-chatgpt-sr-automation]] | | Drafting / reporting | Sections of Introduction/Results/Discussion (40%/90%/30% in Scherbakov), code outlines | Final, verifiable text without expert editing | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] | ## Cross-cutting findings 1. **Human-in-the-loop is mandatory.** All four papers converge on an assisted-not-replaced posture. Automation bias — over-trusting LLM output — is a named adoption risk. 2. **Prompt sensitivity is a reproducibility risk.** Minor rewording and scale changes shifted Dennstädt 2024's numbers considerably; Kitchenham-style reproducibility requires prompts be treated as part of the protocol. 3. **Non-determinism breaks SLR reproducibility.** Mitigation: fix seeds where possible, run N inference passes and majority-vote (Scherbakov uses 3). 4. **GPT/ChatGPT dominates** published usage (73% of 126 architectures mapped in Scherbakov), but open Mixtral-class models can match proprietary on screening at much lower operational cost. 5. **Plan-then-generate** is the headline hallucination-control pattern for writing stages (–18 to –26% fabricated citations). 6. **Most automated stages** are Search (35%) and Extraction (31%); least automated are Quality/bias assessment (7%) and Full-text screening (8%). ## Integration pattern (distilled) LLM as **third reviewer** inside a conventional SLR workflow: - Two human reviewers calibrated on a subset → human consensus. - LLM vote = majority of N (≥3) self-consistency runs on a fixed prompt. - Human consensus compared to LLM vote; disagreements resolved by a senior reviewer. - Low-precision extraction fields (<80%) reassigned to human. ## Relation to Kitchenham This concept extends — but does **not replace** — [[methods/systematic-literature-review]]. The protocol, inclusion/exclusion criteria, quality instrument and audit trail remain Kitchenham-shaped; LLMs accelerate specific steps inside that frame. ## Open questions - Transferability of biomedical-domain screening numbers to BPM / software engineering corpora is not established. - No benchmark for LLM quality/bias assessment — only 7% of published automation projects touch this stage. - Numeric-data extraction remains unsolved for meta-analytic use.