---
title: LLM-assisted Literature Review
type: concept
tags: [literature-review, llm, screening, extraction, synthesis, automation]
sources: ["[[sources/2023-qureshi-chatgpt-sr-automation]]", "[[sources/2024-agarwal-litllms-are-we-there-yet]]", "[[sources/2024-dennstaedt-llm-title-abstract-screening]]", "[[sources/2025-scherbakov-llms-as-tools-literature-reviews]]"]
created: 2026-04-20
updated: 2026-04-20
---

# LLM-assisted Literature Review

Shared concept tying together the four 2023–2025 papers that ask **what can LLMs do inside the [[methods/systematic-literature-review|SLR]] workflow?**. State of the art is uneven across stages: usable first-pass tooling for screening, retrieval and extraction; unreliable for search-strategy construction and unsupervised synthesis.

## Stage-by-stage capability map

| SLR stage (Kitchenham) | LLMs: can | LLMs: cannot | Evidence |
|---|---|---|---|
| Question formulation / PICOC | Drafting, contextualising | Judging importance to practice | [[sources/2023-qureshi-chatgpt-sr-automation]] |
| Search strategy | — | Construct valid MeSH/controlled-vocabulary queries; fabricates terms | [[sources/2023-qureshi-chatgpt-sr-automation]] |
| Retrieval | Keyword extraction from abstract → external search; embedding + keyword combined beats either | Ground retrieval in real corpora without a retriever | [[sources/2024-agarwal-litllms-are-we-there-yet]] |
| Title/abstract screening | Likert-score classifier; ~82/75 sens/spec (Mixtral) | Reach production specificity alone | [[sources/2024-dennstaedt-llm-title-abstract-screening]], [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] |
| Full-text screening | Covered in Scherbakov pipeline with 3-run majority vote | Handle reasoning over long documents without document-structure awareness | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] |
| Quality / bias assessment | — (only 7.0% of the 172 studies) | Apply domain-specific rubrics | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] |
| Data extraction | Categorical/textual fields: GPT-4o ~83% P, 86% R | Numeric-data extraction (low precision) | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] |
| Synthesis / narrative | Drafting outlines, summarising 3–5 abstracts, plan-then-generate related work | Unsupervised multi-study synthesis; 18–26% hallucinated references without plans | [[sources/2024-agarwal-litllms-are-we-there-yet]], [[sources/2023-qureshi-chatgpt-sr-automation]] |
| Drafting / reporting | Sections of Introduction/Results/Discussion (40%/90%/30% in Scherbakov), code outlines | Final, verifiable text without expert editing | [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] |

## Cross-cutting findings

1. **Human-in-the-loop is mandatory.** All four papers converge on an assisted-not-replaced posture. Automation bias — over-trusting LLM output — is a named adoption risk.
2. **Prompt sensitivity is a reproducibility risk.** Minor rewording and scale changes shifted Dennstädt 2024's numbers considerably; Kitchenham-style reproducibility requires prompts be treated as part of the protocol.
3. **Non-determinism breaks SLR reproducibility.** Mitigation: fix seeds where possible, run N inference passes and majority-vote (Scherbakov uses 3).
4. **GPT/ChatGPT dominates** published usage (73% of 126 architectures mapped in Scherbakov), but open Mixtral-class models can match proprietary on screening at much lower operational cost.
5. **Plan-then-generate** is the headline hallucination-control pattern for writing stages (–18 to –26% fabricated citations).
6. **Most automated stages** are Search (35%) and Extraction (31%); least automated are Quality/bias assessment (7%) and Full-text screening (8%).

## Integration pattern (distilled)

LLM as **third reviewer** inside a conventional SLR workflow:
- Two human reviewers calibrated on a subset → human consensus.
- LLM vote = majority of N (≥3) self-consistency runs on a fixed prompt.
- Human consensus compared to LLM vote; disagreements resolved by a senior reviewer.
- Low-precision extraction fields (<80%) reassigned to human.

## Relation to Kitchenham

This concept extends — but does **not replace** — [[methods/systematic-literature-review]]. The protocol, inclusion/exclusion criteria, quality instrument and audit trail remain Kitchenham-shaped; LLMs accelerate specific steps inside that frame.

## Open questions
- Transferability of biomedical-domain screening numbers to BPM / software engineering corpora is not established.
- No benchmark for LLM quality/bias assessment — only 7% of published automation projects touch this stage.
- Numeric-data extraction remains unsolved for meta-analytic use.