--- title: "Are ChatGPT and large language models 'the answer' to bringing us closer to systematic review automation?" type: source tags: [literature-review, systematic-review, llm, chatgpt, automation, commentary] authors: [Qureshi Riaz; Shaughnessy Daniel; Gill Kayden A. R.; Robinson Karen A.; Li Tianjing; Agai Eitan] year: 2023 venue: "Systematic Reviews 12:72 (BMC)" kind: paper raw_path: "raw/Literature Review Methodology/Are ChatGPT and llms the answer.pdf" doi: "10.1186/s13643-023-02243-z" created: 2026-04-20 updated: 2026-04-20 key_claims: - ChatGPT and LLMs show promise for aiding systematic review (SR) tasks but the technology is in its infancy and needs substantial development before unsupervised use. - ChatGPT output looks authoritative ('uncanny valley') but is often erroneous and requires active vetting by domain experts; this pre-requisite expertise defeats the purpose of intelligent automation. - Useful for contextualising a review question, drafting eligibility criteria, starter title screening, and outlining code for search/meta-analysis; unsuitable for producing verifiable search strategies (controlled-vocabulary fabrication) or reliable synthesis. - A particularly strong limitation is that ChatGPT cannot reference or verify real literature - it predicts plausible text rather than retrieving real sources. - Non-deterministic output means the same prompt yields different responses, hampering reproducibility - a core SLR requirement. - LLMs may eventually polish drafts and high-level summaries for experts revising their own writing, but cannot currently be used with confidence for any SR step unattended. --- # Qureshi et al. 2023 — Are ChatGPT and LLMs the answer to SR automation? A commentary in *Systematic Reviews* documenting the PICO Portal webinar of 6 February 2023, where the authors tested ChatGPT (GPT-3.5, with minor GPT-4 comparison) against systematic-review tasks: review-question formulation, eligibility criteria, PubMed search strategy, meta-analysis code outline, and summarisation of three abstracts. ## Why it matters here The earliest widely cited commentary taking the position that **LLMs are not yet a drop-in replacement** for any SLR stage and will need content experts as an always-on safety net. A grounding counterweight to over-optimistic framings of [[concepts/llm-assisted-literature-review]]. Uses [[sources/2007-kitchenham-slr-guidelines|Kitchenham-style]] SR phases implicitly (via PICO, eligibility, search, extraction, synthesis). ## Findings **Where ChatGPT was acceptable as a starter**: - Formulating a structured review question and contextualising PICO. - Drafting eligibility criteria (needs refinement). - Screening three abstracts for relevance (promising but with errors). - Outlining Python/R code for a meta-analysis (needs SE expertise to debug). **Where ChatGPT failed**: - PubMed search strategy — fabricated MeSH/controlled-vocabulary terms that would silently break recall. - Factual referencing — cannot verify sources; hallucinates citations. - Synthesis / summary of multiple studies — the most essential product of an SR, and the one with the least trustworthy output. ## The "uncanny valley" diagnosis From a distance the output mimics expert SR writing; on inspection it is not expertly formed. Reviewers without domain expertise cannot tell the difference — which inverts the value proposition of automation. ## Connections - [[methods/systematic-literature-review]] — baseline procedure ChatGPT is tested against. - [[concepts/llm-assisted-literature-review]] — hub page. - [[entities/riaz-qureshi]] — first author. - Companion evidence: [[sources/2024-dennstaedt-llm-title-abstract-screening]] (empirical screening benchmark), [[sources/2024-agarwal-litllms-are-we-there-yet]] (retrieval+generation pipeline), [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] (LLM-assisted SR of LLM-assisted SRs).