---
title: "Are ChatGPT and large language models 'the answer' to bringing us closer to systematic review automation?"
type: source
tags: [literature-review, systematic-review, llm, chatgpt, automation, commentary]
authors: [Qureshi Riaz; Shaughnessy Daniel; Gill Kayden A. R.; Robinson Karen A.; Li Tianjing; Agai Eitan]
year: 2023
venue: "Systematic Reviews 12:72 (BMC)"
kind: paper
raw_path: "raw/Literature Review Methodology/Are ChatGPT and llms the answer.pdf"
doi: "10.1186/s13643-023-02243-z"
created: 2026-04-20
updated: 2026-04-20
key_claims:
  - ChatGPT and LLMs show promise for aiding systematic review (SR) tasks but the technology is in its infancy and needs substantial development before unsupervised use.
  - ChatGPT output looks authoritative ('uncanny valley') but is often erroneous and requires active vetting by domain experts; this pre-requisite expertise defeats the purpose of intelligent automation.
  - Useful for contextualising a review question, drafting eligibility criteria, starter title screening, and outlining code for search/meta-analysis; unsuitable for producing verifiable search strategies (controlled-vocabulary fabrication) or reliable synthesis.
  - A particularly strong limitation is that ChatGPT cannot reference or verify real literature - it predicts plausible text rather than retrieving real sources.
  - Non-deterministic output means the same prompt yields different responses, hampering reproducibility - a core SLR requirement.
  - LLMs may eventually polish drafts and high-level summaries for experts revising their own writing, but cannot currently be used with confidence for any SR step unattended.
---

# Qureshi et al. 2023 — Are ChatGPT and LLMs the answer to SR automation?

A commentary in *Systematic Reviews* documenting the PICO Portal webinar of 6 February 2023, where the authors tested ChatGPT (GPT-3.5, with minor GPT-4 comparison) against systematic-review tasks: review-question formulation, eligibility criteria, PubMed search strategy, meta-analysis code outline, and summarisation of three abstracts.

## Why it matters here

The earliest widely cited commentary taking the position that **LLMs are not yet a drop-in replacement** for any SLR stage and will need content experts as an always-on safety net. A grounding counterweight to over-optimistic framings of [[concepts/llm-assisted-literature-review]]. Uses [[sources/2007-kitchenham-slr-guidelines|Kitchenham-style]] SR phases implicitly (via PICO, eligibility, search, extraction, synthesis).

## Findings

**Where ChatGPT was acceptable as a starter**:
- Formulating a structured review question and contextualising PICO.
- Drafting eligibility criteria (needs refinement).
- Screening three abstracts for relevance (promising but with errors).
- Outlining Python/R code for a meta-analysis (needs SE expertise to debug).

**Where ChatGPT failed**:
- PubMed search strategy — fabricated MeSH/controlled-vocabulary terms that would silently break recall.
- Factual referencing — cannot verify sources; hallucinates citations.
- Synthesis / summary of multiple studies — the most essential product of an SR, and the one with the least trustworthy output.

## The "uncanny valley" diagnosis

From a distance the output mimics expert SR writing; on inspection it is not expertly formed. Reviewers without domain expertise cannot tell the difference — which inverts the value proposition of automation.

## Connections
- [[methods/systematic-literature-review]] — baseline procedure ChatGPT is tested against.
- [[concepts/llm-assisted-literature-review]] — hub page.
- [[entities/riaz-qureshi]] — first author.
- Companion evidence: [[sources/2024-dennstaedt-llm-title-abstract-screening]] (empirical screening benchmark), [[sources/2024-agarwal-litllms-are-we-there-yet]] (retrieval+generation pipeline), [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] (LLM-assisted SR of LLM-assisted SRs).