--- title: "Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain" type: source tags: [literature-review, systematic-review, screening, llm, biomedical] authors: [Dennstädt Fabio; Zink Johannes; Putora Paul Martin; Hastings Janna; Cihoric Nikola] year: 2024 venue: "Systematic Reviews 13:158 (BMC)" kind: paper raw_path: "raw/Literature Review Methodology/Title and abstract screening for literature reviews.pdf" doi: "10.1186/s13643-024-02575-4" created: 2026-04-20 updated: 2026-04-20 key_claims: - An automated classifier built around a prompt = [Instruction] + [Title] + [Abstract] + [Relevant Criteria] that asks the LLM for a 1–5 (or 1–10) Likert relevance score can be turned into an inclusion/exclusion classifier by thresholding. - Performance varies dramatically across openly available LLMs - on ten published biomedical SLR datasets, sensitivity/specificity was 94.48%/31.78% (FlanT5), 97.58%/19.12% (OpenHermes-NeuralChat), 81.93%/75.19% (Mixtral), 97.58%/38.34% (Platypus 2). - On the authors' newly created dataset, 100% sensitivity was achievable at specificities of 12.58% / 4.54% / 62.47% / 24.74% for the four models - i.e. Mixtral dominated. - Minor prompt changes (rephrasing the instruction, changing the Likert range 1–5 vs 1–10) had considerable impact on performance, underlining prompt-sensitivity as a reproducibility risk. - LLM title/abstract screening is feasible as a first-pass filter at the cost of reviewer-set specificity; human second-pass screening remains necessary. --- # Dennstädt et al. 2024 — LLM title/abstract screening (biomedical) *Systematic Reviews* 2024 research article benchmarking **four openly available LLMs** on title-and-abstract screening, across ten published biomedical SLR datasets plus one newly created dataset. Probably the most rigorous single-stage SLR-automation benchmark published to date. ## Why it matters here Provides an **empirical yardstick** for when LLM-assisted screening is workable and which architectures to prefer. Complements the sceptical commentary of [[sources/2023-qureshi-chatgpt-sr-automation]] with real sensitivity/specificity numbers. Mixtral's 81.93%/75.19% on published datasets suggests an MoE open model can hit usable screening balance without proprietary APIs. ## Method - **Prompt template**: `[Instruction] + "Title: " + title + ", Abstract: " + abstract + [Relevant Criteria]`. Relevant Criteria is hand-written per SLR. - **Output**: LLM is asked to emit a single integer on a Likert scale (default 1–5). A regex extracts the number. - **Thresholding**: a score cutoff (e.g. ≥3 = include) converts Likert output into a binary classifier. - **Models tested**: FlanT5-XXL, OpenHermes-2.5-Neural-Chat-7B, Mixtral-8×7B-Instruct, Platypus 2. - **Datasets**: 10 published biomedical SLR corpora + 1 new dataset constructed for this study. ## Results | Model | Sensitivity (pub) | Specificity (pub) | Best of four on new set | |---|---|---|---| | FlanT5-XXL | 94.48% | 31.78% | 12.58% @ 100% sens. | | OpenHermes-NeuralChat | 97.58% | 19.12% | 4.54% | | **Mixtral-8×7B** | **81.93%** | **75.19%** | **62.47%** | | Platypus 2 | 97.58% | 38.34% | 24.74% | Most models give high sensitivity at the cost of low specificity — i.e. they err toward inclusion, which is the right direction for a screening tool but still imposes reviewer burden. Mixtral is the exception with balanced performance. ## Caveats - **Prompt sensitivity** — changing "3." to "The relevance is 3." and changing Likert range changes outcomes considerably; systematic prompt engineering is mandatory. - No LLM reached production-grade specificity on most datasets; use as first-pass, not sole reviewer. ## Connections - [[methods/systematic-literature-review]] — screening is stage 6.2 ([[sources/2007-kitchenham-slr-guidelines|Kitchenham]] §6.2). - [[concepts/llm-assisted-literature-review]] — hub. - [[entities/fabio-dennstaedt]] — first author. - Contrasts with [[sources/2023-qureshi-chatgpt-sr-automation]] (qualitative scepticism) and [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] (meta-empirical mapping).