---
title: "Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain"
type: source
tags: [literature-review, systematic-review, screening, llm, biomedical]
authors: [Dennstädt Fabio; Zink Johannes; Putora Paul Martin; Hastings Janna; Cihoric Nikola]
year: 2024
venue: "Systematic Reviews 13:158 (BMC)"
kind: paper
raw_path: "raw/Literature Review Methodology/Title and abstract screening for literature reviews.pdf"
doi: "10.1186/s13643-024-02575-4"
created: 2026-04-20
updated: 2026-04-20
key_claims:
  - An automated classifier built around a prompt = [Instruction] + [Title] + [Abstract] + [Relevant Criteria] that asks the LLM for a 1–5 (or 1–10) Likert relevance score can be turned into an inclusion/exclusion classifier by thresholding.
  - Performance varies dramatically across openly available LLMs - on ten published biomedical SLR datasets, sensitivity/specificity was 94.48%/31.78% (FlanT5), 97.58%/19.12% (OpenHermes-NeuralChat), 81.93%/75.19% (Mixtral), 97.58%/38.34% (Platypus 2).
  - On the authors' newly created dataset, 100% sensitivity was achievable at specificities of 12.58% / 4.54% / 62.47% / 24.74% for the four models - i.e. Mixtral dominated.
  - Minor prompt changes (rephrasing the instruction, changing the Likert range 1–5 vs 1–10) had considerable impact on performance, underlining prompt-sensitivity as a reproducibility risk.
  - LLM title/abstract screening is feasible as a first-pass filter at the cost of reviewer-set specificity; human second-pass screening remains necessary.
---

# Dennstädt et al. 2024 — LLM title/abstract screening (biomedical)

*Systematic Reviews* 2024 research article benchmarking **four openly available LLMs** on title-and-abstract screening, across ten published biomedical SLR datasets plus one newly created dataset. Probably the most rigorous single-stage SLR-automation benchmark published to date.

## Why it matters here

Provides an **empirical yardstick** for when LLM-assisted screening is workable and which architectures to prefer. Complements the sceptical commentary of [[sources/2023-qureshi-chatgpt-sr-automation]] with real sensitivity/specificity numbers. Mixtral's 81.93%/75.19% on published datasets suggests an MoE open model can hit usable screening balance without proprietary APIs.

## Method

- **Prompt template**: `[Instruction] + "Title: " + title + ", Abstract: " + abstract + [Relevant Criteria]`. Relevant Criteria is hand-written per SLR.
- **Output**: LLM is asked to emit a single integer on a Likert scale (default 1–5). A regex extracts the number.
- **Thresholding**: a score cutoff (e.g. ≥3 = include) converts Likert output into a binary classifier.
- **Models tested**: FlanT5-XXL, OpenHermes-2.5-Neural-Chat-7B, Mixtral-8×7B-Instruct, Platypus 2.
- **Datasets**: 10 published biomedical SLR corpora + 1 new dataset constructed for this study.

## Results

| Model | Sensitivity (pub) | Specificity (pub) | Best of four on new set |
|---|---|---|---|
| FlanT5-XXL | 94.48% | 31.78% | 12.58% @ 100% sens. |
| OpenHermes-NeuralChat | 97.58% | 19.12% | 4.54% |
| **Mixtral-8×7B** | **81.93%** | **75.19%** | **62.47%** |
| Platypus 2 | 97.58% | 38.34% | 24.74% |

Most models give high sensitivity at the cost of low specificity — i.e. they err toward inclusion, which is the right direction for a screening tool but still imposes reviewer burden. Mixtral is the exception with balanced performance.

## Caveats

- **Prompt sensitivity** — changing "3." to "The relevance is 3." and changing Likert range changes outcomes considerably; systematic prompt engineering is mandatory.
- No LLM reached production-grade specificity on most datasets; use as first-pass, not sole reviewer.

## Connections
- [[methods/systematic-literature-review]] — screening is stage 6.2 ([[sources/2007-kitchenham-slr-guidelines|Kitchenham]] §6.2).
- [[concepts/llm-assisted-literature-review]] — hub.
- [[entities/fabio-dennstaedt]] — first author.
- Contrasts with [[sources/2023-qureshi-chatgpt-sr-automation]] (qualitative scepticism) and [[sources/2025-scherbakov-llms-as-tools-literature-reviews]] (meta-empirical mapping).