---
title: "LitLLMs, LLMs for Literature Review: Are we there yet?"
type: source
tags: [literature-review, llm, retrieval-augmented-generation, generation, related-work]
authors: [Agarwal Shubham; Sahu Gaurav; Puri Abhay; Laradji Issam H.; Dvijotham Krishnamurthy DJ; Stanley Jason; Charlin Laurent; Pal Christopher]
year: 2024
venue: "Transactions on Machine Learning Research (TMLR), 12/2024; arXiv:2412.15249"
kind: paper
raw_path: "raw/Literature Review Methodology/LMs for Literature Review- Are we there yet.pdf"
arxiv: "2412.15249v2"
project: "https://litllm.github.io"
created: 2026-04-20
updated: 2026-04-20
key_claims:
  - Literature-review writing decomposes cleanly into two LLM-tractable sub-tasks - (1) retrieving related work for a query abstract and (2) generating a related-work section from retrieved papers.
  - A two-step retrieval (LLM extracts keywords from an abstract, then queries Google/Semantic Scholar + document embeddings) combined with LLM re-ranking doubles normalised recall versus naive embedding search.
  - Combining keyword-based and embedding-based search improves precision +10% and recall +30% over either alone.
  - A plan-then-generate approach (LLM first outputs a sentence-level plan for which paper to cite where, then executes the plan) reduces hallucinated references by 18-26% versus simpler LLM generation pipelines.
  - Attribution prompting (asking the LLM to justify why each ranked paper is relevant) improves re-ranking reliability and transparency.
  - A rolling arXiv test-set protocol (use only the most recent month of submissions) is proposed to avoid test-set contamination when evaluating newly released LLMs zero-shot.
---

# Agarwal et al. 2024 — LitLLMs: LLMs for literature review, are we there yet?

TMLR 2024 paper from ServiceNow Research + Mila that decomposes literature-review writing into **retrieval + generation** and evaluates a pipeline of LLM-based tricks over each sub-task. The project page ships code and a demo at `litllm.github.io`.

## Why it matters here

The most operationally concrete of the LLM-for-literature-review papers: produces a pipeline specification (keyword extraction → dual retrieval → re-ranking → plan → generation) with ablations showing which tricks pay off. The **plan-then-generate** pattern is the strongest hallucination-reduction finding so far.

## Pipeline

1. **Keyword extraction** — LLM takes an abstract or research-idea paragraph and emits search keywords.
2. **Retrieval** — two-track: LLM-generated keywords into Google Scholar / Semantic Scholar, *plus* embedding-based similarity over abstracts. Combining both beats either alone.
3. **Re-ranking** — LLM scores candidate papers against the query abstract and attributes its reasoning to specific excerpts; debate-style aggregation between multiple LLM passes is also explored.
4. **Plan** — optional sentence-level outline ("cite paper X at sentence 3 to justify Y") generated by LLM, user, or hybrid.
5. **Generation** — LLM writes the related-work section conditioned on query abstract, top-*k* papers, and plan.

## Findings

- Keyword-from-abstract extraction outperforms using the raw abstract as a query.
- Attribution-prompted re-ranking improves ROUGE and human-judged quality.
- **Plan-based generation reduces hallucinated citations by 18-26%** vs. plan-free baselines — the headline result for scientific writing use.
- Rolling arXiv protocol (most-recent-month papers) is proposed as a contamination-resistant evaluation design.

## Limits

Abstract-only input means the method gives early-stage preliminary references, not a final curated bibliography. Evaluation via ROUGE and human judgement on arXiv ML papers — transfer to other domains (medicine, BPM) untested. Does not address the [[sources/2007-kitchenham-slr-guidelines|Kitchenham]] inclusion/exclusion-criteria rigour or quality assessment.

## Connections
- [[methods/systematic-literature-review]] — positioning: this is a narrative / related-work pipeline, not a full SLR.
- [[concepts/llm-assisted-literature-review]] — hub; this paper covers the retrieval and writing phases.
- [[entities/shubham-agarwal]] — first author.
- Contrasts with screening-centric [[sources/2024-dennstaedt-llm-title-abstract-screening]] and meta-empirical [[sources/2025-scherbakov-llms-as-tools-literature-reviews]].