--- title: "LitLLMs, LLMs for Literature Review: Are we there yet?" type: source tags: [literature-review, llm, retrieval-augmented-generation, generation, related-work] authors: [Agarwal Shubham; Sahu Gaurav; Puri Abhay; Laradji Issam H.; Dvijotham Krishnamurthy DJ; Stanley Jason; Charlin Laurent; Pal Christopher] year: 2024 venue: "Transactions on Machine Learning Research (TMLR), 12/2024; arXiv:2412.15249" kind: paper raw_path: "raw/Literature Review Methodology/LMs for Literature Review- Are we there yet.pdf" arxiv: "2412.15249v2" project: "https://litllm.github.io" created: 2026-04-20 updated: 2026-04-20 key_claims: - Literature-review writing decomposes cleanly into two LLM-tractable sub-tasks - (1) retrieving related work for a query abstract and (2) generating a related-work section from retrieved papers. - A two-step retrieval (LLM extracts keywords from an abstract, then queries Google/Semantic Scholar + document embeddings) combined with LLM re-ranking doubles normalised recall versus naive embedding search. - Combining keyword-based and embedding-based search improves precision +10% and recall +30% over either alone. - A plan-then-generate approach (LLM first outputs a sentence-level plan for which paper to cite where, then executes the plan) reduces hallucinated references by 18-26% versus simpler LLM generation pipelines. - Attribution prompting (asking the LLM to justify why each ranked paper is relevant) improves re-ranking reliability and transparency. - A rolling arXiv test-set protocol (use only the most recent month of submissions) is proposed to avoid test-set contamination when evaluating newly released LLMs zero-shot. --- # Agarwal et al. 2024 — LitLLMs: LLMs for literature review, are we there yet? TMLR 2024 paper from ServiceNow Research + Mila that decomposes literature-review writing into **retrieval + generation** and evaluates a pipeline of LLM-based tricks over each sub-task. The project page ships code and a demo at `litllm.github.io`. ## Why it matters here The most operationally concrete of the LLM-for-literature-review papers: produces a pipeline specification (keyword extraction → dual retrieval → re-ranking → plan → generation) with ablations showing which tricks pay off. The **plan-then-generate** pattern is the strongest hallucination-reduction finding so far. ## Pipeline 1. **Keyword extraction** — LLM takes an abstract or research-idea paragraph and emits search keywords. 2. **Retrieval** — two-track: LLM-generated keywords into Google Scholar / Semantic Scholar, *plus* embedding-based similarity over abstracts. Combining both beats either alone. 3. **Re-ranking** — LLM scores candidate papers against the query abstract and attributes its reasoning to specific excerpts; debate-style aggregation between multiple LLM passes is also explored. 4. **Plan** — optional sentence-level outline ("cite paper X at sentence 3 to justify Y") generated by LLM, user, or hybrid. 5. **Generation** — LLM writes the related-work section conditioned on query abstract, top-*k* papers, and plan. ## Findings - Keyword-from-abstract extraction outperforms using the raw abstract as a query. - Attribution-prompted re-ranking improves ROUGE and human-judged quality. - **Plan-based generation reduces hallucinated citations by 18-26%** vs. plan-free baselines — the headline result for scientific writing use. - Rolling arXiv protocol (most-recent-month papers) is proposed as a contamination-resistant evaluation design. ## Limits Abstract-only input means the method gives early-stage preliminary references, not a final curated bibliography. Evaluation via ROUGE and human judgement on arXiv ML papers — transfer to other domains (medicine, BPM) untested. Does not address the [[sources/2007-kitchenham-slr-guidelines|Kitchenham]] inclusion/exclusion-criteria rigour or quality assessment. ## Connections - [[methods/systematic-literature-review]] — positioning: this is a narrative / related-work pipeline, not a full SLR. - [[concepts/llm-assisted-literature-review]] — hub; this paper covers the retrieval and writing phases. - [[entities/shubham-agarwal]] — first author. - Contrasts with screening-centric [[sources/2024-dennstaedt-llm-title-abstract-screening]] and meta-empirical [[sources/2025-scherbakov-llms-as-tools-literature-reviews]].