Research topic
Report
Detailed summary
The literature search finds that models like GPT-3.5 and GPT-4 significantly reduce the time and effort required for title and abstract screening in systematic reviews while maintaining comparable sensitivity to human reviewers, though specificity can vary, as shown in key studies [1, 2, 3].
Key FindingsEfficiency and Workload Reduction: LLMs, particularly GPT-3.5 and GPT-4, drastically cut screening duration from months to hours without compromising sensitivity, though specificity can vary, sometimes leading to higher false-positive rates [1, 2, 4, 6].
Performance Metrics: Studies report high sensitivity levels, aligning closely with human reviewers, and moderate specificity, emphasizing the role of refined prompts and model fine-tuning for optimal performance [5, 9, 12].
Practical Integration: Many studies advocate for hybrid models integrating human oversight with AI tools to combine efficiency with reliability, indicated by potential cognitive load reductions and improved screening systems [3, 10, 15].
Quantitative Evaluations: Most studies employed quantitative measures like F1 score and ROC analysis to validate model effectiveness. Prompt engineering and zero-shot learning emerged as crucial strategies for enhancing model performance [11, 17, 31].
Variability and Optimization: While some variability in model performance exists, research highlights customized prompt strategies and prospective studies as pivotal in model optimization for domain-specific tasks [7, 13, 22].
This body of research confirms LLMs' potential in systematic reviews, primarily through integrating efficient AI-driven screening while addressing specificity and integration challenges.
Categories of papers
- Focus on using quantitative methods to assess recall, precision, F1-score for GPT models in systematic review screening.
- References: [1, 2, 5, 12, 16]
- Details: [1] evaluates GPT-3.5 with accuracy, specificity statistics; [2] uses sensitivity optimization with a layered approach; [5] assesses sensitivity and specificity of GPT models with meta-prompts; [12] calculates precision, recall, F1-score for ChatGPT; [16] measures precision, recall, and F1-score in hallucination context.
- Studies directly comparing GPT models against manual or traditional systematic review methods.
- References: [3, 4, 6, 9, 19]
- Details: [3] compares ChatGPT to traditional classifiers; [4] evaluates sensitivity, specificity vs. manual methods; [6] uses real-world datasets for direct performance comparisons to human reviewers; [9] compares GPT-3.5 Turbo as a single reviewer; [19] tests ChatGPT's screening accuracy in gastroenterology systematic reviews.
- Demonstrates how GPT models reduce time and effort compared to traditional methods.
- References: [1, 7, 10, 15, 21]
- Details: [1] highlights reduced screening time from months to hours; [7] outlines prospective study for efficiency and accuracy improvements; [10] evaluates GPT-4 in reducing manual workload for screenings; [15] proposes hybrid approaches for workload reductions; [21] demonstrates rapid abstract screening capabilities.
- Examines how prompt design impacts LLM efficiency and accuracy.
- References: [5, 11, 17, 18, 26]
- Details: [5] discusses optimized prompt development for GPT models; [11] explores different prompting techniques' effects on performance; [17] leverages question-answering frameworks with optimized prompts; [18] employs LLMs with customized prompts for entire review processes; [26] uses zero-shot queries with calibrated prompts.
- Focused on deploying models in practice with empirical data.
- References: [6, 8, 14, 15, 20]
- Details: [6] applies GPT models to clinical review datasets; [8] evaluates LLMs' efficiency in practice; [14] integrates GPT-4 in real-world scoping reviews; [15] demonstrates hybrid model applications in SR processes; [20] discusses ChatGPT for complete SR automation in practice.
These categories encompass studies evaluating GPT models' performance with metrics, their comparison to traditional methods, workload reduction, prompt engineering, and real-world applicability, providing a comprehensive understanding of current capabilities and implications in systematic review screening tasks.
Timeline and citation network
Early Exploration (2019-2022): Initial studies began exploring the feasibility of using AI models, particularly LLMs, for screening systematic reviews. Concepts focused on substituting traditional methods with automated systems to handle the increasing volume of literature [6].
Integration of LLMs (2023): By 2023, research had extensively shifted to using specific LLMs like GPT-3.5 and GPT-4. Studies increasingly focused on evaluating these models' capabilities to manage systematic review processes with quantitative metrics like precision, recall, and specificity. The consensus emerged that LLMs could significantly reduce time and workload, albeit with considerations around specificity [3, 4, 5].
Refinement and Comparative Studies (2024): The latest works explore refining models with advanced prompting strategies and conducting robust comparative analyses against traditional methods. There is growing attention to fine-tuning and prompt engineering to maximize efficiency and accuracy [1, 7, 9].
Viet-Thi Tran's Group: This group seems prolific with contributions that focus on diagnostic accuracy and workload reduction using LLMs like GPT-3.5 Turbo. They've conducted studies on sensitivity and specificity for systematic reviews [4, 9].
T. Oami and Collaborators: Their contributions include exploring LLMs' potential across various domains, such as citation screening for clinical guidelines. They often employ comprehensive prospective study designs [7, 13, 22].
MedRxiv and ArXiv Authors: A significant portion of the exploratory and early adoption works published in MedRxiv and ArXiv are pivotal in expanding methodological foundations and real-world application scenarios for LLM implementations in systematic reviews [1, 4, 5].
These clusters highlight an evolving focus from feasibility and initial trials to more sophisticated application and optimization of LLMs, supported by a community of consistent contributors leading innovations in this intersection of AI and literature review methodologies.