Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0

dc.contributor.authorYiğit, Y
dc.contributor.authorÖzbek, AE
dc.contributor.authorDoğru, B
dc.contributor.authorGünay, S
dc.contributor.authorAlkahlout, B
dc.date.accessioned2026-03-31T13:21:19Z
dc.date.available2026-03-31T13:21:19Z
dc.date.issued2025
dc.description.abstractBackground The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges. Objectives This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain. Methods A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members. Results Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries. Conclusion Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems.
dc.identifier.doi10.1186/s12245-025-00895-3
dc.identifier.issn1865-1372
dc.identifier.issn1865-1380
dc.identifier.issue1
dc.identifier.pmid40615804
dc.identifier.urihttp://dx.doi.org/10.1186/s12245-025-00895-3
dc.identifier.urihttps://hdl.handle.net/11491/9682
dc.identifier.volume18
dc.identifier.wosWOS:001523047300001
dc.language.isoen
dc.publisherBMC
dc.relation.ispartofINT J EMERG MED
dc.subjectLarge Language models (LLMs)
dc.subjectNatural Language processing (NLP)
dc.subjectRenal colic
dc.subjectImaging recommendations
dc.subjectGemini
dc.subjectCopilot
dc.subjectChatGPT-4.0
dc.titleEvaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0
dc.typeArticle

Dosyalar