Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0

Yiğit, Y; Özbek, AE; Doğru, B; Günay, S; Alkahlout, B

Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0

dc.contributor.author	Yiğit, Y
dc.contributor.author	Özbek, AE
dc.contributor.author	Doğru, B
dc.contributor.author	Günay, S
dc.contributor.author	Alkahlout, B
dc.date.accessioned	2026-03-31T13:21:19Z
dc.date.available	2026-03-31T13:21:19Z
dc.date.issued	2025
dc.description.abstract	Background The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges. Objectives This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain. Methods A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members. Results Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries. Conclusion Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems.
dc.identifier.doi	10.1186/s12245-025-00895-3
dc.identifier.issn	1865-1372
dc.identifier.issn	1865-1380
dc.identifier.issue	1
dc.identifier.pmid	40615804
dc.identifier.uri	http://dx.doi.org/10.1186/s12245-025-00895-3
dc.identifier.uri	https://hdl.handle.net/11491/9682
dc.identifier.volume	18
dc.identifier.wos	WOS:001523047300001
dc.language.iso	en
dc.publisher	BMC
dc.relation.ispartof	INT J EMERG MED
dc.subject	Large Language models (LLMs)
dc.subject	Natural Language processing (NLP)
dc.subject	Renal colic
dc.subject	Imaging recommendations
dc.subject	Gemini
dc.subject	Copilot
dc.subject	ChatGPT-4.0
dc.title	Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu

Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0

Dosyalar

Koleksiyon