Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0
| dc.contributor.author | Yiğit, Y | |
| dc.contributor.author | Özbek, AE | |
| dc.contributor.author | Doğru, B | |
| dc.contributor.author | Günay, S | |
| dc.contributor.author | Alkahlout, B | |
| dc.date.accessioned | 2026-03-31T13:21:19Z | |
| dc.date.available | 2026-03-31T13:21:19Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Background The field of natural language processing (NLP) has evolved significantly since its inception in the 1950s, with large language models (LLMs) now playing a crucial role in addressing medical challenges. Objectives This study evaluates the alignment of three prominent LLMs-Gemini, Copilot, and ChatGPT-4.0-with expert consensus on imaging recommendations for acute flank pain. Methods A total of 29 clinical vignettes representing different combinations of age, sex, pregnancy status, likelihood of stone disease, and alternative diagnoses were posed to the three LLMs (Gemini, Copilot, and ChatGPT-4.0) between March and April 2024. Responses were compared to the consensus recommendations of a multispecialty panel. The primary outcome was the rate of LLM responses matching the majority consensus. Secondary outcomes included alignment with consensus-rated perfect (9/9) or excellent (8/9) responses and agreement with any of the nine panel members. Results Gemini aligned with the majority consensus in 65.5% of cases, compared to 41.4% for both Copilot and ChatGPT-4.0. In scenarios rated as perfect or excellent by the consensus, Gemini showed 69.5% agreement, significantly higher than Copilot and ChatGPT-4.0, both at 43.4% (p = 0.045 and < 0.001, respectively). Overall, Gemini demonstrated an agreement rate of 82.7% with any of the nine reviewers, indicating superior capability in addressing complex medical inquiries. Conclusion Gemini consistently outperformed Copilot and ChatGPT-4.0 in aligning with expert consensus, suggesting its potential as a reliable tool in clinical decision-making. Further research is needed to enhance the reliability and accuracy of LLMs and to address the ethical and legal challenges associated with their integration into healthcare systems. | |
| dc.identifier.doi | 10.1186/s12245-025-00895-3 | |
| dc.identifier.issn | 1865-1372 | |
| dc.identifier.issn | 1865-1380 | |
| dc.identifier.issue | 1 | |
| dc.identifier.pmid | 40615804 | |
| dc.identifier.uri | http://dx.doi.org/10.1186/s12245-025-00895-3 | |
| dc.identifier.uri | https://hdl.handle.net/11491/9682 | |
| dc.identifier.volume | 18 | |
| dc.identifier.wos | WOS:001523047300001 | |
| dc.language.iso | en | |
| dc.publisher | BMC | |
| dc.relation.ispartof | INT J EMERG MED | |
| dc.subject | Large Language models (LLMs) | |
| dc.subject | Natural Language processing (NLP) | |
| dc.subject | Renal colic | |
| dc.subject | Imaging recommendations | |
| dc.subject | Gemini | |
| dc.subject | Copilot | |
| dc.subject | ChatGPT-4.0 | |
| dc.title | Evaluating large language models for renal colic imaging recommendations: a comparative analysis of Gemini, copilot, and ChatGPT-4.0 | |
| dc.type | Article |












