Comparing DeepSeek and GPT-4o in ECG interpretation: Is AI improving over time?

Günay, Serkan; Öztürk, Ahmet; Karahan, Anılcan Tahsin; Barındık, Mert; Komut, Seval; Yiğit, Yavuz

Comparing DeepSeek and GPT-4o in ECG interpretation: Is AI improving over time?

dc.contributor.author	Günay, Serkan
dc.contributor.author	Öztürk, Ahmet
dc.contributor.author	Karahan, Anılcan Tahsin
dc.contributor.author	Barındık, Mert
dc.contributor.author	Komut, Seval
dc.contributor.author	Yiğit, Yavuz
dc.date.accessioned	2026-03-31T13:21:06Z
dc.date.available	2026-03-31T13:21:06Z
dc.date.issued	2026
dc.description.abstract	Background: DeepSeek is a recently launched large language model (LLM), whereas GPT-4o is an advanced ChatGPT version whose electrocardiography (ECG) interpretation capabilities have been previously studied. However, DeepSeek's performance in this domain remains unexplored. Objectives: This study aims to evaluate DeepSeek's accuracy in ECG interpretation and compare it with GPT-4o, emergency medicine specialists, and cardiologists. A secondary aim is to assess any performance changes in GPT-4o over one year. Methods: Between February 9 and March 1, 2025, 40 ECG images (20 daily routine, 20 more challenging) from the book 150 ECG Cases were evaluated by both GPT-4o and DeepSeek, each model tested 13 times. The accuracy of their responses was compared with previously collected answers from 12 cardiologists and 12 emergency medicine specialists. GPT-4o's 2025 performance was compared to its 2024 results on identical ECGs. Results: GPT-4o outperformed DeepSeek with higher median correct answers on daily routine (14 vs. 12), more challenging (13 vs. 10), and total ECGs (27 vs. 22) with statistically significant differences (p=0.048, p<0.001, p<0.001). A moderate agreement was observed between the responses provided by GPT-4o (p<0.001, Fleiss Kappa=0.473), while a substantial agreement was observed in the responses provided by DeepSeek (p<0.001, Fleiss Kappa=0.712). No significant year-over-year improvement was observed in GPT-4o's performance. Conclusion: This first evaluation of DeepSeek in ECG interpretation reveals its performance is lower than that of GPT-4o and human experts. While GPT-4o demonstrates greater accuracy, both models fall short of expert-level performance, underscoring the need for caution and further validation before clinical integration.
dc.identifier.doi	10.1016/j.hrtlng.2025.08.007
dc.identifier.issn	0147-9563
dc.identifier.issn	1527-3288
dc.identifier.pmid	40947358
dc.identifier.uri	http://dx.doi.org/10.1016/j.hrtlng.2025.08.007
dc.identifier.uri	https://hdl.handle.net/11491/9543
dc.identifier.volume	75
dc.identifier.wos	WOS:001621557800002
dc.language.iso	en
dc.publisher	MOSBY-ELSEVIER
dc.relation.ispartof	HEART LUNG
dc.subject	ChatGPT
dc.subject	GPT-4o
dc.subject	DeepSeek
dc.subject	Electrocardiography
dc.subject	Emergency medicine
dc.title	Comparing DeepSeek and GPT-4o in ECG interpretation: Is AI improving over time?
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
PubMed İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Comparing DeepSeek and GPT-4o in ECG interpretation: Is AI improving over time?

Dosyalar

Koleksiyon