NCT07632859|Recruiting

Diagnostic Accuracy of GPT-4o and Claude 4.6 Sonnet in Turkish ED Anamnesis Notes

LLM-ED-DX-TR

Diagnostic Accuracy of Large Language Models From Emergency Department Anamnesis Notes: A Comparison of GPT-4o and Claude 4.6 Sonnet With Emergency Medicine Specialists

Marmara University Pendik Training and Research Hospital ClinicalTrials.gov

Compare

1 other identifier

09.2026.26-0514

Study Type

observational

Target

600

Locations

1 country

Sites

Timeline

RegisteredJun 2026

StartedJun 2026

CompletionOct 2026

Brief Summary

This retrospective diagnostic accuracy study evaluates the ability of two large language models (LLMs) - GPT-4o (gpt-4o-2024-11-20; OpenAI) and Claude 4.6 Sonnet (claude-sonnet-4-6; Anthropic) - to generate correct diagnoses from anonymized Turkish-language emergency department (ED) anamnesis notes, and compares their performance with the diagnosis entered by the treating emergency physician. A consensus gold standard is established by three independent board-certified emergency medicine specialists who blindly review each note and vote on the primary diagnosis using ICD-10 three-character codes; the majority vote (at least 2 of 3 specialists agreeing) constitutes the reference standard. Both LLMs are evaluated using a standardized zero-shot direct prompting strategy (temperature=0, stateless API sessions). The primary outcome is diagnostic accuracy (proportion of ICD-10 chapter-level matches) and Cohen's kappa for each LLM against the gold standard. Secondary outcomes include top-3 accuracy, treating physician accuracy, inter-model agreement, and subgroup analyses by ESI triage level and ICD-10 chapter. Inter-rater reliability among the three specialists is quantified using Fleiss' kappa. Analyses are performed in Jamovi. This study represents the first evaluation of LLM diagnostic accuracy using Turkish-language clinical notes and the first to benchmark LLM performance against an independent three-specialist majority-vote gold standard rather than against the treating physician's own diagnosis.

Trial Health

On Track

Trial Health Score

Automated assessment based on enrollment pace, timeline, and geographic reach

Enrollment

600

participants targeted

Target at P75+ for all trials

Timeline

2mo left

Started Jun 2026

Shorter than P25 for all trials

Geographic Reach

1 country

1 active site

Status

recruiting

Health score is calculated from publicly available data and should be used for screening purposes only.

Trial Relationships

Click on a node to explore related trials.

Study Timeline

Key milestones and dates

4 months study duration

Study Progress51%

Jun 2026Oct 2026

Study Start

First participant enrolled

June 1, 2026

Completed

2 days until next milestone

First Submitted

Initial submission to the registry

June 3, 2026

Completed

5 days until next milestone

First Posted

Study publicly available on registry

June 8, 2026

Completed

23 days until next milestone

Primary Completion

Last participant's last visit for primary outcome

July 1, 2026

Completed

3 months until next milestone

Study Completion

Last participant's last visit for all outcomes

October 1, 2026

Expected

Last Updated

June 25, 2026

Status Verified

June 1, 2026

Enrollment Period

1 month

First QC Date

June 3, 2026

Last Update Submit

June 22, 2026

Conditions

Emergency Medicine Diagnostic Errors Artificial Intelligence (AI) in Diagnosis

Keywords

Large Language Model; GPT-4o; Claude 4.6 Sonnet; ICD-10; Clinical Coding; Turkish; Emergency Department; Diagnostic Accuracy; STARD; STARD-AI

Outcome Measures

Primary Outcomes (2)

Diagnostic Accuracy of GPT-4o for ICD-10 Chapter-Level Diagnosis
Proportion of cases in which GPT-4o primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.
At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).
Diagnostic Accuracy of Claude 4.6 Sonnet for ICD-10 Chapter-Level Diagnosis
Proportion of cases in which Claude 4.6 Sonnet primary (rank 1) diagnosis matches the 3-specialist majority-vote gold standard at the ICD-10 chapter level (22 categories). Range: 0 to 1.00.
At the time of single-session algorithmic evaluation (each case evaluated once following data extraction in June 2026).

Secondary Outcomes (5)

Cohen's Kappa Between GPT-4o Primary Diagnosis and Gold Standard
At the time of algorithmic evaluation (June-July 2026)
Cohen's Kappa Between Claude 4.6 Sonnet Primary Diagnosis and Gold Standard
At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of GPT-4o
At the time of algorithmic evaluation (June-July 2026)
Top-3 Diagnostic Accuracy of Claude 4.6 Sonnet
At the time of algorithmic evaluation (June-July 2026)
Treating Physician Diagnostic Accuracy Against Gold Standard
At the time of the original clinical encounter (retrospective data spanning August-December 2025)

Study Arms (1)

Emergency Department Patient Cohort

Consecutive adult patients presenting to the emergency department with a fully documented electronic anamnesis note and a definitive primary ICD-10 diagnosis

Eligibility Criteria

Age18 Years+

Sexall

Healthy VolunteersNo

Age GroupsAdult (18-64), Older Adult (65+)

Sampling MethodNon-Probability Sample

Study Population

The study population comprises consecutive adult patients (aged 18 years and older) who presented to the emergency department of a tertiary care training and research hospital and had their encounters fully documented in the hospital information system (HBYS). Eligible individuals must have a complete electronic anamnesis note containing the chief complaint, history of present illness, and clinical presentation, alongside a definitive primary ICD-10 diagnosis finalized by the treating emergency physician at file closure. The population excludes pediatric cases, patients triaged to high-acuity resuscitation areas (ESI level 1), and clinical notes with fewer than 50 words or insufficient clinical content.

You may qualify if:

Adult patients (aged 18 years and older) presenting to the emergency department.
Complete electronic health record available in the hospital information system (HBYS) containing a detailed anamnesis note with chief complaint, symptom duration, associated symptoms, and relevant medical history.
A definitive primary diagnosis recorded by the treating emergency physician using ICD-10 codes at the time of patient file closure.

You may not qualify if:

Emergency department anamnesis notes containing fewer than 50 words or completely lacking substantive clinical content\[cite: 1\].
Pediatric cases (age under 18 years)\[cite: 1\].
Patients critically ill and triaged to high-acuity resuscitation areas (Emergency Severity Index \[ESI\] level 1)\[cite: 1\].
Clinical notes containing residual identifying information that cannot be fully de-identified, preventing compliance with data privacy regulations\[cite: 1\].
Non-independent clinical notes consisting solely of a brief cross-reference to a prior hospital visit without a new history entry\[cite: 1\].

Contact the study team to confirm eligibility.

Sponsors & Collaborators

Marmara University Pendik Training and Research Hospitallead

Study Sites (1)

Marmara University Pendik Training and Research Hospital

Istanbul, Istanbul, 34899, Turkey (Türkiye)

RECRUITING

Related Publications (11)

Newman-Toker DE, Peterson SM, Badihian S, Hassoon A, Nassery N, Parizadeh D, Wilson LM, Jia Y, Omron R, Tharmarajah S, Guerin L, Bastani PB, Fracica EA, Kotwal S, Robinson KA. Diagnostic Errors in the Emergency Department: A Systematic Review [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2022 Dec. Report No.: 22(23)-EHC043. Available from http://www.ncbi.nlm.nih.gov/books/NBK588118/
PMID: 36574484RESULT
Wei J et al. Chain-of-thought prompting elicits reasoning in LLMs. NeurIPS. 2022;35:24824-24837.
RESULT
Sounderajah V, Guni A, Liu X, Collins GS, Karthikesalingam A, Markar SR, Golub RM, Denniston AK, Shetty S, Moher D, Bossuyt PM, Darzi A, Ashrafian H; STARD-AI Steering Committee. The STARD-AI reporting guideline for diagnostic accuracy studies using artificial intelligence. Nat Med. 2025 Oct;31(10):3283-3289. doi: 10.1038/s41591-025-03953-8. Epub 2025 Sep 15.
PMID: 40954311RESULT
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, Lijmer JG, Moher D, Rennie D, de Vet HC, Kressel HY, Rifai N, Golub RM, Altman DG, Hooft L, Korevaar DA, Cohen JF; STARD Group. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015 Oct 28;351:h5527. doi: 10.1136/bmj.h5527.
PMID: 26511519RESULT
Niset A, Melot I, Pireau M, Englebert A, Scius N, Flament J, El Hadwe S, Al Barajraji M, Thonon H, Barrit S. Grounded large language models for diagnostic prediction in real-world emergency department settings. JAMIA Open. 2025 Oct 21;8(5):ooaf119. doi: 10.1093/jamiaopen/ooaf119. eCollection 2025 Oct.
PMID: 41127256RESULT
Williams CYK, Miao BY, Kornblith AE, Butte AJ. Evaluating the use of large language models to provide clinical recommendations in the Emergency Department. Nat Commun. 2024 Oct 8;15(1):8236. doi: 10.1038/s41467-024-52415-1.
PMID: 39379357RESULT
Hoppe JM, Auer MK, Struven A, Massberg S, Stremmel C. ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis. J Med Internet Res. 2024 Jul 8;26:e56110. doi: 10.2196/56110.
PMID: 38976865RESULT
Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.
PMID: 37318797RESULT
Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, Miki Y, Ueda D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.
PMID: 40121370RESULT
Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.
PMID: 40279517RESULT
Taylor RA, Sangal RB, Smith ME, Haimovich AD, Rodman A, Iscoe MS, Pavuluri SK, Rose C, Janke AT, Wright DS, Socrates V, Declan A. Leveraging artificial intelligence to reduce diagnostic errors in emergency medicine: Challenges, opportunities, and future directions. Acad Emerg Med. 2025 Mar;32(3):327-339. doi: 10.1111/acem.15066. Epub 2024 Dec 15.
PMID: 39676165RESULT

MeSH Terms

Conditions

Emergencies

Condition Hierarchy (Ancestors)

Disease AttributesPathologic ProcessesPathological Conditions, Signs and Symptoms

Study Officials

Emir Ünal
Marmara University
PRINCIPAL INVESTIGATOR

Central Study Contacts

Emir Ünal, Assistant Professor

CONTACT

+905327766010 emirunal@gmail.com

Emir Unal, Assistant Professor

CONTACT

emirunal@gmail.com

Study Design

Study Type: observational
Observational Model: COHORT
Time Perspective: RETROSPECTIVE
Sponsor Type: OTHER
Responsible Party: PRINCIPAL INVESTIGATOR
PI Title: MD, Assistant Professor

Study Record Dates

First Submitted

June 3, 2026

First Posted

June 8, 2026

Study Start

June 1, 2026

Primary Completion

July 1, 2026

Study Completion (Estimated)

October 1, 2026

Last Updated

June 25, 2026

Record last verified: 2026-06

Locations

TU(1)

Brief Summary

Trial Health

Trial Health Score

Trial Relationships

Related Scientific Literature

Study Timeline

Study Start

First Submitted

First Posted

Primary Completion

Study Completion

Conditions

Keywords

Outcome Measures

Primary Outcomes (2)

Secondary Outcomes (5)

Study Arms (1)

Emergency Department Patient Cohort

Eligibility Criteria

You may qualify if:

You may not qualify if:

Sponsors & Collaborators

Study Sites (1)

Related Publications (11)

MeSH Terms

Conditions

Condition Hierarchy (Ancestors)

Study Officials

Central Study Contacts

Study Design

Study Record Dates

Locations