Brief Summary

Clinical decision support tools powered by artificial intelligence are being rapidly integrated into medical practice. Two leading systems currently available to clinicians are OpenEvidence, which uses retrieval-augmented generation to access medical literature, and GPT-4, a large language model. While both tools show promise, their relative effectiveness in supporting clinical decision-making has not been directly compared. This study aims to evaluate how these tools influence diagnostic reasoning and management decisions among internal medicine physicians.

Trial Health

On Track

Trial Health Score

Automated assessment based on enrollment pace, timeline, and geographic reach

Enrollment

participants targeted

Target at below P25 for not_applicable

Timeline

Completed

Started Jul 2025

Shorter than P25 for not_applicable

Geographic Reach

1 country

2 active sites

Status

completed

Health score is calculated from publicly available data and should be used for screening purposes only.

Trial Relationships

Click on a node to explore related trials.

Study Timeline

Key milestones and dates

6 months study duration

First Submitted

Initial submission to the registry

June 17, 2025

Completed

9 days until next milestone

First Posted

Study publicly available on registry

June 26, 2025

Completed

7 days until next milestone

Study Start

First participant enrolled

July 3, 2025

Completed

6 months until next milestone

Primary Completion

Last participant's last visit for primary outcome

December 30, 2025

Completed

Same day until next milestone

Study Completion

Last participant's last visit for all outcomes

December 30, 2025

Completed

Last Updated

April 9, 2026

Status Verified

April 1, 2026

Enrollment Period

6 months

First QC Date

June 17, 2025

Last Update Submit

April 7, 2026

Conditions

Large Language Models

Outcome Measures

Primary Outcomes (1)

Clinical Reasoning Performance as determined by Rater Scores
Clinical reasoning performance will be evaluated based upon the accuracy of the rater scores to responses to the surveys administered. Six blinded, trained independent raters will independently score each participant's response using a validated scoring rubric. Possible response scores can range from 0-100% with higher scores indicating increased clinical reasoning performance. Results for each assessment will be summarized by study arm using basic descriptive statistics and analyzed using mixed-effects models to account for within-subject correlation and between-subject factors.
15-minutes upon completion of cases, up to approximately 90 minutes total

Secondary Outcomes (2)

Time efficiency
Up to approximately 75 minutes
Decision confidence
15-minutes upon completion of cases, up to approximately 90 minutes total

Study Arms (2)

OpenEvidence

ACTIVE COMPARATOR

Participants in this arm will use OpenEvidence as their research tool

Other: OpenEvidence

ChatGPT

ACTIVE COMPARATOR

Patients in this arm will use Chat-GPT as their research tool

Other: GPT-4

Interventions

OpenEvidenceOTHER

Medical information platform which uses retrieval-augmented generation to access medical literature

OpenEvidence

GPT-4OTHER

A chatbot application developed that uses GPT-4, a large language model, to engage in conversational interactions with users.

Also known as: ChatGPT

ChatGPT

Eligibility Criteria

Age25 Years+

Sexall

Healthy VolunteersYes

Age GroupsAdult (18-64), Older Adult (65+)

You may qualify if:

Internal medicine residents
Internal medicine attending physicians

Contact the study team to confirm eligibility.

Sponsors & Collaborators

Montefiore Medical Centerlead

Study Sites (2)

Harvard Beth Israel Deaconess Medical Center

Boston, Massachusetts, 02215, United States

Location

MontefioreMC

The Bronx, New York, 10467, United States

Location

Related Publications (6)

Cabral S, Restrepo D, Kanjee Z, Wilson P, Crowe B, Abdulnour RE, Rodman A. Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians. JAMA Intern Med. 2024 May 1;184(5):581-583. doi: 10.1001/jamainternmed.2024.0295.
PMID: 38557971BACKGROUND
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Scharli N, Chowdhery A, Mansfield P, Demner-Fushman D, Aguera Y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V. Large language models encode clinical knowledge. Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.
PMID: 37438534BACKGROUND
Strong E, DiGiammarino A, Weng Y, Kumar A, Hosamani P, Hom J, Chen JH. Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations. JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.
PMID: 37459090BACKGROUND
Schaye V, Miller L, Kudlowitz D, Chun J, Burk-Rafel J, Cocks P, Guzman B, Aphinyanaphongs Y, Marin M. Development of a Clinical Reasoning Documentation Assessment Tool for Resident and Fellow Admission Notes: a Shared Mental Model for Feedback. J Gen Intern Med. 2022 Feb;37(3):507-512. doi: 10.1007/s11606-021-06805-6. Epub 2021 May 4.
PMID: 33945113BACKGROUND
Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, Cool JA, Kanjee Z, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Rodman A, Chen JH. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
PMID: 39466245BACKGROUND
Goh E, Gallo RJ, Strong E, Weng Y, Kerman H, Freed JA, Cool JA, Kanjee Z, Lane KP, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Hom J, Chen JH, Rodman A. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat Med. 2025 Apr;31(4):1233-1238. doi: 10.1038/s41591-024-03456-y. Epub 2025 Feb 5.
PMID: 39910272BACKGROUND

Study Officials

Shitij Arora, MD
Montefiore Medical Center
PRINCIPAL INVESTIGATOR

Study Design

Study Type: interventional
Phase: not applicable
Allocation: RANDOMIZED
Masking: SINGLE
Who Masked: OUTCOMES ASSESSOR
Purpose: OTHER
Intervention Model: PARALLEL
Sponsor Type: OTHER
Responsible Party: SPONSOR

Study Record Dates

First Submitted

June 17, 2025

First Posted

June 26, 2025

Study Start

July 3, 2025

Primary Completion

December 30, 2025

Study Completion

December 30, 2025

Last Updated

April 9, 2026

Record last verified: 2026-04

Data Sharing

IPD Sharing: Will not share

Locations

US(2)

Brief Summary

Trial Health

Trial Health Score

Trial Relationships

Related Scientific Literature

Study Timeline

First Submitted

First Posted

Study Start

Primary Completion

Study Completion

Conditions

Outcome Measures

Primary Outcomes (1)

Secondary Outcomes (2)

Study Arms (2)

OpenEvidence

ChatGPT

Interventions

Eligibility Criteria

You may qualify if:

Sponsors & Collaborators

Study Sites (2)

Related Publications (6)

Study Officials

Study Design

Study Record Dates

Data Sharing

Locations