Brief Summary

OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance. In this study, investigators have two goals:

1.To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.
2.To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.
3.State their clinical question.
4.Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.
5.State their clinical conclusion based on the OpenEvidence data.
6.Query the Gold Standard Resource.
7.State their final clinical conclusion.
8.Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.
9.Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.

Trial Health

On Track

Trial Health Score

Automated assessment based on enrollment pace, timeline, and geographic reach

Enrollment

participants targeted

Target at below P25 for all trials

Timeline

4mo left

Started Oct 2025

Shorter than P25 for all trials

Geographic Reach

1 country

1 active site

Status

enrolling by invitation

Health score is calculated from publicly available data and should be used for screening purposes only.

Trial Relationships

Click on a node to explore related trials.

Study Timeline

Key milestones and dates

11 months study duration

Study Progress65%

Oct 2025Sep 2026

First Submitted

Initial submission to the registry

September 15, 2025

Completed

15 days until next milestone

First Posted

Study publicly available on registry

September 30, 2025

Completed

1 day until next milestone

Study Start

First participant enrolled

October 1, 2025

Completed

7 months until next milestone

Primary Completion

Last participant's last visit for primary outcome

April 30, 2026

Completed

4 months until next milestone

Study Completion

Last participant's last visit for all outcomes

September 1, 2026

Expected

Last Updated

February 18, 2026

Status Verified

February 1, 2026

Enrollment Period

7 months

First QC Date

September 15, 2025

Last Update Submit

February 17, 2026

Conditions

Large Language Model AI (Artificial Intelligence)Generative Artificial Intelligence

Outcome Measures

Primary Outcomes (3)

Clinical Appropriateness: Mean with SD
Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. (For Likert scales described in this and all of our outcome metrics, higher scores are better outcomes.) Mean score with standard deviation will be used for primary outcome.
6 months
Clinical Appropriateness: Median
Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. Median clinical appropriateness scores will also be reported.
6 months
Clinical Appropriateness: Interrater Reliability
SME's will evaluate Clinical Appropriateness scores of resident decisions based on OpenEvidence output on a 10-point Likert scale. Interrater reliability of SME Clinical Appropriateness scores will be calculated using kappa value.
6 months

Secondary Outcomes (12)

Comparative Accuracy of LLM's: Win Rate
6 months
Comparative Accuracy: Margin of Win
6 months
Comparative Accuracy: Effect Size
6 months
Comparative Accuracy of LLM's: Interrater Reliability
6 months
Comparative Completeness of LLM's: Win Rate
6 months
+7 more secondary outcomes

Study Arms (2)

Medicine Residents

Trainees in internal medicine or family medicine

Other: AI clinical reference tool

Psychiatry residents

Trainees in adult and child psychiatry

Other: AI clinical reference tool

Interventions

AI clinical reference toolOTHER

Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.

Medicine ResidentsPsychiatry residents

Eligibility Criteria

Sexall

Healthy VolunteersYes

Age GroupsChild (0-17), Adult (18-64), Older Adult (65+)

Sampling MethodNon-Probability Sample

Study Population

Post-graduate trainees in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry

You may qualify if:

Active trainees PGY-1 through PGY-6 in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry at Cambridge Health Alliance
Must agree to the study protocol requirements outlined in the study description.

You may not qualify if:

Residents who plan to leave CHA prior to the end of the study collection period.

Contact the study team to confirm eligibility.

Sponsors & Collaborators

Cambridge Health Alliancelead

Study Sites (1)

Cambridge Health Alliance

Cambridge, Massachusetts, 02193, United States

Location

Related Publications (6)

Skryd A, Lawrence K. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Form Res. 2024 May 8;8:e51346. doi: 10.2196/51346.
PMID: 38717811BACKGROUND
Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
PMID: 37356806BACKGROUND
Pagano S, Strumolo L, Michalk K, Schiegl J, Pulido LC, Reinhard J, Maderbacher G, Renkawitz T, Schuster M. Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study. Comput Struct Biotechnol J. 2024 Dec 26;28:9-15. doi: 10.1016/j.csbj.2024.12.013. eCollection 2025.
PMID: 39850460BACKGROUND
Levkovich I, Rabin E, Brann M, Elyoseph Z. Large language models outperform general practitioners in identifying complex cases of childhood anxiety. Digit Health. 2024 Dec 15;10:20552076241294182. doi: 10.1177/20552076241294182. eCollection 2024 Jan-Dec.
PMID: 39687523BACKGROUND
Griewing S, Knitza J, Boekhoff J, Hillen C, Lechner F, Wagner U, Wallwiener M, Kuhn S. Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch Gynecol Obstet. 2024 Jul;310(1):537-550. doi: 10.1007/s00404-024-07565-4. Epub 2024 May 29.
PMID: 38806945BACKGROUND
Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, Sigler C, Knodler M, Keller U, Beule D, Keilholz U, Leser U, Rieke DT. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023 Nov 1;6(11):e2343689. doi: 10.1001/jamanetworkopen.2023.43689.
PMID: 37976064BACKGROUND

Study Officials

Hannah K Galvin, MD
Cambridge Health Alliance
PRINCIPAL INVESTIGATOR

Study Design

Study Type: observational
Observational Model: COHORT
Time Perspective: PROSPECTIVE
Sponsor Type: OTHER
Responsible Party: PRINCIPAL INVESTIGATOR
PI Title: Chief Health Information Officer

Study Record Dates

First Submitted

September 15, 2025

First Posted

September 30, 2025

Study Start

October 1, 2025

Primary Completion

April 30, 2026

Study Completion (Estimated)

September 1, 2026

Last Updated

February 18, 2026

Record last verified: 2026-02

Locations

US(1)

OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

Brief Summary

Trial Health

Trial Health Score

Trial Relationships

Study Timeline

First Submitted

First Posted

Study Start

Primary Completion

Study Completion

Conditions

Outcome Measures

Primary Outcomes (3)

Secondary Outcomes (12)

Study Arms (2)

Medicine Residents

Psychiatry residents

Interventions

Eligibility Criteria

You may qualify if:

You may not qualify if:

Sponsors & Collaborators

Study Sites (1)

Related Publications (6)

Related Links

Study Officials

Study Design

Study Record Dates

Locations

Brief Summary

Trial Health

Trial Health Score

Trial Relationships

Related Scientific Literature

Study Timeline

First Submitted

First Posted

Study Start

Primary Completion

Study Completion

Conditions

Outcome Measures

Primary Outcomes (3)

Secondary Outcomes (12)

Study Arms (2)

Medicine Residents

Psychiatry residents

Interventions

Eligibility Criteria

You may qualify if:

You may not qualify if:

Sponsors & Collaborators

Study Sites (1)

Related Publications (6)

Related Links

Study Officials

Study Design

Study Record Dates

Locations