OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice
A Comparative Performance Evaluation of Four Publicly Available Large Language Models Against Gold Standard Medical References
1 other identifier
observational
20
1 country
1
Brief Summary
OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance. In this study, investigators have two goals:
- 1.To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.
- 2.To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.
- 3.State their clinical question.
- 4.Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.
- 5.State their clinical conclusion based on the OpenEvidence data.
- 6.Query the Gold Standard Resource.
- 7.State their final clinical conclusion.
- 8.Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.
- 9.Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.
Trial Health
Trial Health Score
Automated assessment based on enrollment pace, timeline, and geographic reach
participants targeted
Target at below P25 for all trials
Started Oct 2025
Shorter than P25 for all trials
1 active site
Health score is calculated from publicly available data and should be used for screening purposes only.
Trial Relationships
Click on a node to explore related trials.
Study Timeline
Key milestones and dates
First Submitted
Initial submission to the registry
September 15, 2025
CompletedFirst Posted
Study publicly available on registry
September 30, 2025
CompletedStudy Start
First participant enrolled
October 1, 2025
CompletedPrimary Completion
Last participant's last visit for primary outcome
April 30, 2026
CompletedStudy Completion
Last participant's last visit for all outcomes
September 1, 2026
ExpectedFebruary 18, 2026
February 1, 2026
7 months
September 15, 2025
February 17, 2026
Conditions
Outcome Measures
Primary Outcomes (3)
Clinical Appropriateness: Mean with SD
Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. (For Likert scales described in this and all of our outcome metrics, higher scores are better outcomes.) Mean score with standard deviation will be used for primary outcome.
6 months
Clinical Appropriateness: Median
Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. Median clinical appropriateness scores will also be reported.
6 months
Clinical Appropriateness: Interrater Reliability
SME's will evaluate Clinical Appropriateness scores of resident decisions based on OpenEvidence output on a 10-point Likert scale. Interrater reliability of SME Clinical Appropriateness scores will be calculated using kappa value.
6 months
Secondary Outcomes (12)
Comparative Accuracy of LLM's: Win Rate
6 months
Comparative Accuracy: Margin of Win
6 months
Comparative Accuracy: Effect Size
6 months
Comparative Accuracy of LLM's: Interrater Reliability
6 months
Comparative Completeness of LLM's: Win Rate
6 months
- +7 more secondary outcomes
Study Arms (2)
Medicine Residents
Trainees in internal medicine or family medicine
Psychiatry residents
Trainees in adult and child psychiatry
Interventions
Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.
Eligibility Criteria
Post-graduate trainees in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry
You may qualify if:
- Active trainees PGY-1 through PGY-6 in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry at Cambridge Health Alliance
- Must agree to the study protocol requirements outlined in the study description.
You may not qualify if:
- Residents who plan to leave CHA prior to the end of the study collection period.
Contact the study team to confirm eligibility.
Sponsors & Collaborators
Study Sites (1)
Cambridge Health Alliance
Cambridge, Massachusetts, 02193, United States
Related Publications (6)
Skryd A, Lawrence K. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Form Res. 2024 May 8;8:e51346. doi: 10.2196/51346.
PMID: 38717811BACKGROUNDRao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.
PMID: 37356806BACKGROUNDPagano S, Strumolo L, Michalk K, Schiegl J, Pulido LC, Reinhard J, Maderbacher G, Renkawitz T, Schuster M. Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study. Comput Struct Biotechnol J. 2024 Dec 26;28:9-15. doi: 10.1016/j.csbj.2024.12.013. eCollection 2025.
PMID: 39850460BACKGROUNDLevkovich I, Rabin E, Brann M, Elyoseph Z. Large language models outperform general practitioners in identifying complex cases of childhood anxiety. Digit Health. 2024 Dec 15;10:20552076241294182. doi: 10.1177/20552076241294182. eCollection 2024 Jan-Dec.
PMID: 39687523BACKGROUNDGriewing S, Knitza J, Boekhoff J, Hillen C, Lechner F, Wagner U, Wallwiener M, Kuhn S. Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch Gynecol Obstet. 2024 Jul;310(1):537-550. doi: 10.1007/s00404-024-07565-4. Epub 2024 May 29.
PMID: 38806945BACKGROUNDBenary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, Sigler C, Knodler M, Keller U, Beule D, Keilholz U, Leser U, Rieke DT. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023 Nov 1;6(11):e2343689. doi: 10.1001/jamanetworkopen.2023.43689.
PMID: 37976064BACKGROUND
Related Links
Study Officials
- PRINCIPAL INVESTIGATOR
Hannah K Galvin, MD
Cambridge Health Alliance
Study Design
- Study Type
- observational
- Observational Model
- COHORT
- Time Perspective
- PROSPECTIVE
- Sponsor Type
- OTHER
- Responsible Party
- PRINCIPAL INVESTIGATOR
- PI Title
- Chief Health Information Officer
Study Record Dates
First Submitted
September 15, 2025
First Posted
September 30, 2025
Study Start
October 1, 2025
Primary Completion
April 30, 2026
Study Completion (Estimated)
September 1, 2026
Last Updated
February 18, 2026
Record last verified: 2026-02