NCT07199231

Brief Summary

OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance. In this study, investigators have two goals:

  1. 1.To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice.
  2. 2.To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice.
  3. 3.State their clinical question.
  4. 4.Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study.
  5. 5.State their clinical conclusion based on the OpenEvidence data.
  6. 6.Query the Gold Standard Resource.
  7. 7.State their final clinical conclusion.
  8. 8.Answer a question on whether their clinical conclusion was modified by the Gold Standard reference.
  9. 9.Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence.

Trial Health

75
On Track

Trial Health Score

Automated assessment based on enrollment pace, timeline, and geographic reach

Enrollment
20

participants targeted

Target at below P25 for all trials

Timeline
4mo left

Started Oct 2025

Shorter than P25 for all trials

Geographic Reach
1 country

1 active site

Status
enrolling by invitation

Health score is calculated from publicly available data and should be used for screening purposes only.

Trial Relationships

Click on a node to explore related trials.

Study Timeline

Key milestones and dates

Study Progress65%
Oct 2025Sep 2026

First Submitted

Initial submission to the registry

September 15, 2025

Completed
15 days until next milestone

First Posted

Study publicly available on registry

September 30, 2025

Completed
1 day until next milestone

Study Start

First participant enrolled

October 1, 2025

Completed
7 months until next milestone

Primary Completion

Last participant's last visit for primary outcome

April 30, 2026

Completed
4 months until next milestone

Study Completion

Last participant's last visit for all outcomes

September 1, 2026

Expected
Last Updated

February 18, 2026

Status Verified

February 1, 2026

Enrollment Period

7 months

First QC Date

September 15, 2025

Last Update Submit

February 17, 2026

Conditions

Outcome Measures

Primary Outcomes (3)

  • Clinical Appropriateness: Mean with SD

    Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. (For Likert scales described in this and all of our outcome metrics, higher scores are better outcomes.) Mean score with standard deviation will be used for primary outcome.

    6 months

  • Clinical Appropriateness: Median

    Clinical appropriateness score of resident decisions based on OpenEvidence output. This is numeric score on a 10-point Likert scale. Median clinical appropriateness scores will also be reported.

    6 months

  • Clinical Appropriateness: Interrater Reliability

    SME's will evaluate Clinical Appropriateness scores of resident decisions based on OpenEvidence output on a 10-point Likert scale. Interrater reliability of SME Clinical Appropriateness scores will be calculated using kappa value.

    6 months

Secondary Outcomes (12)

  • Comparative Accuracy of LLM's: Win Rate

    6 months

  • Comparative Accuracy: Margin of Win

    6 months

  • Comparative Accuracy: Effect Size

    6 months

  • Comparative Accuracy of LLM's: Interrater Reliability

    6 months

  • Comparative Completeness of LLM's: Win Rate

    6 months

  • +7 more secondary outcomes

Study Arms (2)

Medicine Residents

Trainees in internal medicine or family medicine

Other: AI clinical reference tool

Psychiatry residents

Trainees in adult and child psychiatry

Other: AI clinical reference tool

Interventions

Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.

Medicine ResidentsPsychiatry residents

Eligibility Criteria

Sexall
Healthy VolunteersYes
Age GroupsChild (0-17), Adult (18-64), Older Adult (65+)
Sampling MethodNon-Probability Sample
Study Population

Post-graduate trainees in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry

You may qualify if:

  • Active trainees PGY-1 through PGY-6 in Internal Medicine, Family Medicine, Adult Psychiatry, or Child Psychiatry at Cambridge Health Alliance
  • Must agree to the study protocol requirements outlined in the study description.

You may not qualify if:

  • Residents who plan to leave CHA prior to the end of the study collection period.

Contact the study team to confirm eligibility.

Sponsors & Collaborators

Study Sites (1)

Cambridge Health Alliance

Cambridge, Massachusetts, 02193, United States

Location

Related Publications (6)

  • Skryd A, Lawrence K. ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study. JMIR Form Res. 2024 May 8;8:e51346. doi: 10.2196/51346.

    PMID: 38717811BACKGROUND
  • Rao A, Kim J, Kamineni M, Pang M, Lie W, Dreyer KJ, Succi MD. Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot. J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.

    PMID: 37356806BACKGROUND
  • Pagano S, Strumolo L, Michalk K, Schiegl J, Pulido LC, Reinhard J, Maderbacher G, Renkawitz T, Schuster M. Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study. Comput Struct Biotechnol J. 2024 Dec 26;28:9-15. doi: 10.1016/j.csbj.2024.12.013. eCollection 2025.

    PMID: 39850460BACKGROUND
  • Levkovich I, Rabin E, Brann M, Elyoseph Z. Large language models outperform general practitioners in identifying complex cases of childhood anxiety. Digit Health. 2024 Dec 15;10:20552076241294182. doi: 10.1177/20552076241294182. eCollection 2024 Jan-Dec.

    PMID: 39687523BACKGROUND
  • Griewing S, Knitza J, Boekhoff J, Hillen C, Lechner F, Wagner U, Wallwiener M, Kuhn S. Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch Gynecol Obstet. 2024 Jul;310(1):537-550. doi: 10.1007/s00404-024-07565-4. Epub 2024 May 29.

    PMID: 38806945BACKGROUND
  • Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, Sigler C, Knodler M, Keller U, Beule D, Keilholz U, Leser U, Rieke DT. Leveraging Large Language Models for Decision Support in Personalized Oncology. JAMA Netw Open. 2023 Nov 1;6(11):e2343689. doi: 10.1001/jamanetworkopen.2023.43689.

    PMID: 37976064BACKGROUND

Related Links

Study Officials

  • Hannah K Galvin, MD

    Cambridge Health Alliance

    PRINCIPAL INVESTIGATOR

Study Design

Study Type
observational
Observational Model
COHORT
Time Perspective
PROSPECTIVE
Sponsor Type
OTHER
Responsible Party
PRINCIPAL INVESTIGATOR
PI Title
Chief Health Information Officer

Study Record Dates

First Submitted

September 15, 2025

First Posted

September 30, 2025

Study Start

October 1, 2025

Primary Completion

April 30, 2026

Study Completion (Estimated)

September 1, 2026

Last Updated

February 18, 2026

Record last verified: 2026-02

Locations