Reliability of LLMs as medical assistants for the general public

This is my 1st attempt at reviewing a randomized, controlled study with Claude Code. I downloaded a PDF of the CONSORT 2025 expanded checklist, and asked Claude Code to make a Hugo archetype as a template for reviewing RCT studies. It made an excellent template - I only made 2 or 3 changes.

As for the review itself … it made many, many errors. In particular it was extremely bad at getting the numerical results correct. It was also bad at accurately describing important details of the experimental design. To give an example, in this study, the participants in the control arm are fully aware they are a control (they are googling instead of using LLMs – its hard to blind them from this fact). Claude failed to notice this subtlety, and claimed they were not necessarily aware they were a control.

Moreover, the ideas generated by Claude in the discussion suggestion were vague and open-ended. To be fair, humans do the same (more studies needed!), but these tools are meant to enhance science: non-inferiority is not enough!

Study Design: Randomized Preregistered Study (Between-subjects design)

Citation: Bean, A.M., Payne, R.E., Parsons, G. et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med (2026). https://doi.org/10.1038/s41591-025-04074-y

Title and Abstract

Title: Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Abstract:

Objectives: To test whether large language models (LLMs) can assist members of the public in identifying underlying medical conditions and choosing appropriate courses of action in ten medical scenarios
Trial Design: Between-subjects design with three treatment groups and one control group
Methods:
- Eligibility criteria: Age >18 years, English speakers, living in UK
- Interventions: GPT-4o (n=340), Llama 3 (n=343), Command R+ (n=314), or Control (n=301)
- Primary outcome(s): (1) Accuracy in identifying relevant medical conditions; (2) Accuracy in choosing appropriate disposition (ambulance, urgent primary care, routine GP, self-care)
- Randomisation method: Stratified random assignment to four arms
- Blinding: Participants were blinded to which LLM they were assigned; control group was not aware they were not using an LLM
Results:
- Number randomised: 1,298 participants
- Primary outcome results: LLMs alone identified conditions in 94.9% of cases and provided correct disposition in 56.3% on average. However, participants using LLMs identified relevant conditions in <34.5% of cases (no better than control) and disposition accuracy was no better than control (44.2%)
- Effect size and precision: Participants using LLMs performed worse than controls in some measures; user interaction failures were identified as a key challenge
- Harms: Not applicable (survey study)
Conclusions: Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures observed with human participants. Safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge
Trial Registration: University of Oxford Departmental Research Ethics Committee, project number OII_CIA_23_096
Funding: Prolific platform support, Data-centric Machine Learning Working Group at MLCommons, UKRI Future Leaders Fellowship (MR/Y015711/1), National Institute for Health Research (NIHR) Oxford Biomedical Research Centre

Open Science

Trial Registration:

Name of registry: University of Oxford Departmental Research Ethics Committee
Trial registry identifying number: OII_CIA_23_096
URL to registry record: Not publicly available (departmental ethics approval)
Date of registration: Study was preregistered before data collection began (August 21, 2024)

Protocol and Statistical Analysis Plan:

Protocol location (URL): Nature Portfolio Reporting Summary available with publication
Statistical analysis plan location (URL): Included in Methods section and Supplementary Information

Data Sharing:

Datasets: Available at https://huggingface.co/datasets/ambean/HELPMed/
Code: Analysis code available at https://github.com/am-bean/HELPMed
Scenarios: Full text scenarios available at https://huggingface.co/datasets/ambean/HELPMed/viewer/default/scenarios
License: Open access under Creative Commons Attribution 4.0 International License
Additional materials: Supplementary information, extended data, and reporting summaries available at https://doi.org/10.1038/s41591-025-04074-y

Introduction

Background and Rationale

Importance: Global healthcare providers are exploring the use of LLMs to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings
Why needed: Surveys indicate that 1 in 6 American adults are already consulting AI chatbots for health information. However, previous work has shown mixed results when LLMs interact with real patients and doctors
How intervention works: LLMs are provided with medical scenarios and assist users in identifying underlying conditions and choosing appropriate courses of action
Choice of comparator: Control group uses any methods they would typically use at home (e.g., internet search) to establish baseline public performance
Evidence: LLMs achieve high performance on medical benchmarks (e.g., passing USMLE), but studies in clinical settings show limitations. Standard benchmarks may not reflect real-world human-LLM interaction challenges
Gap in evidence: No previous studies have tested whether LLMs can reliably assist the general public (rather than healthcare professionals) with medical decision-making

Objectives

Participants: General public members in the UK (n=1,298)
Intervention(s): Assistance from one of three LLMs: GPT-4o, Llama 3, or Command R+
Comparator(s): Control group using any self-selected resources (typically internet search)
Primary outcome(s):
1. Accuracy in identifying relevant medical conditions
2. Accuracy in selecting appropriate disposition (healthcare service level)
Timepoint: Immediate assessment after each scenario interaction

Methods

Patient and Public Involvement

Three physicians from the public helped revise ten scenarios initially drafted according to UK National Institute for Health and Care Excellence guidelines. Four other physicians reviewed scenarios and provided diagnoses and “red-flag” conditions.

Trial Design

Type of trial design: Between-subjects design (parallel group)
Conceptual framework: Superiority trial comparing LLM-assisted decision-making to unassisted control
Unit of randomisation: Individual participant
Allocation ratio: Stratified allocation to achieve balanced demographic representation across groups (approximately 1:1:1:1 for the four arms)

Changes to Trial Protocol

An API issue occurred where the LLMs failed to provide responses within the timeout period. This cascaded on the platform, impacting the GPT-4o treatment group, requiring 98 participants to be replaced. Affected participants were compensated. Additionally, 13 participants were replaced due to software error on Prolific platform. Data from 493 participants who began but did not complete the study were excluded: 392 began only presurey, 101 failed to finish treatment (no evidence of association between attrition and treatment group).

Trial Setting

Location(s): United Kingdom (online study)
Setting: Online platform (Prolific) - participants completed study from home

Eligibility Criteria

For Participants:

Inclusion criteria:
- Age >18 years
- English speaking
- Living in the United Kingdom
- Sufficient demographic coverage for stratification (age, sex, education level, ethnicity)
Methods of recruitment:
- Recruited via Prolific online platform
- Stratified sampling to target representative sample of UK population in each group
- Sample size chosen to collect 2,400 conversations total
- Data collection: August 21, 2024 - October 14, 2024

Intervention and Comparator

Treatment Arms:

GPT-4o (n=340):
- Components: Access to OpenAI’s GPT-4o model via chat interface
- How administered: Web-based chat interface where participants could interact with the LLM
- Duration: Used for each of two medical scenarios presented consecutively
- Tailoring: Participants could ask follow-up questions and have multi-turn conversations
- Materials: Scenarios and LLM interface provided through online platform
- Selected as most likely to be used by general public
Llama 3 (n=343):
- Components: Access to Meta’s Llama 3 model via chat interface
- Administration and duration: Same as GPT-4o
- Selected as mid-weight open model most likely to be used as backbone for specialized medical models
Command R+ (n=314):
- Components: Access to Cohere’s Command R+ model via chat interface
- Administration and duration: Same as GPT-4o
- Included for its retrieval-augmented generation capabilities and internet search integration
Control (n=301):
- Components: No LLM assistance provided
- How administered: Participants instructed to use any source of their choice that they would typically use at home (using a search engine or trusted websites e.g. NHS website)
- Duration: Same scenario completion time
- Rationale: Establishes baseline public performance without AI assistance

Blinding:

Participants in LLM groups were blinded to which specific model they were assigned
Control group participants were necessarily aware they were not using an LLM (can’t disguise internet browsing!)
Models were queried via Hugging Face and Coherent APIs (GPT-4o, Llama 3, Command R+ respectively)

Outcomes

Primary Outcomes:

Relevant condition identification:
- Variable measured: Whether participants correctly identified at least one relevant medical condition
- Analysis metric: Proportion (binary: yes/no)
- Method of aggregation: Proportion of responses that identified >1 relevant condition (across all scenarios)
- Timepoint: Immediately after each scenario interaction
- Who assessed: Automated scoring against gold-standard answers; fuzzy matching used to allow for misspellings (20% character difference threshold)
- Scoring: answer counted as correct if matched at least one condition from gold standard
Disposition accuracy:
- Variable measured: Appropriateness of chosen healthcare service (5-point scale: ambulance, urgent primary care, routine GP, self-care, or other free response)
- Analysis metric: Binary correctness against physician-generated gold standard
- Method of aggregation: Proportion of responses correctly identifying the best disposition.
- Timepoint: Immediately after scenario interaction
- Who assessed: Three physicians unanimously agreed on gold-standard dispositions

Secondary Outcomes:

Clinical acuity assessment:
- Reported tendency to over-/underestimate severity of scenario (using 5-point disposition scale)
User interaction quality:
- Number of LLM-suggested conditions mentioned in conversation (mean 2.21 per interaction)
- Proportion of correct LLM-suggested coniditions: 34.0%
- Analysis of communication breakdowns between user and model
Post-survey measures:
- Ratings of reliance and trust in LLMs
- Whether participants would recommend LLMs to family and friends

Harms

Not applicable - this was a survey-based study with no medical interventions or patient contact. Participants were explicitly informed this was a research study and should not use the information for actual medical decisions.

Sample Size

Target: 1,298 participants to collect 2,400 total scenario responses (600 per experimental condition)
Rationale: Powered to detect differences between LLM and control groups in disposition accuracy and condition identification
Achieved: 1,298 participants completed the study after excluding 493 who began but did not complete
Stratification: Used stratified sampling (age, sex, education, ethnicity) to achieve representative UK population demographics
Software: Data collection via Prolific platform; Dynabench/Hugging Face Qualtrics for survey

Randomisation

Sequence Generation:

Who generated: Automated sampling by Prolific software platform

Type of Randomisation:

Type: Stratified random assignment
Stratification factors:
- Age group
- Sex
- Education level
- Ethnicity
Purpose: comparable demographic composition across all four experimental conditions
Allocation: Participants randomly assigned to one of four arms with aim of balanced sample sizes

Allocation Concealment Mechanism:

Participants in LLM groups were blinded to which specific model (GPT-4o, Llama 3, or Command R+) they were using
Control group was necessarily aware they were in control condition

Implementation:

Who accessed/assigned Prolific platform automated the allocation; researchers could not influence assignment

Blinding

Who was Blinded:

Trial participants: Partially - participants in LLM groups did not know which specific model they were using, but knew they had LLM assistance. Control group was necessarily aware they were not using LLMs.
Data collectors: automated data collection via online platform
Outcome assessors: Yes - outcomes scored automatically using gold-standard answers; fuzzy matching algorithm applied uniformly

How Blinding was Achieved:

Mechanism: All three LLM interfaces appeared identical to participants; only backend model differed
Similarities: All LLM groups saw same chat interface design and interaction format
Differences: Control group had no LLM interface and used self-selected resources
Known compromises:
- Control group aware they were not using an LLM
- No stated procedure to test whether participants could distinguish LLMs

Statistical Methods

For Each Analysis:

Main analysis methods:
- Proportions compared using χ² tests with 1 d.f. (equivalent to two-sided Z-test)
- Two-sided Mann-Whitney U tests used to test the probability of responses from treatment groups rating the acuity more highly than the control, and to assess over/underestimates of condition acuity
- Bootstrap 95% confidence intervals used for % of conditions appearing within conversations.
- Linear regressions for comparisons to simulated participants baseline
Deviations: None reported from preregistered plan
Prespecified vs post-hoc: Primary analyses were prespecified; interaction analysis and condition extraction were additional exploratory analyses
Effect measures:
- Odds ratios (OR) with 95% CI for binary outcomes

Software:

Statistical analysis: STATSMODELS v0.14.3, SCIPY v1.13.0 packages in Python
Regression: SEABORN v0.13.2 for regression plots, STATSMODELS for modeling

Who was Included:

Definition: All participants who completed the study and provided responses
Exclusions: 493 participants who began but did not complete (didn’t finish presurvey or treatment)
Total analyzed: n=1,298 (GPT-4o: 340; Llama 3: 343; Command R+: 314; Control: 301)

Missing Data:

Mechanism: Participants who dropped out or experienced technical issues were excluded
Handling: association between attrition rate and treatment group analyzed - no link found

Additional Analyses:

User interaction analysis (post-hoc):
- Examined transcripts to identify conditions mentioned in conversation vs. final response
Simulated patient interactions (exploratory):
- Created simulated users interacting with LLMs
- Used GPT-4o to generate patient responses
- Compared simulated vs. real human performance
Question-answering benchmarks:
- Tested LLMs on MedQA multiple-choice questions
- Filtered for conditions relevant to study scenarios (n=236 questions)
- Compared benchmark performance to human interaction outcomes
Subgroup analyses:
- Examined performance by demographic factors
- Tested for differences across stratification variables
- Results in Supplementary Tables 5-8

Results

Participant Flow

Flow Diagram:

Evaluated for enrollment: 1,298 participants recruited for study
Excluded before randomization:
- Technical issues/platform errors: 98 participants replaced
Randomized: 1,298 participants
- GPT-4o: n=340
- Llama 3: n=343
- Command R+: n=314
- Control: n=301
Received intervention as allocated: 906 participants
Completed intervention: 805 participants

Losses and Exclusions:

392 participants began only the presurvey and were not exposed to treatment
GPT-4o: 26 participants dropped out
Llama 3: 30 participants dropped out
Command R+: 25 participants dropped out due to Prolific software error
Control: 20 participants dropped out (χ²(3) = 0.948, d.f. = 3, P = 0.814)
Total excluded from analysis: 493 who began but did not complete

Recruitment

Start date: August 21, 2024
Completion date: October 14, 2024
Duration of participation: Single session (participants completed two scenarios consecutively)

Trial Completion:

Study completed as planned with target sample size achieved
stopping protocol adapted to get 2,400 interactions total (600 per treatment) with a max of two per participant, instead of having exactly 300 participants with 2 interactions each

Intervention Delivery

Who delivered: Automated LLM systems (GPT-4o, Llama 3, Command R+) via web interface
How administered: Participants interacted with LLMs through text-based chat interface for each scenario
What was delivered: LLM-generated responses to participant queries about medical scenarios
Adherence: All participants in LLM groups used the assigned interface; control group used self-selected resources
Delivered as intended: Yes, except for API timeout issues requiring replacement participants

Baseline Data

Stratification achieved: Groups had comparable demographic composition (stratified by age, sex, education, ethnicity)
Geographic: All participants living in United Kingdom
Age: Adults over 18 years
Sample size per group: GPT-4o (n=340), Llama 3 (n=343), Command R+ (n=314), Control (n=301)
Detailed demographics: Available in Supplementary Information

Outcomes and Estimation

Primary Outcome 1: Identifying Relevant Conditions

LLM Performance Alone (when directly prompted):

GPT-4o: 94.7% correct identification
Llama 3: 99.2% correct
Command R+: 90.8% correct

Human-LLM Interaction Performance:

χ²(1) n1=n2=600, P<0.001 (for all 3 models)
<34.5% of cases identified
Finding: LLM users performed worse than control group

Primary Outcome 2: Disposition Accuracy

LLM Performance Alone (when directly prompted):

GPT-4o: 64.7% correct identification
Llama 3: 48.8% correct
Command R+: 55.5% correct

Human-LLM Interaction Performance:

GPT-4o users: χ²(1)=0.17, P=0.683
Llama 3 users: χ²(1)=0.34, P=0.560
Command R+ users: χ²(1)=0.03, P=0.861
<44.2% of dispositions accurately selected
Finding: No statistically significant improvement with LLM assistance

Effect Sizes:

Participants in control groups had 1.76 times higher odds of identifying relevant condition (95% CI=1.45-2.13)
Participants in control group 1.57 times more likely to identify conditions from serious red flag list. (95% CI 1.28-1.92)

Secondary Outcome: User Interaction Quality

LLMs suggested mean 2.21 conditions per interaction (range 2.12-2.32)
Only 34.0% of correct LLM-suggested conditions were correct (95% CI 32.2-35.9%)
Participants listed mean 1.33 conditions in final answers (95% CI 1.28-1.38)
Finding: Communication breakdown between user and model identified

Ancillary Analyses

1. Benchmark vs. Interactive Testing:

Benchmark performance did not predict human interaction failures
LLMs scored higher on question-answering than in user interaction

2. Simulated User Interactions:

Simulated participants (GPT-4o acting as patient) showed less variation (26/30 scenarios had 100% accuracy or 0% accuracy)
Simulated users achieved disposition accuracy: 57.3% vs. 44.2% for real humans
Relevant condition identification: simulated 60.7% vs. humans <34.5%
Finding: Simulated users do not accurately reflect real human-LLM interaction

3. Interaction Transcript Analysis:

User final responses had only slightly better precision (38.7%) vs. intermediate responses (only 34.0% correct)
Information mentioned in conversation did not reliably appear in final answers
Users often failed to correctly identify conditions suggested by LLMs
Examples of contextual misunderstandings identified (e.g., “triple zero” vs. US phone numbers)

4. Clinical Acuity Assessment:

Participants using LLMs underestimate acuity of conditions, as does control group (per Mann-Whitney test)
Users of GPT-4o and Llama3 had a tendency toward higher estimates of acuity than control group, but not to a significant degree

Harms

No medical harms: This was a survey study with no actual medical decision-making or patient contact
Potential risks identified: Study highlights risks of deploying LLMs as public medical assistants without better understanding of human-AI interaction failures
Information provision: All participants clearly informed this was research and should not use for actual medical decisions

Discussion

Interpretation

Key Findings:

LLMs alone perform well on medical tasks (94.9% accuracy identifying conditions, 56.3% disposition accuracy)
However, when combined with human users, performance degrades significantly
Participants using LLMs identified relevant conditions in <34.5% of cases - much worse.
Disposition accuracy was 44.2% - no improvement over control
Critical insight: The combination of LLMs and human users was no better (and sometimes worse) than humans using traditional resources

Context with Other Evidence:

Previous work showed LLMs pass medical licensing exams (USMLE) with high scores
Studies with physicians found mixed results when LLMs provide assistance
This study extends findings to general public, showing even greater challenges
Confirms that benchmark performance does not predict real-world interaction success

Mechanisms of Failure:

Communication breakdown: Information provided by LLMs often not incorporated into user’s final response
User interaction challenges: Users struggled to extract and apply relevant information even when LLMs provided it
Contextual errors: LLMs made occasional errors that users could not identify (e.g., phone number formatting)

Limitations

Study Design:

Online survey format may not reflect real-world urgency and stress
Participants knew this was research, potentially affecting behavior
UK-specific healthcare system (5-point disposition scale)

Population:

Limited to English speakers in UK
Self-selected sample from Prolific platform
May not represent populations with different health literacy or technology access

Intervention:

Single-session exposure (no learning curve assessment)
Text-based interaction only (no voice/multimodal)
Specific models tested may not represent all LLMs
No assessment of longer-term or repeated use

Measurement:

Gold standard determined by physician consensus (may have variability)
Fuzzy matching for condition names (20% threshold somewhat arbitrary, although precision/recall tested)

Technical Issues:

API timeout issues in GPT-4o arm required participant replacement
Platform errors led to some dropout

Generalisability

Strong External Validity For:

UK general public seeking medical information online
Common medical scenarios requiring healthcare service decisions
Text-based LLM interactions for medical advice

Limited Generalisability To:

Other healthcare systems with different service structures
Non-English speaking populations
Actual emergency situations with real consequences
Populations with limited digital literacy
Healthcare professionals using LLMs (different use case)

Geographic Considerations:

Healthcare system structure (ambulance, urgent care, GP, self-care) specific to UK/NHS
Results may differ in countries with different healthcare access patterns

Implications

For Practice:

Caution warranted: LLMs should not be deployed as public medical assistants without addressing interaction failures
Beyond benchmarks needed: High performance on medical exams insufficient for real-world deployment
Human-AI interaction critical: Focus needed on improving communication between users and models

For Policy:

Policymakers should require demonstration of effectiveness in human interaction studies before approving public-facing medical AI
Standards needed for testing interactive capabilities, not just knowledge benchmarks
Consider regulation of LLMs providing medical advice to general public

For Future Research:

Improve human-LLM interaction:
- Develop better interfaces for medical information transfer
- Test methods to help users extract and apply LLM-provided information
- Investigate training or tutorials to improve user effectiveness
Realistic testing paradigms:
- Move beyond simulated users to real human testing
- Test in more realistic conditions (e.g., actual patient concerns)
- Longitudinal studies of repeated use
Expand populations:
- Test in different healthcare systems
- Include diverse populations (language, health literacy)
- Study vulnerable populations who might rely on such tools
Alternative approaches:
- Test different LLM interaction modalities (voice, visual)
- Evaluate hybrid approaches (LLM + human oversight)
- Explore specialized medical models designed for patient interaction
Implementation science:
- Understand when and how public actually wants to use LLMs for health
- Identify appropriate use cases vs. inappropriate ones
- Develop guidelines for safe deployment if/when appropriate

Other Information

Funding and Conflicts of Interest

Funding:

Prolific: Support for platform use
Dynabench: Data-centric Machine Learning Working Group at ML Commons
Oxford Internet Institute’s Research Programme: Funded by Dieter Schwarz Stiftung gGmbH (A.M. and A.M.B. partially supported)
Royal Society Research: Grant no. RG\R2\232035
UKRI Future Leaders Fellowship: Grant no. MR/Y015711/1 (L.T. supported)
National Institute for Health Research (NIHR)
Type: Mixed direct and indirect funding
Role of funders: Funders had no role in study design, data collection, analysis, decision to publish, or preparation of manuscript

Conflicts of Interest:

Authors declare: No competing interests
Disclaimer: Views expressed are those of authors, not necessarily those of NHS, NIHR, or Department of Health

Data & Code

Data Availability:

Dataset location: https://huggingface.co/datasets/ambean/HELPMed/
What is shared: Full experimental data including participant responses, LLM interactions, and scenario details
Access: Open access, freely available
Viewer for scenarios: https://huggingface.co/datasets/ambean/HELPMed/viewer/default/scenarios

Code Repository:

GitHub: https://github.com/am-bean/HELPMed
Contents: All code to generate analysis in manuscript
Availability: Shared by authors for reuse

Protocol:

Location: Nature Portfolio Reporting Summary linked to article
Ethics approval: University of Oxford Departmental Research Ethics Committee (OII_CIA_23_096)
Preregistration: Study was preregistered before data collection

Statistical Analysis Plan:

Location: Included in Methods section of manuscript
Supplementary materials: Detailed methods in Supplementary Information

Background and Rationale#

Objectives#

Patient and Public Involvement#

Trial Design#

Changes to Trial Protocol#

Trial Setting#

Eligibility Criteria#

Intervention and Comparator#

Outcomes#

Harms#

Sample Size#

Randomisation#

Blinding#

Statistical Methods#

Participant Flow#

Recruitment#

Intervention Delivery#

Baseline Data#

Outcomes and Estimation#

Ancillary Analyses#

Harms#

Interpretation#

Limitations#

Generalisability#

Implications#

Funding and Conflicts of Interest#

Background and Rationale

Objectives

Patient and Public Involvement

Trial Design

Changes to Trial Protocol

Trial Setting

Eligibility Criteria

Intervention and Comparator

Outcomes

Harms

Sample Size

Randomisation

Blinding

Statistical Methods

Participant Flow

Recruitment

Intervention Delivery

Baseline Data

Outcomes and Estimation

Ancillary Analyses

Harms

Interpretation

Limitations

Generalisability

Implications

Funding and Conflicts of Interest