Study Design: Randomized Preregistered Study (Between-subjects design)
Citation: Bean, A.M., Payne, R.E., Parsons, G. et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med (2026). https://doi.org/10.1038/s41591-025-04074-y
Title and Abstract
Title: Reliability of LLMs as medical assistants for the general public: a randomized preregistered study
Abstract:
- Objectives: To test whether large language models (LLMs) can assist members of the public in identifying underlying medical conditions and choosing appropriate courses of action in ten medical scenarios
- Trial Design: Between-subjects design with three treatment groups and one control group
- Methods:
- Eligibility criteria: Age >18 years, English speakers, living in UK
- Interventions: GPT-4o (n=340), Llama 3 (n=343), Command R+ (n=314), or Control (n=301)
- Primary outcome(s): (1) Accuracy in identifying relevant medical conditions; (2) Accuracy in choosing appropriate disposition (ambulance, urgent primary care, routine GP, self-care)
- Randomisation method: Stratified random assignment to four arms
- Blinding: Participants were blinded to which LLM they were assigned; control group was not aware they were not using an LLM
- Results:
- Number randomised: 1,298 participants
- Primary outcome results: LLMs alone identified conditions in 94.9% of cases and provided correct disposition in 56.3% on average. However, participants using LLMs identified relevant conditions in <34.5% of cases (no better than control) and disposition accuracy was no better than control (44.2%)
- Effect size and precision: Participants using LLMs performed worse than controls in some measures; user interaction failures were identified as a key challenge
- Harms: Not applicable (survey study)
- Conclusions: Standard benchmarks for medical knowledge and simulated patient interactions do not predict the failures observed with human participants. Safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge
- Trial Registration: University of Oxford Departmental Research Ethics Committee, project number OII_CIA_23_096
- Funding: Prolific platform support, Data-centric Machine Learning Working Group at MLCommons, UKRI Future Leaders Fellowship (MR/Y015711/1), National Institute for Health Research (NIHR) Oxford Biomedical Research Centre
Open Science
Trial Registration:
- Name of registry: University of Oxford Departmental Research Ethics Committee
- Trial registry identifying number: OII_CIA_23_096
- URL to registry record: Not publicly available (departmental ethics approval)
- Date of registration: Study was preregistered before data collection began (August 21, 2024)
Protocol and Statistical Analysis Plan:
- Protocol location (URL): Nature Portfolio Reporting Summary available with publication
- Statistical analysis plan location (URL): Included in Methods section and Supplementary Information
Data Sharing:
- Datasets: Available at https://huggingface.co/datasets/ambean/HELPMed/
- Code: Analysis code available at https://github.com/am-bean/HELPMed
- Scenarios: Full text scenarios available at https://huggingface.co/datasets/ambean/HELPMed/viewer/default/scenarios
- License: Open access under Creative Commons Attribution 4.0 International License
- Additional materials: Supplementary information, extended data, and reporting summaries available at https://doi.org/10.1038/s41591-025-04074-y
Introduction
Background and Rationale
- Importance: Global healthcare providers are exploring the use of LLMs to provide medical advice to the public. LLMs now achieve nearly perfect scores on medical licensing exams, but this does not necessarily translate to accurate performance in real-world settings
- Why needed: Surveys indicate that 1 in 6 American adults are already consulting AI chatbots for health information. However, previous work has shown mixed results when LLMs interact with real patients and doctors
- How intervention works: LLMs are provided with medical scenarios and assist users in identifying underlying conditions and choosing appropriate courses of action
- Choice of comparator: Control group uses any methods they would typically use at home (e.g., internet search) to establish baseline public performance
- Evidence: LLMs achieve high performance on medical benchmarks (e.g., passing USMLE), but studies in clinical settings show limitations. Standard benchmarks may not reflect real-world human-LLM interaction challenges
- Gap in evidence: No previous studies have tested whether LLMs can reliably assist the general public (rather than healthcare professionals) with medical decision-making
Objectives
- Participants: General public members in the UK (n=1,298)
- Intervention(s): Assistance from one of three LLMs: GPT-4o, Llama 3, or Command R+
- Comparator(s): Control group using any self-selected resources (typically internet search)
- Primary outcome(s):
- Accuracy in identifying relevant medical conditions
- Accuracy in selecting appropriate disposition (healthcare service level)
- Timepoint: Immediate assessment after each scenario interaction
Methods
Patient and Public Involvement
Three physicians from the public helped revise ten scenarios initially drafted according to UK National Institute for Health and Care Excellence guidelines. Four other physicians reviewed scenarios and provided diagnoses and “red-flag” conditions.
Trial Design
- Type of trial design: Between-subjects design (parallel group)
- Conceptual framework: Superiority trial comparing LLM-assisted decision-making to unassisted control
- Unit of randomisation: Individual participant
- Allocation ratio: Stratified allocation to achieve balanced demographic representation across groups (approximately 1:1:1:1 for the four arms)
Changes to Trial Protocol
An API issue occurred where the LLMs failed to provide responses within the timeout period. This cascaded on the platform, impacting the GPT-4o treatment group, requiring 98 participants to be replaced. Affected participants were compensated. Additionally, 13 participants were replaced due to software error on Prolific platform. Data from 493 participants who began but did not complete the study were excluded: 392 began only presurey, 101 failed to finish treatment (no evidence of association between attrition and treatment group).
Trial Setting
- Location(s): United Kingdom (online study)
- Setting: Online platform (Prolific) - participants completed study from home
Eligibility Criteria
For Participants:
- Inclusion criteria:
- Age >18 years
- English speaking
- Living in the United Kingdom
- Sufficient demographic coverage for stratification (age, sex, education level, ethnicity)
- Methods of recruitment:
- Recruited via Prolific online platform
- Stratified sampling to target representative sample of UK population in each group
- Sample size chosen to collect 2,400 conversations total
- Data collection: August 21, 2024 - October 14, 2024
Intervention and Comparator
Treatment Arms:
GPT-4o (n=340):
- Components: Access to OpenAI’s GPT-4o model via chat interface
- How administered: Web-based chat interface where participants could interact with the LLM
- Duration: Used for each of two medical scenarios presented consecutively
- Tailoring: Participants could ask follow-up questions and have multi-turn conversations
- Materials: Scenarios and LLM interface provided through online platform
- Selected as most likely to be used by general public
Llama 3 (n=343):
- Components: Access to Meta’s Llama 3 model via chat interface
- Administration and duration: Same as GPT-4o
- Selected as mid-weight open model most likely to be used as backbone for specialized medical models
Command R+ (n=314):
- Components: Access to Cohere’s Command R+ model via chat interface
- Administration and duration: Same as GPT-4o
- Included for its retrieval-augmented generation capabilities and internet search integration
Control (n=301):
- Components: No LLM assistance provided
- How administered: Participants instructed to use any source of their choice that they would typically use at home (using a search engine or trusted websites e.g. NHS website)
- Duration: Same scenario completion time
- Rationale: Establishes baseline public performance without AI assistance
Blinding:
- Participants in LLM groups were blinded to which specific model they were assigned
- Control group participants were necessarily aware they were not using an LLM (can’t disguise internet browsing!)
- Models were queried via Hugging Face and Coherent APIs (GPT-4o, Llama 3, Command R+ respectively)
Outcomes
Primary Outcomes:
Relevant condition identification:
- Variable measured: Whether participants correctly identified at least one relevant medical condition
- Analysis metric: Proportion (binary: yes/no)
- Method of aggregation: Proportion of responses that identified >1 relevant condition (across all scenarios)
- Timepoint: Immediately after each scenario interaction
- Who assessed: Automated scoring against gold-standard answers; fuzzy matching used to allow for misspellings (20% character difference threshold)
- Scoring: answer counted as correct if matched at least one condition from gold standard
Disposition accuracy:
- Variable measured: Appropriateness of chosen healthcare service (5-point scale: ambulance, urgent primary care, routine GP, self-care, or other free response)
- Analysis metric: Binary correctness against physician-generated gold standard
- Method of aggregation: Proportion of responses correctly identifying the best disposition.
- Timepoint: Immediately after scenario interaction
- Who assessed: Three physicians unanimously agreed on gold-standard dispositions
Secondary Outcomes:
Clinical acuity assessment:
- Reported tendency to over-/underestimate severity of scenario (using 5-point disposition scale)
User interaction quality:
- Number of LLM-suggested conditions mentioned in conversation (mean 2.21 per interaction)
- Proportion of correct LLM-suggested coniditions: 34.0%
- Analysis of communication breakdowns between user and model
Post-survey measures:
- Ratings of reliance and trust in LLMs
- Whether participants would recommend LLMs to family and friends
Harms
Not applicable - this was a survey-based study with no medical interventions or patient contact. Participants were explicitly informed this was a research study and should not use the information for actual medical decisions.
Sample Size
- Target: 1,298 participants to collect 2,400 total scenario responses (600 per experimental condition)
- Rationale: Powered to detect differences between LLM and control groups in disposition accuracy and condition identification
- Achieved: 1,298 participants completed the study after excluding 493 who began but did not complete
- Stratification: Used stratified sampling (age, sex, education, ethnicity) to achieve representative UK population demographics
- Software: Data collection via Prolific platform; Dynabench/Hugging Face Qualtrics for survey
Randomisation
Sequence Generation:
- Who generated: Automated sampling by Prolific software platform
Type of Randomisation:
- Type: Stratified random assignment
- Stratification factors:
- Age group
- Sex
- Education level
- Ethnicity
- Purpose: comparable demographic composition across all four experimental conditions
- Allocation: Participants randomly assigned to one of four arms with aim of balanced sample sizes
Allocation Concealment Mechanism:
- Participants in LLM groups were blinded to which specific model (GPT-4o, Llama 3, or Command R+) they were using
- Control group was necessarily aware they were in control condition
Implementation:
- Who accessed/assigned Prolific platform automated the allocation; researchers could not influence assignment
Blinding
Who was Blinded:
- Trial participants: Partially - participants in LLM groups did not know which specific model they were using, but knew they had LLM assistance. Control group was necessarily aware they were not using LLMs.
- Data collectors: automated data collection via online platform
- Outcome assessors: Yes - outcomes scored automatically using gold-standard answers; fuzzy matching algorithm applied uniformly
How Blinding was Achieved:
- Mechanism: All three LLM interfaces appeared identical to participants; only backend model differed
- Similarities: All LLM groups saw same chat interface design and interaction format
- Differences: Control group had no LLM interface and used self-selected resources
- Known compromises:
- Control group aware they were not using an LLM
- No stated procedure to test whether participants could distinguish LLMs
Statistical Methods
For Each Analysis:
- Main analysis methods:
- Proportions compared using χ² tests with 1 d.f. (equivalent to two-sided Z-test)
- Two-sided Mann-Whitney U tests used to test the probability of responses from treatment groups rating the acuity more highly than the control, and to assess over/underestimates of condition acuity
- Bootstrap 95% confidence intervals used for % of conditions appearing within conversations.
- Linear regressions for comparisons to simulated participants baseline
- Deviations: None reported from preregistered plan
- Prespecified vs post-hoc: Primary analyses were prespecified; interaction analysis and condition extraction were additional exploratory analyses
- Effect measures:
- Odds ratios (OR) with 95% CI for binary outcomes
Software:
- Statistical analysis: STATSMODELS v0.14.3, SCIPY v1.13.0 packages in Python
- Regression: SEABORN v0.13.2 for regression plots, STATSMODELS for modeling
Who was Included:
- Definition: All participants who completed the study and provided responses
- Exclusions: 493 participants who began but did not complete (didn’t finish presurvey or treatment)
- Total analyzed: n=1,298 (GPT-4o: 340; Llama 3: 343; Command R+: 314; Control: 301)
Missing Data:
- Mechanism: Participants who dropped out or experienced technical issues were excluded
- Handling: association between attrition rate and treatment group analyzed - no link found
Additional Analyses:
User interaction analysis (post-hoc):
- Examined transcripts to identify conditions mentioned in conversation vs. final response
Simulated patient interactions (exploratory):
- Created simulated users interacting with LLMs
- Used GPT-4o to generate patient responses
- Compared simulated vs. real human performance
Question-answering benchmarks:
- Tested LLMs on MedQA multiple-choice questions
- Filtered for conditions relevant to study scenarios (n=236 questions)
- Compared benchmark performance to human interaction outcomes
Subgroup analyses:
- Examined performance by demographic factors
- Tested for differences across stratification variables
- Results in Supplementary Tables 5-8
Results
Participant Flow
Flow Diagram:
- Evaluated for enrollment: 1,298 participants recruited for study
- Excluded before randomization:
- Technical issues/platform errors: 98 participants replaced
- Randomized: 1,298 participants
- GPT-4o: n=340
- Llama 3: n=343
- Command R+: n=314
- Control: n=301
- Received intervention as allocated: 906 participants
- Completed intervention: 805 participants
Losses and Exclusions:
- 392 participants began only the presurvey and were not exposed to treatment
- GPT-4o: 26 participants dropped out
- Llama 3: 30 participants dropped out
- Command R+: 25 participants dropped out due to Prolific software error
- Control: 20 participants dropped out (χ²(3) = 0.948, d.f. = 3, P = 0.814)
- Total excluded from analysis: 493 who began but did not complete
Recruitment
- Start date: August 21, 2024
- Completion date: October 14, 2024
- Duration of participation: Single session (participants completed two scenarios consecutively)
Trial Completion:
- Study completed as planned with target sample size achieved
- stopping protocol adapted to get 2,400 interactions total (600 per treatment) with a max of two per participant, instead of having exactly 300 participants with 2 interactions each
Intervention Delivery
- Who delivered: Automated LLM systems (GPT-4o, Llama 3, Command R+) via web interface
- How administered: Participants interacted with LLMs through text-based chat interface for each scenario
- What was delivered: LLM-generated responses to participant queries about medical scenarios
- Adherence: All participants in LLM groups used the assigned interface; control group used self-selected resources
- Delivered as intended: Yes, except for API timeout issues requiring replacement participants
Baseline Data
- Stratification achieved: Groups had comparable demographic composition (stratified by age, sex, education, ethnicity)
- Geographic: All participants living in United Kingdom
- Age: Adults over 18 years
- Sample size per group: GPT-4o (n=340), Llama 3 (n=343), Command R+ (n=314), Control (n=301)
- Detailed demographics: Available in Supplementary Information
Outcomes and Estimation
Primary Outcome 1: Identifying Relevant Conditions
LLM Performance Alone (when directly prompted):
- GPT-4o: 94.7% correct identification
- Llama 3: 99.2% correct
- Command R+: 90.8% correct
Human-LLM Interaction Performance:
- χ²(1) n1=n2=600, P<0.001 (for all 3 models)
- <34.5% of cases identified
- Finding: LLM users performed worse than control group
Primary Outcome 2: Disposition Accuracy
LLM Performance Alone (when directly prompted):
- GPT-4o: 64.7% correct identification
- Llama 3: 48.8% correct
- Command R+: 55.5% correct
Human-LLM Interaction Performance:
- GPT-4o users: χ²(1)=0.17, P=0.683
- Llama 3 users: χ²(1)=0.34, P=0.560
- Command R+ users: χ²(1)=0.03, P=0.861
- <44.2% of dispositions accurately selected
- Finding: No statistically significant improvement with LLM assistance
Effect Sizes:
- Participants in control groups had 1.76 times higher odds of identifying relevant condition (95% CI=1.45-2.13)
- Participants in control group 1.57 times more likely to identify conditions from serious red flag list. (95% CI 1.28-1.92)
Secondary Outcome: User Interaction Quality
- LLMs suggested mean 2.21 conditions per interaction (range 2.12-2.32)
- Only 34.0% of correct LLM-suggested conditions were correct (95% CI 32.2-35.9%)
- Participants listed mean 1.33 conditions in final answers (95% CI 1.28-1.38)
- Finding: Communication breakdown between user and model identified
Ancillary Analyses
1. Benchmark vs. Interactive Testing:
- Benchmark performance did not predict human interaction failures
- LLMs scored higher on question-answering than in user interaction
2. Simulated User Interactions:
- Simulated participants (GPT-4o acting as patient) showed less variation (26/30 scenarios had 100% accuracy or 0% accuracy)
- Simulated users achieved disposition accuracy: 57.3% vs. 44.2% for real humans
- Relevant condition identification: simulated 60.7% vs. humans <34.5%
- Finding: Simulated users do not accurately reflect real human-LLM interaction
3. Interaction Transcript Analysis:
- User final responses had only slightly better precision (38.7%) vs. intermediate responses (only 34.0% correct)
- Information mentioned in conversation did not reliably appear in final answers
- Users often failed to correctly identify conditions suggested by LLMs
- Examples of contextual misunderstandings identified (e.g., “triple zero” vs. US phone numbers)
4. Clinical Acuity Assessment:
- Participants using LLMs underestimate acuity of conditions, as does control group (per Mann-Whitney test)
- Users of GPT-4o and Llama3 had a tendency toward higher estimates of acuity than control group, but not to a significant degree
Harms
- No medical harms: This was a survey study with no actual medical decision-making or patient contact
- Potential risks identified: Study highlights risks of deploying LLMs as public medical assistants without better understanding of human-AI interaction failures
- Information provision: All participants clearly informed this was research and should not use for actual medical decisions
Discussion
Interpretation
Key Findings:
- LLMs alone perform well on medical tasks (94.9% accuracy identifying conditions, 56.3% disposition accuracy)
- However, when combined with human users, performance degrades significantly
- Participants using LLMs identified relevant conditions in <34.5% of cases - much worse.
- Disposition accuracy was 44.2% - no improvement over control
- Critical insight: The combination of LLMs and human users was no better (and sometimes worse) than humans using traditional resources
Context with Other Evidence:
- Previous work showed LLMs pass medical licensing exams (USMLE) with high scores
- Studies with physicians found mixed results when LLMs provide assistance
- This study extends findings to general public, showing even greater challenges
- Confirms that benchmark performance does not predict real-world interaction success
Mechanisms of Failure:
- Communication breakdown: Information provided by LLMs often not incorporated into user’s final response
- User interaction challenges: Users struggled to extract and apply relevant information even when LLMs provided it
- Contextual errors: LLMs made occasional errors that users could not identify (e.g., phone number formatting)
Limitations
Study Design:
- Online survey format may not reflect real-world urgency and stress
- Participants knew this was research, potentially affecting behavior
- UK-specific healthcare system (5-point disposition scale)
Population:
- Limited to English speakers in UK
- Self-selected sample from Prolific platform
- May not represent populations with different health literacy or technology access
Intervention:
- Single-session exposure (no learning curve assessment)
- Text-based interaction only (no voice/multimodal)
- Specific models tested may not represent all LLMs
- No assessment of longer-term or repeated use
Measurement:
- Gold standard determined by physician consensus (may have variability)
- Fuzzy matching for condition names (20% threshold somewhat arbitrary, although precision/recall tested)
Technical Issues:
- API timeout issues in GPT-4o arm required participant replacement
- Platform errors led to some dropout
Generalisability
Strong External Validity For:
- UK general public seeking medical information online
- Common medical scenarios requiring healthcare service decisions
- Text-based LLM interactions for medical advice
Limited Generalisability To:
- Other healthcare systems with different service structures
- Non-English speaking populations
- Actual emergency situations with real consequences
- Populations with limited digital literacy
- Healthcare professionals using LLMs (different use case)
Geographic Considerations:
- Healthcare system structure (ambulance, urgent care, GP, self-care) specific to UK/NHS
- Results may differ in countries with different healthcare access patterns
Implications
For Practice:
- Caution warranted: LLMs should not be deployed as public medical assistants without addressing interaction failures
- Beyond benchmarks needed: High performance on medical exams insufficient for real-world deployment
- Human-AI interaction critical: Focus needed on improving communication between users and models
For Policy:
- Policymakers should require demonstration of effectiveness in human interaction studies before approving public-facing medical AI
- Standards needed for testing interactive capabilities, not just knowledge benchmarks
- Consider regulation of LLMs providing medical advice to general public
For Future Research:
Improve human-LLM interaction:
- Develop better interfaces for medical information transfer
- Test methods to help users extract and apply LLM-provided information
- Investigate training or tutorials to improve user effectiveness
Realistic testing paradigms:
- Move beyond simulated users to real human testing
- Test in more realistic conditions (e.g., actual patient concerns)
- Longitudinal studies of repeated use
Expand populations:
- Test in different healthcare systems
- Include diverse populations (language, health literacy)
- Study vulnerable populations who might rely on such tools
Alternative approaches:
- Test different LLM interaction modalities (voice, visual)
- Evaluate hybrid approaches (LLM + human oversight)
- Explore specialized medical models designed for patient interaction
Implementation science:
- Understand when and how public actually wants to use LLMs for health
- Identify appropriate use cases vs. inappropriate ones
- Develop guidelines for safe deployment if/when appropriate
Other Information
Funding and Conflicts of Interest
Funding:
- Prolific: Support for platform use
- Dynabench: Data-centric Machine Learning Working Group at ML Commons
- Oxford Internet Institute’s Research Programme: Funded by Dieter Schwarz Stiftung gGmbH (A.M. and A.M.B. partially supported)
- Royal Society Research: Grant no. RG\R2\232035
- UKRI Future Leaders Fellowship: Grant no. MR/Y015711/1 (L.T. supported)
- National Institute for Health Research (NIHR)
- Type: Mixed direct and indirect funding
- Role of funders: Funders had no role in study design, data collection, analysis, decision to publish, or preparation of manuscript
Conflicts of Interest:
- Authors declare: No competing interests
- Disclaimer: Views expressed are those of authors, not necessarily those of NHS, NIHR, or Department of Health
Data & Code
Data Availability:
- Dataset location: https://huggingface.co/datasets/ambean/HELPMed/
- What is shared: Full experimental data including participant responses, LLM interactions, and scenario details
- Access: Open access, freely available
- Viewer for scenarios: https://huggingface.co/datasets/ambean/HELPMed/viewer/default/scenarios
Code Repository:
- GitHub: https://github.com/am-bean/HELPMed
- Contents: All code to generate analysis in manuscript
- Availability: Shared by authors for reuse
Protocol:
- Location: Nature Portfolio Reporting Summary linked to article
- Ethics approval: University of Oxford Departmental Research Ethics Committee (OII_CIA_23_096)
- Preregistration: Study was preregistered before data collection
Statistical Analysis Plan:
- Location: Included in Methods section of manuscript
- Supplementary materials: Detailed methods in Supplementary Information