Oxford University: Is AI Reliable in Healthcare & Medicine?

The debate surrounding AI chatbots in medical decision-making has intensified following recent research from the University of Oxford, which says that relying on AI for health-related guidance could present significant risks to patient safety and clinical outcomes.
A comprehensive study examining AI reliability for health-related questions involved nearly 1,300 UK participants, divided into four groups.
Three groups utilised different chatbot systems, GPT-4o, Llama 3 and Command R+, whilst the fourth served as a control group.
This control cohort was permitted to use any method, including internet searches or personal judgement, to identify health conditions without AI assistance.
Performance gaps in AI
The research methodology involved qualified doctors drafting ten medical scenarios, which were then passed to other clinicians who provided correct diagnoses for these hypothetical cases.
Each study participant received random allocation to one of these medical scenarios and engaged with their assigned AI chatbot, with the exception of the control group, to help assess their condition.
Participants subsequently reported the advice they received from AI systems or conclusions drawn from alternative resources, including their proposed course of action such as seeking urgent primary care, consulting their GP or calling an ambulance.
The findings revealed that large language models performed no better than traditional methods for medical assessment.
Patients who used AI chatbots did not make superior decisions regarding their condition management compared to participants who relied on conventional approaches like online searches or personal judgement.
Accuracy in recommending appropriate patient action varied considerably based on the model employed.
GPT-4o achieved the highest accuracy at 64.7%, whilst CommandR+ scored just over half at 55.5% and Llama 3 fell below half at 48.8%.
"These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health," says Dr Rebecca Payne, who is a GP and was the lead medical practitioner on the study.
"Despite all the hype, AI just isn't ready to take on the role of the physician.
"Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed."
These results could indicate that current AI systems lack the reliability needed for clinical decision support, particularly in high-stakes medical situations where incorrect guidance could lead to delayed treatment or inappropriate care pathways.
The performance variations between different AI models highlight inconsistencies in diagnostic capabilities across platforms.
Healthcare professionals reviewing the research have expressed caution about the readiness of large language models to provide reliable medical guidance.
Some experts emphasised that whilst AI has applications in healthcare, several factors require addressing before widespread clinical deployment becomes appropriate.
Lead author Andrew Bean said the findings show how interacting with humans poses a challenge "even for top" AI models.
"We hope this work will contribute to the development of safer and more useful AI systems," he says.
Healthcare AI adoption challenges
The study arrives at a critical juncture for healthcare AI integration.
ChatGPT Health launched in December 2024 and has attracted more than 230 million users worldwide seeking health-related information weekly, according to OpenAI, with more than 40 million daily queries.
However, the University of Oxford research demonstrates that two out of three chatbots achieve accuracy around 50% of the time, which could raise concerns for healthcare providers about patient safety and liability.
The findings suggest that human interaction poses challenges even for advanced AI models.
One significant limitation identified in the study involved users' uncertainty about what information they should provide to AI chatbots for accurate assessment.
This knowledge gap could mean that even sophisticated AI systems may receive incomplete or irrelevant data, potentially compromising diagnostic accuracy.
For healthcare systems considering AI implementation, this could highlight the need for patient education alongside technology deployment.
Without proper guidance on how to interact with AI tools, patients may struggle to obtain accurate assessments regardless of the underlying technology's capabilities.
Medical ethics and limitations
Healthcare commentators have questioned whether AI can replace physician judgement, particularly given concerns about the absence of moral compass and ethical conviction in AI tools.
For clinical leaders, this could suggest that AI should serve as a decision support tool rather than a replacement for medical professionals.
Saadia Mahmud, a Management Consultant at Interactive, said: "AI has uses but being a double edged sword we need to use it wisely. Can AI replace a physician? My view is a resounding 'No'.
"AI tools have no moral compass and ethical conviction."
All sustainability, net zero and sustainable supply chain leaders should attend:
- Sustainability LIVE: The Net Zero Summit - QEII Centre, London, March 4-5
- Sustainability LIVE: The US Summit - Navy Pier, Chicago, April 21-22
Co-located with Procurement & Supply Chain LIVE, these events brings together CSOs, ESG leaders and senior decision-makers at a moment when sustainability, supply chains and commercial performance are increasingly interconnected.
Tickets can be booked online today for The Net Zero Summit and The US Summit. Group discounts available.
The research underscores the difficulty of building AI systems that can genuinely support patients in sensitive, high-stakes areas like health, despite considerable hype surrounding AI capabilities in healthcare.
The limitations identified raise fundamental questions about the appropriate role of AI in clinical settings.
The study's implications extend beyond individual patient interactions to broader healthcare system considerations.
As healthcare organisations face pressure to adopt innovative technologies whilst maintaining patient safety standards, the Oxford research could inform evidence-based approaches to AI integration.
Healthcare executives may need to balance enthusiasm for AI efficiency gains against the demonstrated limitations in accuracy and the potential for misdiagnosis or inappropriate triage recommendations.
The evidence suggests a measured approach to AI deployment may be more appropriate than rapid, widespread adoption.




