Building the First Evidence-Based Safety Guide for Health AI Chatbots
People are already using AI chatbots as a first stop for medical questions. How many symptoms justify a doctor visit? What does this medication side effect mean? Is this pain serious? Millions of people worldwide are typing these questions into ChatGPT, Gemini, Copilot, Claude, and similar tools - without any established framework for judging whether the answers they receive are reliable, dangerously wrong, or somewhere in between.
The absence of such a framework is not an accident of slow technology adoption. Regulatory systems for AI in healthcare were designed around clinical decision-support tools and medical devices, not general-purpose language models being used informally by the public. That gap - a governance vacuum, as the researchers describe it - has left users to navigate health information from AI without guidance analogous to what exists for, say, reading a drug label or consulting a pharmacist.
A research consortium led by the University of Birmingham is now building that guide. Announced in a correspondence published in Nature Health, the project - called The Health Chatbot Users' Guide - is an international effort involving more than 20 institutions, explicitly co-designed with members of the public rather than developed solely by researchers and then delivered downward. Applications are open for public co-investigators to help shape the project.
What the Guide Is Designed to Address
The project team has identified four categories of risk that make unguided AI health chatbot use genuinely hazardous.
Medical inaccuracy is the most obvious. Large language models generate plausible-sounding text by predicting likely continuations of prompts based on patterns in training data - they do not reason from a curated medical knowledge base and can produce factually incorrect statements stated with apparent confidence. The phenomenon of "hallucination" - generating specific, detailed information that is simply false - is well documented and has been observed in medical contexts across multiple published studies.
The echo chamber effect is subtler but potentially as harmful. AI systems optimized for user satisfaction tend to be agreeable - they tend to validate the user's framing of a question rather than challenging potentially incorrect assumptions. For someone convinced their symptoms indicate a benign cause when they may not, an AI that confirms that belief represents a real clinical risk.
Algorithmic bias reflects a structural problem: the data on which large language models are trained over-represents certain populations and perspectives. This can result in AI tools that perform less reliably for non-English speakers, for patients with presentations atypical of the majority populations studied in medical research, or for health questions pertaining to conditions more prevalent in historically underserved communities.
Data privacy concerns center on what happens to the sensitive personal health information that people enter into commercial AI platforms. Most large language model services are operated by private companies whose terms of service for data use are complex and not always transparent to users.
Co-Design as Method
A distinctive feature of the project is its public co-design model. Three public co-investigators and a public steering group have been given formal roles in setting the direction of the guide's development, not merely as consultants after the fact. The rationale is practical: a safety resource that is not accessible and useful to the people it is intended to help - across different ages, literacy levels, languages, and levels of health knowledge - is a failed resource regardless of its scientific rigor.
"The use of general-purpose chatbots for healthcare is no longer a hypothetical future possibility; it is a current reality," said Dr Joseph Alderman, National Institute for Health and Care Research Clinical Lecturer at the University of Birmingham and corresponding author. "Ignoring this shift leaves the public to navigate a hazardous information landscape unaided."
Dr Charlotte Blease, a health AI researcher at Uppsala University and Harvard Medical School who is senior researcher on the project and author of Dr Bot, framed the scale of the problem: "Health chatbots have become the world's most accessible first opinion - often speaking to patients before any doctor does."
Scope and Limitations of the Initiative
The guide is being positioned as a harm reduction and benefit maximization resource rather than a blanket endorsement or condemnation of AI health tools. This framing is deliberate: the goal is not to discourage use but to help people use these tools more safely given that they will use them regardless.
The initiative is at an early stage. No draft guide exists yet; the project is in the co-design and development phase. Whether a user-facing guide can meaningfully reduce health harms from AI chatbot use - as opposed to other interventions like improved AI training, regulatory oversight, or better integration of AI with formal clinical pathways - is an empirical question that this project alone cannot answer.
The consortium spans researchers from the University of Birmingham, University Hospitals Birmingham NHS Foundation Trust, the NIHR Birmingham Biomedical Research Centre, and more than 20 international institutions. The correspondence was published in Nature Health.