ChatGPT Health Missed More Than Half of Emergency Cases in First Independent Safety Test
About 40 million people were using ChatGPT Health daily to seek medical guidance within weeks of its January 2026 launch. They were asking a consumer AI tool whether their symptoms warranted emergency care, urgent clinic visits, or could be managed at home. Until a team at the Icahn School of Medicine at Mount Sinai conducted the first independent safety evaluation, there was almost no evidence about whether its advice was reliable in the scenarios that matter most.
The study, fast-tracked into Nature Medicine on February 23, 2026, found the tool performed well in clear-cut medical emergencies but failed to direct users to emergency care in more than half of nuanced cases that physicians unanimously identified as requiring it. The evaluation also revealed a troubling inversion in the tool's suicide-crisis safeguards.
Methodology: 60 scenarios, 16 contextual variations, 3 independent physicians
The research team built 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from conditions appropriate for home management to true emergencies. Three independent physicians established the correct urgency level for each scenario using guidelines from 56 medical societies.
Each scenario was then tested under 16 different contextual conditions, varying the patient's stated race, gender, social circumstances, and barriers to care including lack of insurance or transportation. The team ran 960 total interactions with ChatGPT Health and compared its recommendations against the physician consensus. This was not a simple pass-fail test - it was designed to capture how the system behaves across the variation real patients bring to a health conversation.
Where the tool succeeded and where it failed
ChatGPT Health handled textbook emergencies correctly. Stroke symptoms, severe allergic reactions, and similarly dramatic presentations triggered appropriate emergency referrals. The failures appeared in clinically ambiguous territory - conditions that are dangerous but not immediately obvious, where judgment separates a missed emergency from unnecessary alarm.
In one documented asthma scenario, the tool's own explanation identified early warning signs of respiratory failure but still advised waiting rather than seeking emergency care. The system recognized the danger in its reasoning while its recommendation contradicted that reasoning.
"ChatGPT Health performed well in textbook emergencies," said lead author Ashwin Ramaswamy, instructor of urology at Mount Sinai. "But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most."
Suicide-crisis alerts appeared inversely linked to risk level
The evaluation also examined when the system directed users to the 988 Suicide and Crisis Lifeline. The results were unexpected. Alerts triggered inconsistently, appearing more reliably in lower-risk scenarios while failing to appear when users described specific, concrete plans for self-harm.
"While we expected some variability, what we observed went beyond inconsistency. The system's alerts were inverted relative to clinical risk," said senior author Girish N. Nadkarni, director of the Hasso Plattner Institute for Digital Health at Mount Sinai. "In real life, when someone talks about exactly how they would harm themselves, that's a sign of more immediate and serious danger, not less."
An evolving target and the case for routine evaluation
The study assessed the system at one point in time. AI models are updated frequently, which means a safety evaluation conducted in February does not characterize the tool's behavior in April. The researchers are explicit about this limitation and plan to continue evaluating updated versions of the tool. Future research will expand into pediatric care, medication safety, and non-English-language use.
The study's authors do not argue that consumers should abandon AI health tools. Medical student Alvira Tyagi, second author of the study, framed the finding as a call for critical integration: "These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients."
Isaac S. Kohane, chair of biomedical informatics at Harvard Medical School, summarized the policy implication: "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional."
For worsening or concerning symptoms - chest pain, shortness of breath, severe allergic reactions, or changes in mental status - people should seek medical care directly rather than relying solely on chatbot guidance.