The study, fast-tracked in the February 23, 2026 online issue of Nature Medicine [https://doi.org/10.1038/s41591-026-04297-7], is the first independent safety evaluation of the large language model (LLM)-based tool since its January 2026 launch. It also identified serious concerns with the tool’s suicide-crisis safeguards.
“LLMs have become patients’ first stop for medical advice—but in 2026 they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm,” says Isaac S. Kohane, MD, PhD, Chair, Department of Biomedical Informatics at Harvard Medical School, who was not involved with the research. “When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional.”
Within weeks of its release, ChatGPT Health’s maker, OpenAI, reported that about 40 million people were using the tool daily to seek health information and guidance, including advice about whether to seek urgent or emergency care. At the same time, say the investigators, there was little independent evidence about how safe or reliable its advice actually was.
“That gap motivated our study,” says lead author Ashwin Ramaswamy, MD, Instructor of Urology at the Icahn School of Medicine at Mount Sinai. “We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?”
With respect to suicide-risk alerts, ChatGPT Health was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations. However, the investigators found that these alerts appeared inconsistently, sometimes triggering in lower-risk scenarios while—alarmingly—failing to appear when users described specific plans for self-harm.
“This was a particularly surprising and concerning finding,” says senior and co-corresponding study author Girish N. Nadkarni, MD, MPH, Barbara T. Murphy Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, and Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. “While we expected some variability, what we observed went beyond inconsistency. The system’s alerts were inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves. In real life, when someone talks about exactly how they would harm themselves, that’s a sign of more immediate and serious danger, not less.”
As part of the evaluation, the research team created 60 structured clinical scenarios spanning 21 medical specialties. Cases ranged from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct level of urgency for each case using guidelines from 56 medical societies.
Each scenario was tested under 16 different contextual conditions, including variations in race, gender, social dynamics (such as someone minimizing symptoms), and barriers to care like lack of insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health and compared its recommendations with physician consensus.
In testing the 60 realistic patient scenarios developed by physicians, the researchers found that while the tool generally handled clear-cut emergencies correctly, it under-triaged more than half of cases that physicians determined required emergency care.
The investigators were also struck by how the system failed in emergency medical cases. The tool often demonstrated that it recognized dangerous findings in its own explanations, yet still reassured the patient.
“ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions,” says Dr. Ramaswamy. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most. In one asthma scenario, for example, the system identified early warning signs of respiratory failure in its explanation but still advised waiting rather than seeking emergency treatment.”
The study authors advise that for worsening or concerning symptoms, including chest pain, shortness of breath, severe allergic reactions, or changes in mental status, people should seek medical care directly rather than relying solely on chatbot guidance. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency department.
Still, the researchers emphasize that the findings do not suggest consumers should abandon AI health tools altogether.
“As a medical student training at a time when AI health tools are already in the hands of millions, I see them as technologies we must learn to integrate thoughtfully into care rather than substitutes for clinical judgment,” says Alvira Tyagi, a first-year medical student at the Icahn School of Medicine at Mount Sinai and second author of the study. “These systems are changing quickly, so part of our training now must consider learning how to understand their outputs critically, identify where they fall short, and use them in ways that protect patients.”
The study assessed the system at a single point in time. Because AI models are frequently updated, performance may change over time, underscoring the need for independent evaluation, the researchers say.
“Starting medical training alongside tools that are evolving in real time makes it clear that today’s results are not set in stone,” Ms. Tyagi says. “That reality calls for ongoing review to ensure that improvements in technology translate into safer care.”
The team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, expanding future research into areas such as pediatric care, medication safety, and non-English-language use.
The paper is titled “ChatGPT Health performance in a structured test of triage recommendations.”
The study’s authors, as listed in the journal, are Ashwin Ramaswamy, MD, MPP; Alvira Tyagi, BA; Hannah Hugo, MD; Joy Jiang, PhD; Pushkala Jayaraman, PhD; Mateen Jangda, MSc; Alexis E. Te, MD; Steven A. Kaplan, MD; Joshua Lampert, MD; Robert Freeman, MSN, MS; Nicholas Gavin, MD, MBA; Ashutosh K. Tewari, MBBS, MCh; Ankit Sakhuja, MBBS MS; Bilal Naved, PhD; Alexander W. Charney, MD, PhD; Mahmud Omar, MD; Michael A. Gorin, MD; Eyal Klang, MD; Girish N. Nadkarni, MD, MPH.
For more Mount Sinai artificial intelligence news, visit: https://icahn.mssm.edu/about/artificial-intelligence.
About Mount Sinai's Windreich Department of AI and Human Health
Led by Girish N. Nadkarni, MD, MPH—an international authority on the safe, effective, and ethical use of AI in health care—Mount Sinai’s Windreich Department of AI and Human Health is the first of its kind at a U.S. medical school, pioneering transformative advancements at the intersection of artificial intelligence and human health.
The Department is committed to leveraging AI in a responsible, effective, ethical, and safe manner to transform research, clinical care, education, and operations. By bringing together world-class AI expertise, cutting-edge infrastructure, and unparalleled computational power, the department is advancing breakthroughs in multi-scale, multimodal data integration while streamlining pathways for rapid testing and translation into practice.
The Department benefits from dynamic collaborations across Mount Sinai, including with the Hasso Plattner Institute for Digital Health at Mount Sinai—a partnership between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System—which complements its mission by advancing data-driven approaches to improve patient care and health outcomes.
At the heart of this innovation is the renowned Icahn School of Medicine at Mount Sinai, which serves as a central hub for learning and collaboration. This unique integration enables dynamic partnerships across institutes, academic departments, hospitals, and outpatient centers, driving progress in disease prevention, improving treatments for complex illnesses, and elevating quality of life on a global scale.
In 2024, the Department's innovative NutriScan AI application, developed by the Mount Sinai Health System Clinical Data Science team in partnership with Department faculty, earned Mount Sinai Health System the prestigious Hearst Health Prize. NutriScan is designed to facilitate faster identification and treatment of malnutrition in hospitalized patients. This machine learning tool improves malnutrition diagnosis rates and resource utilization, demonstrating the impactful application of AI in health care.
For more information on Mount Sinai's Windreich Department of AI and Human Health, visit: ai.mssm.edu
About the Hasso Plattner Institute at Mount Sinai
At the Hasso Plattner Institute for Digital Health at Mount Sinai, the tools of data science, biomedical and digital engineering, and medical expertise are used to improve and extend lives. The Institute represents a collaboration between the Hasso Plattner Institute for Digital Engineering in Potsdam, Germany, and the Mount Sinai Health System.
Under the leadership of Girish Nadkarni, MD, MPH, who directs the Institute, and Professor Lothar Wieler, a globally recognized expert in public health and digital transformation, they jointly oversee the partnership, driving innovations that positively impact patient lives while transforming how people think about personal health and health systems.
The Hasso Plattner Institute for Digital Health at Mount Sinai receives generous support from the Hasso Plattner Foundation. Current research programs and machine learning efforts focus on improving the ability to diagnose and treat patients.
About the Icahn School of Medicine at Mount Sinai
The Icahn School of Medicine at Mount Sinai is internationally renowned for its outstanding research, educational, and clinical care programs. It is the sole academic partner for the seven member hospitals* of the Mount Sinai Health System, one of the largest academic health systems in the United States, providing care to New York City’s large and diverse patient population.
The Icahn School of Medicine at Mount Sinai offers highly competitive MD, PhD, MD-PhD, and master’s degree programs, with enrollment of more than 1,200 students. It has the largest graduate medical education program in the country, with more than 2,600 clinical residents and fellows training throughout the Health System. Its Graduate School of Biomedical Sciences offers 13 degree-granting programs, conducts innovative basic and translational research, and trains more than 560 postdoctoral research fellows.
Ranked 11th nationwide in National Institutes of Health (NIH) funding, the Icahn School of Medicine at Mount Sinai is among the 99th percentile in research dollars per investigator according to the Association of American Medical Colleges. More than 4,500 scientists, educators, and clinicians work within and across dozens of academic departments and multidisciplinary institutes with an emphasis on translational research and therapeutics. Through Mount Sinai Innovation Partners (MSIP), the Health System facilitates the real-world application and commercialization of medical breakthroughs made at Mount Sinai.
-------------------------------------------------------
* Mount Sinai Health System member hospitals: The Mount Sinai Hospital; Mount Sinai Brooklyn; Mount Sinai Morningside; Mount Sinai Queens; Mount Sinai South Nassau; Mount Sinai West; and New York Eye and Ear Infirmary of Mount Sinai
END