How good are AI doctors at medical conversations?

Researchers design a more realistic test to evaluate AI's clinical communication skill

2025-01-02

(Press-News.org) Artificial intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories and even providing preliminary diagnoses.

These tools, known as large-language models, are already being used by patients to make sense of their symptoms and medical tests results.

But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world?

Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University.

For their analysis, published Jan. 2 in Nature Medicine, the researchers designed an evaluation framework — or a test — called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large-language models to see how well they performed in settings closely mimicking actual interactions with patients.

All four large-language models did well on medical exam-style questions, but their performance worsened when engaged in conversations more closely mimicking real-world interactions.

This gap, the researchers said, underscores a two-fold need: First, to create more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world and, second, to improve the ability of these tools to make diagnosis based on more realistic interactions before they are deployed in the clinic.

Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in clinic.

"Our work reveals a striking paradox - while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit," said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. “The dynamic nature of medical conversations - the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms - poses unique challenges that go far beyond answering multiple choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy."

A better test to check AI’s real-world performance

Right now, developers test the performance of AI models by asking them to answer multiple choice medical questions, typically derived from the national exam for graduating medical students or from tests given to medical residents as part of their certification.

“This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far messier,” said study co-first author Shreya Johri, a doctoral student in the Rajpurkar Lab at Harvard Medical School. “We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform.”

CRAFT-MD was designed to be one such more realistic gauge.

To simulate real-world interactions, CRAFT-MD evaluates how well large-language models can collect information about symptoms, medications, and family history and then make a diagnosis. An AI agent is used to pose as a patient, answering questions in a conversational, natural style. Another AI agent grades the accuracy of final diagnosis rendered by the large-language model. Human experts then evaluate the outcomes of each encounter for ability to gather relevant patient information, diagnostic accuracy when presented with scattered information, and for adherence to prompts.

The researchers used CRAFT-MD to test four AI models — both proprietary or commercial and open-source ones — for performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties.

All AI models showed limitations, particularly in their ability to conduct clinical conversations and reason based on information given by patients. That, in turn, compromised their ability to take medical histories and render appropriate diagnosis. For example, the models often struggled to ask the right questions to gather pertinent patient history, missed critical information during history taking, and had difficulty synthesizing scattered information. The accuracy of these models declined when they were presented with open-ended information rather than multiple choice answers. These models also performed worse when engaged in back-and-forth exchanges — as most real-world conversations are — rather than when engaged in summarized conversations.

Recommendations for optimizing AI’s real-world performance

Based on these findings, the team offers a set of recommendations both for AI developers who design AI models and for regulators charged with evaluating and approving these tools.

These include:

Use of conversational, open-ended questions that more accurately mirror unstructured doctor-patient interactions in the design, training, and testing of AI tools Assessing models for their ability to ask the right questions and to extract the most essential information Designing models capable of following multiple conversations and integrating information from them Designing AI models capable of integrating textual (notes from conversations) with and non-textual data (images, EKGs) Designing more sophisticated AI agents that can interpret non-verbal cues such as facial expressions, tone, and body language

Additionally, the evaluation should include both AI agents and human experts, the researchers recommend, because relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outpaced human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15-16 hours of expert evaluation. In contrast, human-based approaches would require extensive recruitment and an estimated 500 hours for patient simulations (nearly 3 minutes per conversation) and about 650 hours for expert evaluations (nearly 4 minutes per conversation). Using AI evaluators as first line has the added advantage of eliminating the risk of exposing real patients to unverified AI tools.

The researchers said they expect that CRAFT-MD itself will also be updated and optimized periodically to integrate improved patient-AI models.

“As a physician scientist, I am interested in AI models that can augment clinical practice effectively and ethically,” said study co-senior author Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus it helps move the field forward when it comes to testing AI model performance in health care.”

Authorship, funding, disclosures

Publication DOI 10.1038/s41591-024-03328-5

Additional authors included Jaehwan Jeong and Hong-Yu Zhou, Harvard Medical School; Benjamin A. Tran, Georgetown University; Daniel I. Schlessinger, Northwestern University; Shannon Wongvibulsin, University of California-Los Angeles; Leandra A. Barnes, Zhuo Ran Cai and David Kim, Sandford University; and Eliezer M. Van Allen, Dana-Farber Cancer Institute.

The work was supported by the HMS Dean’s Innovation Award and a Microsoft Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar. SJ received further support through the IIE Quad Fellowship.

Daneshjou reported receiving personal fees from DWA, personal fees from Pfizer, personal fees from L'Oreal, personal fees from VisualDx, stock options from MDAlgorithms and Revea outside the submitted work, and a patent for TrueImage pending. Schlessinger is the co-founder of FixMySkin Healing Balms, a shareholder in Appiell Inc. and K-Health, a consultant with Appiell Inc and LuminDx, and an investigator for Abbvie and Sanofi. Van Allen serves as an advisor to Enara Bio, Manifold Bio, Monte Rosa, Novartis Institute for Biomedical Research, Serinus Bio. E.M.V.A provides research support to Novartis, BMS, Sanofi, NextPoint. Van Allen holds equity in Tango Therapeutics, Genome Medical, Genomic Life, Enara Bio, Manifold Bio, Microsoft, Monte Rosa, Riva Therapeutics, Serinus Bio, Syapse. Van Allen has filed for institutional patents on chromatin mutations and immunotherapy response, and methods for clinical interpretation; intermittent legal consulting on patents for Foaley & Hoag, and serves on the editorial board of Science Advances.

END

ELSE PRESS RELEASES FROM THIS DATE:

A speckle of hope for cancer patients

2025-01-02

Fighting cancer can seem like a deadly game of chance. While some patients may respond well to certain treatments, others might not be as fortunate. Doctors and scientists have long struggled to explain why. Now, Cold Spring Harbor Laboratory (CSHL) Assistant Professor Katherine Alexander and University of Pennsylvania Professor Shelley Berger have found a possible source of this variability in clear cell renal cell carcinoma (ccRCC)—the most common kidney cancer diagnosed in adults. Alexander ...

How does a hula hoop master gravity? Mathematicians prove that shape matters

2025-01-02

Hula hooping is so commonplace that we may overlook some interesting questions it raises: “What keeps a hula hoop up against gravity?” and “Are some body types better for hula hooping than others?” A team of mathematicians explored and answered these questions with findings that also point to new ways to better harness energy and improve robotic positioners. The results are the first to explain the physics and mathematics of hula hooping. “We were specifically interested in what kinds of body motions and shapes could successfully hold the hoop up and what physical requirements and restrictions are involved,” explains ...

New method to measure 5G radiation from mobile phones and base stations

2025-01-02

A team of researchers from Project GOLIAT has developed and applied a new protocol to measure exposure to mobile phone radiation, in particular from 5G. The researchers measured radiofrequency electromagnetic field (RF-EMF) levels during three different scenarios: when the mobile device is in flight mode (non-user), when the mobile phone is used intensively by either downloading or uploading data. The study demonstrates that e. The research was conducted in Switzerland, one of the first countries in Europe to roll out 5G networks on a large scale. The results have now been published in Environmental Research and provide relevant data for epidemiological ...

Artificial Intelligence Predicts Deutsche eMark (DEM) as the 2025 Crypto Sensation

2025-01-01

What Is Deutsche eMark (DEM)?

Deutsche eMark, often referred to by its abbreviation DEM, is a cryptocurrency that draws inspiration from reviving the legacy of the German Mark—Germany’s official currency before the introduction of the euro. The creators of DEM aimed to leverage the recognition of the traditional German currency by combining it with innovative blockchain technology.

The primary mission of the project is to enable fast and efficient transactions both within Germany and on the international market. By using a peer-to-peer network, Deutsche eMark seeks to offer low transaction costs while maintaining a high level of security.

Revolutionizing heat management with high-performance cerium oxide thermal switches

2025-01-01

Groundbreaking cerium oxide-based thermal switches achieve remarkable performance, transforming heat flow control with sustainable and efficient technology. Thermal switches, which electrically control heat transfer, are essential for the advancement of sophisticated thermal management systems. Historically, electrochemical thermal switches have been constrained by suboptimal performance, which impedes their extensive utilization in the electronics, energy, and waste heat recovery sectors. A research team led by Professor Hiromichi Ohta of the Research Institute for Electronic Science, Hokkaido University employed a novel approach of ...

University of Iowa study traces Ebola's route to the skin surface

2025-01-01

Ebola is a deadly hemorrhagic disease caused by a virus that is endemic in parts of East-Central and West Africa. Most people are aware that a primary route for person-to-person transmission is through contact with bodily fluids from an infected person. But more recent outbreaks, including the 2013-2016 Ebola epidemic in West Africa, demonstrated that infectious Ebola virus (EBOV) is also found on the skin’s surface of those who have succumbed to infection or at late times during infection. Although evidence suggests that EBOV can be passed on from skin contact with a person in the later stages of the disease, very little is known about how the virus makes its way out ...

Study finds smoking linked to increased risk of chronic kidney disease in later stages

2025-01-01

A recent study published in Health Data Science led by Zhilong Zhang from the Institute of Medical Technology at Peking University Health Science Center and Professor Luxia Zhang from the National Institute of Health Data Science at Peking University has shed light on the complex relationship between smoking behavior and chronic kidney disease (CKD). Using data from over 500,000 participants in the UK Biobank cohort, the researchers conducted both traditional observational studies and advanced Mendelian randomization (MR) analyses to explore whether smoking behavior ...

System to auto-detect new variants will inform better response to future infectious disease outbreaks

2025-01-01

Researchers have come up with a new way to identify more infectious variants of viruses or bacteria that start spreading in humans - including those causing flu, COVID, whooping cough and tuberculosis. The new approach uses samples from infected humans to allow real-time monitoring of pathogens circulating in human populations, and enable vaccine-evading bugs to be quickly and automatically identified. This could inform the development of vaccines that are more effective in preventing disease. The approach can also quickly detect ...

Key players in brain aging: New research identifies age-related damage on a cellular level

2025-01-01

SEATTLE, WASH.—January 1, 2025—Scientists at the Allen Institute have identified specific cell types in the brain of mice that undergo major changes as they age, along with a specific hot spot where many of those changes occur. The discoveries, published in the journal Nature, could pave the way for future therapies to slow or manage the aging process in the brain. Key findings Sensitive cells: Scientists discovered dozens of specific cell types, mostly glial cells, known as brain support cells, that underwent significant gene expression changes with age. Those strongly affected included microglia and border-associated ...

Pupil size in sleep reveals how memories are sorted, preserved

2025-01-01

ITHACA, N.Y. – Cornell University researchers have found the pupil is key to understanding how, and when, the brain forms strong, long-lasting memories. By studying mice equipped with brain electrodes and tiny eye-tracking cameras, the researchers determined that new memories are being replayed and consolidated when the pupil is contracted during a substage of non-REM sleep. When the pupil is dilated, the process repeats for older memories. The brain’s ability to separate these two substages of sleep with a previously unknown micro-structure is what ...

How good are AI doctors at medical conversations?

ELSE PRESS RELEASES FROM THIS DATE:

LAST 30 PRESS RELEASES: