Expert consensus outlines a standardized framework to evaluate clinical large language models

Consensus proposes retrospective workflows, metrics, multidisciplinary teams, design principles, feedback, and reporting standards for safe deployment of models in healthcare

2026-01-23

(Press-News.org)

A new expert consensus made available online on 10 October 2025 and published in Volume 5, Issue 4 of the journal Intelligent Medicine on 1 November 2025, sets out a structured framework to assess large language models (LLMs) before they are introduced into clinical workflows. The guidance responds to the rapid uptake of artificial intelligence (AI) tools for diagnostic support, medical documentation, and patient communication, and the corresponding need for consistent evaluation of safety, effectiveness, and fairness.

The consensus formalizes retrospective evaluation—testing fully trained models on real or simulated clinical data in specific care contexts, without further modifying the models—to verify performance, ethical compliance, and operational readiness prior to deployment.

Developed in line with World Health Organization guideline methods and registered on the Practice Guideline Registration for Transparency (PREPARE) platform (ID: PREPARE-2025CN503), the consensus draws on literature review, Delphi procedures, and multidisciplinary expert deliberation. In the final round, 35 experts achieved agreement on six recommendations.

What does the framework include?

Evaluation workflows prioritizing scientific rigor, objectivity, comprehensiveness, and ethics (e.g., double-blind procedures, conflict-of-interest transparency). Integrated metrics combining quantitative measures (accuracy, recall, F1-score; BLEU/ROUGE for generation) with structured qualitative ratings (e.g., mean opinion scores for accuracy, completeness, safety, practicality, professionalism). Multidisciplinary teams spanning clinicians, data and computer engineers, ethicists, legal experts, and statisticians, with standardized training and role definitions. Dataset design principles centered on clinical authenticity, broad representativeness across diseases, populations, and institutions, and fairness for vulnerable groups, with modular versioning and privacy/compliance safeguards. Feedback and versioning mechanisms to update standards as technology, regulations, or application scope evolve, including transparent dispute-resolution processes. Standardized reporting templates to improve transparency, reproducibility, and comparability across evaluations.

The consensus also defines six key LLM capability domains for assessment: medical knowledge question and answer; complex medical language understanding; diagnosis and treatment recommendation; medical documentation generation; multi-turn dialogue; and multimodal dialogue.

Emphasizing essential safeguards for patient data protection, bias mitigation, and the need for AI outputs to remain clinically explainable, the authors of the consensus are positioned to support the advancement of safer, more reliable, and ethically governed LLM applications within healthcare systems globally.

***

Reference
DOI: 10.1016/j.imed.2025.09.001

About the journal
Intelligent Medicine is a peer-reviewed, open-access journal focusing on the integration of artificial intelligence, data science, and digital technology in clinical medicine and public health. It is published by the Chinese Medical Association in partnership with Elsevier. To learn more about Intelligent Medicine, please visit https://www.sciencedirect.com/journal/intelligent-medicine

Funding information
The authors received no financial support for this research.

END

ELSE PRESS RELEASES FROM THIS DATE:

Bioengineered tissue as a revolutionary treatment for secondary lymphedema

2026-01-23

The rising incidence of cancer worldwide has led to an increasing number of surgeries that involve the removal of lymph nodes. Although these procedures play a major role in cancer staging and preventing the spread of malignancies, they sometimes come with severe long-term consequences. Since lymph nodes do not naturally regenerate once removed, their absence can lead to a condition known as secondary lymphedema. It manifests as chronic swelling, discomfort, and reduced mobility in affected limbs or regions, severely affecting a patient’s quality of life. Consequently, a major focus within the ...

Forty years of tracking trees reveals how global change is impacting Amazon and Andean Forest diversity

2026-01-23

New research published in Nature Ecology and Evolution reveals significant recent shifts in tree diversity among the tropical forests of the Andes and Amazon, driven by global change. The study, led by Dr Belen Fadrique from the University of Liverpool, uses 40 years of records on tree species collected by hundreds of international botanists and ecologists in long-term plots to offer comprehensive insights into tree diversity change in the world’s most diverse forests. Key Findings At the continental level, ...

Breathing disruptions during sleep widespread in newborns with severe spina bifida

2026-01-23

Children with spina bifida, a malformation of the spinal cord that can lead to mobility impairments and hydrocephalus — a buildup of fluid in the brain — face significant risk of cognitive difficulties throughout their lives. A new multi-center study led by researchers at Washington University School of Medicine in St. Louis and Michigan Medicine finds that breathing problems during sleep are a widespread but often undetected issue among these babies and raises the possibility that early treatment might significantly improve ...

Whales may divide resources to co-exist under pressures from climate change

2026-01-23

The North Atlantic Ocean is warming up. Higher temperatures and increased human activity in the region can trigger abrupt changes in marine ecosystems, for example how species are distributed and what they eat. In a long-term study published in Frontiers in Marine Science, researchers in Canada have examined the diet of three rorqual whale species and how these whales might have adapted their feeding habits as climate change and increasing human presence reshape the ecosystem of the Gulf of St Lawrence (GSL), a seasonally important feeding area for many whale species. “A ...

Why wetland restoration needs citizens on the ground

2026-01-23

Wetland restoration is expanding worldwide, but long-term success often remains uncertain. Most projects rely on short-term, expert monitoring that ends long before restored wetlands stabilize, leaving major gaps in understanding how restored wetlands actually evolve over time. One increasingly discussed way to close these gaps is to extend monitoring beyond professional teams by engaging local communities and citizens in long-term observation. In a Perspective published (DOI: 10.1016/j.ese.2026.100656) in Environmental Science and Ecotechnology in January 2026, researchers from Aarhus University ...

Sharktober: Study links October shark bite spike to tiger shark reproduction

2026-01-23

New University of Hawaiʻi research confirms that “Sharktober” is real, revealing a statistically significant spike in shark bite incidents in Hawaiian waters every October. The study, which analyzed 30 years of data (1995–2024), found that about 20% of all recorded bites occurred in that single month, a frequency far exceeding any other time of the year. Researchers at UH Mānoa’s Hawai‘i Institute of Marine Biology (HIMB) Shark Lab published their findings in Frontiers in Marine Science. The research, led by HIMB Professor Carl G. Meyer, determined ...

PPPL launches STELLAR-AI platform to accelerate fusion energy research

2026-01-23

A new computing platform that pairs artificial intelligence (AI) with high performance computing aims to end the bottleneck holding back fusion energy research by speeding the simulations needed to advance the field. The project — known as the Simulation, Technology, and Experiment Leveraging Learning-Accelerated Research enabled by AI (STELLAR-AI ) — will be led by the U.S. Department of Energy’s (DOE) Princeton Plasma Physics Laboratory (PPPL). STELLAR-AI will expand far beyond the Lab’s walls, however, bringing together national laboratories, universities, technology companies and industry partners to build the computational foundation ...

Breakthrough in development of reliable satellite-based positioning for dense urban areas

2026-01-23

Global navigation satellite systems (GNSS) are vital for positioning autonomous vehicles, buses, drones, and outdoor robots. Yet its accuracy often degrades in dense urban areas due to signal blockage and reflections. Now, researchers have developed a GNSS-only method that delivers stable, accurate positioning without relying on fragile carrier-phase ambiguity resolution. Tested across six challenging urban scenarios, the approach consistently outperformed existing methods, enabling safer and more reliable autonomous navigation. Accurately determining position is critical for the safety and reliability of autonomous vehicles and outdoor ...

DNA-templated method opens new frontiers in synthesizing amorphous silver nanostructures

2026-01-23

A research team from the Institute of Physics, Chinese Academy of Sciences has developed a novel DNA origami-based technique to synthesize stable, monolithic amorphous silver nanostructures under ambient conditions. By using DNA scaffold with fivefold rotational symmetry, the method introduces geometric frustration that effectively suppresses crystallization in metallic silver, a traditionally challenging feat due to the natural tendency of silver to form crystalline structures. Detailed characterization and molecular ...

Stress-testing AI vision systems: Rethinking how adversarial images are generated

2026-01-23

Deep neural networks (DNNs) have become a cornerstone of modern AI technology, driving a thriving field of research in image-related tasks. These systems have found applications in medical diagnosis, automated data processing, computer vision, and various forms of industrial automation, to name a few. As our reliance on AI models grows, so does our need to test them thoroughly using adversarial examples. Simply put, adversarial examples are images that have been strategically modified with noise to trick an AI into making a mistake. Understanding adversarial image generation techniques is essential for identifying vulnerabilities in DNNs and for developing more secure, reliable systems. Despite ...

LAST 30 PRESS RELEASES:

300 million years of hidden genetic instructions shaping plant evolution revealed

High-fat diets cause gut bacteria to enter brain, Emory study finds

Teens and young adults with ADHD and substance use disorder face treatment gap

Instead of tracking wolves to prey, ravens remember — and revisit — common kill sites

Ravens don’t follow wolves to dinner – they remember where the food is

Mapping the lifelong behavior of killifish reveals an architecture of vertebrate aging

Designing for hard and brittle lithium needles may lead to safer batteries

Inside the brains of seals and sea lions with complex vocal behavior learning

Watching a lifetime in motion reveals the architecture of aging

Rapid evolution can ‘rescue’ species from climate change

Molecular garbage on tumors makes easy target for antibody drugs

New strategy intercepts pancreatic cancer by eliminating microscopic lesions before they become cancer

Embryogenesis in 4D: a developmental atlas for genes and cells

CNIO research links fertility with immune cells in the brain

Why do lithium-ion batteries fail? Scientists find clues in microscopic metal 'thorns'

Surface treatment of wood may keep harmful bacteria at bay

Carsten Bönnemann, MD, joins St. Jude to expand research on pediatric catastrophic neurological disorders

Women use professional and social networks to push past the glass ceiling

Trial finds vitamin D supplements don’t reduce covid severity but could reduce long COVID risk

Personalized support program improves smoking cessation for cervical cancer survivors

Adverse childhood experiences and treatment-resistant depression

Psilocybin trends in states that decriminalized use

New data signals high demand in aesthetic surgery in southern, rural U.S. despite access issues

$3.4 million grant to improve weight-management programs

Higher burnout rates among physicians who treat sickle cell disease

Wetlands in Brazil’s Cerrado are carbon-storage powerhouses

Brain diseases: certain neurons are especially susceptible to ALS and FTD

Father’s tobacco use may raise children’s diabetes risk

Structured exercise programs may help combat “chemo brain” according to new study in JNCCN

The ‘croak’ conundrum: Parasites complicate love signals in frogs

[Press-News.org] Expert consensus outlines a standardized framework to evaluate clinical large language models
Consensus proposes retrospective workflows, metrics, multidisciplinary teams, design principles, feedback, and reporting standards for safe deployment of models in healthcare