Medicine 2026-03-02 3 min read

AI Pathology Models Achieve Cancer Diagnoses by Exploiting Statistical Shortcuts, Not Biology

University of Warwick analysis of 8,000 patient samples across four cancer types found deep learning models often rely on correlated biomarkers rather than the specific signals they claim to detect.

On paper, the performance numbers look impressive. Deep learning models trained to detect cancer mutations or predict tumor behavior from microscope images routinely report accuracy scores above 80%. Clinical validation studies cite these figures. Investment decisions get made on the basis of them. But a new analysis from the University of Warwick suggests that a substantial portion of this apparent accuracy is illusory - built on statistical shortcuts that collapse when conditions change.

The study, published in Nature Biomedical Engineering, analyzed more than 8,000 patient samples across four cancer types: breast, colorectal, lung, and endometrial. The researchers, led by Dr. Fayyaz Minhas of the Predictive Systems in Biomedicine Lab at Warwick's Department of Computer Science, compared the performance of leading machine learning approaches while specifically testing whether models were learning the biological signal they claimed to detect or something else entirely.

The Shortcut Problem in Concrete Terms

Consider a model trained to detect mutations in the BRAF gene from tissue images. BRAF mutations in colorectal cancer often co-occur with microsatellite instability - a separate clinical feature that also has visible correlates in tissue histology. A model can learn to predict BRAF status not by detecting the actual tissue features caused by BRAF mutation, but by detecting the features associated with MSI - a correlated surrogate rather than the target signal itself.

"It's a bit like judging a restaurant's quality by the queue of people waiting to get in: it's a useful shortcut, but it's not a direct measure of what's happening in the kitchen," said Dr. Minhas. "Many AI pathology models are doing the same thing, relying on correlations between biomarkers or on obvious tissue features, rather than isolating biomarker-specific signals. And when conditions change, these shortcuts often fall apart."

When the researchers stratified performance within patient subgroups - looking only at high-grade breast cancers, for example, or only MSI-positive tumors - accuracy dropped substantially. The models that performed well in mixed populations became unreliable in homogeneous subgroups where the confounding correlations they had learned no longer applied.

The Modest Advantage Over Simpler Methods

The performance gap between deep learning and much simpler approaches also proved narrower than headline numbers suggested. When predicting biomarker status, AI models achieved accuracy scores just above 80% - compared to roughly 75% using tumor grade alone, a measure already assessed visually by pathologists. In contexts where deep learning is presented as delivering transformative improvements, a 5-percentage-point advantage that evaporates under subgroup analysis is a more honest characterization of where the field actually stands.

"We've found that predicting a BRAF mutation by looking at correlated features like MSI is often like predicting rain by looking at umbrellas - it works, but it doesn't mean you understand meteorology," said Kim Branson, SVP Global Head of AI and Machine Learning at GSK and a co-author on the study. "Crucially, if a model cannot demonstrate information gain above a simple pathologist-assigned grade, we haven't advanced the field; we've just automated a shortcut."

What Rigorous Evaluation Would Look Like

The researchers are not arguing that AI has no value in pathology. Machine learning methods remain capable of processing large quantities of imaging data and identifying patterns at scales no human pathologist could match. The question is whether current evaluation practices allow the field to distinguish genuine biological learning from sophisticated pattern-matching on confounded data.

The paper calls for evaluation protocols that test models within stratified subgroups - not just overall accuracy - and that assess information gain relative to clinical information already available. Models should be required to perform well not only in populations where their shortcut correlations exist, but also in populations where those correlations have been removed.

"This study highlights a critical point about the rollout of AI in medicine: to deliver real and lasting impact, the value of AI-based clinically important predictions must be judged through rigorous, bias-aware evaluation, rather than relying solely on headline accuracies that fail to account for confounding effects," said Professor Nasir Rajpoot, Director of the Tissue Image Analytics Centre at Warwick.

Source: Minhas F et al. Published in Nature Biomedical Engineering. Lead researcher: Dr. Fayyaz Minhas, Predictive Systems in Biomedicine Lab, University of Warwick. Co-author: Kim Branson, GSK. Analysis included 8,000+ patient samples from breast, colorectal, lung, and endometrial cancer cohorts.

AI Pathology Models Achieve Cancer Diagnoses by Exploiting Statistical Shortcuts, Not Biology

The Shortcut Problem in Concrete Terms

The Modest Advantage Over Simpler Methods

What Rigorous Evaluation Would Look Like

Related Press Releases

Latest Press Releases