One AI Model, 750 Tasks: Stanford's Merlin Reads CT Scans Better Than Specialist Tools
Every day, radiology departments produce thousands of CT scans that take expert eyes hours to interpret and often yield only partial answers. The images contain far more information than any radiologist can extract in a clinical read - structural patterns, tissue densities, subtle organ changes - most of which is never formally characterized. What if a model could be trained on all of it?
That is the premise behind Merlin, a machine learning model developed by researchers at Stanford University and funded by the National Institutes of Health. The tool was trained on the largest collection of abdominal CT data assembled to date: more than 15,000 three-dimensional scans paired with their radiology reports and nearly one million diagnostic codes from Stanford's clinical systems. Results published in Nature show that Merlin not only handles a sweeping range of tasks but beats purpose-built specialist models at their own jobs.
What "general-purpose" actually means in practice
The researchers tested Merlin across six broad categories of activity spanning more than 750 individual tasks: identifying anatomical features, predicting diagnosis codes, detecting organ abnormalities, generating radiology report content, predicting chronic disease onset, and performing quality assessment. That breadth is unusual. Most medical AI tools are designed to do one thing - detect a specific tumor type, segment a particular organ, flag a specific finding - and are evaluated almost exclusively on that one thing.
Merlin is what researchers call a foundation model: a large model trained on unlabeled data spanning many kinds of information, which can then be applied to downstream tasks with relatively little additional training. The concept comes from natural language processing, where models like GPT trained on vast text corpora turned out to be remarkably good at tasks they were never explicitly trained for. Merlin applies the same logic to 3D medical imaging.
The numbers that matter
On diagnosing from 692 different diagnostic codes, Merlin successfully identified which of two scans was more likely to match a given code 81% of the time, outperforming competing models. For a subset of 102 codes, accuracy rose to 90%.
The disease prediction results are harder to dismiss. Given two scans from different patients, Merlin correctly identified the one at higher risk of developing diabetes, osteoporosis, or heart disease within five years 75% of the time, compared to 68% for the next-best model. That 7-point gap may sound modest, but at population scale - and given that the comparison model was purpose-built for this task while Merlin is general-purpose - the difference is meaningful.
The researchers also challenged Merlin with chest CT scans, a body region entirely absent from its training data. Merlin matched or exceeded models trained exclusively on chest images. That kind of cross-domain transfer is not something narrowly specialized tools typically achieve.
Why a physician shortage makes this urgent
CT is among the first imaging tools deployed in medical evaluations. The process from scan to diagnosis is long: a radiologist interprets the images, determines whether additional tests are warranted, and communicates findings to the ordering physician. At every step, there is delay - and the U.S. physician shortage makes those delays worse year by year.
"With Merlin, you could potentially go beyond traditional radiology and jump straight from imaging to a possible diagnosis," said co-first author Louis Blankemeier, who conducted the work as a graduate student at Stanford. That is not a claim that Merlin is ready to replace radiologists. It is a claim about where the technology is pointing.
The more immediate application may be triage: using Merlin to flag scans that require urgent attention, predict which patients need follow-up testing, or identify individuals at elevated risk of conditions that are cheaper to treat early. The model might also help identify new biomarkers - patterns in imaging data that correlate with disease but that human observers have never formally characterized.
What Merlin cannot do yet
The researchers are candid about limitations. Report generation - drafting a complete radiology report from scratch - required additional task-specific fine-tuning and remains an area for refinement. The training data came from a single institution, Stanford, which raises questions about how well the model generalizes to hospitals with different patient populations and imaging equipment. And like all AI medical tools, Merlin will need regulatory review before clinical deployment.
"Our model and the data will provide the community a robust backbone to build upon," said senior author Akshay Chaudhari, a professor of radiology and biomedical data science at Stanford. The researchers have released both the model and the dataset for other groups to adapt and extend. Whether Merlin's general-purpose approach ultimately becomes the standard architecture for medical imaging AI, or whether specialist models prove more reliable in specific clinical contexts, will depend on how the technology performs across more institutions and more diverse patient populations.
The study was supported by NIH grants R01EB002524 and P41EB027060, the Medical Imaging and Data Resource Center, the National Heart, Lung, and Blood Institute, and the National Institute of Arthritis and Musculoskeletal and Skin Diseases.