AI Chatbots Matched Expert Data Teams on Preterm Birth Prediction in a Fraction of the Tim
Data science teams competing in the DREAM challenges on preterm birth prediction worked for three months building machine learning algorithms from vaginal microbiome and blood data. Then it took nearly two more years to compile the results and publish them. A research team at UC San Francisco and Wayne State University ran a parallel experiment using AI chatbots on the same data and completed the equivalent work - from initial experiment to submitted paper - in six months.
That comparison is the headline number from a study published February 17, 2026 in Cell Reports Medicine. But the underlying findings are more nuanced than a simple speed comparison suggests, and they point toward something specific about where AI tools add value in health data science and where they remain limited.
The DREAM Challenge Baseline
The DREAM (Dialogue on Reverse Engineering Assessment and Methods) challenges are structured scientific competitions in which research groups work on the same dataset with the same prediction goal. More than 100 groups competed in the UCSF-led challenge on vaginal microbiome data for preterm birth prediction. The competition format produced robust benchmarks: many teams achieved the prediction goal within the three-month challenge period.
Marina Sirota, PhD, professor of Pediatrics and interim director of the Bakar Computational Health Sciences Institute at UCSF, and Adi L. Tarca, PhD, professor in the Center for Molecular Medicine and Genetics at Wayne State, had led this DREAM challenge and two related challenges on pregnancy dating. Their combined datasets included microbiome samples and blood and placental tissue data from approximately 1,200 pregnant women across nine studies.
What the AI Tools Were Asked to Do
The researchers gave eight AI chatbots natural language prompts asking them to build prediction models using the same data from the three DREAM challenges. The prompts were carefully constructed to guide the AI tools to analyze the data the same way the DREAM teams had, without requiring the AI to know the details of the challenges in advance.
Only 4 of the 8 AI tools produced usable code - prediction models that actually ran on the data and produced valid output. That 50 percent failure rate is a meaningful limitation: not all current AI chatbots are reliable enough for technical data science tasks. Among the four that worked, performance was comparable to or better than the DREAM challenge competitors for several prediction tasks.
The speed advantage was stark. Experienced programmers would typically take at least a few hours and potentially a few days to write the analysis code the AI tools generated in minutes. One research pair - a master's student at UCSF and a high school student - produced viable prediction models with AI assistance in a timeframe that would have been impossible without it.
Three Tasks, Different AI Performance
The AI tools were evaluated on three prediction tasks: identifying preterm birth risk from vaginal microbiome data; estimating gestational age from blood samples; and estimating gestational age from placental tissue data. The last task proved most difficult for the AI tools, with performance falling shortest of human team benchmarks. The vaginal microbiome and blood-based dating tasks showed stronger AI performance.
Accurate gestational age estimation matters clinically because birth timing affects fetal development assessment, preparation for labor, and the type of care provided to newborns. When estimates are off, clinical decisions downstream are made on incorrect assumptions.
What AI Cannot Replace
Tarca was direct about the limits: the AI tools generated code and ran analyses, but they did not decide which questions were worth asking, design the study, collect the data, interpret results in light of biological plausibility, or navigate peer review. Scientists still need to be on guard for misleading results and step in when the AI fails. All four AI tools that produced usable output required human review to verify the code was correct and the results were interpretable. The failure modes of AI-generated code - silent errors that produce plausible-looking but wrong outputs - can be difficult to detect without subject matter expertise.
The study is explicitly described as an "early test" of how AI can be used to analyze health data at scale. The finding that four of eight tools matched human expert performance is promising. The finding that the other four failed entirely is a reminder that current AI tools remain inconsistent for technical data science tasks.