Medicine 2026-03-17 3 min read

A single pipeline for all your genomes: metapipeline-DNA automates sequencing analysis across any computing system

Sanford Burnham Prebys and UCLA release an open-access tool that standardizes whole-genome analysis, catches errors before they waste computing time, and works on any infrastructure

A single human genome generates roughly 100 gigabytes of raw sequencing data - the equivalent of about 20,000 smartphone photos. Now multiply that by hundreds of genomes in a single experiment, across dozens of labs using different computing systems, different analysis software, and different quality standards. The result is a fragmentation problem that has been quietly slowing genomic research for years.

A new open-access tool called metapipeline-DNA, published in Cell Reports Methods, aims to fix that. Built by researchers at Sanford Burnham Prebys Medical Discovery Institute and UCLA, it automates the entire DNA sequencing analysis workflow - from quality control through variant detection - in a standardized, reproducible way that works across different computing environments.

The fragmentation problem

As genome sequencing technology has become faster and cheaper over the past decade, many labs have cobbled together their own analysis software or customized shared tools for their specific computing infrastructure. Some pipelines only run on particular supercomputing clusters or cloud platforms. When a lab moves institutions, when collaborators try to reproduce results, or when institutions switch computing systems, these bespoke solutions break.

The consequences are real: wasted time debugging software, inconsistent results across studies, and a reproducibility problem that undermines confidence in genomic findings. A pipeline that runs identically on any infrastructure, produces standardized output, and catches configuration errors before burning days of computing time addresses all three issues.

Error detection before execution

One of metapipeline-DNA's design priorities is preventing failed runs. Genomic analysis on supercomputing clusters can consume days of processing time. A configuration error discovered late in a run means starting over - potentially losing a week of work.

"In designing the software, we focused on making sure that the choices we present to the users are fully validated before the pipeline runs," said Paul Boutros, director of the NCI-Designated Cancer Center at Sanford Burnham Prebys and the study's senior author. The tool checks inputs, parameters, and system compatibility upfront, flagging problems before any heavy computation begins.

Built by a community, benchmarked against NIST

The tool's development has been genuinely collaborative: 43 contributors made 1,408 pull requests to the codebase, and 46 individuals submitted 1,124 feature requests, suggestions, and issue reports. That level of community involvement is unusual for a bioinformatics pipeline and suggests the tool was designed to address real, widely-shared pain points.

To ensure accuracy in detecting genetic variants, the team incorporated reference data from the Genome in a Bottle Consortium, a public-private-academic partnership led by the National Institute of Standards and Technology (NIST). By benchmarking against these meticulously validated reference genomes, the researchers reduced false-positive variant calls without sacrificing the tool's ability to detect true genetic changes.

Cancer genomics as a test case

The team demonstrated metapipeline-DNA's capabilities with two cancer genomics case studies. They analyzed whole-genome sequencing data from five patients who donated both normal tissue and tumor samples to the Pan-Cancer Analysis of Whole Genomes dataset, and another five from The Cancer Genome Atlas. These are well-characterized datasets, making them useful benchmarks for validating the pipeline's performance against established results.

From DNA to RNA to protein

The long-term vision extends beyond DNA. The developers plan to build on metapipeline-DNA's architecture to create analogous automated workflows for RNA sequencing and protein analysis. Because the different pipelines would share the same underlying automation, error-checking, and quality control framework, improvements to any one pipeline would benefit the others.

"This tool should enable labs to process data without needing a lot of background in computation or computer infrastructure, and without having to optimize for their specific computing environment," said Yash Patel, a cloud and AI infrastructure architect at Sanford Burnham Prebys and co-first author.

What it won't do

Metapipeline-DNA automates standard whole-genome analysis steps - alignment, quality control, variant calling, and annotation. It does not perform downstream biological interpretation: identifying which variants are pathogenic, determining clinical significance, or integrating genomic data with other omics layers. Those tasks still require domain-specific tools and expert judgment. The pipeline is also DNA-specific; the planned RNA and protein extensions are not yet available.

And while the tool's portability across computing systems is a strength, initial setup still requires some bioinformatics expertise. The goal of enabling any lab to run genomic analysis without computational specialists is aspirational rather than fully realized.

Source: Yash Patel, Chenghao Zhu, Takafumi Yamaguchi, Paul Boutros et al. Sanford Burnham Prebys and UCLA. Published in Cell Reports Methods, March 17, 2026. DOI: 10.1016/j.crmeth.2026.101340