A Genetic Map of 220,000 DNA Variants Reveals Which Ones Actually Affect Disease Risk
Human genetics has spent decades identifying regions of the genome associated with disease. The problem is that a "region" can span hundreds of thousands of DNA letters, containing dozens or hundreds of variants that might be responsible for the association - or might be innocent bystanders. Pinning down which specific letter changes actually matter has been the next obstacle, and it is slow work when done one variant at a time.
A study published in Nature from The Jackson Laboratory, the Broad Institute, and Yale University accelerated that process by testing more than 220,000 variants simultaneously across five different cell types. The result: 13,000 single-letter DNA changes confirmed to influence gene expression, a clarified picture of how non-coding regions of the genome shape traits like blood pressure, cholesterol, and blood sugar, and a finding that complicates standard genetic models.
The Identification Problem in Human Genetics
Most DNA changes linked to common diseases like heart disease and type 2 diabetes do not fall within genes themselves. Genes constitute roughly 2 percent of the human genome. The disease-associated variants overwhelmingly land in the remaining 98 percent - stretches of non-coding DNA that contain regulatory elements governing when, where, and how strongly genes are expressed. Altering a regulatory element can shift how much of a protein a cell produces, which in turn affects disease risk.
Genome-wide association studies over the past two decades have identified millions of non-coding variants statistically associated with disease. The association tells you that a region matters. It does not tell you which specific variant in that region matters, or how it works at the molecular level.
Ryan Tewhey, a geneticist and associate professor who led the team at JAX, put the challenge this way: finding a causal variant in a disease-associated region is like searching for a single typo on one page of a massive book. His team's approach was to speed-read thousands of pages at once.
Massively Parallel Testing
The technology enabling this is a massively parallel reporter assay (MPRA). Each DNA variant to be tested is paired with a molecular tag - a reporter sequence that produces a measurable signal reflecting how the variant affects gene expression. Thousands of these reporter constructs can be introduced into cells simultaneously, and the output of each can be read out through high-throughput sequencing.
The team applied this approach to more than 220,000 previously identified disease-associated variants across five cell types - including brain, liver, and blood cells. This generated data on whether each variant increased, decreased, or had no effect on gene activity in each cell type. The result resolved about 20 percent of the genome regions that had previously contained multiple candidate variants without clear attribution.
An Unexpected Interaction Effect
Among the 13,000 variants confirmed to affect gene expression, most behaved independently - their effects could be predicted without knowing what neighboring variants were doing. But about 11 percent did not. These variants behaved differently depending on what other variants they were paired with nearby, producing gene expression changes greater or lesser than the sum of their individual effects would predict.
This interaction effect has direct implications for how genetic risk is calculated. Standard polygenic risk scores assume that the effects of individual variants can simply be added together. If a meaningful fraction of disease-associated variants interact - if their combined effect depends on their co-occurrence - then additive models systematically misestimate risk for individuals who carry particular combinations.
The finding was not hypothetical: the team documented cases where pairs of interacting variants were associated with gene activity linked to lower LDL cholesterol levels, and others affecting a gene associated with blood pressure. They also identified two variants near the ESS2 gene - associated with developmental disorders - whose combined effect on gene expression exceeded what either variant produced alone.
Equity Implications
The study included a notable effort to address a persistent problem in human genetics: most genetic studies have been conducted in populations of European ancestry, creating uncertainty about whether findings apply to people from other backgrounds.
The team identified a single variant associated with long-term blood sugar control in people of European ancestry, then predicted that similar but previously understudied variants found predominantly in people of African ancestry would show comparable associations. Follow-up analysis confirmed the prediction. This suggests that mechanistic understanding of what variants do at the molecular level - rather than simply cataloging statistical associations - can help extend genetic discoveries across diverse populations.
Scope and Limitations
The study tested variants in five cell types, which sounds broad but represents a small fraction of the human body's thousands of distinct cell types. Many disease-associated variants likely exert their effects in tissues not examined here. The researchers themselves note that switching genes on or off in a single cell type is only one piece of a much larger puzzle in determining health outcomes. Millions of genetic variants also remain entirely untested.
The lead authors include Layla Siraj (first author, now in obstetrics and gynecology residency at Columbia University), Jacob Ulirsch (now at Illumina), Steven Reilly of Yale School of Medicine, and Hilary Finucane of the Broad Institute and Harvard Medical School. The study builds on MPRA work Tewhey and Ulirsch published together at the Broad Institute in 2016.