Evo 2 was developed by scientists from Arc Institute and NVIDIA, convening collaborators across Stanford University, UC Berkeley, and UC San Francisco. The model's code is publicly accessible from Arc's GitHub, and is also integrated into the NVIDIA BioNeMo framework, as part of a collaboration between Arc Institute and NVIDIA to accelerate scientific research. Arc Institute also worked with AI research lab Goodfire to develop a mechanistic interpretability visualizer that uncovers the key biological features and patterns the model learns to recognize in genomic sequences. The Evo team has shared its training data, training and inference code, and model weights, making it the largest-scale, fully open source AI model to date.
Building on its predecessor Evo 1, which was trained entirely on single-cell genomes, Evo 2 is the largest artificial intelligence model in biology to date, trained on over 9.3 trillion nucleotides—the building blocks that make up DNA or RNA—from over 128,000 whole genomes as well as metagenomic data. In addition to an expanded collection of bacterial, archaeal, and phage genomes, Evo 2 includes information from humans, plants, and other single-celled and multi-cellular species in the eukaryotic domain of life.
"Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides," says Patrick Hsu, Arc Institute Co-Founder, Arc Core Investigator, an Assistant Professor of Bioengineering and Deb Faculty Fellow at University of California, Berkeley, and a co-senior author on the paper. "Evo 2 has a generalist understanding of the tree of life that's useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We're excited to see what the research community builds on top of these foundation models."
Evolution has encoded biological information in DNA and RNA, creating patterns that Evo 2 can detect and utilize. "Just as the world has left its imprint on the language of the Internet used to train large language models, evolution has left its imprint on biological sequences," says co-senior author Brian Hie, an Assistant Professor of Chemical Engineering at Stanford University, the Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and Arc Institute Innovation Investigator in Residence. "These patterns, refined over millions of years, contain signals about how molecules work and interact."
Evo 2 was trained for several months on the NVIDIA DGX Cloud AI platform via AWS, utilizing over 2,000 NVIDIA H100 GPUs and bolstered by collaboration with NVIDIA researchers and engineers. The model can process genetic sequences of up to 1 million nucleotides at once, enabling it to understand relationships between distant parts of a genome. Achieving this technical feat required the research team to reimagine how an AI model could quickly ingest and make inferences about this scale of data. The resulting AI architecture, called StripedHyena 2, enabled Evo 2 to be trained with 30 times more data than Evo 1 and reason over 8 times as many nucleotides at a time.
The model already shows enough versatility to identify genetic changes that affect protein function and organism fitness. For example, in tests with variants of the breast cancer-associated gene BRCA1, Evo 2 achieved over 90% accuracy in predicting which mutations are benign versus potentially pathogenic. Insights like this could save countless hours and research dollars needed to run cell or animal experiments, by finding genetic causes of human diseases and accelerating the development of new medicines.
In the year since its preprint release, researchers have applied the model to a range of scientific problems, from predicting genetic disease risk in Alzheimer's patients to assessing variant effects across domesticated animal species. Arc researchers have also used Evo 2 to design functional synthetic bacteriophages, demonstrating potential applications for treating antibiotic-resistant bacteria.
In addition to genetic analysis, Evo 2 could be useful for engineering new biological tools or treatments. "If you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells," says co-author and computational biologist Hani Goodarzi, an Arc Core Investigator and an Associate Professor of Biochemistry and Biophysics at the University of California, San Francisco. "This precise control could help develop more targeted treatments with fewer side effects."
The research team envisions that more specific AI models could be built with Evo 2 as a foundation. "In a loose way, you can think of the model almost like an operating system kernel—you can have all of these different applications that are built on top of it," says Arc's Chief Technology Officer Dave Burke, a co-author on the paper. "From predicting how single DNA mutations affect a protein's function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficial uses for Evo 2 we haven't even imagined yet."
In consideration of potential ethics and safety risks, the scientists excluded pathogens that infect humans and other complex organisms from Evo 2's base data set, and ensured that the model would not return productive answers to queries about these pathogens. Co-author Tina Hernandez-Boussard, a Stanford Professor of Medicine, and her lab members assisted the team to implement responsible development and deployment of this technology.
"Evo 2 has fundamentally advanced our understanding of biological systems," says Anthony Costa, director of digital biology at NVIDIA. "By overcoming previous limitations in the scale of biological foundation models with a unique architecture and the largest integrated dataset of its kind, Evo 2 generalizes across more known biology than any other model to date — and by releasing these capabilities broadly, Arc Institute has given scientists around the world a new partner in solving humanity's most pressing health and disease challenges."
###
Brixi, G., Durrant, M.G., Ku, J., Naghipourfar, M., Poli, M., Brockman, G., Chang, D., Fanton, A., Gonzalez, G.A., King, S.H., Li, D.B., Merchant, A.T., Nguyen, E., Ricci-Tam, C., Romero, D.W., Schmok, J.C., Sun, G., Taghibakhshi, A., Vorontsov, A., Yang, B., Deng, M., Gorton, L., Nguyen, N., Wang, N.K., Pearce, M.T., Simon, E., Adams, E., Amador, Z.J., Ashley, E.A., Baccus, S.A., Dai, H., Dillmann, S., Ermon, S., Guo, D., Herschl, M.H., Ilango, R., Janik, K., Lu, A.X., Mehta, R., Mofrad, M.R.K., Ng, M.Y., Pannu, J., Ré, C., St. John, J., Sullivan, J., Tey, J., Viggiano, B., Zhu, K., Zynda, G., Balsam, D., Collison, P., Costa, A.B., Hernandez-Boussard, T., Ho, E., Liu, M.-Y., McGrath, T., Powell, K., Pinglay, S., Burke, D.P., Goodarzi, H., Hsu, P.D., & Hie, B.L. (2026). Genome modeling and design across all domains of life with Evo 2. Nature. https://doi.org/10.1038/s41586-026-10176-5
Arc Institute is an independent nonprofit research organization based in Palo Alto, California, that aims to accelerate scientific progress and understand the root causes of complex diseases. Arc's investigators are supported by long-term funding and freedom to pursue bold ideas. Its Technology Centers leverage multi-omics, genome engineering, and cellular, mammalian and computational models to advance discoveries at the intersection of biology and artificial intelligence. Founded in 2021, Arc partners with Stanford, UC Berkeley, and UCSF.
END