AlphaFold Database now includes millions of protein complex structures - and it is just the beginning
The AlphaFold Database already held predicted structures for more than 200 million individual proteins. What it lacked were the structures of protein complexes - the molecular machines that form when two or more proteins bind together to carry out biological functions. That gap has now begun to close.
1.7 million complexes, 30 million calculated
A collaboration between EMBL's European Bioinformatics Institute (EMBL-EBI), Google DeepMind, NVIDIA, and Seoul National University has calculated predicted structures for 30 million protein complexes. Of these, 1.7 million high-confidence homodimer predictions - complexes formed by two copies of the same protein - have been added to the AlphaFold Database and made openly accessible. Another 18 million lower-confidence homodimers are available for bulk download. The remaining predictions are heterodimers (complexes of two different proteins) currently being analyzed.
This is the largest dataset of protein complex predictions currently available.
The dataset prioritizes proteins relevant to human health, covering 20 of the most studied species including humans, as well as organisms on the World Health Organization's bacterial priority pathogens list - the bacteria most urgently in need of new antibiotics.
Why complexes matter more than individual proteins
Individual protein structures tell you what a molecule looks like in isolation. But biology does not work in isolation. Proteins bind to each other, forming complexes that carry out functions neither partner can perform alone. Signaling cascades, gene regulation, immune responses, metabolic pathways - all depend on specific protein-protein interactions.
Understanding these interactions is central to drug development. Many drugs work by disrupting or mimicking protein-protein interfaces. Knowing the three-dimensional structure of those interfaces - which amino acids touch, how the surfaces complement each other, where vulnerabilities lie - accelerates the design of molecules that can intervene.
"The human genome has just over 20,000 different proteins," said Dame Janet Thornton, Director Emeritus of EMBL-EBI. "Despite this relatively small genome, human beings display incredibly complex pathways, processes and regulation. Much of this complexity arises from the intermolecular interactions between proteins."
The infrastructure behind the predictions
Generating 30 million complex predictions required significant computational muscle. NVIDIA provided AI infrastructure and scaled inference pipelines. The Steinegger Lab at Seoul National University developed methodological improvements to multiple sequence alignment calculations and deep learning inference, building on Google DeepMind's AlphaFold system. EMBL-EBI contributed scientific and biodata management expertise and, together with Google DeepMind, integrated the results into the existing database.
The collaboration is hosting data that would otherwise require approximately 17 million hours of GPU computing to recreate. By generating these predictions once and making them freely available, the project eliminates a computational barrier that would otherwise prevent most research groups from accessing protein complex structural information at this scale.
"By making predicted protein complexes accessible at an unprecedented scale, we are illuminating an unseen landscape of molecular interactions across the tree of life," said Martin Steinegger, Associate Professor at Seoul National University.
Predictions, not experimental structures
An important caveat: these are AI-predicted structures, not experimentally determined ones. The distinction matters. Predicted structures are hypotheses about how proteins fold and interact, generated by neural networks trained on experimentally solved structures. They are remarkably accurate in many cases but can be wrong, particularly for protein regions that are intrinsically disordered or for interactions that depend on conditions (pH, temperature, the presence of other molecules) not captured by the prediction model.
The 1.7 million structures added to the main database are those that met high-confidence thresholds. The 18 million lower-confidence predictions, available separately, require more careful interpretation. Researchers using any predicted structure will still need experimental validation for high-stakes applications like drug design.
The AlphaFold Database now has over 3.4 million users from 190 countries. More protein complex predictions will be added in coming months as heterodimer analyses are completed. The work is described in a preprint.