Stanford engineers read proteins by converting them back into DNA
DNA sequencing has become fast, cheap, and routine. A human genome that cost $3 billion to read in 2003 can now be sequenced for a few hundred dollars in under a day. Proteins - the molecules that actually carry out the work of living cells - have no equivalent technology. Reading their sequences remains slow, expensive, and limited in scale.
This gap matters. DNA tells you what a cell could build. Proteins tell you what it actually built, in what quantities, and in what forms. Two cells with identical genomes can behave completely differently because they produce different sets of proteins. Understanding disease, immune response, and drug action often comes down to knowing which proteins are present and in what amounts - information that DNA alone cannot provide.
A team led by bioengineers at Stanford University has now found a way around the bottleneck, and the strategy is counterintuitive: instead of building a better protein sequencer, they convert proteins back into DNA and read them with the sequencing platforms that already exist.
Twenty amino acids versus four bases
The reason protein sequencing has been so much harder than DNA sequencing is fundamental chemistry. DNA is built from four nucleotide bases - A, T, C, and G. Proteins are built from 20 different amino acids. Those amino acids are also physically smaller than nucleotides, roughly one-third the size, making them harder for any detection technology to reliably distinguish from one another.
The dominant method for identifying proteins has long been mass spectrometry, which works by breaking proteins into fragments, ionizing them, and measuring the mass-to-charge ratio of each piece. Mass spectrometry is powerful, but it has a sensitivity problem. From a starting sample containing billions of protein molecules, the technique typically detects about a million - a recovery rate that leaves the vast majority unseen.
Rare proteins, which may be the most biologically important, are the hardest to catch. A protein that exists in only a handful of copies in a given cell - potentially driving disease progression or marking a specific immune state - can easily fall below the detection threshold of mass spectrometry.
Running the biological process in reverse
The Stanford approach, published March 18 in Nature Biotechnology, works by tagging individual amino acids within a peptide chain with short DNA barcodes. Each barcode encodes two pieces of information: which amino acid it is attached to (identified using specific antibodies) and where in the peptide chain it sits (marked by synthetic cycle barcodes that record sequential position).
Once every detectable amino acid in a peptide has been tagged with its DNA barcode, the protein's identity and structure have been translated into a DNA sequence. At that point, standard DNA sequencing platforms - fast, affordable, and widely available - take over. The DNA barcodes are read out, and the protein sequence is reconstructed computationally.
H. Tom Soh, the W. M. Keck Foundation Professor of Electrical Engineering at Stanford and senior author of the study, framed it simply: in nature, proteins are made from DNA. Over the past two decades, society has built extraordinary technology for sequencing DNA quickly and inexpensively, but has not achieved similar progress for proteins. This work runs the natural process in reverse, converting protein sequences back into DNA so that existing sequencing infrastructure can be leveraged.
A thousand-fold jump in molecular coverage
The potential scale advantage over mass spectrometry is significant. First author Liwei Zheng, a research engineer at Stanford, explained the difference in concrete terms. With mass spectrometry, a researcher starts with 1 billion to 10 billion protein molecules and typically recovers about a million identifiable signals. With this new method, the potential detection could reach a thousand times that amount from a comparable sample.
That difference in coverage could bring rare protein species into view for the first time. In cancer research, for example, the proteins that distinguish a responsive tumor cell from a resistant one might exist in tiny quantities. In immunology, the surface markers that identify specific T cell subpopulations during immunotherapy response may be present on only a fraction of cells. Mass spectrometry averages across a bulk sample. This technique, operating at single-molecule resolution, could capture the diversity that bulk measurements erase.
Soh points to CAR-T cell therapy as a case in point. These engineered immune cells have transformed treatment for certain blood cancers, but their effectiveness varies dramatically depending on the cancer type and the specific cells involved. By isolating individual immune cells and reading their protein profiles at single-molecule sensitivity, researchers could begin to understand why the therapy works in some contexts and fails in others - a question that bulk protein analysis has struggled to answer.
From chemistry to instrument
The technology has already attracted commercial interest and has been licensed with the aim of developing a user-friendly instrument. The vision, as Soh describes it, is a device where a researcher can load a sample, press a button, and receive protein sequence data - analogous to the workflow that DNA sequencing instruments already provide.
Zheng highlighted an additional advantage of the DNA-conversion approach. Once protein information has been encoded as DNA, it becomes accessible to the full toolkit of molecular biology. DNA can be amplified, copied, lengthened, and manipulated with enzymes that evolution has refined over billions of years. Converting protein data into DNA format means that all of these tools become available for processing protein sequence information - a capability that direct protein analysis methods lack.
Early stage, significant hurdles
The method is still in its early stages and the path to widespread adoption is not straightforward. The current chemistry does not read every amino acid in a peptide chain with equal efficiency. Some amino acids are easier to tag with antibodies than others, and gaps in coverage affect reconstruction accuracy. The antibody panel used for recognition is limited and would need to be expanded for comprehensive sequencing.
The work has been demonstrated primarily with purified peptides in controlled laboratory conditions. Applying it to the complex mixture of thousands of different proteins present in a real biological sample - a cell lysate, a blood sample, or a tissue biopsy - introduces challenges of specificity, noise, and data processing that have not yet been fully addressed.
The computational task of reconstructing full protein sequences from partial barcode reads across millions of molecules is also non-trivial. Error correction, alignment, and assembly algorithms will need to be developed and validated as the technology scales.
Soh, Zheng, and co-author Yujia Sun are listed as co-inventors on a pending patent application related to the work. The research was funded by the Helmsley Charitable Trust and the Wellcome Leap SAVE program.
If the approach matures as its developers hope, it would fill a gap that molecular biology has struggled with for decades: the ability to read proteins with the same speed, scale, and affordability that we now take for granted when reading DNA. We don't yet know whether this specific chemistry will be the one that closes that gap. But the strategy - translating proteins into a molecular language we already know how to read at scale - is compelling.