PRESS-NEWS.org - Press Release Distribution
PRESS RELEASES DISTRIBUTION

refget v2.0 links the hidden dictionaries of DNA

Millions of genomes have been sequenced worldwide. The GA4GH standard refget quietly helps to decipher many of them. Now a new refget version opens up reliable reference genome retrieval to even more communities

refget v2.0 links the hidden dictionaries of DNA
2023-07-19
(Press-News.org)  

A widely-used tool that finds the exact references needed to pinpoint differences in our DNA just got a refresh.

On 17 July, the Standards Steering Committee of the Global Alliance for Genomics and Health (GA4GH) voted to release refget v2.0. With better compatibility for a range of reference genome names, formats, and systems, the new version of refget makes it easier than ever to retrieve verified genomic reference sequences.

 

A vital infrastructure

You may not even realise that you’re using refget already.

“Almost anyone who uses a CRAM file uses refget,” said Timothe Cezard, co-leader of the team that produced the new refget version and a project lead at EMBL’s European Bioinformatics Institute (EMBL-EBI). “Compression, decompression, all CRAM tools — say, for conversion to other formats or direct analysis — sit on top of refget.”

CRAM is a popular and efficient file format for storing DNA sequences, able to reduce storage costs by up to 50%. It achieves that staggering compression by relying on a reference sequence — a bit of DNA considered typical.

Compare your own genes against the reference, and you will begin to see variation: genetic differences that could lead to everything from freckles to a high risk of breast cancer.

Instead of storing all three billion base pairs of the reference sequence alongside the DNA being studied, CRAM files simply hold onto the reference sequence’s name.

When it’s time to decompress the data, refget steps in — helping you “get” the “reference” you need. 

 

Solving the dictionary dilemma

CRAM is just one example of how refget removes dangerous uncertainty from genomic data. 

“refget identifies each reference sequence using its inherent unique qualities, so you can always trust that a sequence contains what it says on the label,” said Andrew Yates, a founding developer of refget and a team leader at EMBL-EBI.

“The consequences of comparing genomic data to incorrect or misaligned reference sequences are serious. Genetic variants may be classified as pathogenic or harmless incorrectly, and patients could receive improper care. Being exact matters,” he said.

By assigning a unique identifier to reference sequences, refget solves a tricky naming problem in genomics.

Central authorities like the International Nucleotide Sequence Database Collaboration (INSDC), Ensembl, and the University of California, Santa Cruz (UCSC) Genome Browser use different naming conventions for the same reference sequence.

Think of how the Oxford English Dictionary and Merriam-Webster sometimes spell and define the same English word differently. Then try to convince a British English speaker to use “color” instead of “colour,” and you will see the challenge of standardising nomenclature.

Non-unique names create even more uncertainty when analysing data. For instance, another common naming convention counts by chromosome number, starting with chromosome 1 as the largest. But the reference genomes of many organisms have a chromosome called “1.” How do you know you’re getting a human chromosome and not a mouse, for instance? Which “1” is the right one to use?

refget clears up any confusion.

“refget is very straightforward. You’ve got a name, you grab a sequence. You have a sequence, you construct the name,” said Cezard. “You don’t need to rely on any naming authority.”

 

Why you need refget for any genomic analysis

For the initial refget release in 2018, the GA4GH Large Scale Genomics Work Stream tailored the API to support CRAM.

But Yates and team quickly realised that refget could smooth over issues in other genomic data formats and models. VCF and SAM also support refget identifiers, with growing community interest in using them.

“refget is a fundamental building block for GA4GH standards,” said Yates. “It can solve problems beyond CRAM, for any file format or data model that requires a reference sequence. With refget, you know exactly what sequence you’re talking about.”

For instance, refget is already solving problems for the GA4GH Variation Representation Specification (VRS), which provides a framework for describing genetic variants that computers can easily compare and analyse. 

Yates and Cezard worked closely with the VRS team to develop refget v2.0, which supports VRS sequence identifiers. Now hospitals, laboratories, and databases like NIH-funded ClinGen, which use VRS to represent and share genetic variants, link to reference sequences via refget.

“In large part due to refget, GA4GH VRS makes sharing and comparing variant data across institutions much more reliable. refget lets us pinpoint the exact reference sequence, which then helps us represent the variation unambiguously,” said Larry Babb, a leader of the VRS team, principal software engineer at the Broad Institute of MIT and Harvard, and a GA4GH Driver Project Champion for ClinGen.

“By using refget identifiers in VRS, we can tackle important interoperability challenges that arise when comparing evidence from new reference sequences. The strategy has already worked in the real world in a project with the Atlas of Variant Effects Alliance,” said Alex Wagner, the other VRS team leader, who is a principal investigator at Nationwide Children’s Hospital and GA4GH Driver Project Champion for the Variant Interpretation for Cancer Consortium.

Another major resource for the genomics community, the European Nucleotide Archive (ENA), has already implemented refget v2.0.

ENA contains all sequenced DNA and RNA in the public domain — nearly three billion sequences. To decompress files from the database, researchers use the CRAM reference registry, which runs on refget.

The new version of refget will also debut in the Ensembl genome browser. This collection of more than 50,000 genomes (representing great diversity within and among species, from humans to maize to zebrafish) offers tools for analysis and comparison.

“refget is powering our new Ensembl infrastructure. These refget endpoints will be made available in the near future and will provide access to Ensembl hosted protein and transcript sequences,” said Yates.

 

New features in v2.0

The latest version of refget expands the capabilities of the API, making it more accessible and compatible with other systems.

Work with the VRS team led to a new preferred algorithm for defining identifiers. Other new features detailed in the specification include recommended best practices (such as lowercase naming authority strings), and options when searching for a specific identifier (with or without a namespace).

One key change — aimed at expanding the groups who can benefit from refget — allows you to search not just by unique refget identifier, but by another naming convention. In dictionary terms, you can look up either “colour” or “color” and still retrieve the right definition.

“refget servers can now retrieve the same sequence using a different naming convention. The new version is interoperable with other systems that rely on naming authorities, so you can search even if you don’t have access to the reference sequence itself,” said Cezard. 

“You can enter a name that isn’t a refget identifier and still get the same verified, reliable sequence — which you can then recompute into a refget identifier,” he added.

The new version includes technical solutions for handling non-unique names.

These major v2.0 upgrades don’t entail major work for implementers: all existing refget clients can continue using the API. The only breaking change is a minimal one and makes refget servers compatible with the GA4GH Service Info API, which helps find web services for analysing genomic data. 

 

refget for an entire genome

Building on the same principles as refget, the team is currently developing a new specification that verifies the identity of collections of sequences.

“refget defines a name for a single sequence, like a chromosome. Sequence Collections defines a name for a group of sequences, which we would often use for assemblies or a whole genome,” said Cezard.

Sequence Collections will offer many new features beyond defining names, including searching within and comparing collections. 

In the meantime, Cezard, Yates, and collaborators aim to strengthen support for refget in a wide range of genomic file formats, from BED to SAM to VCF, and demonstrate how useful refget can be.

“refget already safeguards a crucial step in genomic analysis for researchers around the world,” said Yates. “This second version reinforces how vital the concept of unique identifiers really is, whether you are identifying a single reference sequence, an entire genome, or even a pangenome.”

END

[Attachments] See images for this press release:
refget v2.0 links the hidden dictionaries of DNA refget v2.0 links the hidden dictionaries of DNA 2

ELSE PRESS RELEASES FROM THIS DATE:

Do certain amino acids modify the risk of dementia linked to air pollution?

2023-07-19
EMBARGOED FOR RELEASE UNTIL 4 P.M. ET, WEDNESDAY, JULY 19, 2023 MINNEAPOLIS – Higher levels of vitamin B-related amino acids may be linked to the risk of dementia associated with a certain type of air pollutants called particulate matter, according to a study published in the July 19, 2023, online issue of Neurology®, the medical journal of the American Academy of Neurology. The study does not prove that pollution or amino acids cause dementia, but it suggests a possible link among them. Researchers ...

CHOP and Penn researchers find behavioral economics strategies can help patients quit smoking after a cancer diagnosis

2023-07-19
Philadelphia, July 19, 2023 – Researchers from Children’s Hospital of Philadelphia (CHOP) and the Perelman School of Medicine at the University of Pennsylvania found that cancer patients who continued to smoke after their diagnosis were significantly more likely to receive treatment for tobacco use when “nudges” to provide tobacco treatment were directed at clinicians through the electronic health record. The findings strengthen the case for using behavioral economics, or targeting predictable patterns in human decision-making to overcome ...

Hepatitis cases and heart valve infection deaths tied to early OxyContin marketing

2023-07-19
New Haven, Conn. — Decades after Purdue Pharma began to push physicians to prescribe addictive pain pills, the opioid crisis has been a slow-motion disaster, with overdoses destroying lives and families across the country. Now, it appears the consequences of those early marketing efforts are even more devastating. In a new study, researchers at the Yale School of Public Health show that infectious disease rates in the United States also climbed as a direct long-term result of the marketing of OxyContin. By ...

Treatment at the first signs of MS could mean lower risk of disability later

2023-07-19
EMBARGOED FOR RELEASE UNTIL 4 P.M. ET, WEDNESDAY, JULY 19, 2023 MINNEAPOLIS – People who start taking medication soon after the first signs of multiple sclerosis (MS) may have a lower risk of disability later, according to a study published in the July 19, 2023, online issue of Neurology®, the medical journal of the American Academy of Neurology. MS is a disease in which the body’s immune system attacks myelin, the fatty white substance that insulates and protects the nerves. Symptoms of MS may include fatigue, numbness, ...

Teaching robots to teach other robots

2023-07-19
You’re a poker wizard. A friend knows all about French cuisine. Another friend is a Mozart expert. The three of you get together and share knowledge about your respective expertise. Each of you leaves learning something from the other two. People learn a lot by sharing and exchanging information. Can computers do the same with other computers—can robots, in effect, teach other robots how to learn by sharing knowledge? A team of researchers led by computer science Professor Laurent Itti and one of his Ph.D. students, Yunhao Ge, address this question in ...

Molecular biologists identify framework for understanding RNA editing in a disease-causing parasite

Molecular biologists identify framework for understanding RNA editing in a disease-causing parasite
2023-07-19
As molecular biologists at Boston University and as husband and wife, Ruslan Afasizhev and Inna Afasizheva, have worked together for decades. Together, they have published dozens of papers on the mechanics of mitochondrial DNA and RNA in a single-celled, disease-causing parasite called Trypanosoma brucei. Now, years of breakthroughs have led to their latest paper published in Science, which provides a detailed look at a mystifying process called RNA editing and could potentially help treat a deadly disease. In ...

Picturing where wildlands and people meet at a global scale

Picturing where wildlands and people meet at a global scale
2023-07-19
Researchers led by a team at the University of Wisconsin–Madison have created the first tool to map and visualize the areas where human settlements and nature meet on a global scale. The tool, which was part of a study recently published in Nature, could improve responses to environmental conflicts like wildfires, the spread of zoonotic diseases and loss of ecosystem biodiversity. These areas where people and wildlands meet are called the wildland-urban interface, or WUI for short. More technically, a WUI (pronounced “woo-ee”) describes anywhere that has at least one house per 40 acres and is also 50% covered by wildland ...

Using new method, study highlights physician turnover trend

2023-07-19
Using an innovative method for measuring doctor turnover, Weill Cornell Medicine researchers determined that between 2010 and 2018, the annual rate at which physicians left their practices increased by 43 percent, from 5.3 percent to 7.6 percent a year.  The causes of this trend are not known, but warrant further investigation, according to the researchers. The study, published July 11 in the Annals of Internal Medicine, also found that the first three quarters of 2020 (the start of the COVID-19 pandemic in the United States) were not associated with higher turnover. However, more data are needed to fully understand turnover trends related to COVID-19. Whether doctors ...

Winery experiences affected by more than what is in your glass

Winery experiences affected by more than what is in your glass
2023-07-19
New research from the University of British Columbia's Okanagan campus has determined that enjoying a tasting at a winery goes well beyond the sip. Professor Annamma Joy, with UBCO’s Faculty of Management, conducts research in the area of consumer behaviour and branding with a special focus on luxury brands, fashion brand experiences, wineries and wine tourism. Dr. Joy, along with her collaborators and students, studied several Okanagan wineries over a three-year period to comprehensively document the experiences ...

Researchers use mass spectrometry to explore antimicrobial resistance

Researchers use mass spectrometry to explore antimicrobial resistance
2023-07-19
Laura-Isobel McCall, Ph.D., and Zhibo Yang, Ph.D., co-principal investigators and associate professors of chemistry and biochemistry in the Dodge Family College of Arts and Sciences at the University of Oklahoma, have been awarded a prestigious R01 grant from the National Institutes of Health for their project, “Novel single-cell mass spectrometry to assess the role of intracellular drug concentration and metabolism in antimicrobial treatment failure.” “Our project builds upon Dr. ...

LAST 30 PRESS RELEASES:

Brazilian study identifies potential targets for treatment of visceral leishmaniasis

Using AI and iNaturalist, scientists build one of the highest resolution maps yet of California plants

Researchers identify signs tied to more severe cases of RSV

Mays Cancer Center radiation oncologist recognized as outstanding mentor to next generation leaders

Hitting the bull’s eye to target ‘undruggable’ diseases – researchers reveal new levels of detail in targeted protein degradation

SCAI publishes expert consensus statement on managing patients with ST-elevated myocardial infarction

Engineering perovskite materials at the atomic level paves way for new lasers, LEDs

Kessler Foundation 2024 Survey highlights key strategies for hiring and supporting workers with disabilities in the hospitality industry

Harnessing protons to treat cancer

Researchers identify neurodevelopmental symptoms that indicate genetic disorders

Electronic nudges to increase influenza vaccination in patients with chronic diseases

Plant stem cells: Better understanding the biological mechanism of growth control

Genomic study identifies human, animal hair in ‘man-eater’ lions’ teeth

These 19th century lions from Kenya ate humans, DNA collected from hairs in their teeth shows

A potential non-invasive stool test and novel therapy for endometriosis

Racial and ethnic disparities in age-specific all-cause mortality during the COVID-19 pandemic

Delft scientists discover how innate immunity envelops bacteria

Workforce diversity is key to advancing One Health

Genome Research publishes a special issue on innovations in computational biology

A quick and easy way to produce anode materials for sodium-ion batteries using microwaves

‘Inside-out’ galaxy growth observed in the early universe

Protein blocking bone development could hold clues for future osteoporosis treatment

A new method makes high-resolution imaging more accessible

Tiny magnetic discs offer remote brain stimulation without transgenes

Illuminating quantum magnets: Light unveils magnetic domains

Different types of teenage friendships critical to wellbeing as we age, scientists find

Hawaii distillery project wins funding from Scottish brewing and distilling award

Trinity researchers find ‘natural killer’ cells that live in the lung are ready for a sugar rush

$7 Million from ARPA-H to tackle lung infections through innovative probiotic treatment

Breakdancers may risk ‘headspin hole’ caused by repetitive headspins, doctors warn

[Press-News.org] refget v2.0 links the hidden dictionaries of DNA
Millions of genomes have been sequenced worldwide. The GA4GH standard refget quietly helps to decipher many of them. Now a new refget version opens up reliable reference genome retrieval to even more communities