Developed a 21-language, fast and high-fidelity neural text-to-speech technology that works on smartphones

2024-07-26

(Press-News.org) Highlights

-Developed a 21-language, fast and high-fidelity neural text-to-speech technology

-The developed model can synthesize one second of speech at high speed in only 0.1 seconds using a single CPU core, which is about eight times faster than the conventional methods

-The developed model can realize fast synthesis with a latency of 0.5 seconds on a smartphone without network connection

-The technology is expected to be introduced into speech applications, such as multilingual speech translation and car navigation

Abstract

The Universal Communication Research Institute of the National Institute of Information and Communications Technology (NICT, President: TOKUDA Hideyuki, Ph.D.) has successfully developed a 21-language, fast and high-fidelity neural text-to-speech technology. The development of this technology has made it possible to synthesize one second of speech at high speed in just 0.1 seconds using a single CPU core, which is about eight times faster than the conventional methods. This technology also enables fast synthesis with a latency of 0.5 seconds on a mid-range smartphone without network connection (see Figure 1).

Furthermore, the developed 21-language neural text-to-speech models are installed on the server of VoiceTra, a multilingual speech translation application for smartphones operated by NICT, and have been made available to the public. In the future, the technology is expected to be introduced into various speech applications, such as multilingual speech translation and car navigation through commercial licensing.

These results will be presented at INTERSPEECH 2024 Show & Tell, an international conference hosted by the International Speech Communication Association (ISCA) in September 2024.

Background

The Universal Communication Research Institute, NICT is conducting R&D of multilingual speech translation technology to realize spoken language communication that transcends language barriers. The outcomes of R&D have been released to the public as a field experiment on VoiceTra, a speech translation application for smartphones, and many other implementations have been made in society through commercial licensing. Text-to-speech technology, which can synthesize the translated text as human speech, is very important for the realization of multilingual speech translation technology, as well as automatic speech recognition and machine translation. The synthesized sound quality of text-to-speech has improved dramatically in recent years thanks to the introduction of neural network technology, and it has reached a level comparable to that of natural speech, however, the huge amount of calculation was a major issue; thus impossible to synthesize on a smartphone without network connection.

Furthermore, NICT is currently conducting R&D on multilingual simultaneous interpretation technology. In simultaneous interpretation, it is required to output the translated speech one after another without waiting for the speaker to finish speaking. Therefore, it is indispensable to further accelerate text-to-speech as in automatic speech recognition and machine translation.

Achievements

Text-to-speech models are typically constructed from an acoustic model that converts input text into intermediate features, and a waveform generative model that converts intermediate features into speech waveforms.

While neural networks (Transformer encoder + Transformer decoder), which are widely used in machine translation, automatic speech recognition, and large language models (e.g. ChatGPT) are the mainstream in acoustic modeling for neural text-to-speech, we have introduced high-speed, high-performance neural networks (ConvNeXt encoder + ConvNeXt decoder), which have been recently proposed in image identification, into the acoustic model, and achieved three times faster synthesis without performance degradation compared with the conventional methods.

In 2021, we introduced MS-HiFi-GAN, in which the signal processing method [2-4] is represented with a training-capable neural network, by extending the conventional model, HiFi-GAN, which can synthesize speech equivalent to human speech, and achieved two times faster synthesis without synthesis performance degradation [5]. In 2023, we successfully developed MS-FC-HiFi-GAN by further accelerating the MS-HiFi-GAN, and achieved four times faster synthesis without synthesis performance degradation compared with the conventional HiFi-GAN.

As the culmination of these achievements, we have developed a novel, fast and high-quality neural text-to-speech model using an acoustic model (Transformer encoder + ConvNeXt decoder) and a waveform generation model (MS-FC-HiFi-GAN) as shown in Figure 2. As a result, the developed model is capable of synthesizing one second of speech at high speed in only 0.1 seconds using a single CPU core, which is about eight times faster than the conventional models. In addition, by introducing a method where incremental synthesis is only applied to the waveform generative model (see Figure 3), the developed model achieved fast synthesis with a latency of 0.5 seconds on a mid-range smartphone without network connection nor synthesis performance degradation. This eliminates the need for internet connection or conventional server-based synthesis and enables high-quality neural text-to-speech on smartphones, PCs, and other devices with reduced communication costs. Furthermore, incremental synthesis processing also makes it possible to immediately synthesize translated text in multilingual simultaneous interpretation.

Since March 2024, the developed technology has been used for neural text-to-speech in 21† of the languages supported in VoiceTra and has been made available to the public.

†21 languages: Japanese, English, Chinese, Korean, Thai, French, Indonesian, Vietnamese, Spanish, Myanmar, Filipino, Brazilian Portuguese, Khmer, Nepali, Mongolian, Arabic, Italian, Ukrainian, German, Hindi, and Russian

Future prospects

In the future, we will promote social implementation, specifically for smartphone applications, etc. such as multilingual speech translation and car navigation systems through commercial licensing.

Article information

Journal: Proceedings of INTERSPEECH 2024

Title: Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for low-latency synthesis

Authors: Takuma Okamoto, Yamato Ohtani, Hisashi Kawai

References

[1] T. Okamoto, Y. Ohtani, T. Toda and H. Kawai, "ConvNeXt-TTS and ConvNeXt-VC: ConvNeXt-based fast end-to-end sequence-to-sequence text-to-speech and voice conversion," in Proc. ICASSP, Apr. 2024, pp. 12456–12460.

[2] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga and H. Kawai, "Subband WaveNet with overlapped single-sideband filterbanks," in Proc. ASRU, Dec. 2017, pp. 698–704.

[3] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga and H. Kawai, "An investigation of subband WaveNet vocoder covering entire audible frequency range with limited acoustic features," in Proc. ICASSP, Apr. 2018, pp. 5654–5658.

[4] T. Okamoto, T. Toda, Y. Shiga and H. Kawai, "Improving FFTNet vocoder with noise shaping and subband approaches," in Proc. SLT, Dec. 2018, pp. 304–311.

[5] T. Okamoto, T. Toda and H. Kawai, "Multi-stream HiFi-GAN with data-driven waveform decomposition," in Proc. ASRU, Dec. 2021, pp. 610–617.

[6] T. Okamoto, H. Yamashita, Y. Ohtani, T. Toda and H. Kawai, "WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer," in Proc. ASRU, Dec. 2023.

[7] H. Yamashita, T. Okamoto, R. Takashima, Y. Ohtani, T. Takiguchi, T. Toda and H. Kawai, "Fast neural speech waveform generative models with fully-connected layer-based upsampling," IEEE Access, vol. 12, pp. 31409–31421, 2024.

END

[Attachments] See images for this press release:

Developed a 21-language, fast and high-fidelity neural text-to-speech technology that works on smartphones 2

Developed a 21-language, fast and high-fidelity neural text-to-speech technology that works on smartphones 3

ELSE PRESS RELEASES FROM THIS DATE:

Supporting school re-entry of children with special health care needs post extended hospitalizations

2024-07-26

East Hanover, NJ – July 26, 2024 – Children with special health care needs (CSHCN) often face significant disruptions in their education due to extended hospitalizations. A recent study published online in Disability and Rehabilitation on July 1, 2024, by a multidisciplinary team of Kessler Foundation and Children Specialized Hospital researchers, highlights critical areas needing attention to ensure smoother school re-entries for CSHCN, ensuring they receive the necessary educational support post-hospitalization. Involving parents, former patients, and rehabilitation ...

Have a seat, doctor: Study suggests eye-level connection makes a difference in hospitals

2024-07-26

Doctors and others who take care of hospitalized patients may want to sit down for this piece of news. A new study suggests that getting at a patient’s eye level when talking with them about their diagnosis or care can really make a difference. Sitting or crouching at a hospitalized patient’s bedside was associated with more trust, satisfaction and even better clinical outcomes than standing, according to the new review of evidence. The study’s authors, from the University of Michigan and VA Ann Arbor Healthcare System, note that most of the studies on this topic varied with their interventions and outcomes, and were found to have high risk of bias. Their ...

BRCA1/2: Why men should be screened for the ‘breast cancer gene’

2024-07-26

More and more studies show that men face risks of cancer from BRCA1 and BRCA2 genetic mutations that are most often associated with breast and ovarian cancers in women. According to a July 25 JAMA Oncology review article by experts at Fred Hutch Cancer Center and University of Washington, newly developed national screening guidelines offer hope for identifying the cancer risk of BRCA mutations in men through genetic testing and tailored cancer screening. “Not enough men are getting genetic testing to see if they carry a BRCA1 or BRCA2 gene ...

Researchers develop state-of-the-art device to make artificial intelligence more energy efficient

2024-07-26

MINNEAPOLIS / ST. PAUL (07/25/2024) — Engineering researchers at the University of Minnesota Twin Cities have demonstrated a state-of-the-art hardware device that could reduce energy consumption for artificial intelligent (AI) computing applications by a factor of at least 1,000. The research is published in npj Unconventional Computing, a peer-reviewed scientific journal published by Nature. The researchers have multiple patents on the technology used in the device. With the growing demand of AI applications, researchers have been looking ...

The Texas Heart Institute provides BiVACOR® Total Artificial Heart Patient update

2024-07-26

Houston, Texas, July 26, 2024 – The Texas Heart Institute (THI), a globally renowned cardiovascular health center, and BiVACOR®, a leading clinical-stage medical device company, are pleased to provide an update on the condition of the first patient to receive the BiVACOR Total Artificial Heart (TAH) implant on July 9, as part of the U.S. Food and Drug Administration (FDA) Early Feasibility Study (EFS). On July 17, eight days following the BiVACOR TAH implant, a donor heart became available and was transplanted into the ...

The ancestor of all modern birds probably had iridescent feathers

2024-07-26

The color palette of the birds you see out your window depend on where you live. If you’re far from the Equator, most birds tend to have drab colors, but the closer you are to the tropics, you’ll probably see more and more colorful feathers. Scientists have long been puzzled about why there are more brilliantly-colored birds in the tropics than in other places, and they’ve also wondered how those brightly-colored birds got there in the first place: that is, if those colorful feathers evolved in the tropics, or if tropical birds have colorful ancestors that came to the region from somewhere else. In a new study published ...

A rare form of ice at the center of a cool new discovery about how water droplets freeze

2024-07-26

Tokyo, Japan – Ice is far more complicated than most of us realize, with over 20 different varieties known to science, forming under various combinations of pressure and temperature. The kind we use to chill our drinks is known as ice I, and it’s one of the few forms of ice that exist naturally on Earth. Researchers from Japan have recently discovered another type of ice: ice 0, an unusual form of ice that can seed the formation of ice crystals in supercooled water. The formation of ice near the surface ...

Embargoed - Researchers devise novel solution to preventing relapse after CAR T-cell therapy

2024-07-26

Lack of persistence of CAR T cells is major limiting step in CAR T-cell therapy Made by fusing an immune-stimulatory molecule to a protein from cancer cells, the therapy selectively targets CAR T cells and enhances their functionality and persistence in the body, extending their attack on cancer. The therapy, called CAR-Enhancer (CAR-E), also causes CAR T cells to retain a memory of the cancer, allowing them to mount another attack if cancer recurs BOSTON – Even as they have revolutionized the treatment of certain forms of cancer, CAR T-cell therapies ...

Lampreys possess a ‘jaw-dropping’ evolutionary origin

2024-07-26

EVANSTON, Ill. --- One of just two vertebrates without a jaw, sea lampreys that are wreaking havoc in Midwestern fisheries are simultaneously helping scientists understand the origins of two important stem cells that drove the evolution of vertebrates. Northwestern University biologists have pinpointed when the gene network that regulates these stem cells may have evolved and gained insights into what might be responsible for lampreys’ missing mandibles. The two cell types — pluripotent blastula cells (or embryonic stem cells) and neural crest cells — are both “pluripotent,” ...

"Just like your mother?" Maternal and paternal X-chromosomes show skewed distribution in different organs and tissues.

2024-07-26

A new study published in Nature Genetics by the Lymphoid Development Group at the MRC Laboratory of Medical Sciences has reveals that the contribution of cells expressing maternal or paternal X chromosomes can be selectively skewed in different parts of the body. The study leverages human data from the 1000 Genomes Project combined with mouse models of human X chromosome-linked DNA sequence variation to advance our fundamental understanding of development in biologically female individuals who have two X chromosomes. Until now, it was thought that the usage of maternal and paternal X-chromosomes was similar throughout the body. The ...

Developed a 21-language, fast and high-fidelity neural text-to-speech technology that works on smartphones

ELSE PRESS RELEASES FROM THIS DATE:

LAST 30 PRESS RELEASES: