Sounds familiar: A speaker identity-controllable framework for machine speech translation
Researchers propose a deep learning-based model for mimicking and continuously modifying speaker voice identity during speech translation
2021-04-26
(Press-News.org) Ishikawa, Japan - Robots today have come a long way from their early inception as insentient beings meant primarily for mechanical assistance to humans. Today, they can assist us intellectually and even emotionally, getting ever better at mimicking conscious humans. An integral part of this ability is the use of speech to communicate with the user (smart assistants such as Google Home and Amazon Echo are notable examples). Despite these remarkable developments, they still do not sound very "human".
This is where voice conversion (VC) comes in. A technology used to modify the speaker identity from one to another without altering the linguistic content, VC can make the human-machine communication sound more "natural" by changing the non-linguistic information, such as adding emotion to speech. "Besides linguistic information, non-linguistic information is also important for natural (human-to-human) communication. In this regard, VC can actually help people be more sociable since they can get more information from speech," explains Prof. Masato Akagi from Japan Advanced Institute of Science and Technology (JAIST), who works on speech perception and speech processing.
Speech, however, can occur in a multitude of languages (for example, on a language-learning platform) and often we might need a machine to act as a speech-to-speech translator. In this case, a conventional VC model experiences several drawbacks, as Prof. Akagi and his doctoral student at JAIST, Tuan Vu Ho, discovered when they tried to apply their monolingual VC model to a "cross-lingual" VC (CLVC) task. For one, changing the speaker identity led to an undesirable modification of linguistic information. Moreover, their model did not account for cross-lingual differences in "F0 contour", which is an important quality for speech perception, with F0 referring to the fundamental frequency at which vocal cords vibrate in voiced sounds. It also did not guarantee the desired speaker identity for the output speech.
Now, in a new study published in IEEE Access, the researchers have proposed a new model suitable for CLVC that allows for both voice mimicking and control of speaker identity of the generated speech, marking a significant improvement over their previous VC model.
Specifically, the new model applies language embedding (mapping natural language text, such as words and phrases, to mathematical representations) to separate languages from speaker individuality and F0 modeling with control over the F0 contour. Additionally, it adopts a deep learning-based training model called a star generative adversarial network, or StarGAN, apart from their previously used variational autoencoder (VAE) model. Roughly put, a VAE model takes in an input, converts it into a smaller and dense representation, and converts it back to the original input, whereas a StarGAN uses two competing networks that push each other to generate improved iterations until the output samples are indistinguishable from natural ones.
The researchers showed that their model could be trained in an end-to-end fashion with direct optimization of language embedding during the training and allowed good control of speaker identity. The F0 conditioning also helped remove language dependence of speaker individuality, which enhanced this controllability.
The results are exciting, and Prof. Akagi envisions several future prospects of their CLVC model. "Our findings have direct applications in protection of speaker's privacy by anonymizing one's identity, adding sense of urgency to speech during an emergency, post-surgery voice restoration, cloning of voices of historical figures, and reducing the production cost of audiobooks by creating different voice characters, to name a few," he comments, excitedly. He intends to further improve upon the controllability of speaker identity in future research.
Perhaps the day is not far when smart devices start sounding even more like humans!
INFORMATION:
Reference
Title of original paper: Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network
Journal: IEEE Access
DOI: 10.1109/ACCESS.2021.3063519
About Japan Advanced Institute of Science and Technology, Japan
Founded in 1990 in Ishikawa prefecture, the Japan Advanced Institute of Science and Technology (JAIST) was the first independent national graduate school in Japan. Now, after 30 years of steady progress, JAIST has become one of Japan's top-ranking universities. JAIST counts with multiple satellite campuses and strives to foster capable leaders with a state-of-the-art education system where diversity is key; about 40% of its alumni are international students. The university has a unique style of graduate education based on a carefully designed coursework-oriented curriculum to ensure that its students have a solid foundation on which to carry out cutting-edge research. JAIST also works closely both with local and overseas communities by promoting industry-academia collaborative research.
About Professor Masato Akagi from Japan Advanced Institute of Science and Technology, Japan
Masato Akagi is a professor at the Faculty of the School of Information Science at Japan Advanced Institute of Science and Technology (JAIST). He received his PhD degree from the Tokyo Institute of Technology, Japan in 1984 and joined JAIST in 1992. His research interests include speech perception and its modeling in humans, and the signal processing of speech. As a senior and reputed professor, he has published 456 papers with over 2500 citations to his credit. For more information about his research, visit: https://www.jaist.ac.jp/english/areas/hld/laboratory/akagi.html#page
Funding information
The study was funded by National Institute of Informatics-Center for Robust Intelligence and Social Technology (NII-CRIS), Grant-in-Aid for Scientific Research, and the Japan Society for the Promotion of Science (JSPS)-NSFC Bilateral Joint Research Projects/Seminars.
[Attachments] See images for this press release:
ELSE PRESS RELEASES FROM THIS DATE:
2021-04-26
Just one third of people in the UK managed to access the hospital care they needed at the peak of the first wave of the Covid-19 pandemic - according to new research from the University of East Anglia.
A new study published today looks at the extent to which people managed to access NHS healthcare in April 2020, and as lockdown restrictions eased.
The researchers found that, despite high levels of unmet need, there was equal access to NHS hospital care for people at different levels of income. And the NHS principle of equal treatment for equal need was upheld.
However, people on higher incomes had better access to GP consultations, prescriptions and medical helplines at ...
2021-04-26
Steroids should not be used to treat smell loss caused by Covid-19 according to an international group of smell experts, including Prof Carl Philpott from the University of East Anglia.
Smell loss is a prominent symptom of Covid-19, and the pandemic is leaving many people with long-term smell loss.
But a new study published today shows that corticosteroids - a class of drug that lowers inflammation in the body - are not recommended to treat smell loss due to Covid-19.
Instead, the team recommend 'smell training' - a process that involves sniffing at least four different odours twice a day for several months.
Smell loss expert Prof Carl Philpott from UEA's Norwich Medical School, said: "The huge rise in smell loss caused by Covid-19 has created an unprecedented worldwide demand for ...
2021-04-26
According to the Motorik-Modul-Längsschnittstudie (MoMo, Motor Module Longitudinal Study) of Karlsruhe Institute of Technology (KIT) and Karlsruhe University of Education (PHKA), mental health of children and adolescents decreased during the first lockdown. For children aged between 4 and 10 years and for girls irrespective of their age, mental health was found to promote physical activity during Covid-induced lockdown in spring 2020. This is reported in Children (DOI: 10.3390/children8020098).
"The impacts of the lockdown on children and adolescents is discussed widely," ...
2021-04-26
Deaths of people who suffered strokes increased during the first lockdown compared to the three previous years, new data analysis has found. Despite the pandemic, health care quality was maintained at a high level.
In their paper, published today in Stroke American Heart Association, research teams from King's College London, Guy's and St Thomas' NHS Foundation and the Sentinel Stroke National Audit Programme (SSNAP) analysed the data of 184,017 patients admitted to hospital with confirmed stroke during October-April periods across four consecutive years. This patient data were collected from 114 hospital trusts in England, Wales and Northern Ireland.
Starting from the third week of February 2020 there was ...
2021-04-26
One of the biggest challenges in Alzheimer's research is to identify biomarkers that can identify people who are at risk of developing dementia. Biomarkers could be used to screen people so they might be helped before they develop dementia.
Researchers have focused primarily on three such biomarkers. Two are Alzheimer's-related proteins, amyloid and tau. Amyloid forms clumps in brains, and tau forms skeins of filaments called neurofibrillary tangles. Both can be detected in cerebral spinal fluid or by specialized positron emission tomography (PET) scans. The third marker, brain atrophy, can be seen with CT or MRI scans.
To guide researchers, the National Institute on Aging and the Alzheimer's Association ...
2021-04-26
Proteins are undoubtedly some of the most fascinating biomolecules, and they perform many of the functions that (in our eyes) separate life from inanimate matter. Multi-molecular protein assemblies even have large-scale structural functions, as evidenced by feathers, hair, and scales in animals. It should come as no surprise that, with progress in advanced nanotechnology and bioengineering, artificial protein assemblies have found applications in a variety of fields, including catalysis, molecular storage, and drug delivery systems.
However, producing ordered protein assemblies remains challenging. It is particularly difficult to get monomers, the building blocks of proteins, to assemble stably into the desired structures; this generally ...
2021-04-26
Although most oncological diseases are not infectious, some viruses can cause cancer. According to the World Health Organization, two HPV subtypes account for 70% of cervical cancer cases and pre-existing conditions. Moreover, HPV considerably increases the risks of other types of cancer. Within an infected cell, a viral protein called E6 binds with human proteins from the 14-3-3 family. 14-3-3 proteins are present in cells of all eukaryotic organisms and can interact with hundreds of other important players of intracellular processes to regulate cell division, gene activity, metabolism, cell death, and intracellular ...
2021-04-26
Researchers from Tel Aviv University have created an artificial intelligence platform that can identify the specific proteins that allow bacteria to infect the intestines - a method that paves the way for the creation of smart drugs that will neutralize the proteins and prevent disease, without the use of antibiotics. Participating in the study, which was published in the prestigious journal Science, were Ph.D. student Naama Wagner and Prof. Tal Pupko, head of the Shmunis School of Biomedicine and Cancer Research at the Faculty of Life Sciences and the new Center for Artificial Intelligence ...
2021-04-26
Prof. PAN Jianwei and his colleagues from the University of Science and Technology of China of the Chinese Academy of Sciences investigated the high-loss free space high-precision time-frequency dissemination experiment between remote locations, simulating the high-precision time-frequency high-orbit satellite-ground links in the channel loss, atmospheric noise, and transmission delay effects.
This link experiment exhibits that the instability of the time-frequency transfer via a satellite in middle-high earth orbits might reach E-18 at 10,000 s, enabling ...
2021-04-26
The University of Southern Denmark (SDU), Rigshospitalet and the University of Copenhagen have come together to study the effects of Football Fitness on various health parameters and self-rated health following treatment for breast cancer.
The results of the project, called Football Fitness After Breast Cancer (ABC), have now been published in three scientific articles published in international sports medicine, cardiology and oncology journals.
"The main conclusion is that Football Fitness is an intense and good form of training for women treated for breast cancer, with beneficial effects on balance, muscle strength and bone density," says END ...
LAST 30 PRESS RELEASES:
[Press-News.org] Sounds familiar: A speaker identity-controllable framework for machine speech translation
Researchers propose a deep learning-based model for mimicking and continuously modifying speaker voice identity during speech translation