PRESS-NEWS.org - Press Release Distribution
PRESS RELEASES DISTRIBUTION

Automatic speaker tracking in audio recordings

A new system dispenses with the human annotation of training data required by its predecessors but achieves comparable results

2013-10-18
(Press-News.org) Contact information: Andrew Carleen
acarleen@mit.edu
617-253-1682
Massachusetts Institute of Technology
Automatic speaker tracking in audio recordings A new system dispenses with the human annotation of training data required by its predecessors but achieves comparable results CAMBRIDGE, Mass-- A central topic in spoken-language-systems research is what's called speaker diarization, or computationally determining how many speakers feature in a recording and which of them speaks when. Speaker diarization would be an essential function of any program that automatically annotated audio or video recordings.

To date, the best diarization systems have used what's called supervised machine learning: They're trained on sample recordings that a human has indexed, indicating which speaker enters when. In the October issue of IEEE Transactions on Audio, Speech, and Language Processing, however, MIT researchers describe a new speaker-diarization system that achieves comparable results without supervision: No prior indexing is necessary.

Moreover, one of the MIT researchers' innovations was a new, compact way to represent the differences between individual speakers' voices, which could be of use in other spoken-language computational tasks.

"You can know something about the identity of a person from the sound of their voice, so this technology is keying in to that type of information," says Jim Glass, a senior research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and head of its Spoken Language Systems Group. "In fact, this technology could work in any language. It's insensitive to that."

To create a sonic portrait of a single speaker, Glass explains, a computer system will generally have to analyze more than 2,000 different acoustic features; many of those may correspond to familiar consonants and vowels, but many may not. To characterize each of those features, the system might need about 60 variables, which describe properties such as the strength of the acoustic signal in different frequency bands.

E pluribus tres

The result is that for every second of a recording, a diarization system would have to search a space with 120,000 dimensions, which would be prohibitively time-consuming. In prior work, Najim Dehak, a research scientist in the Spoken Language Systems Group and one of the new paper's co-authors, had demonstrated a technique for reducing the number of variables required to describe the acoustic signature of a particular speaker, dubbed the i-vector.

To get a sense of how the technique works, imagine a graph that plotted, say, hours worked by an hourly worker against money earned. The graph would be a diagonal line in a two-dimensional space. Now imagine rotating the axes of the graph so that the x-axis is parallel to the line. All of a sudden, the y-axis becomes irrelevant: All the variation in the graph is captured by the x-axis alone.

Similarly, i-vectors find new axes for describing the information that characterizes speech sounds in the 120,000-dimension space. The technique first finds the axis that captures most of the variation in the information, then the axis that captures the next-most variation, and so on. So the information added by each new axis steadily decreases.

Stephen Shum, a graduate student in MIT's Department of Electrical Engineering and Computer Science and lead author on the new paper, found that a 100-variable i-vector -- a 100-dimension approximation of the 120,000-dimension space -- was an adequate starting point for a diarization system. Since i-vectors are intended to describe every possible combination of sounds that a speaker might emit over any span of time, and since a diarization system needs to classify only the sounds on a single recording, Shum was able to use similar techniques to reduce the number of variables even further, to only three.

Birds of a feather

For every second of sound in a recording, Shum thus ends up with a single point in a three-dimensional space. The next step is to identify the bounds of the clusters of points that correspond to the individual speakers. For that, Shum used an iterative process. The system begins with an artificially high estimate of the number of speakers -- say, 15 -- and finds a cluster of points that corresponds to each one.

Clusters that are very close to each other then coalesce to form new clusters, until the distances between them grow too large to be plausibly bridged. The process then repeats, beginning each time with the same number of clusters that it ended with on the previous iteration. Finally, it reaches a point at which it begins and ends with the same number of clusters, and the system associates each cluster with a single speaker.

"What was completely not obvious, what was surprising, was that this i-vector representation could be used on this very, very different scale, that you could use this method of extracting features on very, very short speech segments, perhaps one second long, corresponding to a speaker turn in a telephone conversation," Kenny adds. "I think that was the significant contribution of Stephen's work."

###

Written by Larry Hardesty, MIT News Office


ELSE PRESS RELEASES FROM THIS DATE:

CNIO researchers delve into the behavior of cohesins

2013-10-18
CNIO researchers delve into the behavior of cohesins Pds5 proteins modulate the behavior of cohesins to ensure the proper division of cells -- Understanding the regulation of cohesins can improve diagnosis and treatment for some cancer patients ...

Glacial buzz-saws, gold in fool's gold, fingerprints in sea water, and fluvial iron

2013-10-18
Glacial buzz-saws, gold in fool's gold, fingerprints in sea water, and fluvial iron New Geology articles posted online ahead of print 16 October 2013 Boulder, Colo., USA – New article postings for Geology cover glacial erosion and glacial slip; the work of marine organisms ...

Light to moderate alcohol leads to good cheer at Danish high-school parties

2013-10-18
Contact: Marie Eliasen, M.Sc. mae@niph.dk 45-6550-7777 (Denmark) University of Southern Denmark Alcoholism: Clinical & Experimental Research Light to moderate alcohol leads to good cheer at Danish high-school parties Many people, especially young adults, engage in high-risk drinking because of the belief it will lead to positive mood effects such as cheerfulness. A new study of the association between blood alcohol content (BAC) and the subjective effects of alcohol like cheerfulness, focus distraction, and sluggishness among students in a real-life setting ...

Adolescence: When drinking and genes may collide

2013-10-18
Contact: Carmen van der Zwaluw, Ph.D. cvdzwaluw@gmail.com 31-61-4443988 (Netherlands) Radboud University Nijmegen Alcoholism: Clinical & Experimental Research Adolescence: When drinking and genes may collide Many negative effects of drinking, such as transitioning into heavy alcohol use, often take place during adolescence and can contribute to long-term negative health outcomes as well as the development of alcohol use disorders. A new study of adolescent drinking and its genetic and environmental influences has found that different trajectories of adolescent ...

Use of false ID by youth to buy alcohol is a slippery slope toward alcohol use disorders

2013-10-18
Contact: Amelia M. Arria, Ph.D. aarria@umd.edu 301-405-9795 University of Maryland School of Public Health Jennifer Read, Ph.D. jpread@buffalo.edu 716-645-0193 State University of New York at Buffalo Alcoholism: Clinical & Experimental Research Use of false ID by youth to buy alcohol is a slippery slope toward alcohol use disorders Many underage youth use false identification (ID) to buy alcohol. A new study has found that almost two-thirds of a college student sample used false IDs. False ID use might contribute to the development of alcohol use ...

Human neutrophil peptide-1: A new anti-leishmanial drug candidate

2013-10-18
Human neutrophil peptide-1: A new anti-leishmanial drug candidate Leishmaniasis is a vector borne disease caused by different Leishmania species with different clinical manifestations. Cutaneous leishmaniasis (CL) is endemic and widespread especially ...

Pioneering use of oral cholera vaccine during outbreak

2013-10-18
Pioneering use of oral cholera vaccine during outbreak In a report publishing October 17th, 2013 in PLOS Neglected Tropical Diseases, the international medical humanitarian organization Doctors Without Borders/Médecins Sans Frontières (MSF) and ...

5-year-old children are as likely to suffer from bilharzia as their mothers

2013-10-18
5-year-old children are as likely to suffer from bilharzia as their mothers Children of women harboring the bilharzia (schistosomiasis) worm during pregnancy are more likely to suffer the infection by the age of five years, a new study publishing October 17th, ...

To sleep, perchance to clean

2013-10-18
In findings that give fresh meaning to the old adage that a good night's sleep clears the mind, a new study shows that a recently discovered system that flushes waste from the brain is primarily active during sleep. This revelation could transform scientists' understanding of the biological purpose of sleep and point to new ways to treat neurological disorders. "This study shows that the brain has different functional states when asleep and when awake," said Maiken Nedergaard, M.D., D.M.Sc., co-director of the University of Rochester Medical Center (URMC) Center for ...

The sly maneuvers of the fungus fatal to frogs

2013-10-18
This news release is available in Spanish. Like subsurface ninjas, the cells of a particular fungus are slipping into the skins of amphibians worldwide, killing them, and now a new study hints at why this particular fungus has been so successful. In 1998, a new species of chytrid fungus called Batrachochytrium dendrobatidis was identified. In recent decades, it has contributed to rendering dozens of frog species extinct, researchers think. They know the fungus inserts itself into the skin of frogs, drying out a layer they require to be hydrated, but just how the ...

LAST 30 PRESS RELEASES:

To reach net-zero, reverse current policy and protect largest trees in Amazon, urge scientists

Double trouble: Tobacco use and Long COVID

Eating a plant-forward diet is good for your kidneys

Elucidating liquid-liquid phase separation under non-equilibrium conditions

Fecal microbiome and bile acid profiles differ in preterm infants with parenteral nutrition-associated cholestasis

The Institute of Science and Technology Austria (ISTA) receives €5 million donation for AI research

Study finds link between colorblindness and death from bladder cancer

Tailored treatment approach shows promise for reducing suicide and self-harm risk in teens and young adults

Call for papers: AI in biochar research for sustainable land ecosystems

Methane eating microbes turn a powerful greenhouse gas into green plastics, feed, and fuel

Hidden nitrogen in China’s rice paddies could cut fertilizer use

Texas A&M researchers expose hidden risks of firefighter gear in an effort to improve safety and performance

Wood burning in homes drives dangerous air pollution in winter

The Journal of Nuclear Medicine Ahead-of-Print Tip Sheet: January 23, 2026

ISSCR statement in response to new NIH policy on research using human fetal tissue (Notice NOT-OD-26-028)

Biologists and engineers follow goopy clues to plant-wilting bacteria

What do rats remember? IU research pushes the boundaries on what animal models can tell us about human memory

Frontiers Science House: did you miss it? Fresh stories from Davos – end of week wrap

Watching forests grow from space

New grounded theory reveals why hybrid delivery systems work the way they do

CDI scientist joins NIH group to improve post-stem cell transplant patient evaluation

Uncovering cancer's hidden oncRNA signatures: From discovery to liquid biopsy

Multiple maternal chronic conditions and risk of severe neonatal morbidity and mortality

Interactive virtual assistant for health promotion among older adults with type 2 diabetes

Ion accumulation in liquid–liquid phase separation regulates biomolecule localization

Hemispheric asymmetry in the genetic overlap between schizophrenia and white matter microstructure

Research Article | Evaluation of ten satellite-based and reanalysis precipitation datasets on a daily basis for Czechia (2001–2021)

Nano-immunotherapy synergizing ferroptosis and STING activation in metastatic bladder cancer

Insilico Medicine receives IND approval from FDA for ISM8969, an AI-empowered potential best-in-class NLRP3 inhibitor

Combined aerobic-resistance exercise: Dual efficacy and efficiency for hepatic steatosis

[Press-News.org] Automatic speaker tracking in audio recordings
A new system dispenses with the human annotation of training data required by its predecessors but achieves comparable results