Recognizing a Million Voices: Low Dimensional Audio Representations for Speaker Identification
Recent advances in speaker verification technology have resulted in dramatic performance improvements in both speed and accuracy. Over the past few years, error rates have decreased by a factor of 5 or more. At the same time, the new techniques have resulted in massive speed-ups, which have increased the scale of viable speaker-id systems by several orders of magnitude. These improvements stem from a recent shift in the speaker modeling paradigm. Only a few years ago, the model for each individual speaker was trained using data from only that particular speaker. Now, we make use of large speaker-labeled databases to learn distributions describing inter- and intra-speaker variability. This allow us to reveal the speech characteristics that are important for discriminating between speakers.
During the 2008 JHU summer workshop, our team has found that speech utterances can be encoded into low dimensional fixed-length vectors that preserve information about speaker identity. This concept of so-called “i-vectors”, which now forms the basis of state-of-the-art systems, enabled new machine learning approaches to be applied to the speaker identification problem. Inter- and intra-speaker variability can now be easily modeled using Bayesian approaches, which leads to superior performance. A new training strategies can now benefit form the simpler statistical model form and the inherent speed-up. In our most recent work, we have retrained the hyperparameters of our Bayesian model using a discriminative objective function that directly addresses the task in speaker verification: discrimination between same-speaker and different-speaker trials. This is the first time such discriminative training has been successfully applied to speaker verification task.
Lukas Burget http://www.fit.vutbr.cz/~burget (Ing. [MS]. Brno University of Technology, 1999, Ph.D. Brno University of Technology, 2004) is assistant professor at Faculty of Information Technology, University of Technology, Brno, Czech Republic. He serves as scientific director of the Speech@FIT research group. Dr. Burget supervises several PhD students. From 2000 to 2002, he was a visiting researcher at OGI Portland, USA under supervision of Prof. Hynek Hermansky. Lukas was invited to lead the “Robust Speaker Recognition over Varying Channels” team at the Johns Hopkins University CLSP summer workshop in 2008, and will lead the team of BOSARIS workshop in 2010.
Dr. Burget participated in EU-sponsored projects “Multimodal meeting manager” (M4, 5th FP) and “Augmented MultiParty interaction” (AMI, 6th FP), “Augmented MultiParty interaction with Distant Access” (AMIDA, 6th FP) as well as in several projects sponsored at the local Czech level. Currently, he is participating in EU project “Mobile biometry” (MOBIO, 7th FP). He is principal investigator of US-Air Force EOARD sponsored project “Improving the capacity of language recognition systems to handle rare languages using radio broadcast data”.
His scientific interests are in the field of speech processing, namely acoustic modeling for speech, speaker and language recognition, including their software implementations. He has authored or co-authored more than 40 papers in journals and conferences. Lukas was the leader of teams successful in NIST LRE 2005, 2007 and NIST SRE 2006 and 2008 evaluations. He significantly contributed to the team developing AMI LVCSR systems successful in NIST RT 2005, 2006 and 2007 evaluations. He has served as reviewer for numerous speech-oriented journals and conferences.
Dr. Burget is member of IEEE and ISCA
- Lukas Burget
- Brno University of Technology