Techniques to improve the robustness of automatic speech recognition systems to noise and channel mismatches
Robustness of ASR Technology to Background Noise
You have probably seen that most people using a speech dictation software are wearing a close-talking microphone. So, why has senior researcher Li Deng been trying to get rid of close-talking microphones? Close-talking microphones pick up relatively little background noise and speech recognition systems can obtain decent accuracy with them. If you are going to dictate a long memo, you might not mind putting the microphone on and adjusting it. For handheld devices, where the input problem is hard due to the lack of keyboard and the small screen size, speech recognition could be a life saver. But if it requires you to put on a close-talking microphone every time you want to use it, you may not do it. In fact, researcher Jasha Droppo observed that users get the microphone wires tangled up in the desk, and that the wire is a big inconvenience. Wouldn’t it be nice if users could obtain the same accuracies without having to wear a close-talking microphone?
Speech recognition systems work reasonably well in quiet conditions but work poorly under noisy conditions or distorted channels. For example, the accuracy of a speech recognition system may be acceptable if you call from the phone in your quiet office, yet its performance can be unacceptable if you try to use your cellular phone in a shopping mall. The researchers in the speech group are working on algorithms to improve the robustness of speech recognition system to high noise levels channel conditions not present in the training data used to build the recognizer.
Despite that difficulty, the group has made a lot of progress in algorithms that clean up the signal so that the recognizer’s accuracy does not degrade very much. Instead of cleaning the speech signal itself, these algorithms typically work on samples of the speech spectrum every 10 milliseconds or so, or on cepstrum (obtained through the spectrum). Many of the group’s latest algorithms produce an estimate of cepstrum of undistorted speech given the observed cepstrum of distorted speech to avoid having to recreate the waveform. SPLICE, Stereo-based Piecewise Linear Compensation for Environments, is one such algorithm. We evaluated this algorithm on the Aurora2 task, which consists of digit sequences within the TIDigits database that have been digitally corrupted by passing them through a linear filter and/or by adding different types of realistic noises at SNRs ranging from 20dB to -5dB. When the recognizer was trained with clean speech, the use of SPLICE resulted in a 67.4% average decrease in word error rate over all test sets. Accuracy is higher for multi-condition training, a term used to denote the fact that the recognizer is retrained with speech with all different types of noise and SNR. When using the SPLICE algorithm to preprocess that noisy speech in training, the error rate decreases by 27.9% on the average over the multi-condition case. SPLICE is actually quite simple, yet performs well because it learns the complex relationship between cepstrum vectors of clean and noisy speech through stereo recordings. You can find more about SPLICE through our publications. SPLICE offered the highest accuracy under clean training conditions using the protocol specified for the Aurora2 task done at the 2001 Eurospeech Conference. Senior researcher Alex Acero has been working on technology to solve the noise robustness problem for the last 15 years. “Error rates under very noisy conditions can still go up by almost an order of magnitude even after using SPLICE. We’re definitely not done yet” says Acero.
In addition to SPLICE, researchers at the speech group are working on online estimation of noise and channel parameters and Bayesian inference approaches to estimating the clean speech without the use of stereo training data.