Decades of research in processing audio signals has led to performance saturation. However, recent advances in AI and machine learning provides a new opportunity to advance the state-of-the-art. In this project, one of the first problems we focus on is enhancing speech signals as they are captured by microphones. Speech enhancement is a precursor to several applications like VoIP, teleconferencing systems, ASR and hearing aids. Its importance has grown further with the emergence of mobile and wearable devices, which present challenging capture and processing conditions; thanks to their limited processing capabilities and voice-first IO interfaces. The goal of speech enhancement is to take audio signal from a microphone, clean it and forward clean audio to multiple clients such as speech-recognition software, archival databases and speakers. The process of cleaning is what we focus on in this project. This has traditionally been done with statistical signal processing. However, these techniques make several assumptions that are imprecise. We explore data-driven ways of completing this task in the most efficient, dynamic and accurate manner.
The performance of a speech-enhancement algorithm is measured by the intelligible quality of the clean signal to listeners, measured by standard metrics such as PESQ, and to speech-recognition software, measured by standard metrics such as Word Error Rate. Classical signal-processing approaches to speech enhancement rely on the assumptions of quasi-stationary noise and Gaussian distributions for the spectral amplitudes of noise and speech signals. Under these simplifying assumptions, even the most precise (and complex) mathematical models are limited in their performance on real-world data. Further relaxations that assume statistical independence across consecutive audio frames and frequency bins in one frame help find better converging solutions in the processing algorithms, but again miss opportunities for improving the quality of noisy data in realistic settings. In this project, we develop supervised and semi-supervised neural networks that process audio vector samples to build parametric non-linear regression models. We make no assumptions about the signal or noise statistics. To adapt the performance of the speech enhancement process to varying input conditions, we explore reinforcement-learning techniques. To make our algorithms run faster, we also consider efficient implementation of neural networks through various techniques like bit-precision scaling and factorization.
With recent advances in machine learning (ML) and artificial intelligence (AI), the signal-processing research community has started to look for new data-driven approaches to speech enhancement. Despite some research activity in this area, there are several gaps that preclude practical applicability:
- First, there is an uncertainty regarding algorithm-applicability that prevails among methods of representation learning, generative models and feed-forward inference. Each claim their own robustness benefits. However, existing results contradict across data sets. Thus, it is important to evaluate these methods under uniform settings and benchmarks that we really care about.
- Second, data-driven methods for speech enhancement have been oblivious to efficiency constraints in processing latency, memory footprint, CPU utilization and energy consumption. Existing methods face a 10-15 audio-frame latency and require prohibitively large amounts of memory. Thus, a careful re-design, optimization and mapping of algorithms is essential to make these methods practical.
- Third, eliminating certain kinds of noise, esp. non-stationary noise, is not the strongest suit of data-driven methods. Existing approaches are good at ridding seen and unseen stationary noise, while they suffer heavily when the characteristics of the noise change over time. Thus, it is important to develop mechanisms that enable data-driven methods to adapt to a range of varying noise conditions.
We believe that the long-standing success of the classic, statistical speech-enhancement algorithms is due to their ability to address three important issues: (1) high accuracy, (2) adaptivity and (3) efficiency. In this project, we develop data-driven methods that comprise these same traits while overcoming the limitations of existing algorithms that are listed above.
Achieving High Accuracy
We have explored different architectures including fully-connected DNNs, convolutional-recurrent networks, semi-supervised hierarchical denoising autoencoders as well as RNNs. We have found that each of them has their own strong points. They improve one or the other performance metrics that we care about. Unfortunately, we have not yet scored an algorithm that does well across the board. We believe that there are different methods we can employ to change this. Some ideas include utilizing side-channel information to improve the neural-network models, augmenting the training data with more targeted examples, employing additional processing components like multi-channel data.
We are continuing to advance the state-of-the-art in this space and are still in the early stages of research. Check back soon for our latest results and advances that are forthcoming.
Our methods based on policy gradient have helped tune the performance of speech-enhancement algorithms so that they can adapt to the varying input signal characteristics. As it turns out, although many possible neural-network architectures work well for this task, simple LSTM-based networks provide the highest benefit.
We have designed our system with recurrent components to track algorithm state and adapt it to changing conditions. One downside we have is high computational complexity of the adaptation process. We will need to reduce that in order to make reinforcement-learning based methods more applicable to practical scenarios. This is also an area that we believe needs more work in the future.
Our initial thoughts on speeding up neural-networks were to utilize factorization and sparse processing techniques. However, given our past experience, we decided to start with bit-precision scaling. Fortunately, we have found it to be extremely useful at lowering the computational complexity of many audio processing problems. Detailed experiments have revealed that aggressive scaling of bit precision hurts fine-grain estimation tasks like non-linear vector regression (that we require for speech enhancement). Thus, we have applied this technique to voice activity detection, which can be formulated as a bin-wise classification problem on spectral components.
Although we have taken the first steps in applying ML algorithms to improve audio quality, we have only found limited success in simultaneously optimizing all metrics of interest including the WER, PESQ and MSE. We will continue to investigate advanced neural-network architectures that will allow us to achieve this goal. Dynamic adaptation of the enhancement process has provided us some gains. But, they have come at the cost of increased computational complexity. This is also an issue that we are trying to address. Finally, although efficient bit-precision scaling of fully-connected networks has benefited speech enhancement, the best algorithms in our experience utilize convolutional-recurrent architectures. We are continuing to investigate different ways of speeding up these models so that they can be run efficiently on resource-constrained devices like low-power DSPs and modest FPGAs. We have not yet published most of our work in this project. So, look our for more updates soon.