A Hybrid Approach to Combining Conventional and Deep Learning Techniques for Single-channel Speech Enhancement and Recognition

IEEE Int. Conf. Acoustics Speech and Signal Processing (ICASSP) |

Publication

Conventional speech-enhancement techniques employ statistical signal-processing algorithms. They are computationally efficient and improve speech quality even under unknown noise conditions. For these reasons, they are preferred for deployment in unpredictable environments. One limitation of these algorithms is that they fail to suppress non-stationary noise. This hinders their broad usage. Emerging algorithms based on deep-learning promise to overcome this limitation of conventional methods. However, these algorithms under-perform when presented with noise conditions that were not captured in the training data. In this paper, we propose a single-channel speech-enhancement technique that combines the benefits of both worlds to achieve the best listening-quality and recognition accuracy under conditions of noise that are both unknown and non-stationary. Our method utilizes a conventional speech-enhancement algorithm to produce an intermediate representation of the input data by multiplying noisy input spectrogram features with gain vectors (known as the suppression rule). We process this intermediate representation through a recurrent neural-network based on long short-term memory (LSTM) units. Furthermore, we train this network to jointly learn two targets: a direct estimate of clean-speech features and a noise-reduction mask. Based on this LSTM multistyle training (LSTM-MT) architecture, we demonstrate a PESQ improvement of 0.76 (from 2.65 to 3.41) and a relative word-error rate reduction of 47.7% (from 15.9% to 8.3%).