Speech quality, as perceived by humans, is an important performance metric for telephony and voice services. It is typically measured through subjective listening tests, which can be tedious and expensive. Algorithms such as PESQ and POLQA serve as a computational proxy for subjective listening tests. Here we propose using a convolutional neural network to predict the perceived quality of speech with noise, reverberation, and distortions, both intrusively and non-intrusively, i.e., with and without a clean reference signal. The network model is trained and evaluated on a corpus of about ten thousand utterances labeled by human listeners to derive a Mean Opinion Score (MOS) for each utterance. It is shown to provide more accurate MOS estimates than existing speech quality metrics, including PESQ and POLQA.
The proposed method reduces the root mean squared error from 0.48 to 0.35 MOS points and increases the Pearson correlation from 0.78 to 0.89 compared to the state-of-the-art POLQA algorithm.