Digit. Signal Process. | 2021

Target exaggeration for deep learning-based speech enhancement

 
 

Abstract


Abstract Deep learning has been actively utilized for speech enhancement. However, deep learning-based speech enhancement usually produces over-smoothed speech, resulting in speech distortion and degraded intelligibility. In this paper, we propose the exaggeration of the training target so that the dynamic range of the enhanced speech becomes more similar to that of the clean speech. Target exaggeration can be implemented in two ways. The first approach is to exaggerate the target feature in the cost function of a deep learning-based speech enhancement system. This method can be implemented without additional parameters or computation, but can only be applied to schemes working in the time-frequency domain with the mean-square error cost function. The second approach is to introduce an additional deep neural network (DNN) that estimates the residual error in the output of a deep learning-based speech enhancement. This requires more computation, but can be applied even to time-domain approaches. To evaluate the performance of the proposed target exaggeration, it is applied to a feed-forward DNN- and long short-term memory (LSTM)-based speech enhancement scheme in the time-frequency domain, and the convolutional time-domain audio separation network (Conv-TasNet)-based speech enhancement scheme in the time domain. Experimental results showed that the proposed method improved the quality of speech produced by the deep learning-based speech enhancement system in terms of the perceptual evaluation of speech quality (PESQ) scores and outperformed other approaches, including global variance equalization and a perceptually optimized speech denoising autoencoder, to alleviate the over-smoothing problem.

Volume 116
Pages 103109
DOI 10.1016/J.DSP.2021.103109
Language English
Journal Digit. Signal Process.

Full Text