SkipConvNet: Skip Convolutional Neural Network for Speech Dereverberation using Optimally Smoothed Spectral Mapping
Vinay Kothapally, Wei Xia, Shahram Ghorbani, John H.L. Hansen, Wei Xue, Jing Huang
SSkipConvNet: Skip Convolutional Neural Network for SpeechDereverberation using Optimally Smoothed Spectral Mapping
Vinay Kothapally , Wei Xia , Shahram Ghorbani , John H.L. Hansen Wei Xue , Jing Huang Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, TX, USA JD AI Research, JD.com, USA { vinay.kothapally, wei.xia, shahram.ghorbani,john.hansen } @utdallas.edu { xuewei27, jing.huang } @jd.com Abstract
The reliability of using fully convolutional networks (FCNs)has been successfully demonstrated by recent studies in manyspeech applications. One of the most popular variants of theseFCNs is the ‘U-Net’, which is an encoder-decoder network withskip connections. In this study, we propose ‘SkipConvNet’where we replace each skip connection with multiple convo-lutional modules to provide decoder with intuitive feature mapsrather than encoder’s output to improve the learning capacityof the network. We also propose the use of optimal smooth-ing of power spectral density (PSD) as a pre-processing step,which helps to further enhance the efficiency of the network. Toevaluate our proposed system, we use the REVERB challengecorpus to assess the performance of various enhancement ap-proaches under the same conditions. We focus solely on mon-itoring improvements in speech quality and their contributionto improving the efficiency of back-end speech systems, suchas speech recognition and speaker verification, trained on onlyclean speech. Experimental findings show that the proposedsystem consistently outperforms other approaches.
Index Terms : fully convolutional networks, speech dereveber-ation, speech recognition, speaker verification
1. Introduction
Recent years have seen an exponential rise in the need for effec-tive distant speech systems to improve the experience of human-machine interactions. These systems find their applications inmany consumer devices today as personal assistance. Speechcaptured by these devices in confined spaces such as conferencerooms, lobby, cafeteria, etc. faces two major challenges: (a)reverberation: self-distortion due to reflections which greatlyreduce the intelligibility of the speech, and (b) background-noise: speech from multiple overlapping speakers, music orother acoustical sounds picked up from the environment. Thesetwo challenges are well-known and have been dealt with var-ious signal processing and deep neural network (DNN) basedspeech enhancement strategies.In earlier days, statistical signal enhancement methodsplayed a crucial part in all speech processing pipelines [1, 2, 3].However, over the past few years, many DNN based approachesfor speech enhancement showed promising results in enhancinga reverberant speech. While DNN strategies for time-domainand frequency-domain processing [4, 5] were developed, mostapproaches prefer to operate on short-time fourier transform(STFT) of reverberant speech, to enhance the log-power spec-trum (LPS) and reuse the unaltered noisy phase signal to restorea clean time-domain signal. As reverberation has its effectsspread over time and frequency, sequence-to-sequence learn- ing strategies like recurrent neural networks (RNNs) and longshort-term memory (LSTM) [6] have been explored to captureand leverage the temporal correlations for speech dereverber-ation. Besides the extreme capabilities of these networks tocapture temporal correlations in speech, they fail to capture thespectral structure of formants encoded in the short-time fouriertransform (STFT). Therefore, researchers have moved to convo-lutional neural networks (CNNs) which learn the dependenciesfrom a group of neighboring time-frequency pixels. In a con-ventional CNN, the spectral structure learned using 2-D convo-lutions is compromised due to the presence of fully connectedlayers. For this reason, researchers used fully convolutional net-works (FCNs) which substitute the fully connected layers with1x1 convolutions in order to prevent the loss of spectral struc-ture information. In past couple of years, many FCN archi-tectures like U-Net, ResNet, DenseNet etc. were adopted fromcomputer vision for various speech applications [7, 8, 9, 10, 11].Since these networks have shown significant success, explo-ration of various network architectures that further improve thesystem performance in speech related tasks has been a part ofthe research. In this study, we propose modifications to onesuch FCN architecture, U-Net, specifically designed for speechdereveberation task. We also show that using pre-processedLPS for training such networks improves the efficiency signifi-cantly.The rest of this paper is organized as follows. Section 2briefly introduces to the problem statement. Section-3, providesinsights on the optimal smoothing proposed to be used as a pre-processing step. Section-4, describes the proposed ‘SkipCon-vNet’ for single-channel speech dereverberation. Details on theexperimental setup and results are presented in Section-5. Fi-nally, we conclude our work in Section 6.
2. Problem Formulation
For a given a room impulse response (RIR), a reverberantspeech signal received by an omni-directional microphone canbe modeled as: x ( t ) = L (cid:88) t =0 s ( t ) ∗ h ( L − t ) + n ( t ) (1) X ( t, f ) = S ( t, f ) H ( t, f ) + N ( t, f ) (2)where, x ( t ) is the signal as observed by a distant microphone, s ( t ) is the clean speech signal from the source, h ( t ) is the roomimpulse response (RIR), and n ( t ) is background additive noise.The relation in frequency domain can be represented as Eq-(2), where X ( t, f ) , S ( t, f ) , H ( t, f ) and N ( t, f ) represent theSTFT of observed reverberated speech, clean speech, RIR and a r X i v : . [ ee ss . A S ] J u l ackground noise respectively. In this study, the noise level isconsiderably lower than the target speaker so that the analy-sis concentrates solely on evaluating and suggesting strategiesto reverse the impacts of reverberation. The early and late re-flections in a room impulse response (RIR) creates a smearingeffect both in time and frequency of a speech spectrogram, seeFig-1(a). The emphasis of this study is on learning a non-linearfunction using FCNs that maps the LPS of a noisy & reverberantspeech X ( t, f ) to a corresponding clean speech S ( t, f ) . Later,the estimated enhanced LPS from the network is combined withthe unaltered noisy phase response to reconstruct the enhancedspeech.
3. Optimal Smoothing based Pre-processing
Estimation of the effects of late reflections and/or backgroundnoise statistics plays an important role in the design of a ro-bust speech enhancement system. We use a minimum statisticsbased approach [12] originally used to estimate the power spec-tral density (PSD) of noise, to also estimate the same for latereflections given noisy & reverberant speech. The approach isbased on the assumption that energy across all frequencies willtheoretically tend to be zero during silent periods in speech, un-less affected by stationary-noise or reverberation or both. Thus,tracking the minimum energy in each frequency bin over timefor a smoothed reverberant PSD can produce a robust estimateof the PSD of noise/reverberation. Although this approach isfocused on estimating PSD of noise/reverberation, we discardit and use only the smoothed speech PSD for training our pro-posed system. P ( t, f ) = α opt ( t, f ) P ( t − , f ) + (1 − α opt ( t, f ) | X ( t, f ) | , (3) α opt ( t, f ) = P ( t − ,f ) /σ n ( t,f ) − . (4)In general, a fixed smoothing parameter ‘ α ’ is used inspeech applications to obtain a smoothed PSD, P ( t, f ) . It isknown that having a fixed value often comes with trade-off is-sues in estimation. A sliding window smoothing mechanismis robust for a higher value of ‘ α ’ but blurs the speech activ-ity and silence boundaries. On the contrary, abrupt changes inspeech activity can be recorded with a lower value of ‘ α ’ at thecost of a less reliable estimate of PSD. To address this issue,a time-varying and frequency-dependent smoothing parameteris used to get an accurate estimate of speech PSD as shown inEq-(3). The optimal smoothing parameter is computed usingEq-(4), where σ n ( t, f ) is the variance of noise. We suggest re-ferring [12] for a detailed derivation of this optimal smoothing parameter. In addition, the smooth PSD values below ‘-80 dB’for all training samples are clipped to have a constant dynamicrange, see Fig-1.Figure 1: Optimal Smoothing as Preprocessing
The original and smoothed versions of PSD for a reverber-ant and a corresponding clean speech utterances are shown inFig-1. The optimal smoothing parameter adapts itself accord-ingly for active and silent (late reflections in our case) regionsof speech. It is clear from the highlighted regions in Fig-1 thatoptimal smoothing helps retrieve the lost formant structure inreverberant speech. Thus, we propose to use this smoothingstrategy to pre-process the reverberant speech before being fedto the proposed system for training purposes.
4. Skip Convolutional Neural Network
In this section, we start with a formal description of standardU-Net and then explain the modifications we propose to makeit a ‘SkipConvNet’. U-Net is an encoder-decoder based image-2-image translation network. The encoder stage in the archi-tecture extracts spectral and temporal features from the LPS ofthe input reverberant speech, and the decoder constructs an en-hanced LPS from the encoded features. The encoder and de-coder networks constitute of multiple layers of convolutions fol-lowed by down-sampling and up-sampling, respectively. WithFigure 2:
SkipConvNet: UNet with convolution modules in skipped connections trained on optimally smoothed PSDs ncreasing numbers of layers in the network, each neuron’s re-ception field increases. Alternatively, the dimension of the pro-cessed encoded features decreases. Similar to [13], we useenough layers in the architecture such that the encoder down-samples a given input to a single pixel. This ensures that thedecoder uses all spectral and temporal features learned by theencoder from the input to construct an enhanced output. Theskip connections play the most crucial role in the U-Net archi-tecture. A skip connection is a link between the encoder anddecoder used to share learned features, represented by dottedlines in Fig-2. A skip connection in each layer concatenates theoutput from encoder before down-sampling with the decoderin a corresponding layer, with the assumption that both inputand the output PSD have a similar structure. These concate-nated features from the encoder and the previous layers of thedecoder not only helps to preserve the information from beinglost during down-sampling in the encoder, but also guides thedecoder towards the reconstruction of the enhanced output.Although the skip connections have proven to be efficientin building a robust system, a recent study by [14] analyzes aprobable semantic gap in features exchanged between encoderand decoder. For instance, the first layer of the encoder extractslow-level local spectral and temporal features. These featuresare concatenated with the final layer of the decoder which re-ceives highly processed features from its previous layers. Merg-ing these two incompatible sets of features might limit the learn-ing abilities of the FCNs. Addition of a few convolutional layerswithin each skip connection can compensate for the incompati-bilities by transforming the features from the encoder to be moreintuitive to the decoder. We believe that, with the minimizeddifferences within the features at each layer of the decoder, thelearning ability of FCN’s can potentially be maximized.Unlike [14], which uses convolutions with varying kernelsin parallel, we use standard convolutions followed by normal-ization and non-linear activation for the encoder and the decoderrespectively. However, we use multiple ‘ skipconv ’ blocks in se-ries for each skip connection in the architecture. A skipconv block constitutes of a non-linear activation followed by a 5x5convolution with a residual connection and does not alter thedimensions of the features set. The features are then normal-ized before being shared with either another skipconv block orthe decoder. The number of skipconv blocks used in a partic-ular skip connection is made to vary inversely with the depthof the layer in the encoder it is associated with, see Fig-2. Forinstance, skip connection associated with the final layer of theencoder has only one skipconv block whereas the first layer inthe encoder has a total of eight skipconv blocks. This is based onthe assumption that the deeper layers of the network deal withhigh-level information and require minimal transformation toreduce the incompatibilities within feature sets at the decoder.
5. Experimental Results
All experiments were run on the REVERB challenge corpus[15, 16]. The outcomes of SkipConvNet are compared with theoutcomes of a standard U-Net [13] trained on LPS of reverber-ant and clean speech utterances with and without the proposedpre-processing and a widely used statistic dereverberation al-gorithm, Weighted prediction error WPE [17, 18]. We also testthe enhanced speech utterances with back-end speech automaticspeech recognition (ASR) and speaker verification (SV) modelstrained using only clean speech.
The Reverb Challenge corpus is a collection of simulated andreal recordings of speech sampled at 16kHz from differentrooms with varying levels of reverberation and a backgroundnoise at 20dB SNR. The simulated data is generated by con-volving clean speech utterances from WSJCAM0 [19] and roomimpulse responses (RIRs) collected from three different rooms(small, medium and large sizes) and two different microphoneplacements (near, far) using a single microphone, 2-channel andan 8-channel microphone array. The multi-channel real record-ings were drawn from MC-WSJ-AV corpus [20]. The corpus isdivided into train , dev and eval sets. The train set consists of7,861 simulated reverberant utterances which are used to trainthe systems being compared in this study. However, the dev and eval set contain both simulated and real recordings. We compute STFT with a frame length of 512 samples and anoverlap of 384 samples for a given speech utterance. We thencompute LPS of a speech signal from an optimally smoothedPSD of the speech using Eq-3 & 4. We only consider the lowerhalf of the , since the STFT is symmetric. Late, the LPS of eachutterance is divided into batches with 256 consecutive framesto form spectral images of size 256x256. We re-use the U-Netarchitecture proposed in [13] as our baseline. Fig-2 gives anoverview of the proposed ‘SkipConvNet’ architecture with con-volutional module ‘SkipConv block’ added to the skip connec-tions. All convolutions in the encoder and skipconv blocks usea kernel size of 5x5 with a stride of 2. Similarly, all convo-lutions in the decoder use transposed convolution with a kernelsize and stride of 2. We train our network using a total of 62,888spectral images (corresponding to 7,861 utterances) from train set of the corpus with Adam optimizer. The netowrk is tarinedto minimize the mean square error (MSE) between the networkprediction and the LPS of corresponding clean utterance. Weuse a batch-size of 8 and train the network for 10 epochs. Fi-nally, estimated LPS from the network is combined with theunaltered noisy phase to reconstruct the enhanced speech. Wereport the improvements seen on the eval set of the corpus.Figure 3:
Reverberant, Enhanced and Clean Speech Spectrograms able 1:
Improvements in Speech Quality Measures for SimData and RealData
Simulated RealCD LLR FWSegSNR SRMR SRMRRoom
Near MicrophoneReverb 1.96 4.58 4.20 0.34 0.51 0.65 8.10 3.07 2.32 4.37 3.67 3.66 4.05WPE 1.82 4.53 4.12 0.33 0.52 0.62 8.66 3.44 2.69 4.51 3.89 3.92 4.42UNet 2.14 3.06 2.79 0.28 0.39 0.38 7.31 4.41 5.24 4.34 3.91 4.06 4.68UNet+Pre-Processing 2.05 2.82 2.71 0.22 0.35 0.36 11.68 9.62 8.87 4.67 4.50 4.40 5.87SkipConvNet
Table 2:
SV and ASR performance on simulated and real recordings for models trained on clean speech
Speaker Verification Speech Recognition
EER (%) X-vector PLDA-trained on Clean Speech WER (%) Acoustic Model-trained on Clean Speech
SimData RealData SimData RealDataMethod
We begin our presentation of experimental results with a sec-ond look at the optimal smoothed PSD’s from Fig-1. From Fig-1(a),(c) & (e), we see that optimal smoothing helps in preserv-ing the formant structure during the speech frames by having alow smoothing parameter while assigning the regions with re-verberant contents with a higher smoothing parameter.We then measure the relative enhancement achieved byeach system using several speech quality measures, as shown inTable-1. A FCN based U-Net and the proposed ‘SkipConvNet’performed consistently better compared to the widely used sta-tistical dereverberation algorithm, WPE. However, we observeda 39.19% relative improvement in the performance of the base-line U-Net by solely introducing the proposed pre-processing.This shows that the proposed pre-processing helps all FCNnetworks and is not biased to the proposed ‘SkipConvNet’.However, the proposed ‘SkipConvNet’ consistently performedthe best compared to U-Net and the U-Net trained on pre-processed inputs with an average of 54.45% and 10.40% rel-ative improvements over all quality metrics respectively. Con-sistent improvements in ‘SRMR’ and ‘FWSegSNR’ in additionto ‘CD’ ensure the reduction of reverberation and backgroundnoise in enhanced speech utterances without any processing ar-tifacts/distortions.Finally, we test the improvements in back-end automaticspeech recognition (ASR) and speaker verification systems(SV) achieved with proposed system for single and multi-channel streams, see Table-2. For multi-channel streams ofdata, individual channels are enhanced with different dereve-beration techniques discussed in the study and then spatiallycombined using BeamformIT [21, 22] beamforming strategy.Since the proposed pre-processing enhanced the performance oftraditional U-Net, we compare the proposed system’s achieve-ments with only WPE and U-Net trained on pre-processed spec-tral images. For an ASR system, we use TDNN based acousticmodel [23] trained on single channel clean speech of the RE-VERB CHALLENGE corpus. Similarly, for a speaker verifi- cation system, we train a X-vector model [24] on Voxceleb 1& 2 corpus [25, 26] and a PLDA backend on the in-domainsingle-channel clean speech of the corpus. Three utterancesfrom each speaker from both simulated and real recordings of eval are considered in the enrollment set and the rest in theevaluation set. We see a relative improvement of 35.03% and16.42% in speaker verification performance using X-vectors av-eraged over simulated and real recordings compared to WPEand U-Net trained on pre-processed spectral images. Simi-larly, we see a 48.15% & 23.94% relative improvements inthe performance of an automated speech recognition (ASR)system averaged over simulated and real recordings comparedto WPE and U-Net trained on pre-processed spectral images.The interested reader may check out some audio samples at: https://vkothapally.github.io/SkipConv/
6. Conclusions
In this study, we presented ‘SkipConvNet’ an encoder-decoderbased FCN with convolutional modules introduced in skip con-nections which enhanced the learning ability to map reverber-ant speech to its corresponding clean speech. We have pro-posed the use of optimal smoothing of PSD as a preprocess-ing step for training the network which has shown consider-able improvements in the network’s performance. With the pro-posed modifications, we achieved significant improvements onspeech quality for both real and simulated data from REVERBCHALLENGE corpus in comparison with traditional U-Net andalso widely-used WPE dereverberation algorithm. We have alsoshown that the proposed system also improves the performanceof single-channel and multi-channel back-end speech systemslike speech recognition and speaker verification. To summarize,the addition of convolutions in skip connections reduces the in-compatibilities within the feature sets received at each layer ofthe decoder and boosts the learning capabilities of the network.We believe that this work can potentially be extended to a num-ber of complex-FCN architectures that have recently been re-searched for speech enhancement. . References [1] K. Lebart, J.-M. Boucher, and P. N. Denbigh, “A new methodbased on spectral subtraction for speech dereverberation,”
ActaAcustica united with Acustica , vol. 87, no. 3, pp. 359–366, 2001.[2] S. Griebel and M. Brandstein, “Wavelet transform extrema clus-tering for multi-channel speech dereverberation,” in
IEEE Work-shop on Acoustic Echo and Noise Control . Citeseer, 1999, pp.27–30.[3] E. A. P. Habets, “Single and multi-microphone speech dereverber-ation using spectral enhancement,” 2007.[4] K. Tan and D. Wang, “Complex spectral mapping with a convo-lutional recurrent network for monaural speech enhancement,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6865–6869.[5] A. Pandey and D. Wang, “A new framework for CNN-basedspeech enhancement in the time domain,”
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 27, no. 7,pp. 1179–1188, 2019.[6] M. Mimura, S. Sakai, and T. Kawahara, “Speech dereverberationusing long short-term memory,” in
Sixteenth Annual Conferenceof the International Speech Communication Association , 2015.[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutionalnetworks for biomedical image segmentation,” in
InternationalConference on Medical image computing and computer-assistedintervention . Springer, 2015, pp. 234–241.[8] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang,“UNet++: A nested U-Net architecture for medical image seg-mentation,” in
Deep Learning in Medical Image Analysis andMultimodal Learning for Clinical Decision Support . Springer,2018, pp. 3–11.[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2016, pp. 770–778.[10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,“Densely connected convolutional networks,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition ,2017, pp. 4700–4708.[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in
Proceedings of the IEEE conference on com-puter vision and pattern recognition , 2015, pp. 1–9.[12] R. Martin, “Noise power spectral density estimation based on op-timal smoothing and minimum statistics,”
IEEE Transactions onspeech and audio processing , vol. 9, no. 5, pp. 504–512, 2001.[13] O. Ernst, S. E. Chazan, S. Gannot, and J. Goldberger, “Speechdereverberation using fully convolutional networks,” in . IEEE,2018, pp. 390–394.[14] N. Ibtehaz and M. S. Rahman, “MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation,”
Neural Networks , vol. 121, pp. 74–87, 2020.[15] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets,R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas et al. , “The REVERB challenge: A common evaluation frame-work for dereverberation and recognition of reverberant speech,”in . IEEE, 2013, pp. 1–4.[16] K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani,B. Raj et al. , “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processingresearch,”
EURASIP Journal on Advances in Signal Processing ,vol. 2016, no. 1, p. 7, 2016.[17] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H.Juang, “Speech dereverberation based on variance-normalized de-layed linear prediction,”
IEEE Transactions on Audio, Speech,and Language Processing , vol. 18, no. 7, pp. 1717–1731, 2010. [18] T. Yoshioka and T. Nakatani, “Generalization of multi-channellinear prediction methods for blind MIMO impulse responseshortening,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 10, pp. 2707–2720, 2012.[19] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJ-CAMO: a British English speech corpus for large vocabulary con-tinuous speech recognition,” in , vol. 1.IEEE, 1995, pp. 81–84.[20] M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, “The multi-channel wall street journal audio visual corpus (MC-WSJ-AV):Specification and initial experiments,” in
IEEE Workshop on Au-tomatic Speech Recognition and Understanding, 2005.
IEEE,2005, pp. 357–362.[21] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamform-ing for speaker diarization of meetings,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 15, no. 7, pp.2011–2022, 2007.[22] X. Anguera, C. Wooters, and J. M. Pardo, “Robust speaker di-arization for meetings: ICSI RT06s meetings evaluation system,”in
International Workshop on Machine Learning for MultimodalInteraction . Springer, 2006, pp. 346–358.[23] F. Weninger, S. Watanabe, J. Le Roux, J. Hershey, Y. Tachioka,J. Geiger, B. Schuller, and G. Rigoll, “The MERL/MELCO/TUMsystem for the REVERB challenge using deep recurrent neu-ral network feature enhancement,” in
Proc. REVERB Workshop ,2014, pp. 1–8.[24] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust DNN embeddings for speaker recog-nition,” in . IEEE, 2018, pp. 5329–5333.[25] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[26] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622arXiv preprint arXiv:1806.05622