PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss
Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, Arvindh Krishnaswamy
PPoCoNet: Better Speech Enhancement with Frequency-PositionalEmbeddings, Semi-Supervised Conversational Data, and Biased Loss
Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin,Karim Helwani, Arvindh Krishnaswamy
Amazon Web Services [email protected], [email protected]
Abstract
Neural network applications generally benefit from larger-sizedmodels, but for current speech enhancement models, largerscale networks often suffer from decreased robustness to thevariety of real-world use cases beyond what is encountered intraining data. We introduce several innovations that lead to bet-ter large neural networks for speech enhancement. The novelPoCoNet architecture is a convolutional neural network that,with the use of frequency-positional embeddings, is able tomore efficiently build frequency-dependent features in the earlylayers. A semi-supervised method helps increase the amountof conversational training data by pre-enhancing noisy datasets,improving performance on real recordings. A new loss functionbiased towards preserving speech quality helps the optimiza-tion better match human perceptual opinions on speech quality.Ablation experiments and objective and human opinion metricsshow the benefits of the proposed improvements.
1. Introduction
Neural network based approaches have greatly improved theoutput quality of speech enhancement systems [1, 2, 3, 4, 5].These networks are trained, typically, in a supervised setting,with synthetic mixtures of clean speech and known noise clips,sometimes with synthetic reverberation added. Usually, themodel is used to estimate a magnitude gain for each bin in thetime-frequency domain representation of the noisy and/or rever-berant mixture signal. Recent phase-aware models use a com-plex ratio mask instead of magnitude gain [6, 7], while otherapproaches work directly in the waveform domain [8, 9, 10].The speech enhancement problem has multiple challengesassociated with it. First, the model needs to be robust to themultitude of different speech, recording, and noise conditionspresent in real-world usage. Second, clean speech data for train-ing is limited in the public domain, with the biggest datasetscoming from read material. Third, the task becomes increas-ingly difficult in low signal-to-noise ratio (SNR) cases, whichcan be helped by training larger models, which in turn makesthe model more prone to fitting to the biases of the availabledataset, decreasing robustness to other real-world conditions,making both of the first two challenges more pronounced. Andfourth, the mismatch between human perception of sound qual-ity and standard loss functions and metrics [11] can make well-optimized models perform worse in human evaluation.We propose several architectural, data preparation, aug-mentation and loss-function innovations that help meet theabove stated challenges for large neural networks for speechenhancement.On convolutional architectures, standard implementationsin the time-frequency domain rely on 1D or 2D convnets. Inthe typical 1D architecture (e.g. ConvTASNet [12]), the kernels move in the time-direction, and are fully connected in the fre-quency direction. These tend to have very large weight matricesin the early layers, where the architecture could benefit from amore hierarchical development of features. On the other hand,in standard 2D U-Net models where kernels move in both thetime and frequency directions [13], early layer activations areblind to what frequency they operate in – even in the case whenpadding is used, these early features’ receptive fields have notyet reached the edges of the time-frequency image. Our pro-posed architecture has the advantages of both options, it is a2D U-Net (with DenseNet blocks and self-attention) with smallkernels, and can therefore develop features hierarchically, butcan also take into account frequency information in early layerswith the inclusion of frequency-positional embeddings.On the data front, we scale up the amount of clean conver-sational data available for training by using a semi-supervisedapproach. The clean portion of the LibriSpeech dataset, ourstarting point, contains data only from audio books, which is notconversational. The larger VoxCeleb dataset [14], on the otherhand, is from television broadcasts, and contains backgroundmusic and effects, some of the data is also highly reverberant.We use LibriSpeech-trained speech enhancement models to iso-late the clean speech in VoxCeleb2 and eliminate reverberantclips, and show that adding this processed clean speech datasetto the training data improves the robustness of the model to con-ditions not well-represented in LibriSpeech. To make the mosteffective use of the data, we also use an extensive data augmen-tation stack that also helps address specific failure modes.We also apply synthetic reverberation in the dataset usinga library of recorded and synthetically generated room impulseresponses. We train separate models to target the task with andwithout partial dereverberation. For non-dereverberating mod-els, reverberation is added during training to the clean speechdata as an augmentation before mixing. For training partiallydereverberating models, we add, to the clean speech labels, afaster decaying version of the reverberation as was done in [15].We use L1 losses across the board to help deal with datasetnoise. We use a linear combination of two losses. The first is anew L1-loss on magnitudes which is biased to penalize under-estimation of speech time-frequency bin magnitudes, as well asweighted towards high-frequencies, which makes the output ofthe trained model better preserve speech quality and avoid muf-fling. The second, is an L1 loss in the audio waveform domain,which is backpropagated though the STFT layer and complexmultiplication to the estimated complex ratio mask values inthe time-frequency domain.To measure the performance of our model, we rely onMean Opinion Score (MOS) subjective testing crowd-sourcedon Amazon Mechanical Turk, using model outputs on the DeepNoise Suppression (DNS) challenge [16] pre-competition testset, as well as on standard numerical metrics on the synthetic a r X i v : . [ ee ss . A S ] A ug ortion of the same test set. An ablation study shows the addedimprovement to human MOS and numerical metrics from eachproposed component discussed above.
2. Method
Let s be the clean speech audio signal and x = s ∗ h + n be thesame signal with added noise n and reverberated version s ∗ h ,which is convolved with a room impulse response h , and let y be the denoised and/or dereverberated target signal. The neuralmodel N takes as input the STFT of the reverberant and noisyexample s ∗ h + n and estimates the complex ratio mask thatwould give the target signal estimate as: (cid:98) y = ISTFT( N (STFT( x )) · ST F T ( x )) . For the neural model N , we start with a fully-convolutional2D U-Net architecture with self-attention layers and 4-layerDenseNet blocks at each level, similar to [17]. We take theconvolutions to be causal in the time direction, but not in thefrequency direction, meaning that padding is applied symmetri-cally in the frequency direction as is usual in 2D convnets, butapplied asymmetrically in the time direction in the sense that itis only used at the edge of each layer corresponding to the earlypart in time. This helps preserve the output quality at the late-portion of the output which is used in low-latency application aspadding tends to hurt quality near edges and borders. Note thatlook-ahead is provided by the average-pooling layers, which areused instead of max-pooling. Figure 1 shows the overall archi-tecture, while Figure 2 shows details of the DenseNet and atten-tion blocks.The self-attention blocks we use are the same as the onesused in [18, 19] with the exception that the mechanism ag-gregates information only in the time direction to increase ef-ficiency during training and inference. For early convolutional layers to be able to do frequency-awareprocessing, we concatenate a vector of positional embeddingsto each time-frequency bin at the input layer of the model. Thefrequency-positional embedding vector for time-frequency bincentered at ( t, f ) depends only on f and is defined as: ρ ( t, f ) = (cos( π fF ) , cos(2 π fF ) , . . . , cos(2 k − π fF )) , where F is the frequency bandwidth and k = 10 . For the clean signal s , we combine data from two sources.First, we take two subsets of the publicly available LibriVoxdataset, totaling approximately 600 hours of speech data: theLibriSpeech-clean , [20], as well as the subset of the LibriVoxdataset filtered based on Mean Opinion Scores to form the DNSChallenge dataset [21]. The second source is VoxCeleb2 fromwhich we use approximately 800 hours of data. ’clean’ clips in LibriSpeech were selected based on WER in ASRexperiments, not directly by audio quality To be able to use this large and varied dataset, we first train twomodels on the LibriSpeech dataset described above. The firstmodel is a speech enhancement model that also does full dere-verberation that is trained to estimate the reverb-only portion h ∗ s − s , along with the clean signal s and noise n . This modeluses the same architecture as our proposed network, but we usefewer filters, and early stopping to avoid overfitting. We use thismodel to estimate the direct-to-reverberant ratio (DRR) of eachclip in VoxCeleb2 and filter out clips with DRR less than 30dB. While this model is better at estimating DRR compared tomore traditional methods, its clean signal estimates contain arti-facts and are not suitable for training. Instead, we use a secondmodel with the same architecture, trained to estimate h ∗ s and n only. We use this denoise-only model to filter out all clips withsignal-to-noise ratio (SNR) less than 10 dB, and use its cleanspeech estimates as training data for subsequent experiments. For noise data, we filter the AudioSet dataset, selecting clipswith tags from the AudioSet ontology that are sounds thata speech enhancement system would be expected to remove,while excluding any clips with tags related to sounds that hu-mans make.We found that, even though most AudioSet tags correspondto non-stationary noise categories, a random 1-second chunk weuse in training will more often than not have no non-stationarynoise. We compute, for each chunk, the energy levels in 50ms windows, and upsample, during training, chunks that havea standard-deviation of windowed energy of at least 3 dB. Thisincrease the prevalence of non-stationary noise during training.
We use the following augmentation stack. Unless specified oth-erwise, distributions are uniform in the given number ranges.• EQ . Random high and low-shelf EQ filters. With cen-ter frequency chosen uniformly in logarithmic domainbetween 40 and 8000 Hz, gain between ± dB. Tworandom EQ bell-curves per datapoint, symmetric in logdomain, with Q-value between 0.5 and 1.5; frequencychosen from the same interval as shelf EQ. Randomizedand applied to both speech and noise separately.• Pitch shifts . Random resampling with ± of theoriginal sample rate.• Clipping . Random clipping between 0.5 and 1 of thepeak value of the signal, applied 10% of the time.•
Empty buffer simulation . Random deletion of the first0.5 to 1 of the input signal to simulate partially filledbuffer in low-latency evaluation.•
Level and Silence . We skip datapoints with foregroundRMS less than -38 dBFS (dB relative to full-scale of . )and normalize each signal to have RMS value of -20dBFS. We then apply a random volume down between-30 and 0 dB to the background, normalize the mix to-20 dBFS RMS, then apply a random amplification be-tween -25 to 5 dB to everything. We additionally usesilence as the foreground 3% of the time.• Band-limiting . To make the model robust to caseswhere the input signal is band-limited, we apply a lowpass filter at a frequency between 4 and 7 kHz, 2.5% of enseBlock
Dense AttentionBlock Dense AttentionBlock Dense AttentionBlock
PositionalEmbeddings STFTReal, Imaginary Attention DenseBlockAttention
Dense AttentionBlockDense AttentionBlock
DenseBlockAttention
Dense AttentionBlockDense AttentionBlock
DenseBlockAttention ComplexRatio MaskReal, Imaginary
BlockBlockBlock
ISTFT L audio ( · , y ) L spectral ( · , | Y | ) b b b Figure 1:
Top two levels of the U-Net architecture shown with frequency-positional embeddings and STFT real and imaginary partsinputs; and real and imaginary parts of complex ratio mask outputs. We use a 6-level U-Net architecture. b softmax b valuesquerykey Figure 2:
Details of the DenseNet and attention block. Straightarrows are convolutions with batch normalization and ReLUnon-linearity, curved arrows are concatenations. the time to background only, 2.5% of the time to fore-ground, and 5% of the time to both.•
Reverberation . Used both as an augmentation and fordatapoint creation as described below.
When adding reverberation, we first identify, in each RoomImpulse Response (RIR), the portion corresponding to the di-rect path, i.e. the ’first tap’, and scale and shift the RIR sothat the first tap is at t = 0 and it has height 1. So we have x = s ∗ ( h + h > ) + n where h is a single tap at time zero.We then apply a gain to all taps except the first tap by a valuebetween -25 and 0 dB. Also, 60% of the time, we add reverber-ation via the same impulse response to the noise signal as well,except that there is a separate downward scaling of the non-firsttap. Hence, the model input becomes x = s ∗ ( h + αh > ) + ( n ∗ ( h + βh > )) . We use both real-recorded and synthetic room impulse re-sponses (RIRs). For real impulse responses, we use the AachenImpulse Response dataset [22] consisting of of 214 RIR record-ings. For synthetic RIRs, we generate a library of 10,000 RIRs,using the image method [23], with random rectangular roomswith sizes from 2 to 10 meters with random reflection coeffi-cients between . and . .We restrict to using impulse responses with RT < . .We further augment all impulse responses with random re-sampling, which simulates changing room sizes with the samematerials, and random exponential decays, which approximatechanging uniform absorption levels of the room material. We experiment with no-dereverberation models, where, duringtraining, reverberation is used simply as an augmentation, andthe foreground speech label is y = s ∗ h ; and with partial-dereverberation, where the label’s room impulse response hasthe first ms unaltered, and then made to decay quickly, tomake RT < . , by multiplying with an exponential decayfunction. We train the neural model by optimizing, for each target y , theloss function L ( y, (cid:98) y ) = λ audio L audio ( y, (cid:98) y ) + λ spectral L spectral ( Y, (cid:98) Y ) , where the audio loss is the L1 loss, L audio ( y, (cid:98) y ) = | y − (cid:98) y | . For the spectral loss function L spectral , let Y t,f = | STFT ( y ) t,f | and (cid:98) Y t,f = | STFT ( (cid:98) y ) t,f | , be the STFT bin magnitudes; we set L spectral = (cid:88) t,f w( f ) (cid:16) λ over (cid:98) Y ≥ Y,t,f + λ under (cid:98) Y Figure 3: The Inference-time mechanism. The convnet is evalu-ated always on the last second of the input buffer. ull Synthetic no Reverb Synthetic with Reverb Real RecordingsProposed without positional embeddings 3.789 4.092 3.548 3.758without background reverbs 3.768 4.119 3.573 3.690without biased loss 3.766 4.094 3.519 3.726without semi-supervised data 3.755 4.131 3.620 3.634without reverb augmentations 3.467 4.133 2.358 3.688RNNoise [24] 3.464 3.660 3.162 3.517DNS Challenge Baseline [21] 3.439 3.703 3.120 3.466Noisy 3.432 3.568 3.183 3.489 confidence interval ± . ± . ± . ± . Table 1: Mean Opinion Score evaluation of different Algorithms over the DNS Challenge non-blind test sets. Synthetic without Reverb Synthetic with ReverbMethods PESQ CBAK COVL CSIG PESQ CSIG CBAK COVLRNNoise [24] 1.973 3.463 2.789 2.692 1.777 Proposed without background reverbs 2.667 2.997 3.296 3.909 1.613 2.301 2.223 2.906without biased loss 2.457 2.904 3.106 3.749 1.545 2.258 2.070 2.671 Table 2: Objective evaluation of different Algorithms over the DNS Challenge synthetic non-blind test sets. Note that the Syntheticwith Reverb test reference clean labels contain reverberation, results the model that is trained to keep all reverberation has the bestperformance on this set. For low-latency evaluation, we use 40 ms-sized inputframes (i.e. 640 samples at 16kHz) with one-frame look-ahead.For each input chunk of samples, we run the model on the last16384 samples in the input buffer. We use cross-over to elimi-nate artifacts from the frames. Figure 3 illustrates the evaluationmechanism for two input chunks. Inference takes 0.65 secondsper second of audio on a V100 GPU. 3. Experiments and Results For each model, we use 6 down-blocks in the U-Net with thenumber of per-layer filters in each being 32, 64, 128, 256, 256,256, and up-blocks symmetric to the down-blocks. There area total of ∼ 50M parameters. We multiply the foreground andbackground losses with weighting coefficients λ fg = 2 . , and λ bg = 0 . for the foreground and background estimation re-spectively. We take λ audio = 1 , and λ spectral = 1 . , with (only forthe foreground signal) λ over = 2 . and λ under = 13 . . We havenot used any hyper-parameter tuning techniques, with most pa-rameters, especially those used for augmentations, set as sensi-ble defaults and unmodified.We train each model for 700,000 iterations with total mini-batch size of 112, using the ADAM optimizer, with learning rateof 1e-4, halved every 100,000 iterations. Training each modeltakes about 4 days on 8 V100 GPUs. Our implementation usesthe MXNet [25] framework. For human opinion tests, we use the methodology of ITU-TP.808 Subjective Evaluation of Speech Quality with a Crowd-sourcing Approach [26], on Amazon Mechanical Turk.TheMOS scores are based on 10 listenings each of the model’s out-puts on the 600 real and synthetic inputs on the INTERSPEECH2020 DNS Challenge test set [16], which covers varied cases. For objective metrics, we evaluate wide-band PerceptualEvaluation of Speech Quality (PESQ) – ITU-T P.862.2 – [27],and the composite CSIG, CBAK, and COVL scores proposed in[28] on the 300 synthetic examples in the same test set. Tables 1 and 2 show, respectively, objective and subjective met-ric evaluation results. Note that, removing the semi-supervisedconversational data has a strong effect on performance on realrecordings which tend to have more varied speech styles.The proposed model took 1st place in the 2020 Deep NoiseSuppression Challenge’s Non-Real-Time Track [16]. Full Synthetic Synthetic Realwithout Reverb with Reverb RecsNoisy 2.95 3.13 2.64 2.83 Proposed Table 3: Mean Opinion Score evaluation provided by the DNSChallenge based on the blind test set. Full results are availablein [16] (team 4. Conclusion We described new techniques that result in improvements tospeech enhancement with large neural networks. The resultingPoCoNet speech enhancer is a large U-Net with DenseNet andself-attention blocks with frequency-positional embeddings,and is trained with a semi-supervised technique partly on con-versational data, with an extensive augmentation stack includ-ing reverberation, and with a loss function that is biased to pre-serve speech. Evaluation results show the quality improvementon the overall system due to each component and demonstratethe effectiveness of the introduced techniques for training largeneural speech enhancers. . References [1] Bingyin Xia and Changchun Bao. Speech enhancement withweighted denoising auto-encoder. In INTERSPEECH , pages3444–3448, 2013.[2] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. A regressionapproach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , 23(1):7–19, 2014.[3] Felix Weninger, Hakan Erdogan, Shinji Watanabe, EmmanuelVincent, Jonathan Le Roux, John R Hershey, and Björn Schuller.Speech enhancement with lstm recurrent neural networks and itsapplication to noise-robust asr. In International Conference onLatent Variable Analysis and Signal Separation , pages 91–99.Springer, 2015.[4] Kun Han, Yuxuan Wang, DeLiang Wang, William S Woods, IvoMerks, and Tao Zhang. Learning spectral mapping for speechdereverberation and denoising. IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , 23(6):982–992, 2015.[5] R. Giri, U. Isik, and A. Krishnaswamy. Attention wave-u-net forspeech enhancement. In , pages 249–253, 2019.[6] D.S. Williamson, Y. Wang, and D. Wang. Complex ratio mask-ing for monaural speech separation. IEEE/ACM transactions onaudio, speech, and language processing , 24(3):483–492, 2016.[7] Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sar-roff, and John R Hershey. Phasebook and friends: Leveragingdiscrete representations for source separation. IEEE Journal ofSelected Topics in Signal Processing , 13(2):370–382, 2019.[8] Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet forspeech denoising. In , pages 5069–5073. IEEE, 2018.[9] Craig Macartney and Tillman Weyde. Improved speech enhance-ment with the wave-u-net. arXiv preprint arXiv:1811.11307 ,2018.[10] Francois G Germain, Qifeng Chen, and Vladlen Koltun.Speech denoising with deep feature losses. arXiv preprintarXiv:1806.10522 , 2018.[11] Yi Hu and Philipos C Loizou. Evaluation of objective qualitymeasures for speech enhancement. IEEE Transactions on audio,speech, and language processing , 16(1):229–238, 2007.[12] Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassingideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language process-ing , 27(8):1256–1266, 2019.[13] Se Rim Park and Jin Won Lee. A fully convolutional neural net-work for speech enhancement. Proc. Interspeech 2017 , pages1993–1997, 2017.[14] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Vox-celeb: a large-scale speaker identification dataset. arXiv preprintarXiv:1706.08612 , 2017.[15] Y. Zhao, D. Wang, B. Xu, and T. Zhang. Late reverberationsuppression using recurrent neural networks with long short-termmemory. In International Conference on Acoustics, Speech andSignal Processing (ICASSP) , pages 5434–5438. IEEE, 2018.[16] Chandan KA Reddy, Vishak Gopal, Ross Cutler, EbrahimBeyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matu-sevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, et al.The interspeech 2020 deep noise suppression challenge: Datasets,subjective testing framework, and challenge results. arXivpreprint arXiv:2005.13981 , 2020.[17] B. Tolooshams, R. Giri, A. H. Song, U. Isik, and A. Krish-naswamy. Channel-attention dense u-net for multichannel speechenhancement. In ICASSP 2020 - 2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,pages 836–840, 2020. [18] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.Non-local neural networks. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition , pages 7794–7803, 2018.[19] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and AugustusOdena. Self-attention generative adversarial networks. arXivpreprint arXiv:1805.08318 , 2018.[20] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khu-danpur. Librispeech: an asr corpus based on public domain au-dio books. In , pages 5206–5210.IEEE, 2015.[21] C. Reddy et. al. The interspeech 2020 deep noise suppressionchallenge: Datasets, subjective speech quality and testing frame-work. arXiv preprint arXiv:2001.08662 , 2020.[22] Marco Jeub, Magnus Schafer, and Peter Vary. A binaural roomimpulse response database for the evaluation of dereverberationalgorithms. In , pages 1–5. IEEE, 2009.[23] Jont B Allen and David A Berkley. Image method for efficientlysimulating small-room acoustics. The Journal of the AcousticalSociety of America , 65(4):943–950, 1979.[24] J.-M. Valin. A hybrid DSP/deep learning approach to real-timefull-band speech enhancement. In Proceedings of IEEE Multime-dia Signal Processing (MMSP) Workshop , 2018.[25] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Min-jie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and ZhengZhang. Mxnet: A flexible and efficient machine learning li-brary for heterogeneous distributed systems. arXiv preprintarXiv:1512.01274 , 2015.[26] ITU-T. Recommendation P.808: Subjective evaluation of speechquality with a crowdsourcing approach , 2018.[27] ITU-T. Perceptual evaluation of speech quality (PESQ): An objec-tive method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2001.[28] Philipos C Loizou.