[PDF] PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss

Abstract

Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.

Full PDF

PPoCoNet: Better Speech Enhancement with Frequency-PositionalEmbeddings, Semi-Supervised Conversational Data, and Biased Loss

Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin,Karim Helwani, Arvindh Krishnaswamy

Amazon Web Services [email protected], [email protected]

Abstract

Neural network applications generally beneﬁt from larger-sizedmodels, but for current speech enhancement models, largerscale networks often suffer from decreased robustness to thevariety of real-world use cases beyond what is encountered intraining data. We introduce several innovations that lead to bet-ter large neural networks for speech enhancement. The novelPoCoNet architecture is a convolutional neural network that,with the use of frequency-positional embeddings, is able tomore efﬁciently build frequency-dependent features in the earlylayers. A semi-supervised method helps increase the amountof conversational training data by pre-enhancing noisy datasets,improving performance on real recordings. A new loss functionbiased towards preserving speech quality helps the optimiza-tion better match human perceptual opinions on speech quality.Ablation experiments and objective and human opinion metricsshow the beneﬁts of the proposed improvements.

1. Introduction

Neural network based approaches have greatly improved theoutput quality of speech enhancement systems [1, 2, 3, 4, 5].These networks are trained, typically, in a supervised setting,with synthetic mixtures of clean speech and known noise clips,sometimes with synthetic reverberation added. Usually, themodel is used to estimate a magnitude gain for each bin in thetime-frequency domain representation of the noisy and/or rever-berant mixture signal. Recent phase-aware models use a com-plex ratio mask instead of magnitude gain [6, 7], while otherapproaches work directly in the waveform domain [8, 9, 10].The speech enhancement problem has multiple challengesassociated with it. First, the model needs to be robust to themultitude of different speech, recording, and noise conditionspresent in real-world usage. Second, clean speech data for train-ing is limited in the public domain, with the biggest datasetscoming from read material. Third, the task becomes increas-ingly difﬁcult in low signal-to-noise ratio (SNR) cases, whichcan be helped by training larger models, which in turn makesthe model more prone to ﬁtting to the biases of the availabledataset, decreasing robustness to other real-world conditions,making both of the ﬁrst two challenges more pronounced. Andfourth, the mismatch between human perception of sound qual-ity and standard loss functions and metrics [11] can make well-optimized models perform worse in human evaluation.We propose several architectural, data preparation, aug-mentation and loss-function innovations that help meet theabove stated challenges for large neural networks for speechenhancement.On convolutional architectures, standard implementationsin the time-frequency domain rely on 1D or 2D convnets. Inthe typical 1D architecture (e.g. ConvTASNet [12]), the kernels move in the time-direction, and are fully connected in the fre-quency direction. These tend to have very large weight matricesin the early layers, where the architecture could beneﬁt from amore hierarchical development of features. On the other hand,in standard 2D U-Net models where kernels move in both thetime and frequency directions [13], early layer activations areblind to what frequency they operate in – even in the case whenpadding is used, these early features’ receptive ﬁelds have notyet reached the edges of the time-frequency image. Our pro-posed architecture has the advantages of both options, it is a2D U-Net (with DenseNet blocks and self-attention) with smallkernels, and can therefore develop features hierarchically, butcan also take into account frequency information in early layerswith the inclusion of frequency-positional embeddings.On the data front, we scale up the amount of clean conver-sational data available for training by using a semi-supervisedapproach. The clean portion of the LibriSpeech dataset, ourstarting point, contains data only from audio books, which is notconversational. The larger VoxCeleb dataset [14], on the otherhand, is from television broadcasts, and contains backgroundmusic and effects, some of the data is also highly reverberant.We use LibriSpeech-trained speech enhancement models to iso-late the clean speech in VoxCeleb2 and eliminate reverberantclips, and show that adding this processed clean speech datasetto the training data improves the robustness of the model to con-ditions not well-represented in LibriSpeech. To make the mosteffective use of the data, we also use an extensive data augmen-tation stack that also helps address speciﬁc failure modes.We also apply synthetic reverberation in the dataset usinga library of recorded and synthetically generated room impulseresponses. We train separate models to target the task with andwithout partial dereverberation. For non-dereverberating mod-els, reverberation is added during training to the clean speechdata as an augmentation before mixing. For training partiallydereverberating models, we add, to the clean speech labels, afaster decaying version of the reverberation as was done in [15].We use L1 losses across the board to help deal with datasetnoise. We use a linear combination of two losses. The ﬁrst is anew L1-loss on magnitudes which is biased to penalize under-estimation of speech time-frequency bin magnitudes, as well asweighted towards high-frequencies, which makes the output ofthe trained model better preserve speech quality and avoid muf-ﬂing. The second, is an L1 loss in the audio waveform domain,which is backpropagated though the STFT layer and complexmultiplication to the estimated complex ratio mask values inthe time-frequency domain.To measure the performance of our model, we rely onMean Opinion Score (MOS) subjective testing crowd-sourcedon Amazon Mechanical Turk, using model outputs on the DeepNoise Suppression (DNS) challenge [16] pre-competition testset, as well as on standard numerical metrics on the synthetic a r X i v : . [ ee ss . A S ] A ug ortion of the same test set. An ablation study shows the addedimprovement to human MOS and numerical metrics from eachproposed component discussed above.

2. Method

Let s be the clean speech audio signal and x = s ∗ h + n be thesame signal with added noise n and reverberated version s ∗ h ,which is convolved with a room impulse response h , and let y be the denoised and/or dereverberated target signal. The neuralmodel N takes as input the STFT of the reverberant and noisyexample s ∗ h + n and estimates the complex ratio mask thatwould give the target signal estimate as: (cid:98) y = ISTFT( N (STFT( x )) · ST F T ( x )) . For the neural model N , we start with a fully-convolutional2D U-Net architecture with self-attention layers and 4-layerDenseNet blocks at each level, similar to [17]. We take theconvolutions to be causal in the time direction, but not in thefrequency direction, meaning that padding is applied symmetri-cally in the frequency direction as is usual in 2D convnets, butapplied asymmetrically in the time direction in the sense that itis only used at the edge of each layer corresponding to the earlypart in time. This helps preserve the output quality at the late-portion of the output which is used in low-latency application aspadding tends to hurt quality near edges and borders. Note thatlook-ahead is provided by the average-pooling layers, which areused instead of max-pooling. Figure 1 shows the overall archi-tecture, while Figure 2 shows details of the DenseNet and atten-tion blocks.The self-attention blocks we use are the same as the onesused in [18, 19] with the exception that the mechanism ag-gregates information only in the time direction to increase ef-ﬁciency during training and inference. For early convolutional layers to be able to do frequency-awareprocessing, we concatenate a vector of positional embeddingsto each time-frequency bin at the input layer of the model. Thefrequency-positional embedding vector for time-frequency bincentered at ( t, f ) depends only on f and is deﬁned as: ρ ( t, f ) = (cos( π fF ) , cos(2 π fF ) , . . . , cos(2 k − π fF )) , where F is the frequency bandwidth and k = 10 . For the clean signal s , we combine data from two sources.First, we take two subsets of the publicly available LibriVoxdataset, totaling approximately 600 hours of speech data: theLibriSpeech-clean , [20], as well as the subset of the LibriVoxdataset ﬁltered based on Mean Opinion Scores to form the DNSChallenge dataset [21]. The second source is VoxCeleb2 fromwhich we use approximately 800 hours of data. ’clean’ clips in LibriSpeech were selected based on WER in ASRexperiments, not directly by audio quality To be able to use this large and varied dataset, we ﬁrst train twomodels on the LibriSpeech dataset described above. The ﬁrstmodel is a speech enhancement model that also does full dere-verberation that is trained to estimate the reverb-only portion h ∗ s − s , along with the clean signal s and noise n . This modeluses the same architecture as our proposed network, but we usefewer ﬁlters, and early stopping to avoid overﬁtting. We use thismodel to estimate the direct-to-reverberant ratio (DRR) of eachclip in VoxCeleb2 and ﬁlter out clips with DRR less than 30dB. While this model is better at estimating DRR compared tomore traditional methods, its clean signal estimates contain arti-facts and are not suitable for training. Instead, we use a secondmodel with the same architecture, trained to estimate h ∗ s and n only. We use this denoise-only model to ﬁlter out all clips withsignal-to-noise ratio (SNR) less than 10 dB, and use its cleanspeech estimates as training data for subsequent experiments. For noise data, we ﬁlter the AudioSet dataset, selecting clipswith tags from the AudioSet ontology that are sounds thata speech enhancement system would be expected to remove,while excluding any clips with tags related to sounds that hu-mans make.We found that, even though most AudioSet tags correspondto non-stationary noise categories, a random 1-second chunk weuse in training will more often than not have no non-stationarynoise. We compute, for each chunk, the energy levels in 50ms windows, and upsample, during training, chunks that havea standard-deviation of windowed energy of at least 3 dB. Thisincrease the prevalence of non-stationary noise during training.

We use the following augmentation stack. Unless speciﬁed oth-erwise, distributions are uniform in the given number ranges.• EQ . Random high and low-shelf EQ ﬁlters. With cen-ter frequency chosen uniformly in logarithmic domainbetween 40 and 8000 Hz, gain between ± dB. Tworandom EQ bell-curves per datapoint, symmetric in logdomain, with Q-value between 0.5 and 1.5; frequencychosen from the same interval as shelf EQ. Randomizedand applied to both speech and noise separately.• Pitch shifts . Random resampling with ± of theoriginal sample rate.• Clipping . Random clipping between 0.5 and 1 of thepeak value of the signal, applied 10% of the time.•

Empty buffer simulation . Random deletion of the ﬁrst0.5 to 1 of the input signal to simulate partially ﬁlledbuffer in low-latency evaluation.•

Level and Silence . We skip datapoints with foregroundRMS less than -38 dBFS (dB relative to full-scale of . )and normalize each signal to have RMS value of -20dBFS. We then apply a random volume down between-30 and 0 dB to the background, normalize the mix to-20 dBFS RMS, then apply a random ampliﬁcation be-tween -25 to 5 dB to everything. We additionally usesilence as the foreground 3% of the time.• Band-limiting . To make the model robust to caseswhere the input signal is band-limited, we apply a lowpass ﬁlter at a frequency between 4 and 7 kHz, 2.5% of enseBlock

Dense AttentionBlock Dense AttentionBlock Dense AttentionBlock

PositionalEmbeddings STFTReal, Imaginary Attention DenseBlockAttention

Dense AttentionBlockDense AttentionBlock

DenseBlockAttention

Dense AttentionBlockDense AttentionBlock

DenseBlockAttention ComplexRatio MaskReal, Imaginary

BlockBlockBlock

ISTFT L audio ( · , y ) L spectral ( · , | Y | ) b b b Figure 1:

Top two levels of the U-Net architecture shown with frequency-positional embeddings and STFT real and imaginary partsinputs; and real and imaginary parts of complex ratio mask outputs. We use a 6-level U-Net architecture. b softmax b valuesquerykey Figure 2:

Details of the DenseNet and attention block. Straightarrows are convolutions with batch normalization and ReLUnon-linearity, curved arrows are concatenations. the time to background only, 2.5% of the time to fore-ground, and 5% of the time to both.•

Reverberation . Used both as an augmentation and fordatapoint creation as described below.

When adding reverberation, we ﬁrst identify, in each RoomImpulse Response (RIR), the portion corresponding to the di-rect path, i.e. the ’ﬁrst tap’, and scale and shift the RIR sothat the ﬁrst tap is at t = 0 and it has height 1. So we have x = s ∗ ( h + h > ) + n where h is a single tap at time zero.We then apply a gain to all taps except the ﬁrst tap by a valuebetween -25 and 0 dB. Also, 60% of the time, we add reverber-ation via the same impulse response to the noise signal as well,except that there is a separate downward scaling of the non-ﬁrsttap. Hence, the model input becomes x = s ∗ ( h + αh > ) + ( n ∗ ( h + βh > )) . We use both real-recorded and synthetic room impulse re-sponses (RIRs). For real impulse responses, we use the AachenImpulse Response dataset [22] consisting of of 214 RIR record-ings. For synthetic RIRs, we generate a library of 10,000 RIRs,using the image method [23], with random rectangular roomswith sizes from 2 to 10 meters with random reﬂection coefﬁ-cients between . and . .We restrict to using impulse responses with RT < . .We further augment all impulse responses with random re-sampling, which simulates changing room sizes with the samematerials, and random exponential decays, which approximatechanging uniform absorption levels of the room material. We experiment with no-dereverberation models, where, duringtraining, reverberation is used simply as an augmentation, andthe foreground speech label is y = s ∗ h ; and with partial-dereverberation, where the label’s room impulse response hasthe ﬁrst ms unaltered, and then made to decay quickly, tomake RT < . , by multiplying with an exponential decayfunction. We train the neural model by optimizing, for each target y , theloss function L ( y, (cid:98) y ) = λ audio L audio ( y, (cid:98) y ) + λ spectral L spectral ( Y, (cid:98) Y ) , where the audio loss is the L1 loss, L audio ( y, (cid:98) y ) = | y − (cid:98) y | . For the spectral loss function L spectral , let Y t,f = | STFT ( y ) t,f | and (cid:98) Y t,f = | STFT ( (cid:98) y ) t,f | , be the STFT bin magnitudes; we set L spectral = (cid:88) t,f w( f ) (cid:16) λ over (cid:98) Y ≥ Y,t,f + λ under (cid:98) Y

Figure 3:

The Inference-time mechanism. The convnet is evalu-ated always on the last second of the input buffer. ull

Synthetic no Reverb Synthetic with Reverb Real RecordingsProposed without positional embeddings 3.789 4.092 3.548 3.758without background reverbs 3.768 4.119 3.573 3.690without biased loss 3.766 4.094 3.519 3.726without semi-supervised data 3.755 4.131 3.620 3.634without reverb augmentations 3.467 4.133 2.358 3.688RNNoise [24] 3.464 3.660 3.162 3.517DNS Challenge Baseline [21] 3.439 3.703 3.120 3.466Noisy 3.432 3.568 3.183 3.489 conﬁdence interval ± . ± . ± . ± . Table 1:

Mean Opinion Score evaluation of different Algorithms over the DNS Challenge non-blind test sets.

Synthetic without Reverb Synthetic with ReverbMethods PESQ CBAK COVL CSIG PESQ CSIG CBAK COVLRNNoise [24] 1.973 3.463 2.789 2.692 1.777

Proposed without background reverbs 2.667 2.997 3.296 3.909 1.613 2.301 2.223 2.906without biased loss 2.457 2.904 3.106 3.749 1.545 2.258 2.070 2.671

Table 2:

Objective evaluation of different Algorithms over the DNS Challenge synthetic non-blind test sets. Note that the Syntheticwith Reverb test reference clean labels contain reverberation, results the model that is trained to keep all reverberation has the bestperformance on this set.

For low-latency evaluation, we use 40 ms-sized inputframes (i.e. 640 samples at 16kHz) with one-frame look-ahead.For each input chunk of samples, we run the model on the last16384 samples in the input buffer. We use cross-over to elimi-nate artifacts from the frames. Figure 3 illustrates the evaluationmechanism for two input chunks. Inference takes 0.65 secondsper second of audio on a V100 GPU.

3. Experiments and Results

For each model, we use 6 down-blocks in the U-Net with thenumber of per-layer ﬁlters in each being 32, 64, 128, 256, 256,256, and up-blocks symmetric to the down-blocks. There area total of ∼

50M parameters. We multiply the foreground andbackground losses with weighting coefﬁcients λ fg = 2 . , and λ bg = 0 . for the foreground and background estimation re-spectively. We take λ audio = 1 , and λ spectral = 1 . , with (only forthe foreground signal) λ over = 2 . and λ under = 13 . . We havenot used any hyper-parameter tuning techniques, with most pa-rameters, especially those used for augmentations, set as sensi-ble defaults and unmodiﬁed.We train each model for 700,000 iterations with total mini-batch size of 112, using the ADAM optimizer, with learning rateof 1e-4, halved every 100,000 iterations. Training each modeltakes about 4 days on 8 V100 GPUs. Our implementation usesthe MXNet [25] framework. For human opinion tests, we use the methodology of ITU-TP.808

Subjective Evaluation of Speech Quality with a Crowd-sourcing Approach [26], on Amazon Mechanical Turk.TheMOS scores are based on 10 listenings each of the model’s out-puts on the 600 real and synthetic inputs on the INTERSPEECH2020 DNS Challenge test set [16], which covers varied cases. For objective metrics, we evaluate wide-band PerceptualEvaluation of Speech Quality (PESQ) – ITU-T P.862.2 – [27],and the composite CSIG, CBAK, and COVL scores proposed in[28] on the 300 synthetic examples in the same test set.

Tables 1 and 2 show, respectively, objective and subjective met-ric evaluation results. Note that, removing the semi-supervisedconversational data has a strong effect on performance on realrecordings which tend to have more varied speech styles.The proposed model took 1st place in the 2020 Deep NoiseSuppression Challenge’s Non-Real-Time Track [16].

Full

Synthetic Synthetic Realwithout Reverb with Reverb RecsNoisy 2.95 3.13 2.64 2.83

Proposed

Table 3:

Mean Opinion Score evaluation provided by the DNSChallenge based on the blind test set. Full results are availablein [16] (team

4. Conclusion

We described new techniques that result in improvements tospeech enhancement with large neural networks. The resultingPoCoNet speech enhancer is a large U-Net with DenseNet andself-attention blocks with frequency-positional embeddings,and is trained with a semi-supervised technique partly on con-versational data, with an extensive augmentation stack includ-ing reverberation, and with a loss function that is biased to pre-serve speech. Evaluation results show the quality improvementon the overall system due to each component and demonstratethe effectiveness of the introduced techniques for training largeneural speech enhancers. . References [1] Bingyin Xia and Changchun Bao. Speech enhancement withweighted denoising auto-encoder. In

INTERSPEECH , pages3444–3448, 2013.[2] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. A regressionapproach to speech enhancement based on deep neural networks.

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , 23(1):7–19, 2014.[3] Felix Weninger, Hakan Erdogan, Shinji Watanabe, EmmanuelVincent, Jonathan Le Roux, John R Hershey, and Björn Schuller.Speech enhancement with lstm recurrent neural networks and itsapplication to noise-robust asr. In

International Conference onLatent Variable Analysis and Signal Separation , pages 91–99.Springer, 2015.[4] Kun Han, Yuxuan Wang, DeLiang Wang, William S Woods, IvoMerks, and Tao Zhang. Learning spectral mapping for speechdereverberation and denoising.

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , 23(6):982–992, 2015.[5] R. Giri, U. Isik, and A. Krishnaswamy. Attention wave-u-net forspeech enhancement. In , pages 249–253, 2019.[6] D.S. Williamson, Y. Wang, and D. Wang. Complex ratio mask-ing for monaural speech separation.

IEEE/ACM transactions onaudio, speech, and language processing , 24(3):483–492, 2016.[7] Jonathan Le Roux, Gordon Wichern, Shinji Watanabe, Andy Sar-roff, and John R Hershey. Phasebook and friends: Leveragingdiscrete representations for source separation.

IEEE Journal ofSelected Topics in Signal Processing , 13(2):370–382, 2019.[8] Dario Rethage, Jordi Pons, and Xavier Serra. A wavenet forspeech denoising. In , pages 5069–5073. IEEE, 2018.[9] Craig Macartney and Tillman Weyde. Improved speech enhance-ment with the wave-u-net. arXiv preprint arXiv:1811.11307 ,2018.[10] Francois G Germain, Qifeng Chen, and Vladlen Koltun.Speech denoising with deep feature losses. arXiv preprintarXiv:1806.10522 , 2018.[11] Yi Hu and Philipos C Loizou. Evaluation of objective qualitymeasures for speech enhancement.

IEEE Transactions on audio,speech, and language processing , 16(1):229–238, 2007.[12] Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassingideal time–frequency magnitude masking for speech separation.

IEEE/ACM transactions on audio, speech, and language process-ing , 27(8):1256–1266, 2019.[13] Se Rim Park and Jin Won Lee. A fully convolutional neural net-work for speech enhancement.

Proc. Interspeech 2017 , pages1993–1997, 2017.[14] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Vox-celeb: a large-scale speaker identiﬁcation dataset. arXiv preprintarXiv:1706.08612 , 2017.[15] Y. Zhao, D. Wang, B. Xu, and T. Zhang. Late reverberationsuppression using recurrent neural networks with long short-termmemory. In

International Conference on Acoustics, Speech andSignal Processing (ICASSP) , pages 5434–5438. IEEE, 2018.[16] Chandan KA Reddy, Vishak Gopal, Ross Cutler, EbrahimBeyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matu-sevych, Robert Aichner, Ashkan Aazami, Sebastian Braun, et al.The interspeech 2020 deep noise suppression challenge: Datasets,subjective testing framework, and challenge results. arXivpreprint arXiv:2005.13981 , 2020.[17] B. Tolooshams, R. Giri, A. H. Song, U. Isik, and A. Krish-naswamy. Channel-attention dense u-net for multichannel speechenhancement. In

ICASSP 2020 - 2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,pages 836–840, 2020. [18] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.Non-local neural networks. In

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , pages 7794–7803, 2018.[19] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and AugustusOdena. Self-attention generative adversarial networks. arXivpreprint arXiv:1805.08318 , 2018.[20] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khu-danpur. Librispeech: an asr corpus based on public domain au-dio books. In , pages 5206–5210.IEEE, 2015.[21] C. Reddy et. al.

The interspeech 2020 deep noise suppressionchallenge: Datasets, subjective speech quality and testing frame-work. arXiv preprint arXiv:2001.08662 , 2020.[22] Marco Jeub, Magnus Schafer, and Peter Vary. A binaural roomimpulse response database for the evaluation of dereverberationalgorithms. In , pages 1–5. IEEE, 2009.[23] Jont B Allen and David A Berkley. Image method for efﬁcientlysimulating small-room acoustics.

The Journal of the AcousticalSociety of America , 65(4):943–950, 1979.[24] J.-M. Valin. A hybrid DSP/deep learning approach to real-timefull-band speech enhancement. In

Proceedings of IEEE Multime-dia Signal Processing (MMSP) Workshop , 2018.[25] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Min-jie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and ZhengZhang. Mxnet: A ﬂexible and efﬁcient machine learning li-brary for heterogeneous distributed systems. arXiv preprintarXiv:1512.01274 , 2015.[26] ITU-T.

Recommendation P.808: Subjective evaluation of speechquality with a crowdsourcing approach , 2018.[27] ITU-T.

Perceptual evaluation of speech quality (PESQ): An objec-tive method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2001.[28] Philipos C Loizou.