Handling Background Noise in Neural Speech Generation
Tom Denton, Alejandro Luebs, Felicia S. C. Lim, Andrew Storus, Hengchin Yeh, W. Bastiaan Kleijn, Jan Skoglund
HHANDLING BACKGROUND NOISE IN NEURAL SPEECH GENERATION
Tom Denton, Alejandro Luebs, Michael Chinen, Felicia S. C. Lim, Andrew Storus, Hengchin Yeh, W. Bastiaan Kleijn, , Jan Skoglund Google LLC, San Francisco, CA Victoria University of Wellington, NZ
ABSTRACT
Recent advances in neural-network based generative modeling ofspeech has shown great potential for speech coding. However,the performance of such models drops when the input is not cleanspeech, e.g., in the presence of background noise, preventing itsuse in practical applications. In this paper we examine the reasonand discuss methods to overcome this issue. Placing a denoisingpreprocessing stage when extracting features and target clean speechduring training is shown to be the best performing strategy.
1. INTRODUCTION
Autoregressive neural synthesis systems are based on the idea thatthe speech signal’s probability distribution can be formulated as ascalar autoregressive structure, where the probability of each speechsample s t is conditioned on previous samples and a set of condition-ing features (spectral information, pitch, etc.), θ t , p ( s t | s t − , s t − , . . . , θ t ) . (1)The paradigm was first introduced for text-to-speech using theWaveNet [1] architecture. Soon thereafter WaveNet was shown ben-eficial also for low bit rate speech coding in [2], which was thefirst coder using neural generative synthesis. Another example ofa codec using WaveNet as synthesis generation is [3]. Since thenother autoregressive generators with lower complexity have been in-troduced, and codecs based on such are for example [4], based onSampleRNN [5], and [6], based on WaveRNN [7].However, the reproduction of real-world speech signals by gen-erative models is still a challenge. Coding real-world speech signalswith a neural vocoder requires solving a number of problems simul-taneously. Foremost is handling of background noise, but a suc-cessful system must also be able to reliably reproduce speech fromarbitrary speakers using a low bit rate input stream, ideally with amodel small and fast enough to run on a standard smartphone. Highquality has been achieved only for clean input signals and, to date,no coding performance has been reported for noisy speech.The difficulty of coding noisy signals can perhaps be explainedby the signal structure, where the signal to be coded is the sum ofa clean speech signal and an interfering signal. The autoregressivearchitecture is a good match for the structure of speech, but the ad-dition of a second signal removes this match. We note that this phe-nomenon is well-known in linear modeling: the sum of two signalsgenerated by linear autoregressive systems cannot be modeled effi-ciently with one autoregressive model unless its order is infinite [8].This suggests that to reproduce both the speech and the additive sig-nal with high quality a significantly larger model may be needed.Resisting the urge to increase the model size we instead performedexperiments to find out whether there is a better training and infer-ence strategy to improve robustness to background noise, withoutchanging the network configuration. To establish best-practices for handling noise in a neural vocodersystem, we run two sets of tests. The first uses large models with noreduction of bit rate in the inputs. This allows us to understand howthe systems respond to noise in the best case scenario, without worryfor the quantization schemes or model pruning techniques used. Wefind that placing a denoiser in front of a system trained on a largedatabase of clean speech works best.To evaluate real-world, end-to-end performance, we run an ad-ditional listening test using conversational speech at varying signal-to-noise ratios (SNR), with both the vocoder and denoiser modelsoptimized for on-device performance. In this second test we alsomarginalize by noise SNR and type of noise to demonstrate that thecombination of a pruned ConvTASNet and WaveRNN system workswell in all circumstances.
2. NOISE HANDLING STRATEGIES
In deep learning it is common practice to apply augmentations totraining data to increase the range of conditions familiar to a model[9]. In particular, the artificial addition of noise signals is a com-mon and effective step in audio event classification. This allows themodel to train on a wider variety of realistic scenarios given a rela-tively clean set of ground truth data. In audio event classification, theevent labels are unchanged by the augmentations, and the goal is totrain a system which is invariant under the full set of augmentations.Augmentations can be static (in which a new static dataset is usedfor training) or dynamic (in which augmentations are applied ‘on-the-fly’ as new training examples are consumed). A dataset with dy-namic augmentations is effectively infinite, though one still requiresa base dataset with sufficient variation to capture the full variety of(clean) signals to be modeled.The neural vocoder has two sets of inputs during training: Thealigned conditioning vectors and the teacher-forced autoregressiveinput signal. The conditioning should match what is available at in-ference time (e.g., noisy melspectra), and the autoregressive inputis what we measure loss against, and, therefore, what we train themodel to produce. Observe that these do not need to be derived fromthe same input signal: In particular, if the input conditioning vec-tors are calculated from noise-augmented clean speech, we can usethe raw clean speech as the teacher-forced training target. The re-sult is a system which learns to produce denoised audio from noisyconditioning: This is similar to how augmentation is used in classi-fication problems. The result is (hopefully) a model which is closeto invariant under the addition of noise.As an alternative, we can include a denoising model in the en-coder. During inference, we do not have access to the underlyingclean speech, but we can apply a denoiser to push the conditioningcloser to the speech manifold. A ‘perfect’ denoiser would then allowa model trained solely on clean speech to perform well, since all in-terfering noise has been removed. In reality, no denoiser is perfect, a r X i v : . [ ee ss . A S ] F e b nd will miss some noise and introduce artifacts. Preprocessing witha denoiser is known to work well with classical low bit rate vocodingsystems [10], leading us to believe that a denoiser could help with aneural vocoder as well.In this paper, we denote different training regimes as X2Y, whereX describes the conditioning input and Y describes the autoregres-sive input. The regimes we consider are:• c2c: ‘Clean-to-Clean:’ Trained with clean, studio recordedaudio for both the conditioning and the autoregressive inputs.• n2n: ‘Noisy-to-Noisy:’ Trained with noisy inputs, using bothlarger, noisier speech databases and dynamic noise augmen-tations. This noisy data is used both for conditioning and theautoregressive inputs.• n2c: ‘Noisy-to-Clean:’ Trained with noise-augmented speechfrom a studio-recorded database. The noise-augmentedspeech is used to compute conditioning inputs, and the origi-nal, unaugmented speech is used as the training target.• dc2c: ‘Denoised Clean-to-Clean:’ The same training regimeas c2c, but applies a denoiser during inference.• dn2n: ‘Denoised Noisy-to-Noisy:’ The same training regimeas n2n, but applies a denoiser during inference.
3. NEURAL VOCODER ARCHITECTURE
In this section we describe the architecture of the neural vocoder.The parameter settings of the scheme are provided in section 3.1.The vocoder consists of an encoder and decoder. The encodersimply converts the input signal to log melspectra (e.g., [11]). Theobjective of the decoder is to turn these melspectra back into a high-quality speech waveform.The overall structure of the decoder is similar to WaveRNN, withsome changes which result in a leaner model, suitable for subsequentdeployment to low-resource environments. To summarize the differ-ences, we use a single-pass GRU, predict output samples using amixture of logistics distribution, and predict M frequency-bandedsamples at a time. (All described in full below.)The decoder first consumes the log melspectra with a condition-ing stack , consisting of:1. An input 1D convolution (which is non-causal, allowing afixed amount of delay based on the conditioning frame size),2. Three dilated causal 1D convolutions (allowing a large recep-tive field over the past),3. Three transpose convolutions, which upsample to narrow thegap between the input conditioning rate and the vocoder’soutput sample rate, and4. A final tiled upsampling so that the final output is at thevocoder’s output sample rate exactly.The autoregressive network consists of a multi-band WaveGRU,which is based on gated recurring units (GRU) [12]. We split the tar-get audio (with sample rate S ) using a cascade of Quadrature MirrorFilters, dividing the signal evenly into M = 2 k frequency bands.Similar to [13], this allows the system to predict M samples at atime, greatly reducing the computational load and increasing the ef-fective receptive field resulting in a slight quality improvement.Thus, for our M -band WaveGRU, M samples are generated si-multaneously at an update rate of S/M
Hz, one sample for eachfrequency band. For each update, the state of the GRU networkis projected onto an M × K × dimensional space that defines Fig. 1 . WaveGRU Neural Vocoder architecture. M parameter sets, each set corresponding to a mixture of logisticsdistribution with K mixture components for a particular frequencyband. A sample for each band is then drawn by first selecting themixture component (a logistics distribution) according to its proba-bility and then drawing the sample from this logistics distribution bytransforming a sample from a uniform distribution [14]. For each setof M output samples a synthesis filter-bank produces M subsequenttime-domain samples, which results in an output with sampling rate S Hz.The input to the WaveGRU consists of the sum of autoregres-sive and conditioning components. The autoregressive component isa projection of the last step’s M frequency-band samples projectedonto a vector of the dimensionality of the WaveGRU state. The sec-ond component is the output of the conditioning stack (which has thesame dimensionality as the WaveGRU state).The training of the WaveGRU network and the conditioningstack is performed simultaneously using teacher forcing. That is,the past signal samples that are provided as input to the GRU areground-truth signal samples. The training objective is maximizingthe log likelihood (cross entropy) of the ground truth samples. The neural vocoder operates on 160-dimensional log melspectracomputed from 80 ms windows at an update rate of 25 Hz. Thesystem uses four frequency bands, such that the overall update rateof the WaveGRU system is 4 kHz. The conditioning stack uses 512hidden states and a single frame (40 ms) of lookahead. The di-lated convolutional layers have kernel size two, and dilation of one,two and four respectively. Each of the three upsampling transposeconvolutions doubles the rate, so that the output of the third upsam-pling layer is at 200 Hz. This is then tiled to match the GRU rate.The GRU state is 1024-dimensional, and eight mixture-of-logisticscomponents are used for each output sample distribution.or training the models we used speech from the publicly avail-able sets WSJ0 [15] and LibriTTS [16], as well as Google propri-etary TTS recordings of English speech. These were mixed withadditive noise from Freesound [17] and a set of recordings capturedin a variety of environments, including busy streets, caf´es and of-fices. During training of n2n and n2c models, noise samples are dy-namically mixed into training samples with a random SNR chosenuniformly between 1 dB and 40 dB.
For the second listening test, we use a model optimized for on-deviceperformance with a low bit rate. We apply a Karhunen-Lo`eve trans-form (KLT) to each melspectrum, and then apply vector quantizationto achieve a 3 kbps rate. Meanwhile, the model is pruned to 92%sparsity using iterative magnitude pruning [18] in most layers. Weuse 4x4 structured sparse blocks to allow fast inference using SIMDinstructions [7]. For the main GRU layer, we use a fixed block-diagonal sparsity pattern with 16 blocks for each of the three GRUmatrices, corresponding to 93.75% sparsity. We find this has no im-pact on output quality relative to magnitude pruning, and greatly im-proves training speed.
4. DENOISERS
In this section we describe the architecture of the ConvTASNet de-noiser [19]. We use two different denoisers. For the main MOSlistening tests, we use a TDCN++ architecture, as configured in Ap-pendix A in [20]. This is a very high quality model, but is non-causal.Thus, it provides an upper-bound on real world quality, though thesame model has been shown to work well with lower latencies [21].Based on good experimental results, we then developed a causalConvTASNet model. Similar to the WaveGRU neural vocoder, thearchitecture is modified to minimize complexity when deploying tomobile devices. The parameter settings of the scheme are providedin section 4.1.As with the original ConvTASNet, we use a learned filterbank F with stride H to transform the signal to sample rate S/H . The masknetwork then generates sigmoid masks which are applied to the fil-terbanked signal. A learned transpose filterbank G T then transformsthe masked signal back to the time domain.The mask network has a separate learned filterbank with F (cid:48) fil-ters and matching stride H . Unlike the original ConvTASNet, weremove all layer-wise normalizations, to preserve causality. We alsouse causal dilated convolutions and depth-wise convolutions. We al-low a fixed amount of look-ahead by introducing a delay between thegenerated masks and the filtered mixture. As in the original Conv-TASNet, we use depth-wise convolutional blocks, consisting of an‘input’ inverted bottleneck convolution (kernel size 1, increasing thenumber of channels), a depth-wise convolution, and an ‘output’ bot-tleneck convolution (also with kernel size 1, decreasing the numberof channels). Skip connections combine the input and output of eachblock. The on-device model consumes audio sampled at S = 16 kHz . Thelearned filterbank consists of 256 filters, with a window of 4 ms andstep size of 1 ms.The mask network’s input filterbank is the same, but with128 filters. We use two ‘repeats’ of ten depth-wise convolutional Fig. 2 . Convolutional TASNet architecture.blocks. The depth-wise convolutions have kernel size 3 and dila-tion k mod d , where k is the block number and d = 10 providesa sawtooth pattern to the dilations. The inter-block hidden size is128, and 256 channels are used within the depth-wise convolutionalblocks. Finally, a transpose convolutional layer with kernel size 3combines outputs from adjacent time steps and matches the depth ofthe filterbank. A sigmoid activation is then applied to get the finalmasks.The model is pruned to 95% using iterative magnitude pruning,just as we did for the vocoder, reducing the number of parametersfrom 1.5M to 140k. Pruning is not applied to the input or outputfilterbanks or the depth-wise convolutional layers (which constituteonly about 1% of the total weights). The unpruned model achievesa scale-invariant SNR-improvement (SI-SNRi) of 12 dB on held-out evaluation data, and the pruned model achieves 9.8 dB SI-SNRi.The pruned model can run at about 3x real-time on a single threadon a Pixel3 phone. More recent models are even smaller, and runsreliably in real time alongside the neural vocoder on the Pixel3.
5. EXPERIMENTS AND DISCUSSION5.1. Noise Handling Strategies Listening Test
For our first experiment, we train unpruned vocoder models underthe c2c, n2n, and n2c regimes. We also include dc2c samples, inwhich a large, non-causal ConvTASNet is applied to the inputs be-fore they are fed to the c2c model. All models in this experimentuse unquantized conditioning features, to study modeling and repro-duction aspects of the synthesis in generative speech coding whilebypassing the quantization aspects of the conditioning features.Subjective evaluation was carried out through a crowdsourcedMOS listening test, selecting clean and noisy utterances con-taining both male and female speakers from the VCTK dataset [22].Each utterance had naive listeners (which could be different fromutterance to utterance) rating the quality on a scale from 1 to 5. Theresults are given in Fig. 3.The baseline c2c is, as expected, performing the best for cleanspeech but is also the worst performer in noisy speech. Qualita-tively, the c2c model produces choppy-sounding output in regionswith steady background noise. It also produces babbling when tran-sient noises are present. Using noisy features and target (n2n) im-proves the quality in noisy speech, but at great expense of quality inclean speech.With n2c the quality does not significantly improve for noisyspeech. We find that it occasionally drops phonemes, especially able 1 . On-Device Models Test MOS Results. Bold entries indicate that the 95% confidence intervals do not overlap.All SNRs 10dB SNR 5dB SNR 1dB SNRSystem All Bbl Amb All Bbl Amb All Bbl Amb All Bbl AmbReference 2.96 2.97 2.95 3.27 3.23 3.31 3.02 3.07 2.97 2.6 2.6 2.59TASNet 2.70 2.53 2.87 3.12 3.01 3.23 2.69 2.48 2.91 2.29 2.12 2.47dn2n n2n 1.87 1.83 1.91 2.21 2.07 2.36 1.93 1.93 1.92 1.46 1.48 1.44 Fig. 3 . Mean opinion scores from the Noise Handling StrategiesListening Test. The vertical bars indicate 95% confidence intervals.‘noisy’ fricatives at the beginning of a word. Having access to onlya single melspectrum frame of lookahead likely makes it difficultto determine whether a noisy frame is an actual speech sound or atransient background noise.The denoised setup is the overall better system. For clean speechit has statistically indifferent performance from c2c, indicating thatthe denoiser is quite transparent in clean speech. With noisy speechit is also the best setup, with a mean MOS somewhat higher than thereference noisy speech.The conclusion from these experiments is thus that using noisyfeatures in the training will improve the performance in noisy back-ground, but the trade-off will be inferior performance in clean con-ditions. Instead, we recommend adding a speech enhancer to obtaindenoised features and use clean speech as teacher-forced target dur-ing training.
For our second experiment, we check that the denoising setup isstill superior in situations closer to a real-world deployment. In thisexperiment, we use the pruned vocoder with melspectrum featuresquantized to 3 kbps, and the pruned ConvTASNet variant describedabove. We wish to demonstrate that inclusion of the denoiser im-proves a current-best n2n system, and thus use a single n2n vocodermodel for both the n2n and dn2n cases.For subjective evaluation, we used another crowdsourced MOSlistening test. For this test, we use the Hispanic-English Database[23], which contains spontaneous, conversational speech from 22speakers. We composed an evaluation set of twelve 10-second audiosegments, eight with an isolated speaker, and two with cross-talkingspeakers. We added randomly selected noise samples at 1, 5 and 10dB SNR, to produce a total of 36 evaluation segments in the test.Each item was evaluated by 30 listeners. Noise samples are either ‘ambient’ noise (e.g., cars passing) or ‘babble’ noise consisting ofmixed background talking (e.g., background chatter in a cafe), asbabble noise is a weakness of the ConvTASNet system.Results are reported in Table 1, both overall and marginalized bySNR and type of noise (babble vs ambient). In summary, the dn2nsystem has a higher mean MOS score overall and in all marginal-izations by SNR and type of noise. A 95% confidence interval wascomputed for each bucket; entries where the confidence intervals didnot overlap are indicated in boldface. In particular, the dn2n systemis significantly better overall and at 5dB and 1dB SNR.We also report results for the pruned ConvTASNet in isolation,and find that it does not improve on the reference in any case, and canobserve that it performs a bit worse on babble noise. On listening, wefind that the pruned ConvTASNet occasionally has a ‘scratchiness’in its output, especially at lower SNRs. Curiously, this scratchinessis largely removed in the output of the dn2n model: The artifactsmay be masked by the quantized melspectrum transformation, ormay be sounds that the model never saw in the training data, and aretherefore smoothed away.
6. CONCLUSIONS
In this paper, we examined three strategies for handling noise witha neural vocoder: Adding noisy training data, training the vocoderto act as a denoiser, and adding an additional ConvTASNet denoiserin the encoder. Training on noisy data and introducing a denoiserto the encoder both worked well, though the denoiser gave the bestquality. We also demonstrated that a heavily pruned ConvTASNetworks well in conjunction with the neural vocoder in on-device con-ditions: in conversational speech with varying levels of backgroundnoise, using low biw-rate features.
7. REFERENCES [1] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, andK. Kavukcuoglu, “Wavenet: A generative model for raw au-dio,” arXiv preprint arXiv:1609.03499 , 2016.[2] W. B. Kleijn, F. S. C. Lim, A. Luebs, J. Skoglund, F. Stimberg,Q. Wang, and T. C. Walters, “WaveNet based low rate speechcoding,” in , 2018, pp. 676–680.[3] C. Gˆarbacea, A. van den Oord, Y. Li, F. S. C. Lim, A. Luebs,O. Vinyals, and T. C. Walters, “Low bit-rate speech codingwith VQ-VAE and a WaveNet decoder,” in ,2019, pp. 735–739.[4] J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes,“High-quality speech coding with SampleRNN,” in , 2019, pp. 7155–7159.5] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,A. Courville, and Y. Bengio, “SampleRNN: An unconditionalend-to-end neural audio generation model,” arXiv preprintarXiv:1612.07837 , 2016.[6] J.-M. Valin and J. Skoglund, “A real-time wideband neuralvocoder at . kb/s using LPCNet,” in Proc. Interspeech 2019 ,2019.[7] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efficient neural audiosynthesis,” in
Proc. 35th International Conference on Ma-chine Learning , Jennifer Dy and Andreas Krause, Eds. 2018,vol. 80 of
Proceedings of Machine Learning Research , pp.2410–2419, PMLR.[8] C. W. J. Granger and M. J. Morris, “Time series modelling andinterpretation,”
Journal of the Royal Statistical Society: SeriesA (General) , vol. 139, no. 2, pp. 246–257, 1976.[9] J. Salamon and J. P. Bello, “Deep convolutional neural net-works and data augmentation for environmental sound classi-fication,”
IEEE Signal Processing Letters , vol. 24, no. 3, pp.279–283, 2017.[10] T. Wang, K. Koishida, V. Cuperman, A. Gersho, and J. S. Col-lura, “A 1200/2400 bps coding suite based on MELP,” in , 2002, pp. 90–92.[11] D. O’Shaughnessy,
Speech Communications: Human And Ma-chine (IEEE) , Universities press, 1987.[12] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiri-cal evaluation of gated recurrent neural networks on sequencemodeling,” arXiv preprint arXiv:1412.3555 , 2014.[13] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai,“An investigation of subband wavenet vocoder covering entireaudible frequency range with limited acoustic features,” in , 2018, pp. 5654–5658.[14] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pix-elCNN++: Improving the PixelCNN with discretized logisticmixture likelihood and other modifications,” arXiv preprintarXiv:1701.05517 , 2017.[15] Eugene C. et al., “BLLIP 1987-89 WSJ corpus release 1ldc2000t43 web download,” 2000.[16] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia,Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from Lib-riSpeech for text-to-speech,” arXiv , 2019.[17] F. Font, G. Roma, and X. Serra, “Freesound technical demo,”in
Proc. 21st ACM Int. Conf. Multimedia, Barcelona, Spain ,2013, p. 411–412.[18] M. Zhu and S. Gupta, “To prune, or not to prune: exploringthe efficacy of pruning for model compression,” arXiv preprintarXiv:1710.01878 , 2017.[19] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtime–frequency magnitude masking for speech separation,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 8, pp. 1256–1266, 2019.[20] S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson, andJohn Hershey, “Unsupervised sound separation using mixtureinvariant training,”
Advances in Neural Information Process-ing Systems , vol. 33, 2020.[21] S. Sonning, C. Sch¨uldt, H. Erdogan, and S. Wisdom, “Perfor-mance study of a convolutional time-domain audio separationnetwork for real-time speech denoising,” in , 2020, pp. 831–835. [22] C. Valentini-Botinhao, “Noisy speech database for trainingspeech enhancement algorithms and TTS models,”