[PDF] Generative Speech Coding with Predictive Variance Regularization

Abstract

The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.

Full PDF

GGENERATIVE SPEECH CODING WITH PREDICTIVE VARIANCE REGULARIZATION

W. Bastiaan Kleijn, , Andrew Storus, Michael Chinen, Tom Denton, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, Hengchin Yeh Google LLC, San Francisco, USA School of Engineering and Computer Science, Victoria University of Wellington, New Zealand

ABSTRACT

The recent emergence of machine-learning based generative mod-els for speech suggests a signiﬁcant reduction in bit rate for speechcodecs is possible. However, the performance of generative modelsdeteriorates signiﬁcantly with the distortions present in real-worldinput signals. We argue that this deterioration is due to the sensi-tivity of the maximum likelihood criterion to outliers and the inef-fectiveness of modeling a sum of independent signals with a singleautoregressive model. We introduce predictive-variance regulariza-tion to reduce the sensitivity to outliers, resulting in a signiﬁcantincrease in performance. We show that noise reduction to removeunwanted signals can signiﬁcantly increase performance. We pro-vide extensive subjective performance evaluations that show that oursystem based on generative modeling provides state-of-the-art cod-ing performance at 3 kb/s for real-world speech signals at reasonablecomputational complexity.

Index Terms — Speech, coding, WaveNet, regularization

1. INTRODUCTION

In recent years it has become possible to generate high-qualityspeech from a conditioning sequence with a very low informationrate. This suggests that generative synthesis forms a natural basis forthe coding and enhancement of speech. However, it has been foundthat generative synthesis is sensitive to the quality of the data usedfor training and the conditioning sequences used for training andinference. This can result in poor synthesized speech quality. In thispaper, we discuss methods that signiﬁcantly reduce the impact ofdistortions in the input signal on signal synthesis for speech coding.WaveNet [1] ﬁrst showed that the generation of high-qualityspeech from only a low-rate conditioning sequence, such as writ-ten text, is possible. WaveNet is based on an autoregressive struc-ture that speciﬁes a predictive distribution for each subsequent sig-nal sample. While WaveNet uses a dilated convolution to determinethe predictive distribution, other recurrent neural networks structuressuch as WaveRNN [2] and the WaveGRU structure that we use inthis paper have also been used successfully for this purpose. Al-though autoregressive structures for synthesis are common, feed-forward structures are used by, for example, Parallel WaveNet [3],WaveGlow [4], WaveGAN [5], and GANSynth [6]. It is fair to statethat while the more recent methods may have computational advan-tages, they do not surpass the basic synthesis quality of the originalWaveNet approach.The high quality of generative speech synthesis has led to a sig-niﬁcant effort towards its usage for coding. In contrast to synthesisfrom text, synthesis for coding must be able to generate an unlim-ited range of voices and its conditioning is variable as it is computedfrom input signals that may suffer from a range of distortions. It was found that the synthesis of a wide range of voices with a single gen-erative model is not a signiﬁcant problem. Generative synthesis ofa wide range of unknown voices with a single model results in onlya minor reduction of speaker identiﬁability [7]. However, variabilityin the conditioning leads to a reduced output speech quality that cannot be improved signiﬁcantly with straightforward measures suchas training with noisy conditioning and undistorted target signals.Hence, despite extensive research, e.g., [8–12] generative synthesisbased speech coding has not yet seen major practical applications.The contribution of this paper consists of the identiﬁcation ofcauses of the sensitivity to distortion, the development of methodsto reduce this sensitivity, and subjective testing of the new methodsconﬁrming the improvements. We show that a major cause of thesensitivity is associated with an attribute of the log-likelihood (LL)objective function. The LL objective function incurs a high penalty ifthe model assigns a low probability to observed data. Hence, in thecontext of autoregressive structures, it encourages an overly broadpredictive distribution when at least some training data are difﬁcultto predict accurately from the past signal and conditioning, which istrue for real-world training data. We mitigate this effect by includingpredictive variance regularization in the objective function.We also show with experiments that input-noise suppression canimprove performance signiﬁcantly. It is well-known that a sum oflow-order linear autoregressive processes is, in general, not a low-order autoregressive process. This suggests that linear autoregres-sive models are poor for sums of independent signals and our re-sults indicate this also holds for nonlinear autoregressive models.Whereas traditional analysis-by-synthesis coding methods can com-pensate for model inadequacies, this is not true for generative synthe-sis based coding, explaining the effectiveness of noise suppression.

2. PROBLEM FORMULATION

In this section, we ﬁrst describe how an autoregressive model is usedto model a process. The method is as proposed in [1]. We then dis-cuss a common problem that occurs when training such sequences.Consider a random process { X i } consisting of real-valued ran-dom samples X i , with a time index i ∈ Z . The joint distribution of aﬁnite sequence, p ( x i , · · · , x i − N ) , can be expressed as a product ofconditional distributions: p ( x i , · · · , x i − N | β ) = N (cid:89) j =0 p ( x i − j | x i − j − , · · · , x i − N , β ) , (1)where β is conditioning information.It follows from (1) that we can create an approximate realiza-tion of a random process by recursively sampling from a model ofthe predictive distribution p ( x i | x i − , · · · , x i − N , β ) for sufﬁcientlylarge N . It is convenient to use a standard-form distribution q ( x i | α ) a r X i v : . [ ee ss . A S ] F e b ith parameters α as a model predictive distribution. The standard-form distribution can be a Gaussian or a logistic mixture, for exam-ple. This formulation allows us to predict the model parameters witha deterministic neural network φ : ( x i − , · · · , x i − N , β, W ) (cid:55)→ α where W is a vector of network parameters. Thus, the predictivedistribution for sample x i is now q ( x i | φ ( x i − , · · · , x i − N , β, W )) .To ﬁnd the parameters W , a reasonable objective is to min-imize the Kullback-Leibler divergence between the ground truthjoint distribution p ( x i , · · · , x i − N ) and the model distribution q ( x i , · · · , x i − N ) , or, equivalently, the cross-entropy between thesedistributions. The latter measure is tractable even though p is onlyavailable as an empirical distribution. It follows from (1) and ourformulation of q ( x i | α ) that cross-entropy based estimation of theparameters of φ can be implemented using maximum-likelihoodbased teacher forcing. For a database of M signal samples, themaximum-likelihood estimate of W can be written as W ∗ = argmax W M (cid:88) i =1 log q ( x i | φ ( x i − , · · · , x i − N , β, W )) . (2)Note that (2) leads to rapid training as it facilitates parallel imple-mentation. For sufﬁciently large N and M , the LL objective pro-vides an upper bound on the differential entropy rate as h ( X i | X i − j , · · · , X i − N ) ≤− M M (cid:88) i =1 log q ( x i | φ ( x i − , · · · , x i − N , W )) , (3)where, for notational convenience, we considered the unconditionedcase. Conversely, (3) can be interpreted as a lower bound on a mea-sure of uncertainty associated with the model predictive distribution.This lower bound is associated with the process itself and not withthe model.Although the differential entropy rate is subadditive for summedsignals, predictive models tend not to work well for summed signals.In general, a model of summed signals is essentially multiplicative inthe required model conﬁgurations. It is well-known that the sum ofﬁnite-order linear autoregressive models is, in general, not a ﬁnite-order autoregressive model [13]. It is relatively straightforward toreduce this problem with noise suppression.A more difﬁcult problem relates to well-known drawbacks ofthe Kullback-Leibler divergence and, hence, the LL objective of (2).When the model distribution q vanishes in the support region of thegroundtruth p , the Kullback-Leibler divergence diverges. In (2) thismanifests itself as a severe penalty for training data x i that havea low model probability q ( x i | φ ( x i − , · · · , x i − N , β, W )) . Hence,a few nonrepresentative outliers in the training data may lead thetraining procedure to equip the predictive model distribution withheavy tails. Such tails lead to signal synthesis with a relatively highentropy rate during inference. In audio synthesis this correspondsto a noisy synthesized signal. Hence it is desirable to counter theseverity of the penalty for low probability training data.We can identify a second relevant drawback to the ML objec-tive. When the ML objective function is used, the model distribu-tion should converge to the groundtruth distribution with increasingdatabase size. However, in practice the stochastic nature of the train-ing data and the training method results in inaccuracies and this inturn means the method attempts to minimize the impact of such er-rors. For example, the implicit description of pitch by the predictivedistribution may be inaccurate. A predictive model distribution withheavy tails for voiced speech then increases the likelihood of training data as it reduces the impact of the model pitch deviating from thegroundtruth pitch. From this reasoning we conclude that it is desir-able to account for the audibility (perception) of distortions, leadingto empirically motivated reﬁnements of the objective function.The problems associated with the LL objective have been con-sidered earlier in different contexts. The vanishing support problemdescribed above was addressed in the context of generative adver-sarial networks (GANs) [14], where the implicit Jensen-Shannonobjective function of the original method and the more general f -divergence based method [15] suffer, at least in principle, from sim-ilar problems. The support problem in GANs can be removed byusing the 1-Wasserstein distance [16] or with maximum mean dis-crepancy (MMD) [17, 18]. However, as these measures require asinput two empirical distributions, these methods are natural for staticdistributions and not for dynamic predictive distributions. The meth-ods also do not facilitate adjustment to account for perception. Anexisting approach that attempts to compensate for overly broad pre-dictive distributions is to lower the “temperature” during inference,e.g., [19]. The predictive distribution is typically raised to a power,then renormalized. This approach does not account for the implicitcost penalty in the basic training objective.

3. OBJECTIVE FUNCTIONS FOR PREDICTIVEDISTRIBUTION MODELS

In this section, we discuss two related approaches that modify themaximum likelihood criterion to obtain improved performance.Both approaches aim to reduce the impact of data points in the train-ing set that are difﬁcult to predict. The methods remove the needfor heuristic modiﬁcations during inference. While the principles ofour methods are general, we apply them to the mixture of logisticsdistribution that we use in our coding scheme (cf. section 4).

We now discuss how to add a term to the objective function thatencourages low-variance predictive distributions. In this approachwe deﬁne the overall objective function for the weights W given adatabase { x } as J ( { x } , W ) = J LL ( { x } ; W ) + νJ var ( { x } , W ) (4)where the log likelihood over the database, J LL ( { x } ; W ) =E data log q ( x i ; φ ( x i − , · · · , x i − N , W )) , is combined with a vari-ance regulatization term J var ( { x } , W ) that is deﬁned below andwhere ν is a constant that must be tuned. The variance of the predictive distribution is an instantaneous pa-rameter that varies over a database and J var ( { x } , W ) must be anaverage over the predictive distributions. The predictive distributionof each sample has a distinct variance and the averaging method canbe selected to have properties that are advantageous for the speciﬁcapplication. As noted in section 2, the predictive distribution is astandard-form distribution q ( x | α ) .The predictive distribution q ( x | α ) is commonly a mixture dis-tribution. Hence we must ﬁnd an expression for the variance of amixture distribution. We ﬁrst note that the mean of a mixture distri-bution is simply µ = E q [ X ] = K (cid:88) k =1 γ k E q k [ X ] = K (cid:88) k =1 γ k µ k , (5)here E q is expectation over q and q k = ˘ q ( · ; µ k , s k ) , with ˘ q a mix-ture component. The variance of the mixture distribution is E q [( X − E q [ X ]) ] = K (cid:88) k =1 γ k ( σ k + µ k − µ ) . (6)We now consider the speciﬁc case of a mixture of logistics inmore detail. The logistic distribution for component k is: ˘ q ( x ; µ k , s k ) = e − x − µksk (1 + e − x − µksk ) , (7)where s k is the scale and µ k is an offset. It is easily seen that thelogistic distribution is symmetric around µ and that, hence, µ is thedistribution mean. The variance of the logistic distribution is E X ∼ ˘ q [( X − E X ∼ ˘ q [ X ]) ] = s π . (8)We can now write down the variance of the mixture of logisticsmodel by combining (6) and (8): σ q = K (cid:88) k =1 γ k ( s k π µ k ) − ( K (cid:88) k =1 γ k µ k ) . (9) The most obvious approach for reducing the prediction variance isto use the prediction variance (9) directly as variance regularizationin the objective function (4): J var ( { x } , W ) = E data [ σ q ] , (10)where E data indicates averaging over the database. That is, we en-courage selection of weights W of the network φ that minimize σ q .Straightforward optimization of (10) over a database may re-sult in the prediction variance being reduced mainly for signal re-gions where the conditional differential entropy (3) is large. Theconditional differential entropy can be decomposed into the sum ofa scale-independent term and a logarithmic scale (signal variance)dependency. For speech the scale-independent term is large for un-voiced segments while the scale-dependent term is large for voicedspeech (as it is relatively loud).For signals that have uniform overall signal variance, it may bedesirable to encourage low predictive variance only for regions thathave relatively low conditional differential entropy. (For speech thatwould correspond to encouraging low variance for voiced speechonly.) This can be accomplished by a monotonically increasing con-cave function of the predictive variance. The logarithm is particu-larly attractive for this purpose as it is invariant with scale: the effectof a small variance getting smaller equals that of a large variancegetting smaller by the same proportion. We then have: J var ( { x } , W ) = E data [log( σ q + a )] , (11)with a providing a ﬂoor. For completeness we describe an alternative method for preventingthe vanishing support problem of the Kullback-Leibler divergence by using a ”baseline” distribution. To this purpose consider a mix-ture distribution of the form q train ( x i ; φ ) = γ ˘ q ( x i | α )+ K (cid:88) k =1 γ k ˘ q ( x i ; φ ( x i − , · · · , x i − N , β, W k ) , (12)where the parameters γ and α are set by the designer and wherethe ﬁrst term is omitted during inference (the other terms must berenormalized by a factor − γ ). By selecting α to provide an overlybroad distribution, the distribution used for inference will be of lowvariance.

4. SYSTEM ARCHITECTURE

In this section we describe the architecture of our coding scheme.The parameter settings of the scheme are provided in section 5.1.Let us consider an input signal with a sampling rate S Hz. Toavoid the need for modeling summed independent signals, the inputis pre-processed with a real-time TasNet [20, 21].The encoder ﬁrst converts the signal into a sequence of log melspectra (e.g., [22]). A set of subsequent log mel-spectra are stackedinto a supervector that is subjected to a Karhunen-Lo`eve transform(KLT) that is optimized off-line. The transformed stacked log melspectra are encoded using split vector quantization with a small num-ber of coefﬁcients per split. No other information is encoded.The decoder ﬁrst decodes the bit stream into a sequence of quan-tized log mel spectra. These spectra form the input to the condition-ing stack , which consists of a set of 1D convolutional layers, all ex-cept the ﬁrst with dilation. The output is a vector sequence with asampling rate equal to that of the mel spectra of the encoder and adimensionality equal to the state of the GRU unit discussed below.The autoregressive network consists of a multi-band WaveGRU,which is based on gated recurring units (GRU) ) [23]. For our N -band WaveGRU , N samples are generated simultaneously at an up-date rate of S/N

Hz, one sample for each frequency band. For eachupdate, the state of the GRU network is projected onto an N × K × dimensional space that deﬁnes N parameter sets, each set corre-sponding to a mixture of logistics for a band. The value of a nextsignal sample for each band is then drawn by ﬁrst selecting the mix-ture component (a logistics distribution) according to its probabilityand then drawing the sample from this logistic distribution by trans-forming a sample from a uniform distribution. For each set of N samples a synthesis ﬁlter-bank produces N subsequent time-domainsamples, which results in an output with sampling rate S Hz.The input to the WaveGRU consists of the addition of an autore-gressive and conditioning components. The autoregressive compo-nent is a projection of the last N frequency-band samples projectedonto a vector of the dimensionality of the WaveGRU state. The sec-ond component is the output of the conditioning stack (dimension-ality of the WaveGRU state), repeated in time to obtain the correctsampling rate of S/N

Hz.The training of the GRU network and the conditioning stack isperformed simultaneously using teacher forcing. That is, the pastsignal samples that are provided as input to the GRU are ground-truth signal samples. The objective function ) , combining log like-lihood (cross entropy) and variance regularization, is used for eachsubsequent signal sample. For our implementation with varianceregularization, we found the baseline distribution not to aid perfor-mance signiﬁcantly, and it was omitted from the experiments. able 1 . Systems in test.system label b v t vt q qv qt qvtvar. regularization (cid:51) (cid:51) (cid:51) (cid:51)

TasNet noise supp. (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

93% pruning (cid:51) (cid:51) (cid:51) (cid:51)

5. EXPERIMENTS

Our experiments had two goals. The ﬁrst is to show the effect ofpredictive variance regularization and noise suppression. The secondis to show that our contributions enable a practical system.

We tested eight systems, all variants based on a single baseline sys-tem operating on 16 kHz sampled signals. It is conditioned using asequence of 160-dimensional log mel spectra computed from 80 mswindows at an update rate of 50 Hz. The system uses four frequencybands, each band sampled at 4 kHz. The conditioning stack consistsof a single non-causal input layer (expanding from 160 channels to512 channels), three dilated causal convolutional layers with kernelsize two, and three upsampling transpose convolutional layers (ker-nel size two). The overall algorithmic delay is 90 ms. The condition-ing outputs are tiled to match the GRU update frequency. The GRUstate dimensionality is 1024, and eight mixture-of-logistics compo-nents are used for the predictive distribution per band.The systems were trained from randomly initialized weights W for 7.5 million steps, using a mini-batch size of 256. The targetsignal was from a combination of clean [24, 25] and noisy sources[26], including large proprietary TTS datasets. Additional noise wasadded from [27], with random SNR between and dB SNR.Table 1 shows the combinations of coder attributes that wereused. We brieﬂy discuss each attribute. The variance regularizationincluded reﬁnements that further improved its performance: it wasapplied to the ﬁrst two bands only and ν in (4) was made proportionalto a voicing score. The noise suppression system was a version ofConvTasNet [21]. The weight pruning attribute was selected to en-able implementation on consumer devices. For the three GRU ma-trices, we used block-diagonal matrices with 16 blocks, which uses93% fewer weights than a fully connected model. For other hiddenlayers, we applied iterative magnitude pruning to remove 92% of themodel weights [28]. The pruning makes the codec with TasNet runreliably on a Pixel 2 phone in single-threaded mode. The system wasquantized with 120 bits per supervector, each supervector containingtwo log mel spectra, for an overall rate of 3 kb/s. The quantizationwas a two-dimensional vector-quantization of the KLT coefﬁcients. To evaluate the absolute quality of the different systems on differentSNRs a Mean Opinion Score (MOS) listening test was performed.Except for data collection, we followed the ITU-T P.800 [29] (ACR)recommendation. The data was collected using a crowd-sourcingplatform with the requirements on listeners being native Englishspeakers and using headphones. The evaluation dataset is composedof 30 samples from the Noisy VCTK dataset [30]: 15 clean and 15augmented with additive noise at various SNRs (2.5, 7.5 and 12.5dB). Each utterance for each system was rated about 200 times andthe average and 95% conﬁdence interval were calculated per SNR.

Fig. 1 . Quality vs SNR for baseline, regularized, denoised, and reg-ularized and denoised systems. Also shown are the quality of unpro-cessed and denoised only signals.

Fig. 2 . Quality vs SNR for the pruned and quantized systems.

The quality for the systems of Table 1 is shown in Figs. 1 and 2. TheMOS with 95% conﬁdence intervals are given for four SNRs.Fig. 1 displays the effect of predictive variance regularizationand noise suppression (TN) without weight pruning and quantiza-tion. Predictive variance regularization results in a signiﬁcant qualityimprovement and reduces the sensitivity to noise in the input signal.Noise suppression aids performance when noise is present.Fig. 2 shows the quality for pruned and quantized systems. Forthis case, the improvement due to variance regularization is partic-ularly large for clean signals. The effect of noise suppression (TN)varies in an unexpected manner with SNR. This likely results froman interaction between noise suppression and quantization. It maybe related to noise suppression reducing signal variability and quan-tization reducing noise on its own.As a reference, Fig. 2 provides the performance of the Opuscodec [31] operating at 6 kb/s and the EVS codec [32] operating at5.9 kb/s (for fairness with disabled DTX). It is seen that the proposedfully practical 3 kb/s WaveGRU coder performs signiﬁcantly betterthan Opus at 6 kb/s and similarly to EVS operating at 5.9 kb/s.

6. CONCLUSION

We have developed a robust speech codec using neural-networkbased signal synthesis that encodes speech at 3 kb/s. Our systemis suitable for, for example, low-rate video calls, and ﬁts in con-sumer devices as evidenced by our implementation running on awide range of mobile phones including the Pixel 2. Our experimentsshow that its quality is similar or better than state-of-the-art conven-tional codecs operating at double the rate. Our main contribution isthat we addressed the impact of variability and distortion inherent inreal-world input to practical speech codecs. We identiﬁed as causesfor poor performance i ) the inherent emphasis of outliers by themaximum likelihood criterion and ii ) the difﬁculty of modeling asum of multiple independent sources. We resolved these problemswith predictive variance regularization and noise suppression. . REFERENCES [1] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[2] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio syn-thesis,” arXiv preprint arXiv:1802.08435 , 2018.[3] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stim-berg et al. , “Parallel WaveNet: Fast high-ﬁdelity speechsynthesis,” in International conference on machine learning .PMLR, 2018, pp. 3918–3926.[4] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generative network for speech synthesis,” in . IEEE,2019, pp. 3617–3621.[5] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audiosynthesis,” arXiv preprint arXiv:1802.04208 , 2018.[6] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue,and A. Roberts, “GANSynth: Adversarial neural audio synthe-sis,” arXiv preprint arXiv:1902.08710 , 2019.[7] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg,Q. Wang, and T. C. Walters, “WaveNet based low rate speechcoding,” in . IEEE, 2018, pp. 676–680.[8] J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes,“High-quality speech coding with sample RNN,” in . IEEE,2019, pp. 7155–7159.[9] C. Gˆarbacea, A. van den Oord, Y. Li, F. S. Lim, A. Luebs,O. Vinyals, and T. C. Walters, “Low bit-rate speech codingwith VQ-VAE and a WaveNet decoder,” in . IEEE,2019, pp. 735–739.[10] J.-M. Valin and J. Skoglund, “A Real-Time Wideband NeuralVocoder at 1.6kb/s Using LPCNet,” in

Proc. Interspeech 2019 ,2019, pp. 3406–3410.[11] F. S. Lim, W. B. Kleijn, M. Chinen, and J. Skoglund, “Ro-bust low rate speech coding based on cloned networks andwavenet,” in . IEEE, 2020, pp. 6769–6773.[12] R. Fejgin, J. Klejsa, L. Villemoes, and C. Zhou, “Source cod-ing of audio signals with a generative model,” in . IEEE,2020, pp. 341–345.[13] C. W. J. Granger and M. J. Morris, “Time series modelling andinterpretation,”

Journal of the Royal Statistical Society: SeriesA (General) , vol. 139, no. 2, pp. 246–257, 1976.[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in

Advances in neural information processingsystems , 2014, pp. 2672–2680.[15] S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: Training gen-erative neural samplers using variational divergence minimiza-tion,” in

Advances in neural information processing systems ,2016, pp. 271–279. [16] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875 , 2017.[17] Y. Li, K. Swersky, and R. Zemel, “Generative moment match-ing networks,” in

International Conference on Machine Learn-ing , 2015, pp. 1718–1727.[18] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P´oczos,“MMD GAN: Towards deeper understanding of momentmatching network,” in

Advances in Neural Information Pro-cessing Systems , 2017, pp. 2203–2213.[19] S. Kim, S.-g. Lee, J. Song, J. Kim, and S. Yoon,“FloWaveNet: A generative ﬂow for raw audio,” arXiv preprintarXiv:1811.02155 , 2018.[20] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtime–frequency magnitude masking for speech separation,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 8, pp. 1256–1266, 2019.[21] S. Sonning, C. Sch¨uldt, H. Erdogan, and S. Wisdom, “Per-formance study of a convolutional time-domain audio separa-tion network for real-time speech denoising,” in . IEEE,2020, pp. 831–835.[22] D. O’Shaughnessy,

Speech Communications: Human And Ma-chine (IEEE) . Universities press, 1987.[23] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiricalevaluation of gated recurrent neural networks on sequencemodeling,” arXiv preprint arXiv:1412.3555 , 2014.[24] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Other,” Harvard Dataverse, Tech. Rep., 2016.[Online]. Available: https://doi.org/10.7910/DVN/ZVU9HF[25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an ASR corpus based on public domain audiobooks,” in . IEEE, 2015, pp.5206–5210.[26] R. Ardila et al., “Common voice: A massively-multilingualspeech corpus,” arXiv preprint arXiv:1912.06670 , 2019.[27] E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Fer-raro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets:a platform for the creation of open audio datasets,” in

Proc.18th Int. Society Music Information Retrieval Conference (IS-MIR 2017) , Suzhou, China, 2017, pp. 486–493.[28] M. Zhu and S. Gupta, “To prune, or not to prune: exploringthe efﬁcacy of pruning for model compression,” arXiv preprintarXiv:1710.01878 , 2017.[29]

Recommendation ITU-T P.800 Methods for subjective determi-nation of transmission quality , ITU-T Std., Aug 1996.[30] C. Valentini-Botinhao, “Noisy speech database for trainingspeech enhancement algorithms and tts models,” University ofEdinburgh. School of Informatics. Centre for Speech Technol-ogy Research (CSTR), Tech. Rep., 2016.[31] J. M. Valin, K. Vos, and T. Terriberry,

Deﬁnition of the OpusAudio Codec , IETF Std., Sept 2012, RfC: 6717.[32] M. Dietz et al , “Overview of the EVS codec architecture,”in2015 IEEE Int. Conf. Acoust Speech Signal Processing(ICASSP)