[PDF] Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder

Abstract

Recently, a generative variational autoencoder (VAE) has been proposed for speech enhancement to model speech statistics. However, this approach only uses clean speech in the training phase, making the estimation particularly sensitive to noise presence, especially in low signal-to-noise ratios (SNRs). To increase the robustness of the VAE, we propose to include noise information in the training phase by using a noise-aware encoder trained on noisy-clean speech pairs. We evaluate our approach on real recordings of different noisy environments and acoustic conditions using two different noise datasets. We show that our proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters. At the same time, we demonstrate that our model is capable of generalizing to unseen noise conditions better than a supervised feedforward deep neural network (DNN). Furthermore, we demonstrate the robustness of the model performance to a reduction of the noisy-clean speech training data size.

Full PDF

VVARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENT WITH A NOISE-AWAREENCODER

Huajian Fang , , Guillaume Carbajal , Stefan Wermter , Timo Gerkmann Signal Processing (SP), Universit¨at Hamburg, Germany Knowledge Technology (WTM), Universit¨at Hamburg, Germany { fang, carbajal, wermter, gerkmann } @informatik.uni-hamburg.de ABSTRACT

Recently, a generative variational autoencoder (VAE) has been pro-posed for speech enhancement to model speech statistics. However,this approach only uses clean speech in the training phase, makingthe estimation particularly sensitive to noise presence, especially inlow signal-to-noise ratios (SNRs). To increase the robustness of theVAE, we propose to include noise information in the training phaseby using a noise-aware encoder trained on noisy-clean speech pairs.We evaluate our approach on real recordings of different noisy envi-ronments and acoustic conditions using two different noise datasets.We show that our proposed noise-aware VAE outperforms the standardVAE in terms of overall distortion without increasing the number ofmodel parameters. At the same time, we demonstrate that our modelis capable of generalizing to unseen noise conditions better than asupervised feedforward deep neural network (DNN). Furthermore, wedemonstrate the robustness of the model performance to a reduction ofthe noisy-clean speech training data size.

Index Terms — speech enhancement, generative model, varia-tional autoencoder, semi-supervised learning.

1. INTRODUCTION

Speech enhancement refers to the problem of extracting a target speechsignal from a noisy mixture in order to enhance the quality and intelli-gibility of the speech. This task is of particular interest for applicationslike speech recognition and hearing aids. Single-channel speech en-hancement is a challenging task, especially at low signal-to-noise ratios(SNRs).Speech enhancement typically requires the statistical estimationof the noise and speech power spectral densities (PSDs) [1, 2]. Non-negative matrix factorization (NMF) is a popular choice for PSDestimation [3–6]. However, underlying linearity assumptions limitthe performance when modeling complex high-dimensional data. Incontrast, speech enhancement based on non-linear deep neural networks(DNNs) has shown better modeling capacity. Common approachesfocus on inferring a time-frequency mask in a supervised manner [7].However, to generalize to unseen noise conditions, DNNs require alarge number of pairs of noisy and clean speech in various acousticconditions [8].Recently, there has been an increasing interest in generative models,such as generative adversarial networks (GANs) [9] and variationalautoencoders (VAEs) [10, 11]. The generative VAE is a probabilisticmodel widely used for learning latent representations of a probabilisticdistribution. The VAE features a similar architecture as a classicalautoencoder with an encoder and a decoder, but its latent space dif-fers by being regularized to follow a standard Gaussian distribution. Moreover, the VAE has been extended to deep conditional generativemodels for effectively performing probabilistic inference [12,13]. VAEshave been applied to speech enhancement in both single-channel andmulti-channel scenarios [14–16]. They have been used to model thespeech statistics by training on clean speech spectra only. However,because no noise information is involved in its training phase, theencoder of the standard VAE is sensitive to noise. In low SNRs, thisnoise-sensitivity results in the erroneous estimation of latent variablesand thus in inappropriately generated speech coefﬁcients and a reducedperformance.In this work, inspired by conditional VAEs and its applicationto image segmentation [12, 13, 17], to increase noise robustness, wepropose to replace the encoder of the VAE by a noise-aware encoder .To learn this encoder, the VAE is ﬁrst trained on clean speech spectraonly, and then, given noisy speech, the proposed noise-aware encoderis trained in a supervised fashion to make its latent space as closeas possible to that of the ﬁrst speech-only trained encoder. For ouranalyses we rely on the VAE-NMF speech enhancement framework[14, 15], which uses NMF to model the noise PSD. We show that theproposed encoder is more robust to noise presence and improves speechestimation without increasing the number of model parameters. Themethod also shows robustness to unseen noise conditions by evaluatingon real recordings from different noise datasets. Finally, we illustratethat already a small amount of noisy-clean speech data can lead toimprovements in overall distortion.In section 2, we introduce problem settings and notations, as wellas the framework of the VAE-based speech model and the noise modeldeveloped on the NMF. In section 3, we introduce details about theproposed noise-aware VAE. After showing the experiment settings insection 4, we present experimental evaluation results and conclusionsin section 5 and section 6.

2. PROBLEM FORMULATION2.1. Mixture model

In our work, we employ an additive signal model, where a noisy mixtureis seen as a superposition of clean speech and additive noise. In theshort-time Fourier transform (STFT) domain, it shows as x ft = s ft + n ft , (1)where x ft , s ft , and n ft represent each time-frequency coefﬁcientin spectra of noisy mixture X ∈ C F × T , speech S ∈ C F × T , andnoise N ∈ C F × T respectively. F denotes the number of frequencybins, T represents the number of time frames, which are indexed by f and t , respectively. The speech and noise spectra are assumed tobe mutually independent complex Gaussian distributions with zero-mean, i.e., s ft ∼ N C (0 , σ s,ft ) , n ft ∼ N C (0 , σ n,ft ) where σ s,ft , a r X i v : . [ ee ss . A S ] F e b n,ft represent the variances of speech and noise. The PSD of signalsis characterized by the parameter variance under the local stationaryassumption [18].Furthermore, to provide an increased robustness to the loudness ofthe audio utterances, a time-dependent and frequency-independent gain g t is introduced [15]. Eventually, this modiﬁes the additive mixturemodel in (1) to x ft = √ g t s ft + n ft . (2)Given the observed noisy mixture which follows a complex Gaussiandistribution as x ft ∼ N C (0 , g t σ s,ft + σ n,ft ) , the desired speech canbe extracted by separately modeling the speech and noise variances. For the VAE-based speech model, a frame-wise D -dimensional latentvariable z t ∈ R D is deﬁned, and an F -dimensional speech frame s t is assumed to be sampled from the conditional likelihood distribution p θ ( s t | z t ) . This is achieved by the decoder of VAE, also called thegenerative model. The variable θ here indicates the parameters of thedecoder network. ˆ σ s : R D → R F + denotes the nonlinear function fromthe latent space to the reconstructed signal given by the generativemodel of the VAE.The VAE provides a principled method to jointly learn latent vari-ables and the inference model [10]. Following a Bayesian framework,this requires to approximate the intractable true posterior distribu-tion p ( z t | s t ) . In the VAE, the encoder, also called the inferencemodel, is used to approximate the true posterior, denoted as q φ ( z t | s t ) .The variable φ here indicates the parameters of the encoder network. ˆ µ d : R F + → R D , ˆ σ d : R F + → R D + indicate the nonlinear mapping ofthe neural network given by the inference model of the VAE. Understochastic gradient descent, the generative model’s parameters θ and theinference model’s parameters φ are jointly optimized by maximizingvariational lower bound, given by log p ( S ) ≥ − (cid:88) t KL [ q φ ( z t | s t ) || p ( z t ))]+ (cid:88) t E q φ ( z t | s t ) [log p θ ( s t | z t )] . (3)The quantity p ( z t ) represents the prior distribution of the D -dimen-sional variable z t , and KL indicates Kullback-Leibler divergence.The prior of the latent variables is deﬁned as a zero-mean isotropicmultivariate Gaussian z t ∼ N ( , I ) as in [10]. The ﬁrst term in theobjective function (3) refers to the regularization error in the latentspace to ensure meaningful latent variables, and the second term is thereconstruction error.As shown in Fig. 1, the VAE is trained on the periodograms ofclean speech | s t | [14, 15]. During testing, the estimates of the cleanspeech power spectra ˆ σ s ( z t ) are expected to be generated from latentvariables learnt from the noisy periodograms | x t | ∈ R F + . Note that arobust estimation of latent variables that represents the clean speechstatistics plays a crucial role in the generative process. NMF tries to ﬁnd an optimal approximation to an input matrix by adictionary matrix containing basis functions weighted by a coefﬁcientsmatrix [3]. Here NMF is used to model the noise variance [14, 15].The variance of noise σ n is approximated by a multiplication of thedictionary matrix W ∈ R F × K + and the coefﬁcients matrix H ∈ Fig. 1 . The generative model and inference model of the adopted VAE.The dashed line here indicates the sampling process. R K × T + , computed as σ n = W H = (cid:88) ft (cid:88) k w fk h kt , (4)where K indicates the rank of the noise model indexed by k . w fk and h kt are elements from W and H respectively at the corresponding rowand column indexed by f , k , and t . By modeling speech and noise with VAE and NMF respectively, thedistribution of the noisy mixture can be represented as x ft ∼ N C (0 , g t ˆ σ s,f ( z t ) + (cid:88) k w fk h kt ) , (5)where ˆ σ s,f : R D → R + denotes the nonlinear function ˆ σ s for f -th frequency bin. Given the noisy mixture as an observation, theMonte Carlo expectation-maximization (MCEM) algorithm is utilizedto estimate the NMF parameters and the gain factor [15, 19]. Thesampling strategy is based on the Metropolis-Hastings algorithm [20].The clean speech can be extracted from a noisy mixture in the time-frequency domain by constructing a Wiener ﬁlter denoted by ˆ m ft ,given as ˆ m ft = ˆ σ s,f ( z t ) g t ˆ σ s,f ( z t ) + (cid:80) k w fk h kt . (6)Although modeling speech with a VAE can be achieved by trainingsolely on clean speech data, using it for speech enhancement is anothermatter since gaining robustness to noise is difﬁcult without includingnoise samples in the training data and the model. However, the standardVAE does not allow for including noise at the training phase.

3. NOISE-AWARE VAE

Instead of using the encoder trained on the clean speech signals, wepropose a noise-aware VAE that can improve the robustness of theencoder against noise presence. For a generative process, it is difﬁcult oreven impossible to derive the optimal mapping between latent variablesand targets. However, we argue that it might be relevant to make latentvariables estimated from noisy mixtures as close as possible to the onesinferred from the corresponding clean speech.To obtain the noise-aware VAE based on this assumption, we pro-pose a two-step learning algorithm, which learns a non-linear mappingfrom the noisy signals to latent variables that represent the clean speechstatistics. We ﬁrst train a VAE using Equation (3) to learn a regularizeda) (b)

Fig. 2 . The proposed architecture for minimizing divergence betweenlatent variables. The constraint in the latent space is shown in (a), andits graphic explanation given in (b).latent space over the clean speech signals. The noise-aware encoderis then proposed to approximate the probability q γ ( z (cid:48) t | x t ) to outputD-dimensional latent variables z (cid:48) t ∈ R D conditioned on the noisy mix-ture x t . It is also assumed that the conditional probability q γ ( z (cid:48) t | x t ) follows a standard Gaussian distribution. The variable γ indicates theparameters of the new encoder. Finally, the distance of z (cid:48) t obtained fromnoisy speech to the latent variables z t inferred form the correspondingclean speech is minimized based on the Kullback–Leibler divergenceas shown in Fig. 2 (a), given by L ( γ ) = (cid:88) t KL ( q φ ( z t | s t ) || q (cid:48) γ ( z (cid:48) t | x t )) (7) = (cid:88) t,d (cid:110)

12 log (cid:101) σ d ( | x t | )ˆ σ d ( | s t | ) −

12+ ˆ σ d ( | s t | ) + (ˆ µ d ( | s t | ) − (cid:101) µ d ( | x t | )) (cid:101) σ d ( | x t | ) (cid:111) (8)where (cid:101) µ d : R F + → R D and (cid:101) σ d : R F + → R D + represents the nonlinearmapping of the neural networks for the mean and variance of theposterior Gaussian distribution for the variable z (cid:48) t . The parametersof the new inference model γ are optimized by minimizing the costfunction using stochastic gradient descent algorithms. In this way, wecombine unsupervised learning of the speech characteristics by theVAE and supervised learning using the pairs of noisy-clean speechsignals.Eventually, as graphically shown in Fig. 2 (b), by introducingthis cost function in the latent space, the latent variables z (cid:48) t estimatedfrom the noisy mixture x t is pulled towards z t estimated from thecorresponding clean speech s t . The dashed lines here indicate thenonlinear mapping from the signal space to the latent space, anddifferent colors indicate two mapping pairs. At the inference stage, thenoise-aware inference model is used to replace the standard speech-based encoder. The decoder of the VAE remains unchanged.

4. EXPERIMENTAL SETTINGS4.1. Datasets

We evaluate the performance of the proposed model by using signalsfrom the speech dataset Wall Street Journal (WSJ0) [21], and the noisedatabases QUT-NOISE [22] and DEMAND [23]. QUT-NOISE is usedin constructing datasets of both training and evaluation using 4 noisetypes ”cafe”, ”car”, ”home”, and ”street” recorded in unique locations.DEMAND is introduced as another evaluation dataset corresponding tocompletely unseen noise conditions in the training set, and the noisesignals are randomly sampled from recordings of 12 noise types in thecategories ”domestic”, ”public”, ”street”, and ”transportation”.To train the noise-aware encoder, around 25 hours of speech sam-ples are chosen from WSJ0 and mixed with the sampled noise signalsat a SNR randomly chosen from the range of -5 dB to 5 dB with a gapof 1 dB. Two speaker-independent evaluation datasets each containingaround 2.3 hours of 1000 noisy samples are created by mixing thespeech and noise signals at SNRs of -10 dB, -5 dB, 0 dB, 5 dB, and 10dB.

We show evaluation results by comparing the proposed noise-awareVAE to the standard VAE, and a fully-connected DNN model. TheDNN model outputs a Wiener ﬁlter based on a mean square error costfunction [24], referred to as DNN-WF. The standard VAE is trained onthe same amount of the clean speech signals that are not mixed withthe noise signals, while the supervised DNN-WF is trained on the samedataset as the noise-aware encoder.

All signals are sampled at 16 kHz. The signal is transformed intothe STFT domain with a sine window of length 1024 ( F = 513 )and a 25% hop size. Global normalization to zero mean and unitstandard deviation is employed for training the noise-aware encoder,since Kullback–Leibler divergence is scale-dependent. The rank ofNMF is chosen to be K = 8 when modeling noise, and its composingmatrices W and H are randomly initialized. The parameters of MCEMalgorithm follow the setting in [15].The VAE is comprised of an encoder and a decoder both withtwo feedforward hidden layers of 128 units. The hyperbolic tangentactivation function is applied to all hidden layers, except the outputlayer. The dimension of the latent space L is ﬁxed at 16. The noise-aware encoder has the same structure as the speech-based encoder of thestandard VAE. The fully supervised DNN-WF contains 5 hidden layers,each with 128 units, and its architecture is built to contain a similarnumber of parameters as our VAE model. No temporal information isconsidered in DNN-WF, which is consistent with the non-sequentialcharacteristic of the VAE. We apply the ReLU activation function toall hidden layers, and the sigmoid function is put on the output layerto ensure the estimate of the Wiener ﬁlter mask lies in the range [0 , .The parameters θ and φ of the VAE are optimized by Adam [25] with alearning rate of 1e-3, and the parameters γ of the noise-aware encoderwith a learning rate of 1e-4. To show the enhancement performance, we employ scale-invariantsignal-to-distortion ratio (SI-SDR) in decibel (dB) [26] to measure theoverall distortion, which takes both noise reduction and artifacts intoaccount.NR Average -10 dB -5 dB 0 dB 5 dB 10 dBUnprocessed -0.04 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± . Performance comparison in SI-SDR on 5 different SNR conditions trained and evaluated on different subsets of the QUT-NOISE dataset(4 noise types). Values of SI-SDR are given in mean ± conﬁdence interval (95% conﬁdence) over all utterances of the evaluation dataset with unitdB. NA-VAE refers to the proposed noise-aware VAE.SNR Average -10 dB -5 dB 0 dB 5 dB 10 dBUnprocessed -0.04 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± . Performance comparison in SI-SDR on 5 different SNR conditions trained on the QUT-NOISE dataset and evaluated on the DEMANDdataset (12 noise types, completely unseen noise conditions). Values of SI-SDR are given in mean ± conﬁdence interval (95% conﬁdence) over allutterances of the evaluation dataset with unit dB.

1% 3% 5% 10% 25% 50% 100%

Dataset relative size S I- S D R NA-VAEVAE

Fig. 3 . Inﬂuence of the amount of noisy-clean speech training data onSI-SDR improvements for both VAE models, averaged over all noiseconditions.

5. RESULTS AND DISCUSSIONS5.1. Performance evaluation

As can be seen from the results in Table 1 which presents results trainedand evaluated on different subsets of QUT-NOISE, the proposed noise-aware VAE outperforms the standard VAE in terms of overall distortionin all SNR scenarios, and the SI-SDR improvements are more evidentat low SNR conditions. For example, the noise-aware VAE outperformsthe baseline VAE by nearly 1 dB at an input SNR of -10 dB. Table 1also shows that the DNN-WF performs better than the plain VAE,which implies that appropriate prior noise information is beneﬁcial.In Table 2, which shows the evaluation performed on the DEMANDdatabase while training is still conducted on QUT-NOISE, we see thatthe fully connected DNN-WF performs signiﬁcantly worse than theother models. This was expected as we now test on a different morediverse dataset with 12 noise types instead of only 4. The supervisedDNN-WF can not transfer the denoising capability to unseen noisetypes implying that inappropriate prior noise information may evendeteriorate performance [8, 14]. However, the proposed noise-awareVAE can still outperform VAE in all SNR conditions, which suggeststhat the proposed method of improving latent variables in the latentspace under this conﬁguration is more capable of generalizing to unseen noise scenarios. Informal listening conﬁrms the SI-SDR resultsespecially for Table 1, while the improvements reported in Table 2 arerelatively subtle. Audio examples are available online . We then look at the inﬂuence of the amount of noisy-clean speechtraining data for estimating the speech latent variable. To achievethis, we initialize the noise-aware encoder with the encoder parametersof the pre-trained standard VAE and then train the new encoder byrandomly selecting 1%, 3%, 5%, 10%, 25%, 50% of the noisy-cleanspeech pairs constructed with the QUT-NOISE dataset. In Fig. 3, it isshown that the performance can already be improved by using only asmall percentage of the paired noisy-clean speech data. A value of morethan 0.2 dB SI-SDR improvement can be observed with just 1% of thetotal paired data. It can also be observed that increasing the numberof data in the later stage leads to gradual improvements, which maybe due to the noise diversity already being largely represented in thesmall fraction of data used. The research can be extended by increasingthe diversity of the noise types in the training phase. This ability ofimproving performance with only few labeled data shows potential inalleviating overﬁtting issues in supervised training strategies.

6. CONCLUSION

In this paper, we proposed a noise-aware encoding scheme to improvethe robustness of the VAE encoder particularly in low SNRs. For thiswe incorporate noise information into the VAE encoder to enable amore accurate speech variance estimation based on improved latentvariables. By constraining the latent space, the VAE with the proposednoise-aware encoder can learn a non-linear mapping from the noisymixture to latent variables that represent the clean speech statistics. Ourproposed VAE outperforms the standard VAE and a supervised DNN-based ﬁlter in SI-SDR. Experiments also showed the generalizationability to unseen noise scenarios by evaluating across different datasets.Moreover, we showed that we could improve the performance evenwith a small amount of noisy-clean speech data. For future work, ourapproach could also be integrated with deep generative models thatcombine temporal dependencies [27]. https://uhh.de/inf-sp-navae2021 . REFERENCES [1] T. Gerkmann and R. C. Hendriks, “Unbiased mmse-based noisepower estimation with low complexity and low tracking delay,” IEEE Transactions on Audio, Speech, and Language Processing ,vol. 20, no. 4, pp. 1383–1393, 2011.[2] R. C. Hendriks, T. Gerkmann, and J. Jensen,

DFT-Domain BasedSingle-Microphone Noise Reduction for Speech Enhancement: ASurvey of the State of the Art , Morgan & Claypool Publishers,2013.[3] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in

Advances in neural information processingsystems , 2001, pp. 556–562.[4] C. F´evotte, N. Bertin, and J.L. Durrieu, “Nonnegative matrixfactorization with the itakura-saito divergence: With applicationto music analysis,”

Neural computation , vol. 21, no. 3, pp. 793–830, 2009.[5] N. Mohammadiha, T. Gerkmann, and A. Leijon, “A new linearmmse ﬁlter for single channel speech enhancement based on non-negative matrix factorization,” in

IEEE workshop on applicationsof signal processing to audio and acoustics (WASPAA) . IEEE,2011, pp. 45–48.[6] H. Sawada, H. Kameoka, S. Araki, and N. Ueda, “Multichannelextensions of non-negative matrix factorization with complex-valued data,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 21, no. 5, pp. 971–982, 2013.[7] D. Wang and J. Chen, “Supervised speech separation basedon deep learning: An overview,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 26, no. 10, pp.1702–1726, 2018.[8] R. Rehr and T. Gerkmann, “An analysis of noise-aware featuresin combination with the size and diversity of training data for dnn-based speech enhancement,” in

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019, pp. 601–605.[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative ad-versarial nets,” in

Advances in neural information processingsystems , 2014, pp. 2672–2680.[10] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”in ,Yoshua Bengio and Yann LeCun, Eds., 2014.[11] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochasticbackpropagation and approximate inference in deep generativemodels,”

International Conference on Machine Learning , p.1278–1286, 2014.[12] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling,“Semi-supervised learning with deep generative models,” in

Advances in neural information processing systems , 2014, pp.3581–3589.[13] K. Sohn, H. Lee, and X. Yan, “Learning structured outputrepresentation using deep conditional generative models,” in

Advances in neural information processing systems , 2015, pp.3483–3491. [14] Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara,“Statistical speech enhancement based on probabilistic integrationof variational autoencoder and non-negative matrix factorization,”in

IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2018, pp. 716–720.[15] S. Leglaive, L. Girin, and R. Horaud, “A variance modelingframework based on variational autoencoders for speech enhance-ment,” in

International Workshop on Machine Learning for SignalProcessing (MLSP) , 2018, pp. 1–6.[16] S. Leglaive, L. Girin, and R. Horaud, “Semi-supervised multi-channel speech enhancement with variational autoencoders andnon-negative matrix factorization,” in

IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) , May2019, pp. 101–105.[17] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam,K. Maier-Hein, SM. A. Eslami, D. J. Rezende, and O. Ron-neberger, “A probabilistic u-net for segmentation of ambiguousimages,” in

Advances in Neural Information Processing Systems ,2018, pp. 6965–6975.[18] A. Liutkus, R. Badeau, and G. Richard, “Gaussian processesfor underdetermined source separation,”

IEEE Transactions onSignal Processing , vol. 59, no. 7, pp. 3155–3167, 2011.[19] G. C. Wei and M. A. Tanner, “A monte carlo implementationof the em algorithm and the poor man’s data augmentation algo-rithms,”

Journal of the American statistical Association , vol. 85,no. 411, pp. 699–704, 1990.[20] C. Robert and G. Casella,

Monte Carlo statistical methods ,Springer Science & Business Media, 2013.[21] J. Garofolo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0)sennheiser ldc93s6b,”

Web Download. Philadelphia: LinguisticData Consortium , 1993.[22] D. B. Dean, S. Sridharan, R. J. Vogt, and M. W. Mason, “The qut-noise-timit corpus for the evaluation of voice activity detectionalgorithms,”

Proceedings Interspeech , 2010.[23] J. Thiemann, N. Ito and E. Vincent, “DEMAND: Di-verse Environments Multichannel Acoustic Noise Database,”http://parole.loria.fr/DEMAND/, 2013.[24] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Dis-criminatively trained recurrent neural networks for single-channelspeech separation,” in

IEEE Global Conference on Signal andInformation Processing (GlobalSIP) , 2014, pp. 577–581.[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,”

International Conference on Learning Representations ,12 2014.[26] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr –half-baked or well done?,” in

IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp.626–630.[27] J. Richter, G. Carbajal, and T. Gerkmann, “Speech Enhance-ment with Stochastic Temporal Convolutional Networks,” in