Guided Variational Autoencoder for Speech Enhancement With a Supervised Classifier
GGUIDED VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENTWITH A SUPERVISED CLASSIFIER
Guillaume Carbajal, Julius Richter, Timo Gerkmann
Signal Processing (SP), Universität Hamburg, Germany{guillaume.carbajal, julius.richter, timo.gerkmann}@uni-hamburg.de
ABSTRACT
Recently, variational autoencoders have been successfully used tolearn a probabilistic prior over speech signals, which is then usedto perform speech enhancement. However, variational autoencodersare trained on clean speech only, which results in a limited ability ofextracting the speech signal from noisy speech compared to super-vised approaches. In this paper, we propose to guide the variationalautoencoder with a supervised classifier separately trained on noisyspeech. The estimated label is a high-level categorical variable de-scribing the speech signal (e.g. speech activity) allowing for a moreinformed latent distribution compared to the standard variational au-toencoder. We evaluate our method with different types of labelson real recordings of different noisy environments. Provided thatthe label better informs the latent distribution and that the classifierachieves good performance, the proposed approach outperforms thestandard variational autoencoder and a conventional neural network-based supervised approach.
Index Terms — Speech enhancement, deep generative model,variational autoencoder, semi-supervised learning.
1. INTRODUCTION
The task of single-channel speech enhancement consists in recov-ering a speech signal from a mixture signal captured with one mi-crophone in a noisy environment [1]. Common speech enhance-ment approaches estimate the speech signal using a filter in the time-frequency domain to reduce the noise signal while avoiding speechartifacts [2]. Under the Gaussian assumption, the optimal filter inthe minimum mean square error sense requires estimating the signalvariances [3–5].Supervised deep neural networks (DNNs) have demonstratedexcellent performance in estimating the speech signal [6–9]. How-ever, supervised approaches require labeled data which originatesfrom pairs of noisy and clean speech. These pairs can be createdsynthetically. However, since supervised approaches may not gener-alize well to unseen situations, a large amount of pairs is needed tocover various acoustic conditions, e.g. different noise types, rever-beration and different signal-to-noise ratios (SNRs).Recently, deep generative models based on the variational au-toencoder (VAE) have gained attention for learning the probabilitydistribution of complex data [10]. VAEs have been used to learna prior distribution of clean speech, and have been combined withan untrained non-negative matrix factorization (NMF) noise modelto estimate the signal variances using a Monte Carlo expectationmaximization (MCEM) algorithm [11, 12]. However, since the
This work has been funded by the German Research Foundation (DFG)in the transregio project Crossmodal Learning (TRR 169) and ahoi.digital.
VAE speech model is trained in an unsupervised manner on cleanspeech only, its ability of extracting speech characteristics fromnoisy speech is limited in low SNRs. This results in limited speechenhancement performance compared to supervised approaches inalready-seen noisy environments [11].To overcome this limitation, the VAE can be conditioned onan auxiliary variable that allows for a more informed probabilisticlatent distribution [13]. Kameoka et al. used a VAE conditionedon the speaker identity to inform the speech prior for multichannelspeech separation [14]. However, their approach can only separatespeakers which are included in the training set. As a result, theirapproach aims at speaker-dependent speech separation and not atspeaker-independent speech enhancement.In this work, we propose to guide the VAE with a classifier fullydecoupled from the VAE. The classifier is trained separately in a su-pervised manner with pairs of noisy and clean speech. The estimatedlabel is a high-level categorical variable describing the speech signal(e.g. speech activity). We show that the choice of label is crucialfor the performance of the proposed guided VAE. In addition, weshow that a noise-robust classifier is also required to outperform thestandard VAE and a conventional supervised DNN-based approach.The rest of this paper is organized as follows. In Section 2 wesummarize the background related to the VAE for speech enhance-ment. Section 3 describes our proposed approach. The experimentalsetup is described in Section 4 which is followed by the evaluationin Section 5.
2. BACKGROUND2.1. Mixture model and filtering
In the time-frequency domain using the short time Fourier transform(STFT), the mixture signal x nf ∈ C is the sum of the clean speech s nf ∈ C and the noise b nf ∈ C : x nf = √ g n s nf + b nf , (1)at time frame index n ∈ [1 , N ] and frequency bin f ∈ [1 , F ] ,where N denotes the number of time frames and F the numberof frequency bins of the utterance. The scalar g n ∈ R + repre-sents a frequency-independent but time-varying gain providing somerobustness with respect to the time-varying loudness of differentspeech signals [12].Under the Gaussian assumption, the clean speech s nf can beestimated in the minimum mean square error sense using the Wienerestimator: (cid:98) s nf = (cid:98) g n (cid:98) v s,nf (cid:98) g n (cid:98) v s,nf + (cid:98) v b,nf x nf , (2)where (cid:98) v s,nf and (cid:98) v b,nf are the estimated variances of the cleanspeech s nf and the noise b nf , respectively. Under a local stationary a r X i v : . [ ee ss . A S ] F e b ssumption, short-time power spectra | s nf | and | b nf | are unbiasedestimates of the signal variances [15]. The standard VAE which we refer to as model M1 is used to learn aprior over clean speech [11,12]. At time frame n , the frequency binsof clean speech s n ∈ C F are modeled as s n | z ,n ∼ CN ( , diag ( v θ ( z ,n ))) , z ,n ∼ N ( , I ) , (3)where z ,n ∈ R D denotes a latent variable of dimension D and v θ : R D (cid:55)→ R F + represents a trainable feedforward DNN called thegenerative model or decoder parametrized by θ . (see Fig 1a).In variational inference, the posterior of z ,n is approximated as z ,n | s n ∼ N ( µ φ ( | s n | ) , diag( v φ ( | s n | ))) , (4)where µ φ : R F + (cid:55)→ R D and v φ : R F + (cid:55)→ R D + represent feedfor-ward DNNs sharing the same input and hidden layers called therecognition model or encoder which are parametrized by φ (see Fig1b). Note that the absolute value and squaring in (4) are performedelement-wise.The generative model and recognition model are simultaneouslytrained by maximizing the evidence lower bound (ELBO) on the per-frame log-likelihood log p θ ( s n ) ≥ E q φ ( z ,n | s n ) [log p θ ( s n | z ,n )] − D KL ( q φ ( z ,n | s n ) || p ( z ,n )) , (5)where the first term is the reconstruction loss and D KL ( ·||· ) denotesthe Kullback-Leibler divergence. The noise variance is modeled with an untrained NMF as v b,nf = { HW } nf , (6)where H ∈ R N × K + and W ∈ R K × F + are two non-negative matri-ces representing the temporal activations and spectral patterns of thenoise power spectrogram. K denotes the NMF rank. Given the speech prior provided by model M1 and the noise model,the mixture signal x nf is distributed as x nf | z ,n ∼ CN (0 , g n { v θ ( z ,n ) } f + { HW } nf ) (7)where Θ u = { g n , H , W } are the unsupervised parameters to beestimated. Since the resulting optimization problem is intractabledue to the non-linear relation between the speech variance and the(a) Generative model (b) Recognition model Fig. 1 . Model M1 consisting of (a) a generative model p θ ( s n | z ,n ) p ( z ,n ) and (b) a recognition model q φ ( z ,n | s n ) . latent variable, an MCEM algorithm is employed to iteratively op-timize the unsupervised parameters Θ u [12]. At each iteration, theestimated terms (cid:98) g n { v θ ( z ,n ) } f and { (cid:98) H (cid:99) W } nf are supposed to getcloser to the true variances v s,nf and v b,nf , respectively, reachinga local optimum. Note that while the VAE operates on a frame-by-frame basis, the MCEM algorithm is offline, resulting in an offlineestimation of the parameters.At test time, the recognition model takes the mixture signal x nf as input instead of clean speech s nf . However, since model M1 istrained on clean speech only, its ability of extracting speech char-acteristics from the mixture x nf is limited. As a result, the speechenhancement performance of the MCEM using M1 may be lowercompared to supervised approaches trained on already-seen noisyenvironments [11].
3. GUIDED VARIATIONAL AUTOENCODER
In this section, we propose a guided VAE which consists in an ex-tension of model M1 combined with a supervised classifier.
Inspired by Kingma et al.’s deep generative model for semi-supervised learning [13], we extend model M1 with a categoricalvariable y n ∈ Y that characterizes a high-level feature of the speechsignal (e.g. speech activity). Hereafter, we denote y n as the label .The label is supposed to allow for a more informed probabilisticspeech prior learned by the VAE. We describe our choice for y n inSection 3.3.(a) Generative model (b) Recognition model Fig. 2 . Model M2 consisting of (a) the guided generative model p θ ( s n | y n , z ,n ) p ( z ,n ) p ( y n ) and (b) the guided recognition model q φ ( z ,n | s n , y n ) .At time frame n , the frequency bins of clean speech s n are gen-erated as s n | y n , z ,n ∼ CN ( , diag ( v θ ( y n , z ,n ))) , (8)where z ,n has the same prior as z ,n and v θ : Y ◦ R D (cid:55)→ R F + is afeedforward DNN resulting in the guided generative model (see Fig2a). The posterior of z ,n is approximated as z ,n | y n , s n ∼ N ( µ φ ( y n , | s n | ) , diag( v φ ( y n , | s n | ))) , (9)where µ φ : Y ◦ R F + (cid:55)→ R D and v φ : Y ◦ R F + (cid:55)→ R D + are feed-forward DNNs sharing the same input and hidden layers resulting inthe guided recognition model (see Fig 2b).The guided generative model and recognition model are simul-taneously trained by maximizing the ELBO on the per-frame jointlog-likelihood log p θ ( y n , s n ) ≥ E q φ ( z ,n | y n , s n ) [log p θ ( s n | y n , z ,n )] − D KL ( q φ ( z ,n | y n , s n ) || p ( z ,n )) + log p ( y n ) , (10) ig. 3 . Combined model M2 and classifier p ( y n | x n ) at test time.where the first term is the reconstruction loss and p ( y n ) is the priordistribution of y n . The estimation path at test time is shown in Fig. 3. First, we use aclassifier to estimate (cid:98) y n from the mixture x n . Then, we use the mix-ture signal x n and the estimated label (cid:98) y n as inputs for the guidedrecognition model. Given (cid:98) y n and the latent variable z ,n , the mix-ture signal x nf is distributed as x nf | (cid:98) y n , z ,n ∼ CN (0 , g n { v θ ( (cid:98) y n , z ,n ) } f + { HW } nf ) (11)For estimating the unsupervised parameters we use the sameMCEM configuration as for model M1. Provided that 1) the classi-fier is noise-robust and that 2) the label y n better informs the speechprior, model M2 is supposed to better extract speech characteristicsfrom the mixture x n than model M1. Since the classifier is fully decoupled from model M2, it can betrained separately. This fact can be used to construct the best classi-fier possible. In order to obtain a noise-robust classifier, we train afeedforward DNN in a supervised manner using the mixture powerspectra | x n | as inputs and corresponding labels y n as targets. Theclassifier outputs the posterior probability p ( y n | x n ) and the esti-mated label (cid:98) y nf is subsequently determined by selecting the class c corresponding to the highest posterior probability p ( y n = c | x n ) .We consider two types of labels related to speech activity. First,we use a classifier to perform voice activity detection (VAD), i.e. y VAD n ∈ { , } . We use the binary cross entropy (BCE) as the learn-ing objective for this classifier. The prior of y VAD n is a symmetricBernoulli distribution. Second, we consider a classifier to performideal binary mask (IBM) estimation, i.e. y IBM n ∈ { , } F , which isequivalent to perform VAD per time-frequency bin. Thus, we usethe BCE averaged over all frequency bins f and the prior for eachfrequency bin is a symmetric Bernoulli distribution.
4. EXPERIMENTAL SETUP4.1. Dataset
For training, we use the “si_tr_s” subset of the Wall Street Jour-nal (WSJ0) dataset which consists of approximately h of cleanspeech [16], and the noise signals DWASHING, NRIVER, OOFICEand TMETRO of the DEMAND dataset [17]. For validation, weuse the “si_dt_05” subset of WSJ0 and the noise signals NFIELD,OHALLWAY, PSTATION and TBUS of the DEMAND dataset. Allsignals have a sampling rate of kHz. For the test, we use the“si_et_05” subset of WSJ0 consisting of utterances, resultingin . h and the noise signals from the "verification" subset of the QUT-NOISE dataset [18], which we downsample to kHz. Notethat both speakers and noise types in the test set are different thanin the training set. Each mixture signal is created by uniformly sam-pling a noise type and mixing speech and noise signals at SNRs of − , and +5 dB. Hereafter, we denote model M2 with VAD labels as
M2+VAD andmodel M2 with IBM labels as
M2+IBM . We use the DNN-basedclassifier described in Section 3.3 for the estimation of the VAD andIBM labels. To compare with our DNN-based IBM classifier, wealso use a non-learned classifier which consists of the IBM estimatorused inside the algorithm of Gerkmann and Hendriks [5], originallyemployed for noise PSD estimation.For the baselines, we use model M1 and a feedforward DNNestimating a Wiener-like mask trained with the magnitude spectrumapproximation loss [19], which we denote as
Supervised . The STFT is computed using a ms Hann window with 75% over-lap, resulting in a frame period of ms and F = 513 unique fre-quency bins. To obtain the ground truth for the VAD and IBM labels,we use the method of Heymann et al. related to clean speech [20].For a fair comparison between all the approaches, we considera similar architecture for each model. Tab. 1 shows the configura-tion of the models. In particular, we consider hidden layers for Supervised to match the same number of layers as models M1 andM2 ( encoder + decoder ). Model M1 has , learnable parame-ters whereas M2+VAD has , and M2+IBM has , . TheVAD classifier has , learnable parameters whereas the IBMclassifier has , and Supervised has , .We use the Adam optimizer with standard configuration and alearning rate of − [21]. We set the batch size to . Note thatbecause the learning objective of the classifier is scale-dependent,the DNN input | x n | needs to be normalized at training time. Thisis not the case for models M1 and M2 since the reconstruction loss(i.e. the Ikatura-Saito distance) is scale-independent. Early stoppingwith a patience of epochs is performed using the validation set.For the MCEM we follow the settings of Leglaive et al. and set theNMF rank to K = 10 [12]. For the non-learned IBM classifier, weuse the standard configuration of the IBM estimator as in Gerkmannand Hendriks [5]. Hidden layers Output layer Model
Encoder
Decoder
DNN classifer
Supervised
Table 1 . Model configurations.
To evaluate the classification performance of the different classifiers,we use the F1-score which combines the precision and recall rates.To evaluate the speech enhancement performance of the approaches,we use the scale-invariant signal-to-distortion ratio (SI-SDR) mea-sured in dB [22]. nput SNRModel Classifier F1-score SI-SDR − dB dB dBMixture – – . ± . − . ± . . ± . . ± . Supervised – – . ± . − . ± . . ± . . ± . M1 – – . ± . . ± . . ± . . ± . M2+VAD
DNN .
82 6 . ± . . ± . . ± . . ± . oracle .
00 6 . ± . . ± . . ± . . ± . M2+IBM [5] . − . ± . − . ± . − . ± . . ± . DNN . ± ± ± ± oracle .
00 9 . ± . . ± . . ± . . ± . Table 2 . Average SI-SDR (in dB) and 95% confidence intervals on the test set on average and for different input SNRs.
5. RESULTS
Tab. 2 shows the results on average and per input SNR. Regard-ing the results on average,
M2+VAD with DNN classifier outper-forms
Supervised by . dB but is outperformed by M1 by . dB. M2+VAD with the oracle classifier outperforms M1 by . dB but thedifference is not statistically significant. We conclude that, even withthe best classifier, VAD does not inform the speech prior learned bythe VAE statistically better. M2+IBM with DNN classifier outperforms
Supervised by . dB and M1 by . dB on average. The performance of M2+IBM is also statistically significant compared to these two models. How-ever, the performance of
M2+IBM with the non-learned classifierdramatically drops compared to the DNN classifier. Since the clas-sification performance of the non-learned classifier is worse thanthe DNN classifier, we conclude that the performance of
M2+IBM crucially depends on the classifier. Finally, the performance of
M2+IBM with the oracle classifier shows that IBM informs thespeech prior significantly better. Thus, the performance of
M2+IBM could be improved with a better classifier.
M2+IBM with DNN classifier also outperforms M1 for all inputSNRs. The difference between the two models gets larger as theinput SNR decreases. Therefore, M2+IBM with the DNN classifieris particularly more robust to noise than M1 in low SNRs.From informal listening tests, we can state that M2+IBM withDNN classifier typically reduces the noise better than M1 , whichis particularly obvious in the presence of nonstationary noise andtransient interferences such as bursts. Code and audio examples areavailable online .
6. CONCLUSION
We proposed to guide a VAE for speech enhancement with a super-vised classifier separately trained on noisy speech. We evaluated ourmethod with labels corresponding to VAD and IBM on real record-ings of different noisy environments. Using the IBM as label and afeedforward DNN classifier, the guided VAE outperforms the stan-dard VAE and a feedforward DNN-based Wiener filter, particularlyin low SNRs. Improving the classifier by taking time dependen-cies and/or visual information into account could further improvethe guided VAE. The model could then be compared to other deepgenerative models taking temporal dependencies into account [23].
7. REFERENCES [1] E. Vincent, T. Virtanen, and S. Gannot, eds.,
Audio Source Sep-aration and Speech Enhancement . Hoboken, NJ: John Wiley https://uhh.de/inf-sp-guided2021 & Sons, 2018.[2] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-DomainBased Single-Microphone Noise Reduction for Speech En-hancement: A Survey of the State-of-the-Art . No. 11 in Synthe-sis Lectures on Speech and Audio Processing, Williston, VT:Morgan & Claypool, 2013.[3] C. Breithaupt, T. Gerkmann, and R. Martin, “Cepstral Smooth-ing of Spectral Filter Gains for Speech Enhancement With-out Musical Noise,”
IEEE Signal Processing Letters , vol. 14,pp. 1036–1039, Dec. 2007.[4] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative Ma-trix Factorization with the Itakura-Saito Divergence: With Ap-plication to Music Analysis,”
Neural Computation , vol. 21,pp. 793–830, Mar. 2009.[5] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-BasedNoise Power Estimation With Low Complexity and LowTracking Delay,”
IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 20, pp. 1383–1393, May 2012.[6] A. Narayanan and D. Wang, “Ideal ratio mask estimation us-ing deep neural networks for robust speech recognition,” in
ICASSP , pp. 7092–7096, May 2013.[7] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis,“Joint Optimization of Masks and Deep Recurrent Neural Net-works for Monaural Source Separation,”
IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 23,pp. 2136–2147, Dec. 2015.[8] D. S. Williamson, Y. Wang, and D. Wang, “Complex RatioMasking for Monaural Speech Separation,”
IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 24,pp. 483–492, Mar. 2016.[9] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing IdealTime–Frequency Magnitude Masking for Speech Separation,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, pp. 1256–1266, Aug. 2019.[10] D. P. Kingma and M. Welling, “An Introduction to VariationalAutoencoders,”
Foundations and Trends in Machine Learning ,vol. 12, no. 4, pp. 307–392, 2019.[11] Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawa-hara, “Statistical Speech Enhancement Based on ProbabilisticIntegration of Variational Autoencoder and Non-Negative Ma-trix Factorization,” in
ICASSP , pp. 716–720, Apr. 2018.[12] S. Leglaive, L. Girin, and R. Horaud, “A variance modelingframework based on variational autoencoders for speech en-hancement,” in
MLSP , pp. 1–6, Sept. 2018.13] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, andM. Welling, “Semi-supervised learning with deep generativemodels,” in
NeurIPS , pp. 3581–3589, Curran Associates, Inc.,2014.[14] H. Kameoka, L. Li, S. Inoue, and S. Makino, “Supervised de-termined source separation with multichannel variational au-toencoder,”
Neural Computation , vol. 31, pp. 1891–1914, Sept.2019.[15] A. Liutkus, R. Badeau, and G. Richard, “Gaussian Processesfor Underdetermined Source Separation,”
IEEE Transactionson Signal Processing , vol. 59, pp. 3155–3167, July 2011.[16] J. S. Garofolo, D. Graff, D. Paul, and D. S. Pallett,
CSR-I(WSJ0) Sennheiser.
Interspeech ,pp. 3456–3460, 2015.[19] F. Weninger, J. R. Hershey, J. L. Roux, and B. Schuller,“Discriminatively trained recurrent neural networks for single-channel speech separation,” in
GlobalSIP , pp. 577–581, Dec.2014.[20] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural networkbased spectral mask estimation for acoustic beamforming,” in
ICASSP , pp. 196–200, Mar. 2016.[21] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Op-timization,” in
ICLR , Dec. 2014.[22] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR– Half-baked or Well Done?,” in
ICASSP , pp. 626–630, May2019.[23] J. Richter, G. Carbajal, and T. Gerkmann, “Speech Enhance-ment with Stochastic Temporal Convolutional Networks,” in