[PDF] Guided Variational Autoencoder for Speech Enhancement With a Supervised Classifier

Abstract

Recently, variational autoencoders have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. However, variational autoencoders are trained on clean speech only, which results in a limited ability of extracting the speech signal from noisy speech compared to supervised approaches. In this paper, we propose to guide the variational autoencoder with a supervised classifier separately trained on noisy speech. The estimated label is a high-level categorical variable describing the speech signal (e.g. speech activity) allowing for a more informed latent distribution compared to the standard variational autoencoder. We evaluate our method with different types of labels on real recordings of different noisy environments. Provided that the label better informs the latent distribution and that the classifier achieves good performance, the proposed approach outperforms the standard variational autoencoder and a conventional neural network-based supervised approach.

Full PDF

GGUIDED VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENTWITH A SUPERVISED CLASSIFIER

Guillaume Carbajal, Julius Richter, Timo Gerkmann

Signal Processing (SP), Universität Hamburg, Germany{guillaume.carbajal, julius.richter, timo.gerkmann}@uni-hamburg.de

ABSTRACT

Recently, variational autoencoders have been successfully used tolearn a probabilistic prior over speech signals, which is then usedto perform speech enhancement. However, variational autoencodersare trained on clean speech only, which results in a limited ability ofextracting the speech signal from noisy speech compared to super-vised approaches. In this paper, we propose to guide the variationalautoencoder with a supervised classiﬁer separately trained on noisyspeech. The estimated label is a high-level categorical variable de-scribing the speech signal (e.g. speech activity) allowing for a moreinformed latent distribution compared to the standard variational au-toencoder. We evaluate our method with different types of labelson real recordings of different noisy environments. Provided thatthe label better informs the latent distribution and that the classiﬁerachieves good performance, the proposed approach outperforms thestandard variational autoencoder and a conventional neural network-based supervised approach.

Index Terms — Speech enhancement, deep generative model,variational autoencoder, semi-supervised learning.

1. INTRODUCTION

The task of single-channel speech enhancement consists in recov-ering a speech signal from a mixture signal captured with one mi-crophone in a noisy environment [1]. Common speech enhance-ment approaches estimate the speech signal using a ﬁlter in the time-frequency domain to reduce the noise signal while avoiding speechartifacts [2]. Under the Gaussian assumption, the optimal ﬁlter inthe minimum mean square error sense requires estimating the signalvariances [3–5].Supervised deep neural networks (DNNs) have demonstratedexcellent performance in estimating the speech signal [6–9]. How-ever, supervised approaches require labeled data which originatesfrom pairs of noisy and clean speech. These pairs can be createdsynthetically. However, since supervised approaches may not gener-alize well to unseen situations, a large amount of pairs is needed tocover various acoustic conditions, e.g. different noise types, rever-beration and different signal-to-noise ratios (SNRs).Recently, deep generative models based on the variational au-toencoder (VAE) have gained attention for learning the probabilitydistribution of complex data [10]. VAEs have been used to learna prior distribution of clean speech, and have been combined withan untrained non-negative matrix factorization (NMF) noise modelto estimate the signal variances using a Monte Carlo expectationmaximization (MCEM) algorithm [11, 12]. However, since the

This work has been funded by the German Research Foundation (DFG)in the transregio project Crossmodal Learning (TRR 169) and ahoi.digital.

VAE speech model is trained in an unsupervised manner on cleanspeech only, its ability of extracting speech characteristics fromnoisy speech is limited in low SNRs. This results in limited speechenhancement performance compared to supervised approaches inalready-seen noisy environments [11].To overcome this limitation, the VAE can be conditioned onan auxiliary variable that allows for a more informed probabilisticlatent distribution [13]. Kameoka et al. used a VAE conditionedon the speaker identity to inform the speech prior for multichannelspeech separation [14]. However, their approach can only separatespeakers which are included in the training set. As a result, theirapproach aims at speaker-dependent speech separation and not atspeaker-independent speech enhancement.In this work, we propose to guide the VAE with a classiﬁer fullydecoupled from the VAE. The classiﬁer is trained separately in a su-pervised manner with pairs of noisy and clean speech. The estimatedlabel is a high-level categorical variable describing the speech signal(e.g. speech activity). We show that the choice of label is crucialfor the performance of the proposed guided VAE. In addition, weshow that a noise-robust classiﬁer is also required to outperform thestandard VAE and a conventional supervised DNN-based approach.The rest of this paper is organized as follows. In Section 2 wesummarize the background related to the VAE for speech enhance-ment. Section 3 describes our proposed approach. The experimentalsetup is described in Section 4 which is followed by the evaluationin Section 5.

2. BACKGROUND2.1. Mixture model and ﬁltering

In the time-frequency domain using the short time Fourier transform(STFT), the mixture signal x nf ∈ C is the sum of the clean speech s nf ∈ C and the noise b nf ∈ C : x nf = √ g n s nf + b nf , (1)at time frame index n ∈ [1 , N ] and frequency bin f ∈ [1 , F ] ,where N denotes the number of time frames and F the numberof frequency bins of the utterance. The scalar g n ∈ R + repre-sents a frequency-independent but time-varying gain providing somerobustness with respect to the time-varying loudness of differentspeech signals [12].Under the Gaussian assumption, the clean speech s nf can beestimated in the minimum mean square error sense using the Wienerestimator: (cid:98) s nf = (cid:98) g n (cid:98) v s,nf (cid:98) g n (cid:98) v s,nf + (cid:98) v b,nf x nf , (2)where (cid:98) v s,nf and (cid:98) v b,nf are the estimated variances of the cleanspeech s nf and the noise b nf , respectively. Under a local stationary a r X i v : . [ ee ss . A S ] F e b ssumption, short-time power spectra | s nf | and | b nf | are unbiasedestimates of the signal variances [15]. The standard VAE which we refer to as model M1 is used to learn aprior over clean speech [11,12]. At time frame n , the frequency binsof clean speech s n ∈ C F are modeled as s n | z ,n ∼ CN ( , diag ( v θ ( z ,n ))) , z ,n ∼ N ( , I ) , (3)where z ,n ∈ R D denotes a latent variable of dimension D and v θ : R D (cid:55)→ R F + represents a trainable feedforward DNN called thegenerative model or decoder parametrized by θ . (see Fig 1a).In variational inference, the posterior of z ,n is approximated as z ,n | s n ∼ N ( µ φ ( | s n | ) , diag( v φ ( | s n | ))) , (4)where µ φ : R F + (cid:55)→ R D and v φ : R F + (cid:55)→ R D + represent feedfor-ward DNNs sharing the same input and hidden layers called therecognition model or encoder which are parametrized by φ (see Fig1b). Note that the absolute value and squaring in (4) are performedelement-wise.The generative model and recognition model are simultaneouslytrained by maximizing the evidence lower bound (ELBO) on the per-frame log-likelihood log p θ ( s n ) ≥ E q φ ( z ,n | s n ) [log p θ ( s n | z ,n )] − D KL ( q φ ( z ,n | s n ) || p ( z ,n )) , (5)where the ﬁrst term is the reconstruction loss and D KL ( ·||· ) denotesthe Kullback-Leibler divergence. The noise variance is modeled with an untrained NMF as v b,nf = { HW } nf , (6)where H ∈ R N × K + and W ∈ R K × F + are two non-negative matri-ces representing the temporal activations and spectral patterns of thenoise power spectrogram. K denotes the NMF rank. Given the speech prior provided by model M1 and the noise model,the mixture signal x nf is distributed as x nf | z ,n ∼ CN (0 , g n { v θ ( z ,n ) } f + { HW } nf ) (7)where Θ u = { g n , H , W } are the unsupervised parameters to beestimated. Since the resulting optimization problem is intractabledue to the non-linear relation between the speech variance and the(a) Generative model (b) Recognition model Fig. 1 . Model M1 consisting of (a) a generative model p θ ( s n | z ,n ) p ( z ,n ) and (b) a recognition model q φ ( z ,n | s n ) . latent variable, an MCEM algorithm is employed to iteratively op-timize the unsupervised parameters Θ u [12]. At each iteration, theestimated terms (cid:98) g n { v θ ( z ,n ) } f and { (cid:98) H (cid:99) W } nf are supposed to getcloser to the true variances v s,nf and v b,nf , respectively, reachinga local optimum. Note that while the VAE operates on a frame-by-frame basis, the MCEM algorithm is ofﬂine, resulting in an ofﬂineestimation of the parameters.At test time, the recognition model takes the mixture signal x nf as input instead of clean speech s nf . However, since model M1 istrained on clean speech only, its ability of extracting speech char-acteristics from the mixture x nf is limited. As a result, the speechenhancement performance of the MCEM using M1 may be lowercompared to supervised approaches trained on already-seen noisyenvironments [11].

3. GUIDED VARIATIONAL AUTOENCODER

In this section, we propose a guided VAE which consists in an ex-tension of model M1 combined with a supervised classiﬁer.

Inspired by Kingma et al.’s deep generative model for semi-supervised learning [13], we extend model M1 with a categoricalvariable y n ∈ Y that characterizes a high-level feature of the speechsignal (e.g. speech activity). Hereafter, we denote y n as the label .The label is supposed to allow for a more informed probabilisticspeech prior learned by the VAE. We describe our choice for y n inSection 3.3.(a) Generative model (b) Recognition model Fig. 2 . Model M2 consisting of (a) the guided generative model p θ ( s n | y n , z ,n ) p ( z ,n ) p ( y n ) and (b) the guided recognition model q φ ( z ,n | s n , y n ) .At time frame n , the frequency bins of clean speech s n are gen-erated as s n | y n , z ,n ∼ CN ( , diag ( v θ ( y n , z ,n ))) , (8)where z ,n has the same prior as z ,n and v θ : Y ◦ R D (cid:55)→ R F + is afeedforward DNN resulting in the guided generative model (see Fig2a). The posterior of z ,n is approximated as z ,n | y n , s n ∼ N ( µ φ ( y n , | s n | ) , diag( v φ ( y n , | s n | ))) , (9)where µ φ : Y ◦ R F + (cid:55)→ R D and v φ : Y ◦ R F + (cid:55)→ R D + are feed-forward DNNs sharing the same input and hidden layers resulting inthe guided recognition model (see Fig 2b).The guided generative model and recognition model are simul-taneously trained by maximizing the ELBO on the per-frame jointlog-likelihood log p θ ( y n , s n ) ≥ E q φ ( z ,n | y n , s n ) [log p θ ( s n | y n , z ,n )] − D KL ( q φ ( z ,n | y n , s n ) || p ( z ,n )) + log p ( y n ) , (10) ig. 3 . Combined model M2 and classiﬁer p ( y n | x n ) at test time.where the ﬁrst term is the reconstruction loss and p ( y n ) is the priordistribution of y n . The estimation path at test time is shown in Fig. 3. First, we use aclassiﬁer to estimate (cid:98) y n from the mixture x n . Then, we use the mix-ture signal x n and the estimated label (cid:98) y n as inputs for the guidedrecognition model. Given (cid:98) y n and the latent variable z ,n , the mix-ture signal x nf is distributed as x nf | (cid:98) y n , z ,n ∼ CN (0 , g n { v θ ( (cid:98) y n , z ,n ) } f + { HW } nf ) (11)For estimating the unsupervised parameters we use the sameMCEM conﬁguration as for model M1. Provided that 1) the classi-ﬁer is noise-robust and that 2) the label y n better informs the speechprior, model M2 is supposed to better extract speech characteristicsfrom the mixture x n than model M1. Since the classiﬁer is fully decoupled from model M2, it can betrained separately. This fact can be used to construct the best classi-ﬁer possible. In order to obtain a noise-robust classiﬁer, we train afeedforward DNN in a supervised manner using the mixture powerspectra | x n | as inputs and corresponding labels y n as targets. Theclassiﬁer outputs the posterior probability p ( y n | x n ) and the esti-mated label (cid:98) y nf is subsequently determined by selecting the class c corresponding to the highest posterior probability p ( y n = c | x n ) .We consider two types of labels related to speech activity. First,we use a classiﬁer to perform voice activity detection (VAD), i.e. y VAD n ∈ { , } . We use the binary cross entropy (BCE) as the learn-ing objective for this classiﬁer. The prior of y VAD n is a symmetricBernoulli distribution. Second, we consider a classiﬁer to performideal binary mask (IBM) estimation, i.e. y IBM n ∈ { , } F , which isequivalent to perform VAD per time-frequency bin. Thus, we usethe BCE averaged over all frequency bins f and the prior for eachfrequency bin is a symmetric Bernoulli distribution.

4. EXPERIMENTAL SETUP4.1. Dataset

For training, we use the “si_tr_s” subset of the Wall Street Jour-nal (WSJ0) dataset which consists of approximately h of cleanspeech [16], and the noise signals DWASHING, NRIVER, OOFICEand TMETRO of the DEMAND dataset [17]. For validation, weuse the “si_dt_05” subset of WSJ0 and the noise signals NFIELD,OHALLWAY, PSTATION and TBUS of the DEMAND dataset. Allsignals have a sampling rate of kHz. For the test, we use the“si_et_05” subset of WSJ0 consisting of utterances, resultingin . h and the noise signals from the "veriﬁcation" subset of the QUT-NOISE dataset [18], which we downsample to kHz. Notethat both speakers and noise types in the test set are different thanin the training set. Each mixture signal is created by uniformly sam-pling a noise type and mixing speech and noise signals at SNRs of − , and +5 dB. Hereafter, we denote model M2 with VAD labels as

M2+VAD andmodel M2 with IBM labels as

M2+IBM . We use the DNN-basedclassiﬁer described in Section 3.3 for the estimation of the VAD andIBM labels. To compare with our DNN-based IBM classiﬁer, wealso use a non-learned classiﬁer which consists of the IBM estimatorused inside the algorithm of Gerkmann and Hendriks [5], originallyemployed for noise PSD estimation.For the baselines, we use model M1 and a feedforward DNNestimating a Wiener-like mask trained with the magnitude spectrumapproximation loss [19], which we denote as

Supervised . The STFT is computed using a ms Hann window with 75% over-lap, resulting in a frame period of ms and F = 513 unique fre-quency bins. To obtain the ground truth for the VAD and IBM labels,we use the method of Heymann et al. related to clean speech [20].For a fair comparison between all the approaches, we considera similar architecture for each model. Tab. 1 shows the conﬁgura-tion of the models. In particular, we consider hidden layers for Supervised to match the same number of layers as models M1 andM2 ( encoder + decoder ). Model M1 has , learnable parame-ters whereas M2+VAD has , and M2+IBM has , . TheVAD classiﬁer has , learnable parameters whereas the IBMclassiﬁer has , and Supervised has , .We use the Adam optimizer with standard conﬁguration and alearning rate of − [21]. We set the batch size to . Note thatbecause the learning objective of the classiﬁer is scale-dependent,the DNN input | x n | needs to be normalized at training time. Thisis not the case for models M1 and M2 since the reconstruction loss(i.e. the Ikatura-Saito distance) is scale-independent. Early stoppingwith a patience of epochs is performed using the validation set.For the MCEM we follow the settings of Leglaive et al. and set theNMF rank to K = 10 [12]. For the non-learned IBM classiﬁer, weuse the standard conﬁguration of the IBM estimator as in Gerkmannand Hendriks [5]. Hidden layers Output layer Model

Encoder

Decoder

DNN classifer

Supervised

Table 1 . Model conﬁgurations.

To evaluate the classiﬁcation performance of the different classiﬁers,we use the F1-score which combines the precision and recall rates.To evaluate the speech enhancement performance of the approaches,we use the scale-invariant signal-to-distortion ratio (SI-SDR) mea-sured in dB [22]. nput SNRModel Classiﬁer F1-score SI-SDR − dB dB dBMixture – – . ± . − . ± . . ± . . ± . Supervised – – . ± . − . ± . . ± . . ± . M1 – – . ± . . ± . . ± . . ± . M2+VAD

DNN .

82 6 . ± . . ± . . ± . . ± . oracle .

00 6 . ± . . ± . . ± . . ± . M2+IBM [5] . − . ± . − . ± . − . ± . . ± . DNN . ± ± ± ± oracle .

00 9 . ± . . ± . . ± . . ± . Table 2 . Average SI-SDR (in dB) and 95% conﬁdence intervals on the test set on average and for different input SNRs.

5. RESULTS

Tab. 2 shows the results on average and per input SNR. Regard-ing the results on average,

M2+VAD with DNN classiﬁer outper-forms

Supervised by . dB but is outperformed by M1 by . dB. M2+VAD with the oracle classiﬁer outperforms M1 by . dB but thedifference is not statistically signiﬁcant. We conclude that, even withthe best classiﬁer, VAD does not inform the speech prior learned bythe VAE statistically better. M2+IBM with DNN classiﬁer outperforms

Supervised by . dB and M1 by . dB on average. The performance of M2+IBM is also statistically signiﬁcant compared to these two models. How-ever, the performance of

M2+IBM with the non-learned classiﬁerdramatically drops compared to the DNN classiﬁer. Since the clas-siﬁcation performance of the non-learned classiﬁer is worse thanthe DNN classiﬁer, we conclude that the performance of

M2+IBM crucially depends on the classiﬁer. Finally, the performance of

M2+IBM with the oracle classiﬁer shows that IBM informs thespeech prior signiﬁcantly better. Thus, the performance of

M2+IBM could be improved with a better classiﬁer.

M2+IBM with DNN classiﬁer also outperforms M1 for all inputSNRs. The difference between the two models gets larger as theinput SNR decreases. Therefore, M2+IBM with the DNN classiﬁeris particularly more robust to noise than M1 in low SNRs.From informal listening tests, we can state that M2+IBM withDNN classiﬁer typically reduces the noise better than M1 , whichis particularly obvious in the presence of nonstationary noise andtransient interferences such as bursts. Code and audio examples areavailable online .

6. CONCLUSION

We proposed to guide a VAE for speech enhancement with a super-vised classiﬁer separately trained on noisy speech. We evaluated ourmethod with labels corresponding to VAD and IBM on real record-ings of different noisy environments. Using the IBM as label and afeedforward DNN classiﬁer, the guided VAE outperforms the stan-dard VAE and a feedforward DNN-based Wiener ﬁlter, particularlyin low SNRs. Improving the classiﬁer by taking time dependen-cies and/or visual information into account could further improvethe guided VAE. The model could then be compared to other deepgenerative models taking temporal dependencies into account [23].

7. REFERENCES [1] E. Vincent, T. Virtanen, and S. Gannot, eds.,

Audio Source Sep-aration and Speech Enhancement . Hoboken, NJ: John Wiley https://uhh.de/inf-sp-guided2021 & Sons, 2018.[2] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-DomainBased Single-Microphone Noise Reduction for Speech En-hancement: A Survey of the State-of-the-Art . No. 11 in Synthe-sis Lectures on Speech and Audio Processing, Williston, VT:Morgan & Claypool, 2013.[3] C. Breithaupt, T. Gerkmann, and R. Martin, “Cepstral Smooth-ing of Spectral Filter Gains for Speech Enhancement With-out Musical Noise,”

IEEE Signal Processing Letters , vol. 14,pp. 1036–1039, Dec. 2007.[4] C. Févotte, N. Bertin, and J.-L. Durrieu, “Nonnegative Ma-trix Factorization with the Itakura-Saito Divergence: With Ap-plication to Music Analysis,”

Neural Computation , vol. 21,pp. 793–830, Mar. 2009.[5] T. Gerkmann and R. C. Hendriks, “Unbiased MMSE-BasedNoise Power Estimation With Low Complexity and LowTracking Delay,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 20, pp. 1383–1393, May 2012.[6] A. Narayanan and D. Wang, “Ideal ratio mask estimation us-ing deep neural networks for robust speech recognition,” in

ICASSP , pp. 7092–7096, May 2013.[7] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis,“Joint Optimization of Masks and Deep Recurrent Neural Net-works for Monaural Source Separation,”

IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 23,pp. 2136–2147, Dec. 2015.[8] D. S. Williamson, Y. Wang, and D. Wang, “Complex RatioMasking for Monaural Speech Separation,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 24,pp. 483–492, Mar. 2016.[9] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing IdealTime–Frequency Magnitude Masking for Speech Separation,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, pp. 1256–1266, Aug. 2019.[10] D. P. Kingma and M. Welling, “An Introduction to VariationalAutoencoders,”

Foundations and Trends in Machine Learning ,vol. 12, no. 4, pp. 307–392, 2019.[11] Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawa-hara, “Statistical Speech Enhancement Based on ProbabilisticIntegration of Variational Autoencoder and Non-Negative Ma-trix Factorization,” in

ICASSP , pp. 716–720, Apr. 2018.[12] S. Leglaive, L. Girin, and R. Horaud, “A variance modelingframework based on variational autoencoders for speech en-hancement,” in

MLSP , pp. 1–6, Sept. 2018.13] D. P. Kingma, S. Mohamed, D. Jimenez Rezende, andM. Welling, “Semi-supervised learning with deep generativemodels,” in

NeurIPS , pp. 3581–3589, Curran Associates, Inc.,2014.[14] H. Kameoka, L. Li, S. Inoue, and S. Makino, “Supervised de-termined source separation with multichannel variational au-toencoder,”

Neural Computation , vol. 31, pp. 1891–1914, Sept.2019.[15] A. Liutkus, R. Badeau, and G. Richard, “Gaussian Processesfor Underdetermined Source Separation,”

IEEE Transactionson Signal Processing , vol. 59, pp. 3155–3167, July 2011.[16] J. S. Garofolo, D. Graff, D. Paul, and D. S. Pallett,

CSR-I(WSJ0) Sennheiser.

Interspeech ,pp. 3456–3460, 2015.[19] F. Weninger, J. R. Hershey, J. L. Roux, and B. Schuller,“Discriminatively trained recurrent neural networks for single-channel speech separation,” in

GlobalSIP , pp. 577–581, Dec.2014.[20] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural networkbased spectral mask estimation for acoustic beamforming,” in

ICASSP , pp. 196–200, Mar. 2016.[21] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Op-timization,” in

ICLR , Dec. 2014.[22] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR– Half-baked or Well Done?,” in

ICASSP , pp. 626–630, May2019.[23] J. Richter, G. Carbajal, and T. Gerkmann, “Speech Enhance-ment with Stochastic Temporal Convolutional Networks,” in