Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement
aa r X i v : . [ ee ss . A S ] F e b SWITCHING VARIATIONAL AUTO-ENCODERS FORNOISE-AGNOSTIC AUDIO-VISUAL SPEECH ENHANCEMENT
Mostafa Sadeghi and Xavier Alameda-Pineda, IEEE Senior Member Inria Nancy Grand-Est, Inria Grenoble Rhˆone-Alpes & Univ. Grenoble Alpes, France
ABSTRACT
Recently, audio-visual speech enhancement has been tack-led in the unsupervised settings based on variational auto-encoders (VAEs), where during training only clean data isused to train a generative model for speech, which at testtime is combined with a noise model, e.g. nonnegative matrixfactorization (NMF), whose parameters are learned withoutsupervision. Consequently, the proposed model is agnosticto the noise type. When visual data are clean, audio-visualVAE-based architectures usually outperform the audio-onlycounterpart. The opposite happens when the visual data arecorrupted by clutter, e.g. the speaker not facing the camera.In this paper, we propose to find the optimal combination ofthese two architectures through time. More precisely, we in-troduce the use of a latent sequential variable with Marko-vian dependencies to switch between different VAE archi-tectures through time in an unsupervised manner: leading toswitching variational auto-encoder (SwVAE). We propose avariational factorization to approximate the computationallyintractable posterior distribution. We also derive the corre-sponding variational expectation-maximization algorithm toestimate the parameters of the model and enhance the speechsignal. Our experiments demonstrate the promising perfor-mance of SwVAE.
Index Terms — Audio-visual speech enhancement, ro-bustness, variational auto-encoder, variational inference.
1. INTRODUCTION
Audio-visual speech enhancement (AVSE) refers to the taskof removing background noise from a noisy speech with thehelp of visual information (lip movements) of the unknownspeech [1, 2]. Several deep neural network (DNN)-basedmethods have been proposed for AVSE in the past. The ma-jority of these methods are supervised , where the underlyingidea is to learn a DNN that maps noisy speech and its as-sociated visual data (video frames of mouth area) to cleanspeech [2–5]. To have a good generalization performance, ahuge dataset with different noise types and various signal-to-noise ratio (SNR) levels is usually required.
Xavier Alameda-Pineda acknowledges ANR JCJC ML3RI project(ANR-19-CE33-0008-01). This work has been partially supported by MIAI@ University Grenoble Alpes, (ANR-19-P3IA-0003)
Recently, some unsupervised
AVSE methods have beenproposed that do not need noise signals for training [6–8],meaning that their training is agnostic to the noise type. Thisapproach builds upon the audio-only speech enhancementcounterpart [9,10] consisting of two main steps. First, model-ing the probabilistic generative process of clean speech usingVAEs [11]. Second, combining it with a noise model, e.g.NMF, to perform speech enhancement from noisy speech.One critical issue with AVSE methods, shared with otherAV-processing tasks such as speaker localisation and track-ing [12, 13], is how to robustly handle noisy visual data attest time, e.g., when mouth area is heavily occluded or non-frontal. Exploiting such noisy visual data by an AVSE modeltrained on clean data may degrade the performance. In the su-pervised settings, this problem is usually addressed by properdata augmentation and efficient audio-visual fusion strategiesduring model training. For example, [14] proposes to com-bine speaker embedding with visual cues to achieve more ro-bustness to occluded visual stream. Moreover, during train-ing, some artificial occlusions are added to video frames. Inthe VAE-based unsupervised settings, a totally different per-spective is pursued owning to its probabilistic nature. In thisregard, a robust generative model has been proposed in [7]which is a mixture of trained audio-based (A-VAE) and audio-visual based (AV-VAE) model. As such, following a varia-tional inference approach, for noisy visual data the A-VAEmodel is chosen, whereas for clean visual data the AV-VAEmodel is used, thus providing robustness.In this paper, we build upon [7] and introduce a newmodel and associated robust AVSE algorithm, where aMarkovian dependency is assumed to switch between dif-ferent VAE-based generative models, and term them switch-ing variational auto-encoder (SwVAE). Alternatively, theproposed model can be understood as a hidden Markovmodel (HMM) [15] with emission probabilities given bythe decoder of several VAEs. Furthermore, we propose avariational factorization of the posterior distribution of thelatent variables, enabling efficient inference and algorithminitialization. Experimental results demonstrate the superiorperformance of the proposed method compared to [7].The rest of the paper is organized as follows. Section 2introduces the proposed SwVAE. The inference and speechenhancement methodologies, and the relation of the presentwork to [7] are also detailed in this section. Section 3 presentsand discusses the experiments. t t → t + 1 z t s t (a) Graphical model x t λ, τ ξ m t , Λ m t Σ m t v t W , H x t z t s t m t t ↔ t + 1 (b) Variational approximation Fig. 1 : Graphical model (left) and proposed variational infer-ence (right) assotiacted to switching variational autoencoders.
2. SWITCHING VARIATIONAL AUTOENCODERS
In this section, we present a generative model for short-timeFourier transform (STFT) time frames of clean speech con-sisting of audio-only and audio-visual VAE models plus aswitching variable deciding which model to be used for eachaudio frame. The switching variable is modeled with anHMM. We also discuss how to structure the variance of thebackground noise via NMF. Then, a variational approxima-tion is proposed to estimate the model parameters and inferthe latent variables, including the clean speech signal, fromthe noisy mixture.
We define s t ∈ C F as the vector of clean speech STFT coef-ficients at time frame t ∈ { , ..., T } . In the following, N c and N stand for complex- and real-valued Gaussian distributions,respectively. The main methodological contribution of thispaper is the use of a switching variable m t ∈ { , . . . , M } modeled with a Markov chain in combination with a set of M non-linear generative models (i.e. VAE) to model cleanspeech. The full generative model describes the probabilis-tic relationship between the switching variable m t , the cleanspeech s t , and the latent code z t ∈ R L , describing some hid-den characteristics of s t , given the associated visual data rep-resentation v t ∈ R V . There are two possible, equivalent in-terpretations of this model. First, a hidden Markov modelwith emission probabilities given by the decoder of M VAEs.Second, a set of M VAEs switched by a selecting variablemodeled with Markovian dependencies. More formally: p ( m , . . . , m T ) ∼ MC ( λ, τ ) ,p ( z t | m t ; v t ) ∼ N (cid:16) ξ m t ( v t ) , Λ m t ( v t ) (cid:17) ,p ( s t | z t , m t ; v t ) ∼ N c (cid:16) , Σ m t ( z t , v t ) (cid:17) , (1)where MC ( λ, τ ) is short for a Markov chain with initial dis-tribution λ and transition distribution τ , and ξ m t ( . ) , Λ m t ( . ) , and Σ m t ( ., . ) are non-linear transformations of their inputsindexed by m t ∈ { , . . . , M } and realized as DNNs. For eachgenerative model, the associated DNNs are trained by ap-proximating the intractable posterior p ( z t | s t , m t ; v t ) by an-other DNN-based parameterized Gaussian distribution calledthe encoder [6,11]. So, there are M different distributions forthe prior of z t and for the likelihood of s t . Importantly, theswitching variable m t selects which one of the M models isused at each time step t , while ensuring temporal smoothingin the choice of this transformation. To complete the defini-tion of the probabilistic model, we use an NMF structure forthe additive noise [6, 9, 10]: p ( x t | s t ) ∼ N c (cid:16) s t , diag (cid:16) W h t (cid:17)(cid:17) , (2)where W ∈ R F × K + , H ∈ R K × T + , and h t denotes the t -th col-umn of H . The graphical representation of the full model isshown in Fig. 1 (a). The set of HMM and NMF parameters,i.e. { λ, τ, W , H } are then estimated following a variationalinference method detailed in the next section, and representedin Fig. 1 (b). While for the generative model the dependen-cies are forward in time, at inference time, the latent code andspectrogram at any time t depend on the past and future noisyobservations. It should be emphasized that the DNN parame-ters of (1), trained according to [6], are fixed. In the proposed formulation, the problem of speech enhance-ment is cast into the computation of the posterior proba-bility p ( s | x , v ) , which is the marginal of the full posterior p ( s , z , m | x , v ) , where we define x = { x t } Tt =1 and analo-gously s , z , m , v . The full posterior being intractable, wepropose the following variational factorization: p ( s , z , m | x , v ) ≈ r s ( s | m ) r z ( z | m ) r m ( m ) . (3)It is easy to see that r s and r z further factorize over time,meaning that: r s ( s | m ) = Q t r s ( s t | m t ) and analogously for r z ( z | m ) . Moreover, as a variational approximation, the pos-terior of the latent code z t is assumed to follow a Gaussiandistribution r z ( z t | m t ) = N ( c tm , Ω tm ) , where the mean vec-tor c tm and the diagonal covariance matrix Ω tm are to be es-timated along with r s and r m . To this end, we optimize thefollowing lower-bound of the data log-likelihood log p ( x , v ) ,as done in variational inference: E r s r z r m (cid:20) log p ( x , v , s , z , m ) r s ( s | m ) r z ( z | m ) r m ( m ) (cid:21) ≤ log p ( x , v ) . (4) Optimizing (4) over r s provides the following expression: r s ( s t | m t ) ∝ p ( x t | s t ) · exp (cid:16) E r z h log p ( s t | z t , m t ; v t ) i(cid:17) . pproximating the intractable expectation with a Monte-Carlo estimate, we obtain a Gaussian distribution: r s ( s t | m t ) = N c ( η m t t , diag [ ν m t t ]) , where: η m t ft = γ m t ft γ m t ft + ( WH ) ft · x ft , ν m t ft = γ m t ft · ( WH ) ft γ m t ft + ( WH ) ft , (5) γ m t ft = h D D X d =1 Σ − m t ,ff ( z ( d ) m t , v t ) i − , (6)in which, Σ m t ,ff denotes the ( f, f ) -th entry of Σ m t (simi-larly for the rest of the variables), and { z ( d ) m t } Dd =1 is a sequencesampled from r z ( z t | m t ) . The result in (5) must be interpretedas a Wiener filter, averaged over the latent variable z t for agiven VAE generative model m t . The enhanced speech sig-nal is the marginalisation over the switching variable at time t , and naturally writes: ˆ s t = E r m ( m t ) h E r s ( s t | m t ) [ s t ] i = X m t r m ( m t ) η m t t , ∀ t. (7) After doing some derivations, the set of parameters of r z ( z t | m t ) is estimated by solving: max c tm , Ω tm E r m ( m t ) h E r z ( z t | m t ) h E r s ( s t | m t ) h log p ( s t | z t , m t ; v t ) ii − KL ( r z ( z t | m t ) k p ( z t | m t ; v t )) i . (8)where, KL denotes the Kullback-Leibler divergence. In (8),the expectation over r m and r s can be evaluated in closed-form. This is also the case for the KL term as both the distri-butions are Gaussian. However, the expectation over r z is in-tractable. Like in standard VAE, here we approximate this ex-pectation with a single sample drawn from r z . Furthermore,to be able to back-propagate through the posterior parameters,the reparametrization trick is utilized [11]. For r m ( m ) , we obtain: r m ( m ) ∝ p ( m ) · T Y t =1 exp( − g t ( m t )) (9)with: g t ( m t ) = E r z h KL ( r s ( s t | m t ) k p ( s t | z t , m t ; v t )) i − (10) E r s h log p ( x t | s t ) i + KL ( r z ( z t | m t ) k p ( z t | m t ; v t )) Again, the KL terms and the expectation over r s can be com-puted in closed-form. However, we approximate the expec-tation over r z by a Monte-Carlo estimate. This allows us to Algorithm 1
SwVAE Input:
Trained A-VAE and AV-VAE models, noisy STFTframes { x t } Tt =1 , and visual embeddings { v t } Tt =1 . Initialize: • The latent codes { z ( d ) m t } Dd =1 via the VAE encoders.• The parameters of r s ( s | m ) using (5).• The posterior r m ( m ) uniformly.• The parameters W , H , τ and λ (randomly). While stop criterion not met do :• E- z step: Using (8).• E- s step: Using (5).• E- m step: Compute q mt = exp( − g t ( m t )) P mt exp( − g t ( m t )) using(10), and run the forward backward algorithm [15] toobtain the posterior probability r m ( m t ) and the jointposterior probability ζ m ( m t − , m t ) .• M step:
Update W , H using (12) and (11), and λ, τ using the standard formulae with r m and ζ m [15]. End while Speech enhancement:
Using (7).compute (10). In order to compute the marginal variationalposterior r m ( m t ) required in the E-s and E-z steps, we realizethat (9) has the same structure as standard HMM if we con-sider exp( − g t ( m t )) as the emission probability of the HMM.We therefore use the forward-backward algorithm [15] to ef-ficiently compute r m ( m t ) . After performing the E steps, the NMF parameters are up-dated by optimizing (4). The update formulas for W and H are then obtained by using standard multiplicative rules [16]: H ← H ⊙ W ⊤ (cid:16) V ⊙ ( WH ) ⊙− (cid:17) W ⊤ ( WH ) ⊙− , (11) W ← W ⊙ (cid:16) V ⊙ ( WH ) ⊙− (cid:17) H ⊤ ( WH ) ⊙− H ⊤ , (12)where V = hP m t r m ( m t )( | x ft − η m t ft | + ν m t ft ) i ( f,t ) , and ⊙ signifies entry-wise operation. The parameters of the HMM,i.e. λ and τ , are updated by the standard formulae usingthe joint posterior probabilities computed by the forward-backward algorithm in the E-m step. The complete inferenceand enhancement algorithm is summarized in Algorithm 1. The closest work to ours is [7], which uses a mixture model,comprising an A-VAE and an AV-VAE, as the generative able 1 : Average PESQ, SDR and STOI values of the enhanced speech signals. Here, “clean” and “noisy” refer to visual data.
Measure PESQ SDR (dB) STOISNR (dB) -5 0 5 10 15 -5 0 5 10 15 -5 0 5 10 15Input 1.44 1.67 2.04 2.30 2.72 -12.30 -7.30 -3.45 1.88 6.73 0.22 0.32 0.45 0.56 0.68[7] - clean -3.51 -3.59 [7] - noisy -3.78 -3.97 model of clean speech. Though sharing some similarities,there are several crucial differences between the two meth-ods. First, here we assume a Markovian dependency onthe switching variable that ensures smoothness over time.Second, in [7] the following variational factorization is pro-posed: p ( s , z , m | x ) ≈ r s ( s ) r z ( z ) r m ( m ) , where r s and r z are not conditioned on m . This is in contrast to our proposedfactorization given in (3), which provides a more effectiveapproximation and a robust initialization for the latent codes z , as required by the inference algorithm. More precisely,in the proposed framework, the parameters of r s ( s | m ) areinitialized using its respective set of latent codes z , whichthemselves are initialized by the corresponding encoders (seeSection 3), as opposed to [7] where a weighted combinationof the latent codes (coming from different models) is used forinitializing the parameters of r s ( s ) . This might not be effec-tive given that latent initialization is important in VAE-basedAVSE [8]. Finally, the proposed posterior approximation r z ( z t | m t ) = N ( c tm , Ω tm ) makes sampling, needed by (6),more efficient than the method of [7] which relies on the com-putationally demanding Metropolis-Hastings algorithm [15].
3. EXPERIMENTSProtocol
We evaluate the performance of SwVAE and com-pare it with [7] using the same experimental protocol. Weused two VAE models (A-VAE and AV-VAE) from [6],trained on the NTCD-TIMIT dataset [17]. The test set in-cludes 9 speakers, along with their corresponding lip regionof interest, with different noise types: Living Room (LR) , White , Cafe , Car , Babble , and
Street , and noise levels: {− , , , , } dB. From each speaker, we randomly se-lected 150 examples per noise level for evaluation.The parameters for the algorithm of [7] where set as theirproposed values. Both of the algorithms were run for it-erations, on the same test set. For optimizing (8), the Adamoptimizer [18] was used with a learning rate of . for 10 it-erations. Moreover, we used D = 20 samples to compute (6)and (10). The c tm , Ω tm parameters of r z were, respectively,initialized with the means and variances at the output of therespective VAE encoders by giving ( x t , v t ) as their inputs.The parameters of r s are then initialized using (5) and (6). For A-VAE, the prior of z t is a standard normal distribution, and Σ m t is a function of only z t ; see (1). The two AVSE algorithms were run on the test set withboth clean visual data as well as artificially generated noisyversions, where about one third of the total video frames pertest instance were occluded. Similarly to [7], the occlusionswere simulated by random patches of standard Gaussian noiseadded to randomly selected sub-sequences of 20 consecutivevideo frames. We used three standard speech enhancementscores, i.e., signal-to-distortion ratio (SDR) [19], perceptualevaluation of speech quality (PESQ) [20], and short-time ob-jective intelligibility (STOI) [21]. SDR is measured in deci-bels (dB), and PESQ and STOI values lie in the intervals [ − . , . and [0 , , respectively (the higher the better). Results
Table 1 summarizes the results, averaged over allthe test samples, for the three performance measures, andclean as well as noisy visual data. From this table, we cansee that in terms of PESQ and SDR, SwVAE outperforms[7], with the performance difference being more significantin high SNR values. In terms of the intelligibility measure,i.e., STOI, the proposed method exhibits much better perfor-mance than [7]. These observations are consistent for bothclean and noisy visual data. Furthermore, the two algorithmsshow robustness to noisy visual data, which is especially no-ticeable in terms of STOI. However, for the algorithm of [7]the performance drop due to noisy visual data is higher thanSwVAE. Supplementary materials are available online .
4. CONCLUSION
In this paper, we proposed a noise-agnostic audio-visualspeech generative model based on a sequential mixture oftrained A-VAE and AV-VAE models, combined with an NMFmodel for the noise variance. The switching variable al-lows us to seamlessly use either of the auto-encoders forspeech enhancement, without requiring supervision. We de-tailed a variational expectation-maximization approach toestimate the parameters of the model as well as to enhancethe noisy speech. The proposed algorithm, called switchingVAE (SwVAE), exhibits promising performance when com-pared to the previous work [7] on robust AVSE. In the future,we would like to explore the use of Dynamical VAEs [22] forunsupervised AVSE. https://team.inria.fr/perception/research/swvae/ . REFERENCES [1] L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visualenhancement of speech in noise,” The Journal of theAcoustical Society of America , vol. 109, no. 6, pp. 3007–3020, 2001.[2] D. Michelsanti, Z. H. Tan, S. X. Zhang, Y. Xu, M. Yu,D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separa-tion,” 2020, arXiv:2008.09586.[3] Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai,Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang,“Audio-visual speech enhancement using multimodaldeep convolutional neural networks,”
IEEE Trans-actions on Emerging Topics in Computational Intelli-gence , vol. 2, no. 2, pp. 117–128, 2018.[4] T. Afouras, J. S. Chung, and A. Zisserman, “The con-versation: Deep audio-visual speech enhancement,” in
Proc. Conference of the International Speech Communi-cation Association (INTERSPEECH) , 2018, pp. 3244–3248.[5] A. Gabbay, A. Shamir, and S. Peleg, “Visual speechenhancement,” in
Proc. Conference of the InternationalSpeech Communication Association (INTERSPEECH) ,2018, pp. 1170–1174.[6] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin,and R. Horaud, “Audio-visual speech enhancement us-ing conditional variational auto-encoders,”
IEEE/ACMTransactions on Audio, Speech and Language Process-ing , vol. 28, pp. 1788 –1800, 2020.[7] M. Sadeghi and X. Alameda-Pineda, “Robust unsuper-vised audio-visual speech enhancement using a mixtureof variational autoencoders,” in
Proc. IEEE Interna-tional Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP) , 2020.[8] M. Sadeghi and X. Alameda-Pineda, “Mixture of in-ference networks for vae-based audio-visual speech en-hancement,” 2020, arXiv:1912.10647.[9] S. Leglaive, L. Girin, and R. Horaud, “A variancemodeling framework based on variational autoencodersfor speech enhancement,” in
Proc. IEEE InternationalWorkshop on Machine Learning for Signal Processing(MLSP) , 2018, pp. 1–6.[10] Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, andT. Kawahara, “Statistical speech enhancement based onprobabilistic integration of variational autoencoder andnon-negative matrix factorization,” in
Proc. IEEE Inter-national Conference on Acoustics, Speech, and SignalProcessing (ICASSP) , 2018, pp. 716–720.[11] D. P. Kingma and M. Welling, “Auto-encoding varia-tional bayes,” in
International Conference on LearningRepresentations (ICLR) , 2014. [12] Jan Cech, Ravi Mittal, Antoine Deleforge, JordiSanchez-Riera, Xavier Alameda-Pineda, and Radu Ho-raud, “Active-speaker detection and localization withmicrophones and cameras embedded into a robotichead,” in
IEEE-RAS Humanoids , 2013, pp. 203–210.[13] Yutong Ban, Laurent Girin, Xavier Alameda-Pineda,and Radu Horaud, “Exploiting the complementarityof audio and visual data in multi-speaker tracking,” in
IEEE ICCV Workshops , 2017, pp. 446–454.[14] Triantafyllos Afouras, Joon Son Chung, and AndrewZisserman, “My lips are concealed: Audio-visualspeech enhancement through obstructions,” in
INTER-SPEECH , 2019.[15] C. Bishop,
Pattern Recognition and Machine Learning ,Springer-Verlag Berlin, Heidelberg, 2006.[16] C. F´evotte, N. Bertin, and J.-L. Durrieu, “Nonnegativematrix factorization with the Itakura-Saito divergence:With application to music analysis,”
Neural computa-tion , vol. 21, no. 3, pp. 793–830, 2009.[17] A.-H. Abdelaziz, “NTCD-TIMIT: A new database andbaseline for noise-robust audio-visual speech recogni-tion,” in
Proc. Conference of the International SpeechCommunication Association (INTERSPEECH) , 2017,pp. 3752–3756.[18] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,” in
International Conference on LearningRepresentations (ICLR) , 2015.[19] E. Vincent, R. Gribonval, and C. F´evotte, “Performancemeasurement in blind audio source separation,”
IEEETransactions on Audio, Speech, and Language Process-ing , vol. 14, no. 4, pp. 1462–1469, 2006.[20] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-stra, “Perceptual evaluation of speech quality (PESQ)-anew method for speech quality assessment of telephonenetworks and codecs,” in
Proc. IEEE International Con-ference on Acoustics, Speech, and Signal Processing(ICASSP) , 2001, pp. 749–752.[21] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,“An algorithm for intelligibility prediction of time–frequency weighted noisy speech,”
IEEE Trans. Audio,Speech, Language Process. , vol. 19, no. 7, pp. 2125–2136, 2011.[22] Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Di-ard, Thomas Hueber, and Xavier Alameda-Pineda, “Dy-namical variational autoencoders: A comprehensive re-view,” arXiv preprint arXiv:2008.12595arXiv preprint arXiv:2008.12595