VSEGAN: Visual Speech Enhancement Generative Adversarial Network
Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen
VVSEGAN: Visual Speech Enhancement Generative Adversarial Network
Xinmeng Xu , , Yang Wang , Dongxiang Xu , Yiyuan Peng , Cong Zhang , Jie Jia , Binbin Chen E.E. Engineering, Trinity College Dublin, Ireland vivo AI Lab, P.R. China [email protected], { yang.wang.rj,dongxiang.xu,pengyiyuan,zhangcong,jie.jia,bb.chen } @vivo.com Abstract
Speech enhancement is an essential task of improving speechquality in noise scenario. Several state-of-the-art approacheshave introduced visual information for speech enhancement,since the visual aspect of speech is essentially unaffected byacoustic environment. This paper proposes a novel frameworkthat involves visual information for speech enhancement, by in-corporating a Generative Adversarial Network (GAN). In par-ticular, the proposed visual speech enhancement GAN consistof two networks trained in adversarial manner, i) a generatorthat adopts multi-layer feature fusion convolution network toenhance input noisy speech, and ii) a discriminator that attemptsto minimize the discrepancy between the distributions of theclean speech signal and enhanced speech signal. Experiment re-sults demonstrated superior performance of the proposed modelagainst several state-of-the-art models.
Index Terms : speech enhancement, visual information, multi-layer feature fusion convolution network, generative adversarialnetwork
1. Introduction
Speech processing systems are used in a wide variety of appli-cations such as speech recognition, speech coding, and hearingaids. These systems have best performance under the conditionthat noise interference are absent. Consequently, speech en-hancement is essential to improve the performance of these sys-tems in noisy background [1]. Speech enhancement is a kind ofalgorithm that can be used to improve the quality and intelligi-bility of noisy speech, decrease the hearing fatigue, and improvethe performance of many speech processing systems [2].Conventional speech enhancement algorithms are mainlybased on signal processing techniques, e.g., by using speechsignal characteristics of a known speaker, which include spec-tral subtraction [3], signal subspace [4], Wiener filter [5], andmodel-based statistical algorithms [6]. Various deep learningnetworks architectures, such as fully connected network [7],Convolution Neural Networks (CNNs) [8, 9], Recurrent Neu-ral Networks (RNNs) [10], have been demonstrated to notablyimprove speech enhancement capabilities than that of conven-tional approaches. Although deep learning approaches makenoisy speech signal more audible, there are some remaining de-ficiencies in restoring intelligibility.Speech enhancement is inherently multimodal, where vi-sual cues help to understand speech better. The correlation be-tween the visible proprieties of articulatory organs, e.g., lips,teeth, tongue, and speech reception has been previously shownin numerous behavioural studies [11, 12]. Similarly, a largenumber of previous works have been developed for visualspeech enhancement, which based on signal processing tech-niques and machine learning algorithms [13, 14]. Not surpris-ingly, visual speech enhancement has been recently addressed in the framework of DNNs, a fully connected network was usedto jointly process audio and visual inputs to perform speech en-hancement [15]. The fully connected architecture cannot effec-tively process visual information, which caused the audio-visualspeech enhancement system slightly better than its audio-onlyspeech enhancement counterpart. In addition, there is a modelwhich feed the video frames into a trained speech generationnetwork, and predict clean speech from noisy input [16], whichhas shown more obvious improvement when compared with theprevious approaches.The Generative Adversarial Network (GAN) [17] consistsof a generator network and a discriminator network that playa min-max game between each other, and GAN have been ex-plored for speech enhancement, SEGAN [18] is the first ap-proach to apply GAN to speech enhancement model. This paperproposes a Visual Speech Enhancement Generative AdversarialNetwork (VSEGAN) that enhances noisy speech using visualinformation under GAN architecture.The main contributions of the proposed VSEGAN can besummarized as follows:• It provides a robust audio-visual feature fusion strategy.The audio-visual fusion occurs at every processing level,rather than a single fusion in whole network, hence, moreaudio-visual information can be obtained.• It firstly investigates the GAN for visual speech enhance-ment. In addition, the GAN architecture helps visualspeech enhancement to control information proportionbetween audio and visual modalities during fusion pro-cess.The rest of article is organized as follows: Section 2presents the proposed method in detail. Section 3 introduces theexperimental setup. Experiment results are discussed in Section4, and a conclusion is summarized in Section 5.
2. Model Architecture
GAN is comprised of generator (G) and discriminator (D). Thefunction of G is to map a noise vector x from a given priordistribution X to an output sample y from the distribution Y oftraining data. D is a binary classifier network, which determineswhether its input is real or fake. The generated samples comingfrom Y , are classified as real, whereas the samples coming fromG, are classified as fake. The learning process can be regardedas a minimax game between G and D, and can be expressedby [17]: min G max D V ( D, G ) = E y ∼ p y ( y ) [log( D ( y ))]+ E x ∼ p x ( x ) [log(1 − D ( G ( x )))] (1)Training procedure for GAN can be concluded the repeti-tion of following three steps: a r X i v : . [ ee ss . A S ] F e b igure 1: Network architecture of generator. Conv-A, Conv-V, Conv-AV, BN, and Deconv denote convolution of audio encoder, convo-lution of video encoder, convolution of audio-visual fusion, batch normalization, and transposed convolution.
Step 1: D back-props a batch of real samples y .Step 2: Freeze the parameters of G, and D back-props abatch of fake samples that generated from G.Step 3: Freeze the parameters of D, and G back-props tomake D misclassify.The regression task generally works with a conditioned ver-sion of GAN [19], in which some extra information, involve ina vector y c , is provided along with the noise vector x at the in-put of G. In that case, the cost function of D is expressed asfollowing: min G max D V ( D, G ) = E y , y c ∼ p y ( y , y c ) [log( D ( y , y c ))]+ E x ∼ p x ( x ) , y c ∼ p y ( y c ) [log(1 − D ( G ( x , y c ) , y c ))] (2)However, Eq. (2) are suffered from vanishing gradients dueto the sigmoid cross-entropy loss function [20]. To tackle thisproblem, least-squares GAN approach [21] substitutes cross-entropy loss to the mean-squares function with binary coding,as given in Eq. (3) and Eq. (4). max D V ( D ) = 12 E y , y c ∼ p y ( y , y c ) [log( D ( y , y c ) − ]+ 12 E x ∼ p x ( x ) , y c ∼ p y ( y c ) [log(1 − D ( G ( x , y c ) , y c )) ] (3) min G V ( G ) = 12 E x ∼ p x ( x ) , y c ∼ p y ( y c ) [log( D ( G ( x , y c ) , y c ) − ] (4) The G network of VSEGAN performs enhancement, where itsinputs are noisy speech ˜ y and video frames v , and its out-put is the enhanced speech (cid:98) y = G (˜ y , v ) . The G network uses Multi-layer Feature Fusion Convolution Network (MF-FCN) [22], which follows an encoder-decoder scheme, and con-sist of encoder part, fusion part, embedding part, and decoderpart. The architecture of G network is shown in Figure 1.Encoder part of G network involves audio encoder andvideo encoder. The audio encoder is designed as a CNN tak-ing spectrogram as input, and each layer of an audio encoderis followed by strided convolutional layer [20], batch normal-ization, and Leaky-ReLU for non-linearity. The video encoderis used to process the input face embedding through a numberof max-pooling convolutional layers followed by batch normal-ization, and Leaky-ReLU. In the G network, the dimension ofvisual feature vector after convolution layer has to be the sameas the corresponding audio feature vector, since both vectorstake at every encoder layer is through a fusion part in encodingstage. The audio decoder is reversed in the audio encoder partby deconvolutions, followed again by batch normalization andLeaky-ReLU.Fusion part designates a merged dimension to implementfusion, and the audio and video streams take the concatenationoperation and are through several strided convolution layer, fol-lowed by batch normalization, and Leaky-ReLU. Embeddingpart consists of three parts: 1) flatten audio and visual steams,2) concatenate flattened audio and visual streams together, 3)feed concatenated feature vector into several fully-connectedlayers. The output of fusion part in each layer is fed to the corre-sponding decoder layer. Embedding part is a bottleneck, whichapplied deeper feature fusion strategy, but with a larger compu-tation expense. The architecture of G network avoids that manylow level details could be lost to reconstruct the speech wave-form properly, if all information are forced to flow through thecompression bottleneck.The D network of VSEGAN has the same structure withSERGAN [23], as shown in Figure 2. The D can be seen as akind of loss function, which transmits the classified informationigure 2: Network architecture of discriminator, and GAN training procedure.
Table 1:
Detailed architecture of the VSEGAN generator encoders. Conv1 denotes the first convolution layer of the VSEGAN generatorencoder part.
Conv1 Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 Conv9 Conv10Num Filters 64 64 128 128 256 256 512 512 1024 1024Filter Size (5, 5) (4, 4) (4, 4) (4, 4) (2, 2) (2, 2) (2, 2) (2, 2) (2, 2) (2, 2)Stride(audio) (2, 2) (1, 1) (2, 2) (1, 1) (2, 1) (1, 1) (2, 1) (1, 1) (1, 5) (1, 1)MaxPool(video) (2, 4) (1, 2) (2, 2) (1, 1) (2, 1) (1, 1) (2, 1) (1, 1) (1, 5) (1, 1)(real or fake) to G, i.e., G can predict waveform towards the re-alistic distribution, and getting rid of the noisy signals labeled tobe fake. In addition, previous approaches [24,25] demonstratedthat using L norm as an additional component is beneficial tothe loss of G, and L norm which performs better than L normto minimize the distance between enhanced speech and targetspeech [26]. Therefore, the G loss is modified as: min G V ( G ) = 12 E x ∼ p x ( x ) , ˜ y ∼ p y (˜ y ) [( D ( G ( x , ( v , ˜ y )) , ˜ y ) − ]+ λ || G ( x , ( v , ˜ y )) − y || (5)where λ is a hyper-parameter to control the magnitude of the L norm.
3. Experiment
The model is trained on two datasets: the first is the GRID [27]which consist of video recordings where 18 male speakers and16 female speakers pronounce 1000 sentences each; the secondis TCD-TIMIT [28], which consist of 32 male speakers and 30female speakers with around 200 videos each.The noise signals are collected from the real world and cat-egorized into 12 types: room, car, instrument, engine, train,talker speaking, air-brake, water, street, mic-noise, ring-bell,and music. At every iteration of training, a random attenuationof the noise interference in the range of [-15, 0] dB is appliedas a data augmentation scheme. This augmentation was done tomake the network robust against various SNRs.
The video representation is extracted from input video and isresampled to 25 frames per seconds. Each video is dividedinto non-overlapping segments of 5 consecutive frames, and has been cropped with a mouthcentered window of size × [29]. The audio representation is the transformed magnitudespectrograms in the log Mel-domain with 80 Mel frequencybands from 0 to 8 kHz, using a Hanning window of length 640bins (40 milliseconds), and hop size of 160 bins (10 millisec-onds). The whole spectrograms are sliced into pieces of dura-tion of 200 milliseconds corresponding to the length of 5 videoframes.Consequently, the size of video feature is × × ,and the size of audio feature is × × . During trainingprocess, the size of visual modality has to be the same as audiomodality, therefore, the processed video segment is zoomed to × × by bilinear interpolation algorithm [30]. The pro-posed VSEGAN has 10 convolutional layers for each encoderand decoder of generator, and the details of audio and visual en-coders are described in Table 1, and a Conv-A or a Conv-V inFigure 1 comprise of two convolution layers in Table 1.The model is trained with ADAM optimizer for 70 epochswith learning rate of − , and batch size of 8, and the hyperparameter λ of loss function in Eq. (5) is set to 100 [18]. The performance of VSEGAN is evaluated with the followingmetrics: Perceptual Evaluation of Speech Quality (PESQ), themore specifically the wideband version recommended in ITU-TP.862.2 (-0.5 to 4.5) [31], and Short Term Objective Intelligi-bility (STOI) [32]. Each measurement compares the enhancedspeech with clean reference of the test stimuli provided in thedataset.
4. Results
There are four networks trained as follows:•
SEGAN [18]: An audio-only speech enhancement gen-erative adversarial network.able 2:
Performance of trained networks
Test SNR -5 dB 0 dBEvaluation Metrics STOI PESQ STOI PESQNoisy 51.4 1.03 62.6 1.24SEGAN 63.4 1.97 77.3 2.21Baseline 81.3 2.35 87.9 2.94MFFCN 82.7 2.72 89.3 2.92VSEGAN
Table 3:
Performance comparison of VSEGAN with state-of-the-art result on GRID
Test SNR -5 dB 0 dBEvaluation Metrics PESQL2L 2.61 2.92OVA 2.69 3.00AV(SE) - 2.98AMFFCN 2.81 3.04VSEGAN • Baseline [33]: A baseline work of visual speech en-hancement.•
MFFCN [22]: A visual speech enhancement approachusing multi-layer feature fusion convolution network,and this model is the G network of proposed model.•
VSEGAN : the proposed model, visual speech enhance-ment generative adversarial network.Table 2 demonstrates the improvement performance of net-work, as a new component is added to the architecture, such asvisual information, multi-layer feature fusion strategy, and fi-nally GAN model. The VSEGAN outperforms SEGAN, whichis an evidence that visual information significantly improves theperformance of speech enhancement system. What is more,the comparison between VSEGAN and MFFCN illustrates thatGAN model for visual speech enhancement is more robustthan G-only model. Hence the performance improvement fromSEGAN to VSEGAN is primarily for two reason: 1) using vi-sual information, and 2) using GAN model. Figure 3 showsthe visualization of baseline system enhancement, MFFCN en-hancement, and VSEGAN enhancement, which most obviousdetails of spectrum distinction are framed by dotted box. For further investigating the superiority of proposedmethod, the performance of VSEGAN has also compared to thefollowing recent audio-visual speech enhancement approacheson GRID dataset:•
Looking-to-Listen model [34]: A speaker independentaudio-visual speech separation model.•
Online Visual Augmented (OVA) model [35]: A latefusion based visual speech enhancement model, whichinvolves the audio-based component, visual-based com-ponent and the augmentation component.•
AV(SE) model [36]: An audio-visual squeeze-excitespeech enhancement model. Speech samples are available at: https://XinmengXu.github.io/AVSE/VSEGAN
Figure 3:
Example of input and enhanced spectra from an ex-ample speech utterance. (a) Noisy speech under the conditionof noise at 0 dB. (b) Enhanced speech generated by baselinework. (c) Enhanced speech generated by MFFCN. (d) En-hanced speech generated by VSEGAN. • AMFFCN model [37]: A visual speech enhancementapproach using attentional multi-layer feature fusionconvolution network.Table 3 shows that the VSEGAN produces state-of-the-artresults in terms of PESQ score by comparing against four recentproposed methods that use DNNs to perform end-to-end visualspeech enhancement. Results for competing methods are takenfrom the corresponding papers and the missing entries in thetable indicate that the metric is not reported in the reference pa-per. Although the competing results are for reference only, theVSEGAN has better performance than state-of-the-art resultson the GRID dataset.
5. Conclusions
This paper proposed an end-to-end visual speech enhancementmethod has been implemented within the generative adversarialframework. The model adopts multi-layer feature fusion convo-lution network structure, which provides a better training behav-ior, as the gradient can flow deeper through the whole structure.According to the experiment results, the performance of speechenhancement system has significantly improves by involving ofvisual information, and visual speech enhancement using GANperforms better quality of enhanced speech than several state-of-the-art models.Possible future work involves integrating spatial and tem-poral information into network, exploring better convolutionalstructures, including weightings for audio and visual modalitiesfusion blocks for controlling how the information from audioand visual modalities is treated. . References [1] P. C. Loizou,
Speech enhancement: theory and practice . CRCpress, 2013.[2] T. F. Quatieri,
Discrete-time speech signal processing: principlesand practice . Pearson Education India, 2006.[3] R. Martin, “Spectral subtraction based on minimum statistics,” power , vol. 6, no. 8, 1994.[4] Y. Ephraim and H. L. Van Trees, “A signal subspace approach forspeech enhancement,”
IEEE Transactions on speech and audioprocessing , vol. 3, no. 4, pp. 251–266, 1995.[5] J. Lim and A. Oppenheim, “All-pole modeling of degradedspeech,”
IEEE Transactions on Acoustics, Speech, and SignalProcessing , vol. 26, no. 3, pp. 197–210, 1978.[6] M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech en-hancement from noise: A regenerative approach,”
Speech Com-munication , vol. 10, no. 1, pp. 45–57, 1991.[7] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap-proach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 23, no. 1, pp. 7–19, 2014.[8] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florˆencio, andM. Hasegawa-Johnson, “Speech enhancement using bayesianwavenet.” in
Interspeech , 2017, pp. 2013–2017.[9] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluationmetrics optimization by fully convolutional neural networks,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 26, no. 9, pp. 1570–1584, 2018.[10] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speechseparation with memory-enhanced recurrent neural networks,” in . IEEE, 2014, pp. 3709–3713.[11] W. H. Sumby and I. Pollack, “Visual contribution to speech intelli-gibility in noise,”
The journal of the acoustical society of america ,vol. 26, no. 2, pp. 212–215, 1954.[12] Q. Summerfield, “Use of visual information for phonetic percep-tion,”
Phonetica , vol. 36, no. 4-5, pp. 314–331, 1979.[13] W. Wang, D. Cosker, Y. Hicks, S. Saneit, and J. Chambers, “Videoassisted speech source separation,” in
Proceedings.(ICASSP’05).IEEE International Conference on Acoustics, Speech, and SignalProcessing, 2005. , vol. 5. IEEE, 2005, pp. v–425.[14] J. R. Hershey and M. Casey, “Audio-visual sound separation viahidden markov models,” in
Advances in Neural Information Pro-cessing Systems , 2002, pp. 1173–1180.[15] J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W.Chang, and H.-M. Wang, “Audio-visual speech enhancement us-ing deep neural networks,” in . IEEE, 2016, pp. 1–6.[16] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing throughnoise: Visually driven speaker separation and enhancement,” in
IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2018, pp. 3051–3055.[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,”
Advances in neural information processing systems ,vol. 27, pp. 2672–2680, 2014.[18] S. Pascual, A. Bonafonte, and J. Serr`a, “SEGAN: Speech en-hancement generative adversarial network,” in
Interspeech 2017 ,2017, pp. 3642–3646.[19] M. Mirza and S. Osindero, “Conditional generative adversarialnets,”
Computer ence , pp. 2672–2680, 2014.[20] A. Radford, L. Metz, and S. Chintala, “Unsupervised representa-tion learning with deep convolutional generative adversarial net-works,” 11 2016. [21] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley,“Least squares generative adversarial networks,” in
Proceedingsof the IEEE international conference on computer vision , 2017,pp. 2794–2802.[22] X. Xu, D. Xu, J. Jia, Y. Wang, and B. Chen, “Mffcn: Multi-layerfeature fusion convolution network for audio-visual speech en-hancement,” arXiv preprint arXiv:2101.05975 , 2021.[23] D. Baby and S. Verhulst, “Sergan: Speech enhancement using rel-ativistic generative adversarial networks with gradient penalty,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 106–110.[24] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in
Proceedingsof the IEEE conference on computer vision and pattern recogni-tion , 2017, pp. 1125–1134.[25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A.Efros, “Context encoders: Feature learning by inpainting,” in
Pro-ceedings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 2536–2544.[26] A. Pandey and D. Wang, “On adversarial training and loss func-tions for speech enhancement,” in .IEEE, 2018, pp. 5414–5418.[27] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recog-nition,”
The Journal of the Acoustical Society of America , vol.120, no. 5, pp. 2421–2424, 2006.[28] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus ofcontinuous speech,”
IEEE Transactions on Multimedia , vol. 17,no. 5, pp. 603–615, 2015.[29] V. Kazemi and J. Sullivan, “One millisecond face alignment withan ensemble of regression trees,” in
Proceedings of the IEEE con-ference on computer vision and pattern recognition , 2014, pp.1867–1874.[30] K. T. Gribbon and D. G. Bailey, “A novel approach to real-time bi-linear interpolation,” in
Proceedings. DELTA 2004. Second IEEEInternational Workshop on Electronic Design, Test and Applica-tions . IEEE, 2004, pp. 126–131.[31] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-ceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs,” in
IEEE International Conference on Acoustics, Speech, and SignalProcessing. Proceedings (Cat. No. 01CH37221) , vol. 2. IEEE,2001, pp. 749–752.[32] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-gorithm for intelligibility prediction of time–frequency weightednoisy speech,” vol. 19, no. 7. IEEE, 2011, pp. 2125–2136.[33] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhance-ment,”
Interspeech , pp. 1170–1174, 2018.[34] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim,W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock-tail party: A speaker-independent audio-visual model for speechseparation,”
ACM Transactions on Graphics , 2018.[35] W. Wang, C. Xing, D. Wang, X. Chen, and F. Sun, “A robustaudio-visual speech enhancement model,” in
ICASSP 2020-2020IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 7529–7533.[36] M. L. Iuzzolino and K. Koishida, “AV(SE) : Audio-visualsqueeze-excite speech enhancement,” in ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2020, pp. 7539–7543.[37] X. Xu, Y. Wang, D. Xu, C. Zhang, Y. Peng, J. Jia, andB. Chen, “AMMFCN:attentional multi-layer feature fusion con-volution network for audio-visual speech enhancement,”