[PDF] MFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Abstract

The purpose of speech enhancement is to extract target speech signal from a mixture of sounds generated from several sources. Speech enhancement can potentially benefit from the visual information from the target speaker, such as lip move-ment and facial expressions, because the visual aspect of speech isessentially unaffected by acoustic environment. In order to fuse audio and visual information, an audio-visual fusion strategy is proposed, which goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to more powerful representation which increase intelligibility in noisy conditions. The proposed model fuses audio-visual featureslayer by layer, and feed these audio-visual features to each corresponding decoding layer. Experiment results show relative improvement from 6% to 24% on test sets over the audio modalityalone, depending on audio noise level. Moreover, there is a significant increase of PESQ from 1.21 to 2.06 in our -15 dB SNR experiment.

Full PDF

MMFFCN: Multi-layer Feature Fusion Convolution Network for Audio-visualSpeech Enhancement

Xinmeng Xu , , Yang Wang , Dongxiang Xu , Yiyuan Peng , Cong Zhang , Jie Jia , Binbin Chen E.E. Engineering, Trinity College Dublin, Ireland vivo AI Lab, P.R. China [email protected], { yang.wang.rj,dongxiang.xu,pengyiyuan,zhangcong,jie.jia,bb.chen } @vivo.com Abstract

The purpose of speech enhancement is to extract target speechsignal from a mixture of sounds generated from several sources.Speech enhancement can potentially beneﬁt from the visual in-formation from the target speaker, such as lip movement andfacial expressions, because the visual aspect of speech is es-sentially unaffected by acoustic environment. In order to fuseaudio and visual information, an audio-visual fusion strategyis proposed, which goes beyond simple feature concatenationand learns to automatically align the two modalities, leading tomore powerful representation which increases intelligibility innoisy conditions. The proposed model fuses audio-visual fea-tures layer by layer, and feed these audio-visual features to eachcorresponding decoding layer. Experiment results show relativeimprovement from 6 % to 24 % on test sets over the audio modal-ity alone, depending on audio noise level. Moreover, there is asigniﬁcant increase of PESQ from 1.21 to 2.06 in our -15dBSNR experiment. Index Terms : speech enhancement, audio-visual, multi-layerfeature fusion convolution network (MFFCN)

1. Introduction

Speech enhancement aims at improving speech quality and in-telligibility when audio is recorded in noisy environment. Thisstep is important for applications involving voice commands,especially in far-ﬁeld conditions where Automatic SpeechRecognition (ASR) may be affected by noise and interference,such as radio, TV, or other speakers [1]. Speech enhancementhas been the subject of extensive research [2–4] and has recentlybeneﬁted from advancements in lip reading [5, 6], and speechreading [7, 8].Advanced audio-only speech enhancement algorithmsmakes noisy signal more audible, but the deﬁcient in restor-ing intelligibility is remained. Consequently, the multi-modalspeech enhancement algorithms are demanded that simulate theaudio-visual speech processing mechanism in human contexts,amplify the target speaker, or ﬁlter out acoustic clutter.Recently, a large amount of research has been shown thatthe fusion of visual and audio information is beneﬁcial for var-ious speech perception tasks, e.g., [9–11], but several studiessubstantiates the belief that the audio-visual speech enhance-ment still being less investigated than audio-only speech en-hancement. The overview article by Rivet et al. [12] surveysaudio-visual speech separation techniques, but it is up to 2014when deep learning was not adopted for the task. Althoughaudio-visual speech enhancement has been recently addressedin the framework of deep neural networks (DNNs), several in-teresting architectures, and well-performing algorithms weredeveloped, e.g., [13, 14], the majority of the existing systems have a common disadvantage that one modality (not necessar-ily the most reliable in a given scenario) tends to dominate theother, causing performance degradation.To tackle the above problems, this paper proposes anaudio-visual deep Convolution Neural Networks (CNNs) basedspeech enhancement model that integrates audio and visual cuesinto a uniﬁed network. Moreover, the proposed model adopts anovel fusion technique named multi-layer audio-visual fusionstrategy, instead of concatenating audio and visual modalitiesonly once in the whole network, the proposed model extractedaudio and visual feature in every encoding layers and fuses theaudio-visual information in each layer. When two modalitiesin each layer are concatenated, the system applies them as anadditional input to feed the corresponding decoding layer.The method is evaluated on an audio-visual speech en-hancement task involving the two largest publicly availableaudio-visual datasets, TCD-TIMIT [15] and GRID corpus [16],which contain complex sentences of both read speech and in-the-wild recordings. Using both of these datasets offers repeat-ability and allows other researchers to compare their systemsdirectly to ours. The training data videos are added with syn-thetic background noise taken from the noise dataset collectedin our lab. The reminder of the paper is organised as follows. Sec-tion 2 reviews related work in the ﬁeld of audio-visual speechenhancement. Section 3 introduces the framework and audio-visual fusion strategy of proposed model. Section 4 illus-trates the employed datasets and audio-visual feature extractionmethod. In Section 5 experimental results are presented, and adiscussion is shown in Section 6.

2. Related Work

Speech processing based on audio-visual multi-modal learninghas been done on speech enhancement and separation [17–19].Furthermore, a fully connected network, proposed by Hou etal. [13], was used to jointly process audio and visual inputs toperform speech enhancement.Since the fully connected architecture cannot effectivelyprocess visual information, the audio-visual speech enhance-ment system in Hou’s approach is only slightly better than itsaudio-only speech enhancement counterpart. In addition, Gab-bay et al. proposed a model [20] which feed the video framesinto a trained speech generation network, and predict cleanspeech from noisy input, which has obtained a better perfor-mance when compared with the previous approaches.The audio-visual multi-modal learning present signiﬁcantperformance mainly reﬂected in audio-visual features fusion Speech samples are available at: https://XinmengXu.github.io/AVSE/MultilayerFFCN a r X i v : . [ ee ss . A S ] F e b igure 1: Illustration of proposed MFFCN model architecture. A sequence of 5 video frames centered on the lip-region is resized by aconvolution layer, and fed into video encoding convolution neural network blocks (blue). The corresponding spectrogram of the noisyspeech is put into audio encoding convolution neural network blocks (green) in as same fashion as the video encoder. A single audio-visual embedding (purple) is obtained by concatenating the last video and audio encoding layers and is fed into several consecutivefully-connected layers (amber). Finally, a spectrogram of enhanced speech is decoded in audio decoding layers that are obtained byconcatenating between audio-visual fusion vector (red), a fusion of audio (green) and visual (blue) modalities generated from encodinglayers, and audio decoding vectors (gray), from the last audio decoder layer. approaches. These fusion approaches aims at one-time datafusion, which not only request a large multi-modal trainingdataset, but also cause the data feature wasting.

3. Model Architecture

The presented MFFCN architecture involves the encoder com-ponent, fusion component, and decoder component, and its ar-chitecture is shown in Figure 1.

As previous approaches in several convolution neural networkbased audio encoding models [21–23], the audio encoder is thusdesigned as a convolution neural network using the spectrogramas input.Each layer of an audio encoder block is followed by batchnormalization, Leaky-ReLU for non-linearity, and strided con-volutions for temporal sequence maintaining. The networklayer structure of the audio encoder is described in Table 1.

The video encoder part is used to process the input face em-bedding. In our approach, the video feature vectors and audiofeature vectors take concatenation access at every step in theencoding stage, and the size of visual feature vectors after con-volution layer have to be the same as the corresponding audiofeature vectors is shown in Figure 1.Consequently, the ﬁrst encoding layer is used to regulatethe size of video input equal to audio input, and the follow-ing video encoding blocks take the same structure with audioencoder, which has illustrated in Table I .Each layer in a videoencoder block is followed by batch normalization, Leaky-ReLU for non-linearity, max pooling, and dropout of 0.25.

The proposed model includes two fusion strategies:i) audio-visual fusion which combines the audio and visualstreams in each layer directly and feeds the combinationinto several convolution layers;ii) audio-visual embedding which ﬂattens audio and visualstreams from 3-D to 1-D, then concatenates both ﬂat-tened streams together, and ﬁnally feed the concatenatedfeature vector into several fully-connected layers.Audio-visual fusion process usually designates a consoli-dated dimension to implement fusion. The principle of con-catenation process of audio-visual fusion is shown as Z concat = { V i , A i } (1)where V i and A i denotes visual and audio feature in layer i , inwhich i = 2 , , , in proposed model. From Fig. 2, each spe-cial feature and Z concat can be regarded as a fusion set with allthe features. For the following convolution layers, the relation-ship between input and output has been shown as X i = Conv av av av Z concat ))) (2)Then the resulting vectors X i are fed into the correspondingaudio decoder layer.Audio-visual embedding process, which requested to ﬂattenfeature vector from 3-dimensional to 1-dimensional, to pursuea highly feature fusion. In addition, the concatenation processof audio-visual embedding is shown asable 1: Detailed architecture of the encoders.

Conv1 Conv2 Conv3 Conv4 Conv5 Conv6 Conv7 Conv8 Conv9 Conv10Num Filters 64 64 128 128 256 256 512 512 1024 1024Filter Size (5, 5) (4, 4) (4, 4) (4, 4) (2, 2) (2, 2) (2, 2) (2, 2) (2, 2) (2, 2)Stride(audio) (2, 2) (1, 1) (2, 2) (1, 1) (2, 1) (1, 1) (2, 1) (1, 1) (1, 5) (1, 1)MaxPool(video) (2, 2) (1, 1) (2, 2) (1, 1) (2, 1) (1, 1) (2, 1) (1, 1) (1, 5) (1, 1) Z embed = { Flatten( V j ) , Flatten( A j ) } (3)where j denotes the index of last encoder layer, and it thus equalto 10 in the proposed model. Then the concatenated featuremaps, which named to shared embedding, are subsequently fedinto a block of 3 consecutive fully connected layers. The result-ing vector is then to build audio decoder. The audio decoder consists of 6 transposed convolution layers,mirroring the layers of the audio encoder. Referring to Figure 1,each decoder layer input is the concatenation feature vector be-tween the corresponding audio-visual fusion vector and the out-put from the last decoder layer.Because of the downsampling blocks, the model can com-pute several higher-level features on coarser time scales, whichare concatenated with the local, high-resolution features com-puted from the same level upsampling block. This concatena-tion results in multi-scale features for predictions.

4. Dataset and Preprocessing

The model is trained on two datasets: the ﬁrst is the TCD-TIMIT [15], which consists of 60 volunteer speakers witharound 200 videos each, as well as three lip-speakers; the sec-ond is GRID audio-visual sentence corpus [16], which is a largedataset of audio and facial recordings of 1,000 sentences spo-ken by 34 people (18 male and 16 female). The noise datasetincludes 12 types of noise recorded in real-world environments.These videos are divided into a training set which contain30 speakers (15 male and 15 female) and 900 videos per speak-ers; a development set which contains 30 speakers and 100videos per speakers as the training set but not included in thetraining set; and a test set which contains two speakers that arenot in the training set, each with 1,000 videos.The noise signals are from the dataset which is categorizedinto 12 types: room, car, instrument, engine, train, human-chatting, air-brake, water, street, mic-noise, ring-bell, and mu-sic. For each type, part of noise signals (80 % ) are conductedinto both training data and development data, but the rest areused to mix the test data. Moreover, all of the noise are treatedas the unknown type and is randomly added to speech data. The audio representation is extracted from raw audio wave-forms using Short Time Fourier Transform (STFT) with Han-ning window function after resampling the audio signal to 16kHz. Each frame contains a window of 40 milliseconds, whichequals 640 samples per frame and corresponds to the duration of a single video frame, and the frame shift is 160 samples (10milliseconds).For each speech frame, a log Mel-scale spectrogram is ex-tracted by multiplying the spectrogram via a Mel-scale ﬁlterbank. The resulting spectrogram have frequency resolutionF = ×

20, representing 20 temporalsamples, and 80 frequency bins in each sample.

Visual feature is extracted from the input videos that is re-sampled to 25 frames per second. The video is divided into non-overlapping segments of 5 frames each. During the processingstage, each frame that has been cropped a mouth-centered win-dow of size 128 ×

128 by using the 20 mouth landmarks from68 facial landmarks suggested by Kazemi et al. [24]. Then thevideo segment is processed as input is the sizeof 128 × × × ×

5. Experiment Results

The proposed model is evaluated on several speech enhance-ment tasks using the dataset provided in Part A, Sec. IV. In allcases, background interference are set by the different types ofacoustic environment from the noise dataset. The speech andnoise signals are mixed with SNR from 10 dB to -10 dB bothfrom the training and testing dataset.The model performance is assessed by two objective scores:Short-term Objective Intelligibility (STOI) [26] and PerceptualEvaluation of Speech Quality (PESQ) [27] scores.

To examine the effectiveness of the proposed MFFCN model,subjective comparison test were conducted in terms of speechenhancement capability with an audio-only based approach,temporal convolutional neural network (TCNN) [28], whichstructure is similar with the proposed model. The comparisonresults is given in Table 2.In each sample, the target speech is mixed with natural in-terference, and speech interference respectively. Speech inter-ference denotes the background speech produced by unknowntalker(s), as the table provided that audio-only based modelshows degraded performance on this speech noise, but our ap-proach has a clear improvement with 30 % increase of PESQscore if compared with audio-only model. Moreover, for theable 2: Models comparison in terms of STOI and PESQ scores, “Speech” interference denotes the background speech signal fromunknown talker(s); “Natural” interference denotes the ambient non-speech noise.

Evaluation metrics STOI (%) PESQTest SNR -5 dB 0 dB -5 dB 0 dBInterference Speech Natural Speech Natural Speech Natural Speech NaturalUnprocessed 57.8 51.4 64.7 62.6 1.59 1.03 1.66 1.24TCNN (Audio-only) 73.2 78.7 80.8 81.3 2.01 2.19 2.47 2.58Gabbay et al. [2017] 77.9 81.3

MFFCN (proposed)

Model comparison in terms of STOI and PESQ scoresbetween recent AV model and the proposed model at -15 DBSNR on “natural” interference set

Evaluation metrics STOI (%) PESQUnprocessed 43.1 0.83Gabbay et al. [2017] 61.8 1.21MFFCN (proposed)

Figure 2:

The waveforms of an example speech utterance un-der the condition of nature noise at -15 dB. Top: Noisy speech;Middle: Enhanced speech using baseline system; Bottom: En-hanced speech using the proposed model. natural interference, which denotes the noise not produced bythe human vocal cord system, the proposed model also outper-forms the audio-only approach that the PESQ is improved by24.2 % at -5 dB SNR and 13 % at 0 dB SNR. To further determine the signiﬁcance of the results, the per-formance between the proposed MFFCN model and a baselinespeech enhancement algorithm, which proposed by Gabbay etal. [14], is contrasted and the results are shown in Table 2.At the pair-wise comparison, the proposed model has noobvious advantage on 0dB, but a better value at -5 dB SNR, inwhich results show improvement of STOI score from 77.9 % to80.7 % on speech interference set, and improvement of PESQfrom 2.35 to up to 2.72 on natural interference set.In order to verify robustness of the proposed model instronger noise environment, enhancement capability test onnoisy speech of -15 dB, between our approach and Gabbay’sapproach, is presented. The test results are shown in Table 3,and its waveforms are shown in Figure 2. Moreover, Table 3 Figure 3: The details of spectrograms of an example speechutterance under the condition of nature noise at -15 dB. Top:Noisy speech; Middle: Enhanced speech using baseline system;Bottom: Enhanced speech using the proposed model. strengthens that the proposed approach produces better resultthan the baseline work on noisy speech at -15 dB SNR, espe-cially the PESQ value is improved from 1.21 to 2.06. What ismore, the observation from Figure 2 supported that the resultsgenerated by MFFCN, keeps more speech elements in both timedomain and frequency domain.In addition, Figure 3 illustrates more details that the pro-posed model exposes a robust performance on enhancing highnoise speech signal, the visualization of spectrogram generatedby MFFCN, apparently kept more energy of speech signal.

6. Discussion

A multi-layer features fusion based MFFCN model for audio-visual speech enhancement, separating the target speech of vis-ible speaker from background noise, has been presented. Along temporal context is processed by repeated downsamplingand convolution of feature maps to combine both high-level andlow-level features at different layer steps.The proposed model consistently improves the quality andintelligibility of noisy speech, and experiment results showedthat MFFCN has better performance than recent audio-onlybased model and also demonstrated a obvious improvement onhighly noisy speech enhancement.The proposed model is compact and operates on shortspeech segments, and thus potentially suitable for real-time ap-plications. . References [1] S. Dupont and J. Luettin, “Audio-visual speech modeling for con-tinuous speech recognition,”

IEEE transactions on multimedia ,vol. 2, no. 3, pp. 141–151, 2000.[2] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,”

Signal processing , vol. 81, no. 11,pp. 2403–2418, 2001.[3] S. Gannot and I. Cohen, “Speech enhancement based on the gen-eral transfer function gsc and postﬁltering,”

IEEE Transactions onSpeech and Audio Processing , vol. 12, no. 6, pp. 561–571, 2004.[4] P. C. Loizou,

Speech enhancement: theory and practice . CRCpress, 2013.[5] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in

AsianConference on Computer Vision . Springer, 2016, pp. 87–103.[6] C. J. Son and A. Zisserman, “Lip reading in proﬁle,” 2017.[7] T. Le Cornu and B. Milner, “Generating intelligible audio speechfrom visual speech,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 25, no. 9, pp. 1751–1761, 2017.[8] A. Ephrat, T. Halperin, and S. Peleg, “Improved speech recon-struction from silent video,” in

Proceedings of the IEEE Inter-national Conference on Computer Vision Workshops , 2017, pp.455–462.[9] W. H. Sumby and I. Pollack, “Visual contribution to speech intelli-gibility in noise,”

The journal of the acoustical society of america ,vol. 26, no. 2, pp. 212–215, 1954.[10] N. P. Erber, “Auditory-visual perception of speech,”

Journal ofspeech and hearing disorders , vol. 40, no. 4, pp. 481–492, 1975.[11] A. MacLeod and Q. Summerﬁeld, “Quantifying the contributionof vision to speech perception in noise,”

British journal of audiol-ogy , vol. 21, no. 2, pp. 131–141, 1987.[12] B. Rivet, W. Wang, S. M. Naqvi, and J. A. Chambers, “Audio-visual speech source separation: An overview of key methodolo-gies,”

IEEE Signal Processing Magazine , vol. 31, no. 3, pp. 125–134, 2014.[13] J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W.Chang, and H.-M. Wang, “Audio-visual speech enhancement us-ing deep neural networks,” in . IEEE, 2016, pp. 1–6.[14] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhance-ment,”

Proc. Interspeech , pp. 1170–1174, 2018.[15] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus ofcontinuous speech,”

IEEE Transactions on Multimedia , vol. 17,no. 5, pp. 603–615, 2015.[16] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recog-nition,”

The Journal of the Acoustical Society of America , vol.120, no. 5, pp. 2421–2424, 2006.[17] L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visual enhance-ment of speech in noise,”

The Journal of the Acoustical Society ofAmerica , vol. 109, no. 6, pp. 3007–3020, 2001.[18] Q. Summerﬁeld, “Use of visual information for phonetic percep-tion,”

Phonetica , vol. 36, no. 4-5, pp. 314–331, 1979.[19] W. Wang, D. Cosker, Y. Hicks, S. Saneit, and J. Chambers, “Videoassisted speech source separation,” in

Proceedings.(ICASSP’05).IEEE International Conference on Acoustics, Speech, and SignalProcessing, 2005. , vol. 5. IEEE, 2005, pp. 421–425.[20] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing throughnoise: Visually driven speaker separation and enhancement,” in

IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2018, pp. 3051–3055.[21] S.-W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neuralnetwork modeling for speech enhancement.” in

Interspeech , 2016,pp. 3768–3772. [22] T. Kounovsky and J. Malek, “Single channel speech enhancementusing convolutional neural network,” in

IEEE International Work-shop of Electronics, Control, Measurement, Signals and their Ap-plication to Mechatronics (ECMSM) . IEEE, 2017, pp. 1–5.[23] K. Tan and D. Wang, “A convolutional recurrent neural networkfor real-time speech enhancement.” in

Interspeech , 2018, pp.3229–3233.[24] V. Kazemi and J. Sullivan, “One millisecond face alignment withan ensemble of regression trees.” CVPR, 2017.[25] K. T. Gribbon and D. G. Bailey, “A novel approach to real-time bi-linear interpolation,” in

Proceedings. DELTA 2004. Second IEEEInternational Workshop on Electronic Design, Test and Applica-tions . IEEE, 2004, pp. 126–131.[26] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-gorithm for intelligibility prediction of time–frequency weightednoisy speech,”

IEEE Transactions on Audio, Speech, and Lan-guage Processing , vol. 19, no. 7, pp. 2125–2136, 2011.[27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-ceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs,” in

IEEE International Conference on Acoustics, Speech, and SignalProcessing. Proceedings (Cat. No. 01CH37221) , vol. 2. IEEE,2001, pp. 749–752.[28] A. Pandey and D. Wang, “TCNN: Temporal convolutional neuralnetwork for real-time speech enhancement in the time domain,” in