[PDF] AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Abstract

Audio-visual speech enhancement system is regarded to be one of promising solutions for isolating and enhancing speech of desired speaker. Conventional methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) not adequate to use data fully and effectively, b) cannot process features selectively. The proposed model addresses these drawbacks, by a) applying a model that fuses audio and visual features layer by layer in encoding phase, and that feeds fused audio-visual features to each corresponding decoder layer, and more importantly, b) introducing soft threshold attention into the model to select the informative modality softly. This paper proposes attentional audio-visual multi-layer feature fusion model, in which soft threshold attention unit are applied on feature mapping at every layer of decoder. The proposed model demonstrates the superior performance of the network against the state-of-the-art models.

Full PDF

AAMFFCN: Attentional Multi-layer Feature Fusion Convolution Network forAudio-visual Speech Enhancement

Xinmeng Xu , , Yang Wang , Dongxiang Xu , Yiyuan Peng , Cong Zhang , Jie Jia , Binbin Chen E.E. Engineering, Trinity College Dublin, Ireland vivo AI Lab, P.R. China [email protected], { yang.wang.rj,dongxiang.xu,pengyiyuan,zhangcong,jie.jia,bb.chen } @vivo.com Abstract

Audio-visual speech enhancement system is regarded to be oneof promising solutions for isolating and enhancing speech of de-sired speaker. Conventional methods focus on predicting cleanspeech spectrum via a naive convolution neural network basedencoder-decoder architecture, and these methods a) are not ad-equate to use data fully and effectively, b) cannot process fea-tures selectively. The proposed model addresses these draw-backs, by a) applying a model that fuses audio and visual fea-tures layer by layer in encoding phase, and that feeds fusedaudio-visual features to each corresponding decoder layer, andmore importantly, b) introducing soft threshold attention intothe model to select the informative modality softly. This pa-per proposes attentional audio-visual multi-layer feature fusionmodel, in which soft threshold attention unit are applied on fea-ture mapping at every layer of decoder. The proposed modeldemonstrates the superior performance of the network againstthe state-of-the-art models.

Index Terms : speech enhancement, audio-visual, soft-threshold attention, multi-layer feature fusion model

1. Introduction

Speech processing systems are commonly used in a variety ofapplications such as automatic speech recognition, speech syn-thesis, and speaker veriﬁcation. Numerous speech processingdevices (e.g. mobile communication systems and digital hear-ing aids systems) which are often used in environments withhigh levels of ambient noise such as public places and cars inour daily life. Generally speaking, the presence of high-levelnoise interference, severely decrease perceptual quality and in-telligibility of speech signal. Therefore, there is an urgent needfor the development of speech enhancement algorithms whichcan automatically ﬁlter out noise signal and improve the effec-tiveness of speech processing systems.Recently, many approaches are proposed to recover theclean speech of target speaker immersed in noisy environ-ment, which can be roughly divided into two categories, i.e.,audio-only speech enhancement (AO-SE) [1–3] and audio-visual speech enhancement (AV-SE) [4–6]. AO-SE approachesmake assumptions on statistical properties of the involved sig-nals [7, 8], and aim to estimate target speech signals accordingto mathematically tractable criteria [9, 10]. Advanced AO-SEmethods based on deep learning can predict target speech sig-nal directly, but they tend to depart from the knowledge-basedmodelling. Compared with AO-SE approaches, AV-SE methodshave achieved an improvement in the performance of intelligi-bility of speech enhancement due to the visual aspect which canrecover some of the suppressed linguistic features when targetspeech corrupted by noise interference [11, 12]. However, AV-SE model should be trained using data that representative of settings in which they are deployed. In order to have robust per-formance in a wide variety of settings, very large AV datasetsfor training and testing need to be collected. Furthermore, AV-SE is inherently a multi-modal process, and it focuses not onlyon determining the parameters of a model, but also on the possi-ble fusion architectures [13]. Generally, a naive fusion strategydoes not allow to control how the information from audio andthe visual modalities is fused, as a consequence, one of the twomodalities dominate over the other.To overcome the aforementioned limitations, this paperproposes an attentional audio-visual Convolution Neural Net-works (CNNs) based speech enhancement algorithm that inte-grates the selected audio and visual cues into a uniﬁed networkusing multi-layer audio-visual fusion strategy. The proposedframework applies a Soft Threshold Attention (STA) inspiredby soft thresholding algorithm [14], which has often been usedas a key step in many signal denoising methods [15], and elimi-nated unimportant features [16]. Moreover, the proposed modeladopts the multi-layer audio and visual fusion strategy. in whichthe extracted audio and visual features are concatenated in everyencoding layer. When two modalities in each layer are concate-nated, the system applies them as an additional input via STAto feed the corresponding decoding layer.The main contributions of this paper can be summarized asfollows:• Adopting STA for audio and video processing, the pro-posed framework has ability of eliminating unimpor-tant samples, which further leads to improvement ofspeech enhancement performance and size reduction ofthe model.• Adopting multi-layer feature fusion strategy, the pro-posed model can extract audio-visual features in differ-ent levels and feed them into decoder blocks, which pro-motes the model making better use of data, further im-proves the performance, and requires less data.The reminder of the paper is organised as follows. Section 2introduces the model architecture. Section 3 illustrates the em-ployed datasets and audio-visual representations. In Section 4experiment results are presented, and a conclusion is shown inSection 5.

2. Model Architecture

The Multi-layer Feature Fusion Convolution Network (MF-FCN) architecture is shown in Figure 1. This model followsan encoder-decoder scheme, uses a series of downsampling andupsampling blocks to make its predictions, and consists of theencoder component, fusion component, and decoder compo-nent [17]. a r X i v : . [ ee ss . A S ] F e b igure 1: Illustration of proposed Attentional MFFCN (AMFFCN) model architecture. A sequence of 5 video frames centered onlip-region is resized by a convolution layer, and fed into video encoding convolution neural network blocks (blue). The correspondingspectrogram of noisy speech is put into audio encoding convolution neural network blocks (green) as same fashion as video encoder. Asingle audio-visual embedding (purple) is obtained by concatenating the last video and audio encoding layers and is fed into severalconsecutive fully-connected layers (amber). Finally, a spectrogram of enhanced speech is decoded in audio decoding layers that areobtained by concatenating between audio-visual fusion vector (red), a fusion of audio (green) and visual (blue) modalities generatedfrom encoding layers, and audio decoding vectors (gray), from the last audio decoder layer, via STA block (dark gray). The overallarchitecture of STA is shown in the dot line circle.

The encoder component involves audio encoder and videoencoder. As previous approaches in several CNNs based audioencoding models [18–20], the audio encoder is thus designed asa CNNs using the spectrogram as input. The video encoder partis used to process the input face embedding. In our approach,the video feature vectors and audio feature vectors take concate-nation access at every step in the encoding stage, and the sizeof visual feature vectors after convolution layer have to be thesame as the corresponding audio feature vectors, as shown inFigure 1.Fusion component consists of audio-visual fusion processand audio-visual embedding process. Audio-visual fusion pro-cess usually designates a consolidated dimension to implementfusion, which combines the audio and visual streams in eachlayer directly and feeds the combination into several convolu-tion layers. Audio-visual embedding which ﬂattens audio andvisual streams from 3D to 1D, then concatenates both ﬂattenedstreams together, and ﬁnally feed the concatenated feature vec-tor into several fully-connected layers. Audio-visual embed-ding is a feature deeper fusion strategy, and the resulting vectoris then to build decoder component.The decoder component, or named audio decoder, is madeof deconvolutional layers. Because of the downsampling blocks, the model computes a number of higher level featureson coarser time scales, and generate the audio-visual featuresby audio-visual fusion process in each level, which are concate-nated with the local, high resolution features computed fromthe same level upsampling block. This concatenation resultsinto multi-scale features for predictions.

In the proposed architecture, the potential unbalance caused byconcatenation-based fusion easily happened on decoder blocks,when the concatenating features directly computed during con-tracting path with the same hierarchical level among the decoderblocks. Consequently, the proposed model use attention gates,as shown in Figure 1, to selectively ﬁlter out unimportant infor-mation using soft-thresholding algorithms.Soft-thresholding is a kind of ﬁlter that can transform usefulinformation to very positive or negative features and noise in-formation to near-zero features. Deep learning enables the softthresholding algorithm to be learned automatically using a gra-dient decent algorithm , which is a promising way to eliminatenoise-related information and construct highly discriminativefeatures. The function of soft-thresholding can be expressedigure 2:

The STA, where X i,j,k denotes the input feature, i , j ,and k are the index of width, height and channel of the featuremap X , Y is output feature, which size is the same as x , and z , α are the indicators of the features maps to be used whendetermining threshold. by Y =  X − τ, X > τ , − τ ≤ X ≤ τX + τ, X < − τ (1)where X is the input feature, Y is the output feature, and τ is thethreshold. In addition, X and τ are not independent variableswhere τ is non-negative, and their relation is expressed in Eq 3.The estimation of threshold is a set of deep learning blocksas shown in Figure 2. In the threshold estimating module, thefeature map X i,j,k , where i , j , and k are the index of width,height and channel, is took absolute value, and its dimensionis reduced to 1D. The function of the following several fully-connected layers generates the attention mask [21], where thesigmoid function at the last layers scaled the attention maskfrom 0 to 1, which can be expressed by α = 11 + e − z (2)where z is the output of fully-connected layers, and α is theattention mask. Finally, the threshold parameter τ can be usedto determine the value of feature vectors, which are obtained bymultiplying between the average value of | X i,j,k | and attentionmask α . The function of threshold parameter can be expressedby τ = α × Avg( | X i,j,k | ) (3)where Avg( . ) denotes the average pooling. Substitute Eq 2 andEq 2 into Eq 1, the output feature Y i,j,k can be obtained.There are two advantages of STA: Firstly, it removes noise-related features from higher-level audio-visual fusion vectors.Secondly, it balances audio and visual modalities in the audio-visual fusion vector, and selectively take audio-visual features.

3. Datasets and Implementation Details

The dataset used in proposed model involves two publicly avail-able audio-visual datasets: GRID [22] and TCD-TIMIT [23],which are the two most commonly used databases in the area ofaudio-visual speech processing. GRID consists of video record-ings where 18 male speakers and 16 female speakers pronounce1000 sentences each. TCD-TIMIT consists of 32 male speakersand 30 female speakers with around 200 videos each.The proposed model shufﬂes and splits the dataset to train-ing, validation, and evaluation sets to 24300 (15 males, 12 fe-males, 900 utterance each), 4400 (12 males, 10 females, 200utterance each), and 1200 utterances (4 males, 4 females, 150 Table 1:

Performance of trained networks

Test SNR -5 dB 0 dBEvaluation Metrics STOI(%) PESQ STOI(%) PESQNoisy 51.4 1.03 62.6 1.24TCNN 78.7 2.19 81.3 2.58Baseline 81.3 2.35 87.9 2.94MFFCN 82.7 2.72 utterance each), respectively. The noise dataset contains 25.3hours ambient noise categorized into 12 types: room, car, in-strument, engine, train, human chatting, air-brake, water, street,mic-noise, ring-bell, and music.Part of noise signals (23.9 hours) are conducted into bothtraining set and validation set, but the rest are used to mix theevaluation set. The speech-noise mixtures in training and val-idation are generated by randomly selecting utterances fromspeech dataset and noise dataset and mixing them at randomSNR between -10dB and 10dB. The evaluation set is generatedSNR at 0dB and -5dB.

The audio representation is the transformed magnitude spectro-grams in the log Mel-domain. The input audio signals are rawwaveforms, and ﬁrstly are transformed to spectrograms usingShort Time Fourier Transform (STFT) with Hanning windowfunction, and 16 kHz resampling rate. Each frame containsa window of 40 milliseconds, which equals 640 samples perframe and corresponds to the duration of a single video frame,and the frame shift is 160 samples (10 milliseconds).The transformed spectrograms are then converted to logMel-scale spectrograms via Mel-scale ﬁlter banks. The result-ing spectrogram have 80 Mel frequency bands from 0 to 8 kHz.The whole spectrograms are sliced into pieces of duration of200 milliseconds corresponding to the length of 5 video frames,resulting in spectrograms of size 80 ×

20, representing 20 tem-poral samples, and 80 frequency bins in each sample.

Visual representation is extracted from the input videos, and isre-sampled to 25 frames per second. Each video is divided intonon-overlapping segments of 5 frames. During the processingstage, each frame that has been cropped a mouthcentered win-dow of size 128 × 128 by using the 20 mouth landmarks from68 facial landmarks suggested by Kazemi et al. [24]. Then thevideo segment processed as input is the size of 128 × × × ×

4. Experiment Results

To evaluate the performance of the proposed approach, the com-parisons are provided with several recently proposed speechenhancement algorithms. Specially, the evaluation methodsare compared AMFFCN model with TCNN model (an AO-SE approach), the AV-SE baseline system, and MFFCN model.able 2:

Performance comparison of AMFFCN with state-of-the-art result on GRID

Test SNR -5 dB 0 dBEvaluation Metrics ∆ PESQDeep-learning-based AV-SE 1.13 0.74OVA Approach 0.21 0.06L2L Model 0.26 0.19Therefore, there are four networks have trained:•

TCNN [25]: Temporal convolutional neural network forreal-time speech enhancement in the time domain.•

Baseline [26]: A baseline work of visual speech en-hancement.•

MFFCN [17]: Multi-layer Feature Fusion ConvolutionNetwork for audio-visual speech enhancement.•

AMFFCN : Attentional Multi-layer Feature Fusion Con-volution Network for audio-visual speech enhancement.

The results of the proposed network using the following evalua-tion metrics: Short Term Objective Intelligibility (STOI) [27]and Perceptual Evaluation of Speech Quality (PESQ) [28].Each measurement compares the enhanced speech with cleanreference of each of the test stimuli provided in the dataset. Table 1 demonstrates the improvement in the performanceof network, as new component to the speech enhancement ar-chitecture, such as visual modality, multi-layer audio-visual fea-ture fusion strategy, and ﬁnally the STA. There is an observationthat the AV-SE baseline work outperforms TCNN, an end-to-end deep learning based AO-SE system, and the performanceof MFFCN model better than the baseline system. Hence theperformance improvement from TCNN (AO-SE) to MFFCN isprimarily for two reasons: a) the addition of the visual modal-ity, and b) the use of fusion technique named multi-layer audio-visual fusion strategy, instead of concatenating audio and vi-sual modalities only once in the whole network. Finally, theresults from table I shows that STA improves the performanceof MFFCN further. Figure 3 shows the visualization of baselinesystem enhancement, MFFCN enhancement, and AMFFCN en-hancement, the comparison details of spectrum framed by dot-ted box.Table 2 demonstrated that our proposed aproach producesstate-of-the-art results in terms of speech quality metrics as dis-cussed above by comparing against the following three recentlyproposed methods that use deep neural networks to perform AV-SE on GRID dataset:•

Deep-learning-based AV-SE [29]: Deep-learning-based audio-visual speech enhancement in presence ofLombard effect•

OVA approach [30]: A LSTM based AV-SE with maskestimation•

L2L model [31]: A speaker independent audio-visualmodel for speech separation Speech samples are available at: https://XinmengXu.github.io/AVSE/AMFFCN (a) Noisy input(b) Baseline enhanced(c) MFFCN enhanced(d) AMFFCN enhanced

Figure 3:

Example of input and enhanced spectra from an ex-ample speech utterance. (a) Noisy speech input from test dataunder the condition of ambient noise at -10 dB. (b) Enhancedspeech generated by baseline work. (c) Enhanced speech gen-erated by MFFCN model. (d) Enhanced speech generated byproposed AMFFCN model.

The results where ∆ PESQ denotes PESQ improvementwith AMFFCN result in Table 1. Results for the competingmethods are taken from the corresponding papers. Although thecomparison results are for reference only, the proposed modeldemonstrates a robust performance in comparison with state-of-the-art results on the GRID AV-SE tasks.

5. Conclusion

This paper proposed an AMFFCN model for audio-visualspeech enhancement. The multi-layer feature fusion strategyprocess a long temporal context by repeated downsampling andconvolution of feature maps to combine both high-level andlow-level features at different layer steps. In addition, STAinspired by soft-thresholding algorithm, which can automati-cally select informative features, transfer them to very posi-tive or negative features, and ﬁnally eliminate the rest of near-zero features. Results provided an illustration that the proposedmodel has better performance than some published state-of-the-art models on the GRID dataset. . References [1] L.-P. Yang and Q.-J. Fu, “Spectral subtraction-based speech en-hancement for cochlear implant patients in background noise,”

The journal of the Acoustical Society of America , vol. 117, no. 3,pp. 1001–1004, 2005.[2] K. Paliwal and A. Basu, “A speech enhancement method based onkalman ﬁltering,” in

ICASSP’87. IEEE International Conferenceon Acoustics, Speech, and Signal Processing , vol. 12. IEEE,1987, pp. 177–180.[3] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap-proach to speech enhancement based on deep neural networks,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 23, no. 1, pp. 7–19, 2014.[4] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, andH.-M. Wang, “Audio-visual speech enhancement using multi-modal deep convolutional neural networks,”

IEEE Transactionson Emerging Topics in Computational Intelligence , vol. 2, no. 2,pp. 117–128, 2018.[5] I. Almajai and B. Milner, “Visually derived wiener ﬁlters forspeech enhancement,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 19, no. 6, pp. 1642–1651, 2010.[6] L. Girin, J.-L. Schwartz, and G. Feng, “Audio-visual enhance-ment of speech in noise,”

The Journal of the Acoustical Society ofAmerica , vol. 109, no. 6, pp. 3007–3020, 2001.[7] Y. Ephraim and H. L. Van Trees, “A signal subspace approach forspeech enhancement,”

IEEE Transactions on speech and audioprocessing , vol. 3, no. 4, pp. 251–266, 1995.[8] Y. Ephraim, “Statistical-model-based speech enhancement sys-tems,”

Proceedings of the IEEE , vol. 80, no. 10, pp. 1526–1555,1992.[9] A. Rezayee and S. Gazor, “An adaptive klt approach for speechenhancement,”

IEEE Transactions on Speech and Audio Process-ing , vol. 9, no. 2, pp. 87–95, 2001.[10] P. Scalart et al. , “Speech enhancement based on a priori signalto noise estimation,” in , vol. 2. IEEE, 1996, pp. 629–632.[11] W. H. Sumby and I. Pollack, “Visual contribution to speech intelli-gibility in noise,”

The journal of the acoustical society of america ,vol. 26, no. 2, pp. 212–215, 1954.[12] Q. Summerﬁeld, “Use of visual information for phonetic percep-tion,”

Phonetica , vol. 36, no. 4-5, pp. 314–331, 1979.[13] D. Ramachandram and G. W. Taylor, “Deep multimodal learning:A survey on recent advances and trends,”

IEEE Signal ProcessingMagazine , vol. 34, no. 6, pp. 96–108, 2017.[14] D. L. Donoho, “De-noising by soft-thresholding,”

IEEE transac-tions on information theory , vol. 41, no. 3, pp. 613–627, 1995.[15] M. Zhao, S. Zhong, X. Fu, B. Tang, and M. Pecht, “Deep resid-ual shrinkage networks for fault diagnosis,”

IEEE Transactions onIndustrial Informatics , vol. 16, no. 7, pp. 4681–4690, 2019.[16] M. Zhao, S. Zhong, X. Fu, and Tang, “Deep residual shrinkagenetworks for fault diagnosis,”

IEEE Transactions on Industrial In-formatics , vol. 16, no. 7, pp. 4681–4690, 2019.[17] X. Xu, D. Xu, J. Jia, Y. Wang, and B. Chen, “MFFCN: Multi-layer feature fusion convolution network for audio-visual speechenhancement,” arXiv preprint arXiv:2101.05975 , 2021.[18] S.-W. Fu, Y. Tsao, and X. Lu, “SNR-aware convolutional neuralnetwork modeling for speech enhancement.” in

Interspeech , 2016,pp. 3768–3772.[19] T. Kounovsky and J. Malek, “Single channel speech enhance-ment using convolutional neural network,” in . IEEE, 2017,pp. 1–5. [20] K. Tan and D. Wang, “A convolutional recurrent neural networkfor real-time speech enhancement.” in

Interspeech , 2018, pp.3229–3233.[21] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2018, pp. 7132–7141.[22] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recog-nition,”

The Journal of the Acoustical Society of America , vol.120, no. 5, pp. 2421–2424, 2006.[23] N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus ofcontinuous speech,”

IEEE Transactions on Multimedia , vol. 17,no. 5, pp. 603–615, 2015.[24] V. Kazemi and J. Sullivan, “One millisecond face alignment withan ensemble of regression trees,” in

Proceedings of the IEEE con-ference on computer vision and pattern recognition , 2014, pp.1867–1874.[25] A. Pandey and D. Wang, “TCNN: Temporal convolutional neuralnetwork for real-time speech enhancement in the time domain,” in

ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2019, pp. 6875–6879.[26] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhance-ment,”

Interspeech , pp. 1170–1174, 2018.[27] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-gorithm for intelligibility prediction of time–frequency weightednoisy speech,”

IEEE Transactions on Audio, Speech, and Lan-guage Processing , vol. 19, no. 7, pp. 2125–2136, 2011.[28] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per-ceptual evaluation of speech quality (PESQ)-a new method forspeech quality assessment of telephone networks and codecs,” in , vol. 2.IEEE, 2001, pp. 749–752.[29] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “Deep-learning-based audio-visual speech enhancement in presence oflombard effect,”

Speech Communication , vol. 115, pp. 38–50,2019.[30] W. Wang, C. Xing, D. Wang, X. Chen, and F. Sun, “A robustaudio-visual speech enhancement model,” in

ICASSP 2020-2020IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 7529–7533.[31] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim,W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock-tail party: A speaker-independent audio-visual model for speechseparation,”