Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement
Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee
CCorrelating Subword Articulation with Lip Shapes forEmbedding Aware Audio-Visual Speech Enhancement
Hang Chen a , Jun Du a, ∗ , Yu Hu a , Li-Rong Dai a , Bao-Cai Yin b , Chin-Hui Lee c a National Engineering Laboratory for Speech and Language Information Processing,University of Science and Technology of China, Hefei, Anhui, China b iFlytek Research, iFlytek Co., Ltd., Hefei, Anhui, China c School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta,GA, USA
Abstract
In this paper, we propose a visual embedding approach to improving embed-ding aware speech enhancement (EASE) by synchronizing visual lip frames atthe phone and place of articulation levels. We first extract visual embeddingfrom lip frames using a pre-trained phone or articulation place recognizer forvisual-only EASE (VEASE). Next, we extract audio-visual embedding fromnoisy speech and lip videos in an information intersection manner, utilizing acomplementarity of audio and visual features for multi-modal EASE (MEASE).Experiments on the TCD-TIMIT corpus corrupted by simulated additive noisesshow that our proposed subword based VEASE approach is more effective thanconventional embedding at the word level. Moreover, visual embedding at thearticulation place level, leveraging upon a high correlation between place of ar-ticulation and lip shapes, shows an even better performance than that at thephone level. Finally the proposed MEASE framework, incorporating both audioand visual embedding, yields significantly better speech quality and intelligibil-ity than those obtained with the best visual-only and audio-only EASE systems.
Keywords:
Speech enhancement , audio-visual , representationlearning , deep learning , universal attribute recognition ∗ Corresponding author
Email address: [email protected] (Jun Du)
Preprint submitted to Neural Networks September 22, 2020 a r X i v : . [ c s . S D ] S e p . Introduction Background noises greatly reduce the quality and intelligibility of the speechsignal, limiting the performance of speech-related applications in real-world con-ditions (e.g. automatic speech recognition, dialogue system and hearing aid,etc.). The goal of speech enhancement [1] is to generate enhanced speech withbetter speech quality and clarity by suppressing background noise componentsin noisy speech.Conventional speech enhancement approaches, such as spectral subtraction[2], Wiener filtering [3], minimum mean squared error (MMSE) estimation [4],and the optimally-modified log-spectral amplitude (OM-LSA) speech estima-tor [5, 6], have been extensively studied in the past. Recently, deep learningtechnologies have been successfully used for speech enhancement [7, 8, 9].Human auditory system can track a single target voice source in extremelynoisy acoustic environment like a cocktail party, as known as the cocktail partyeffect [10]. This fascinating nature motivates us to utilize the discovery thathumans perceive speech when designing speech enhancement systems. McGurkeffect [11] suggests a strong influence of vision in human speech perception.More researches [12, 13, 14, 15] have shown visual cues such as facial/lip move-ments can supplement acoustic information of the corresponding speaker, help-ing speech perception, especially in noisy environments. Inspired by the abovediscoveries, the speech enhancement method utilizing both audio and visual sig-nals, known as audio-visual speech enhancement (AVSE), has been developed.The AVSE methods can be traced back to [16] and following work, e.g.[17, 18, 19, 20, 21, 22]. And recently numerous studies have attempted tobuild deep neural network-based AVSE models. [23] employed a video-to-speechmethod to construct T-F masks for speech enhancement. An encoder-decoderarchitecture was used in [24, 25]. These methods were merely demonstratedunder constrained conditions (e.g. the utterances consisted of a fixed set ofphrases, or a small number of known speakers). [26] proposed a deep AVSEnetwork consisting of the magnitude and phase sub-networks, which enhanced2agnitude and phase, respectively. [27] designed a model that conditioned onthe facial embedding of the source speaker and outputted the complex mask.[28] proposed a time-domain AVSE framework based on Conv-tasnet [29]. Thesemethods all performed well in the situations of unknown speakers and unknownnoise types.We briefly discuss the above-mentioned AVSE methods from following twoperspectives: visual embedding and audio-visual fusion method. Regrading tothe visual embedding, [25, 23, 24] made use of the image sequences of the lipregion. For discarding irrelevant variations between images, such as illumina-tion, [27] proposed using the face embedding obtained from a pre-trained facerecognizer and confirmed through ablation experiments that the lip area playedthe most important role for enhancement performance in the face area. More-over, [26, 28] chose lip embedding via the middle layer output in a pre-trainedisolated word recognition model.In recent work, [30] adopted the phone as the classification target insteadof isolated word and provided a more useful visual embedding for speech sepa-ration. In the term of audio-visual fusion method, most AVSE methods focuson audio-visual fusion that happens at the middle layer of the enhancementnetwork in the fashion of channel-wise concatenation.We can get some inspirations from these pioneering works. A useful visualembedding should contain as much acoustic information in the video as possible.But the acoustic information in video is very limited, and there is also otherinformation redundancy. In the current classification-based embedding extract-ing framework, we can yield a more robust and generalized visual embeddingby reducing the information redundancy and increasing the correlation betweenthe classification target and the visual acoustic information. Cutting out the liparea is helpful for reducing the redundancy. While for the other one, finding aclassification target that is more relevant to lip movements is informative.The superset of speech information called speech attributes include a seriesof fundamental speech sounds with their linguistic interpretations, speaker char-acteristics, and emotional state etc [31]. In contrast to phone models, a smaller3umber of universal attribute units are needed for a complete characterizationof any spoken language [32]. The set of universal speech attributes used in re-lated works mainly consist of place and manner of articulation [33, 34, 35]. Wepropose that a higher correlation between the speech attribute and the visualacoustic information can provide a more useful supervisory signal in the stageof visual embedding extractor training.One commonly accepted consensus in multimodal learning is that the data ofeach mode obeys an independent distribution conditioned on the ground truthlabel [36, 37, 38, 39]. Each mode captures features related to ground truthtags from different aspects, so the information extracted (labels excluded) isnot necessarily related to the other. This shows that the ground truth canbe seen as “information intersection” between all modes [40], i.e., the amountof agreement shared by all the modalities. Specifically, in AVSE, there is amismatch between the information intersection and the ground truth label. Theintersection of audio modal (noisy speech) and video modal (lip video) is notclean speech which is the ground truth.In this work, we extend the previous AVSE framework to the embeddingaware speech enhancement (EASE) framework. The conventional AVSE meth-ods are regarded as special EASE methods, which only utilize visual embeddingextracted from lip frames, as known as visual-only EASE (VEASE) methods. InEASE framework, we propose a VEASE model using a novel visual embedding,which is the middle layer output in a pre-trained universal place recognizer. Wehave the same dataset in the stages of the embedding extractor training and theenhancement network training. A more effective visual embedding is obtainedby utilizing a high correlation between the designed classification target, i.e.,the articulation place, and the visual acoustic information rather than addi-tional video data. Moreover, we present a novel multimodal EASE (MEASE)model using multimodal embedding instead of unimodal embedding. The visualembedding extractor in the VEASE model evolves into audio-visual embeddingextractor in the MEASE model. The enhancement network takes not only thenoisy speech but also the fused audio-visual embedding as inputs, and outputs4 ideal ratio mask. The fusion of audio and visual embeddings occurs in thestage of embedding extractor training and is supervised by their informationintersection at the articulation place label level.The main contributions of this paper are:(1) We explore the effectiveness of different visual embeddings pre-trained forvarious classification targets on enhancement performance. A novel classi-fication target, i.e., the articulation place, is proposed for training visualembedding extractor. The visual embedding utilizing a high correlation be-tween the articulation place and the acoustic information in video achievesthe better enhancement performance with no additional data used.(2) We verify the complementarity between audio and visual embeddings liesin different SNR levels, as well as different articulation places by ablationexperiments. And based on the information intersection, we adopt a novelfusion method integrating visual and audio embeddings in the proposedMEASE model, which achieves better performance in all SNR levels and allarticulation places.(3) We design experiments to study the effect of the stage when audio-visualfusion occurs on the quality and intelligibility of enhanced speech. And weobserve that the early fusion of audio and visual embeddings achieves thebetter enhancement performance.Concurrently and independently from us, a number of groups have pro-posed various methods from above two perspectives for AVSE. [41] observedserious performance degradations when these AVSE methods were applied witha medium or high SNR and proposed a late fusion-based approach to safelycombine visual knowledge in speech enhancement. This is the opposite of ourwork. [42] proposed a new mechanism for audio-visual fusion. In this research,the fusion block was adaptable to any middle layers of the enhancement network. Performance degradation in [41] may result from the changes in the network structure,but we have indeed observed reduction in improvements from our results of comparativeexperiment, as will be discussed in Section 4.4.
2. VEASE Model Utilizing Articulation Place Label
In this section, we elaborate our proposed VEASE model, including two as-pects, i.e., architecture and training process. The visual embedding extractor isan important part of the VEASE model, which takes a sequence of lip framesas input and outputs a compact vector for every lip frame, known as visual em-bedding. The VEASE model takes both noisy log-power spectra (LPS) featuresand visual embeddings as inputs, and outputs ideal ratio mask. The details ofthe visual embedding extractor and the VEASE model are elaborated in thefollowing. D - C onv B N & R e L U D - M a x P oo li ng R e s N e t - Lip
Frames V i s u a l E m b e dd i ng Visual
Embedding
Extractor B i G R U S o f t M a x Word/Phone/Place
Posteriors
Figure 1: Illustration of a visual embedding extractor (in color blue for ease of cross-referencingin Figure 1, Figure 3, Figure 5 and Figure 6). For every lip frame, the extractor outputs acompact vector. We train visual embedding extractor by using 3 different classification labels,i.e., word, phone and place. f V has a similar structure to [43, 44], whichis also used in previous AVSE studies [26, 28]. The extractor consists of a spa-tiotemporal convolution followed by an 18-layer ResNet [45] which is the identitymapping version [46], as shown in Figure 1. A spatiotemporal convolution con-sists of a convolution layer with 64 3D-kernels of 5 × × V = { V t ∈ R H × W ; t = 0 , , . . . , T V − } , thefeature maps is extracted by the spatiotemporal convolution. Then, the fea-ture maps are passed through the 18-layer ResNet. The spatial dimensionalityshrinks progressively in the ResNet until output becomes a L V -dimensional vec-tor per time step, known as the visual embedding E V : E V = { E t V ∈ R L V ; t = 0 , , . . . , T V − } = f V ( V )= ResNet-18 V (MaxPooling (BN(ReLU(Conv ( V ))))) (1)where T V , H and W denote the number and the size of lip frames, respectively.In this study, we use L V = 256, H = 98 and W = 98 by default.The visual embedding extractor is trained with a classification backend f (cid:48) C in the right side of Figure 1, consisting of a 2-layer BiGRU, a fully connectedlayer followed by a SoftMax activation. The output of the E V is fed to f (cid:48) C andthe posterior probability of each class representing each segment of lip frames P word is calculated as follows: P class = f (cid:48) C ( E V ) = SoftMax(Mean(FC(BiGRU(BiGRU( E V ))))) , (2)where the class can be labeled as word, phone or place of articulation. Conventional AVSE techniques [26, 28] often obtain the visual embeddingextractor discussed earlier based on an isolated word classification task by usinga lip reading dataset, such as the Lip Reading in the Wild (LRW).7e build our baseline model, denoted as VEASE-word using the LRW corpusconsisting of up to 1000 audio-visual speech segments extracted from BBC TVbroadcasts (News, Talk Shows, etc.), totaling around 170 hours. There are 500target words and more than 200 speakers. The LRW dataset provides a word-level label for each audio-visual speech segment, i.e., the real distribution ofword P truthword . We calculate the cross entropy (CE) loss L CE between P truthword and P word : L CE = CE( P truthword (cid:107) P word ) = − (cid:88) P truthword log P word . (3)The objective function, L CE , is minimized by using Adam optimizer [47] for100 epochs and the mini-batch size is set to 64. The initial learning rate is setto 0 . ± The isolated word classification task usually requires a word-level datasetwhich is not easy to collect in a large scale effort. To alleviate this problem, wepropose that the same data is used during training visual embedding extractorand enhancement network with different labels. Under the guidance of resultsin [30], we choose context-independent (CI) phones consisting of 39 units fromCMU dictionary as classification labels, denoted as VEASE-phone. E V is fed to a classification backend f C which has a same structure as f (cid:48) C and outputs the posterior probability of each CI-phone for each specific timeframe P phone = f C ( E V ) = SoftMax(FC(BiGRU(BiGRU( E V )))).The TCD-TIMIT dataset is a high quality audio-visual speech corpus labeledat both the phonetic and the word level. We can directly get the frame-levelreal distribution of CI-phone P truthphone . The calculation of L CE between P truthphone P phone is similar to Equation (3): L CE = CE( P truthphone (cid:107) P phone ) = − (cid:88) P truthphone log P phone (4)We use the same optimization process as in Section 2.2 to minimize L CE . Dental Velar Glottal Coronal Retroflex Low High Mid Labial
Figure 2: 9 lip shapes corresponding to utterance segments representing 9 articulation posi-tions: all lip shapes come from a single speaker starting with the lip closed. The lip shapechanges greatly in High, Mid and Labial than Dental, Velar and Glottal.
As discussed earlier, we believe there is a high correlation between speechattributes and visual acoustic information. In order to verify our idea, we checkthe lip shapes belonging to different places and manners of articulation. Wefind that the influences of various articulation places on the change of lip shapeare different, i.e., the lip shape changes greatly in some utterance segmentsbelonging to specific articulation place. An example is presented in Figure2. In contrast, we do not observe similar changes in the term of articulationmanner. Consequently we propose to train visual embedding extractor with thearticulation place label in this study, denoted as VEASE-place. We adopt 10units as in [48, 49] for articulation place set.Compared with the phone, the category granularity of articulation place iscoarser. Thus, the classification model can achieve comparable performancewith lower complexity. And the articulation place has fewer categories, whichreduces the labeling costs. Moreover, the articulation place label is believe tobe more language-independent than phones, which allows various languages toappear in training and testing.The same classification backend f C takes E V as input and outputs the pos-terior probability of each articulation place class for each specific time frame P place . 9 able 1: The mapping between articulation place classes and CI-phones as in [48]. Articulation place classes CI-phones
Coronal d, l, n, s, t, zHigh ch, ih, iy, jh, sh, uh, uw, yDental dh, thGlottal hhLabial b, f, m, p, v, wLow aa, ae, aw, ay, oyMid ah, eh, ey, owRetroflex er, rVelar g, k, ngSilence sil P truthphone is mapped into the frame-level real distribution of articulation place P truthplace by using Table 1. L CE between P truthplace and P place is calculated as follows: L CE = CE( P truthplace (cid:107) P place ) = − (cid:88) P truthplace log P place (5)The optimization process to minimize L CE is same as in that in Section 2.2. The VEASE model consists of three stacks of 1D-ConvBlocks and a frozenvisual embedding extractor, as shown in Figure 3. Each 1D-ConvBlock includesa 1D convolution layer with a residual connection, a ReLU activation, and abatch normalization, as in [26]. Some of the blocks contain an extra up-samplingor down-sampling layer, because the number of audio frames is different fromthat of the video frames.Visual embedding E V is processed by the stack s E at the bottom left con-sisting of N E A LPS = { A t LPS ∈ R F ; t = 0 , , . . . , T A − } are processed by the stack s LPS at the bottom right consisting of N LPS R E = s E ( E V ) = N E (cid:122) (cid:125)(cid:124) (cid:123) ConvBlock ( · · · ConvBlock ( E V )) (6) R LPS = s LPS ( A LPS ) = N LPS (cid:122) (cid:125)(cid:124) (cid:123)
ConvBlock ( · · · ConvBlock ( A LPS )) (7)10 igmoid
Visual
Embedding
Extractor
Concatenation
Over
Channel
Dimension15 LPSLip
Frames Estimated
IRM
Figure 3: Illustration of the VEASE model. The VEASE model takes the visual embeddingsas the auxiliary inputs except regular noisy LPS features. The visual embedding extractor ispre-trained separately with classification backend, following the steps introduced in the above-mentioned sections. In the training of the VEASE model, the visual embedding extractor iskept frozen. where T A and F denote the number of time frames and frequency bins forspectrogram, respectively. R E and R LPS denote outputs of different stacks.The R E and R LPS are then concatenated along the channel dimension andfed to the top stack s F consisting of N F M ∈ R T A × F : M = σ ( s F ([ R E , R LPS ]))= σ ( N F (cid:122) (cid:125)(cid:124) (cid:123) ConvBlock ( · · · ConvBlock ([ R E , R LPS ]))) (8)11he values of M range from 0 to 1. In this study, we use N E = 10, N LPS = 5and N F = 15 by default.To show the effectiveness of embedding on enhancement performance, wealso design a competitive no-embedding version of the EASE model which isstripped of the stack s E at the bottom left and the frozen visual embeddingextractor, denoted as NoEASE model. The NoEASE model computes M onlyusing the noisy LPS features as inputs: M = σ ( s F ( s LPS ( A LPS ))) (9)The ideal ratio mask (IRM) [50] is employed as the learning target, whichis widely used in monaural speech enhancement [51]. IRM M IRM ∈ R T A × F iscalculated as follows: M IRM = (cid:18) C PS C PS + D PS (cid:19) (10)where C PS ∈ R T A × F and D PS ∈ R T A × F denote power spectrograms of cleanspeech and noise, respectively.The mean square error (MSE) L MSE between M and M IRM is calculated asthe loss function: L MSE = MSE(
M, M
IRM ) = (cid:88) (cid:107) M − M IRM (cid:107) (11)We use Adam optimizer to train for 100 epochs with early stopping whenthere is no improvement on the validation loss for 10 epochs. The batch sizeis 96. Initial learning rate is set to 0 .
3. Proposed MEASE Model
In this section, we elaborate our proposed MEASE model. The MEASEmodel takes the fused audio-visual embedding as the auxiliary input instead of12he visual embedding. As described in Section 1, the MEASE model utilizinga complementarity of audio and visual features in an information intersectionmanner. In order to verify the complementarity between audio and visual em-beddings, we design an EASE model that utilizes the audio embedding, denotedas AEASE model. The AEASE model has a similar structure to the VEASEmodel with the main difference of employing an audio embedding extractor in-stead of the visual embedding extractor. For verifying the effectiveness of theinformation intersection-based audio-visual fusion manner on enhancement per-formance, we design an EASE model that utilizes the concatenation of audioand visual embeddings, denoted as cMEASE model. The details of the AEASEmodel, the MEASE model and the cMEASE model are elaborated in the fol-lowing.
Noisy
FBANK D - C onv B N & R e L U R e s N e t - A ud i o E m b e dd i ng Audio
Embedding
Extractor B i G R U S o f t M a x Place
Posteriors
Figure 4: Illustration of a audio embedding extractor (in color green for ease of cross-referencing in Figure 4, Figure 5 and Figure 6). The audio embedding extractor has thesimilar structure as the visual embedding extractor in Section 2.1. The training of the audioembedding extractor is same as that in Section 2.4.
The AEASE model has a similar structure to the VEASE model as shown inFigure 3, with the main difference of employing a audio embedding extractor,instead of the visual embedding extractor.The audio embedding extractor f A has a similar structure as the visualembedding extractor in Section 2.1, as shown in Figure 4. The 3D-kernels inthe spatiotemporal convolution are replaced by 1D-kernels meanwhile the 3D-MaxPooling layer is dropped in this case as the audio frame is a vector. We also13se the ResNet-18 with the main difference of employing 1D-kernels insteadof 2D-kernels. Given noisy FBANK features A FBANK ∈ R T A × F mel , the audioembeddings E A ∈ R T A × L A are calculated as follows: E A = { E t A ∈ R L A ; t = 0 , , . . . , T A − } = f A ( A FBANK )= ResNet-18 A (BN(ReLU(Conv ( A FBANK )))) (12)where, F mel and L A are the number of triangular filters set for FBANK featuresand the length of E t A , respectively. In this study, L A = L V = 256 is used bydefault.We use the same training process to train the audio embedding extractor astraining the visual embedding extractor in Section 2.4. Adam optimizer is usedto minimize L CE , which is calculated by Equation (5). But P place is computedby using E A : P place = f C ( E A ) (13)The AEASE model takes both A LPS and E A as inputs and outputs M : M = σ ( s F ([ s E ( E A ) , s LPS ( A LPS )])) (14)The same optimization process as in Section 2.5 is also used to minimize L MSE ,which is calculated by Equation (11).
The most significant change in the MEASE model is that the visual embed-ding extractor evolves into the audio-visual embedding extractor. The audio-visual embedding extractor takes not only lip frames but also noisy FBANK fea-tures as inputs and outputs the fused audio-visual embedding which is learnedunder the supervision of the information intersection, i.e., the articulation placelabel.The audio-visual embedding extractor consists of visual, audio and fusedstreams, as shown at the bottom left part of Figure 5. The visual stream has14 isual
StreamAudio
Stream Fused
Stream SigmoidConcatenation
Over
Channel
Dimension15
LPSLip
Frames2
BiGRUNoisy
FBANKConcatenation
OverChannel
Dimension Estimated
IRMAudio
Embedding
Extractor Visual
Embedding
ExtractorAudio-Visual
Embedding
Extractor
Figure 5: Illustration of the proposed MEASE model. The pervious visual embedding ex-tractor evolves to the audio-visual embedding extractor, which consists of a visual stream (inblue), an audio stream (in green) and a fused stream (in orange). The audio-visual embeddingextractor fuse the audio and visual embeddings in an information intersection manner. the same structure as the visual embedding extractor in Section 2.1 while theaudio stream has the same structure as the audio embedding extractor in Section3.1. V and A FBANK are processed by visual and audio streams, respectively: E VAV = f V ( V ) (15) E AAV = f A ( A FBANK ) (16)where E VAV ∈ R T V × L V and E AAV ∈ R T A × L A denotes the outputs of visual and15udio streams, respectively. The mismatch in the number of frames between E VAV and E AAV , i.e., T A (cid:54) = T V , is solved by repeating a video frame for severalaudio frames: ˜ E VAV = { T A /T V (cid:122) (cid:125)(cid:124) (cid:123) E V , , · · · , E V , , E V , · · · } (17)The fused stream consisting of a 2-layers BiGRU at the top takes ˜ E VAV and E AAV as inputs and outputs the audio-visual embedding E AV ∈ R T A × L AV : E AV = { E t AV ; t = 0 , , · · · , T A − } = BiGRU(BiGRU([ ˜ E VAV , E
AAV ])) (18)where L AV is the length of E t AV . In this paper, we use L AV = L A + L V = 512by default.We also use the same steps to minimize L CE , which is calculated by Equation(5), as these in Section 2.4. But P place is computed by using E AV : P place = f C ( E AV ) (19)It is a remarkable fact that the audio-visual classification model can achieve abetter and faster convergence, by initializing visual and audio streams with theindependently pre-trained params.The MEASE model takes both A LPS and E AV as inputs and outputs M : M = σ ( s F ([ s E ( E AV ) , s LPS ( A LPS )])) (20)We use the same optimization process as in Section 2.5 to minimize L MSE , whichis calculated by Equation (11).
By ablating the fused stream in Figure 5, another audio-visual embedding, cE AV ∈ R T A × ( L A + L V ) , which is the concatenation of audio and visual embed-16ings, is designed: cE AV = [ E V , E A ] = [ f V ( V ) , f A ( A FBANK )] (21)where f A and f V are trained independently, following the steps introduced inSection 3.1 and Section 2.4, respectively.The cMEASE model takes both A LPS and cE AV as inputs and outputs M : M = σ ( s F ([ s E ( cE AV ) , s LPS ( A LPS )])) (22)We use the same optimization process as in Section 2.5 to minimize L MSE , whichis calculated by Equation (11).
SigmoidAudio
Embedding
Extractor Concatenation
Over
Channel
Dimension
LPSLip
FramesNoisy
FBANK Visual
Embedding
Extractori
IRM
Figure 6: Illustration of the MEASE model with different fusion stages of audio and visualembeddings.
To study the effect of the audio-visual fusion stage on enhancement perfor-mance, we design a MEASE model that fuses visual and audio embeddings at17he i -th layer of the enhancement network, denoted as Middle-i model, as shownin Figure 6. We change N E with the fixed sum of N E and N F and use the samestack to process audio and visual embeddings, respectively: R E V = s E ( E V ) = N E = i (cid:122) (cid:125)(cid:124) (cid:123) ConvBlock ( · · · ConvBlock ( E V )) (23) R E A = s E ( E A ) = N E = i (cid:122) (cid:125)(cid:124) (cid:123) ConvBlock ( · · · ConvBlock ( E A )) (24) R LPS = s LPS ( A LPS ) = N LPS (cid:122) (cid:125)(cid:124) (cid:123)
ConvBlock ( · · · ConvBlock ( A LPS )) (25) M = σ ( s F ([ R E V , R E A , R LPS ]))= σ ( N F =25 − i (cid:122) (cid:125)(cid:124) (cid:123) ConvBlock ( · · · ConvBlock ([ R E V , R E A , R LPS ]))) (26)where s E ( · ) in Equation (23) has the same params as that in Equation (24), aswell as E A and E V are extracted by using f A and f V trained independently. Bymodifying the value of i , we can make the fusion take places at different stageswithout changing the network structure.
4. Experiments
To evaluate the performance of our proposed method, we created a sim-ulation dataset of noisy speech based on the TCD-TIMIT audio-visual corpus[53]. The TCD-TIMIT consisted of 59 volunteer speakers with around 98 videoseach, as well as 3 lipspeakers who specially were trained to speak in a way thathelped the deaf understand their visual speech. The speakers were recordedsaying various sentences from the TIMIT corpus [54] by using both front-facingand 30-degree cameras. However, the utterances of 3 lipspeakers and 30-degreevideos were not used in this paper. For testing the robustness to unseen speakercondition, we divided these videos and audios into a train-clean set which con-sisted of 57 speakers (31 male and 26 female) and a test-clean set which consistedof 2 speakers (1 male and 1 female) who were not in the train-clean set.18e chose the TCD-TIMIT dataset for two main reasons:(1) TCD-TIMIT was recorded in a controlled environment, and provided near-field signals collected by a microphone close to the mouth, which can ensurethat the utterances do not contain background noise. While other large-scale in-the-wild audio-visual datasets, such as BBC-Oxford LipReadingSentences 2 (LRS2) dataset [55], AVSpeech dataset [27], etc, were collectedfrom real-world sources using automated pipeline, and none of them waschecked whether background noise exists. When testing an enhancementsystem, if the ground truth contains background noise, the metrics will beseverely distorted and cannot well measure the performance of the system.(2) The utterances consisted of various phrases in the TCD-TIMIT dataset,thus they were more suitable for actual scenarios than the utterances con-sisting of a fixed set of phrases in the GRID dataset [56]. The TCD-TIMITdataset also contained phonetic-level transcriptions, which provided avail-able labels for the embedding extractor training.A total of 115 noise types, including 100 noise types in [57] and 15 homemadenoise types, were adopted for training to improve the robustness to unseen noisetypes. The 5600 utterances from train-clean set were corrupted with the above-mentioned 115 noise types at five levels of SNRs, i.e., 15 dB, 10 dB, 5 dB, 0dB and − train-clean set were corrupted with 3 unseen noise types at above-mentioned SNR levels tobuild a validation set, i.e., Destroyer Operations, Factory2 and F-16 Cockpit.The 198 utterances from test-clean set were used to construct a test set for eachcombination of 3 other unseen noise type and above SNR levels, i.e., DestroyerEngine, Factory1 and Speech Babble. All unseen noise were collected from theNOISEX-92 corpus [58]. The five levels of SNRs in the training set were alsoadopted for testing and validating. We manually listen to the test and verification sets of the LRS2 dataset. We find morethan half of sentences can be clearly perceived as noisy. F = 201 , F mel = 40.Mean and variance normalizations were applied to the noisy LPS and FBANKvectors.For video preprocessing, a given video clip was downsampled from 29 .
97 fpsto 25 fps, i.e., T A = 4 × T V . For every video frame, 68 facial landmarks wereextracted by using Dlib [59] implementation of the face landmark estimatordescribed in [60], then we cropped a lip-centered window of size 98 ×
98 pixelsby using the 20 lip landmarks from the 68 facial landmarks. The frames weretransformed to grayscale and normalized with respect to the overall mean andvariance.
In this experiment, we mainly adopt Perceptual Evaluation of Speech Quality(PESQ) [61] and Short-Time Objective Intelligibility (STOI) [62] to evaluatemodels. Both metrics are commonly used to evaluate the performance of speechenhancement system. PESQ, which is a speech quality estimator, is designedto predict the mean opinion score of a speech quality listening test for certaindegradations. Moreover, to show the improvement in speech intelligibility, wealso calculated STOI. The STOI score is typically between 0 and 1, and thePESQ score is between − . .
5. For both metrics, higher scores indicatebetter performance.
In Section 2, we proposed two VEASE models with different visual embed-dings, i.e., VEASE-phone and VEASE-place. To compare their effectiveness20 igure 7: A comparison of learning curves among NoEASE, VEASE-word (LRW), VEASE-phone and VEASE-place on the validation set.Table 2: Average performance comparison of VEASE models with different visual embeddingson the test set at different SNRs averaged over 3 unseen noise types.
Model PESQ STOI (in %)
SNR (in dB) -5 0 5 10 15 -5 0 5 10 15Noisy 1 .
70 1.97 2.26 2.56 2.86 54.34 65.11 75.33 84.48 90.88NoEASE 2.07 2.34 2.64 2.92 3.21 58.79 70.29 80.24 87.83 92.57VEASE-word 2.16 2.45 2.72 2.99 3.25 66.26 75.11 82.57 88.75 92.98VEASE-phone 2.14 2.42 2.69 2.96 3.23 66.29 74.89 82.22 88.45 92.79VEASE-place 2.21 2.47 2.73 3.00 3.26 66.57 75.27 82.64 88.80 92.96 with the baseline model, i.e., VEASE-word (LRW), on enhancement perfor-mance, a series of experiments were conducted for the unprocessed system de-noted as Noisy, NoEASE, VEASE-word (LRW), VEASE-phone and VEASE-place. We present the learning curves of the MSEs among NoEASE, VEASE-word (LRW) and VEASE-phone and VEASE-place on the validation set inFigure 7. The corresponding evaluation metrics are shown in Table 2. We eval-uate the average performance of two measures at different SNRs across 3 unseennoise types.Based on Figure 7 and Table 2, we find following observations.(1) The learning curves of the MSEs indicate that all VEASE models consis-tently generate smaller MSEs on the validation set than NoEASE. This21esult implies that the visual embedding is useful for speech enhancement.As shown in Table 2, all VEASE models yield improvements in PESQ andSTOI over NoEASE in all SNRs. In particular, the improvement is moresignificant at low SNRs cases.(2) VEASE-word (LRW) consistently yields smaller MSE values and a slowerconvergence than VEASE-phone. This implies that the visual embeddingin VEASE-word (LRW) is more useful but slow-fit for speech enhancement.In VEASE-word (LRW), visual embedding extractor is trained with a largeamount of additional data, so it can obtain better generalization ability andprovide visual embedding with more information. On the other side, thedata mismatch between embedding learning and enhancement task bringsinformation redundancy, which leads to a slower convergence. This obser-vation is consistent with the comparison of the objective evaluation metricson the test set shown in Table 2. And VEASE-phone does not performbetter than VEASE-word (LRW) in any evaluation metrics.(3) VEASE-place clearly achieves a better and faster convergence than VEASE-phone and VEASE-word (LRW). This implies that VEASE-place providesmore useful and quick-fit visual embedding for speech enhancement. Bycomparing the evaluation metrics in Table 2, we also observe that VEASE-place not only yields remarkable gains over VEASE-phone across all evalua-tion metrics and all SNR levels, but also outperforms VEASE-word (LRW)in most cases with only one exception for the STOI at 15 dB SNR. Andin that exceptional situation, the results are still close. These results sug-gest that our proposed VEASE-place model achieves a better generalizationcapability, while reducing mismatch between embedding learning and en-hancement task.Overall, the high correlation between the articulation place label and theacoustic information in video is beneficial to the extraction of visual embedding,which is useful for speech enhancement, even if no requirement of additionaldata. Therefore, we select articulation place as the default classification target22n all subsequent experiments and use VEASE to refer to VEASE-place in allsubsequent sections.
Table 3: Average performance comparison of NoEASE model, VEASE model, AEASE model,cMEASE model and MEASE model on the test set at different SNRs averaged over 3 unseennoise types.
Model PESQ STOI (in %)
SNR (in dB) -5 0 5 10 15 -5 0 5 10 15NoEASE 2.07 2.34 2.64 2.92 3.21 58.79 70.29 80.24 87.83 92.57VEASE 2.21 2.47 2.73 3.00 3.26 66.57 75.27 82.64 88.80 92.96AEASE 2.09 2.39 2.69 2.98 3.27 60.84 72.24 81.58 88.39 92.76cMEASE 2.27 2.55 2.81 3.08 3.34 67.60 76.26 83.26 89.13 93.12MEASE 2.29 2.59 2.88 3.16 3.42 68.96 77.64 84.43 89.99 93.64
In this section, the goal is to examine the effectiveness of the proposedMEASE model on enhancement performance, and obtain a better understandingabout the contribution of different parts of the MESAE model. We present anaverage performance comparison between NoEASE, VEASE, AEASE, cMEASEand MEASE in Table 3.Paying attention to the last row in Table 3, we can observe that MEASEshows significant improvements over VEASE across all evaluation metrics, andlarger gains are observed at high SNRs. By comparing the results of VEASEwith NoEASE, the improvement yielded by visual embedding decreases as SNRincreases, for example, the PESQ of VEASE increased from 2 .
07 to 2 .
21 at − .
21 to 3 .
26 at 15 dB SNR. This observation is consistent withthat in [41]. In contrast, MEASE shows stable improvements over NoEASE forhigh SNRs. For example, the PESQ of MEASE increased from 2 .
07 to 2 .
29 at − .
21 to 3 .
42 at 15 dB SNR. All these results indicatethat MEASE is more robust against the change of noise level and yields bettergeneralization capability than VEASE.Table 3 also shows the results of AEASE. By comparing its results withNoEASE, we can observe that the improvement yielded by audio embeddingincreases as SNR grows, for example, the PESQ of AEASE increased from 2 . .
09 at − .
21 to 3 .
27 at 15 dB SNR. This suggest thatthe complementarity between audio and visual embeddings lies in the variationtendencies of metric improvement with respect to SNR level. But directly com-paring AEASE and VEASE on the evaluation metrics as shown in Table 3, wecan not observe that AEASE performs better than VEASE at high SNRs, i.e.,SNR = 5, 10 and 15 dB, especially at 5 dB SNR.
Table 4: Average performances of different models on the test set at different SNRs anddifferent articulation places averaged over 3 unseen noise types.
SNR (in dB) -5 0 5Place Model NoEASE AEASE VEASE MEASE NoEASE AEASE VEASE MEASE NoEASE AEASE VEASE MEASELabial 1.28 1.38 1.58 1.76 1.57 1.75 1.81 2.06 2.05 2.18 2.23 2.50Mid 1.54 1.68 1.86 2.02 2.03 2.21 2.29 2.45 2.58 2.72 2.73 2.96High 1.38 1.52 1.65 1.81 1.79 1.95 1.99 2.17 2.28 2.39 2.42 2.62Low 1.63 1.89 2.00 2.29 2.17 2.48 2.46 2.69 2.84 2.99 2.93 3.20Retroflex 1.46 1.66 1.75 2.00 1.95 2.15 2.12 2.32 2.44 2.57 2.54 2.77Coronal 1.59 1.74 1.80 1.93 1.92 2.07 2.05 2.23 2.30 2.39 2.35 2.56Glottal 1.02 1.22 1.36 1.70 1.42 1.71 1.59 1.92 1.95 2.10 2.05 2.30Velar 1.31 1.44 1.41 1.49 1.48 1.64 1.68 1.86 1.86 2.01 2.00 2.22Dental 0.94 1.22 1.25 1.64 1.32 1.62 1.36 2.05 1.98 2.21 1.98 2.44
To further explore the complementarity between audio and visual embed-dings, we present an average performance comparison between utterance seg-ments belonging to different articulation places in Table 4. Because the ut-terance segment does not have actual semantics, we only examine the averageperformance of PESQ at different SNRs across 3 unseen noise types. Table 4illustrates VEASE and AEASE play a major role at different articulation places,respectively, at the same SNR level. Even at high SNRs, VEASE still yieldsimprovement than AEASE in some articulation places. For example, VEASE’sPESQ values are 2 .
23, 2 .
73, 2 .
42, while AEASE’s PESQ values are 2 .
18, 2 . .
39 in Labial, Mid, High at 5 dB SNR level. This result explains why AEASEdoes not outperform VEASE at high SNR levels. Relating to the lip shapesbelonging to different articulation places, as shown in Figure 2, we find VEASEyields greater improvement at articulation places where the lip shapes changegreatly, i.e., Labial, Mid and High, while AEASE is on the contrary. Overall, wecan conclude that the complementarity between audio and visual embeddingslies in different SNR levels, as well as different articulation places. More specif-ically, in the cases that the SNR level is low and the articulation place has high24isual correlation, visual embedding performs better. And audio embeddingis better on articulation places with low visual correlation at high SNR levels.Based on these observations, our proposed MEASE model takes the advantagesof visual and audio embeddings, and achieves the best performance in all SNRsand all articulation places.The information intersection-based audio-visual fusion manner in the MEASEmodel is our another contribution. From Table 3, we can observe that MEASEconsistently outperforms cMEASE over all SNR levels in terms of all 2 measures,especially at high SNRs. This observation demonstrates that the informationintersection-based audio-visual fusion method has better information integra-tion capability for audio and visual embeddings than channel-wise concatenationwhich is widely used in previous works.
One of the most significant differences between our method and previousmethods is that the proposed MEASE model fuses audio and visual modes inthe stage of embedding extractor training. It is an early fusion while previousmethods fuse audio and visual modes in the middle of the enhancement network,known as the middle fusion. For verifying the effectiveness of the early fusionon enhancement performance, we design an experimental comparative studydescribed in Section 3.4 and conduct a set of experiments using five different i ,i.e., i = 5 , , , , a) PESQ(b) STOIFigure 8: Average performance comparison among different audio-visual fusion stages for thePESQ/STOI measures at different SNRs averaged over 3 unseen noise types. The top figureshows the PESQ measure. The bottom figure shows the STOI measure. . Conclusion In this study, we extend the previous audio-visual speech enhancement (AVSE)framework to embedding aware speech enhancement (EASE). We first proposevisual embedding to enhance speech, leveraging upon the high correlation be-tween articulation place labels and acoustic information in videos. Next, we pro-pose multi-modal audio-visual embedding obtained by fusing audio and visualembedding in the stage of embedding extraction training under the supervisionof their information intersection at the articulation place label level.Extensive experiments empirically validate that our proposed visual em-bedding consistently yields improvements over the conventional word-based ap-proaches. And our proposed audio-visual embedding achieves even greater per-formance improvements by utilizing the complementarity of audio and visualembedding in a information intersection-based way, with higher informationintegration capabilities and better speech enhancement performance in earlyfusion.Our future work will focus on how to use unsupervised or self-supervisedtechniques to extract effective audio-visual embedding, in order to achieve acomparable or better enhancement performance than the current framework.
6. Acknowledgements
This work was supported by the National Natural Science Foundation ofChina under Grant No.61671422.
References [1] P. C. Loizou, Speech enhancement: theory and practice, CRC press, 2013.[2] S. Boll, Suppression of acoustic noise in speech using spectral subtraction,IEEE Transactions on acoustics, speech, and signal processing 27 (2) (1979)113–120. 273] J. Lim, A. Oppenheim, All-pole modeling of degraded speech, IEEE Trans-actions on Acoustics, Speech, and Signal Processing 26 (3) (1978) 197–210.[4] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-squareerror log-spectral amplitude estimator, IEEE transactions on acoustics,speech, and signal processing 33 (2) (1985) 443–445.[5] I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise envi-ronments, Signal processing 81 (11) (2001) 2403–2418.[6] I. Cohen, Noise spectrum estimation in adverse environments: Improvedminima controlled recursive averaging, IEEE Transactions on speech andaudio processing 11 (5) (2003) 466–475.[7] Y. Xu, J. Du, L. Dai, C. Lee, A regression approach to speech enhance-ment based on deep neural networks, IEEE/ACM Transactions on Audio,Speech, and Language Processing 23 (1) (2015) 7–19.[8] A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neuralnetworks for robust speech recognition, in: 2013 IEEE International Con-ference on Acoustics, Speech and Signal Processing, 2013, pp. 7092–7096.[9] D. Wang, J. Chen, Supervised speech separation based on deep learning:An overview, IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing 26 (10) (2018) 1702–1726.[10] E. C. Cherry, Some experiments on the recognition of speech, with oneand with two ears, The Journal of the acoustical society of America 25 (5)(1953) 975–979.[11] H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature264 (5588) (1976) 746–748.[12] L. E. Bernstein, C. Benoˆıt, For speech perception by humans or machines,three senses are better than one, in: Proceeding of Fourth International28onference on Spoken Language Processing. ICSLP’96, Vol. 3, IEEE, 1996,pp. 1477–1480.[13] A. MacLeod, Q. Summerfield, Quantifying the contribution of vision tospeech perception in noise, British journal of audiology 21 (2) (1987) 131–141.[14] D. W. Massaro, J. A. Simpson, Speech perception by ear and eye: Aparadigm for psychological inquiry, Psychology Press, 2014.[15] L. D. Rosenblum, Speech perception as a multimodal phenomenon, CurrentDirections in Psychological Science 17 (6) (2008) 405–409.[16] L. Girin, G. Feng, J.-L. Schwartz, Noisy speech enhancement with filters es-timated from the speaker’s lips, in: Fourth European Conference on SpeechCommunication and Technology, 1995, pp. 1559–1562.[17] L. Girin, J.-L. Schwartz, G. Feng, Audio-visual enhancement of speech innoise, The Journal of the Acoustical Society of America 109 (6) (2001)3007–3020.[18] J. W. Fisher III, T. Darrell, W. T. Freeman, P. A. Viola, Learning jointstatistical models for audio-visual fusion and segregation, in: Advances inneural information processing systems, 2001, pp. 772–778.[19] S. Deligne, G. Potamianos, C. Neti, Audio-visual speech enhancement withavcdcn (audio-visual codebook dependent cepstral normalization), in: Sen-sor Array and Multichannel Signal Processing Workshop Proceedings, 2002,IEEE, 2002, pp. 68–71.[20] R. Goecke, G. Potamianos, C. Neti, Noisy audio feature enhancement us-ing audio-visual speech data, in: 2002 IEEE International Conference onAcoustics, Speech, and Signal Processing, Vol. 2, IEEE, 2002, pp. II–2025.[21] J. R. Hershey, M. Casey, Audio-visual sound separation via hidden markovmodels, in: Advances in Neural Information Processing Systems, 2002, pp.1173–1180. 2922] A. H. Abdelaziz, S. Zeiler, D. Kolossa, Twin-hmm-based audio-visualspeech enhancement, in: 2013 IEEE International Conference on Acous-tics, Speech and Signal Processing, IEEE, 2013, pp. 3726–3730.[23] A. Gabbay, A. Ephrat, T. Halperin, S. Peleg, Seeing through noise: Visuallydriven speaker separation and enhancement, in: 2018 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,2018, pp. 3051–3055.[24] A. Gabbay, A. Shamir, S. Peleg, Visual speech enhancement, Proc. Inter-speech 2018 (2018) 1170–1174.[25] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang,Audio-visual speech enhancement using multimodal deep convolutionalneural networks, IEEE Transactions on Emerging Topics in ComputationalIntelligence 2 (2) (2018) 117–128.[26] T. Afouras, J. S. Chung, A. Zisserman, The conversation: Deep audio-visual speech enhancement, in: Proc. Interspeech 2018, 2018, pp. 3244–3248.[27] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T.Freeman, M. Rubinstein, Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation, ACM Trans. Graph.37 (4).[28] E. Ideli, B. Sharpe, I. V. Baji, R. G. Vaughan, Visually assisted time-domain speech enhancement, in: 2019 IEEE Global Conference on Signaland Information Processing (GlobalSIP), 2019, pp. 1–5.[29] Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequency mag-nitude masking for speech separation, IEEE/ACM transactions on audio,speech, and language processing 27 (8) (2019) 1256–1266.[30] J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, D. Yu, Timedomain audio visual speech separation, in: 2019 IEEE Automatic Speech30ecognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 667–673.[31] C.-H. Lee, M. A. Clements, S. Dusan, E. Fosler-Lussier, K. Johnson, B.-H.Juang, L. R. Rabiner, An overview on automatic speech attribute tran-scription (asat), in: Eighth Annual Conference of the International SpeechCommunication Association, 2007, pp. 1825–1828.[32] H. Li, B. Ma, C.-H. Lee, A vector space modeling approach to spoken lan-guage identification, IEEE Transactions on Audio, Speech, and LanguageProcessing 15 (1) (2006) 271–284.[33] J. Li, Y. Tsao, C.-H. Lee, A study on knowledge source integrationfor candidate rescoring in automatic speech recognition, in: Proceed-ings.(ICASSP’05). IEEE International Conference on Acoustics, Speech,and Signal Processing, 2005., Vol. 1, IEEE, 2005, pp. I–837.[34] S. M. Siniscalchi, J. Reed, T. Svendsen, C.-H. Lee, Universal attributecharacterization of spoken languages for automatic spoken language recog-nition, Computer Speech & Language 27 (1) (2013) 209–227.[35] W. Li, S. M. Siniscalchi, N. F. Chen, C.-H. Lee, Improving non-nativemispronunciation detection and enriching diagnostic feedback with dnn-based speech attribute modeling, in: 2016 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp.6135–6139.[36] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the eleventh annual conference on Compu-tational learning theory, 1998, pp. 92–100.[37] S. Dasgupta, M. L. Littman, D. A. McAllester, Pac generalization boundsfor co-training, in: Advances in neural information processing systems,2002, pp. 375–382. 3138] B. Leskes, The value of agreement, a new boosting algorithm, in: Interna-tional Conference on Computational Learning Theory, Springer, 2005, pp.95–110.[39] D. D. Lewis, Naive (bayes) at forty: The independence assumption in infor-mation retrieval, in: European conference on machine learning, Springer,1998, pp. 4–15.[40] X. Sun, Y. Xu, P. Cao, Y. Kong, L. Hu, S. Zhang, Y. Wang, Tcgm: Aninformation-theoretic framework for semi-supervised multi-modality learn-ing, arXiv preprint arXiv:2007.06793.[41] W. Wang, C. Xing, D. Wang, X. Chen, F. Sun, A robust audio-visual speechenhancement model, in: ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp.7529–7533.[42] M. L. Iuzzolino, K. Koishida, Av (se) 2: Audio-visual squeeze-excite speechenhancement, in: ICASSP 2020-2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 7539–7543.[43] T. Stafylakis, G. Tzimiropoulos, Combining residual networks with lstmsfor lipreading, in: Proc. Interspeech 2017, 2017, pp. 3652–3656. doi:10.21437/Interspeech.2017-85 .URL http://dx.doi.org/10.21437/Interspeech.2017-85http://dx.doi.org/10.21437/Interspeech.2017-85