Multimodal Attention Fusion for Target Speaker Extraction
Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki
MMULTIMODAL ATTENTION FUSION FOR TARGET SPEAKER EXTRACTION
Hiroshi Sato, Tsubasa Ochiai, Keisuke Kinoshita, Marc Delcroix, Tomohiro Nakatani, Shoko Araki
NTT Corporation, Japan
ABSTRACT
Target speaker extraction, which aims at extracting a target speaker’svoice from a mixture of voices using audio, visual or locationalclues, has received much interest. Recently an audio-visual targetspeaker extraction has been proposed that extracts target speech byusing complementary audio and visual clues. Although audio-visualtarget speaker extraction offers a more stable performance than sin-gle modality methods for simulated data, its adaptation towards real-istic situations has not been fully explored as well as evaluations onreal recorded mixtures. One of the major issues to handle realisticsituations is how to make the system robust to clue corruption be-cause in real recordings both clues may not be equally reliable, e.g.visual clues may be affected by occlusions. In this work, we proposea novel attention mechanism for multi-modal fusion and its trainingmethods that enable to effectively capture the reliability of the cluesand weight the more reliable ones. Our proposals improve signal todistortion ratio (SDR) by 1.0 dB over conventional fusion mecha-nisms on simulated data. Moreover, we also record an audio-visualdataset of simultaneous speech with realistic visual clue corruptionand show that audio-visual target speaker extraction with our pro-posals successfully work on real data.
Index Terms — audio-visual, target speaker extraction, multi-modal fusion
1. INTRODUCTION
Although recent advances in machine learning techniques havegreatly improved the performance of automatic speech recognition,it remains difficult to recognize speech when it overlaps other peo-ple’s voices, a problem that frequently occurs in daily life. Mostapproaches proposed to tackle this problem focus on blind sourceseparation; the mixture of speech signals is separated into each of itssources. Recently deep learning-based separation approaches suchas deep clustering [1], permutation invariant training [2], TasNet [3],etc. have received increased attention.Target speaker extraction is another promising approach to dealwith overlapping speech; it uses auxiliary information about the tar-get to extract only the target speaker’s voice. Hereafter, the auxiliaryinformation is referred to as target speaker clues . There are severalways to realize target speaker extraction depending on the nature ofthe clues used. Some approaches utilize audio clues provided by en-rollment utterances of the target speaker [4–7]. Extraction based onthese audio clues has shown promising results, but the performancetends to degrade when the voice characteristics of the target and in-terference are similar. Other approaches utilize video to capture theface or mouth of the target speaker as visual clues [8–11]. Visualclue-based extraction has shown robust performance for mixtures ofsimilar voice characteristics, but it is susceptible to the occlusions caused by the speaker’s hands, masks, microphones, etc. that occurin natural dialogues.To overcome the shortcomings of the single modality-based ap-proaches, audio-visual target speaker extraction has been investi-gated recently [12–14]. In [12], the target speaker information ex-tracted from audio and video clues are summed to create multi-modal embeddings, which are then used as auxiliary information topredict a time-frequency mask that extracts the target speaker out ofthe mixture. Simultaneously to [12], the multi-modal target speakerextraction method called “audio-visual SpeakerBeam” [13] was pro-posed to use source-target attention [15] for fusing target speaker in-formation extracted from audio and video clues, instead of summingthem up.Audio-visual target speaker extraction methods have been testedon simulated data and shown to offer more robust performance thansingle modality methods, thanks to the use of the complementaryclues. However, its application to more realistic situations has notbeen fully explored. One of the important aspects to be consideredis how to deal with clue corruption. For example, visual clues are notaccessible if the speaker’s hands or microphones are blocking his orher face and mouth. Audio clues recorded under noisy conditionsare also not reliable. Such corruption of clues is often present in realrecordings. Besides the recordings could be corrupted by noise andreverberation that could also hinder the extraction performance. Theresults in [12] indicated that existing methods could handle inter-mittent occlusion of visual clues to some extent when reliable audioclues were available. However, it remains unclear whether the neg-ative effect of the unreliable visual clues on the overall performancecould be well mitigated, i.e. whether the performance of the multi-modal system with clean audio clues and unreliable visual clues wasnot significantly lower than the performance of a system relying onthe clean audio-clues only. The modality attention mechanism wasintroduced in [13] with the expectation of better mitigating the effectof unreliable clues, but it was only tested on conditions where bothclues were reliable and thus its effectiveness with corrupted clueshas not been verified.In this work, we focus on dealing with such a clue corruptiontowards more realistic applications. To deal with the corruption, itis desirable to utilize more informative clues while ignoring unreli-able ones selectively at every time instance, according to their reli-ability. We realized that the simple attention mechanism introducedin [13] was insufficient to achieve such a reliability-based clue se-lection due to imbalance in the norm of the contributions from thedifferent modalities, which hindered the desirable behaviors of at-tention mechanisms. Therefore, we propose a novel modality atten-tion approach that introduces a normalization mechanism to correctthe norm imbalance problem of the conventional method. Besides,we also propose two multi-task training methods that introduce a a r X i v : . [ ee ss . A S ] F e b oss term on the attention-based modality fusion module to explic-itly consider the reliability of the clues in the training phase, helpingthe attention module become more fully aware of clue reliability.Another important aspect of this work is the evaluation of themulti-modal target speaker extraction methods on real-recordings toconfirm its potential for realistic applications. Unlike previous stud-ies that used only simulated mixtures, we created an audio-visualdataset by recording the simultaneous speech of two speakers. Therecordings include video of the speaker’s face with natural occlu-sions by hand motions for evaluating the ability to deal with realisticvisual-clue corruption. Since the amount of real recorded data is rel-atively small, we cannot train a system using only real recordings.Instead, we perform model adaptation to adapt a base model trainedon a large simulated dataset to real recording conditions. With sucha scheme, we show that audio-visual target speaker extraction candeal with real recorded mixtures.The contribution of this work is twofold.1. We propose a novel normalized modality attention mecha-nism and two multi-task training methods in order to enablethe attention mechanism to select clues based on their relia-bility. Our proposals outperform previous attention-based andsummation-based fusion and maintain the performance evenwhen either of the clues is corrupted.2. We conduct evaluations on real recorded mixtures with real-istic occlusions and show that the audio-visual target speakerextraction performs well in real conditions. We investigatemodel adaptation techniques to adapt an audio-visual speechextraction model trained on a large amount of simulated datato the real recording condition for which a limited amount ofdata is available.
2. CONVENTIONAL METHODS2.1. General Framework
Fig. 1 shows the general framework of the audio-visual targetspeaker extraction network called audio-visual SpeakerBeam pro-posed in [13]. SpeakerBeam extracts the waveform signal of targetspeaker, ˆ x s ∈ R T from the mixture y ∈ R T as follows: ˆ x s = SpeakerBeam( y , c As , C Vs ) , (1)where s is the index of the target speaker, T is the number of samplesof the mixture, c As corresponds to the audio clue provided by en-rollment utterances of the target speaker, and C Vs to the visual cluederived from a video capturing the face region of the target speaker.The network SpeakerBeam( · ) is composed of two parts: a tar-get extraction network and a clue extraction network . The targetextraction network adopts a structure inspired by TasNet [16, 17]that conducts time-domain separation and has outperformed time-frequency mask-based approaches. TasNet consists of three parts:encoder, separator and decoder. The encoder converts mixture wave-form y into optimized representations Y . The separator calculatesmasks M s from Y which are subsequently applied to Y by anelement-wise product operation. The separator is composed of sev-eral stacked 1-D convolutional blocks followed by a 1x1 Convolu-tion layer and a nonlinear activation function. The masked repre-sentations are then converted into extracted audio in raw waveform ˆ x s ∈ R T by the decoder. While TasNet was developed for speech separation, here we aimat extracting only speech of the target speaker. To guide Speaker-Beam toward extracting only the target speaker, the processing of thenetwork was conditioned on the audio-visual target speaker clues , Z AVs that is computed using the “clue extraction network” basedon audio and visual clues (see 2.2 for the details of the clue extrac-tion network and Z AVs calculation). The audio-visual target speakerclues Z AVs ∈ R T e × D e were incorporated into the TasNet architec-ture after N L convolutional blocks of the separator, by an element-wise product operation (cid:12) as follows [17]: X (cid:48) s = Y (cid:48) (cid:12) Z AVs , (2)where Y (cid:48) ∈ R T e × D e is an intermediate representation of the mix-ture, and X (cid:48) s ∈ R T e × D e is that of the target speaker. T e and D e arethe number of frames and dimension of the hidden representationrespectively. In previous studies such as [7], it was found Speaker-Beam works effectively when the clue information is provided to theextraction network with the element-wise product as in eq. (2). The audio-visual target speaker clues Z AVs is calculated from the au-dio clue c As , the visual clue C Vs , and the intermediate representationwithin the target extraction network Y (cid:48) as follows: Z AVs = ClueExtraction( c As , C Vs , Y (cid:48) ) , (3)where ClueExtraction( · ) is a neural network that consists of twosub-networks, hereafter called Cluenets, that compute embeddingsassociated with the audio and visual clues, and a “modality fusion”block that combine these embeddings into the audio-visual targetspeaker clues Z AVs . For audio clues, the audio cluenet computes the audio embedding z As ∈ R D e from the raw wave audio clue c As ∈ R T a,s as follows: z As = Average(conv(Encoder( c As ))) (4)where T a,s denote the number of samples of the audio clue forspeaker s . The audio Encoder( · ) has identical structure as TasNetencoder and projects the raw-wave clue into an D e dimensionalembedding space, to get two dimensional C As ∈ R T (cid:48) a,s × D e where T (cid:48) a,s denotes the number of frames of the hidden representation ofthe audio clue for speaker s . C As is subsequently fed into severalconvolutional layers ( conv( · ) in Eq. (4)) to extract features, and thenaveraged over time ( Average( · ) in Eq. (4)) to make time-invariant,thus one-dimensional, audio clue z As . The visual embedding Z Vs ∈ R T e × D e is calculated with the visualcluenet from visual feature C Vs ∈ R T v × D v as follows: Z Vs = Upsample(conv( C Vs )) (5)where T v and D v are the number of frames and the dimension of avisual feature. We used pre-trained face recognition model Facenet[18, 19] to extract visual features C Vs according to the procedurein [13]. They are then transformed using several convolutional layers ig. 1 . Overview of audio-visual SpeakerBeam. conv( · ) . Contrary to the audio clue, the visual clue Z Vs is time-variant. Upsample( · ) was applied in order to align the number offrames of visual embeddings with Y (cid:48) . Modality fusion block incorporate each single modality embeddings z As and Z Vs into audio-visual target speaker clues Z AVs as follows: Z AVs = ModalityFusion( z As , Z Vs , Z Ms ) (6)where Z Ms represents an embedding of the input mixture, whichis calculated from an intermediate representation within the targetextraction network Y (cid:48) using several convolutional layers.Several variants have been proposed for the modality fusionblock in Fig. 1: attention fusion [13] and summation fusion [12] areexamples. We first describe the attention fusion as it can be seen asa general case from which the summation fusion can be derived as aspecial case.Attention fusion is designed such that the network can discernthe reliability of each clue and utilize a more reliable one selectivelyin each time instance. Fig. 2(a) shows a conceptual diagram of theconventional attention fusion approach. In attention fusion, embed-dings from each modality ψ ∈ ( A, V ) are averaged with weight a ψst at each time frame t as follows: z AVst = (cid:88) ψ ∈ ( A,V ) ˆ a ψst z ψst ( t = 1 , , . . . , T e ) (7)where z ψst is an embedding of audio and visual clues for each timeframe t = 1 , , . . . , T e . Since the audio embedding is time-invariant, z Ast is the same for each frame and is expressed as z Ast = z As ∈ R D e for each time frame. Visual embedding z Vst ∈ R D e is the t -thtime frame of Z Vs = { z Vst ; t = 1 , , . . . , T e } . Following the samenotation, z AVst is the t -th time frame element of Z AVs = { z AVst ; t =1 , , . . . , T e } . The attention value ˆ a st = [ˆ a Ast , ˆ a Vst ] is determined bythe “attention calculation” block in Fig. 2 from the audio, visual andmixture embeddings for each time frame t . For calculating attentionvalue ˆ a ψst , the additive-attention proposed in [15] was adopted. ˆ a ψst is calculated as follows: ˆ a ψst = exp (cid:15)e ψst (cid:80) ψ ∈ ( A,V ) exp (cid:15)e ψst (8) Fig. 2 . Diagram of modality fusion methods. (a) conventional atten-tion, (b) proposed normalized attention, (c) attention guided train-ing and (d) clue condition aware training. System (c) and (d) canbe combined with (b) normalized attention, which is described bydashed normalize and scaling blocks. e ψst = w tanh( W z Mt + V z ψst + b ) (9)where z Mt ∈ R D e is the t -th time frame element of Z M = { z Mt ; t =1 , , . . . , T e } . Here w , W , V and b are learnable parameters, and (cid:15) is a sharpening factor [15].Summation fusion proposed in [12] is another way to fuse em-beddings from audio and visual modalities. It can be seen as a spe-cial case of attention fusion where the attention weights are the samefor each modality regardless of the clues, i.e. ˆ a st = [0 . , . . Inaddition, the separation based on only audio or only visual clues arealso special cases with ˆ a st = [1 . , . and ˆ a st = [0 . , . , re-spectively.
3. PROPOSED METHODS
In this work, we focus on how to deal with clue corruption. To thisend, the attention mechanism should have the ability to select cluesbased on their reliability. In order to enhance the attention mecha-nism acquiring this behavior, we propose 1) a normalized attentionfusion method and 2) two multi-task training methods, which aredescribed in the following sections.
Although the attention mechanism was originally introduced to han-dle clue corruption by attending to more reliable clues [13], ourpreliminary experiments suggested that the conventional attentionmechanism explained in 2.2.3 did not learn such a behavior. Asa possible cause for this issue, we found that the embeddings z ψst for audio and video clues tended to take very different norms duringtraining. In this situation, a small fluctuation in attention weights ˆ a st causes a large displacement of attention output z AVst and hinders at-tention controllability, resulting in the poor extraction performance.In addition, the heavy bias in the attention values makes it difficultto interpret the attention values.To address this issue, we propose the normalized attention mech-anism shown in Fig. 2(b). Our approach is to normalize each inputclue with its norm in each time frame, before integrating them asfollows: z ψ (cid:48) st = z ψst / | z ψst | (10)he normalized clues z ψ (cid:48) st are integrated as in Eq. (7) to get Z AV (cid:48) s . Z AV (cid:48) s is then multiplied by constant l to reflect the original norm asfollows: Z AVs = l Z AV (cid:48) s (11)where l is calculated as l = 1 / ( (cid:80) ψ ∈ ( A,V ) / | z ψst | ) . In addition toresolving norm imbalance, normalized attention allows the value ofinter-modal attention to provide interpretability. For example, a Ast = a Vst = 0 . means both clues are almost equally reliable and a Ast >>a
Vst indicates that the audio clue is more reliable than the visual clue.
In order to train the attention mechanism more efficiently, we alsointroduce multi-task loss that guides attention explicitly by directlygiving the ideal attention value according to clue condition, in addi-tion to the usual source to distortion ratio loss as shown in Fig. 2(c).The overall loss function L with the attention guided training loss isexpressed as follows: L = L SDR ( ˆ x s , x s ) + α (cid:88) ψ ∈ ( A,V ) T e (cid:88) t =1 L MSE (ˆ a ψst , a ψst ) T e (12)where x s is the reference signal and a ψst is the oracle attention value. L SDR ( · ) and L MSE ( · ) stand for the scale invariant source to dis-tortion ratio [16] and mean square error loss, respectively. α is amulti-task weight that balance the effects of each loss term.Since it is unclear which attention value is appropriate in ev-ery clue condition, we calculate and append the attention predic-tion loss only in extreme cases as follows: 1) one of the clues wascompletely unreliable and the other was completely clean. 2) bothclues are completely clean. In the former case, we set oracle at-tention to ˆ a st = [1 . , .
0] or [0 . , . . In the latter case we adopt ˆ a st = [0 . , . to improve interpretability. When the oracle atten-tion is not defined, we only used the first-term of Eq. (12). Althoughattention guided training can be used with normalized training, thistraining method itself would mitigate the norm imbalance problem. In attention guided training, oracle attention values are explicitlygiven, but their appropriate values are not necessarily evident. Inorder to make the attention module aware of clue reliability withoutexplicitly controlling its behavior, we propose clue condition awaretraining as shown in Fig. 2(d). In order for embeddings z ψst to con-tain clue reliability information, we introduce clue condition pre-diction networks that predict clue reliability ˆ r As and ˆ r Vst from eachembedding. The clue condition prediction network consists of sev-eral linear layers. The overall loss function with the clue conditionaware training loss is expressed as follows: L = L SDR ( ˆ x s , x s ) + β (cid:34) L MSE (ˆ r As , r As ) + T e (cid:88) t =1 L MSE (ˆ r Vst , r
Vst ) T e (cid:35) (13)where r As and r Vst represents oracle clue reliability (see 4.1 for thedefinition). β is a multi-task weight. Clue condition aware trainingcomplements normalized attention. Fig. 3 . Samples of visual clue occlusion. The photos correspond to40x30, 80x60 and 140x105 pixel sized mask.
4. EXPERIMENTS4.1. Datasets
We created a simulated dataset from the Lip Reading Sentences3 (LRS3-TED) audio-visual corpus. For training, we prepared a“clean-clue set” where both clues were unobstructed and intact anda “corrupted-clue set” where one of the clues was corrupted to vari-ous degrees. “Clean-clue set” was prepared using basically the sameprocedure used in [13]. For “corrupted-clue set”, we used the samemixture and clue pairs as the “clean-clue set” but randomly corruptedone of the clues. Half of the visual clue corrupted data were renderedtotally unreliable by setting an artificial mask that occluded the en-tire region of the video, and the other half were partially unreliablewith random-sized rectangular masks from 40x30 pixels to 140x105pixels, within the 188x188 pixel-sized face region, centered on themouth position as detected by [18,19], see Fig. 3. Similarly, one halfof the audio clue disabled data were made totally unreliable withwhite noise added to yield the signal to noise ratio (SNR) of -20 dB,and the other half were made partially unreliable with SNR varyingfrom -20 dB up to 20 dB. In attention guided training, we calculatedthe multi-task loss only for the totally unreliable clues. For cluecondition aware training, we define the oracle clue reliability ofeach visual or audio clue as r Vst = ( l s ) / (188 × where l s denotesthe perimeter of the mask in pixels and r As = (SNR s + 20) / .We conducted training with “augmented-clue set” that consisted of“clean-clue set” and “corrupted-clue set”.For the evaluation, we similarly prepared clean and corrupteddataset with several degrees of corruption. Although the visual clueocclusions were time-invariant in the training phase, i.e. same masksize for every frame, we also evaluated the performance using inter-mittent masks. We randomly selected multiple consecutive framesand applied masking. Half of the frames were selected and occludedby full-sized masks, so visual clues were completely unreliable halfof the time. In order to verify whether the attention proposals can deal with realrecorded mixtures with natural time-varying occlusions, we createdan audio-visual dataset of simultaneous speech by two speakers. Thedataset was recorded under three recording conditions: 1) speakersdo not occlude face with hands ( without occlusion ), 2) speakers oc-clude their face during speech ( full occlusion ) and 3) speakers movetheir hands freely so that the occlusion occurred intermittently ( inter-mittent occlusion ). Eight pairs of speakers participated in the record-ing; four pairs had the same gender and the others had different gen-der. Each pair spoke 60 utterances in Japanese in total, 20 for eachrecording condition. able 1 . The extraction performance for the simulated dataset in SDR [dB]. System (5)-(9) are the proposals and all outperform conventionalattention fusion (4) and summation fusion (3) approaches. The mask sizes of med. and full correspond to 80x60 and 188x188 pixels,respectively. “Intermittent full” is the condition where half of the frames are corrupted by full sized masks. fusion multitasktraining normalize visual clues (mask size) averageno mask no mask med. full intermittent fullaudio clues (SNR)clean 0 dB -20 dB clean clean 0 dB -20 dBmixture 0.1 0.1(1) audio 15.1 14.8 3.5 15.1 15.1 15.1 14.8 3.5 12.1(2) visual 15.3 15.3 15.3 13.1 -1.5 13.7 13.7 13.7 12.3(3) sum 15.4 15.4 15.3 14.8 14.4 14.7 14.6 13.8 14.8(4) attention 15.4 15.4 15.3 14.8 14.5 15.1 14.9 12.4 14.7(5) attention att. guided 15.9 15.9 15.9 15.3 14.8 15.6 15.6 14.4 15.4(6) attention clue cond. aware 15.9 15.9 15.9 15.4 14.9 15.7 15.6 15.1 15.6(7) attention (cid:88) (cid:88) (9) attention clue cond. aware (cid:88)
Since the model trained on LRS3-TED had a large domain mismatchagainst the real-recordings e.g. language, lighting conditions, speak-ing style, etc., we conducted domain adaptation with real recordings.In addition to the real recorded mixtures, we also recorded the sameamount of non-mixed speech from eight speakers who did not attendthe recordings of the real mixtures. Based on the non-mixed speech,we generated 10,000 mixtures for the adaptation set, which containsthe same three occlusion conditions as the real recorded mixtures.As a multi-task training, we conducted attention guided train-ing for the real dataset by assuming that both audio and visual clueswere always clean in “without occlusion” condition and that the au-dio clues were always clean while the visual clues were always unre-liable in “full occlusion” condition. We defined the oracle attentionvalue at ˆ a st = [0 . , . and ˆ a st = [1 . , . for “without occlu-sion” and “full occlusion” condition, respectively. We did not usethe multi-task loss term for “time-variant occlusion” condition, thuswe set α = 0 for data recorded in this condition. For all experiments, the network configurations were the same ex-cept for the modality fusion block. We set hyperparameters of thetarget extraction network as follows:
N = 256 , L = 20 , B = 256 , R = 4 , X = 8 , H = 512 and
P = 3 following the notation in [16],and N F = 1 . All the cluenets had the same structure; three timesof convolutional layers over the time axis (256 channels, 7x1, 5x1and 5x1 kernel size for each layer and 1x1 stride) and layer normal-ization, followed by a linear layer (256 channels). Clue conditionprediction networks consisted of three linear layers (256 channelsfor each hidden layer) with rectified linear unit (ReLU) activationafter each hidden layer and sigmoid function after the output layer.For training and adaptation, we adopted the Adam algorithm[20]. The learning rate was 5e-4 and 1e-4 for training with simu-lated data from scratch and adaptation with real data, respectively.For LRS3-TED, we trained the models for 50 epochs and chose thebest performing model based on a development set. For adaptation,we conducted training for 20 epochs. For hyperparameters, we set α = 10 , β = 5 to make the range of each loss term roughly the samescale, and (cid:15) = 2 . The evaluation metric is scale invariant SDR calcu-lated by BSSEval toolbox [21] and all the results were averaged over each condition of evaluation. Since we cannot acquire separated sig-nals for the mixed speech of real data, we used close-talk recordingsas the reference signals. Table 1 shows the SDR [dB] of extracted speech for simulated LRS3-TED. The result includes the conditions where both clues are clean,one of the clues is corrupted and the other is clean, and where thevisual clue is intermittently occluded while the audio clue is clean orcorrupted. The “audio” and “visual” in the “fusion” column standsfor the separations with only audio clues or visual clues, and “sum”and “attention” stands for the multi-modal methods with summa-tion and attention fusion, all respectively. The result shows that theproposed approaches (5-9) all offered improved performance, espe-cially the combination of normalized attention with attention guidedtraining (8). The extraction performance of (8) was 0.96 dB betterthan sum (3) and 1.04 dB better than the conventional attention (4),on average. As other metrics, perceptual evaluation of speech qual-ity (PESQ) and short-time objective intelligibility (STOI) were alsocalculated and improved from 2.45 and 0.939 in conventional atten-tion (4) to 2.61 and 0.947 in the best proposed system (8) on aver-age. The improvement is especially large in intermittent mask condi-tions. The conventional systems (3-4) exhibited significant drops inperformance with such time-variant visual occlusions, while the pro-posed systems performed robustly even with corrupted audio clues.It seems that attention proposals help to deal with clues that dynam-ically changes in quality, despite the model being only trained ontime-invariant occlusions.It is worth noting that the performance of the conventional meth-ods (3-4) in the “full mask” condition fell behind the audio-only ex-traction (1) (14.4 dB and 14.5 dB v.s. 15.1 dB). In contrast, the bestmulti-modal model (8) yielded comparable SDR than the audio-onlymethod (1) even when visual clues were totally corrupted (15.2 dBv.s. 15.1 dB). This observation suggests that our proposal can miti-gate the performance degradation caused by unreliable clues. Ad-ditionally, proposal (8) significantly outperformed the visual-onlymethod (2) even when audio clues were mostly unreliable (16.0 dBv.s. 15.3 dB). This further improvement is probably due to the multi-task effect of multi-modal training. ig. 4 . An example of speech extracted by normalized attentionmodel with attention guided training. The target voice is correctlyextracted from the mixture.
Fig. 4 shows an example of speech extracted from a real recordedmixture. The figure shows spectrograms of (a) the mixture recordedwith a distant microphone, (b) the extracted speech with the pro-posed method, (c) the target speaker recorded with a close-talk mi-crophone, and (d) the interference speaker recorded with a close-talkmicrophone. The references (c-d) are not true target signals since itis unavailable in real recordings. This result was obtained by thenormalized attention model with attention guided training. From theresult, we can confirm that the target voice is successfully extractedfrom the mixture. More qualitative examples are also available onthe website .Table 2 shows the SDR [dB] of extracted speech. Since the“true” target signals are unavailable for the real mixtures, these val-ues were calculated using close-talk microphone signals as referencesignals, and are thus only indicative because the reference containssome leakage of the interference speaker as seen in Fig. 4-(c). Eval-uations on other metrics including speech recognition accuracy willbe part of our future works.The “audio” and “visual” in the table represents the modelwith single modality clues, and “attention (proposal)” represents thenormalized attention model with attention guided training; domainadaptation was conducted for each of them. The system with pro-posed attention extracted the target voice significantly better thanthe single modality systems for all occlusion conditions. This resultdemonstrates that audio-visual target speaker extraction with theproposed attention mechanism is also valid for real recorded mix-tures even with realistic visual clue occlusions. It must also be notedthat the domain adaptation improved the performance significantlyfor the proposed method. The result indicates that the domain adap-tation could mitigate the mismatch conditions, even with a relativelylimited amount of adaptation data.Fig. 5 shows an example of the attention values of audio modal-ity a Ast in response to the hand motions of a speaker. The resultswere obtained from domain adapted models. The result shows thatthe conventional attention mechanism fails to capture the clue relia-bility. Additionally, the attention value was heavily biased to visualmodality and consequently a Ast took values of the order of − , dueto the norm imbalance problem. By mitigating the norm imbalance Fig. 5 . An example of attention values to the audio modality a Ast .The proposed attention mechanism captured the reliability of visualclues and yielded interpretable results.problem, the proposed attention with normalization enabled to selec-tively weight audio clues when the mouth region of the visual cluewas occluded, and to use audio and visual clues equally when not oc-cluded. Although the displacement of the attention values was rela-tively small in the normalized attention, the attention guided trainingfurther improved the attention behavior to realize the utilization ofthe whole range of the variability. These results indicate that theproposed attention scheme successfully capture the reliability of theclues and learns to neglect corrupted clues even with real occlusions.This may help improve the extraction performance by mitigating theadverse effects of the corruption. Also, this dynamic selection makesthe attention value interpretable as we can judge visual clue reliabil-ity from the attention value displacements.
5. CONCLUSIONS
Towards the more realistic application of audio-visual target speakerextraction, we focused on the clue corruption problem that naturallyoccurs in real-world recordings. In this work, we proposed a normal-ized attention mechanism and two multi-task training methods thatefficiently train the attention mechanism to be aware of clue reliabil-ity. Experiments on a simulated dataset showed that both proposedmethods improve the extraction performance over conventional at-tention fusion and summation fusion by capturing clue reliability.We also conducted experiments on real recorded mixtures withnatural visual occlusions. The results showed that our proposals withdomain adaptations successfully extracted target voices from mix-tures, at the same time of correctly dealing with the visual occlusionsto selectively use more informative clues for each time instance.As future works, we will carry a deeper investigation using realdatasets with more training and test speakers as well as adoptingother evaluation metrics such as speech recognition accuracy.
Table 2 . Performance with real data in SDR [dB]. Our proposalextract the target voice with reasonable quality thanks to the modeladaptation. system withoutocclusion intermittentocclusion fullocclusionmixture -0.4 -0.4 -0.3audio 7.7 7.2 7.9visual 3.1 0.6 -1.2attention (proposal) 8.9 8.3 8.2- w/o adaptation 6.6 5.0 4.6 . REFERENCES [1] John R Hershey, Zhuo Chen, Jonathan Le Roux, and ShinjiWatanabe, “Deep clustering: Discriminative embeddings forsegmentation and separation,” in
IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP) ,2016, pp. 31–35.[2] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jesper Jensen,“Multitalker speech separation with utterance-level permuta-tion invariant training of deep recurrent neural networks,”
IEEE ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 25, no. 10, pp. 1901–1913, 2017.[3] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio sep-aration network for real-time, single-channel speech separa-tion,” in
IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2018, pp. 696–700.[4] Kateˇrina ˇZmol´ıkov´a, Marc Delcroix, Keisuke Kinoshita,Takuya Higuchi, Atsunori Ogawa, and Tomohiro Nakatani,“Speaker-aware neural network based beamformer for speakerextraction in speech mixtures,” in
Proc. Interspeech , 2017, pp.2655–2659.[5] Jun Wang, Jie Chen, Dan Su, Lianwu Chen, Meng Yu, YanminQian, and Dong Yu, “Deep extractor network for target speakerrecovery from single channel speech mixtures,” in
Proc. Inter-speech , 2018, pp. 307–311.[6] Marc Delcroix, Kateˇrina ˇZmol´ıkov´a, Keisuke Kinoshita, At-sunori Ogawa, and Tomohiro Nakatani, “Single channel tar-get speaker extraction and recognition with speaker beam,” in
IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2018, pp. 5554–5558.[7] Marc Delcroix, Kateˇrina ˇZmol´ıkov´a, Tsubasa Ochiai, KeisukeKinoshita, Shoko Araki, and Tomohiro Nakatani, “Compactnetwork for speakerbeam target speaker extraction,” in
IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2019, pp. 6965–6969.[8] Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wil-son, Avinatan Hassidim, William T Freeman, and Michael Ru-binstein, “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation,”
ACMTransactions on Graphics (TOG) , vol. 37, no. 4, pp. 1–11,2018.[9] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisser-man, “The conversation: Deep audio-visual speech enhance-ment,” in
Proc. Interspeech , 2018, pp. 3244–3248.[10] Aviv Gabbay, Asaph Shamir, and Shmuel Peleg, “Visualspeech enhancement,” in
Proc. Interspeech , 2018, pp. 1170–1174.[11] Jian Wu, Yong Xu, Shi-Xiong Zhang, Lian-Wu Chen, MengYu, Lei Xie, and Dong Yu, “Time domain audio visual speechseparation,” in
IEEE Automatic Speech Recognition and Un-derstanding Workshop (ASRU) , 2019.[12] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisser-man, “My Lips Are Concealed: Audio-Visual Speech En-hancement Through Obstructions,” in
Proc. Interspeech , 2019,pp. 4295–4299. [13] Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, AtsunoriOgawa, and Tomohiro Nakatani, “Multimodal SpeakerBeam:Single Channel Target Speech Extraction with Audio-VisualSpeaker Clues,” in
Proc. Interspeech , 2019, pp. 2718–2722.[14] Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Lianwu Chen, Yuex-ian Zou, and Dong Yu, “Multi-modal multi-channel targetspeech separation,”
IEEE Journal of Selected Topics in Sig-nal Processing , 2020.[15] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” in
International Conference on Learning Represen-tations (ICLR) , 2015.[16] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing idealtime–frequency magnitude masking for speech separation,”
IEEE/ACM transactions on audio, speech, and language pro-cessing , vol. 27, no. 8, pp. 1256–1266, 2019.[17] Marc Delcroix, Tsubasa Ochiai, Kateˇrina ˇZmol´ıkov´a, KeisukeKinoshita, Naohiro Tawara, Tomohiro Nakatani, and ShokoAraki, “Improving speaker discrimination of target speechextraction with time-domain speakerbeam,” in
IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020.[18] Florian Schroff, Dmitry Kalenichenko, and James Philbin,“Facenet: A unified embedding for face recognition and clus-tering,” in
IEEE conference on computer vision and patternrecognition (CVPR) , 2015, pp. 815–823.[19] https://github.com/davidsandberg/facenet.[20] Diederik P Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in
International Conference onLearning Representations (ICLR) , 2015.[21] Emmanuel Vincent, R´emi Gribonval, and C´edric F´evotte,“Performance measurement in blind audio source separation,”