[PDF] Speaker activity driven neural speech extraction

Abstract

Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity information as an auxiliary clue for single-channel neural network-based speech extraction. We propose a speaker activity driven speech extraction neural network (ADEnet) and show that it can achieve performance levels competitive with enrollment-based approaches, without the need for pre-recordings. We further demonstrate the potential of the proposed approach for processing meeting-like recordings, where the speaker activity is obtained from a diarization system. We show that this simple yet practical approach can successfully extract speakers after diarization, which results in improved ASR performance, especially in high overlapping conditions, with a relative word error rate reduction of up to 25%.

Full PDF

SSPEAKER ACTIVITY DRIVEN NEURAL SPEECH EXTRACTION

Marc Delcroix , Katerina Zmolikova ∗ , Tsubasa Ochiai , Keisuke Kinoshita , Tomohiro Nakatani NTT Corporation, Japan, Brno University of Technology, Speech@FIT and IT4I Center of Excellence, Czechia

ABSTRACT

Target speech extraction, which extracts the speech of a targetspeaker in a mixture given auxiliary speaker clues, has recentlyreceived increased interest. Various clues have been investigatedsuch as pre-recorded enrollment utterances, direction information,or video of the target speaker. In this paper, we explore the use ofspeaker activity information as an auxiliary clue for single-channelneural network-based speech extraction. We propose a speaker ac-tivity driven speech extraction neural network (ADEnet) and showthat it can achieve performance levels competitive with enrollment-based approaches, without the need for pre-recordings. We furtherdemonstrate the potential of the proposed approach for process-ing meeting-like recordings, where the speaker activity is obtainedfrom a diarization system. We show that this simple yet practicalapproach can successfully extract speakers after diarization, whichresults in improved ASR performance, especially in high overlap-ping conditions, with a relative word error rate reduction of up to25 %.

Index Terms — Speech extraction, Speaker activity, Speech en-hancement, Meeting recognition, Neural network

1. INTRODUCTION

Recognizing speech in the presence of interfering speakers remainsone of the challenges for automatic speech recognition (ASR). Aconventional approach to tackle this problem consists of separatingthe observed speech mixture into all of its source speech signals be-fore ASR. Single-channel speech separation has greatly progressedwith the introduction of deep learning [1, 2]. However, most separa-tion approaches suffer from two limitations, (1) they require know-ing or estimating the number of speakers in the mixture, and (2) theysuffer from a global permutation ambiguity issue, i.e., an arbitrarymapping between source speakers and outputs.Target speech extraction [3] has been proposed as an alternativeto speech separation to alleviate the above limitations. It focuseson extracting only the speech signal of a speaker of interest or tar-get speaker by exploiting auxiliary clues about that speaker. Theproblem formulation becomes thus independent of the number ofspeakers in the mixture. Besides, the global permutation ambiguityis naturally solved thanks to the use of auxiliary clues. Several targetspeech extraction schemes have been proposed, exploiting differenttypes of auxiliary clues such as pre-recorded enrollment utterancesof the target speaker [4, 5], direction information [6], video of thetarget speaker [7, 8], or electroencephalogram (EEG) signals [9].For example, SpeakerBeam [4, 10] is an early approach forenrollment utterance-based target speech extraction. It exploits a ∗ Katerina Zmolikova was partly supported by Czech Ministry of Educa-tion, Youth and Sports from the National Programme of Sustainability (NPUII) project ”IT4Innovations excellence in science - LQ1602” speaker embedding vector derived from an enrollment utterance toinform a speech extraction network which speaker to extract fromthe mixture. With SpeakerBeam, the speaker embedding vectors areobtained from an auxiliary network that is jointly trained with thespeech extraction network. Consequently, the speaker embeddingvector may capture the voice characteristics of the target speaker,which are needed to distinguish that speaker in a mixture.Alternative approaches that exploit video clues have been pro-posed subsequently [7, 8]. In these works, a sequence of faceembedding vectors are extracted from the video of the face of thetarget speaker speaking in the mixture and passed to the extractionnetwork. Different from SpeakerBeam, the auxiliary clue is time-synchronized with the mixture signal. What information the faceembedding vectors actually capture remains unclear, but we cannaturally assume that they capture the lip movements of the targetspeaker and that the mouth opening and closing regions may beimportant information provided to the speech extraction network.Although target speech extraction can achieve a high level ofperformance [7, 11], it is not always possible to have access to theenrollment utterance or video. In this paper, inspired by the workson visual speech extraction, we investigate the use of another clue,which consists of the speech activity of a speaker. The speaker ac-tivity consists of a signal that indicates for each time frame if thespeaker is speaking or not. It can be a general and practical clue forspeech extraction as there are various ways to obtain it.We propose a single-channel speaker activity driven speech ex-traction neural network (ADEnet). We hypothesize that a neural net-work can exploit the speaker activity information to identify and ex-tract the speech of a speaker, assuming that the speakers in the mix-ture do not always fully overlap. We experimentally demonstratethat assuming the availability of oracle speaker activity, ADEnet canachieve superior performance to SpeakerBeam, without requiringpre-recorded enrollment speech.In practice, there are various possible approaches to obtain thespeaker activity information including visual-based voice activitydetection (VAD) [12], personal VAD [13] or diarization [14–16].For example, speaker diarization has greatly progressed [14–17], andﬁnding speaker activity regions in meetings has achieved a low di-arization error rate (DER) even in overlapping conditions. It does notrequire pre-recorded enrollment utterances of the speakers or video.Diarization is often used as a pre-processing for ASR. However, di-arization itself does not extract the speech signals, which limits ASRperformance in overlapping conditions. As a practical potential use-case of ADEnet, we investigate using the speaker activity informa-tion obtained from a diarization system as clues for ADEnet. Weshow that this simple approach can improve ASR performance inmeeting-like single-channel recordings, especially in severe overlap-ping conditions. Note that our proposed ADEnet does not depend onhow the speaker activity information is obtained, making it a versa-tile speech extraction method. a r X i v : . [ ee ss . A S ] F e b xt. block 1Ext. block 2 E x t r ac ti on n e t Auxiliary netTime avg. (cid:1825) (cid:3047) (cid:1801) (cid:3047) (cid:3548)(cid:1824) (cid:3047) (cid:1805)(cid:3549)(cid:1813) (cid:3047) .. (c) ADEnet-input(a) SpeakerBeam Extraction net Auxiliary netWeighted sum (cid:1825) (cid:3047) (cid:1868) (cid:3047) (cid:1805)(cid:3549)(cid:1813) (cid:3047) (d) ADEnet-mix (cid:3549)(cid:1813) (cid:3047)

Concatenate (cid:1825) (cid:3047) (cid:1868) (cid:3047) (cid:1825) (cid:3047)(cid:2904) , (cid:1868) (cid:3047) (cid:2904) Extraction net (b) ADEnet-auxiliary

Auxiliary netWeighted sum (cid:3549)(cid:1813) (cid:3047) (cid:1805)

Concatenate (cid:1825) (cid:3047) (cid:1868) (cid:3047) (cid:1825) (cid:3047)(cid:2904) , (cid:1868) (cid:3047) (cid:2904) Extraction net (cid:3548)(cid:1824) (cid:3047) . (cid:3548)(cid:1824) (cid:3047) . (cid:3548)(cid:1824) (cid:3047) . Fig. 1 . Diagram of SpeakerBeam and different variations of ADEnet. See Sect. 5.3 for details of the auxiliary and extraction blocks.In the remainder of the paper, we ﬁrst brieﬂy review Speaker-Beam in Section 2, which serves as a basis for the proposed ADEnetthat we describe in Section 3. We discuss related works in Section 4.In Section 5, we present experimental results based on the LibriCSScorpus [18]. Finally, we conclude the paper in Section 6.

2. SPEAKERBEAM

Figure 1-(a) shows a schematic diagram of SpeakerBeam [10]. Let y t be an observed speech mixture, a t be an enrollment utterance,and x t be the speech signal of the target speaker. Here, we considernetworks operating in the frequency domain, and the signals y t , x t ,and a t are sequences of amplitude spectrum coefﬁcients, and t rep-resents the time-frame index. Note that we could derive the methodspresented in this paper equivalently for time-domain networks [11].SpeakerBeam is composed of two networks. First, an auxiliarynetwork accepts the enrollment utterance and outputs a ﬁxed dimen-sion speaker embedding vector e obtained as the time-average of theoutput of the auxiliary network, e = 1 T a T a (cid:88) t =1 f ( a t ) , (1)where f ( · ) represents the auxiliary network, and T a is the numberof frames of the enrollment utterance.The second network computes a time-frequency mask, m t ,given the mixture signal, y t , and the speaker embedding vector, e , as m t = g ( y t , e ) , where g ( · ) represents the speech extrac-tion network. The extracted speech signal, ˆ x t , is then obtained as ˆ x t = m t (cid:12) y t , where (cid:12) is the element-wise multiplication.There are several approaches to combine the two networks us-ing, e.g., feature concatenation, addition, or multiplication. We em-ploy here the multiplicative combination [19], where the output ofthe ﬁrst hidden layer of the speech extraction network g ( · ) is multi-plied with the speaker embedding vector e as proposed in [20]. Wejointly train both networks to minimize the mean square error (MSE)between the estimated and reference target speech.

3. NEURAL ACTIVITY DRIVEN SPEECH EXTRACTION

SpeakerBeam has demonstrated high speech extraction perfor-mance. However, it requires a pre-recorded enrollment utterancethat may not always be available. Moreover, there may be a mis-match between the enrollment utterance and the target speech in themixture due to, e.g., different recording conditions.Here, we assume that we can have access to the speaker activityinformation instead of the pre-recorded utterance. Let p t ∈ { , } be a signal representing the activity of a speaker. p t takes a valueof when the speaker is speaking and otherwise. We consider two cases, i.e. (1) speaker activity with overlap , where p t con-tains speaker active regions, including regions where the interfer-ence speakers overlap with the target speaker, and (2) speaker ac-tivity without overlap , where we remove the overlap regions fromthe speaker active regions based on the activity information of theinterference speakers.We ﬁrst describe three conﬁgurations of ADEnet to exploit thespeaker activity for target speech extraction shown in Fig. 1(b),(c),(d).We then discuss how to obtain target speaker activity in practice.Finally, we discuss a training strategy to make ADEnet robust toestimation errors in the speaker activity signals. The ﬁrst conﬁguration, shown in Fig. 1-(b), exploits a similar ar-chitecture as SpeakerBeam, but replaces the enrollment utterance bythe input mixture and the time-averaging operation of Eq. (1) by aweighted-sum as, e = 1 (cid:80) t p t (cid:88) t p t f ( y t ) . (2)This formulation is similar to SpeakerBeam using instead of pre-recorded enrollment utterance the regions of the mixture signalwhere the target speaker is active. If we can remove the overlappingregions based on the speaker activity information (i.e., activity with-out overlap), the enrollment utterance consists of the regions wherethe target speaker is the single active speaker. This naturally avoidsany mismatch in the recording conditions between the enrollmentand the mixture. We call this approach ADEnet-auxiliary .An alternative option consists of simply concatenating thespeaker activity p t and the speech mixture y t at the input of thenetwork. This approach does not compute any explicit speaker em-bedding vector but expects that the extraction network can learn totrack and identify the target speaker internally based on its activity.Consequently, it does not use any auxiliary network. This approachis shown in Fig. 1-(c). We call it ADEnet-input .The third conﬁguration is shown in Fig. 1-(d). It consists of acombination of both approaches, where we use the speaker activityat the input and also to compute the speaker embedding vector withEq. (2). We call this approach

ADEnet-mix . ADEnet requires the speech activity of the target speaker in the mix-ture. We aim at building a versatile system that can perform speechextraction independently of the way we obtain the speaker activity.In practice, there are several possible options to obtain thespeaker activity. Naturally, a conventional VAD [21], would notwork in this context because it could not discriminate the targetspeaker from the interference in a mixture. However, if we haveccess to video recordings of the target speaker, we could use avisual-based VAD to detect the speaker activity [12, 22]. If enroll-ment utterances are available, we can obtain the speaker activitywith a personal VAD [13]. If none of the above are available, wecan use diarization [14, 16, 17] to obtain the speech activity of thespeakers in a recording.As an example use-case, we investigate using ADEnet in a meet-ing scenario. We use diarization to obtain the speech activity of allspeakers in a meeting. We then apply ADEnet to each speaker iden-tiﬁed by the diarization system. This builds upon the great progressof recent diarization systems that can achieve a low DER in meeting-like situations [15–17]. Combining diarization with ADEnet offersa practical yet relatively simple approach for enhancing ASR perfor-mance in overlapping conditions.However good recent VAD or diarization systems are, we can-not expect to obtain oracle speaker activity information. Besides,depending on the scenario, it may not be possible to obtain speakeractivity without overlap. It is thus essential to develop a system ro-bust to errors in the speaker activity detection. In the next section,we explain a training strategy making ADEnet robust to errors in theactivity signals.

We require the triplet of the observed mixture y t , the reference targetspeech x t , and the target speaker activity p t to train ADEnet. We usesimulated speech mixtures and extract the oracle speaker activity byapplying a conventional VAD approach [21] on the reference targetspeech. Besides, we use speaker activity without overlap, to allowthe system to better capture the characteristics of the target speaker.Using oracle speaker activity information obtained from the ref-erence signals allows us to build a system independent of the methodused for estimating the speaker activity at test time. However, it isunrealistic to assume the availability of oracle error-free speaker ac-tivity information. Therefore, we include noise to the speaker activ-ity information at training time to make the system robust. Practi-cally, for each oracle speech segment found by the VAD, we modifythe segment boundaries by adding to the start and end times a valueuniformly sampled between − and seconds.

4. RELATED WORK

There have been several works that integrated speech activity intonon-neural multi-channel speech separation systems to extractspeech. [23–26]. For example, supervised independent vector anal-ysis (IVA) modiﬁes the objective function of IVA by including anextra-term related to the target speaker, such as the source activity,allowing the system to extract that speaker [25].The diarization information has also been combined with spatialclustering-based separation in the guided source separation (GSS)framework [26]. GSS introduces the speaker activity as a prior forthe mask estimation of the complex angular central Gaussian mix-ture model (cACGMM) based source separation, enabling identify-ing the target source from the cACGMM outputs. GSS has been ap-plied successfully to severe recording conditions [26]. Note that su-pervised IVA and GSS require multi-microphone recordings, whileour proposed ADEnet can work with a single microphone.If multi-microphone recordings are available, we could extendGSS with ADEnet by using time-frequency masks of the sourcesobtained from ADEnet instead of the speaker activity as priors tocACGMM as proposed in [27] for speech separation. Such an ex-tension will be part of our future works.

5. EXPERIMENTS

We performed two types of experiments, with (1) simulated two-speaker mixtures to show the potential of using speaker activity forspeech extraction and (2) meeting-like recordings to illustrate a po-tential use-case of ADEnet as a pre-processor for ASR.

We used the same training data for all experiments. It consistedof 63000 mixtures of reverberant two-speaker mixtures with back-ground noise at SNR between 10 and 20 dB. The speech signals weretaken from the LibriSpeech corpus [28]. We further created a vali-dation and test set with the same conditions. The test set consistedof 1364 mixtures with an averaged overlapping ratio of 38.5 %.In the second experiment, we used the meeting-like LibriCSScorpus [18], which consists of 8-speaker meeting-like recordingssessions of 10 minutes, obtained by re-recording LibriSpeech utter-ances played through loudspeakers in a meeting room. The overlapratio varies from 0 to 40 %.

For training, we used the “oracle” speaker activity without overlapobtained by applying the webrtc VAD [21] to the reference signals.We trained the system with or without noisy activity training (addingnoise to the speaker activity signals, as described in 3.3). In the ﬁrstexperiment with simulated two-speaker mixtures, we used the ora-cle speaker activity with and without noise. In the second experi-ment with LibriCSS, we used target speaker VAD (TS-VAD) baseddiarization [16] to obtain the speech activity regions of each speaker.

In all experiments, we used a similar architecture for the extractionnetwork, which consisted of 3 bidirectional long short-term mem-ory (BLSTM) layers with 1200 units, and two fully connected (FC)layers with rectiﬁed linear unit (ReLU) activation functions. Theauxiliary network consisted of two FC layers with 64 hidden unitsand ReLU activation functions. We used the same conﬁgurationfor the SpeakerBeam network. The input features consisted of 257-dimensional log amplitude spectrum coefﬁcients obtained using ashort-time Fourier transform with a window of 32 msec and a shift of8 msec. The training loss consisted of the MSE between the spectraof the reference signal and the extracted speech. We trained the net-works using the ADAM optimizer using the padertorch toolkit [29].

We ﬁrst performed experiments on the simulated two-speaker mix-tures. Table 1 shows the signal to distortion ratio (SDR) [30] ob-tained with ADEnet with and without noisy activity training andfor different activity signals at test time. At test time, we used or-acle speaker activity signals without overlap as during training (i.e.“w/o overlap”) without, and with adding noise, i.e. “Oracle” and“+Noise”, respectively. We also explored training and test mismatchby using speaker activity with overlap at test time (i.e. “w/ overlap”).The SDR of the mixture signals is -0.2 dB. We compare the pro-posed ADEnet with SpeakerBeam using pre-recorded enrollment ut-terances. We used a randomly selected utterance of the target speakerfrom the LibriSpeech corpus as the enrollment utterance. Speaker-Beam achieved an SDR of 9.4 dB. Besides, masking out the regions able 1 . Speech enhancement results in terms of SDR [dB] for sim-ulated 2-speaker data with ADEnet without and with noisy activitytraining. The SDR of the mixture signal is -0.2 dB, and the SDR ofSpeakerBeam is 9.4 dB.Noisy

Activity signal at test time activity w/o overlap w/ overlaptraining Oracle +Noise Oracle +Noise AvgADEnet-aux - 9.7 6.0 6.0 4.3 6.5ADEnet-in - (cid:88) (cid:88) (cid:88) where the speaker is inactive from the observed signal (“inactivity-masking”) leads to an SDR or 3.8 dB with oracle speaker activity,and -1.6 dB when using noisy speaker activity.The upper part of Table 1 shows the results without noisy ac-tivity training. The ﬁrst column conﬁrms that ADEnet with oraclespeaker activity can achieve higher SDR than SpeakerBeam (10.2 dBvs. 9.4 dB) without the need for pre-recorded enrollment utterance.Moreover, it greatly outperforms the simple “inactivity-masking”.These results demonstrate that ADEnet can effectively extract speak-ers based on the speaker activity. However, the SDR degrades sig-niﬁcantly when precise speaker activity is not available as revealedby the poor results when including noise or overlap regions.The lower part of Table 1 shows the results with noisy activitytraining. The oracle performance degrades, but the systems be-come much more robust to errors in the activity signals. Overall,the performance in noisy conditions becomes a little worse thanSpeakerBeam but still much better than using “inactivity-masking”.ADEnet-mix with noisy training achieves superior performanceon average, and we use this conﬁguration in the following experi-ments. Figure 2 shows an example of extracted speech, showing thatADEnet can extract a speaker even in the overlapping regions.This experiment demonstrates that single-channel speech ex-traction is possible without explicit use of enrollment utterancesor video. Moreover, the relatively stable performance with variousquality of the activity signals indicates that ADEnet could be usedwith various diarization or VAD methods that provide or not speakerregions without overlap.

We present results on the LibriCSS meeting-like corpus, using theTS-VAD-based diarization to obtain the speech activity regions ofall speakers in the meeting. We use TS-VAD as an example of di-arization, but other diarization approaches could be used such as,e.g., [14, 15]. TS-VAD achieved a DER of 7.3 % [31]. We use theESPnet toolkit and the transformer-based recipe developed for Lib-riCSS [32, 33] as ASR back-end, which provides a strong baselinefor the task [31].Diarization ﬁnds the speech activity for the speakers in therecording. From the diarization output, we process each continuousspeech segment separately, by adding 2 seconds of context at thebeginning and end to create extended segments. We used the speakeractivity estimated by the diarization within the extended segment asspeaker activity signal. Besides, we also experimented with speakeractivity without overlap by removing from the activity signal theregions where multiple speakers were detected. ASR is performedseparately on each speech segments found by the diarization and (a) Mixture(b) Reference speaker 1(d) Extracted speaker 1 (c) Reference speaker 2(e) Extracted speaker 2

Fig. 2 . Example of extracted speech with ADEnet-mix.

Table 2 . cpWER using TS-VAD-based diarization, the proposedADEnet and a Transformer-based ASR backend.w/o Overlap ratio in %overlap 0L 0S OV10 OV20 OV30 OV40 Avgno proc na 11.2 9.4 16.2 23.1 33.6 41.1 23.9ADEnet - (cid:88) evaluated using the concatenated minimum-permutation word errorrate (cpWER) [34] that includes the errors caused by diarization.Table 2 shows the cpWER obtained with the diarized outputwith and without ADEnet. Note that ADEnet operates on single-channel recordings. Even if we used a powerful diarization, we seethe degradation caused by the overlapping speech. We conﬁrm thatof ADEnet consistently improves cpWER, especially for the highoverlapping regions, with relative WER improvement ranging from6.5 % to 25 %. These results should be put into perspective withthose obtained with a multi-channel system using GSS [26] withseven microphones, which achieved an average WER of 11.2 %.Compared to GSS, ADEnet has the advantage that it can work withsingle-channel recordings. The integration of ADEnet with GSS ispossible and will be part of our future works.

6. CONCLUSION

We investigated the potential of using speaker activity to develop asimple and versatile neural speech extraction. We showed experi-mentally that the proposed ADEnet with precise speaker activity in-formation could achieve high single-channel speech extraction per-formance. Besides, combined with diarization, it can signiﬁcantlyimprove the ASR performance in meeting-like situations.For further improvements, we plan to investigate more power-ful network conﬁgurations [11] and combine ADEnet with multi-microphone schemes [26, 27]. Besides, we would like to compareADEnet with visual clue based speech extraction approaches.

Acknowledgment.

This work was started at JSALT 2020 atJHU, with support from Microsoft, Amazon, and Google. We thankDesh Raj, Pavel Denisov, and Shinji Watanabe for providing the Lib-riCSS baseline. . REFERENCES [1] M. Kolbaek, D. Yu, Z. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant trainingof deep recurrent neural networks,”

IEEE/ACM Trans. ASLP ,vol. 25, no. 10, pp. 1901–1913, 2017.[2] Y. Luo and N. Mesgarani, “TasNet: Surpassing ideal time-frequency masking for speech separation,” in

Proc. ofICASSP’18 , 2018.[3] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi,A. Ogawa, and T. Nakatani, “Speaker-aware neural networkbased beamformer for speaker extraction in speech mixtures,”in

Proc. of Interspeech’17 , 2017, pp. 2655–2659.[4] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi,A. Ogawa, and T. Nakatani, “Learning speaker representationfor neural network based multichannel speaker extraction,” in

Proc of ASRU’17 , 2017, pp. 8–15.[5] J. Chen, D. Su, L. Chen, M. Yu, Y. Qian, and D. Yu, “Deep ex-tractor network for target speaker recovery from single channelspeech mixtures,” in

Proc. of Interspeech’18 , 2018, pp. 307–311.[6] R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su,Y. Zou, and D. Yu, “Neural spatial ﬁlter: Target speaker speechseparation assisted with directional information,” in

Proc. In-terspeech 2019 , 2019, pp. 4290–4294.[7] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen atthe cocktail party: a speaker-independent audio-visual modelfor speech separation,”

ACM Trans. on Graphics (TOG) , vol.37, no. 4, pp. 1–11, 2018.[8] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation:Deep audio-visual speech enhancement,” in

Proc. Interspeech2018 , 2018, pp. 3244–3248.[9] E. Ceolini, J. Hjortkjær, D. D. Wong, J. O’Sullivan, V. S.Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu, and N. Mes-garani, “Brain-informed speech separation (biss) for enhance-ment of target speaker in multitalker speech perception,”

Neu-roImage , vol. 223, pp. 117282, 2020.[10] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai,T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam:Speaker aware neural network for target speaker extraction inspeech mixtures,”

IEEE JSTSP , vol. 13, no. 4, pp. 800–814,2019.[11] M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita,N. Tawara, T. Nakatani, and S. Araki, “Improving speakerdiscrimination of target speech extraction with time-domainspeakerbeam,” in

Proc. of ICASSP’20 , 2020.[12] P. Liu and Z. Wang, “Voice activity detection using visual in-formation,” in

Proc. of ICASSP’04 , 2004, vol. 1, pp. I–609.[13] S. Ding, Q. Wang, S.-Y. Chang, L. Wan, and I. Lopez Moreno,“Personal VAD: Speaker-Conditioned Voice Activity Detec-tion,” in

Proc. of Odyssey’20 , 2020, pp. 433–439.[14] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. Mc-Cree, “Speaker diarization using deep neural network embed-dings,”

Proc. of ICASSP’17 , pp. 4930–4934, 2017.[15] Z. Huang, S. Watanabe, Y. Fujita, P. Garc´ıa, Y. Shao, D. Povey,and S. Khudanpur, “Speaker diarization with region proposalnetwork,”

Proc. of ICASSP’20 , pp. 6514–6518, 2020.[16] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Y. Khokhlov,M. Korenevskaya, I. Sorokin, T. V. Timofeeva, A. Mitrofanov,A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko,“Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario,”

ArXiv ,vol. abs/2005.07272, 2020.[17] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, and K. Naga-matsu, “End-to-end neural diarization: Reformulating speakerdiarization as simple multi-label classiﬁcation,”

ArXiv , vol.abs/2003.02966, 2020.[18] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu,and J. Li, “Continuous speech separation: Dataset and analy-sis,”

Proc. of ICASSP’20 , pp. 7284–7288, 2020.[19] L. Samarakoon and K. C. Sim, “Subspace LHUC for fast adap-tation of deep neural network acoustic models,” in

Proc. ofInterspeech’16 , 2016, pp. 1593–1597.[20] M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, S. Araki,and T. Nakatani, “Compact network for SpeakerBeam targetspeaker extraction,” in

Proc. of ICASSP’19 , 2019, pp. 6965–6969.[21] “ https://github.com/wiseman/py-webrtcvad ,” .[22] D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz, and C. Jutten,“An analysis of visual speech information applied to voice ac-tivity detection,” in

Proc. of ICASSP’06 , 2006, vol. 1, pp. I–I.[23] B. Rivet, L. Girin, and C. Jutten, “Visual voice activity detec-tion as a help for speech source separation from convolutivemixtures,”

Speech Communication , vol. 49, no. 7-8, pp. 667–677, 2007.[24] T. Ono, N. Ono, and S. Sagayama, “User-guided indepen-dent vector analysis with source activity tuning,” in

Proc. ofICASSP’12 , 2012, pp. 2417–2420.[25] F. Nesta, S. Mosayyebpour, Z. Koldovsk´y, and K. Paleˇcek,“Audio/video supervised independent vector analysis throughmultimodal pilot dependent components,” in

Proc. of EU-SIPCO’17 , 2017, pp. 1150–1164.[26] C. B¨oddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude,J. Heymann, and R. Haeb-Umbach, “Front-end processingfor the CHiME-5 dinner party scenario,” in

Proc of Inter-speech’18 , 2018.[27] T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita,“Integrating DNN-based and spatial clustering-based maskestimation for robust MVDR beamforming,” in

Proc. ofICASSP’17 , 2017, pp. 286–290.[28] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: an asr corpus based on public domain audio books,”in

Proc. of ICASSP’15 , 2015, pp. 5206–5210.[29] “ https://github.com/fgnt/padertorch ,” .[30] E. Vincent, R. Gribonval, and C. F´evotte, “Performance mea-surement in blind audio source separation,”

IEEE trans. ASLP ,vol. 14, no. 4, pp. 1462–1469, 2006.[31] D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He,S. Watanabe, J. Du, T. Yoshioka, Y. Luo, N. Kanda, J. Li,S. Wisdom, and J. R. Hershey, “Integration of speech separa-tion, diarization, and recognition for multi-speaker meetings:System description, comparison, and analysis,” in

Submited toSLT’21 , 2021.[32] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner,N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in

Proc. of Interspeech’18 ,2018, pp. 2207–2211.[33] “ https://github.com/espnet/espnet/tree/master/egs/libri_css/asr1 ,” .[34] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “CHiME-6 challenge: Tackling multispeaker speech recognition for un-segmented recordings,”