[PDF] Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Abstract

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

Full PDF

aa r X i v : . [ ee ss . A S ] J u l Target-Speaker Voice Activity Detection: a Novel Approach for Multi-SpeakerDiarization in a Dinner Party Scenario

Ivan Medennikov , , Maxim Korenevsky , Tatiana Prisyach , Yuri Khokhlov ,Mariya Korenevskaya , Ivan Sorokin , Tatiana Timofeeva , Anton Mitrofanov ,Andrei Andrusenko , Ivan Podluzhny , Aleksandr Laptev , Aleksei Romanenko , STC-innovations Ltd, St. Petersburg, Russia ITMO University, St. Petersburg, Russia { medennikov, korenevsky, knyazeva, khokhlov, korenevskaya, sorokin, timofeeva,mitrofanov-aa, andrusenko, podluzhnyi, laptev, romanenko } @speechpro.com Abstract

Speaker diarization for real-life scenarios is an extremely chal-lenging problem. Widely used clustering-based diarization ap-proaches perform rather poorly in such conditions, mainly dueto the limited ability to handle overlapping speech. We proposea novel Target-Speaker Voice Activity Detection (TS-VAD) ap-proach, which directly predicts an activity of each speaker oneach time frame. TS-VAD model takes conventional speechfeatures (e.g., MFCC) along with i-vectors for each speaker asinputs. A set of binary classiﬁcation output layers produces ac-tivities of each speaker. I-vectors can be estimated iteratively,starting with a strong clustering-based diarization.We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top ofhidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the pre-dicted speaker activity probabilities are investigated. Experi-ments on the CHiME-6 unsegmented data show that TS-VADachieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate(DER) abs.

Index Terms : speaker diarization, TS-VAD, CHiME-6

1. Introduction

Diarization is a process of determining boundaries of utterancesfor each speaker in a conversation. Diarization is an impor-tant part of many applications, primarily of automatic speechrecognition (ASR), e.g., for meeting minutes creation. A con-ventional approach [1, 2] consists of several stages, namelyspeech/voice activity detection (SAD/VAD), segmentation ofthe detected speech into short subsegments, and extraction ofthe current speaker’s features (i-vectors [3], d-vectors [4, 5], x-vectors[6] etc.) followed by clustering (k-means [7], agglom-erative hierarchical [8], spectral [9], etc.) according to somesimilarity metrics (Probabilistic Linear Discriminant Analy-sis (PLDA) [10, 8] score, cosine, etc.). These stages can also befollowed by re-segmentation (such as GMM [11], VariationalBayes [12] or LSTM-based [13]) and overlapping speech seg-ments post-processing.Currently, high diarization accuracy is achieved for manybenchmarks, such as CallHome. However, the developmentof a diarization system for complex acoustic environments isstill an unsolved task. This was a motivation for the DIHARDChallenges [14, 15] focused on the development of systems for“hard” diarization. The DIHARD II Challenge [15] includes, inparticular, multichannel Tracks 3 and 4 based on the CHiME-5Challenge [16] data, which are very hard for both diarization and ASR. The same data diarization is also one of the CHiME-6 Challenge [17] Track 2 tasks. This data recorded in real-lifeconditions contains a large amount of overlapping speech. Con-ventional diarization systems are not well-suited for processinghighly overlapping speech, so it is not very surprising that eventhe best DIHARD II system developed by BUT [18] providedDER only slightly below 60%.During the participation in the CHiME-6 Challenge, wewere solving the same diarization problem [19]. Achieving highdiarization accuracy was crucial for high ASR performance,which was the main challenge goal. So we started by review-ing approaches that are effective in diarizing highly overlap-ping speech. One of the most promising methods of such kindis the end-to-end neural diarization (EEND) [20], which per-forms diarization in a single stage and outputs the frame-levelactivity probabilities for each speaker independently. Anotherdirection that we found to be promising consists of using pre-computed features of a speaker of interest to draw the system’sattention to only their speech. This direction is represented bysuch approaches as Target-Speaker ASR [21], Speaker Beam[22, 23] and Voice Filter [24] aimed at the target-speaker speechextraction, etc. Moreover, the TS-ASR approach may be usedfor simultaneous speech recognition and diarization [25]. Onemore representative of this direction is the Personal VAD [26]approach allowing to detect speech boundaries for only thatspeaker whose acoustic “proﬁle” is fed into the system.Inspired by the the ideas mentioned above, we combinedtheir beneﬁts in our own diarization approach for the CHiME-6 referred to as Target-Speaker VAD (TS-VAD). In the courseof the development, our model evolved from the simple one,very similar to the Personal VAD, to the sophisticated one,which processes multi-channel recordings and outputs indepen-dent speaker activity streams like the EEND model. The maindifﬁculty for the successful TS-VAD application is that, un-like TS-ASR or Personal VAD scenarios, we do not have anypre-computed speakers’ features and have to compute them di-rectly from severely distorted and highly overlapping speech.Nevertheless, we managed to ﬁnd such a way of applying TS-VAD which reduced DER to the values of 33% and 36% for theCHiME-6 development and evaluation sets respectively, whichis much better than BUT system result for DIHARD II as wellas the CHiME-6 baseline result. Thus, although the direct com-parison to the DIHARD II results may not be fully correct , theobtained results show the potential of our approach for diariza-tion in CHiME-6-like scenarios. Our TS-VAD implementationis available as a part of a new Kaldi recipe for CHiME-6 . See Section 2 for the detailed explanation. https://github.com/kaldi-asr/kaldi/tree/master/egs/chime6/s5b track2 he rest of the paper is organized as follows: in Section 2we describe in more details the CHiME-6 Challenge conditionsand data, then in Section 3 we talk about how a single-channelversion of TS-VAD evolved and how the above-mentioned pro-totypes inﬂuenced this process. The multi-channel TS-VADsystem is described in Section 4, and the fusion of differentsystems as well as post-processing of diarization results are dis-cussed in Section 5. Section 6 provides some conclusions andsuggests possible future directions.

2. CHiME-6 Challenge

The CHiME-6 Challenge continues a series of challenges onspeech recognition in complex real-life acoustic environments.It is based on the same data as the previous CHiME-5, whichconsists of multi-channel recordings from six 4-microphoneMicrosoft Kinect arrays, located in three different rooms. Eachrecording contains an informal conversation of four persons ina dinner party scenario in three locations. Besides, the two-channel recordings from in-ear microphone pairs worn by eachperson are also provided for the training. Data includes 20sessions (16 for training, 2 for development and 2 for evalu-ation). The goal is to create ASR system for multi-channelKinect recordings in two tracks. In Track 1 participants areallowed to use manual boundaries of each speaker’s utterancescreated manually from worn microphones audio listening, whilein Track 2 this information can not be used. Since the knowl-edge of boundaries can help to improve ASR substantially, oneof the Track 2 tasks is to obtain a high-quality diarization ofKinect recordings. This task is very similar to Track 4 of theDIHARD II Challenge. Same as in DIHARD II, the metricsfor this task are DER and Jaccard Error Rate (JER), which areevaluated without using a collar and without excluding over-lapping segments. Nonetheless, there are some differences, aswell. Firstly, CHiME-6 organizers provided software for time-aligning different channels within a session, which was notavailable in DIHARD II. All our results are obtained on pre-aligned data. Also, unlike DIHARD II allowing participants touse any training data, the CHiME-6 rules limit training data forTrack 2 by only CHiME-6 data and VoxCeleb data.Initially, reference rttm-ﬁles provided by the CHiME-6organizers were created based on utterance boundaries setmanually (the same information was used in Tracks 3-4 ofDIHARD II). However, such segmentation contains intra-utterance pauses, which are treated as speech, and unlabelledspeakers’ introductions. Therefore, the organizers provided an-other reference segmentation with the exclusion of silence seg-ments. This was carried out using triphone GMM-HMM forcedalignment of reference transcripts over manual segments. Be-sides, the UEM-ﬁle was provided, which makes it possible toexclude speaker introductions from scoring. The signiﬁcant dif-ference between reference rttm ﬁles in CHiME-6 and DIHARDII (about 30-40% in terms of DER) is the main reason why thedirect comparison of the results is not possible.CHiME-6 data is rather noisy and reverberated as well ascontains a signiﬁcant amount of overlapping speech, so it is dif-ﬁcult to perform an accurate diarization with clustering-basedsystems. Statistics of overlapping speech in the developmentand evaluation datasets is shown in Table 1. If one changes thereference rttms to leave only a single speaker on each overlap-ping segment, this will result in DER of 25.61%/21.76% (dueto the miss errors only) for the development and evaluation sets,respectively. These values show the lowest achievable DERson the CHiME-6 data for clustering-based diarization systems without special overlaps processing.0 1 2 3 4dev 24.05% 54.25% 17.74% 3.49% 0.47%eval 33.47% 51.52% 12.03% 2.47% 0.51%Table 1:

Distribution of audio with respect to the number ofsimultaneously speaking persons.

3. Single-channel TS-VAD

Recently, a series of papers on multi-speaker speech process-ing was published, where models focus on a speciﬁc speakerignoring the speech of others. These approaches include TS-ASR [21] for target speech recognition, Speaker Beam [22, 23]and Voice Filter [24] for target speech extraction, and PersonalVAD [26] for target speech detection. Most of them use theacoustic footprint of a target speaker (usually i-vector), obtainedduring prior enrollment, to focus on speech of interest. The Per-sonal VAD model seemed to be most appropriate for our goalssince it selects each speaker’s speech independently. However,this model was trained and tested on concatenated instead ofoverlapped speech segments, so it was unclear if it could han-dle overlapping speech as well. Besides, under the CHiME-6conditions, it was not trivial how to compute i-vectors for eachspeaker since the use of manual segmentation was prohibited intest time.Thus, we started with a purely research question: givenan i-vector estimated on manual non-overlapping speech seg-ments of a target speaker, is it possible to detect their speechin overlapping conditions? The answer was generally positive.Our ﬁrst single-speaker TS-VAD model was very similar to thePersonal VAD with the same three targets obtained from theforced alignment, namely silence, target speech, and non-targetspeech. The model architecture was a simple 3-layer BLSTMwith projections [27].All the experiments were performed in the Kaldi ASRToolkit [28]. We used the acoustic model training dataset andi-vector extractor from the Kaldi chime6 recipe. Moreover, weused a “negative” version of each utterance with an i-vector cor-responding to a random speaker from the same session and de-vice (without such “negative” examples, the model tends to de-tect any speech as a target). Training targets were taken fromthe tri3

GMM forced alignment: silence and noise phones weretreated as silence class, and the rest phones as target speech ornon-target speech for the original and “negative” utterances, re-spectively.TS-VAD model outputs a sequence of probabilities of tar-get speaker presence for each time frame. To convert it intothe segmentation, the simple post-processing described in Sec-tion 5 was applied. After multiple experiments with i-vectorscomputed on non-overlapping regions of manual segmentation,we managed to obtain DER=66.81% on the development data(for all experiments on manual segmentation, we used CH1 ofthe reference device, i.e., the device closest to the currently ac-tive speaker).Single-speaker TS-VAD processes each speaker indepen-dently, which is intuitively sub-optimal. So, we introduced anadditional “mutual” threshold to the post-processing procedure.Considering a set of speech probabilities on the current frame,we found that there is likely no speech of those speakers whoseprobabilities are strongly dominated by the maximum proba-bility on this frame. Thus, the probability was set to zero ift differed from the maximum probability on the current frameby more than the threshold. This trick provided a dramatic di-arization improvement reducing DER to 46.12%, which lookedpromising.

We also experimented with the EEND model [20], hoping itwould work well for CHiME-6 data. Unfortunately, this wasnot the case due to several reasons. First of all, the relativelyshort audio pieces are required to apply EEND since the self-attention module needs to see the whole sequence. But suchblock-wise processing induces the between-block speaker per-mutation problem. We used oracle permutation to estimate thelowest possible DER for the EEND. An oracle DER of theEEND model trained on the VoxCeleb data was comparable toDER of the baseline diarization, presumably due to the severeacoustic mismatch between the VoxCeleb and CHiME-6 data.The model trained on CHiME-6 data provided much better ora-cle DER (about 50%). To resolve between-block permutations,we extracted baseline x-vectors from the EEND-induced seg-ments and applied several clustering approaches to them. Wefound that the simple unconstrained k-means algorithm workedbest, but the obtained DERs were only slightly better than thoseof the baseline diarization. Nonetheless, we adopted some ideasfrom EEND to improve the TS-VAD model.As the CHiME-6 diarization task is closed (the number ofspeakers in a session is always four), we decided to use i-vectorsof all speakers in the session as inputs, instead of the targetspeaker’s i-vector only. The initial model was designed to es-timate speech probability for the speaker corresponding to theﬁrst i-vector. It was beneﬁcial to average the predicted proba-bilities obtained on various permutations of i-vectors. However,much better results were obtained with the TS-VAD model de-signed to predict speech probabilities for all speakers simulta-neously, as the EEND model does. This model with four outputlayers was trained using a sum of binary cross-entropies as aloss function. We also found that it is essential to process eachspeaker by the same Speaker Detection (SD) 2-layer BLSTMP,and then combine SD outputs for all speakers by one moreBLSTMP layer. The model architecture is shown in Figure 1.Note that parameters of the SD block are shared across speak-ers, and it is trained jointly with the whole TS-VAD model.Figure 1:

Single-channel TS-VAD scheme

As we performed all the experiments in the Kaldi ASRToolkit [28], it was easier to use 2-class softmax instead of a sig- moid in the output layers. Training targets were 8-dimensionalvectors representing four pairs of silence and speech probabil-ities corresponding to four speakers. Given an utterance, tar-gets corresponding to the current speaker were taken directlyfrom the forced alignment. Targets for the three other speakerswere obtained by averaging alignments from neighboring over-lapping utterances over all the devices and channels (for non-overlapping frames, targets for speech and silence were 0 and1, respectively).To provide more data variability, we performed on-the-ﬂyrandom permutations of speakers (i.e., both i-vectors and tar-gets) during the training. It was also beneﬁcial to use the mixupdata augmentation [29] using our Kaldi-compatible tools pre-sented in [30]. Moreover, we obtained a small DER improve-ment (about 0.5%) by adding an 800h subset of the VoxCelebdata augmented by artiﬁcial room impulse responses. The ﬁnalmulti-speaker model achieved an impressive DER of 37.40%using i-vectors estimated on manual segmentation.Note that the “mutual” threshold providing a huge DER re-duction for the single-speaker TS-VAD turned out to be totallyuseless for the multi-speaker model.

After the experiments on manual segments, we switched to thereal CHiME-6 task and tried to compute i-vectors from the seg-mentation provided by the baseline clustering-based diarization(DER=63.42% on the development set).Unfortunately, the early versions of TS-VAD did not pro-vide any tangible improvement over the baseline results. So,we ﬁrstly focused on improving the baseline diarization. Tothis end, we applied the improved x-vectors extractor based ona Wide ResNet (WRN), which was trained on the VoxCelebdata. The details on the extractor are given in [31]. Applicationof the baseline Agglomerative Hierarchical Clustering (AHC)based on PLDA scoring to these x-vectors improved DER byabout 10% abs. Then we used the idea from [32] and replacedAHC clustering based on PLDA scores with Spectral Cluster-ing (SC) based on Cosine Similarities. Using similarity matrixbinarization with respect to automatically chosen threshold [32]provided the consistent DER improvement on both the develop-ment and evaluation sets (see Table 2 for the results). Besides,using WRN x-vectors substantially improved the permutationresolution on the EEND segmentation and covered a large frac-tion of the gap to the oracle DER.The improved clustering-based diarization provided a goodinitial estimation of i-vectors for TS-VAD. The next idea wasto re-estimate i-vectors iteratively, using segmentation from theprevious TS-VAD iteration. However, we found that even betterresults can be achieved using probabilities obtained by the TS-VAD as soft-weights for re-estimation of i-vectors. Note that,to ensure robust i-vectors estimation, for each speaker we con-sidered only those frames where speech probability was morethan 0.8 of total speech probability for all speakers on a givenframe. The second iteration provided a signiﬁcant gain, but thethird one did not lead to any improvement. Note that such i-vectors estimation provides as good results as i-vectors com-puted on manual segmentation . Later we found that the sameiterative procedure with the best TS-VAD model starting fromthe baseline diarization also improves DER by about 15% ab-solute. However, more iterations are required for convergence. In the same conditions without WPE and channel averaging, iter-ative i-vector estimation led to 38% DER, compared to 37.4% on themanual segments. . Multi-channel processing

The single-channel version of TS-VAD (TS-VAD-1C) pro-cesses each channel separately, but we suggested that the multi-channel processing may be beneﬁcial. Indeed, we found thatthe multi-channel Weighted Prediction Error (WPE) derever-beration [33, 34] improves the results of TS-VAD by about 1%DER absolute. Moreover, averaging of per-channel TS-VADprobabilities provides up to 2% absolute DER reduction.To process separate Kinect channels jointly, we investigatedthe multi-channel TS-VAD model (TS-VAD-MC), which takesa combination of TS-VAD-1C model SD blocks outputs from aset of 10 Kinect recordings as an input. The architecture of thismodel is presented in Figure 2. The channels of input Kinectrecordings are chosen randomly for training, while CH1 andCH4 are taken at test-time. This way of combining informationfrom different channels is more effective than a simple averag-ing of probabilities, as in the TS-VAD-1C model. All the SDvectors for each speaker are passed through a 1-d convolutionallayer and then combined by means of a simple attention mecha-nism. Combined outputs of attention for all speakers are passedthrough a single BLSTM layer and converted into a set of per-frame probabilities of each speaker’s presence/absence.Figure 2:

Multi-channel TS-VAD scheme

Finally, to improve overall diarization performance, wefused 3 single-channel and 3 multi-channel TS-VAD models bycomputing a weighted average of their probability streams. Fu-sion weights were tuned to minimize DER on the developmentset. The diarization results on different stages of our system arepresented in Table 2. DEV EVALDER JER DER JERx-vectors + AHC 63.42 70.83 68.20 72.54EEND + WRN x-vectors 52.20 57.42 56.01 61.49WRN x-vectors + AHC 53.45 56.76 63.79 62.02WRN x-vectors + SC 47.29 49.03 60.10 57.99+ TS-VAD-1C (it1) 39.19 40.87 45.01 47.03+ TS-VAD-1C (it2) 35.80 37.38 39.80 41.79+ TS-VAD-MC 34.59 36.73 37.57 40.51Fusion

Fusion* 41.76 44.04 40.71 45.32Table 2:

Diarization results (* stands for DIHARD II reference)

5. Post-processing

To convert the TS-VAD output probabilities into a sequence ofsegments, the simple post-processing was applied. It includes51-tap median ﬁltering, binarization with the threshold of 0.4,combining speech segments separated by pauses shorter than0.3s, and deleting speech segments shorter than 0.2s.Alternatively, Viterbi decoding was applied for the post-processing. We introduced a simple Hidden Markov Model(HMM) with 11 states representing silence, speech from eachof four speakers without overlaps, and overlapping speech from6 possible pairs of speakers (overlaps of three and four speakerswere neglected due to short duration). The emission probabil-ities in each state were induced from TS-VAD output proba-bilities, while the transition probabilities were tuned to mini-mize DER on the development set. The transitions from thesilence state to the two-speaker states and vice versa were pro-hibited and other transition probabilities were shared betweenany pairs of states with the same number of speakers (7 tunableparameters in total). The most likely state sequence found bythe Viterbi search determined the ﬁnal segments.The inﬂuence of several post-processing techniques onDER improvement is shown in Table 3.Post-processing DEV EVALBest single TS-VAD T+F+S 34.59 37.57Fusion T 34.73 37.52Fusion T+F+S 33.56 36.63Fusion V+S 32.84 36.02Table 3:

DER for different post-processing. T, F, S and Vstand for binarization with the threshold, median ﬁltering, shortspeech/pause processing, and Viterbi decoding, respectively.

6. Conclusions

We presented a novel approach for the diarization of multi-speaker conversations, which provided state-of-the-art resultsin a complex multi-channel dinner party scenario. The pro-posed Target-Speaker VAD selects speech of every conversationparticipant taking their i-vector as an input along with MFCCfeatures. Since the sufﬁciently good segmentation is requiredfor the reliable i-vectors estimation, we improved the baselineclustering-based diarization signiﬁcantly.It is worth noting that we also tried to replace i-vectors withmore discriminative speaker embeddings like x-vectors as theTS-VAD inputs, but results got much worse. We believe the rea-son is that the relatively simple TS-VAD architecture is unableto match well the current MFCC features to the speaker embed-dings obtained from a much more complicated model. Anotherpossible reason is severe overﬁtting due to a sparse embeddingsspace and a small number of speakers in the training data.Although our ﬁnal solution is task-dependent (multi-channel input, a ﬁxed number of speakers), we believe the pro-posed approach is ﬂexible enough to be easily modiﬁed forother similar tasks. In the future, we plan to extend our so-lution to the scenario of informal meetings with an unknownlarge number of participants.

7. Acknowledgements

This work was partially ﬁnancially supported by the Govern-ment of the Russian Federation (Grant 08-08).We are grateful to STC Voice Biometrics Team for the awe-some speaker embeddings extractor and valuable discussions. . References [1] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Willaba,M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe,and S. Khudanpur, “Diarization is hard: Some experiences andlessons learned for the JHU team in the inaugural DIHARD chal-lenge,” in

INTERSPEECH , 2018, pp. 2808–2812.[2] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova,K. ˇZmol´ıkov´a, O. Novotny, K. Vesel´y, O. Glembek, O. Plchot,L. Moˇsner, and P. Matejka, “BUT system for DIHARD speechdiarization challenge 2018,” in

INTERSPEECH , 2018, pp. 2798–2802.[3] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker veriﬁcation,”

Audio,Speech, and Language Processing, IEEE Transactions on , vol. 19,pp. 788 – 798, Jun 2011.[4] E. Variani, X. Lei, E. McDermott, I. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker veriﬁcation,” in

ICASSP , 2014, pp. 4052–4056.[5] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker veriﬁcation,” arXiv:1710.10467 , 2017.[6] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,” in

ICASSP , 2018, pp. 5329–5333.[7] D. Dimitriadis and P. Fousek, “Developing on-line speaker di-arization system,” in

INTERSPEECH , 2017, pp. 2739–2743.[8] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree,“Speaker diarization using deep neural network embeddings,” in

Proc. ICASSP 2017 , Mar 2017, pp. 4930–4934.[9] H. Ning, M. Liu, H. Tang, and T. Huang, “A spectral clusteringapproach to speaker diarization,” in

Proc. ICSLP 2006 , vol. 5, Jan2006, pp. 2178–2181.[10] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminantanalysis for inferences about identity,” in , 2007, pp. 1–8.[11] C. Fredouille, S. Bozonnet, and N. Evans, “The LIA-EURECOMRT‘09 speaker diarization system,” in

Proc. RT 2009, NIST RichTranscription Workshop , 2009.[12] G. Sell and D. Garcia-Romero, “Diarization resegmentation in thefactor analysis subspace,” in

ICASSP , 2015, pp. 4794–4798.[13] M. Sahidullah, J. Patino, S. Cornell, R. Yin, S. Sivasankaran,H. Bredin, P. Korshunov, A. Brutti, R. Serizel, E. Vincent,N. Evans, S. Marcel, S. Squartini, and C. Barras, “The speedsubmission to DIHARD II: Contributions & lessons learned,” arXiv:1911.02388 , 2019.[14] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, andM. Liberman, “First DIHARD challenge evaluation plan,” Tech.Rep., 2018.[15] ——, “The second DIHARD diarization challenge: Dataset, task,and baselines,” arXiv:1906.07839 , 2019.[16] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The ﬁfth’CHiME’ speech separation and recognition challenge: Dataset,task and baselines,” in

INTERSPEECH , 2018, pp. 1561–1565.[17] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “CHiME-6challenge: Tackling multispeaker speech recognition for unseg-mented recordings,” arXiv:2004.09249 , 2020.[18] F. Landini, S. Wang, M. Diez, L. Burget, P. Matˇejka,K. ˇZmol´ıkov´a, L. Moˇsner, A. Silnova, O. Plchot, O. Novotn´y,H. Zeinali, and J. Rohdin, “BUT system for the second DIHARDspeech diarization challenge,” arXiv:2002.11356 , 2020.[19] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Ko-renevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. An-drusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “The STCsystem for the CHiME-6 challenge,” in

CHiME 2020 Workshopon Speech Processing in Everyday Environments , 2020. [20] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, andS. Watanabe, “End-to-end neural speaker diarization with self-attention,” in

ASRU , 2019, pp. 296–303.[21] N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu,and S. Watanabe, “Auxiliary interference speaker loss for target-speaker speech recognition,” in

INTERSPEECH , 2019, pp. 236–240.[22] K. ˇZmol´ıkov´a, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa,and T. Nakatani, “Speaker-aware neural network based beam-former for speaker extraction in speech mixtures,” in

INTER-SPEECH , 2017, pp. 2655–2659.[23] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, andT. Nakatani, “Single channel target speaker extraction and recog-nition with speaker beam,” in

ICASSP , 2018, pp. 5554–5558.[24] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Her-shey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voice-Filter: Targeted voice separation by speaker-conditioned spectro-gram masking,”

ArXiv:1810.04826 , 2018.[25] N. Kanda, S. Horiguchi, Y. Fujita, Y. Xue, K. Nagamatsu, andS. Watanabe, “Simultaneous speech recognition and speaker di-arization for monaural dialogue recordings with target-speakeracoustic models,” in

ASRU , 2019, pp. 31–38.[26] S. Ding, Q. Wang, S.-y. Chang, L. Wan, and I. Moreno,“Personal VAD: Speaker-conditioned voice activity detection,” arXiv:1908.04284 , 2019.[27] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re-current neural network architectures for large scale acoustic mod-eling,” in

INTERSPEECH , 2014, pp. 338–342.[28] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motl´ıˇcek, Y. Qian, P. Schwarz,J. Silovsk´y, G. Stemmer, and K. Vesel, “The kaldi speech recog-nition toolkit,” in

ASRU , 2011.[29] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup:Beyond empirical risk minimization,” arXiv:1710.09412 , 2017.[30] I. Medennikov, Y. Khokhlov, A. Romanenko, D. Popov,N. Tomashenko, I. Sorokin, and A. Zatvornitskiy, “An investiga-tion of mixup training strategies for acoustic models in ASR,” in

INTERSPEECH , 2018.[31] A. Gusev, V. Volokhov, T. Andzhukaev, S. Novoselov, G. Lavren-tyeva, M. Volkova, A. Gazizullina, A. Shulipa, A. Gorlanov,A. Avdeeva, A. Ivanov, T. Pekhovsky, and Y. Matveev, “Deepspeaker embeddings for far-ﬁeld speaker recognition on short ut-terances,” arXiv:2002.06033 , 2020.[32] T. Park, K. Han, M. Kumar, and S. Narayanan, “Auto-tuning spec-tral clustering for speaker diarization using normalized maximumeigengap,”

IEEE Signal Processing Letters , vol. 27, pp. 381–385,Dec 2019.[33] T. Yoshioka and T. Nakatani, “Generalization of multi-channellinear prediction methods for blind MIMO impulse responseshortening,”

IEEE Transactions on Audio Speech and LanguageProcessing , vol. 20, pp. 2707–2720, Dec 2012.[34] L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach,“NARA-WPE: A python package for weighted prediction errordereverberation in Numpy and Tensorﬂow for online and ofﬂineprocessing,” in

13. ITG Fachtagung Sprachkommunikation (ITG2018)13. ITG Fachtagung Sprachkommunikation (ITG2018)