[PDF] The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Abstract

This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After the DOVER-Lap based system combination, it achieved diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and 16.94 % and 20.01 % in Track 2 full and core, respectively. With their results, we won second place in all the tasks of the challenge.

Full PDF

TThe Hitachi-JHU DIHARD III System: CompetitiveEnd-to-End Neural Diarization and X-VectorClustering Systems Combined by DOVER-Lap

Shota Horiguchi Nelson Yalta Paola Garc´ıa Yuki Takashima Yawen Xue Desh Raj Zili Huang Yusuke Fujita Shinji Watanabe Sanjeev Khudanpur Hitachi, Ltd. Research & Development Group, Tokyo, Japan Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA { shota.horiguchi.wk, nelson.yalta.wm } @hitachi.com, [email protected] Abstract —This paper provides a detailed description of theHitachi-JHU system that was submitted to the Third DIHARDSpeech Diarization Challenge. The system outputs the ensembleresults of the ﬁve subsystems: two x-vector-based subsystems, twoend-to-end neural diarization-based subsystems, and one hybridsubsystem. We reﬁne each system and all ﬁve subsystems becomecompetitive and complementary. After the DOVER-Lap basedsystem combination, it achieved diarization error rates of 11.58%and 14.09% in Track 1 full and core, and 16.94% and 20.01%in Track 2 full and core, respectively. With their results, we wonsecond place in all the tasks of the challenge.

Index Terms —speaker diarization, x-vector, VBx, EEND,DOVER-Lap

I. N

OTABLE HIGHLIGHTS

This technical report describes the Hitachi-JHU systemsubmitted to the Third DIHARD Speech Diarization Challenge[1]. We mainly focused our efforts on how we can pick thebest of diarization based on x-vector clustering and end-to-end neural speaker diarization (EEND). The highlights of oursystems are as follows: • Two x-vector-based systems incorporated with VBx clus-tering and heuristic overlap assignment. One is basedon a time-delay neural network (TDNN) based x-vectorextractor following the winning system of the DIHARD IIChallenge [2], [3]. The other is based on Res2Net-basedx-vector extractors, which won the VoxCeleb SpeakerRecognition Challenge 2020 [4]. • Two EEND-based subsystems, each of which is theextension of the original self-attentive EEND [5] to outputdiarization results of a variable number of speakers, withimproved inference. • A hybrid subsystem of x-vector clustering and EEND,in which update the results of x-vector clustering usingEEND as post-processing [6]. • Modiﬁed DOVER-Lap [7] to combine the results fromﬁve subsystems above. • Self-supervised adaptation of the EEND model.II. D

ATA RESOURCES

Table I summarizes the corpora we used to train our modelswhich compose our diarization system. We brieﬂy explain each corpus below. • DIHARD III: focused on “hard” speaker diarization,contains 5-10 minute utterances selected from 11 conver-sational domains, each including approximately 2 hoursof audio [1]. • VoxCeleb 1: a large-scale speaker identiﬁcation datasetwith 1,251 speakers and over 100,000 utterances, col-lected “in the wild” [8]. • VoxCeleb 2: a speaker recognition dataset that containsover a million utterances from over 6,000 speakers undernoisy and unconstrained conditions [8]. • Switchboard-2 (Phase I, II, III), Switchboard Cellular(Part 1, 2): English telephone conversation datasets.Their LDC catalog numbers are LDC98S75, LDC99S79,LDC2002S06, LDC2001S13, and LDC2004S07, respec-tively. • NIST Speaker Recognition Evaluation (2004, 2005, 2006,2008): also telephone conversations but not limited toEnglish, which are composed of the following LDCcorpora: LDC2006S44, LDC2011S01, LDC2011S04,LDC2011S09, LDC2011S10, LDC2012S01,LDC2011S05, LDC2011S08. • MUSAN: publicly available corpus that consists of music,speech, and noise [9]. The music and noise portions aresometimes used for data augmentation.III. D

ETAILED DESCRIPTION OF ALGORITHM

A. Voice Activity Detector

We employed two voice activity detectors (VAD): SincNet-based VAD [10] and TDNN-based VAD.

SincNet-based VAD : Our SincNet-based VAD is implementedusing the pyannote [11] framework. This VAD model learnsto detect speech from the raw speech using a combinationof a SincNet [12] followed by BiLSTM layers and fullyconnected layers. For our experiments, we employed thedefault conﬁguration provided by pyannote: a SincNet with80 channels and 251 dims of kernel size, two BiLSTM layerswith 128 cell dimensions, and two fully connected layers of a r X i v : . [ ee ss . A S ] F e b ABLE I: Corpora we used to train the models in our system.

VAD X-vector extractorCorpus SincNet TDNN TDNN Res2Net PLDA Overlap detector EEND-EDA SC-EENDDIHARD III development set [1] (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

DIHARD III evaluation set [1] (with pseudo labels) (cid:88)

VoxCeleb 1 [8] (cid:88) (cid:88)

VoxCeleb 2 [8] (cid:88) (cid:88)

Switchboard-2 Phase I, II, III (cid:88) (cid:88)

Switchboard Cellular Part 1, 2 (cid:88) (cid:88)

NIST Speaker Recognition Evaluation 2004, 2005, 2006, 2008 (cid:88) (cid:88)

MUSAN corpus (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

TABLE II: VAD performance on the DIHARD III develop-ment set.

Method False alarm (%) Missed speech (%)SincNet-based VAD 2.78 2.51TDNN-based VAD 2.85 2.80Posterior average 2.58 2.55

128 dimensions. We trained the model using the DIHARD IIIdevelopment set for 300 epochs.

TDNN-based VAD : Our TDNN VAD is based on the examplefrom Kaldi [13] recipe. The acoustic feature we use is -dim MFCC, and the left and right 2 frames are appended togenerate the 200-dim input features. The model ﬁrst transformsthe input features with linear discriminant analysis (LDA)estimated with the VAD labels. Then the transformed featurespass through ﬁve TDNN blocks. Each TDNN block consistsof a TDNN layer, a Rectiﬁed Linear Unit (ReLU), and abatch normalization layer. In the last two TDNN blocks, tocapture long temporal contexts, the mean vector for neigh-boring frames is computed as an additional input. Finally, alinear layer is used to predict the probability for each frame.The model was trained on the DIHARD III development setfor around 10 epochs. We augmented the training data withthe noise, music, and babble from the MUSAN [9] corpusand created some reverberated speech with simulated roomimpulse responses [14].The ﬁnal results of VAD were calculated by averagingposterior probabilities from the two models, followed bythresholding and median ﬁltering. As shown in the Table II,posterior averaging of the two systems achieved the besttrade-off between false alarms and missed speech than theindividuals. B. X-vector-based subsystems1) TDNN (

System (1) ): The TDNN x-vector-based systemconsists of two main parts: TDNN extractor and the VBxclustering.

TDNN-based extractor : It employs 40-dimensional ﬁlter-banks, with a

25 ms window and

15 ms frame shift. Thesefeatures are used for the embedding extraction as in [15].The x-vector was trained using a TDNN with a . windowwith frame shift of .

25 s . The TDNN extractor consists offour TDNN-ReLU layers each of them followed by a dense- ReLU. Then, two dense-ReLU layers are incorporated beforea pooling layer; a ﬁnal dense-ReLU is included from which512-dimension embeddings are computed. A dense-softmaxconcludes this TDNN architecture [16].

VBx clustering : To eliminate the need for a tuned agglom-erative hierarchical clustering (AHC) stopping threshold, weperform VBx-clustering after AHC [15]. The VBx-clusteringis a simpliﬁed variational Bayes diarization. It follows a hiddenMarkov model (HMM), in which the state represents a speaker,and the state transitions correspond to speaker turns. The statedistributions, or emission probabilities, are Gaussian mixturemodels constrained by eigenvoice matrix. Each speaker has aprobability of P loop when the HMM ends up back in the samestate. The initialization for this system is a probabilistic LDA(PLDA) model. For our experiments, this PLDA is the resultof the interpolation of the VoxCeleb PLDA and the in-domainDIHARD III PLDA. Both PLDAs were centered and whitenedusing DIHARD III development set.For the TDNN-based system, the x-vectors were projectedfrom 512 dim to 220 using an LDA, the PLDA interpolationregulated by an alpha was set to 0.50, and the value for P loop to 0.80.We ﬁnally applied overlap assignment, which is describedin Section III-B3, to obtain the ﬁnal diarization results fromthis subsystem.

2) Res2Net (

System (2) ): Initially proposed for imagerecognition, Res2Net was applied to speaker embedding be-cause it provides highly accurate speaker clustering [17].The Res2Net-based extractor uses the default conﬁgurationdescribed in [17]. The Res2Net uses 80 log-ﬁlterbank dimen-sions as input, a multi head-attentive pooling with attentionheads set to 16 that learns to weight each frame, and additiveangular margin Softmax (AAM) [18] with margin of 0.1 andscale of 30 as a training criterion. For our experiments, weemployed four extractors:i)

Res2Net-UN : This extractor employs the default conﬁg-uration of a Res2Net with 23 layers, utterance normal-ization, log compression, AAM margin of 0.1, andAAM scale of 30.ii) Res2Net-BN : This extractor is similar to Res2Net-UN,with a batch normalization layer instead of utterancenormalization and ln compression.iii) Res2Net-BN-Large : This extractor uses a Res2Net with50 layers with a similar conﬁguration as Res2Net-BN.v)

Res2Net-UN-Large : This extractor uses a Res2Net with50 layers and a similar conﬁguration as Res2Net-UN.Additionally, it uses SpecAugment [19] for data aug-mentation.We employed the VoxCeleb 1 and VoxCeleb 2 sets [8]as training that provided 7323 speakers and over 1M ofrecordings. We augmented the data following similar dataaugmentation as the Kaldi recipe for VoxCeleb . Each audiorecording is randomly chunked into subsegments of lengthbetween . and . that are feed into the models.Similarly to the TDNN-based system, the 128-dimensionembeddings, were passed through and LDA without reductionand used a PLDA interpolation regulated by an alpha was setto 0.10, and the value for P loop to 0.80.Once the results from the four extractors were obtained, wecombined the results using modiﬁed DOVER-Lap, which isexplained in Section III-E.

3) Overlap detection and assignment:

For the Res2Net andthe TDNN x-vector subsystems, we used a similar approach toperform overlap detection like the one shown for the SincNet-based VAD, with the only difference that the classiﬁer will dis-tinguish between overlapping speech versus non-overlappingspeech. We assigned the closest other speaker in the time axisas the second speaker for each detected frame.Table III shows the diarization error rates (DERs) andJaccard error rates (JERs) on the DIHARD III developmentset using the x-vector-based subsystems.

C. EEND-based subsystems

We employed EEND-EDA [20] and SC-EEND [21] asEEND-based subsystems, each of which can handle a ﬂexiblenumber of speakers. The inputs to the EEND-based modelswere based on log-Mel ﬁlterbanks but with different conﬁgu-rations for each model. For EEND-EDA, 23-dimensional log-Mel ﬁlterbanks was extracted with frame length of

25 ms andframe shift of

10 ms from recordings. Each ﬁlterbankswere then concatenated with those from the left and right sevenframes to construct 345-dimensional features. We subsampledthem by a factor of 10 to obtain input features for each

100 ms during training and that of ﬁve to obtain features foreach

50 ms during inference. For SC-EEND, we used 40-dimensional log-Mel ﬁlterbanks from

16 kHz recordings andconcatenated the left and right 14 frames to construct 1160-dimensional features. The subsampling factor was set to 20during pretraining using simulated mixtures and 10 duringadaptation and inference.

1) EEND-EDA (

System (3) ): EEND-EDA [20] calculatesposteriors by dot products between time-frame-wise embed-dings and speaker-wise attractors, which are calculated fromthe embeddings using encoder-decoder attractor calculationmodule (EDA). The training procedure depends on simulatedmixtures summarized in Table IV and the DIHARD III corpus.We created them using the script provided in the EEND https://github.com/kaldi-asr/kaldi/blob/master/egs/voxceleb/v2 repository with various β values shown in Table IV, whichdetermines the average duration of silence between utterances.We ﬁrst trained the model using Sim2spk for 100 epochs, thenﬁnetuned it on the concatenation of Sim1spk to Sim5spk foranother 75 epochs, and ﬁnally adapted it on the DIHARD IIIdevelopment set for 200 epochs. We used Adam optimizer[22] for all the training, but with Noam scheduler [23] thatset the warm-up steps to 100,000 iterations for training onsimulated mixtures and with a ﬁxed learning rate of × − for adaptation.During inference, we used the dereverberated audio usingweighted prediction error (WPE) [24]. We estimated a dere-verberation ﬁlter on Short Time Fourier Transform (STFT)spectrum using the entire audio recording as an input block.The STFT features are computed using a window of

32 ms (512 dims) and shifting of (128 dims). Using 5 iterations,we set the prediction delay and the ﬁlter length to 3 and 30,respectively, for

16 kHz .Because EEND-based models conduct speaker diarizationand voice activity detection simultaneously, they must beincorporated with oracle speech segments (for Track 1) oraccurate external VAD (for Track 2) to ﬁt the DIHARD tasks.Thus, once the diarization results were obtained using theEEND-EDA model, we ﬁltered false alarms and recoveredmissed speech by assigning the speakers with the highestposterior probabilities using VAD. In this paper, we call theseprocedures VAD post-processing.Even if the adaptation was based on the DIHARD III devel-opment set, which contains mixtures of at most 10 speakers,it is difﬁcult to produce diarization results of more than ﬁvespeakers because its pretraining was based on mixtures inwhich include at most ﬁve speakers. Therefore, we producediarization results for more than ﬁve speakers using an iterativeinference as follows:i) decide the maximum number of speakers K ( ≤ todecode,ii) decode at most K speaker’s diarization results,iii) stop inference if the estimated number of speakers isless than K otherwise continue to the next step,iv) select frames in which all the decoded speakers areinactive and back to i),We varied K ∈ { , , , , } at the ﬁrst iteration and ﬁxed it to from the second iteration. Finally, the ﬁve estimated resultsare combined using the modiﬁed DOVER-Lap described inSection III-E to obtain the ﬁnal results of the EEND-EDA-based system.Table Va shows DERs and JERs of the EEND-EDA-basedand SC-EEND-based subsystems. It clearly indicates that theVAD post-processing and the iterative inference improved thediarization performance.

2) SC-EEND (

System (4) ): SC-EEND is a model whichestimates each speaker’s speech activities one-by-one, con-ditioned on the previously estimated speech activities. We https://github.com/hitachi-speech/EEND/blob/master/egs/callhome/v1/run prepare shared eda.sh ABLE III: DERs / JERs (%) of x-vector-based subsystems on the DIHARD III development set. (a) TDNN (System (1))

Method DER / JER (%)x-vector + VBx 16.33 / 34.18x-vector + VBx + OvlAssign 13.87 / 32.73 (b) Res2Net (System (2))

DER / JER (%)Method Res2Net-BN Res2Net-UN Res2Net-BN-Large Res2Net-UN-Largex-vector + VBx 17.24 / 37.12 17.04 / 36.17 16.85 / 35.86 17.08 / 35.95x-vector + VBx + OvlAssign 14.89 / 35.64 14.72 / 34.65 14.56 / 34.31 14.74 / 34.40Modiﬁed DOVER-Lap 14.04 / 34.29

TABLE IV: Simulated mixtures used for EEND-EDA training.Sim1spk, Sim2spk, Sim3spk, and Sim4spk are the same asones used in the EEND-EDA paper [20].

Dataset β Overlap ratio (%)Sim1spk 1 100,000 2 0.0Sim2spk 2 100,000 2 34.1Sim3spk 3 100,000 5 34.2Sim4spk 4 100,000 9 31.5Sim5spk 5 100,000 13 30.3

TABLE V: DERs and JERs (%) on the DIHARD III devel-opment set using EEND-based models. FA: false alarm, MI:missed speech. (a) EEND-EDA (System (3))

Method DER JEREEND-EDA 18.77 38.98+ ﬁlter FA 17.33 37.92+ recover MI 13.08 35.38+ iterative inference ( K = 5 ) 13.35 34.19+ iterative inference ( K ∈ { , . . . , } ) & DOVER-Lap (b) SC-EEND (System (4)) Method DER JERSC-EEND 18.61 39.19+ ﬁlter FA 16.02 37.46+ recover MI used stacked Conformer encoders [25] instead of Transformerencoders that used in the original SC-EEND. The model wasﬁrstly trained on simulated mixtures, each of which containsat most four speakers, for 100 epochs using Adam optimizerwith the same scheduler as in EEND-EDA. Then, the modelwas initialized with the average weights of the last 10 epochsand trained again on the simulated mixtures for additional 100epochs. Finally, the model was adapted on the DIHARD IIIdevelopment set from the average weights of the last 10 epochsof the second-round pretraining for additional 200 epochsusing Adam optimizer with the ﬁxed learning rate of × − .The details of the simulated mixtures are described in the SC-EEND paper [21].For SC-EEND, we also used dereverberated audio andapplied VAD post-processing (ﬁltering false alarms and recov-ering missed speech) as described in Section III-C1. However,the Conformer encoders have order dependency so that wecannot conduct the decoding process only for the selected frames that are not always equally spaced along the timeaxis. Therefore, we did not apply the iterative inference forthe SC-EEND model. The results of SC-EEND with step-by-step improvement by using VAD post-processing are shownin Table Vb. D. Hybrid subsystem (

System (5) ) We also used EEND as post-processing (EENDasP) [6] toreﬁne diarization results obtained from the TDNN-based x-vectors described in Section III-B1. In EENDasP, two speakersfrom the results are iteratively selected and their results areupdated using the EEND model. In the original paper, theEEND-EDA model was trained to output only two-speakerresults, but we used the ﬁrst two speakers’ output from themodel trained in Section III-C1 for our system. By applyingEENDasP for TDNN-based x-vectors with VBx clustering butwithout heuristic overlap assignment, DER was improved from .

33 % to .

63 % . E. System fusion

To combine multiple diarization results, we used DOVER-Lap [7] with a modiﬁcation. The original DOVER-Lap assignsuniformly-divided regions for each speaker if the multiplespeakers are weighted equally in the label voting stage.However, we found that it leads to an increase in missedspeech. This is obvious by considering the case when thesame three hypotheses with overlaps are input to DOVER-Lap. The speakers included in the hypotheses are always tiedin this case; thus, overlapped regions in the hypotheses aredivided to be assigned for each speaker, which results in thecombined hypothesis with no overlap. Thus, we assigned allthe tied speakers to the regions without any division.When we combine diarization results from various systems,we sometimes know that some systems are highly accurateand others are not so. Therefore, we introduced hypothesis-wise manual weighting to DOVER-Lap. The original DOVER[26] and DOVER-Lap, the input hypotheses are ranked bytheir average DER to all the other hypotheses. In other words,the hypotheses H , . . . , H k , . . . , H K are ranked by followingscore s k : s k = 1 K − (cid:88) k (cid:48) ∈{ ,...,K } ,k (cid:54) = k (cid:48) DER ( H k , H k (cid:48) ) , (1)where DER ( H k , H k (cid:48) ) is the function to calculate diarizationerror rate from the reference H k and estimation H k (cid:48) . In ourABLE VI: Comparison between the original and modiﬁedDOVER-Lap on the DIHARD III development set. MI: missedspeech, FA: false alarm, CF: speaker confusion. Method MI FA CF DER(1) TDNN-based x-vector + VBx + OvlAssign 5.36 1.93 6.58 13.87(2) Res2Net-based x-vector + VBx + OvlAssign 5.47 1.89 6.68 14.04(3) EEND-EDA 6.54 1.36 5.02 12.92(4) SC-EEND 4.85 1.96 6.32 13.13(5) TDNN-based x-vector + VBx + EENDasP 6.53 1.32 4.79 12.63DOVER-Lap 6.96 0.77 4.33 12.07Modiﬁed DOVER-Lap (System (6)) 5.53 0.93 4.27 10.73Modiﬁed DOVER-Lap + manual weighting 5.54 0.93 4.21 10.68 system, we used w k s k instead of s k , where w k ∈ R + is aweighing value, to control the importance of each hypothesis.Table VI shows DERs and breakdown on the DIHARD IIIdevelopment set. Note that the manual weighting was onlyused to combine ﬁve hypotheses in System (9) and not usedfor combine the Res2Net-results in System (1), EEND-EDAiterative inference in Systems (3) and (7), and the ﬁve-systemfusion for System (6) due to time constraints. The weights tocombine Systems (1)(2)(4)(7)(9) were set to w (1) = 2 , w (2) =2 , w (4) = 1 , w (7) = 4 , w (9) = 3 , which were determined byusing the development set. F. Self-supervised adaptation

After the ﬁrst system fusion, we applied self-supervisedadaptation (SSA) for the EEND-EDA model. The estimatedresults were used as the pseudo labels for the DIHARD IIIevaluation set, We redid the adaptation step in Section III-C1on the concatenation of the DIHARD III development setwith the ground truth labels and the evaluation set with thepseudo labels. With the new model, we placed the results ofthe EEND-EDA (System (3)), EENDasP (System (5)), andDOVER-Lap (System (6)). Note that we used different pseudolabels for Track 1 and Track 2 because the oracle VAD wasonly available on Track 1.IV. R

ESULTS

Table VII shows the results on the DIHARD III developmentset and evaluation set. The results on the evaluation set arefrom the ofﬁcial scoring server. Every subsystem signiﬁcantlyoutperformed the baseline system [1]. System (5) performedbest as a single subsystem without self-supervised adaptation,but the other four subsystems showed the comparable per-formance. Our best system achieved .

58 % and .

09 % of DERs on the full and core evaluation set in Track 1,respectively. It also achieved .

94 % and .

01 % of DERsin Track 2. V. H

ARDWARE REQUIREMENTS

We run our experiments using two different infrastructures.One is equipped with Intel ® Xeon ® CPU Gold 6123 @ .

60 GHz using up to 56 threads with 750 GB of RAM,and up to eight NVIDIA ® V100 ® GPUs with 32 GB ofVRAM each and 15.7 single-precision TFLOPS. Using thisinfrastructure, we trained and processed the VAD models, the Res2Net models, the PLDA model, the EEND-based systems,and DOVER-Lap.The other is the JHU’s CLSP Cluster, which is equippedwith Intel ® Xeon ® CPU E5-2680 v2 @ .

80 GHz using upto 54 threads and 60GB of RAM, and up to four NVIDIA ® GeForce GTX 1080 Ti ® with 11 GB of VRAM each and10.6 single-precision TFLOPS. The TDNN-based extractor,the VBx clustering, and the overlap detection and assignmentmodel were trained on this cluster.The processing time for WPE dereverberation is .

54 s for1 minute of audio.Our framework’s components were trained on PyTorch [27],except for the TDNN-based extractor that was trained on Kaldi[13].The SincNet VAD was trained on a single NVIDIA ® V100 ® GPU and required about 22 hours for training. The processingof the labels required .

132 s for 1 minute of audio.The TDNN VAD was trained with 3 to 8 NVIDIA ® GeForceGTX 1080 Ti ® (We gradually increased the number of GPUjobs during training) for 1 hour.The TDNN x-vectors, VBx, and the overlap detector ex-traction were conducted on the CLSP Cluster. The overlapdetector required 40 CPUs with a decoding time of 30 minsfor all datasets including development and evaluation sets.The TDNN x-vector was trained on 4-8 GPUs and requiredapproximately 48 hours. The PLDAs, trained using CPUs,required around 30 mins to train the VoxCeleb datasets, and 10mins to train the DIHARD III dataset. The scoring for everyﬁle took around .

25 s for each audio. All the procedures wereparallelized using 30 to 40 jobs to reduce the computationaltime.The Res2Net-based x-vector extractors were trained usingfour NVIDIA ® V100 ® GPUs and required 54 hours approx-imately for training. The processing time for the x-vectorextraction using this model is .

52 s for 1 minute of audio.The EEND-based models are trained using a singleNVIDIA ® V100 ® GPU. For EEND-EDA, it took 30 hoursfor training on Sim2spk, 325 hours for ﬁnetuning on theconcatenation of Sim1spk to Sim5spk, and 1.5 hours foradaptation on the DIHARD III development set. The process-ing time of iterative inference and VAD post-processing wasabout 30 minutes. It takes about 3 hours for self-supervisedadaptation, which was almost doubled from the adaptationon the development set because we additionally used theevaluation set with pseudo labels. For SC-EEND, it took200 hours for training on simulated mixtures, 2 hours foradaptation, and 5 minutes for inference.The processing time of EENDasP given the results ofTDNN-based x-vectors + VBx was about 5 minutes for theentire development set.DOVER-Lap of ﬁve systems was based on the ofﬁcial repos-itory , which took about 3 minutes to process the developmentset. https://github.com/desh2608/dover-lap ABLE VII: DERs / JERs (%) on Track 1 & 2.

Track 1 (w/ oracle VAD) Track 2 (w/o oracle VAD)Dev Eval Dev EvalSystem full core full core full core full coreBaseline [1] 19.41 / 41.66 20.25 / 46.02 19.25 / 42.45 20.65 / 47.74 21.71 / 43.66 22.28 / 47.75 25.36 / 46.95 27.34 / 51.91(1) TDNN-based x-vector + VBx + OvlAssign 13.87 / 32.73 14.88 / 36.72 15.65 / 33.71 18.20 / 38.42 17.61 / 36.03 18.64 / 39.92 21.47 / 37.83 24.58 / 42.02(2) Res2Net-based x-vector + VBx + OvlAssign 14.04 / 34.29 15.18 / 38.80 15.81 / 35.53 18.47 / 40.47 17.26 / 37.17 18.39 / 41.56 21.37 / 39.59 24.64 / 44.49(3) EEND-EDA 12.92 / 33.85 13.95 / 35.37 13.95 / 35.37 17.28 / 41.97 15.90 / 35.94 18.50 / 41.71 19.04 / 38.89 22.84 / 45.27(4) SC-EEND 13.13 / 35.35 16.05 / 41.80 15.16 / 38.62 19.14 / 46.04 16.16 / 37.52 19.00 / 43.74 20.30 / 42.19 24.75 / 49.36(5) TDNN-based x-vector + VBx + EENDasP 12.63 / 31.52 14.61 / 36.28 13.30 / 33.02 15.92 / 38.29 15.94 / 34.11 18.09 / 38.97 18.13 / 35.82 21.31 / 40.78(6) DOVER-Lap of (1)(2)(3)(4)(5) 10.73 / 31.39 12.56 / 36.88 11.83 / 32.85 14.41 / 38.81 14.13 / 34.32 16.06 / 39.75 17.21 / 37.64 20.34 / 43.40(7) EEND-EDA (SSA) 12.95 / 33.98 15.69 / 40.03 12.74 / 34.08 15.86 / 40.44 15.03 / 33.64 17.52 / 39.15 17.81 / 38.32 21.31 / 44.32(8) TDNN-based x-vector + VBx + EENDasP (SSA) 12.54 / 31.32 14.55 / (9) DOVER-Lap of (1)(2)(4)(7)(8) / / 36.21 / 32.37 / 38.25 / / 38.77 / 36.31 / 41.78 The trained models and the generated outputs had a totaldisk usage of 1.2 TB. R

EFERENCES[1] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, C. Cieri,J. Du, S. Ganapathy, and M. Liberman, “The third DIHARD diarizationchallenge,” arXiv:2012.01477, 2020.[2] F. Landini, S. Wang, M. Diez, L. Burget, P. Matˇejka, K. ˇZmol´ıkov´a,L. Moˇsner, O. Plchot, O. Novotn`y, H. Zeinali, and J. Rohdin, “BUTsystem description for DIHARD Speech Diarization Challenge 2019,”arXiv:1910.08847, 2019.[3] F. Landini, S. Wang, M. Diez, L. Burget, P. Matˇejka, K. ˇZmol´ıkov´a,L. Moˇsner, A. Silnova, O. Plchot, O. Novotn`y, H. Zeinali, and J. Rohdin,“BUT system for the Second DIHARD Speech Diarization Challenge,”in

ICASSP , 2020, pp. 6529–6533.[4] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, S. Chen, Y. Zhao,G. Liu, Y. Wu, J. Wu, S. Liu, J. Li, and Y. Gong, “Microsoft speakerdiarization system for the VoxCeleb Speaker Recognition Challenge2020,” arXiv:2010.11458, 2020.[5] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watan-abe, “End-to-end neural speaker diarization with self-attention,” in

ASRU , 2019, pp. 296–303.[6] S. Horiguchi, P. Garc´ıa, Y. Fujita, S. Watanabe, and K. Nagamatsu,“End-to-end speaker diarization as post-processing,” in

ICASSP , 2021(to appear).[7] D. Raj, L. P. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, A. Stol-cke, and S. Khudanpur, “DOVER-Lap: A method for combining overlap-aware diarization outputs,” in

SLT , 2021, pp. 881–888.[8] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “VoxCeleb: Large-scale speaker veriﬁcation in the wild,”

Computer Speech & Language ,vol. 60, p. 101027, 2020.[9] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, andnoise corpus,” arXiv:1510.08484, 2015.[10] M. Lavechin, M.-P. Gill, R. Bousbib, H. Bredin, and L. P. Garcia-Perera, “End-to-end domain-adversarial voice activity detection,” in

INTERSPEECH , 2020, pp. 3685–3689.[11] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, and et al.,“pyannote.audio: neural building blocks for speaker diarization,” in

ICASSP 2020 , 2020, pp. 7124–7128.[12] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveformwith SincNet,” in

SLT , 2018, pp. 1021–1028.[13] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. , “The Kaldispeech recognition toolkit,” in

ASRU , 2011.[14] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “Astudy on data augmentation of reverberant speech for robust speechrecognition,” in

ICASSP , 2017, pp. 5220–5224.[15] M. Diez, L. Burget, and P. Matejka, “Speaker diarization based onBayesian HMM with eigenvoice priors,” in

Odyssey , 2018, pp. 102–109.[16] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversationsusing x-vectors,” in

ICASSP , 2019, pp. 5796–5800.[17] T. Zhou, Y. Zhao, and J. Wu, “ResNeXt and Res2Net structures forspeaker veriﬁcation,” in

SLT , 2021, pp. 301–307. [18] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angularmargin loss for deep face recognition,” in

CVPR , 2019, pp. 4685–4694.[19] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “SpecAugment: A simple data augmentation methodfor automatic speech recognition,” in

INTERSPEECH , 2019, pp. 2613–2617.[20] S. Horiguchi, Y. Fujita, S. Wananabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers withencoder-decoder based attractors,” in

INTERSPEECH , 2020, pp. 269–273.[21] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, J. Shi, and K. Naga-matsu, “Neural speaker diarization with speaker-wise chain rule,”arXiv:2006.01796, 2020.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

ICLR , 2015.[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

NeurIPS ,2017, pp. 5998–6008.[24] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang,“Speech dereverberation based on variance-normalized delayed linearprediction,”

IEEE TASLP , vol. 18, no. 7, pp. 1717–1731, 2010.[25] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han,S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in

INTERSPEECH ,2020, pp. 5036–5040.[26] A. Stolcke and T. Yoshioka, “DOVER: A method for combiningdiarization outputs,” in

ASRU , 2019, pp. 757–763.[27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in