The DKU-Duke-Lenovo System Description for the Third DIHARD Speech Diarization Challenge
TThe DKU-Duke-Lenovo System Description forthe Third DIHARD Speech Diarization Challenge st Weiqing Wang
Department of ECEDuke University
Durham, [email protected] nd Qingjian Lin
Lenovo Voice DepartmentLenovo AI Lab
Beijing, [email protected] rd Danwei Cai
Department of ECEDuke University
Durham, [email protected] th Lin Yang
Lenovo Voice DepartmentLenovo AI Lab
Beijing, [email protected] th Ming Li
Data Science Research CenterDuke Kunshan University
Kunshan, [email protected]
Abstract —In this paper, we present the submitted system forthe third DIHARD Speech Diarization Challenge from the DKU-Duke-Lenovo team. Our system consists of several modules: voiceactivity detection (VAD), segmentation, speaker embedding ex-traction, attentive similarity scoring, agglomerative hierarchicalclustering. In addition, the target speaker VAD (TSVAD) is usedfor the phone call data to further improve the performance. Ourfinal submitted system achieves a DER of 15.43% for the coreevaluation set and 13.39% for the full evaluation set on task 1,and we also get a DER of 21.63% for core evaluation set and18.90% for full evaluation set on task 2.
Index Terms —Speaker Diarization, Speaker Recognition, DeepLearning, Self-attention, Target-speaker Voice Activity Detection
I. I
NTRODUCTION
Speaker diarization is the task of breaking up the audio intohomogeneous pieces that belong to the same speaker, and itaims to determine “who spoke when” in a continuous audiorecording. A traditional speaker diarization system alwayscontains several modules: voice activity detection (VAD),segmentation, speaker embedding extraction, and clustering.While this kind of traditional speaker diarization systemhas been successful in many domains, including meeting,interview, conversation, it still difficult to mitigate the successto more challenging corpora, such as web videos, speechin the wild, child language recordings, etc. [1]. One of theproblems is recognizing the speaker in an overlapping region,and an extremely noisy background is also harmful to adiarization system. To raise more researchers’ attention onsuch challenging data, the third DIHARD speech diarizationchallenge was held, where the data is drawn from a diversesampling of sources [2].It is difficult for traditional speaker diarization system tofind the speaker in an overlapping region, but some pre- andpost-processing can be employed to solve this problem. In theVoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20)[3], Xiong et al. [4] employed conformer-based continuousspeech separation (CSS) as the pre-processing to separate the overlapping speech. Besides, Ivan et al. used target-speaker voice activity detection (TSVAD) as post-processingto predicts the activity of each speaker on each time frameon the CHIME-6 challenge. Recently, an end-to-end systemwas proposed in [5], and it can also recognize the speaker inoverlap, which shows a better performance than the traditionalmethod does in the CallHome corpus.Our submitted system contains several modules. First, wepartition the data into conversation telephone speech (CTS)and non-conversation telephone speech (NCTS) since the CTSdata is upsampled to 16k from 8k audio signal. Second, forCTS and NCTS data, we train an 8k and a 16k speakerembedding extractor to extract embeddings for audio seg-ments. Third, we perform different clustering methods onCTS and NCTS data, including agglomerative hierarchicalclustering (AHC) and spectral clustering (SC). Finally, weemploy TSVAD on the CTS data, which significantly improvethe performance. For task 2, an additional ResNet-based VADmodel is employed to remove the non-speech region from thedata.The rest of this paper is organized as follows: Section 2describes the details of the dataset we used in this challenge.Section 3 introduces our submitted systems and algorithms fordifferent tasks. Section 4 presents the experimental results andanalysis. Finally, section 5 concludes this paper.II. D
ATA P ARTITION AND D ATA R ESOURCES
From the evaluation plan, we notice that the conversationtelephone speech (CTS) data are upsampled from 8kHz audiosignal while others are 16kHz audio signal. In addition, theCTS data only contains two speakers. Considering that theCTS data is so different from the remaining non-conversationtelephone speech (NCTS) 16kHz audio signal, we build twodifferent systems for CTS data and NCTS data. For NCTSdata, we employ the system described in [6]. For CTS data, wefirst use AHC to determine the homogeneous speaker region. a r X i v : . [ ee ss . A S ] F e b s fbank energiesFrame level speech likelihood ResNet 18Global Avg Pooling2-layer Bi-LSTMLinearSigmoid
Fig. 1. The architecture of the VAD model
Then, we extract speaker embedding for each speaker andperform TSVAD to get the diarization results.To partition the evaluation set into CTS and NCTS data, weextract the STFT spectrogram on the first 100 seconds of eachrecording and compare the maximum value in the frequencybin above 4kHz. If this maximum value is greater than thethreshold, the recording is classified as NCTS; otherwise, itis CTS data. The threshold is 0.07, which is obtained fromthe development set. Finally, we downsample the CTS data to8kHz.For NCTS data, we use Voxceleb 1 & 2 [7] as the trainingdataset for speaker embedding extraction. AMI meeting corpus[8], ICSI meeting corpus [9] and voxconverse dev set [3]are used for similarity measurement. MUSAN dataset [10] isemployed for data augmentation.For CTS data, we first downsample the Voxceleb 1 & 2 datato 8kHz and then train another speaker embedding model thatis suitable for 8k data. Finally, the TSVAD model is trainedon a collection of SRE-databases, including SRE 2004, 2005,2006, 2008, and Switchboard.III. D
ETAILED D ESCRIPTION OF A LGORITHM
A. VAD
The architecture of the VAD model is shown in Figure 1.The VAD model consists of a ResNet18 [11], a global averagepooling (GAP) layer, a 2-layer Bi-LSTM with 64 units perdirection as well as a dropout rate of 0.5, and two fully-connected layers followed by a sigmoid function. The widths(number of channels) of the residual blocks are {
16, 32, 64,128 } , and the corresponding strides are { (1, 1), (1, 2), (1, 2),(1, 2) } . The dimensions of two fully-connected layers are 64and 1, respectively. Given a Mel-filterbank, the ResNet18 canextract the feature maps for speech and non-speech regions.Then, the GAP layer is employed on each channel and get a C-dimensional vector for each frame, where C is the number ofchannels. Finally, a 2-layer Bi-LSTM captures the sequentialinformation, and a fully-connected layer predicts the frame-level speech likelihood. The VAD model is trained on the DIHARD III developmentset, where 90% of data is for training and the remainingdata is for validation. Data augmentation with MUSAN andRIRS corpora is employed to improve the performance, whereambient noise and music are used for the background additivenoise and RIRS for reverberation. We do not split the data intoCTS and NCTS in VAD model training.The acoustic features are 32-dimensional log Mel-filterbankenergies with a frame length of 25ms and a hop size of 10ms.The data is broken up into 4s segments with a shift of 2s.During the training phase, we employ the stochastic gradientdescent (SGD) optimizer and the binary cross-entropy (BCE)loss with an initial learning rate of 0.1. The learning ratedecrease by a factor of 0.1 every 20 epoch. For evaluation,we also break up all data into 4s segments with 2s overlap.The output of the overlapping region between two consecutivesegments is the mean of the prediction of these segments. Thedecision threshold is set to 0.5. B. Speaker Embedding Extraction
We adopt the same structure in [12] as the speaker embed-ding model, including three components: a front-end patternextractor, an encoder layer, and a back-end classifier. Weemploy the ResNet34 as the front-end pattern extractor, wherethe widths (number of channels) of the residual blocks are { } . Then, a global statistic pooling (GSP) layerprojects the variable length input to the fixed-length vector.This vector contains the mean and standard deviation of theoutput feature maps. Finally, a fully connected layer extractsthe 128-dimensional speaker embedding. We use the ArcFace[13] (s=32,m=0.2) as the classifier. The detailed configurationof the neural network is the same as [14].We also perform data augmentation with MUSAN and RIRSdatasets. For the MUSAN corpus, we use ambient noise,music, television, and babble noise for the background additivenoise. For the RIRS corpus, we only use audio from small andmedium rooms and perform convolution with training data.The acoustic features are 80-dimensional log Mel-filterbankenergies with a frame length of 25ms and a hop size of 10ms.The extracted features are mean-normalized before feedinginto the deep speaker network. We train two speaker embed-ding model. One is trained with 16kHz data, which is used forNCTS data. Another is trained with 8kHz downsampled data,which is used in AHC and TSVAD model for CTS data. C. Segmentation
For NCTS data, we employ uniform segmentation with awindow length of 1.5s and a shift of 0.75s on the speechregion for training, and a window length of 1.5s and a shiftof 0.25s on the speech region for inference.For CTS data, we first employ uniform segmentation witha window length of 0.5s and a shift of 0.25s on the speechregion. Then, we extract speaker embedding for each segmentand merge the consecutive segments if the cosine similarityof these two segments is greater than a predefined threshold. inear
Speaker embedding sequence
Transformer EncoderLinearSigmoid
Fig. 2. The architecture of the Att-v2s model
After two segments are merged to a new segment, the embed-ding of this new segment becomes the mean of the previoustwo segments. We merge these segments recursively until allcosine similarity between two consecutive segments is lowerthan the threshold. The threshold is set to 0.6, which is tunedfrom the development set.
D. Similarity Measurement and Clustering
For NCTS data, we employ an attention-based neural net-work to measure the similarity between two segments. Thenetwork architecture and training process are the same asthe attentive vector-to-sequence (Att-v2s) scoring in [6]. Thearchitecture of this transformer-based model consists of amulti-head self-attention module and several linear layers, asFigure 2 shows.The input m i is a sequence where each frame is a con-catenation of two embeddings. Then the similarity matrix S is extracted as follows: S i = [ S i , S i , ..., S in ] = f att ( m i ) (1) m i = (cid:20) x i x i ... x i x x ... x n (cid:21) , (2)where S i is the i-th row of the similarity matrix S , m i is the i-th row of the network input, and x i is the speaker embeddingof the i-th segment. For j-th entry m ij = (cid:20) x i x j (cid:21) in m i , thecorresponding output is S ij .The first linear layer contains 256 units. The self-attention-based encoder contains two heads with 128 attention units. Thedimension of the last two linear layers is 1024 and 1, followedby a sigmoid function. During the training phase, we employthe Diaconis augmentation [15] on the embedding sequenceswith a probability of 0.5 for data augmentation. The binarycross-entropy (BCE) loss function and the stochastic gradientdescent (SGD) optimizer are employed with an initial learningrate of 0.01. The learning rate decreases twice to 0.0001 with ResNet 34 mixture speech
Linear Speaker EmbeddingExtractor target speaker's speech ... ... ... ... copy frame-level speaker features concat...
LinearSigmoid speaker embedding
Fig. 3. The architecture of the TSVAD model a factor of 0.1. Then, we finetune the model on the wholedevelopment set for 30 epochs with a fixed learning rate of0.0001, and we do not use a validation set. For inference, weuse the segments with a window length of 1.5s and a shift of0.25s. Finally, we employ spectral clustering (SC) [16] to getthe diarization result. For more details, please refer to [6].For CTS data, we use cosine distance to measure thesimilarity between two segments. Then, we perform AHC tocluster these segments. Note that this clustering step is notto obtain the final diarization result. We want to find thespeech region for each speaker, and the speech region shouldcontain as less overlap as possible. Since we already knowthat CTS data only contains two speakers, we can cluster allsegments into two clusters, and the center of each cluster isthe mean of all speaker embeddings. Since our purpose is tofind two speakers’ speeches without overlap, we set a highstop threshold of 0.6. And the two clusters will be used toextract the target speaker embedding for TSVAD later.After we get these two clusters, we can still assign othersegments to these clusters to get the final diarization results forcomparison. The center embedding for each cluster is fixed.Once the cosine distance between the speaker embedding ofa segment and each cluster center embedding is lower thananother predefined threshold, we consider it as an overlappingsegment, and it will be added to each cluster. The thresholdis set to 0.0, which is tuned from the development set.In the next section, we will use these speech regions toextract the speaker embedding for each speaker and performTSVAD to obtain a more accurate result. . TSVAD
We only perform TSVAD for CTS data. The model issimilar to the model in [17], [18], but the training strategyand network architecture are different. First, we only detectthe speech region, which means that the task becomes a binaryclassification. Another difference is that we use a ResNet34to further extract the frame-level speaker identity informationinstead of directly using acoustic features. Figure 3 showsthe structure of our TSVAD model. First, a ResNet with afully-connected layer extracts the 128-dimensional frame-levelspeaker identity vector. Then, the target speaker embeddingis concatenated to each frame of speaker identity vector asthe input of a target speaker detector, which consists of twoBi-LSTM layers. Finally, a fully-connected layer predicts ifa frame contains the target speaker. The speaker embeddingextractor is the same as the model in Section 3.3.In the training phase, the parameters of ResNet34 is thesame as the front-end ResNet of the speaker embeddingextractor. These parameters of ResNet34 are frozen duringthe early training phase, and we only update the parametersin the Bi-LSTM and Linear layers. After these parametersconverging, we unfreeze the parameters of the ResNet34 inthe left of Figure 3. Then we train the whole model for severalepochs until converging.The acoustic features are 80-dimensional log Mel-filterbankenergies with a frame length of 25ms and a hop size of 10ms.During the training phase, we first use Kaldi Tools [19] to getthe VAD label for training data. We randomly select 8s of aspeaker’s audio to extract the target speaker embedding andrandomly select 4s to 20s speech as the input of ResNet34 toextract frame-level speaker identity, as show in Figure 3. Thelearning rate is set to 0.0001 when ResNet34 is frozen and0.00001 when ResNet34 is unfrozen. The stochastic gradientdescent (SGD) optimizer and the binary cross-entropy (BCE)loss are employed. During the finetuning stage, we use thefirst 41 recordings as the finetuning set and the remaining20 recordings as the validation set. The learning rate is setto 0.00001. During the inference phase, some post-processare employed to get the final diarization result. First, weperform 11-tap median filtering on the output of the neuralnetwork. Then we decide the target speech for each speakerby a predefined threshold of 0.65. Note that we only performTSVAD on the speech region. Thus, some frames may bemisclassified as non-speech because the output of each targetspeaker is lower than the threshold. We choose the speakerwith larger output as the target speaker for these frames.IV. E
XPERIMENTAL R ESULTS
A. VAD
In the DIHARD III challenge, the VAD labels for theevaluation set are available in task 1 but should be excludedfrom task 2, so we do not test our model on the evaluation set.The model is trained on the 90% of the development set andvalidated on the remaining 10% data. Table I shows that thetraining accuracy is 96.8% and validation accuracy is 94.9%,which means that our model is not overfitted.
TABLE IVAD
ACCURACY ON THE DEVELOPMENT SET
Training set Validation set
Accuracy 96.8% 94.9%
B. Clustering
Table II shows the results of NCTS data, CTS data on thedevelopment set in task 1. For NCTS data, we provide theDER before finetuning on the development set. For CTS data,we provide the DER of AHC clustering on the developmentset. Note that all result of CTS data is on the last 20 recordings.For the whole dataset, the result is the combination of the CTSdata and NCTS data on the development set. Table III showsthe total DER on the evaluation set.
TABLE IIS
YSTEM PERFORMANCE (DER)
ON DEVELOPMENT DATASET ( TASK Dataset Method DER (%)
NCTS att-v2s + SC 16.05CTS Cosine + AHC 15.07CTS TSVAD 10.60CTS (adapt) TSVAD round 1 7.80CTS (adapt) TSVAD round 2 7.63
C. TSVAD
Table II also shows the results of the TSVAD model onthe CTS data. We first directly evaluate the performance ofthe CTS dataset. Then, we use the first 41 recordings inthe CTS for finetuning (adapt) and test on the remaining 20recordings. After obtaining the diarization results (round 1)from the TSVAD model, we extract each speaker’s speechfrom this result and feed it to our TSVAD model again to getthe results (round 2). Results show that the DER also decreasesin the 2nd round, but it no longer changes in the later round.The results show that the TSVAD model significantly re-duces the DER from 15.07% to 7.63%. Table III shows thetotal DER with TSVAD on the evaluation set. Since we onlysubmit our TSVAD-based results to task 2, some entries inTable III are missed.
D. Discussion
Our TSVAD model shows a good performance on the CTSdata. From the experiments in [18], it seems that xvector isnot as good as the ivector as the target speaker’s embedding.However, in our experiment, the speaker embedding extractedby the ResNet-based model shows good performance com-pared with the ivector-based method. The main reason maybe that we use a ResNet34 to further extract the frame-levelspeaker feature, which can help the later Bi-LSTM layersto find the relationship between each frame and the targetspeaker’s embedding. Besides, we copy the parameters ofthe pre-trained speaker embedding extract to the ResNet34 toextract the frame-level speaker identity, which speeds up theconverging.
ABLE IIIS
YSTEM PERFORMANCE (DER)
ON EVALUATION DATASET ( TASK
Dataset Method DER on full set (%) DER on core set (%) task1 NCTS (adapt) & CTS att-v2s + SC & Cosine + AHC 16.34 17.03NCTS (adapt) & CTS (adapt) att-v2s + SC & TSVAD round 2 13.39 15.43task2 NCTS (adapt) & CTS att-v2s + SC & Cosine + AHC - -NCTS (adapt) & CTS (adapt) att-v2s + SC & TSVAD round 2 18.90 21.63
Although TSVAD shows a good performance on the CTSdata, it has poor performance on the NCTS data. The reasonmay be that the NCTS data comes from various domains andsome recordings are extremely noisy. Our speaker embeddingcannot extract enough speaker identity for such noisy record-ings. We will train our model for each domain in the future.V. C
ONCLUSION
In this paper, we provide a detailed description of ourdiarization system. We break up the dataset into CTS andNCTS data and evaluate the performance of them. We alsoemploy TSVAD for the CTS to find the overlapping regionand reduce the DER. In task 1, our final submission achievesa DER of 13.39 and 15.43 on the full and core evaluation set.In task 2, we achieve a DER of 18.90 and 21.63 on the fulland core evaluation set. We rank 4th place on task 1 and 3rdplace on task 2. R
EFERENCES[1] N. Ryanta, E. Bergelson, K. Church, A. Cristia, J. Du, S. Ganapathy,S. Khudanpur, D. Kowalski, M. Krishnamoorthy, R. Kulshreshta et al. ,“Enhancement and analysis of conversational speech: Jsalt 2017,” in . IEEE, 2018, pp. 5154–5158.[2] N. Ryant, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liber-man, “Third dihard challenge evaluation plan,” arXiv preprintarXiv:2006.05815 , 2020.[3] A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie,M. McLaren, D. A. Reynolds, and A. Zisserman, “Voxsrc 2020:The second voxceleb speaker recognition challenge,” arXiv preprintarXiv:2012.06867 , 2020.[4] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, Y. Zhao, G. Liu,J. Wu, J. Li, and Y. Gong, “Microsoft speaker diarization systemfor the voxceleb speaker recognition challenge 2020,” arXiv preprintarXiv:2010.11458 , 2020.[5] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers withencoder-decoder based attractors,” arXiv preprint arXiv:2005.09921 ,2020.[6] Q. Lin, Y. Hou, and M. Li, “Self-Attentive Similarity MeasurementStrategies in Speaker Diarization,” in
Proc. Interspeech 2020 , 2020, pp.284–288. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-1908[7] A. Nagrani, J. Chung, and A. Zisserman, “Voxceleb: a large-scalespeaker identification dataset,” 2017.[8] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn,M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos et al. , “The ami meetingcorpus,” in
Proceedings of the 5th International Conference on Methodsand Techniques in Behavioral Research , vol. 88. Citeseer, 2005, p.100.[9] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan,B. Peskin, T. Pfau, E. Shriberg, A. Stolcke et al. , “The icsi meetingcorpus,” in , vol. 1. IEEE,2003, pp. I–I. [10] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, andNoise Corpus,” arXiv:1510.08484 , 2015.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[12] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and lossfunction in end-to-end speaker and language recognition system,” arXivpreprint arXiv:1804.05160 , 2018.[13] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angularmargin loss for deep face recognition,” in
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition , 2019, pp.4690–4699.[14] X. Qin, M. Li, H. Bu, R. K. Das, W. Rao, S. Narayanan, and H. Li, “Theffsvc 2020 evaluation plan,” arXiv preprint arXiv:2002.00387 , 2020.[15] Q. Li, F. L. Kreyssig, C. Zhang, and P. C. Woodland, “Dis-criminative neural clustering for speaker diarisation,” arXiv preprintarXiv:1910.09703 , 2019.[16] U. Von Luxburg, “A tutorial on spectral clustering,”
Statistics andcomputing , vol. 17, no. 4, pp. 395–416, 2007.[17] S. Ding, Q. Wang, S.-y. Chang, L. Wan, and I. L. Moreno, “Per-sonal vad: Speaker-conditioned voice activity detection,” arXiv preprintarXiv:1908.04284 , 2019.[18] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Ko-renevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko,I. Podluzhny et al. , “Target-speaker voice activity detection: a novelapproach for multi-speaker diarization in a dinner party scenario,” arXivpreprint arXiv:2005.07272 , 2020.[19] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem-mer, and K. Vesely, “The kaldi speech recognition toolkit,” in