The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker Diarisation Challenge
TThe HUAWEI Speaker Diarisation System for the VoxCeleb SpeakerDiarisation Challenge
Renyu Wang ∗ , Ruilin Tong, Yu Ting Yeung, Xiao Chen Huawei Noah’s Ark Lab, China { wangrenyu1,tongruilin,yeung.yu.ting,chen.xiao2 } @huawei.com Abstract
This paper describes system setup of our submission to speakerdiarisation track (Track 4) of VoxCeleb Speaker RecognitionChallenge 2020. Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals. We replace con-ventional energy-based voice activity detection (VAD) with aneural network based VAD. The neural network based VADprovides more accurate annotation of speech segments con-taining only background music, noise, and other interference,which is crucial to diarisation performance. We apply agglom-erative hierarchical clustering (AHC) of x-vectors and varia-tional Bayesian hidden Markov model (VB-HMM) based it-erative clustering for speaker clustering. Experimental resultsdemonstrate that our proposed system achieves substantial im-provements over the baseline system, yielding diarisation errorrate (DER) of 10.45%, and Jacard error rate (JER) of 22.46%on the evaluation set.
Index Terms : speaker diarisation, speech enhancement, voiceactivity detection, variational Bayesian
1. Introduction
Speaker diarisation is a task of annotating a conversation intohomogeneous segments of same speakers, which is often re-ferred to as “who spoke when” problem [1, 2, 3]. It is an inter-esting and challenging research topic due to complex scenariosin real-life conversations. A diarisation system often serves asa pre-processing step for different speech applications, such asspeaker tracking, speech transcription, and meeting summary.A conventional speaker diarisation system usually consistsof two main components, namely speech segmentation andspeaker clustering. A speech segmentation component usu-ally consists of a voice activity detection (VAD) module anda speaker change point detection module in order to split an au-dio stream into homogeneous segments. Ideally, each segmentshould only contain one speaker. Methods based on generativemodels and neural discriminant models [4, 5, 6] have achievedconsiderable performance in various segmentation tasks. Aspeaker clustering algorithm is usually belonged to one of thefollowing two categories, bottom-up and top-down methods[3, 7]. A classic bottom-up approach such as agglomerative hi-erarchical clustering (AHC) [8] tries to merge speech segmentsiteratively until one or more previously-defined stopping crite-ria are reached. In contrast, a top-down approach tries to splitspeech segments to new speaker clusters according to splittingprerequisites [9]. We refer readers to [7] for further comparisonof the two approaches.The recent advances in sequence modeling of deep neuralnetwork (DNN) lead to development of supervised speaker di- ∗ Corresponding author arisation systems with end-to-end DNN models [10, 11, 12].These end-to-end DNN models perform online speaker diarisa-tion without segmenting and clustering an input audio stream.However, end-to-end methods require large amount of human-annotated or simulated conversations as training data. Prepara-tion of training data which is suitable for real-life scenarios is amain challenge for end-to-end diarisation models.System performance of speaker diarisation continues to im-prove for both conventional systems and end-to-end DNN mod-els. However, lack of a common task has resulted fragmentationof individual research groups focusing on different datasets ordomains. Researches and challenges focusing on diarisation ofreal-life scenarios [13, 14] catalyse the development of speakerdiarisation systems for practical applications.The diarisation track of VoxCeleb Speaker RecognitionChallenge 2020 is launched as a satellite event of Interspeech2020 to facilitate study on speaker diarisation on videos col-lected “in the wild” [15]. The dataset includes multiple real-lifescenarios such as talk shows, news broadcast, interviews, andvlogs. The dataset covers large numbers of speakers. Shortrapid exchanges cross-talk, applause, background music, noise,and reverberation are common in the recordings, which woulddegrade diarisation performance. For example, noisy non-speech segments and speech segments contaminated by noiseare easily mishandled as new speaker clusters. When these er-rors accumulate, the quality of final diarisation output couldbe significantly degraded, and probably unusable. Therefore,speech front-end pre-processing is essential for speaker diarisa-tion. Improved speech quality generally leads to higher perfor-mance upper-bound for a speaker diarisation system [16].In this paper, we describe our system and experimental re-sults of this challenging diarisation task. Our system is stillbased on the methodology of a conventional speaker diarisa-tion system. First, we build a baseline system based on energy-based VAD, x-vector speaker representation and an AHC basedclustering [17]. For the advanced system, we add a DNN-based speech enhancement model with long short-term memory(LSTM) architecture as speech pre-processing front-end, andreplace the energy-based VAD with a DNN-based VAD. Af-ter these modifications, diarisation errors due to mis-classifyingnoise and music segments as speech segments are reduced sig-nificantly, leading to substantial improvement in diarisation per-formance.This paper is organized as follows. We present the structureof our diarisation system in Section 2. Experimental results arepresented in Section 3, followed by conclusions in Section 4.
2. System Structure
Our speaker diarisation system consists of several componentsincluding speech enhancement, voice activity detection, speakerfeature representation, speaker segmentation and clustering. a r X i v : . [ c s . S D ] O c t igure 1: Framework of the Huawei speaker diarisation system (a) Original speech segment(b) Enhanced speech segment
Figure 2:
Comparison of spectrograms with background musicand applause for the proposed DNN-based speech enhancement
The overall structure of the proposed speaker diarisation sys-tem is shown in Figure 1.
An effective speech pre-processing front-end should be able toremove most of background noise and music, and retains mostof speaker information. Due to limitation of model assump-tion, traditional speech enhancement methods such as Winenerfiltering [3], and minimum-mean square error (MMSE) estima-tors [18] perform poorly on non-stationary noise, and induceartifact to speech spectrum. These lead to loss of speaker infor-mation and degrade performance of a speaker diarisation task.Therefore, we adopt a DNN-based speech enhancement modelwith advanced LSTM architecture with specially designed hid-den layers and multiple learning targets of predicting both log-power spectra (LPS) feature and ideal ratio mask (IRM). In or-der to retain most of speaker information, we apply densely con-nected progressive learning strategy proposed in [19]. The inputfeature and the estimated target are spliced together for learn-ing next targets of higher signal noise ratio (SNR). We applyminimum mean square error (MMSE) criterion for parameteroptimization.A sample speech segment selected from the evaluation setare shown in Figure 2. We compare spectrograms before and af-ter speech enhancement. In the original spectrogram, speaker’sspeech was accompanied with cheers, applause and music. Af-ter speech enhancement, a significant amount of interferencehas been removed, but spectral components of speech signalsare generally retained.
The recordings from this challenge is collected “in the wild”.There are various types of noise segments of different lengthsbetween different speakers. These noise segments are easilymishandled as new speaker clusters if they are not correctlyremoved. A well-trained VAD is essential for this require-ment. The core part of the VAD module is a deep neural net-work (DNN) for binary classification of speech and non-speech frames. The DNN model consists of two 2D-convolution layers,and three time-delay layers. Left and right context of the time-delay layers are asymmetric, in order to reduce latency duringstreaming inference. We also apply AM-softmax [20] at DNNoutput for better frame-wise discriminative training. We ap-ply 128-dimensional Mel-frequency filterbanks as input feature.The input feature is normalized globally. We perform forcealignment on training data with our automatic speech recogni-tion (ASR) system to obtain frame-wise speech and non-speechlabels.We further apply post-processing on the neural networkbased VAD model output. We apply speech smoothing to re-move speech chunks shorter than 200 ms, and apply speechsegmentation when silence is longer than 500 ms.
After segmentation of recordings from VAD module, we getspeech segments of varying lengths. A good speaker featurerepresentation in a diarisation system should be able to capturespeaker discriminative features in a short speech segment andmore accurate speaker feature representation in a long speechsegment. X-vector as our speaker feature representation canfulfill this requirement [17]. X-vector is extracted from a well-trained time-delay neural network (TDNN) model according tothe VoxCeleb recipe of the Kaldi toolkit [21]. We apply 40-dimensional Mel-frequency filterbanks as speaker input feature.The TDNN model consists of three components. First, there is afeature learning component, which consists of 5 time-delay lay-ers for higher order representation of speaker filterbank feature.The slicing parameters of the 5 time-delay layers are { t − , t − , t, t + 1 , t + 2 } , { t − , t, t + 2 } , { t − , t, t + 3 } , { t } , { t } respectively. Then, there is a statistical pooling component,which computes mean and standard deviation of the higher or-der speaker representation of a given speech segment. The fi-nal component is a speaker classification module which consistsof two fully-connected layers and a softmax output layer. Thesize of the softmax output corresponds to number of speakersin training set. The x-vector are extracted from the penultimatefully-connected layer once the network is well-trained.Considering complex noise and reverberation conditionsin the Challenge, we apply data augmentation based on thepipeline of Kaldi SRE16 recipe [17] during training. We aug-ment the training data by mixing with various music and noisesignals and convolving with the different room impulse re-sponses for reverberation [22]. We apply agglomerative hierarchical clustering (AHC) andvariational Bayesian (VB) based iterative clustering [23] forspeaker clustering in our diarisation system. The x-vector ex-tracted from speech segments after VAD are first clustered byAHC, with log-likelihood scores from probabilistic linear dis-criminant analysis (PLDA) as similarity metric. PLDA metricis widely used in speaker verification.After the initial assignment from AHC, we apply varia-tional Bayesian hidden Markov Model (VB-HMM) at x-vectorlevel to further cluster x-vectors [24]. The x-vectors are firstprojected to low-dimensional discriminative space with lin-ear discriminant analysis (LDA). Variational Bayesian hiddenMarkov model (VB-HMM) inference is performed iterativelyfor refining assignment of x-vectors to speaker clusters for moreaccurate speaker distribution. . Experimental Results
In this section, we describe datasets for building our diarisationsystem. We also report the results of our system in terms ofdiarisation error rate (DER) and Jaccard error rate (JER).
The diarisation task in the VoxCeleb Speaker Recognition Chal-lenge is an open-set track, which allows the use of extra data.Here, we introduce all the datasets used for building each partof our system.The neural network based speech enhancement system istrained with CHiME-3 dataset [25]. The speech signals from thenear-field microphone (channel 5) are mixed with backgroundnoise with SNR levels of -5dB, 0dB, 5dB.For neural network based VAD, we apply LibriSpeech [26]and CommonVoice [27] datasets as training data. We augmentthe training data by adding additive noise with SNR of 0-20dBfrom AudioSet dataset [28].For x-vector model training, we choose VoxCeleb corpus[29, 30] containing VoxCeleb 1 and 2 with 1.2 million speechutterances from 7146 speakers. Data augmentation is performedwith MUSAN Corpus [31] and RIRs from the AIR dataset [22].Details of development set and evaluation set of VoxCelebSpeaker Recognition Challenge can be found at [15]. Note thatwe perform speaker diarisation only with audio stream of thedataset.
The diarisation system is evaluated with two metrics, includingdiarisation error rate (DER) and Jacard error rate (JER). Thetwo metrics compare the system result with reference humanannotations.Diarisation error rate is a percentage measure of total du-ration of speaker time slots which does not attribute to targetspeakers correctly. DER considers 3 types of error. FA is to-tal system duration which does not attribute to reference speak-ers. MISS is total reference speaker duration which does notattribute to system speakers. ERROR is total reference speakerduration attributed to wrong speakers. DER is defined as,
DER = F A + MISS + ERRORT OT AL . (1)Jaccard error rate is based on the Jaccard index and is de-fined as the ratio between intersection and union of system andreference speaker duration. FA is total system speaker durationwhich does not attribute to reference speakers. MISS is the to-tal reference speaker duration which does not attribute to systemspeakers. Total JER is an average of JERs of all speakers and isdefined as, where N is total number of speakers,
JER spk = F A + MISST OT AL (2)
JER = 1 N (cid:88) spk JER spk . (3) First, we evaluate our baseline speaker diarization system whichis based on energy-based VAD, x-vector based speaker repre-sentation and AHC based speaker clustering. For the baselinesystem, we use VoxCeleb dataset to train a TDNN model as x-vector extractor. The results matches the official performanceof the Challenge [15], as shown in Table 1 and Table 2. Table 1:
Performance on development set methods DER JER
AHC (baseline) .
09 47 . AHC + VB-HMM .
06 31 . AHC + VB-HMM + NN-VAD (thr-0.8) .
87 34 . AHC + VB-HMM + NN-VAD (thr-0.7) .
81 30 . AHC + VB-HMM + NN-VAD (thr-0.6) .
54 29 . AHC + VB-HMM + NN-VAD (thr-0.5) .
58 27 . AHC + VB-HMM + NN-VAD (thr-0.5)+ NN-enhanced .
06 26 . AHC (thr-0.8) + VB-HMM + NN-VAD(thr-0.5) + NN-enhanced .
46 25 . Table 2:
Performance on evaluation set methods DER JER
AHC (baseline) .
74 51 . AHC + VB-HMM .
03 36 . AHC + VB-HMM + NN-VAD (thr-0.5) . . AHC (thr-0.8) + VB-HMM + NN-VAD(thr-0.5) + NN-enhanced . . Then we study the performance of VB-HMM based cluster-ing. As shown in first two lines of Table 1 and Table 2, the VB-HMM iterative clustering method improves the performance ofboth development set and evaluation set significantly. This givesus a great confidence in solving such a hard “in the wild” prob-lem by incorporating VB-HMM method.After replacing the energy-based VAD to our neural net-work based VAD, we get significant performance improvement.When tuning the VAD threshold from 0.8 to 0.5 on the develop-ment set, we attain the best results at the threshold of 0.5. Wekeep this threshold for the evaluation set.The results confirm that VAD is a critical component fora diarisation system. We successfully remove music, noise andother daily-life non-speech segments with neural network basedVAD. These interference could lead to inevitable mistakes in theclustering back-end. An effective VAD helps to achieve higherperformance upper bound of a diarisation system. However, wenotice that the performance of the development set is sensitiveto the VAD thresholds. This may be an interesting topic forfurther study.Applying neural network based speech enhancement (NN-enhanced) further improves the performance slightly. Speechenhancement should be able to minimise the impact of noiseand retain more speaker discriminative feature in speech seg-ments. Therefore, we further fine-tune the AHC threshold from0.0 to 0.8 in speaker clustering stage to allow more aggressiveclustering. The performance is further improved.We perform diarisation on the evaluation set with the best-performed parameters in the development set. Finally, we getDER of 10.45%, and JER of 22.46%. The performance im-proves substantially over the baseline method. We have submit-ted the results to the official Challenge submission system.
4. Conclusions
This paper presents the development of our diarisation systemfor diarisation task of VoxCeleb Speaker Recognition Challenge2020. A neural network based speech enhancement models applied for speech pre-processing for reducing backgroundinterference. We believe that the improved speech enhance-ment model helps to retain speaker information in the enhancedspeech segments. A well-trained neural network based VADis applied to identify non-speech segments of music, back-ground noise and other daily-life interference. The neural net-work based VAD also helps to obtain accurate speech bound-aries for the back-end clustering algorithms. We notice thatthe main performance improvement of the diarisation systemis contributed by the neural network based VAD. We applya TDNN based x-vector system for speaker feature extractionfollowed by AHC for initial speaker clustering. VariationalBayesian hidden Markov model (VB-HMM) clustering at x-vector level further improves the clustering performance by lo-cating more accurate speaker boundaries. We also apply variousdata augmentation to increase diversity of training data for eachcomponent. By combining these strategies, we achieve substan-tial improvement on the diariation results in terms of both DERand JER.Speaker diarisation is a hard task as a system has to solvemultiple problems in complex acoustic environments and dif-ferent applications. In the future, we aim to further improvediarisation performance by investigating overlapped speech de-tection and automatic threshold selection for real-life scenarios.
5. References [1] D. A. Reynolds and P. Torres-Carrasquillo, “Approaches andapplications of audio diarization,” in , vol. 5, 2005, pp. v–953.[2] S. E. Tranter and D. A. Reynolds, “An overview of auto-matic speaker diarization systems,”
IEEE Transactions on audio,speech, and language processing , vol. 14, no. 5, pp. 1557–1565,2006.[3] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Fried-land, and O. Vinyals, “Speaker diarization: A review of recentresearch,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 2, pp. 356–370, 2012.[4] S. Chen, P. Gopalakrishnan et al. , “Speaker, environment andchannel change detection and clustering via the Bayesian infor-mation criterion,” in
Proc. DARPA broadcast news transcriptionand understanding workshop , vol. 8, 1998, pp. 127–132.[5] K. Chen and A. Salman, “Learning speaker-specific characteris-tics with a deep neural architecture,”
IEEE Transactions on NeuralNetworks , vol. 22, no. 11, pp. 1744–1756, 2011.[6] R. Wang, M. Gu, L. Li, M. Xu, and T. F. Zheng, “Speaker segmen-tation using deep speaker vectors for fast speaker change scenar-ios,” in , 2017, pp. 5420–5424.[7] N. Evans, S. Bozonnet, D. Wang, C. Fredouille, and R. Troncy,“A comparative study of bottom-up and top-down approaches tospeaker diarization,”
IEEE Transactions on Audio, speech, andlanguage processing , vol. 20, no. 2, pp. 382–392, 2012.[8] K. J. Han and S. S. Narayanan, “A robust stopping criterion foragglomerative hierarchical clustering in a speaker diarization sys-tem,” in
Eighth Annual Conference of the International SpeechCommunication Association , 2007.[9] S. Bozonnet, N. W. Evans, and C. Fredouille, “The LIA-EurecomRT’09 speaker diarization system: Enhancements in speaker mod-elling and cluster purification,” in ,2010, pp. 4958–4961.[10] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully su-pervised speaker diarization,” in , 2019,pp. 6301–6305. [11] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, andS. Watanabe, “End-to-end neural speaker diarization with self-attention,” in , 2019, pp. 296–303.[12] L. E. Shafey, H. Soltau, and I. Shafran, “Joint speech recognitionand speaker diarization via sequence transduction,” arXiv preprintarXiv:1907.05337 , 2019.[13] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba,M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe et al. , “Diarization is Hard: Some experiences and lessons learnedfor the JHU team in the inaugural DIHARD challenge.” in
Inter-speech , 2018, pp. 2808–2812.[14] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy,and M. Liberman, “The second DIHARD diarization challenge:Dataset, task, and baselines,” arXiv preprint arXiv:1906.07839 ,2019.[15] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman,“Spot the conversation: speaker diarisation in the wild,” arXivpreprint arXiv:2007.01216 , 2020.[16] L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-H. Lee,“Speaker diarization with enhancing speech for the first DIHARDchallenge.” in
Interspeech , 2018, pp. 2793–2797.[17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust DNN embeddings for speaker recog-nition,” in , 2018, pp. 5329–5333.[18] Y. Ephraim and D. Malah, “Speech enhancement using aminimum-mean square error short-time spectral amplitude esti-mator,”
IEEE Transactions on Acoustics, Speech, and Signal Pro-cessing , vol. 32, no. 6, pp. 1109–1121, 1984.[19] T. Gao, J. Du, L.-R. Dai, and C.-H. Lee, “SNR-Based progres-sive learning of deep neural network for speech enhancement.” in
Interspeech , 2016, pp. 3713–3717.[20] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmaxfor face verification,”
IEEE Signal Processing Letters , vol. 25,no. 7, pp. 926–930, 2018.[21] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The Kaldi speech recognition toolkit,” in
IEEE 2011 Workshopon Automatic Speech Recognition and Understanding (ASRU) ,2011.[22] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in , 2017, pp.5220–5224.[23] F. Landini, S. Wang, M. Diez, L. Burget, P. Matˇejka,K. ˇZmol´ıkov´a, L. Moˇsner, A. Silnova, O. Plchot, O. Novotn`y et al. , “BUT system for the second DIHARD speech diarizationchallenge,” in , 2020, pp. 6529–6533.[24] M. Diez, L. Burget, F. Landini, and J. ˇCernock`y, “Analysis ofspeaker diarization based on Bayesian HMM with eigenvoice pri-ors,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 355–368, 2019.[25] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third‘CHiME’ speech separation and recognition challenge: Dataset,task and baselines,” in , 2015, pp. 504–511.[26] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audio books,”in , 2015, pp. 5206–5210.[27] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,“Common Voice: A massively-multilingual speech corpus,” arXivpreprint arXiv:1912.06670 , 2019.28] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontologyand human-labeled dataset for audio events,” in , 2017, pp. 776–780.[29] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:A large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[30] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622arXiv preprint arXiv:1806.05622