[PDF] ABSP System for The Third DIHARD Challenge

Abstract

This report describes the speaker diarization system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our primary contribution is to develop acoustic domain identification (ADI) system for speaker diarization. We investigate speaker embeddings based ADI system. We apply a domain-dependent threshold for agglomerative hierarchical clustering. Besides, we optimize the parameters for PCA-based dimensionality reduction in a domain-dependent way. Our method of integrating domain-based processing schemes in the baseline system of the challenge achieved a relative improvement of 9.63\% and 10.64\% in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set.

Full PDF

aa r X i v : . [ ee ss . A S ] F e b ABSP System for The Third DIHARD Challenge

A Kishore Kumar , Shefali Waldekar , Goutam Saha , Md Sahidullah Dept. of Electronics and ECE, Indian Institute of Technology Kharagpur, Kharagpur, India Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, [email protected]

Abstract —This report describes the speaker diarization systemdeveloped by the ABSP Laboratory team for the third DIHARDspeech diarization challenge. Our primary contribution is todevelop acoustic domain identiﬁcation (ADI) system for speakerdiarization. We investigate speaker embeddings based ADI sys-tem. We apply a domain-dependent threshold for agglomerativehierarchical clustering. Besides, we optimize the parameters forPCA-based dimensionality reduction in a domain-dependent way.Our method of integrating domain-based processing schemes inthe baseline system of the challenge achieved a relative improve-ment of . and . in DER for core and full conditions,respectively, for Track 1 of the DIHARD III evaluation set. I. N

OTABLE HIGHLIGHTS

We participated in the

Track 1 of the third DIHARDchallenge [1]. Our main focus was to apply domain-dependentprocessing which was found promising in preliminary studieswith the second DIHARD dataset [2], [3]. We propose a simplemodiﬁcation of the baseline system of the challenge whichresults considerable reduction of the error rates comparedto the baseline performance. The notable features of oursubmission to the challenge are as follows: • We propose a simple but efﬁcient method for acous-tic domain identiﬁcation (ADI) using speaker embed-dings of the full-recording. We observed that i-vector-based speaker embeddings are considerably better thanx-vector-based speaker embeddings for ADI task. • We have found that that full domain-dependent pro-cessing with domain-dependent clustering and domain-dependent probabilistic linear discriminant analysis(PLDA) adaptation does not improve the diarization per-formance. However, this helps when the clustering is donein a domain-dependent way, but PLDA adaptation duringscoring is made with audio-data from all the elevendomains. • We also found that experimental optimization of theparameters for principal component analysis (PCA) ina domain-speciﬁc way further improves the diarizationperformance. • The proposed system does not introduce much com-putational overhead over the baseline system for thediarization. Though this approach requires more time forempirical optimization of the parameters on the develop-ment set, the additional computational cost is negligiblefor the evaluation data. • The proposed system does not have any fusion or systemcombination from evaluation perspective. Considering thefact that most of the top systems in this challenge are combination of two or more sub-systems, our algorithmis remarkably faster than other competitive systems.II. D

ATA RESOURCES

The ABSP system has two major components: ADI andspeaker diarization. The ADI system uses i-vector speakerembeddings extracted with models trained on VoxCeleb 1 and 2 corpora.On the other hand, the diarization system uses an embed-ding extractor trained on a combination of VoxCeleb 1 andVoxCeleb 2 augmented with additive noise and reverberationfrom MUSAN and RIR database, respectively.III. D ETAILED DESCRIPTION OF ALGORITHM

Our diarization system is primarily based on the baselinesystem created by the organizers [4]. We have used the toolkit with the same frame-level acoustic features, embedding ex-tractor, scoring method, etc. The ADI system is based onthe speaker embeddings as sentence-level feature and nearestneighbor classiﬁer [5]. In order to extract utterance-levelembeddings for ADI task, we used pre-trained i-vector [6]model trained on VoxCeleb audio-data .We can summarize the steps for the speaker diarization asfollows:1) ADI task : First, the ADI system was developed fromthe development set. We have used nearest neighbourclassiﬁer with cosine similarity. The full developmentset with all 254 ﬁles was used for training the ﬁnalADI system. More details about this system are reportedin [5].2)

Domain-dependent threshold selection : The baselinesystem for the challenge ﬁnds the optimum thresholdby computing diarization error rates (DERs) on fulldevelopment set at different thresholds ranging from − . to . . We follow the same process but fordifferent acoustic-domains, independently. At the end ofthis step, the optimum thresholds for each domain arestored in a lookup table. ∼ vgg/data/voxceleb/vox1.html ∼ vgg/data/voxceleb/vox2.html https://github.com/dihardchallenge/dihard3 baseline https://kaldi-asr.org/models/m7 https://github.com/dihardchallenge/dihard3 baseline/blob/master/recipes/track1/local/diarize.sh ) Domain-dependent dimensionality reduction : The PLDAscoring involves dimensionality reduction of the embed-ding using PCA. The baseline system preserves 30% ofthe total energy during dimensionality reduction. Insteadof applying ﬁxed value of . for all the recording, weoptimized this for each domain separately by varying itbetween . to . with a step of . . Similar to theprevious step, the optimum parameters for each domainare preserved in another lookup table.4) Diarization on the evaluation set : Finally, during thediarization on the evaluation set, we ﬁrst computedthe i-vector of the full-recording to the be processed.Then, we predicted the corresponding acoustic domainusing the ADI system. This is followed by the selectionof clustering threshold and dimensionality reductionparameters corresponding to the predicted labels.IV. R

ESULTS ON THE DEVELOPMENT SET

The speaker diarization results on development set areshown in Table I. We have also shown the results for the eval-uaiton set in Table II. Both these results conﬁrm considerableimprovement over baseline system.

TABLE IR

ESULTS SHOWING THE SPEAKER DIARIZATION PERFORMANCE USINGBASELINE AND PROPOSED METHODS ON

DEVELOPMENT SET .Method Full CoreDER JER DER JERBaseline 19.59 43.01 20.17 47.28Proposed 17.40 38.08 17.95 42.12TABLE IIS

AME AS T ABLE I BUT FOR

EVALUATION SET .Method Full CoreDER JER DER JERBaseline 19.19 43.28 20.39 48.61Proposed 17.20 37.30 18.66 42.23

V. H

ARDWARE REQUIREMENTS

The codes were run on Dell PowerEdge R730 server . It hasIntel Xeon (R) CPU E5-2695 v4 @ 2.10GHz processor and128 GB memory with 72 CPUs, 2 sockets, 18 cores/socket, and2 threads/core. We used Ubuntu 16.04 LTS 64-bit operatingsystem.We used 32 cores for the experiments. The system executiontimes to process the entire development set is ≈ ≈ VI. C

ONCLUSION

Our study is a step towards advancing the baseline diariza-tion system with domain-dependent processing. Our systemshowed substantially reduced error rates as we optimized theclustering threshold and the dimensionality reduction param-eters for each domain separately. The future work involvesinvestigating advanced embedding extractors and exploringmore domain-dependent processing, e.g., domain-dependentacoustic front-end, embedding extractor, re-segmentation, etc.VII. A

CKNOWLEDGEMENTS

EFERENCES[1] N. Ryant et al. , “Third DIHARD challenge evaluation plan,” arXivpreprint arXiv:2006.05815 , 2020.[2] M. Sahidullah et al. , “The Speed submission to DIHARD II: Contribu-tions & lessons learned,” arXiv preprint arXiv:1911.02388 , 2019.[3] T. Fennir, F. Habib, and C. Macaire, “Acoustic scene classiﬁcation forspeaker diarization,” Universit´e de Lorraine, Tech. Rep., 2020.[4] N. Ryant et al. , “The third DIHARD diarization challenge,” arXiv preprintarXiv:2012.01477 , 2020.[5] A. K. Kumar, S. Waldekar, G. Saha, and M. Sahidullah, “Domain-dependent speaker diarization for the third DIHARD challenge,” arXivpreprint arXiv:2101.09884 , 2021.[6] N. Dehak et al. , “Front-end factor analysis for speaker veriﬁcation,”