ABSP System for The Third DIHARD Challenge
A Kishore Kumar, Shefali Waldekar, Goutam Saha, Md Sahidullah
aa r X i v : . [ ee ss . A S ] F e b ABSP System for The Third DIHARD Challenge
A Kishore Kumar , Shefali Waldekar , Goutam Saha , Md Sahidullah Dept. of Electronics and ECE, Indian Institute of Technology Kharagpur, Kharagpur, India Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, [email protected]
Abstract —This report describes the speaker diarization systemdeveloped by the ABSP Laboratory team for the third DIHARDspeech diarization challenge. Our primary contribution is todevelop acoustic domain identification (ADI) system for speakerdiarization. We investigate speaker embeddings based ADI sys-tem. We apply a domain-dependent threshold for agglomerativehierarchical clustering. Besides, we optimize the parameters forPCA-based dimensionality reduction in a domain-dependent way.Our method of integrating domain-based processing schemes inthe baseline system of the challenge achieved a relative improve-ment of . and . in DER for core and full conditions,respectively, for Track 1 of the DIHARD III evaluation set. I. N
OTABLE HIGHLIGHTS
We participated in the
Track 1 of the third DIHARDchallenge [1]. Our main focus was to apply domain-dependentprocessing which was found promising in preliminary studieswith the second DIHARD dataset [2], [3]. We propose a simplemodification of the baseline system of the challenge whichresults considerable reduction of the error rates comparedto the baseline performance. The notable features of oursubmission to the challenge are as follows: • We propose a simple but efficient method for acous-tic domain identification (ADI) using speaker embed-dings of the full-recording. We observed that i-vector-based speaker embeddings are considerably better thanx-vector-based speaker embeddings for ADI task. • We have found that that full domain-dependent pro-cessing with domain-dependent clustering and domain-dependent probabilistic linear discriminant analysis(PLDA) adaptation does not improve the diarization per-formance. However, this helps when the clustering is donein a domain-dependent way, but PLDA adaptation duringscoring is made with audio-data from all the elevendomains. • We also found that experimental optimization of theparameters for principal component analysis (PCA) ina domain-specific way further improves the diarizationperformance. • The proposed system does not introduce much com-putational overhead over the baseline system for thediarization. Though this approach requires more time forempirical optimization of the parameters on the develop-ment set, the additional computational cost is negligiblefor the evaluation data. • The proposed system does not have any fusion or systemcombination from evaluation perspective. Considering thefact that most of the top systems in this challenge are combination of two or more sub-systems, our algorithmis remarkably faster than other competitive systems.II. D
ATA RESOURCES
The ABSP system has two major components: ADI andspeaker diarization. The ADI system uses i-vector speakerembeddings extracted with models trained on VoxCeleb 1 and 2 corpora.On the other hand, the diarization system uses an embed-ding extractor trained on a combination of VoxCeleb 1 andVoxCeleb 2 augmented with additive noise and reverberationfrom MUSAN and RIR database, respectively.III. D ETAILED DESCRIPTION OF ALGORITHM
Our diarization system is primarily based on the baselinesystem created by the organizers [4]. We have used the toolkit with the same frame-level acoustic features, embedding ex-tractor, scoring method, etc. The ADI system is based onthe speaker embeddings as sentence-level feature and nearestneighbor classifier [5]. In order to extract utterance-levelembeddings for ADI task, we used pre-trained i-vector [6]model trained on VoxCeleb audio-data .We can summarize the steps for the speaker diarization asfollows:1) ADI task : First, the ADI system was developed fromthe development set. We have used nearest neighbourclassifier with cosine similarity. The full developmentset with all 254 files was used for training the finalADI system. More details about this system are reportedin [5].2)
Domain-dependent threshold selection : The baselinesystem for the challenge finds the optimum thresholdby computing diarization error rates (DERs) on fulldevelopment set at different thresholds ranging from − . to . . We follow the same process but fordifferent acoustic-domains, independently. At the end ofthis step, the optimum thresholds for each domain arestored in a lookup table. ∼ vgg/data/voxceleb/vox1.html ∼ vgg/data/voxceleb/vox2.html https://github.com/dihardchallenge/dihard3 baseline https://kaldi-asr.org/models/m7 https://github.com/dihardchallenge/dihard3 baseline/blob/master/recipes/track1/local/diarize.sh ) Domain-dependent dimensionality reduction : The PLDAscoring involves dimensionality reduction of the embed-ding using PCA. The baseline system preserves 30% ofthe total energy during dimensionality reduction. Insteadof applying fixed value of . for all the recording, weoptimized this for each domain separately by varying itbetween . to . with a step of . . Similar to theprevious step, the optimum parameters for each domainare preserved in another lookup table.4) Diarization on the evaluation set : Finally, during thediarization on the evaluation set, we first computedthe i-vector of the full-recording to the be processed.Then, we predicted the corresponding acoustic domainusing the ADI system. This is followed by the selectionof clustering threshold and dimensionality reductionparameters corresponding to the predicted labels.IV. R
ESULTS ON THE DEVELOPMENT SET
The speaker diarization results on development set areshown in Table I. We have also shown the results for the eval-uaiton set in Table II. Both these results confirm considerableimprovement over baseline system.
TABLE IR
ESULTS SHOWING THE SPEAKER DIARIZATION PERFORMANCE USINGBASELINE AND PROPOSED METHODS ON
DEVELOPMENT SET .Method Full CoreDER JER DER JERBaseline 19.59 43.01 20.17 47.28Proposed 17.40 38.08 17.95 42.12TABLE IIS
AME AS T ABLE I BUT FOR
EVALUATION SET .Method Full CoreDER JER DER JERBaseline 19.19 43.28 20.39 48.61Proposed 17.20 37.30 18.66 42.23
V. H
ARDWARE REQUIREMENTS
The codes were run on Dell PowerEdge R730 server . It hasIntel Xeon (R) CPU E5-2695 v4 @ 2.10GHz processor and128 GB memory with 72 CPUs, 2 sockets, 18 cores/socket, and2 threads/core. We used Ubuntu 16.04 LTS 64-bit operatingsystem.We used 32 cores for the experiments. The system executiontimes to process the entire development set is ≈ ≈ VI. C
ONCLUSION
Our study is a step towards advancing the baseline diariza-tion system with domain-dependent processing. Our systemshowed substantially reduced error rates as we optimized theclustering threshold and the dimensionality reduction param-eters for each domain separately. The future work involvesinvestigating advanced embedding extractors and exploringmore domain-dependent processing, e.g., domain-dependentacoustic front-end, embedding extractor, re-segmentation, etc.VII. A
CKNOWLEDGEMENTS
EFERENCES[1] N. Ryant et al. , “Third DIHARD challenge evaluation plan,” arXivpreprint arXiv:2006.05815 , 2020.[2] M. Sahidullah et al. , “The Speed submission to DIHARD II: Contribu-tions & lessons learned,” arXiv preprint arXiv:1911.02388 , 2019.[3] T. Fennir, F. Habib, and C. Macaire, “Acoustic scene classification forspeaker diarization,” Universit´e de Lorraine, Tech. Rep., 2020.[4] N. Ryant et al. , “The third DIHARD diarization challenge,” arXiv preprintarXiv:2012.01477 , 2020.[5] A. K. Kumar, S. Waldekar, G. Saha, and M. Sahidullah, “Domain-dependent speaker diarization for the third DIHARD challenge,” arXivpreprint arXiv:2101.09884 , 2021.[6] N. Dehak et al. , “Front-end factor analysis for speaker verification,”