Mixture of Speaker-type PLDAs for Children's Speech Diarization
Jiamin Xie, Suzanna Sia, Paola Garcia, Daniel Povey, Sanjeev Khudanpur
MMixture of Speaker-type PLDAs for Children’s Speech Diarization
Jiamin Xie , Suzanna Sia , Paola Garc´ıa , , Daniel Povey , Sanjeev Khudanpur , Center for Language and Speech Processing & Human Language Technology Center of ExcellenceThe Johns Hopkins University, Baltimore, MD 21218, USA Xiaomi Corp., Beijing, China { jxie27, ssia1, lgarci27 } @jhu.edu, [email protected], [email protected] Abstract
In diarization, the PLDA is typically used to model an inferencestructure which assumes the variation in speech segments be in-duced by various speakers. The speaker variation is then learnedfrom the training data. However, human perception can differ-entiate speakers by age, gender, among other characteristics. Inthis paper, we investigate a speaker-type informed model thatexplicitly captures the known variation of speakers. We explorea mixture of three PLDA models, where each model representsan adult female, male, or child category. The weighting of eachmodel is decided by the prior probability of its respective class,which we study. The evaluation is performed on a subset ofthe BabyTrain corpus. We examine the expected performancegain using the oracle speaker type labels, which yields an 11.7%DER reduction. We introduce a novel baby vocalization aug-mentation technique and then compare the mixture model to thesingle model. Our experimental result shows an effective 0.9%DER reduction obtained by adding vocalizations. We discoverempirically that a balanced dataset is important to train the mix-ture PLDA model, which outperforms the single PLDA by 1.3%using the same training data and achieving a 35.8% DER. Thesame setup improves over a standard baseline by 2.8% DER.
Index Terms : speaker diarization, children’s speech, trans-former encoder, mixture of PLDAs
1. Introduction
Speaker diarization aims to answer the question of ”who speakswhen?” in a recording. It is the crucial first step that ensures thesingle-speaker assumption necessary for downstream tasks, in-cluding speaker verification and speech recognition, among oth-ers. Most diarization methods take short (1-2 sec) overlappingsegments of a recording, estimate similarities between each pairof segments, and cluster/separate segments to same/differentspeaker(s). The process results in hypothesized speech seg-ments of various lengths that belong to each speaker.Research in diarization has mainly focused on adult speech.The benchmark diarization error rates (DER) in controlledspeech environment, such as telephone conversations or busi-ness meetings, typically range from 2% to 10% among the bestsystems [1, 2]. With more natural conditions, such as telephoneconversational speech, the diarization performance varies be-tween 5% and 30% DER [3, 4]. However, these results are madepossible under a rather easy setup of few number of speakersor with similar speaking style shared by participants. Recentstudies have found diarization is hard under realistic conditionsthat involve overlapping speech, noisy background, and diversemodalities of speech [5, 6].Children’s speech is one of the realistic domains that poseschallenges for speaker diarization [7]. The acoustic and lin-guistic properties of children differ from that of adults, such as higher pitch and formant frequencies and longer phoneme du-ration [8]. In addition, children utter spontaneous vocalizationsduring their speech, which increases the need for intra-speakervariations to be appropriately modeled. These spontaneous vo-calizations can also occur when others are speaking, which callsfor overlap diarization. The above factors introduce a large per-formance discrepancy between the diarization system of chil-drens speech and adult speech. One of the studies which ana-lyze the language exposure of children in a home environmentrevealed on average a 48.9% DER performance across differenttraining datasets and the state-of-the-art diarization systems [9].Our previous work [10] focused on the adaptation of aPLDA model through both children’s speech data augmentationand a discrimination of speaker representations by adult female,adult male, and child type of speakers. In this paper, we extendthe later idea to a mixture PLDA model that explicitly capturesthe variation of speakers across speaker types. The organizationof the paper is as follows. Section 2 addresses the related workand inspiration of mixture PLDA model. Section 3 describesthe main methods. Section 4 outlines the experimental setupand data preparation. Section 5 presents the results. Finally,Section 6 concludes this work and mentions future work.
2. Related Work
Child speech has long been studied for automatic speech recog-nition (ASR) in [11, 12]. Early development of diarization sys-tem on children’s speech has focused on child language acqui-sition analysis [13] or proprietary smart home devices [14, 15].The diarization of four classes among a primary child, sec-ondary child, adult, and non-speech was first explored in [13].The speaker-independent DNN-HMM system [13] achievedaround 20% DER per child category. However, disentanglingthe non-speech events and children’s speech remains a chal-lenge, as about 10% of the true child speech was misclassified tobe non-speech. One recent work in [16] studied the speech en-hancement of the noisy environment in daylong realistic record-ings, including the SEEDLingS [17], a child-centered dataset.The proposed LSTM-based enhancement preprocessor with abuilt-in diarization system achieved a 39.2% DER [18]. Themixture PLDA model has been mainly studied for speaker veri-fication tasks [19, 20]. The work of [20] showed a better perfor-mance using the mixture of two gender-dependent PLDAs thana single gender-independent PLDA. Our work extend from [20]to three classes of speakers and further develops the mixture ofPLDAs for diarization.
3. Methods
In this section, we illustrate the methods to incorporate speakertype information to diarization. Subsection 3.1. explains an a r X i v : . [ ee ss . A S ] A ug deal segmentation step that takes account of oracle speakertype labels. Subsection 3.2. explains the concept of the mix-ture of PLDA models that encompasses speaker type priors.Lastly, subsection 3.3 describes an estimator of speaker typeconfidence scores. The speaker type segmentation refers to the process which splitsspeech into three parts that each belongs to an adult female,adult male, or child class. Diarization is subsequently per-formed on each speech region of a class. As illustrated on theleft of figure 1, each of the speech splits then goes through theuniform segmentation and scored by a PLDA model trained onthe data from the corresponding speaker type. Since the speakertypes are mutually exclusive, the diarization proposals of speak-ers in each of the speech splits will be differentiated, i.e. the speaker1 of female speech is not the speaker1 of male speech.This is of course an ideal setup if provided with the gold speaker speaker type independent
FEM CHI
Oracle speaker typesegmentationUniformsegmentationEmbeddingextractionScoring by single PLDA SADsegmentation ∑ MAL
UniformsegmentationEmbeddingextraction Scoring by mixture PLDA
Figure 1:
Diarization steps using speaker type information.Left: oracle speaker type segmentation, Right: mixture PLDA type labels. We find that the empirical performance using pre-dicted labels from a classifier is worse compared to a standardbaseline because the confusion made early between speakertypes can cause the wrong assignments of speakers later.
To prevent the hard assignment of speaker types, a probabilisticframework is considered. The similarity scoring in diarization[21] relies on the likelihood ratio between the same-speaker hy-pothesis H s and different-speaker hypothesis H d , R = P ( z , z | H s ) /P ( z , z | H d ) (1)where z and z are the segment-level speaker representations,and the likelihoods may be obtained from either a single PLDAmodel or a mixture of PLDA models. The PLDA model was originally proposed in [22], where datavariations are captured by a latent class variable. In diarization,speech utterances are thought to vary between and within speak-ers. The PLDA model is often used to represent the speakeras a latent class variable. The model learns a projection spacewhere distance between representations of different speakers ismaximized, and distance between representations of the samespeaker is minimized. Given a single PLDA model with thelearned covariance ψ s in the transformed space, the likelihoodratio in equation (1) can be represented by, LR ( z , z | ψ s ) = P ( z | z , ψ s ) /P ( z | ψ s ) (2) where z and z are conditionally independent given H d . Al-though the single PLDA model provides a unified framework tocompare speakers, it does not explicitly model the known vari-ations of speakers such as gender or age that is often providedas metadata in a dataset. The mixture PLDA model is a formulation that can be thoughtas a linear combination of different single PLDA models by theweight of a prior. The prior acts as the most general view of datavariations. Therefore, we can use the prior of speaker types toweigh each single PLDA model trained on the data from eachspeaker type. Compared to a single model trained on the dataof all speaker types, a mixture PLDA potentially allows an in-formed discrimination between speakers through the prior. Thisis figuratively shown on the right of figure 1.Under a mixture PLDA model, the numerator of equation(1) can be written as a convex combination of speaker-type de-pendent PLDA models, P ( z , z | H s ) = (cid:88) g ∈ G,g ∈ G P ( g , g | H s ) P ( z , z | g , g , H s ) (3)where g and g are the speaker types in G = { (cid:48) M (cid:48) , (cid:48) F (cid:48) , (cid:48) C (cid:48) } corresponding to z and z , and (cid:48) M (cid:48) , (cid:48) F (cid:48) , (cid:48) C (cid:48) are the adultmale, adult female, and child speaker types, respectively. Thedenominator of equation (1) can be written as P ( z , z | H d ) = (cid:88) g ∈ G,g ∈ G P ( g , g | H d ) P ( z , z | g , g , H d ) (4)where there are 9 terms in the denominator (all pairwise combi-nations of speaker types). Given the same speaker, both g and g must belong to the same speaker type, P ( g , g | H s ) = P ( g )= P ( g ) P ( z , z | g , g , H s ) = P ( z , z | ψ g )= P ( z , z | ψ g ) Under different speakers, P ( g , g | H d ) = P ( g ) × P ( g ) P ( z , z | g , g , H d ) = P ( z | ψ g ) × P ( z | ψ g ) where ψ g and ψ g are parameters of the g -type PLDA and g -type PLDA, respectively. Here, we assumed the distribution ofa speaker type is independent under the different-speaker condi-tion H d . The single likelihood P ( z i | ψ g i ) or the joint likelihood P ( z , z | ψ g ) can be obtained from the single PLDA model asdescribed in [22]. But the prior distribution P ( g ) is a designchoice to make. As explained in the previous section, the prior distribution of aspeaker type is key to the mixture PLDA formulation. One sim-ple way is to assume a constant prior for all recordings encoun-tered in the evaluation. For instance, the prior of child speakersshould be above the uniform threshold of 0.33 for diarizationon children’s speech. We illustrate briefly the other approach toestimate the speaker type confidence from frame-level features. .3.1. Problem Formulation
Our goal is to obtain an informed prior probability P ( g ) for themixture PLDA. We can consider to use the posterior P ( g | X ) given an input feature sequence X . The confidence estimatesfor each speaker type can be adopted by taking the softmaxoutput of a neural network [23, 24]. The prior distribution P ( g i ) that a speech segment z i belongs to a speaker type g for ∀ i ∈ { , } can be approximated by, P ( g i ) ≈ P nn ( g i | z i ) = 1 T i T i (cid:88) t =0 P nn ( g ti | X ) (5)where z i has T i frames and P nn ( g ti | X ) is the frame-wise pos-terior output of the network given the whole input sequence.We experimented with this using various sequence-to-sequenceand Transformer architectures [23, 24], but found that althoughsuch a trained system can predict the correct speaker type labelwith around 75% accuracy, the performance gains are not trans-ferred when used as mixture PLDA weights, motivating futurework on calibration of neural network output probabilities.
4. Experimental Setup
The experimental setup is illustrated in this section.
Our diarization system mainly follows the x-vector-based sys-tem from the DiHARD 2018 [25] recipe in Kaldi [26]. We fo-cus on extending the PLDA model within this pipeline. Theaudio data input is sampled at 16k-Hz. Mel-frequency cepstralcoefficients (MFCC) are used as features and 30 cepstral co-efficients are taken from 30 mel-frequency bins. The featuresare extracted over a 25ms window with a frame rate of 10ms.Both the Delta and the Delta-Delta features are appended. Cep-stral mean normalization is applied over a sliding window upto 3 seconds. After the pre-processing, each segment in the evaluation and the
PLDA training data is subsegmented by a1.5s sliding window with a 0.75s overlap. The speaker featuresare then extracted from the subsegments and length normalized[27]. The x-vector embedding has 512 dimensions.
We use the Voxceleb [28, 29] datasets for the adult speech. The
VoxCeleb1 and
VoxCeleb2 [28, 29] are two versions of a largescale dataset that contains interview videos of celebrities up-loaded to YouTube. The speakers in the dataset are expected tobe mainly adults. There are a total of 7325 speakers with 61%being male and 39% being female. We filter by gender in eachdataset to train individual PLDA models of adult speaker types.
The
CMU Kids [30] and the
CSLU Kids corpus [31] are used forchild speech training. Both datasets were collected for speechrecognition tasks, so the audio quality is considered clean. Theage of children from both datasets cover a range from 5 to 10years old. The combined set contains 1191 speakers and about42.6 hours of speech. The average duration of an utterance isabout 4 seconds long.
We use a subset of the data provided in the Interspeech Com-ParE challenge [32] as the augmentation dataset, which collectsmostly baby crying sounds. The dataset contains 5.6k record-ings with about 2.8 hours of baby vocalizations. We highlightthat the average duration of a recording is only 1.8 seconds long,which makes this small dataset hardly sufficient to train modelson baby speakers alone.
The babyTrain is a newly aggregated dataset of 9 child-centeredcorpus [17] with daylong recordings in the home environment.The train , dev , and test split of the dataset is prepared by theJSALT 2019 workshop [6] and covers a total of 270 hour record-ings. The age of children in the dataset varies between 5 to60 months old. We adopt the provided dataset splits of 57.5% train , 27% development , and 15.5% test set of the total audiolength. The oracle mapping of speakers to categories of keychild, child, adult female, adult male, and others is provided.The distribution of speaker type in train is similar to the test ,which about 46% is child, 50% is female, and 4% is male. Weexclude recordings where the distant speakers are annotated butwith an undefined speaker type label. This leaves us with 329out of 413 files. Our baseline system uses the single PLDA model trained on the
VoxCeleb1 , and
CMU Kids and
CSLU Kids corpus.
To obtain an estimate of the upper bound on the performance ofthe speaker type informed diarization system, we evaluate thesystem based on the oracle speaker type labels, that is p ( F ) = 1 when the speaker for the speech segment z is Female. Thiseffectively shifts the responsibility of likelihood ratio scoring toone PLDA model trained on the true speaker type. The mixture PLDA is compared to the single PLDA model inthe evaluation. Both models are trained on the same dataset,where we further split the data by speaker types to train themixture PLDA. To study the influence of a data imbalance, weeither use the whole dataset or randomly select 1000 speak-ers from each speaker type to compose the mixture PLDA. Wefurther compare between a nonuniform and uniform prior ofspeaker types, where the nonuniform distribution on female,child, and male is 40%, 40%, and 20%, respectively.
The primary evaluation metric of our experiments is the diariza-tion error rate (DER). We score speech overlaps and do not usenon-score collar. The DER measures a cumulative duration ofthe following three types of errors over a total duration of validscoring regions,1. False alarm (FA) classifying non-speech as speech2. Miss (MS) classifying speech as non-speech3. Speaker mismatch (SM) actual speaker differs from theclaimed speaker . Results
We conduct three main experiments. The first one examinesthe upper bound of performance gain using the oracle speakertype segmentation. The second one studies the effectiveness ofbaby-vocalization augmentation. The last one evaluates the sin-gle (UniPLDA) and mixture (MixPLDA) model as well as theinfluence of training data balance. The system performance isevaluated using DER under the gold number of speakers and or-acle speech activity detection . The main results from the threeexperiments are presented in Table 1, Table 2, and Table 3. De-tails of the experiments were illustrated in section 4.
To study the benefit of speaker type information, the evaluationrecording is split into three speech regions of each speaker type,using the oracle label. We instead had to use a score thresh-old to stop the clustering since the number of speakers in eachspeaker-type segmented audio is unknown. Shown in Table 1,UniPLDA Baseln Oracle Speaker Type Same Speaker39.90 (-0.2) (0.0) 40.26 (-)Table 1:
DER(%) (threshold) comparison between baseline andoracle speaker type using the oracle speaker type reduces the UniPLDA baselineby a significant 11.7% DER. This verifies our claim that extraspeaker type information is beneficial for diarization. The lastentry shown in Table 1 illustrates the worst scenario when thesystem outputs only one speaker. This result also implies thedominant speaker accounts for about 60% of the speech (~40%DER), and of the remaining 40% belongs to other speakers.
The baby vocalization augmentation is found to be an effectivedomain adaptation of the child-type PLDA. We apply differ-ent augmentation techniques to the clean children’s speech andcompared the results in Table 2.System Clean mn v vn vmusCHI-PLDA 39.61 39.28 38.74
DER(%) comparison between clean and different aug-mentations of the child-type PLDA. The music, noise, vocal-ization, and MUSAN augmentation are labeled accordingly asm,n,v, and mus.
As shown from above, using the vocalization (column 4)alone is found to reduce 0.9% DER from the clean baseline.Compared to the gains from adding double-sized or triple-sizedsamples generated by music and noise augmentation (column3) or the MUSAN augmentation (column 6), the vocalizationseems to provide the most matched information to the targetdomain. Lastly, we find the vocalization is complementary withthe noise augmentation, which further reduces the clean base-line by an absolute 1.9% DER.
We observe combining the kids data and
VoxCeleb1 to trainthe UniPLDA baseline achieves a 38.62% DER. Our proposed mixture of PLDA models with uniform weights (33%) on eachspeaker type, ‘Male’, ‘Female’, and ‘Child’ has a very compara-ble performance with the UniPLDA. However, with a matchedestimation on the speaker type prior to the evaluation, the Mix-PLDA outperforms the UniPLDA model (row 3 and 5), showingthe potential of the speaker type informed model. The cost ofTraining Data
DER(%) comparison between the single (Uni) andmixture PLDA (Mix-nunif and -unif). The size of training datais shown by the number ( using a mixture model comes close to the single model sincetraining three models in parallel is possible. The value of theconstant prior may be elicited from human expert knowledge,but making either a manual inspection from sampling or an in-tuitive estimate should be sufficient.We find the importance of keeping the data balanced whiletraining the MixPLDA. Comparing row 2 to 3 or row 4 to 5in Table 3, we find on average 1.9% DER improvement in thenonuniform MixPLDA, even though data size is reduced fromthe balancing operation. On the contrary, the performance of theUniPLDA depends heavily on the size of the data (the largest isthe vmus , the second goes the vn , and baseline is the least). Theformulation of [20] shows the likelihood ratio obtained underthe MixPLDA is equivalent to a weighted sum of each likeli-hood ratio scored obtained from one of the mixed PLDAs. Wesuspect this may explain why data balancing is helpful sincesimilarly constrained data can limit the modeling space thateach PLDA covers.
6. Conclusion and Future Work
In this paper, we presented a diarization framework using themixture PLDA model that is targeted at children’s speech do-main. We discovered speaker type information is beneficial andverified a large upper bound of improvement. Empirically, themixture of speaker-type PLDA models outperforms the singlePLDA model when a balanced training data is used. Thoughthe best result is obtained by the single PLDA, the best mixturePLDA system comes close with an absolute 0.2% difference inDER and a half of the training size required. Using baby vocal-izations as additive background noises has shown matching tothe age and acoustic condition of children’s speech. Our futurework is to develop a confidence score estimator of speaker typesusing neural networks, as illustrated in the paper.
7. Acknowledgements
The authors would like to thank Jess Villalba for the construc-tive discussion on the mixture PLDA formulation. . References [1] S. E. Tranter and D. A. Reynolds, “An overview of auto-matic speaker diarization systems,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 14, no. 5, pp. 1557–1565,Sep. 2006.[2] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Fried-land, and O. Vinyals, “Speaker diarization: A review of recentresearch,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 20, no. 2, pp. 356–370, Feb 2012.[3] A. McCree, G. Sell, and D. Garcia-Romero, “Speaker diarizationusing leave-one-out gaussian plda clustering of dnn embeddings,”
Proc. Interspeech 2019 , pp. 381–385, 2019.[4] M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “Astudy of the cosine distance-based mean shift for telephone speechdiarization,”
IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 22, no. 1, pp. 217–227, 2013.[5] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “Chime-6challenge: Tackling multispeaker speech recognition for unseg-mented recordings,” arXiv preprint arXiv:2004.09249 , 2020.[6] P. Garc´ıa, J. Villalba, H. Bredin, J. Du, D. Castan, A. Cristia,L. Bullock, L. Guo, K. Okabe, P. S. Nidadavolu et al. , “Speakerdetection in the wild: Lessons learned from jsalt 2019,” arXivpreprint arXiv:1912.00938 , 2019.[7] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy,and M. Liberman, “First dihard challenge evaluation plan,” 2018.[8] M. Gerosa, S. Lee, D. Giuliani, and S. Narayanan, “Analyz-ing children’s speech: An acoustic study of consonants andconsonant-vowel transition,” in ,vol. 1. IEEE, 2006, pp. I–I.[9] A. Cristi`a, S. Ganesh, M. Casillas, and S. Ganapathy, “Talker di-arization in the wild: the case of child-centered daylong audio-recordings,” in
Interspeech , 2018.[10] J. Xie, L. P. Garcia-Perera, D. Povey, and S. Khudanpur, “Multi-plda diarization on childrens speech,” 2019.[11] A. Potamianos, S. Narayanan, and S. Lee, “Automatic speechrecognition for children,” in
Fifth European Conference on SpeechCommunication and Technology , 1997.[12] S. Ghai and R. Sinha, “A study on the effect of pitch on lpcc andplpc features for children’s asr in comparison to mfcc,” in
TwelfthAnnual Conference of the International Speech CommunicationAssociation , 2011.[13] M. Najafian and J. H. Hansen, “Speaker independent diarizationfor child language environment analysis using deep neural net-works,” in . IEEE, 2016, pp. 114–120.[14] D. Xu, U. Yapanel, and S. Gray, “Reliability of the lena languageenvironment analysis system in young childrens natural home en-vironment,”
Boulder, CO: LENA Foundation , pp. 1–16, 2009.[15] M. Ford, C. T. Baer, D. Xu, U. Yapanel, and S. Gray, “The lenatmlanguage environment analysis system: Audio specifications ofthe dlp-0121,”
Boulder, CO: Lena Foundation , 2008.[16] L. Sun, J. Du, T. Gao, Y.-D. Lu, Y. Tsao, C.-H. Lee, and N. Ryant,“A novel lstm-based speech preprocessor for speaker diariza-tion in realistic mismatch conditions,” in . IEEE, 2018, pp. 5234–5238.[17] M. VanDam, A. S. Warlaumont, E. Bergelson, A. Cristia,M. Soderstrom, P. De Palma, and B. MacWhinney, “Homebank:An online repository of daylong child-centered audio recordings,”in
Seminars in speech and language , vol. 37, no. 02. ThiemeMedical Publishers, 2016, pp. 128–142.[18] D. Vijayasenan and F. Valente, “Diartk: An open source toolkitfor research in multistream speaker diarization and its applicationto meetings recordings,” in
Thirteenth Annual Conference of theInternational Speech Communication Association , 2012. [19] M.-W. Mak, X. Pang, and J.-T. Chien, “Mixture of plda for noiserobust i-vector speaker verification,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 24, no. 1, pp. 130–142, 2015.[20] M. Senoussaoui, P. Kenny, N. Br¨ummer, E. d. Villiers, and P. Du-mouchel, “Mixture of plda models in i-vector space for gender-independent speaker recognition,” in
Twelfth Annual Conferenceof the International Speech Communication Association , 2011.[21] G. Sell and D. Garcia-Romero, “Speaker diarization with plda i-vector scoring and unsupervised calibration,” in . IEEE, 2014, pp.413–417.[22] S. Ioffe, “Probabilistic linear discriminant analysis,” in
EuropeanConference on Computer Vision . Springer, 2006, pp. 531–542.[23] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhut-dinov, “Transformer-xl: Attentive language models beyond afixed-length context,” arXiv preprint arXiv:1901.02860 , 2019.[24] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequencelearning with neural networks,” in
Advances in neural informationprocessing systems , 2014, pp. 3104–3112.[25] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba,M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe et al. , “Diarization is hard: Some experiences and lessons learnedfor the jhu team in the inaugural dihard challenge.” in
Interspeech ,vol. 2018, 2018, pp. 2808–2812.[26] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” IEEE Signal ProcessingSociety, Tech. Rep., 2011.[27] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vectorlength normalization in speaker recognition systems,” in
Twelfthannual conference of the international speech communication as-sociation , 2011.[28] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in
INTERSPEECH , 2017.[29] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” in
INTERSPEECH , 2018.[30] M. Eskenazi, J. Mostow, and D. Graff, “The cmu kids corpus,”
Linguistic Data Consortium , 1997.[31] K. Shobaki, J.-P. Hosom, and R. Cole, “Cslu: Kids speech version1.1,”
Linguistic Data Consortium , 2007.[32] B. W. Schuller, S. Steidl, A. Batliner, P. B. Marschik, H. Baumeis-ter, F. Dong, S. Hantke, F. B. Pokorny, E.-M. Rathner, K. D. Bartl-Pokorny et al. , “The interspeech 2018 computational paralinguis-tics challenge: Atypical & self-assessed affect, crying & heartbeats.” in
Interspeech , 2018, pp. 122–126.[33] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, andnoise corpus,” arXiv preprint arXiv:1510.08484arXiv preprint arXiv:1510.08484