[PDF] Neural PLDA Modeling for End-to-End Speaker Verification

Abstract

While deep learning models have made significant advances in supervised classification problems, the application of these models for out-of-set verification tasks like speaker recognition has been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems use a generative model based on probabilistic linear discriminant analysis (PLDA) for computing the verification score. Recently, we had proposed a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA) where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function and the learnable parameters of the score function are optimized using a verification cost. In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end (E2E) fashion. This proposed end-to-end model is optimized directly from the acoustic features with a verification cost function and during testing, the model directly outputs the likelihood ratio score. With various experiments using the NIST speaker recognition evaluation (SRE) 2018 and 2019 datasets, we show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.

Full PDF

NNeural PLDA Modeling for End-to-End Speaker Veriﬁcation

Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy

Learning and Extraction of Acoustic Patterns (LEAP) Lab,Department of Electrical Engineering, Indian Institute of Science, Bengaluru, India { shreyasr, prashantkv1, sriramg } @iisc.ac.in Abstract

While deep learning models have made signiﬁcant advancesin supervised classiﬁcation problems, the application of thesemodels for out-of-set veriﬁcation tasks like speaker recognitionhas been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker veriﬁcation systems usea generative model based on probabilistic linear discriminantanalysis (PLDA) for computing the veriﬁcation score. Recently,we had proposed a neural network approach for backend mod-eling in speaker veriﬁcation called the neural PLDA (NPLDA)where the likelihood ratio score of the generative PLDA modelis posed as a discriminative similarity function and the learnableparameters of the score function are optimized using a veriﬁca-tion cost. In this paper, we extend this work to achieve jointoptimization of the embedding neural network (x-vector net-work) with the NPLDA network in an end-to-end (E2E) fash-ion. This proposed end-to-end model is optimized directly fromthe acoustic features with a veriﬁcation cost function and dur-ing testing, the model directly outputs the likelihood ratio score.With various experiments using the NIST speaker recognitionevaluation (SRE) 2018 and 2019 datasets, we show that theproposed E2E model improves signiﬁcantly over the x-vectorPLDA baseline speaker veriﬁcation system.

Index Terms : NPLDA, End-to-End Systems, Speaker Veriﬁca-tion

1. Introduction

Automatic speaker veriﬁcation (ASV) has several applicationssuch as voice biometrics for commercial applications, speakerdetection in surveillance, speaker diarization, etc. A speakeris enrolled by a sample utterance(s), and the task of ASV isto detect whether the target speaker is present in a given testutterance or not. Several challenges have been organized overthe years for benchmarking and advancing speaker veriﬁcationtechnology such as the NIST speaker recognition Evaluation(SRE) challenge 2019 [1], the VoxCeleb speaker recognitionchallenge (VoxSRC) [2] and the VOiCES challenge [3]. Themajor challenges in speaker veriﬁcation include the languagemismatch in testing, short duration audio and the presence ofnoise/reverberation in the speech data.The state-of-the-art systems in speaker veriﬁcation use amodel to extract embeddings of ﬁxed dimension from utter-ances of variable duration. The earlier approaches based onunsupervised Gaussian mixture model (GMM) i-vector extrac-tor [4] have been recently replaced with neural embedding ex-tractors [5, 6] which are trained on large amounts of super-vised speaker classiﬁcation tasks. These ﬁxed dimensionalembeddings are pre-processed with a length normalization [7]

This work was funded by the Ministry of Human Resources Devel-opment (MHRD) of India and the Department of Science and Technol-ogy (DST) EMR/2016/007934 grant. technique followed by probabilistic linear discriminant analysis(PLDA) based backend modeling approach [8].In our previous work, we had explored a discriminative neu-ral PLDA (NPLDA) approach [9] to backend modeling wherea discriminative similarity function was used. The learnableparameters of the NPLDA model were optimized using an ap-proximation of the minimum detection cost function (DCF).This model also showed good improvements in our SRE eval-uations and the VOiCES from a distance challenge [10, 11].In this paper, we extend this work to propose a joint modelingframework that optimizes both the front-end x-vector embed-ding model and the backend NPLDA model in a single end-to-end (E2E) neural framework. The proposed model is ini-tialized with the pre-trained x-vector time delay neural network(TDNN). The NPLDA E2E is fully trained on pairs of speechutterances starting directly from the mel-frequency cepstral co-efﬁcient (MFCC) features. The advantage of this method is thatboth the embedding extractor as well as the ﬁnal score computa-tion is optimized on pairs of utterances and with the speaker ver-iﬁcation metric. With experiments on the NIST SRE 2018 and2019 datasets, we show that the proposed NLPDA E2E modelimproves signiﬁcantly over the baseline system using x-vectorsand generative PLDA modeling.

2. Related Prior Work

The common approaches for scoring in speaker veriﬁcation sys-tems include support vector machines (SVMs) [12], and theprobabilistic linear discriminant analysis (PLDA) [8]. Someefforts on pairwise generative and discriminative modeling arediscussed in [13, 14, 15]. The discriminative version of PLDAwith logistic regression and support vector machine (SVM) ker-nels has also been explored in [16]. In this work, the authorsuse the functional form of the generative model and pool all theparameters needed to be trained into a single long vector. Theseparameters are then discriminatively trained using the SVM lossfunction with pairs of input vectors. The discriminative PLDA(DPLDA) is however prone to over-ﬁtting on the training speak-ers and leads to degradation on unseen speakers in SRE evalu-ations [17]. The regularization of embedding extractor networkusing a Gaussian backend scoring has been investigated in [18].Other recent developments in this direction includes efforts inusing the approximate DCF metric for text dependent speakerveriﬁcation [19].Recently, some end-to-end approaches for speaker veriﬁca-tion have been examined. For example, in [20], the PLDA scor-ing which is done with the i-vector extraction has been jointlyderived using a deep neural network architecture and the en-tire model is trained using a binary cross entropy training crite-rion. In [21], a generalized end to end loss by minimizing thecentroid means of within speaker distances while maximizingacross speaker distances was proposed. In another E2E effort, a r X i v : . [ ee ss . A S ] A ug + Figure 1:

End-to-End x-vector NPLDA architecture for Speaker Veriﬁcation. the use of triplet loss has been explored [22]. However, in spiteof these efforts, most state of the art systems use a generativePLDA backend model with x-vectors and similar neural net-work embeddings.

3. Background

The PLDA model on the processed x-vector embedding η r (af-ter centering, LDA transformation and unit length normaliza-tion) is given by η r = Φ ω + (cid:15) r (1)where ω is the latent speaker factor with a Gaussian prior of N ( , I ) , Φ characterizes the speaker sub-space matrix, and (cid:15) r isthe residual assumed to have distribution N ( , Σ ) . For scoring,a pair of embeddings, η e from the enrollment recording and η t from the test recording are used with the PLDA model tocompute the log-likelihood ratio score given by s ( η e , η t ) = η (cid:124) e Qη e + η (cid:124) t Qη t + 2 η (cid:124) e P η t + const (2)where, Q = Σ − tot − ( Σ tot − Σ ac Σ − tot Σ ac ) − (3) P = Σ − tot Σ ac ( Σ tot − Σ ac Σ − tot Σ ac ) − (4)with Σ tot = ΦΦ T + Σ and Σ ac = ΦΦ T .In the kaldi implementation of PLDA, a diagonalizingtransformation which simultaneously diagonalizes the withinand between speaker covariances is computed which reduces P and Q to diagonal matrices. In the discriminative NPLDA approach [11], we construct thepre-processing steps of LDA as ﬁrst afﬁne layer, unit-lengthnormalization as a non-linear activation and PLDA centeringand diagonalization as another afﬁne transformation. The ﬁ-nal PLDA pair-wise scoring given in Eq. 2 is implemented asa quadratic layer in Fig. 1. Thus, the NPLDA implements thepre-processing of the x-vectors and the PLDA scoring as a neu-ral backend.

To train the NPLDA for the task of speaker veriﬁcation, we sam-ple pairs of x-vectors representing target (from same speaker)and non-target hypothesis (from different speakers). The nor-malized detection cost function (DCF) [23] for a detectionthreshold θ is deﬁned as: C Norm ( β, θ ) = P Miss ( θ ) + βP FA ( θ ) (5)where β is an application based weight deﬁned as β = C FA (1 − P target ) C Miss P target (6)where C Miss and C FA are the costs assigned to miss and falsealarms, and P target is the prior probability of a target trial. P Miss and P FA are the probability of miss and false alarms re-spectively, and are computed by applying a detection thresholdof θ to the log-likelihood ratios. A differentiable approximationof the normalized detection cost was proposed in [11, 19]. P (soft) Miss ( θ ) = (cid:80) Ni =1 t i [1 − σ ( α ( s i − θ ))] (cid:80) Ni =1 t i (7) P (soft) FA ( θ ) = (cid:80) Ni =1 (1 − t i ) σ ( α ( s i − θ )) (cid:80) Ni =1 (1 − t i ) (8)ere, i is the trial index, s i is the system score and t i denotesthe ground truth label for trial i , and σ denotes the sigmoid func-tion. N is the total number of trials in the minibatch over whichthe cost is computed. By choosing a large enough value forthe warping factor α , the approximation can be made arbitrarilyclose to the actual detection cost function for a wide range ofthresholds. The minimum detection cost (minDCF) is achievedat a threshold where the DCF is minimized.minDCF = min θ C Norm ( β, θ ) (9)The threshold θ is included in the set of learnable parametersof the neural network. This way, the network learns to mini-mize the minDCF as a function of all the parameters throughbackpropagation.

4. End-to-end modeling

The model we explore is a concatenated version of two param-eter tied x-vector extractors (TDNN networks [24]) with theNPLDA model (Fig. 1). The end-to-end model processes themel frequency cepstral coefﬁcients (MFCCs) of a pair of utter-ances to output a score. The MFCC features are passed throughnine time delay neural network (TDNN) layers followed by astatistic pooing layer. The statistics pooling layer is followedby a fully connected layer with unit length normalization non-linearity. This is followed by a linear layer and a quadratic layeras a function of the enrollment and test embeddings to output ascore. The parameters of the TDNN and the afﬁne layers of theenrollment and test side of the E2E model are tied, which makesthe model symmetric.

We can estimate the memory required for a single iteration(batch update) of training as the sum of memory required tostore the network parameters, gradients, forward and backwardcomponents of each batch. In this end-to-end network, eachtraining batch of N trials can have upto N unique utterancesassuming there are no repetitions. For simplicity, let us assumeeach of the utterances corresponds to T frames. We denote k i to be the dimension of the input to the i th TDNN layer, witha TDNN context of c i frames. The total memory required canthen be estimated as NT (cid:80) i k i c i × bytes.. The GPU mem-ory is limited by the total number of frames that go into theTDNN, which is denoted by the factor NT . A large batchsizeof , as was used in [10], is infeasible for the end-to-endmodel (results in GPU memory load of GB). Hence, we re-sorted to a sampling strategy to reduce the GPU memory con-straints.

In this current work, in order to avoid memory explosion in thex-vector extraction stage of the E2E model, we propose to use asmall number of utterances ( ) in a batch with about sec. ofaudio in each utterance. These utterances are drawn from m speakers where m ranges from − . These utterances aresplit randomly into two halves for each speaker to form enroll-ment and test side of trials. The MFCC features of the enroll-ment and test utterances are transformed to utterance embed-dings η e and η t (as shown in Fig. 1). Each pair of enrollment,and test utterances is given a label as to whether the trial belongs The implementation of this model can be found in https://github.com/iiscleap/E2E-NPLDA to the target class (same speaker) or non-target class (differentspeakers). In this way, while the number of utterances is small,the number of trials used in the batch is . Using the labelinformation and the cost function deﬁned in Eq. 5, the gradientsare back-propagated to update the entire E2E model parameters.This algorithm is applied separately to the male and femalepartitions of each training dataset to ensure the trials are genderand domain matched. All the utterances used in a batch comefrom the same gender and same dataset (to avoid cross gender,cross language trials). The algorithm is repeated multiple timeswith different number of speakers ( m ), for the male and femalepartitions of every dataset. Finally, all the training batches arepooled together and randomized.In contrast, the trial sampling algorithm used in our previ-ous work on NPLDA [11, 10] was much simpler. For each gen-der of each dataset, we sample an enrollment utterance from arandomly sampled speaker, and sample another utterance fromeither the same speaker or a different speaker to get a target or anon-target trial. This was done without any repetition of utter-ances, to ensure that each utterance appears once per samplingepoch. This procedure was repeated numerous times for multi-ple datasets and for both genders to obtain the required numberof trials. All the trials were then pooled together, shufﬂed andsplit into batches of or trials.It is worth noting that the batch statistics of the two sam-pling methods are signiﬁcantly different. A batch of trials inthe previous sampling method (Algo. 1) can contain trials frommultiple datasets and gender, whereas in the modiﬁed samplingmethod, which we will refer as Algo. 2, all the trials in a batchare from a particular gender of a particular dataset.

5. Experiments and Results

The work is an extension of our work in [10]. The x-vectormodel is trained using the extended time-delay neural network(E-TDNN) architecture described in [24]. This uses 10 layersof TDNNs followed by a statistics pooling layer. Once the net-work is trained, x-vectors of 512 dimensions are extracted fromthe afﬁne component of layer 12 in the E-TDNN architecture.By combining the Voxceleb 1&2 dataset [2] with Switchboard,Mixer 6, SRE04-10, SRE16 evaluation set and SRE18 evalua-tion sets, we obtained with . M recordings from speak-ers. The datasets were augmented with the 5-fold augmenta-tion strategy similar to the previous models. In order to reducethe weighting given to the VoxCeleb speakers (out-of-domaincompared to conversational telephone speech (CTS)), we alsosubsampled the VoxCeleb augmented portion to include only . M utterances. The x-vector model is trained using di-mensional MFCC features using a -channel mel-scale ﬁlterbank spanning the frequency range Hz -

Hz,, mean-normalized over a sliding window of up to 3 seconds and with dimensional targets using the Kaldi toolkit. More infor-mation about the model can be found in [10].The various backend PLDA models are trained on theSRE18 evaluation dataset. The evaluation datasets used includethe SRE18 development and the SRE19 evaluation datasets. Weperform several experiments under various conditions. The pri-mary baseline to benchmark our systems is the Gaussian PLDAbackend implementation in the Kaldi toolkit (GPLDA). TheKaldi implementation models the average embedding x-vectorof each training speaker. The x-vectors are centered, dimen-sionality reduced using LDA to 170 dimensions, followed byunit length normalization. odel Duration ofutterance SRE18 Dev SRE19 EvalEER (%) C Min

EER (%) C Min

GPLDA (G1) Full 6.43 0.417 6.18 0.512GPLDA (G2) 20 secs 5.96 0.436 5.80 0.518NPLDA (N1) Full 5.33 0.389 5.10 0.443NPLDA (N2) 20 secs 5.57 0.359 5.32 0.432

Table 1:

Performance comparison of training utterance dura-tions (Full utterance vs 20 second segmenting) on GPLDA andNPLDA[10] models

Model Sampling SRE18 Dev SRE19 EvalEER (%) C Min

EER (%) C Min

NPLDA (N2) Algo. 1 5.57 0.359 5.32 0.432NPLDA (N3) Algo. 2 5.23 0.338 5.73 0.439

Table 2:

Performance comparison with different samplingtechniques using NPLDA[10] method using previous samplingmethod (Algo. 1) and proposed new sampling method (Algo. 2)

In the traditional x-vector system, the statistic pooling layercomputes the mean and standard deviation of the ﬁnal TDNNlayer. These two statistics then are concatenated into a ﬁxeddimensional embedding. We also perform experiments wherewe use variance instead of the standard deviation and comparethe performance.In the following sections, we study the inﬂuence of reducedtraining duration, and provide a performance comparison of thesampling method (Algo. 1 vs Algo. 2). We then comparethe performance of Gaussian PLDA (GPLDA), Neural PLDA(NPLDA), and the proposed end-to-end approach (E2E). ThePLDA backend training dataset used is the SRE18 Evaluationdataset. We report our results on the SRE18 Development setand the SRE19 Evaluation dataset using two cost metrics - equalerror rate (EER) and minimum DCF ( C Min ), which are the pri-mary cost metrics for SRE19 evaluations.

As discussed in Section 4.2, due to GPU memory considerationsand ease of implementation, we create a modiﬁed dataset bysplitting longer utterances into 20 second chunks (2000 frames)after voice activity detection (VAD) and mean normalization.We compare the performances of the models on the modiﬁeddataset and the original one. The results are reported in Table1. We observe that the performance of the systems are quitecomparable. This allows us to proceed using these conditionsin the implementation of the End-to-End model. All subsequentreported models use 20 second chunks for PLDA training.

The way the training trials are generated is crucial to how themodel trains and its performance. The performance comparisonof the two sampling techniques with PLDA models trained onSRE18 Evaluation dataset can be seen in Table 2. Although thenature of batch wise trials has changed signiﬁcantly in terms ofnumber of speakers in each batch and gender matched batchesin the proposed new sampling method (Algo. 2), we see thatits performance is comparable to our previous sampling method(Algo. 1).

Model Poolingfunction Init. SRE18 Dev SRE19 EvalEER (%) C Min

EER (%) C Min

GPLDA (G2) StdDev - 5.96 0.436 5.80 0.518GPLDA (G3) Var - 7.23 0.459 6.33 0.560NPLDA (N2) StdDev G2 5.57 0.359 5.32 0.432NPLDA (N4) Var G3 6.05 0.377 5.91 0.465E2E (E1) StdDev N2

E2E (E2) Var N4 5.60

Table 3:

Performance comparison between GPLDA, NPLDAand E2E models using standard deviation and variance as thesecondary pooling functions. The model that was used to ini-tialize the network is denoted in the 3rd column

Using the proposed sampling method, we generate batches of1024 trials using 64 utterances per batch. Both the NPLDAand E2E models were trained with this batch size. We use theAdam optimizer for the backpropagation learning. The perfor-mance of these models are reported in Table 3. The NPLDAmodel is initialized with the GPLDA model. The initializationdetails of the models along with the pooling functions are re-ported in the table. We compare performances using two differ-ent statistics (StdDev or Var). We observe signiﬁcant improve-ments in NPLDA over the GPLDA system and subsequentlyin E2E system over the NPLDA. Comparing E2E and GPLDAwhen we use standard deviation as the pooling function, we ob-serve relative improvements of about % and % in SRE18development and SRE19 evaluation sets, respectively in termsof the C Min metric. The relative improvements between E2Eand GPLDA when we use Var as the pooling function are about % and % for SRE18 development and SRE19 evaluationsets, respectively for the C Min metric. Though, the cost func-tion in the neural network aims to minimize the detection costfunction (DCF), we also see improvements in the EER metricusing the proposed approach. These results show that the jointE2E training with a single neural pipeline and optimization re-sults in improved speaker recognition performance.

6. Summary and Conclusions

This paper explores a step in the direction of a neural End-to-End (E2E) approach in speaker veriﬁcation tasks. It is an ex-tension of our work on a discriminative neural PLDA (NPLDA)backend. The proposed model is a single elegant end-to-endapproach that optimizes directly from acoustic features likeMFCCs with a veriﬁcation cost function to output a likeli-hood ratio score. We discuss the inﬂuence of the factors thatwere key in implementing the E2E model. This involved mod-ifying the duration of the training utterance and developing anew sampling technique to generate training trials. The modelshows considerable improvements over the generative GaussianPLDA and the NPLDA models on the NIST SRE 2018 and 2019datasets. One drawback of the proposed method is the require-ment to initialize the E2E model with pre-trained weights of anx-vector network.Future work in this direction could include investigatingbetter sampling algorithms such as the use of curriculum learn-ing [25], different loss functions, improved architecture for theembedding extractor using attention and other sequence modelssuch as LSTMs etc. . References

Proc. Interspeech , 2017,pp. 2616–2620.[3] M. K. Nandwana, J. van Hout, C. Richey, M. McLaren, M. A.Barrios, and A. Lawson, “The VOiCES from a Distance Chal-lenge 2019,” in

Proc. Interspeech , 2019, pp. 2438–2442.[4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,”

IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[5] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,Y. Carmiel, and S. Khudanpur, “Deep neural network-basedspeaker embeddings for end-to-end speaker veriﬁcation,” in

Spo-ken Language Technology Workshop (SLT), 2016 IEEE . IEEE,2016, pp. 165–170.[6] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN Embeddings for Speaker Recogni-tion,” in

ICASSP . IEEE, 2018, pp. 5329–5333.[7] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-VectorLength Normalization in Speaker Recognition Systems,” in

Proc.Interspeech , 2011, pp. 249–252.[8] P. Kenny, “Bayesian Speaker Veriﬁcation with Heavy-Tailed Pri-ors,” in

Proc. Odyssey , 2010, pp. 14–21.[9] S. Ramoji, V. Krishnan, P. Singh, S. Ganapathy et al. , “PairwiseDiscriminative Neural PLDA for Speaker Veriﬁcation,” arXivpreprint arXiv:2001.07034 , 2020.[10] S. Ramoji, P. Krishnan, B. Mysore, P. Singh, and S. Ganapathy,“LEAP System for SRE 2019 CTS Challenge - Improvements andError Analysis,” in

Proc. Odyssey , 2020, pp. 281–288.[11] S. Ramoji, P. Krishnan, and S. Ganapathy, “NPLDA: A DeepNeural PLDA Model for Speaker Veriﬁcation,” in

Proc. Odyssey ,2020, pp. 202–209.[12] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vec-tor machines using GMM supervectors for speaker veriﬁcation,”

IEEE signal processing letters , vol. 13, no. 5, pp. 308–311, 2006.[13] S. Cumani, N. Br¨ummer, L. Burget, P. Laface, O. Plchot, andV. Vasilakakis, “Pairwise Discriminative Speaker Veriﬁcation inthe i-Vector Space,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 21, no. 6, pp. 1217–1227, 2013.[14] S. Cumani and P. Laface, “Large-Scale Training of Pairwise Sup-port Vector Machines for Speaker Recognition,”

IEEE Transac-tions on Audio, Speech and Language Processing , vol. 22, no. 11,pp. 1590–1600, 2014.[15] ——, “Generative pairwise models for speaker recognition,” in

Proc. Odyssey , 2014, pp. 273–279.[16] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matˇejka, andN. Br¨ummer, “Discriminatively trained probabilistic linear dis-criminant analysis for speaker veriﬁcation,” in

ICASSP . IEEE,2011, pp. 4832–4835.[17] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree,G. Sell, J. Borgstrom, L. P. Garc´ıa-Perera, F. Richardson, R. De-hak et al. , “State-of-the-art speaker recognition with neural net-work embeddings in nist sre18 and speakers in the wild evalua-tions,”

Computer Speech & Language , vol. 60, p. 101026, 2020.[18] L. Ferrer and M. McLaren, “Optimizing a Speaker EmbeddingExtractor Through Backend-Driven Regularization,” in

Proc. In-terspeech , 2019, pp. 4350–4354.[19] V. Mingote, A. Miguel, D. Ribas, A. Ortega, and E. Lleida,“Optimization of False Acceptance/Rejection Rates and DecisionThreshold for End-to-End Text-Dependent Speaker VeriﬁcationSystems,” in

Proc. Interspeech , 2019, pp. 2903–2907. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-2550 [20] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matˇejka, and L. Bur-get, “End-to-end dnn based speaker recognition inspired by i-vector and plda,” in

ICASSP . IEEE, 2018, pp. 4874–4878.[21] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker veriﬁcation,” in

ICASSP . IEEE, 2018, pp.4879–4883.[22] C. Zhang and K. Koishida, “End-to-end text-independentspeaker veriﬁcation with triplet loss on short utterances,” in

Proc. Interspeech , 2017, pp. 1487–1491. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2017-1608[23] D. A. Van Leeuwen and N. Br¨ummer, “An introduction toapplication-independent evaluation of speaker recognition sys-tems,” in

Speaker classiﬁcation I . Springer, 2007, pp. 330–353.[24] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversa-tions using x-vectors,” in

ICASSP . IEEE, 2019, pp. 5796–5800.[25] S. Ranjan and J. H. Hansen, “Curriculum learning based ap-proaches for noise robust speaker recognition,”