Neural PLDA Modeling for End-to-End Speaker Verification
NNeural PLDA Modeling for End-to-End Speaker Verification
Shreyas Ramoji, Prashant Krishnan, Sriram Ganapathy
Learning and Extraction of Acoustic Patterns (LEAP) Lab,Department of Electrical Engineering, Indian Institute of Science, Bengaluru, India { shreyasr, prashantkv1, sriramg } @iisc.ac.in Abstract
While deep learning models have made significant advancesin supervised classification problems, the application of thesemodels for out-of-set verification tasks like speaker recognitionhas been limited to deriving feature embeddings. The state-of-the-art x-vector PLDA based speaker verification systems usea generative model based on probabilistic linear discriminantanalysis (PLDA) for computing the verification score. Recently,we had proposed a neural network approach for backend mod-eling in speaker verification called the neural PLDA (NPLDA)where the likelihood ratio score of the generative PLDA modelis posed as a discriminative similarity function and the learnableparameters of the score function are optimized using a verifica-tion cost. In this paper, we extend this work to achieve jointoptimization of the embedding neural network (x-vector net-work) with the NPLDA network in an end-to-end (E2E) fash-ion. This proposed end-to-end model is optimized directly fromthe acoustic features with a verification cost function and dur-ing testing, the model directly outputs the likelihood ratio score.With various experiments using the NIST speaker recognitionevaluation (SRE) 2018 and 2019 datasets, we show that theproposed E2E model improves significantly over the x-vectorPLDA baseline speaker verification system.
Index Terms : NPLDA, End-to-End Systems, Speaker Verifica-tion
1. Introduction
Automatic speaker verification (ASV) has several applicationssuch as voice biometrics for commercial applications, speakerdetection in surveillance, speaker diarization, etc. A speakeris enrolled by a sample utterance(s), and the task of ASV isto detect whether the target speaker is present in a given testutterance or not. Several challenges have been organized overthe years for benchmarking and advancing speaker verificationtechnology such as the NIST speaker recognition Evaluation(SRE) challenge 2019 [1], the VoxCeleb speaker recognitionchallenge (VoxSRC) [2] and the VOiCES challenge [3]. Themajor challenges in speaker verification include the languagemismatch in testing, short duration audio and the presence ofnoise/reverberation in the speech data.The state-of-the-art systems in speaker verification use amodel to extract embeddings of fixed dimension from utter-ances of variable duration. The earlier approaches based onunsupervised Gaussian mixture model (GMM) i-vector extrac-tor [4] have been recently replaced with neural embedding ex-tractors [5, 6] which are trained on large amounts of super-vised speaker classification tasks. These fixed dimensionalembeddings are pre-processed with a length normalization [7]
This work was funded by the Ministry of Human Resources Devel-opment (MHRD) of India and the Department of Science and Technol-ogy (DST) EMR/2016/007934 grant. technique followed by probabilistic linear discriminant analysis(PLDA) based backend modeling approach [8].In our previous work, we had explored a discriminative neu-ral PLDA (NPLDA) approach [9] to backend modeling wherea discriminative similarity function was used. The learnableparameters of the NPLDA model were optimized using an ap-proximation of the minimum detection cost function (DCF).This model also showed good improvements in our SRE eval-uations and the VOiCES from a distance challenge [10, 11].In this paper, we extend this work to propose a joint modelingframework that optimizes both the front-end x-vector embed-ding model and the backend NPLDA model in a single end-to-end (E2E) neural framework. The proposed model is ini-tialized with the pre-trained x-vector time delay neural network(TDNN). The NPLDA E2E is fully trained on pairs of speechutterances starting directly from the mel-frequency cepstral co-efficient (MFCC) features. The advantage of this method is thatboth the embedding extractor as well as the final score computa-tion is optimized on pairs of utterances and with the speaker ver-ification metric. With experiments on the NIST SRE 2018 and2019 datasets, we show that the proposed NLPDA E2E modelimproves significantly over the baseline system using x-vectorsand generative PLDA modeling.
2. Related Prior Work
The common approaches for scoring in speaker verification sys-tems include support vector machines (SVMs) [12], and theprobabilistic linear discriminant analysis (PLDA) [8]. Someefforts on pairwise generative and discriminative modeling arediscussed in [13, 14, 15]. The discriminative version of PLDAwith logistic regression and support vector machine (SVM) ker-nels has also been explored in [16]. In this work, the authorsuse the functional form of the generative model and pool all theparameters needed to be trained into a single long vector. Theseparameters are then discriminatively trained using the SVM lossfunction with pairs of input vectors. The discriminative PLDA(DPLDA) is however prone to over-fitting on the training speak-ers and leads to degradation on unseen speakers in SRE evalu-ations [17]. The regularization of embedding extractor networkusing a Gaussian backend scoring has been investigated in [18].Other recent developments in this direction includes efforts inusing the approximate DCF metric for text dependent speakerverification [19].Recently, some end-to-end approaches for speaker verifica-tion have been examined. For example, in [20], the PLDA scor-ing which is done with the i-vector extraction has been jointlyderived using a deep neural network architecture and the en-tire model is trained using a binary cross entropy training crite-rion. In [21], a generalized end to end loss by minimizing thecentroid means of within speaker distances while maximizingacross speaker distances was proposed. In another E2E effort, a r X i v : . [ ee ss . A S ] A ug + Figure 1:
End-to-End x-vector NPLDA architecture for Speaker Verification. the use of triplet loss has been explored [22]. However, in spiteof these efforts, most state of the art systems use a generativePLDA backend model with x-vectors and similar neural net-work embeddings.
3. Background
The PLDA model on the processed x-vector embedding η r (af-ter centering, LDA transformation and unit length normaliza-tion) is given by η r = Φ ω + (cid:15) r (1)where ω is the latent speaker factor with a Gaussian prior of N ( , I ) , Φ characterizes the speaker sub-space matrix, and (cid:15) r isthe residual assumed to have distribution N ( , Σ ) . For scoring,a pair of embeddings, η e from the enrollment recording and η t from the test recording are used with the PLDA model tocompute the log-likelihood ratio score given by s ( η e , η t ) = η (cid:124) e Qη e + η (cid:124) t Qη t + 2 η (cid:124) e P η t + const (2)where, Q = Σ − tot − ( Σ tot − Σ ac Σ − tot Σ ac ) − (3) P = Σ − tot Σ ac ( Σ tot − Σ ac Σ − tot Σ ac ) − (4)with Σ tot = ΦΦ T + Σ and Σ ac = ΦΦ T .In the kaldi implementation of PLDA, a diagonalizingtransformation which simultaneously diagonalizes the withinand between speaker covariances is computed which reduces P and Q to diagonal matrices. In the discriminative NPLDA approach [11], we construct thepre-processing steps of LDA as first affine layer, unit-lengthnormalization as a non-linear activation and PLDA centeringand diagonalization as another affine transformation. The fi-nal PLDA pair-wise scoring given in Eq. 2 is implemented asa quadratic layer in Fig. 1. Thus, the NPLDA implements thepre-processing of the x-vectors and the PLDA scoring as a neu-ral backend.
To train the NPLDA for the task of speaker verification, we sam-ple pairs of x-vectors representing target (from same speaker)and non-target hypothesis (from different speakers). The nor-malized detection cost function (DCF) [23] for a detectionthreshold θ is defined as: C Norm ( β, θ ) = P Miss ( θ ) + βP FA ( θ ) (5)where β is an application based weight defined as β = C FA (1 − P target ) C Miss P target (6)where C Miss and C FA are the costs assigned to miss and falsealarms, and P target is the prior probability of a target trial. P Miss and P FA are the probability of miss and false alarms re-spectively, and are computed by applying a detection thresholdof θ to the log-likelihood ratios. A differentiable approximationof the normalized detection cost was proposed in [11, 19]. P (soft) Miss ( θ ) = (cid:80) Ni =1 t i [1 − σ ( α ( s i − θ ))] (cid:80) Ni =1 t i (7) P (soft) FA ( θ ) = (cid:80) Ni =1 (1 − t i ) σ ( α ( s i − θ )) (cid:80) Ni =1 (1 − t i ) (8)ere, i is the trial index, s i is the system score and t i denotesthe ground truth label for trial i , and σ denotes the sigmoid func-tion. N is the total number of trials in the minibatch over whichthe cost is computed. By choosing a large enough value forthe warping factor α , the approximation can be made arbitrarilyclose to the actual detection cost function for a wide range ofthresholds. The minimum detection cost (minDCF) is achievedat a threshold where the DCF is minimized.minDCF = min θ C Norm ( β, θ ) (9)The threshold θ is included in the set of learnable parametersof the neural network. This way, the network learns to mini-mize the minDCF as a function of all the parameters throughbackpropagation.
4. End-to-end modeling
The model we explore is a concatenated version of two param-eter tied x-vector extractors (TDNN networks [24]) with theNPLDA model (Fig. 1). The end-to-end model processes themel frequency cepstral coefficients (MFCCs) of a pair of utter-ances to output a score. The MFCC features are passed throughnine time delay neural network (TDNN) layers followed by astatistic pooing layer. The statistics pooling layer is followedby a fully connected layer with unit length normalization non-linearity. This is followed by a linear layer and a quadratic layeras a function of the enrollment and test embeddings to output ascore. The parameters of the TDNN and the affine layers of theenrollment and test side of the E2E model are tied, which makesthe model symmetric.
We can estimate the memory required for a single iteration(batch update) of training as the sum of memory required tostore the network parameters, gradients, forward and backwardcomponents of each batch. In this end-to-end network, eachtraining batch of N trials can have upto N unique utterancesassuming there are no repetitions. For simplicity, let us assumeeach of the utterances corresponds to T frames. We denote k i to be the dimension of the input to the i th TDNN layer, witha TDNN context of c i frames. The total memory required canthen be estimated as NT (cid:80) i k i c i × bytes.. The GPU mem-ory is limited by the total number of frames that go into theTDNN, which is denoted by the factor NT . A large batchsizeof , as was used in [10], is infeasible for the end-to-endmodel (results in GPU memory load of GB). Hence, we re-sorted to a sampling strategy to reduce the GPU memory con-straints.
In this current work, in order to avoid memory explosion in thex-vector extraction stage of the E2E model, we propose to use asmall number of utterances ( ) in a batch with about sec. ofaudio in each utterance. These utterances are drawn from m speakers where m ranges from − . These utterances aresplit randomly into two halves for each speaker to form enroll-ment and test side of trials. The MFCC features of the enroll-ment and test utterances are transformed to utterance embed-dings η e and η t (as shown in Fig. 1). Each pair of enrollment,and test utterances is given a label as to whether the trial belongs The implementation of this model can be found in https://github.com/iiscleap/E2E-NPLDA to the target class (same speaker) or non-target class (differentspeakers). In this way, while the number of utterances is small,the number of trials used in the batch is . Using the labelinformation and the cost function defined in Eq. 5, the gradientsare back-propagated to update the entire E2E model parameters.This algorithm is applied separately to the male and femalepartitions of each training dataset to ensure the trials are genderand domain matched. All the utterances used in a batch comefrom the same gender and same dataset (to avoid cross gender,cross language trials). The algorithm is repeated multiple timeswith different number of speakers ( m ), for the male and femalepartitions of every dataset. Finally, all the training batches arepooled together and randomized.In contrast, the trial sampling algorithm used in our previ-ous work on NPLDA [11, 10] was much simpler. For each gen-der of each dataset, we sample an enrollment utterance from arandomly sampled speaker, and sample another utterance fromeither the same speaker or a different speaker to get a target or anon-target trial. This was done without any repetition of utter-ances, to ensure that each utterance appears once per samplingepoch. This procedure was repeated numerous times for multi-ple datasets and for both genders to obtain the required numberof trials. All the trials were then pooled together, shuffled andsplit into batches of or trials.It is worth noting that the batch statistics of the two sam-pling methods are significantly different. A batch of trials inthe previous sampling method (Algo. 1) can contain trials frommultiple datasets and gender, whereas in the modified samplingmethod, which we will refer as Algo. 2, all the trials in a batchare from a particular gender of a particular dataset.
5. Experiments and Results
The work is an extension of our work in [10]. The x-vectormodel is trained using the extended time-delay neural network(E-TDNN) architecture described in [24]. This uses 10 layersof TDNNs followed by a statistics pooling layer. Once the net-work is trained, x-vectors of 512 dimensions are extracted fromthe affine component of layer 12 in the E-TDNN architecture.By combining the Voxceleb 1&2 dataset [2] with Switchboard,Mixer 6, SRE04-10, SRE16 evaluation set and SRE18 evalua-tion sets, we obtained with . M recordings from speak-ers. The datasets were augmented with the 5-fold augmenta-tion strategy similar to the previous models. In order to reducethe weighting given to the VoxCeleb speakers (out-of-domaincompared to conversational telephone speech (CTS)), we alsosubsampled the VoxCeleb augmented portion to include only . M utterances. The x-vector model is trained using di-mensional MFCC features using a -channel mel-scale filterbank spanning the frequency range Hz -
Hz,, mean-normalized over a sliding window of up to 3 seconds and with dimensional targets using the Kaldi toolkit. More infor-mation about the model can be found in [10].The various backend PLDA models are trained on theSRE18 evaluation dataset. The evaluation datasets used includethe SRE18 development and the SRE19 evaluation datasets. Weperform several experiments under various conditions. The pri-mary baseline to benchmark our systems is the Gaussian PLDAbackend implementation in the Kaldi toolkit (GPLDA). TheKaldi implementation models the average embedding x-vectorof each training speaker. The x-vectors are centered, dimen-sionality reduced using LDA to 170 dimensions, followed byunit length normalization. odel Duration ofutterance SRE18 Dev SRE19 EvalEER (%) C Min
EER (%) C Min
GPLDA (G1) Full 6.43 0.417 6.18 0.512GPLDA (G2) 20 secs 5.96 0.436 5.80 0.518NPLDA (N1) Full 5.33 0.389 5.10 0.443NPLDA (N2) 20 secs 5.57 0.359 5.32 0.432
Table 1:
Performance comparison of training utterance dura-tions (Full utterance vs 20 second segmenting) on GPLDA andNPLDA[10] models
Model Sampling SRE18 Dev SRE19 EvalEER (%) C Min
EER (%) C Min
NPLDA (N2) Algo. 1 5.57 0.359 5.32 0.432NPLDA (N3) Algo. 2 5.23 0.338 5.73 0.439
Table 2:
Performance comparison with different samplingtechniques using NPLDA[10] method using previous samplingmethod (Algo. 1) and proposed new sampling method (Algo. 2)
In the traditional x-vector system, the statistic pooling layercomputes the mean and standard deviation of the final TDNNlayer. These two statistics then are concatenated into a fixeddimensional embedding. We also perform experiments wherewe use variance instead of the standard deviation and comparethe performance.In the following sections, we study the influence of reducedtraining duration, and provide a performance comparison of thesampling method (Algo. 1 vs Algo. 2). We then comparethe performance of Gaussian PLDA (GPLDA), Neural PLDA(NPLDA), and the proposed end-to-end approach (E2E). ThePLDA backend training dataset used is the SRE18 Evaluationdataset. We report our results on the SRE18 Development setand the SRE19 Evaluation dataset using two cost metrics - equalerror rate (EER) and minimum DCF ( C Min ), which are the pri-mary cost metrics for SRE19 evaluations.
As discussed in Section 4.2, due to GPU memory considerationsand ease of implementation, we create a modified dataset bysplitting longer utterances into 20 second chunks (2000 frames)after voice activity detection (VAD) and mean normalization.We compare the performances of the models on the modifieddataset and the original one. The results are reported in Table1. We observe that the performance of the systems are quitecomparable. This allows us to proceed using these conditionsin the implementation of the End-to-End model. All subsequentreported models use 20 second chunks for PLDA training.
The way the training trials are generated is crucial to how themodel trains and its performance. The performance comparisonof the two sampling techniques with PLDA models trained onSRE18 Evaluation dataset can be seen in Table 2. Although thenature of batch wise trials has changed significantly in terms ofnumber of speakers in each batch and gender matched batchesin the proposed new sampling method (Algo. 2), we see thatits performance is comparable to our previous sampling method(Algo. 1).
Model Poolingfunction Init. SRE18 Dev SRE19 EvalEER (%) C Min
EER (%) C Min
GPLDA (G2) StdDev - 5.96 0.436 5.80 0.518GPLDA (G3) Var - 7.23 0.459 6.33 0.560NPLDA (N2) StdDev G2 5.57 0.359 5.32 0.432NPLDA (N4) Var G3 6.05 0.377 5.91 0.465E2E (E1) StdDev N2
E2E (E2) Var N4 5.60
Table 3:
Performance comparison between GPLDA, NPLDAand E2E models using standard deviation and variance as thesecondary pooling functions. The model that was used to ini-tialize the network is denoted in the 3rd column
Using the proposed sampling method, we generate batches of1024 trials using 64 utterances per batch. Both the NPLDAand E2E models were trained with this batch size. We use theAdam optimizer for the backpropagation learning. The perfor-mance of these models are reported in Table 3. The NPLDAmodel is initialized with the GPLDA model. The initializationdetails of the models along with the pooling functions are re-ported in the table. We compare performances using two differ-ent statistics (StdDev or Var). We observe significant improve-ments in NPLDA over the GPLDA system and subsequentlyin E2E system over the NPLDA. Comparing E2E and GPLDAwhen we use standard deviation as the pooling function, we ob-serve relative improvements of about % and % in SRE18development and SRE19 evaluation sets, respectively in termsof the C Min metric. The relative improvements between E2Eand GPLDA when we use Var as the pooling function are about % and % for SRE18 development and SRE19 evaluationsets, respectively for the C Min metric. Though, the cost func-tion in the neural network aims to minimize the detection costfunction (DCF), we also see improvements in the EER metricusing the proposed approach. These results show that the jointE2E training with a single neural pipeline and optimization re-sults in improved speaker recognition performance.
6. Summary and Conclusions
This paper explores a step in the direction of a neural End-to-End (E2E) approach in speaker verification tasks. It is an ex-tension of our work on a discriminative neural PLDA (NPLDA)backend. The proposed model is a single elegant end-to-endapproach that optimizes directly from acoustic features likeMFCCs with a verification cost function to output a likeli-hood ratio score. We discuss the influence of the factors thatwere key in implementing the E2E model. This involved mod-ifying the duration of the training utterance and developing anew sampling technique to generate training trials. The modelshows considerable improvements over the generative GaussianPLDA and the NPLDA models on the NIST SRE 2018 and 2019datasets. One drawback of the proposed method is the require-ment to initialize the E2E model with pre-trained weights of anx-vector network.Future work in this direction could include investigatingbetter sampling algorithms such as the use of curriculum learn-ing [25], different loss functions, improved architecture for theembedding extractor using attention and other sequence modelssuch as LSTMs etc. . References
Proc. Interspeech , 2017,pp. 2616–2620.[3] M. K. Nandwana, J. van Hout, C. Richey, M. McLaren, M. A.Barrios, and A. Lawson, “The VOiCES from a Distance Chal-lenge 2019,” in
Proc. Interspeech , 2019, pp. 2438–2442.[4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, 2010.[5] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,Y. Carmiel, and S. Khudanpur, “Deep neural network-basedspeaker embeddings for end-to-end speaker verification,” in
Spo-ken Language Technology Workshop (SLT), 2016 IEEE . IEEE,2016, pp. 165–170.[6] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN Embeddings for Speaker Recogni-tion,” in
ICASSP . IEEE, 2018, pp. 5329–5333.[7] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-VectorLength Normalization in Speaker Recognition Systems,” in
Proc.Interspeech , 2011, pp. 249–252.[8] P. Kenny, “Bayesian Speaker Verification with Heavy-Tailed Pri-ors,” in
Proc. Odyssey , 2010, pp. 14–21.[9] S. Ramoji, V. Krishnan, P. Singh, S. Ganapathy et al. , “PairwiseDiscriminative Neural PLDA for Speaker Verification,” arXivpreprint arXiv:2001.07034 , 2020.[10] S. Ramoji, P. Krishnan, B. Mysore, P. Singh, and S. Ganapathy,“LEAP System for SRE 2019 CTS Challenge - Improvements andError Analysis,” in
Proc. Odyssey , 2020, pp. 281–288.[11] S. Ramoji, P. Krishnan, and S. Ganapathy, “NPLDA: A DeepNeural PLDA Model for Speaker Verification,” in
Proc. Odyssey ,2020, pp. 202–209.[12] W. M. Campbell, D. E. Sturim, and D. A. Reynolds, “Support vec-tor machines using GMM supervectors for speaker verification,”
IEEE signal processing letters , vol. 13, no. 5, pp. 308–311, 2006.[13] S. Cumani, N. Br¨ummer, L. Burget, P. Laface, O. Plchot, andV. Vasilakakis, “Pairwise Discriminative Speaker Verification inthe i-Vector Space,”
IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 21, no. 6, pp. 1217–1227, 2013.[14] S. Cumani and P. Laface, “Large-Scale Training of Pairwise Sup-port Vector Machines for Speaker Recognition,”
IEEE Transac-tions on Audio, Speech and Language Processing , vol. 22, no. 11,pp. 1590–1600, 2014.[15] ——, “Generative pairwise models for speaker recognition,” in
Proc. Odyssey , 2014, pp. 273–279.[16] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matˇejka, andN. Br¨ummer, “Discriminatively trained probabilistic linear dis-criminant analysis for speaker verification,” in
ICASSP . IEEE,2011, pp. 4832–4835.[17] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree,G. Sell, J. Borgstrom, L. P. Garc´ıa-Perera, F. Richardson, R. De-hak et al. , “State-of-the-art speaker recognition with neural net-work embeddings in nist sre18 and speakers in the wild evalua-tions,”
Computer Speech & Language , vol. 60, p. 101026, 2020.[18] L. Ferrer and M. McLaren, “Optimizing a Speaker EmbeddingExtractor Through Backend-Driven Regularization,” in
Proc. In-terspeech , 2019, pp. 4350–4354.[19] V. Mingote, A. Miguel, D. Ribas, A. Ortega, and E. Lleida,“Optimization of False Acceptance/Rejection Rates and DecisionThreshold for End-to-End Text-Dependent Speaker VerificationSystems,” in
Proc. Interspeech , 2019, pp. 2903–2907. [Online].Available: http://dx.doi.org/10.21437/Interspeech.2019-2550 [20] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matˇejka, and L. Bur-get, “End-to-end dnn based speaker recognition inspired by i-vector and plda,” in
ICASSP . IEEE, 2018, pp. 4874–4878.[21] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in
ICASSP . IEEE, 2018, pp.4879–4883.[22] C. Zhang and K. Koishida, “End-to-end text-independentspeaker verification with triplet loss on short utterances,” in
Proc. Interspeech , 2017, pp. 1487–1491. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2017-1608[23] D. A. Van Leeuwen and N. Br¨ummer, “An introduction toapplication-independent evaluation of speaker recognition sys-tems,” in
Speaker classification I . Springer, 2007, pp. 330–353.[24] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversa-tions using x-vectors,” in
ICASSP . IEEE, 2019, pp. 5796–5800.[25] S. Ranjan and J. H. Hansen, “Curriculum learning based ap-proaches for noise robust speaker recognition,”