A Generalized Framework for Domain Adaptation of PLDA in Speaker Recognition
Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Takafumi Koshinaka
AA GENERALIZED FRAMEWORK FOR DOMAIN ADAPTATION OF PLDA IN SPEAKERRECOGNITION
Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Takafumi Koshinaka
Biometrics Research Laboratories, NEC, Japan
ABSTRACT
This paper proposes a generalized framework for domainadaptation of Probabilistic Linear Discriminant Analysis(PLDA) in speaker recognition. It not only includes sev-eral existing supervised and unsupervised domain adaptationmethods but also makes possible more flexible usage ofavailable data in different domains. In particular, we in-troduce here the two new techniques described below. (1)Correlation-alignment-based interpolation and (2) covarianceregularization. The proposed correlation-alignment-based-interpolation method decreases minC primary up to 30.5%as compared with that from an out-of-domain PLDA modelbefore adaptation, and minC primary is also 5.5% lower thanwith a conventional linear interpolation method with optimalinterpolation weights. Further, the proposed regularizationtechnique ensures robustness in interpolations w.r.t. varyinginterpolation weights, which in practice is essential.
Index Terms — Speak verification, domain adaptation,correlation alignment, regularization, generalized framework
1. INTRODUCTION
Recent progress in speaker recognition has achieved suc-cessful application of deep neural networks to derive deepspeaker embeddings from speech utterances [1, 2, 3, 4].Speaker embeddings are fixed-length continuous-value vec-tors that provide succinct characterizations of speakers voicesrendered in speech utterances. Similar to classical i-vectors[5], deep speaker embeddings live in a simpler Euclideanspace in which distance can be measured far more easily thanwith much more complex input patterns. Techniques suchas within-class covariance normalization (WCCN) [6], lin-ear discriminant analysis (LDA) [7], and probabilistic LDA(PLDA) [8, 9, 10] can also be applied.State-of-the-art speaker recognition systems that are com-posed of an x-vector (or i-vector) speaker embedding front-end followed by a PLDA backend have shown promisingperformance [11]. The effectiveness of these componentsrelies on the availability of a large collection of labeled train-ing data, typically over hundred hours of speech recordingsconsisting of multi-session recordings from several thousandspeakers. It would be prohibitively expensive, however, to collect such a large amount of in-domain (InD) data for anew domain of interest for every application. Most avail-able resource-rich data that already exist will not matchnew domains of interest, i.e., most will be out-of-domain(OOD) data. The challenge of domain mismatch arises whena speaker recognition system is used in a different domain(e.g., with different languages, demographics etc.) from thatof the training data. Performance may degrade considerably.Domain adaptation techniques for adapting resource-richOOD systems so as to produce good results in new domains,have recently been studied with the aim of alleviating thisproblem. They are either supervised adaptation [12, 13, 14,15], for which a small amount of InD data and their speakerlabels are used, or unsupervised adaptation [16, 17, 18, 19],for which InD data is used without speaker labels. Superviseddomain adaptation is more powerful than unsupervised.Supervised domain adaptation methods can be further cat-egorized into the three approaches described below: 1) Datapooling. It has been proposed, for example, to add InD data toa large amount of OOD data to train PLDA [14]. 2) Featurevector compensation. Data shifting for OOD data has beenproposed that uses statistical information about data in bothdomains [12]. 3) PLDA parameter adaptation. A linear inter-polation method has been proposed for combining parametersof PLDAs trained separately with OOD and InD data so as totake advantage of both PLDAs [13]; in [20], a maximum like-lihood linear transformation has been proposed for transform-ing OOD PLDA parameters so as to be closer to InD. Amongunsupervised methods, there are CORAL [21] as 2) featurevector compensation as well as CORAl+ [22] and clusteringmethods [16] as 3) PLDA parameter adaptation.Among the three approaches, PLDA parameter adapta-tion, such as linear interpolation, has advantages over theothers. First, it directly optimizes the model in an efficientway and does not require computationally expensive retrain-ing with large-scale OOD data or transformation of individualfeature vectors. Secondly, it can easily adjust the mixing ratefor OOD and InD data by changing interpolation weights.Simple linear modeling, however, which implicitly assumessmall differences between OOD and InD data, is not alwaysfeasible in real-world situations. Further, if interpolationweights are not appropriately determined, performance mayseriously deteriorate. a r X i v : . [ ee ss . A S ] A ug n this paper, we take advantage of previous work [22] andpropose 1) a correlation-alignment-based interpolation and 2)a covariance regularization for both unsupervised and super-vised methods, on the basis of linear interpolation [13] forrobust domain adaptation. Finally, we combine existing andproposed methods into a generalized framework and demon-strate its use in certain special cases. Domain adaptation issuccessful in all cases.The remainder of this paper is organized as follows: Sec-tion 2 reviews the CORAL+ unsupervised method and lin-ear interpolation supervised methods; Section 3 introducesthe proposed correlation-alignment-based interpolation and aregularization technique, as well as a generalized frameworkfor domain adaptation and examples of its use; Section 4 de-scribes our experimental setup, results, and analyses; and Sec-tion 5 summarizes our work.
2. DOMAIN ADAPTATIONS FOR PLDA2.1. Probabilistic Linear Discriminant Analysis
Let vector φ be a speaker embedding (e.g., x-vector, ivec-tor, etc.). We assume that vector φ is generated from a linearGaussian model [7], as follows [8, 9]: p ( φ | h , x ) = N ( φ | µ + Fh + Gx , Σ ) (1)Vector µ represents the global mean, while F and G are, re-spectively, the speaker and channel loading matrices, and thediagonal matrix Σ models residual variances. The variables h and x are, respectively, the latent speaker and channel vari-ables, which can be considered between- and within-speakercovariance matrices: Φ b = FF T , Φ w = GG T + Σ . (2)PLDA adaptation involves the adaptation of its mean vec-tor and covariance matrices. Mean shift due to domain mis-match could be dealt with by centralizing the datasets to acommon origin [23]. In this paper, we focus on the adapta-tion of between- and within-speaker covariance in PLDA. CORAL+ [22] is a correlation alignment based model-levelunsupervised domain adaptation. It adapts both the between-and within- speaker covariance matrices given only the totalcovariance matrix estimated directly from InD data.Pseudo-InD between- and within- speaker covariance ma-trices Φ I , pseudo are first computed from pseudo-InD data thatis recolored from whitened OOD vectors, using the total co-variance matrices of OOD and InD vectors { C O , C I } [21].It is commonly known that a linear transformation on a nor-mally distributed vector leads to an equivalent transformationon the mean vector and covariance matrix of its density func-tion. Thus, Φ I , pseudo = C / C − / Φ O C − / C / , (3) where Φ O is the covariance matrix of the OOD PLDA. InCORAL+, the adapted PLDA covariance matrices Φ + are: Φ + = β Φ O + (1 − β )Γ max ( Φ I , pseudo , Φ O ) (4)Here, Γ max ( Y , Z ) is a regularization function that ensuresthat the variance increases by choosing the larger value be-tween two covariance matrices { E , I } in a diagonalized spaceafter a transformation with matrix B as: Γ max ( Y , Z ) = B − T max ( E , I ) B − , B − T YB = E , B − T ZB = I . (5)Here, max ( . ) is a element-wise operator.The effect of CORAL+ has been experimentally validatedon the recent NIST 2016 and 2018 Speaker Recognition Eval-uation (SRE16, SRE18) datasets [22] [24]. Major features in CORAL+ adaptation include (1) CORALtransformation, (2) covariance regularization, and (3) linearinterpolation. When the first two factors are dropped, theadapted equation in (4) is reduced to linear interpolation.It has been shown in [13] that linear interpolation, thoughsimple, is a promising method. It employs a linear combi-nation, with a weight α , of PLDA parameters, i.e., between-and within-speaker covariance of independently trained OODand InD PLDAs: Φ + = α Φ I + (1 − α ) Φ O , (6)where Φ I represents the InD covariance matrix.Linear interpolation [13] implicitly assumes, however,that simple interpolation is sufficient, and such an assump-tion may not hold if the characteristics of OOD and InD aresignificantly different. In addition, performance is stronglyaffected by interpolation weights.
3. PROPOSED METHOD
In this section, we utilize the advantage of CORAL+ [22] andpropose (1) correlation-alignment-based interpolation (CIP)and (2) covariance regularization used in both unsupervisedand supervised methods for robust domain adaptation. Fi-nally, we propose a generalized framework.
As noted in the introduction, linear interpolation [13] assumesthat no two domains to be interpolated should be significantlydistant from one another. If OOD PLDA could be transformedinto something that is closer to a true InD PLDA, we wouldbe able to make the resulting PLDA more reliable. For thisreason, we propose replacing the OOD covariance matrix inlinear interpolation with that of a pseudo-InD PLDA obtainedby correlation alignment (CORAL): Φ + = α Φ I + (1 − α ) Φ I , pseudo (7)s noted in Section 2.2, CORAL aims to align covariancematrices so that they will match the InD feature space, whilemaintaining the good properties that OOD PLDA learnedfrom a large amount of data. We refer to the process rep-resented in (7) as correlation-alignment-based interpolation(CIP). The central idea in domain adaptation is to propagate the un-certainty seen in the InD data to the PLDA model. Neitherthe adaptation equation for linear interpolation (LIP) in (6),nor that for correlation-alignment-based interpolation (CIP)in (7), guarantees that the variance, and therefore the uncer-tainty, will increase. To deal with this, we introduce covari-ance regularization that will guarantee an increase in variance.Here, LIP with regularization (LIP reg) is given by Φ + = α Φ I + (1 − α )Γ max ( Φ O , Φ I ) , (8)while CIP with regularization (CIP reg) is Φ + = α Φ I + (1 − α )Γ max ( Φ I , pseudo , Φ I ) . (9)Also, note that the Γ max operator is the same as that in (5). In previous sections, we have presented various adaptationequations that work with covariance matrices. Three mainfactors are (1) interpolations of covariance matrices, (2) cor-relation alignment, and (3) covariance regularization. Theadaptations (4), (6), (7), (8), and (9) could be summarizedwith a single formula as follows: Φ + = α Φ + (1 − α )Γ max ( Φ , Φ ) , (10)where Φ is the covariance matrix of a base PLDA fromwhich a new PLDA is adapted; Φ is the covariance matrixof a developer PLDA that is supposed to have some proper-ties that are the same as or similar to the actual InD PLDA; Φ is the covariance matrix of a reference PLDA for compar-ison with the developer PLDA covariance matrix. The threePLDAs can be the same or different. When the same model ischosen for Φ and Φ , (10) will be equivalent to that withoutregularization.The above-mentioned domain adaptation methods, aswell as a KALDI unsupervised domain adaptation method[25], can be formulated in a single generalized frameworkwith specific covariance matrices as parameters (see Table 1).In addition, there are more cases that can be derived fromthe generalized framework. For example, Special Case 7 inTable 1 is a variation of CIP with regularization (CIP reg) as Φ + = α Φ I + (1 − α )Γ max ( Φ I , pseudo , Φ O ) . (11)Rather than using InD PLDA as a reference for covarianceregularization, Special Case 7 uses OOD PLDA. Special Case Table 1 . The special cases derived from the general form.Method Φ Φ Φ Eq.1 CORAL+[22] Φ O Φ I , pseudo Φ O (4)2 KALDI[25] Φ O C O Φ b O + Φ w O -3 LIP [13] Φ I Φ O Φ O (6)4 LIP reg Φ I Φ O Φ I (8)5 CIP Φ I Φ I , pseudo Φ I , pseudo (7)6 CIP reg Φ I Φ I , pseudo Φ I (9)7 - Φ I Φ I , pseudo Φ O (11)8 - Φ I Γ max ( Φ I , pseudo , Φ O ) Φ I (12)8 is another variation of CIP reg that employs pseudo-InDPLDA regularized using OOD PLDA as the developer covari-ance: Φ + = α Φ I + (1 − α )Γ max (Γ max ( Φ I , pseudo , Φ O ) , Φ I ) . (12)Note that Special Cases1 and 2 are unsupervised methods,while Case 3 to 8 are supervised methods.
4. EXPERIMENTS
Experiments were conducted on the recent SRE18 dataset.Performance was evaluated in terms of equal error rate (EER)and minimum detection cost ( minC primary ) [26].The latest SREs organized by NIST have focused on do-main mismatch as a particular technical challenge. Switch-board, VoxCeleb 1 and 2, and MIXER corpora that consistedof SREs 04 06, 08, 10, and 12 were used to train an x-vectorextractor. They were considered to be OOD data as they areEnglish speech corpora, while SRE’18 is in Tunisian Arabic.Data augmentation applied to the OOD data follows that ofour work in [24].Only the MIXER corpora and its augmentation, whichconsisted of 262,427 segments from 4,322 speakers in total,was used as OOD data in PLDA training. SRE’18 has threedatasets: an evaluation set (13,451 segments), a developmentset (1,741 segments), and an unlabeled set (2,332 segments).We chose the bigger labeled dataset, the evaluation set, asIND data to train the PLDA, and we conducted an evaluationon the development set. The enroll and test data in this sec-tion are those in the development set. The unlabeled set isused for adaptive symmetric score normalization [27] in allthe experiments.The x-vector extractor is a 43-layers TDNN with resid-ual connections and a 2-head attentive statistics pooling inthe same way as in [24]. The number of dimensions of thex-vector was 512. Mean shift was applied to OOD data us-ing its mean. InD data and enroll and test data were central-ized using InD data. As is commonly done in most state-of-the-art systems, LDA was used to reduce dimensionality to150-dimension. In our interpolation domain adaptation ex-periments, LDA trained with OOD data was applied to both able 2 . PLDA with CORAL+ and linear interpolation do-main adaptations.Systems EER(%) minC primary
InD PLDA 4.15 0.293OOD PLDA 4.38 0.249CORALl+ [22] 3.95 0.217LIP [13] 3.58 0.195
Table 3 . Comparison of LIP and CIP with and without regu-larization using interpolation weights of . .Systems EER(%) minC primary LIP [13] 3.58 0.195LIP reg 3.58 0.195CIP 3.68 0.186CIP reg 3.58 0.173InD and OOD vectors for training and evaluations. For thesingle InD PLDA training, LDA trained with InD data wasused.
We first evaluated performance of the PLDA trained using ei-ther OOD or InD data, respectively (see Table 2). Also shownin the table are the OOD PLDA adapted to InD in an un-supervised manner, i.e., using InD data without labels, withCORAL+ [22], and supervised manner with LIP [13]. Theweights in both adaptations were chosen to be . .InD PLDA resulted in lower EER but higher minC primary than did OOD PLDA. We reckon that InD PLDA did not out-perform OOD PLDA because the training data for InD PLDAwas limited, so as to be able to train a good PLDA. Both do-main adaptation methods outperformed any single OOD orInD system. It is also expected that the supervised linear in-terpolation would outperform unsupervised CORAL+.Performance of the proposed correlation-alignment-basedinterpolation (CIP) and covariance regularization (reg) isshown in Table 3. The weights for all the interpolations werealso chosen to be . . CIP performed worse in terms ofEER than other methods but achieved a better minC primary than the conventional linear interpolation (LIP). All of theproposed methods performed better than LIP in terms of minC primary . The best system is CIP reg domain adaptationwhich reduced minC primary by . and . , respec-tively, as compared with the single InD and OOD system inTable 2. It was also lower by . than that of LIP.We further investigated the effects on speaker verificationperformance of varying interpolation weights from . to . (see Figure 1). It can be seen that the proposed covarianceregularization technique provided more robust performancefor both LIP and CIP over a wider range of the interpola-tion weights. This would be beneficial in practice. For the Fig. 1 . The proposed methods with varying weights.
Fig. 2 . Results of special cases derived from the generalizedframework using interpolation weights of . .proposed correlation-alignment-based interpolation (CIP),though its EER was worse than that of other interpolationsat the weight α = 0 . (Table 3), its best EER was . atthe weight α = 0 . , which is comparable to the other threesystems’ best .
56% EER . Also, the correlation alignmentinterpolations (CIP and CIP reg) were better than the linearinterpolations (LIP and LIP reg) in terms of minC primary with all weights. The best
EER of the CIP reg system was . lower than LIP’s best.Figure 2 summarizes experimental results of all the spe-cial cases shown in Table 1. Performance improvement isobserved in all cases.
5. SUMMARY
We have proposed here a generalized framework for do-main adaptation of PLDA in speaker recognition that workswith both unsupervised and supervised methods, as well astwo new techniques: (1) correlation-alignment-based inter-polation and (2) covariance regularization. The generalizedframework enable us to combine the two techniques and alsoseveral existing supervised and unsupervised domain adapta-tion methods into a single formulation. Use of the proposedcorrelation-alignment-based interpolation method decreases minC primary up to 30.5% as compared to that with the out-of-domain PLDA model before adaptation. It is also 5.5% lowerthan with the conventional linear interpolation method withoptimal interpolation weights. Further, the proposed regular-ization technique ensures robustness for interpolations w.r.t.varying interpolation weights, which in practice is essential. . REFERENCES [1] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khu-danpur, “Deep neural network embeddings for text-independent speaker verification,” in
Proc. Interspeech ,2017.[2] E. Variani, X. Lei, E. McDermott, I. Moreno, andJ. Gonzalez-Dominguez, “Deep neural networks forsmall footprint text-dependent speaker verification,” in
Proc. ICASSP , 2014.[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, andS. Khudanpur, “X-vectors: Robust DNN embeddingsfor speaker recognition,” in
Proc. ICASSP , 2018.[4] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentivestatistics pooling for deep speaker embedding,” in
Proc.Interspeech , 2018.[5] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, andP. Ouellet, “Front-end factor analysis for speaker ver-ification,” in
Proc. IEEE Transactions on Audio, Speechand Language Processing , 2011.[6] A. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance normalization for SVM-based speakerrecognition,” in
Proc. Interspeech , 2006.[7] C. Bishop,
Pattern recognition and machine learning ,Springer, 2006.[8] S. Prince and J. Elder, “Probabilistic linear discriminantanalysis for inferences about identity,” in
Proc. ICCV ,2007.[9] S. Ioffe, “Probabilistic linear discriminant analysis,” in
Proc. ECCV , 2006.[10] P. Kenny, “Bayesian speaker verification with heavy-tailed priors,” in
Proc. Odyssey , 2010.[11] H. Yamamoto, K. Lee, K. Okabe, and T. Koshinaka,“Speaker augmentation and bandwidth extension fordeep speaker embedding,” in
Proc. Interspeech , 2019.[12] H. Aronowitz, “Inter dataset variability compensationfor speaker recognition,” in
Proc. ICASSP , 2014.[13] D. Garcia-Romero and A. McCree, “Supervised domainadaptation for i-vector based speaker recognition,” in
Proc. ICASSP , 2014.[14] A. Misra and J. Hansen, “Spoken language mismatchin speaker verification: an investigation with NIST-SREand CRSS Bi-ling Corpora,” in
Proc. IEEE SLT , 2014. [15] O. Glembek, J. Ma, P. Matejka, B. Zhang, O. Plchot,L. Burget, and S. Matsoukas, “Domain adaptationvia within-class covariance correction in i-vector basedspeaker recognition systems,” in
Proc. ICASSP , 2014.[16] S. H. Shum, D. A. Reynolds, D. Garcia-Romero, andA. McCree, “Unsupervised clustering approaches fordomain adaptation in speaker recognition systems,” in
Proc. Odyssey , 2014.[17] D. Garcia-Romero, A. McCree, S. Shum, N. Brummer,and C. Vaquero, “Unsupervised domain adaptation fori-vector speaker recognition,” in
Proc. Odyssey , 2014.[18] J. Villalba and E. Lleida, “Unsupervised adaptation ofPLDA by using variational bayes methods,” in
Proc.ICASSP , 2014.[19] D. Garcia-Romero, X. Zhang, A. McCree, and D. Povey,“Improving speaker recognition performance in the do-main adaptation challenge using deep neural networks,”in
Proc. IEEE SLT , 2014.[20] Q. Wang, H. Yamamoto, and T. Koshinaka, “Domainadaptation using maximum likelihood linear transfor-mation for PLDA-based speaker verification,” in
Proc.ICASSP , 2016.[21] J. Alam, G. Bhattacharya, and P. Kenny, “Speaker verifi-cation in mismatched conditions with frustratingly easydomain adaptation,” in
Proc. Odyssey , 2018.[22] K. Lee, Q. Wang, and T. Koshinaka, “The coral+ algo-rithm for unsupervised domain adaptation of plda,” in
Proc. ICASSP , 2019.[23] K. Lee, V. Hautamaki, T. Kinnunen, and et al, “The I4Umega fusion and collaboration for nist speaker recogni-tion evaluation 2016s,” in
Proc. Interspeech , 2017.[24] K. Lee, H. Yamamoto, K. Okabe, Q. Wang, L. Guo,T. Koshinaka, J. Zhang, and K. Shinoda, “The NEC-TT2018 speaker verification system,” in
Proc. Interspeech