VAE-based Domain Adaptation for Speaker Verification
VVAE-based Domain Adaptation for SpeakerVerification
Xueyi Wang ∗† , Lantian Li ∗ , Dong Wang ∗∗ Center for Speech and Language Technologies, Tsinghua University, Beijing, China † China University of Mining & Technology, Beijing, ChinaCorresponding Author E-mail: [email protected]
Abstract —Deep speaker embedding has achieved satisfactoryperformance in speaker verification. By enforcing the neuralmodel to discriminate the speakers in the training set, deepspeaker embedding (called ‘x-vectors’) can be derived fromthe hidden layers. Despite its good performance, the presentembedding model is highly domain sensitive, which meansthat it often works well in domains whose acoustic conditionmatches that of the training data (in-domain), but degrades inmismatched domains (out-of-domain). In this paper, we present adomain adaptation approach based on Variational Auto-Encoder(VAE). This model transforms x-vectors to a regularized latentspace; within this latent space, a small amount of data fromthe target domain is sufficient to accomplish the adaptation.Our experiments demonstrated that by this VAE-adaptationapproach, speaker embeddings can be easily transformed to thetarget domain, leading to noticeable performance improvement.
I. I
NTRODUCTION
Automatic speaker verification (ASV) is an important bio-metric authentication technology and has found a broad rangeof applications. Conventional ASV methods are based onstatistical models [1], [2], [3]. Perhaps the most famousstatistical model is the Gaussian mixture model − universalbackground model (GMM-UBM) [1]. It factorizes the speechsignal into the phonetic factor and the speaker factor, and thisfactorization process is based on the maximum likelihood (ML)criterion. This basic factorization model was later extended tovarious low-rank variants, including the joint factor analysismodel [2] and the i-vector model [3]. Further improvementswere obtained by either discriminative models (e.g., PLDA [4])or phonetic knowledge transferring (e.g., DNN-based i-vectormodel [5], [6]).Recently, inspired by the success of deep learning in auto-matic speech recognition (ASR), the neural-based ASV modelshave been studied and shown great potential [7], [8], [9]. Thesemodels leverage the power of deep neural networks (DNNs) inlearning strong speaker-related discriminative features, ideallyfrom a large amount of speaker-labelled data. A state-of-the-art neural-based architecture is the x-vector model proposedby Snyder et al. [10]. By this architecture, frame-level deepfeatures are derived by several full-connection layers (or morestructured layers), and then the first- and second-order statisticsof frame-level features are collected and then projected toa low-dimensional representation, which is called ‘x-vector’.During the training, the objective of discriminating the speak-ers in the training dataset encourages the DNN structure to learn discriminative representations at both the frame level(deep feature) and the utterance level (x-vector). The x-vectormodel has achieved the state-of-the-art performance in variousspeaker recognition tasks, as well as related tasks such aslanguage identification [11].In spite of its powerful discriminability, the x-vector modelstill heavily relies on a strong back-end scoring component,such as LDA, PLDA [12], [13]. This is puzzling at the firstglance as in the i-vector regime the back-end models play therole of enhancing the discrimination among speaker, thoughthe x-vectors have been discriminative already. Our previousstudy shows that the back-end models play a different rolewhen accompanying x-vectors: instead of promoting discrim-ination, they essentially normalize the prior distribution ofspeaker x-vectors and the conditional distribution of utterancex-vectors of a particular speaker [13].A critical problem that usually arises in real-life applicationsis that the back-end models are highly domain-sensitive, whichmeans that an LDA-PLDA model that is well trained in onedomain may degrade significantly in other domains whoseacoustic condition is substantially different from that of thetraining data. To tackle this problem, this paper presents adomain adaptation approach based on the Variational Auto-Encoder (VAE). VAE is a powerful architecture that canproject an unconstrained distribution to a simple Gaussiandistribution, and the projection can be learned in a purelyunsupervised way. In our previous study [13], VAE has beenused as a normalization model that normalizes the distributionof x-vectors into a more regularized Gaussian. This normaliza-tion, when combined with PLDA, clearly improves the ASVperformance. In this study, we investigate a domain adaptationapproach based on the VAE-based normalization architecture.Our experiments showed that this VAE-based adaptation out-performs both the LDA- and PCA-based adaptation and thefamous unsupervised PLDA adaptation [14], [15].The organization of this paper is as follows. Section 2describes the related work, and Section 3 presents the proposedVAE-based adaption approach. Experiments are reported inSection 4, and the paper is concluded in Section 5.II. R ELATED WORK
This work is a direct extension of our previous work [13].The main contribution of this extension is that we provide athorough investigation on the VAE-based domain adaption for a r X i v : . [ ee ss . A S ] A ug -vector x-vector P oo li n g l a y e r S p e a k e r I D s VAE
OOD data
PLDA
Fig. 1. The three-component architecture of an x-vector system, where the normalization model is a VAE. X-vectors are extracted from the speaker-discriminativenetwork, and then pass the VAE network for normalization. The normalized x-vectors are retrieved from the bottleneck layer of the VAE and scored by PLDA.The adaptation can be conducted on either VAE or PLDA, or both.
ASV. Some recent studies on domain adaptation in the x-vectormodel regime are related to this work. For example, Alam etal. [16] presented an unsupervised adaptation approach basedon Correlation Alignment (CORAL) [17]. CORAL can alignthe distributions of in-domain and out-of-domain features.Alam et al. found that this technique can be applied tocompensate for domain mismatch of x-vectors. Lee et al. [18]proposed a similar approach that employed CORAL to alignthe statistics of in-domain and out-of-domain vectors. TheOOD statistics was then used to update the PLDA model. OurVAE-based approach works on the normalization model ratherthan the scoring model.III. VAE-
BASED DOMAIN ADAPTATION
A. Revisit VAE
VAE is a generative model that can represent a complex datadistribution [19]. The key idea of VAE is to learn a DNN-basedmapping function x = f ( z ) that maps a simple distribution p ( z ) to a complex distribution p ( x ) . In other words, it repre-sents complex observations by simple-distributed latent codesvia a complex mapping function.In brief, VAE consists of two parts, a decoder f ( z ) thatmaps p ( z ) to p ( x ) , i.e., p ( x ) = (cid:90) p ( x | z ) p ( z )d z = (cid:90) N ( f ( z ) , I ) p ( z )d z, where p ( x | z ) has been assumed to be Gaussian. And an encoder g ( x ) produces a distribution q ( z | x ) that approximatesthe posterior distribution p ( z | x ) as follows: p ( z | x ) ≈ q ( z | x ) = N ( µ ( x ) , σ ( x )) , where [ µ ( x ) σ ( x )] = g ( x ) .The training objective is the log probability of the trainingdata (cid:80) i ln p ( x i ) . It is intractable so a variational lower boundis optimized instead, which depends on both the encoder g ( x ) and the decoder f ( z ) . This is formally written by: L ( f, g ) = (cid:88) i {− D KL [ q ( z | x i ) || p ( z )] + E q ( z | x i ) [ln p ( x i | z )] } , where D KL is the KL divergence, and E q denotes expectationw.r.t. distribution q . As the expectation is intractable, a sam-pling scheme is often used, as shown in the blue box in Fig. 1.More details of the training process can be found in [19]. B. VAE for normalization and adaptation
A conventional x-vector system consists of two components,one is the front-end model which is used to extract speakerembeddings (x-vectors), the other one is the back-end modelwhich is used for scoring. For the front-end model, it is trainedby discriminating the speakers in the training set, as shown inthe dotted gray box in Fig. 1. To learn sufficiently discrim-inative and generalizable speaker embeddings, it requires alarge amount of speaker-labelled data. In spite of its powerfuldiscriminability, the x-vector model still heavily relies on aPLDA back-end.A potential problem, however, is that these back-end modelsmay be domain-sensitive. For instance, a well-trained PLDAtends to be ineffective on the out-of-domain (OOD) data. Todeal with this OOD issue, an intuitive idea is to retrain an OODPLDA model. However, training a PLDA model from scratchrequires a large amount of labelled data, usually thousandsof speakers, each with multiple sessions. In many practicalsituations, collecting such a large amount of labelled data isvery difficult and time consuming. Therefore, how to make fulluse of the limited speaker-labelled data is the key point to dealwith the OOD issue. To tackle this problem, a multitude ofPLDA adaptation approaches have been proposed [14], [15].In all these methods, within-class and between-class statisticsare collected from the adaptation data, and then are used toupdate the PLDA model.In a previous study [13], we presented a three-componentarchitecture, where a normalization model is introduced be-tween the front-end (x-vector DNN) and the back-end (PLDA).he role of the normalization model is to project x-vectors to alatent space in which the projected codes are more regularized,e.g., more Gaussian. This model could be PCA or LDA, butwe found VAE is more powerful, due to its capability ofrepresenting complex distributions with a simple distribution.This three-component architecture is shown in Fig. 1.This architecture motivates a new domain-adaptation ap-proach, i.e., adapting the normalization model rather than thePLDA back-end. In particular, there are several advantagesif we adapt the VAE-based normalization model: (1) VAE isessentially a distribution mapping function that involves strongstructural constraints (i.e., conditional Gaussian) in both thedata space and latent space. This highly structured architectureallows effective adaptation even with a very limited amountof data; (2) VAE training is purely unsupervised and theadaptation data are easy to obtain; (3) After VAE adaptation,the normalized x-vectors (latent codes) remain regularizedalthough the distribution of raw x-vectors may have greatlychanged. This largely alleviates the necessity of PLDA adap-tation. Fig. 1 illustrates the VAE-based adaptation, where theparameters of both VAE and PLDA could be adapted usingthe OOD data. IV. E
XPERIMENTS
A. Data
Three datasets were used in our experiments: VoxCeleb,SITW and CSLT-SITW. VoxCeleb was used for model train-ing, while the other two were used for evaluation. Moreinformation about these three datasets is presented below.
VoxCeleb : A large-scale free speaker database collectedby University of Oxford, UK [20]. Data augmentation wasapplied, where the MUSAN corpus [21] was used to generatenoisy utterances and the room impulse responses (RIRS)corpus [22] was used to generate reverberant utterances. Thisdataset, after removing the utterances shared by SITW, wasused to train the DNN x-vector model, plus the PLDA andVAE models.
SITW-Eval.Core : A standard free database collected by [23]for ASV evaluation. It was collected from open-source mediachannels, and consists of speech data covering well-knownpersons. This dataset was used as the
IND test set . CSLT-SITW : A small database collected by CSLT for com-mercial usage. It consists of speakers, each of which recordsa simple Chinese command word, and the duration is about seconds. The scenarios involve laboratory, corridor, street,restaurant, bus, subway, mall, home, etc. Speakers variedtheir recording devices and poses during the recording. Inour experiments, speakers were used for OOD adaptation( OOD adaptation set ), and the rest speakers were usedfor OOD evaluation ( OOD test set ). B. Settings
We built several systems to validate the VAE-based domainadaptation. All these systems use the same x-vector front-endand PLDA back-end, but differ in the normalization model.We denote these systems as follows.
Baseline : The baseline x-vector system. It was built fol-lowing the Kaldi SITW recipe [24]. The feature-learningcomponent is a -layer time-delay neural network (TDNN).The statistic pooling layer computes the mean and standarddeviation of the frame-level features from a speech segment.The size of the output layer is , , corresponding to thenumber of speakers in the training set. Once trained, the -dimensional activations of the penultimate hidden layer areread out as an x-vector. There is no normalization model. PCA : As the baseline, but with PCA as the normalizationmodel. The dimension of the code space is . Similar toVAE, PCA is also an unsupervised model, though it is linearand shallow.
LDA : As the baseline, but with LDA as the normalizationmodel. The dimension of the code space is . VAE : As the baseline, but with VAE as the normalizationmodel. The VAE model is a -layer DNN. The dimension ofcode space is , and other hidden layers are , . C-VAE : As the baseline, but with C-VAE as the normaliza-tion model. C-VAE is a variant of VAE, with a cohesive lossinvolved to encourage within-class coherence [13].
C. Basic results
We first present the basic results evaluated on the IND testset and the OOD test set. All these three components of thesystem (front-end, normalization, back-end) are trained withVoxCeleb. The results in terms of equal error rate (EER) arereported in Table I. As expected, it can be observed that for allthese five systems, the performance on the IND data outper-forms that on the OOD data. For the baseline system (withoutany normalization), the performance degradation on the OODdata is not much, suggesting that the DNN x-vector modelhas been well trained and fairly generalizable. For systemswith normalization models (PCA, LDA, VAE and C-VAE),the performance on the IND data is significantly improved,which confirms the contribution of normalization. However,the performance on the OOD data nearly remains unchanged,indicating that all these normalization models suffer from adomain mismatch. In particular, the two VAE systems dropthe most on the OOD data, though their performance on theIND data is the best. This is not surprising, as VAE/C-VAE arethe most complex models and so tend to be domain-overfitting.
TABLE IEER(%)
RESULTS OF VARIOUS SYSTEMS ON THE
IND
DATA AND THE
OOD
DATA .Baseline PCA LDA VAE C-VAEIND 16.79 4.84 3.80 3.64 3.77OOD 18.51 14.58 14.82 16.72 15.58
D. PLDA adaptation
In this experiment, we keep all these settings as in Sec-tion IV-C, but adapt the PLDA model using the OOD adap-tation data. This back-end adaptation will partly mitigate thedomain-mismatch problem and so presumably improves per-formance of all these systems. We investigated two adaptationchemes:
PLDA-RET that retrains the PLDA model fromscratch, and
PLDA-UAT that adapts the PLDA model usingthe unsupervised adaptation approach proposed by [15]. Theresults are presented in Table II.
TABLE IIEER(%)
RESULT ON THE
OOD
SET WITH
PLDA
ADAPTATION .Baseline PCA LDA VAE C-VAEPLDA 18.51 14.58 14.82 16.72 15.58PLDA-RET 15.25 13.83 14.18 13.85 13.47PLDA-UAT 14.49
Firstly, we observe that both the two PLDA adaptationapproaches improve the performance on all these five systems,as expected. Secondly, the best performance is obtained by thePCA system with PLDA-UAT. The VAE and C-VAE systemsdo not work as well as the PCA and LDA systems. Thisindicates that PLDA adaptation can not fully compensate forthe domain-mismatch inherence in the VAE/C-VAE models.
E. Adaptation for normalization
We have found that the normalization models, in particularVAE and C-VAE, suffer from domain-mismatch on the OODdata, for which PLDA adaptation can not fully address. Inthis experiment, we adapt both the normalization model andthe PLDA back-end. For simplicity, both these adaptationsare implemented as re-training. The results are shown inTable III, where ‘Norm-Adapt’ denotes the normalizationmodel adaptation.
TABLE IIIEER(%)
RESULT ON THE
OOD
SET WITH ADAPTATION ON BOTHNORMALIZATION MODELS AND
PLDA
BACK - END .PCA LDA VAE C-VAEPLDA-RET 13.83 14.18 13.85 13.47Norm-Adapt + PLDA-RET 13.31 14.84
Firstly, it can be observed that the adaptation on nor-malization models delivers performance gains on all thesesystems, compared with the sole PLDA adaptation (PLDA-RET). As expected, the improvement on the VAE and C-VAEsystems is much more significant than that on the PCA andLDA systems, indicating that adaptation is more important forcomplex normalization models. Overall, the C-VAE systemobtains the best performance with both normalization modeladaptation and PLDA adaptation. This performance is betterthan the best unsupervised PLDA adaptation shown in Table II.
F. Analysis
To better understand these adaption methods, we computethe skewness and kurtosis of the distributions of normalizedx-vectors of utterances in the OOD test dataset. The skewnessand kurtosis are defined as follows:
Skew( x ) = E [( x − µ x ) ] σ x , Kurt( x ) = E [ x − µ x ] σ x − , where µ x and σ x denote the mean and standard variation of x , respectively. The more Gaussian a distribution is, the closerto zero the two values are.The utterance-level skewness and kurtosis of x-vectorsnormalized by different normalization models are reported inTable IV. The Original group denotes the normalized vectorsproduced by the original normalization models trained withVoxCeleb, and the
Adaptation group denotes the normalizedvectors produced by the adapted normalization models.
TABLE IVU
TTERANCE - LEVEL S KEWNESS AND K URTOSIS OF X - VECTORSNORMALIZED BY DIFFERENT NORMALIZATION MODELS .Skew KurtOriginal Baseline -0.0890 -0.1154PCA 0.0004 0.0713LDA 0.0050 0.1257VAE 0.0096 0.0560C-VAE -0.0132 -0.0027Adaptation PCA -0.0076 0.1447LDA 0.0054 0.3465VAE -0.0010 -0.0115
C-VAE -0.0023 0.0011
In the original group, the skewness and kurtosis values ofthe utterance-level x-vectors are clearly reduced by any ofthe normalization models, confirming that PCA, LDA, VAEand C-VAE are capable of normalizing x-vectors. Moreover,it can be found that the skewness and kurtosis of the PCAand LDA normalized vectors are smaller than VAE and C-VAE normalized vectors. This indicates that in the OODscenario, PCA and LDA can do better than VAE and C-VAE invector normalization. This is consistent with the observationsin Table II, where the PCA and LDA systems perform betterthan the VAE and C-VAE systems on the OOD data.After adaptation, the skewness and kurtosis of VAE andC-VAE normalized vectors are clearly reduced. This is un-derstandable as these two models are the most powerful indistribution normalization. This normalization does not workwell on the OOD data, but a simple adaptation will recoverthe power quickly. Again, these results are consistent with theobservations shown in Table III, where VAE and C-VAE showthe best performance after adaptation.V. C
ONCLUSIONS
This paper proposed a VAE-based domain adaptation ap-proach for deep speaker embedding. VAE (and its variant C-VAE) is a powerful model for normalizing the distributionof x-vectors, and can be easily adapted to a new domainwith a small amount of data. Experiments demonstrated thatthis VAE-based adaptation outperforms the LDA- and PCA-based adaptation, and when combined with PLDA re-training,it outperforms the unsupervised PLDA adaptation.A
CKNOWLEDGEMENT
This work was supported by the National Natural ScienceFoundation of China No. 61633013, and the PostdoctoralScience Foundation of China No. 2018M640133.
EFERENCES[1] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speakerverification using adapted Gaussian mixture models,”
Digital signalprocessing , vol. 10, no. 1-3, pp. 19–41, 2000.[2] Patrick Kenny, Gilles Boulianne, Pierre Ouellet, and Pierre Dumouchel,“Joint factor analysis versus eigenchannels in speaker recognition,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 15, no.4, pp. 1435–1447, 2007.[3] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker verification,”
IEEETransactions on Audio, Speech, and Language Processing , vol. 19, no.4, pp. 788–798, 2011.[4] Sergey Ioffe, “Probabilistic linear discriminant analysis,” pp. 531–542,2006.[5] Patrick Kenny, Vishwa Gupta, Themos Stafylakis, P Ouellet, and J Alam,“Deep neural networks for extracting baum-welch statistics for speakerrecognition,” in
Proc. Odyssey , 2014, pp. 293–298.[6] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “Anovel scheme for speaker recognition using a phonetically-aware deepneural network,” in
Acoustics, Speech and Signal Processing (ICASSP),2014 IEEE International Conference on . IEEE, 2014, pp. 1695–1699.[7] Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, andJavier Gonzalez-Dominguez, “Deep neural networks for small footprinttext-dependent speaker verification,” in . IEEE,2014, pp. 4052–4056.[8] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer,“End-to-end text-dependent speaker verification,” in . IEEE, 2016, pp. 5115–5119.[9] Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, and Dong Wang,“Deep speaker feature learning for text-independent speaker verifica-tion,” in
Interspeech , 2017, pp. 1542–1546.[10] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, andSanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speakerrecognition,” in . IEEE, 2018, pp. 5329–5333.[11] David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “Spoken language recognitionusing x-vectors,” in
Proc. Odyssey 2018 The Speaker and LanguageRecognition Workshop , 2018, pp. 105–111.[12] Weicheng Cai, Jinkun Chen, and Ming Li, “Exploring the encodinglayer and loss function in end-to-end speaker and language recognitionsystem,” in
Proc. Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 74–81.[13] Yang Zhang, Lantian Li, and Dong Wang, “Vae-based regularization fordeep speaker embedding,” arXiv preprint arXiv:1904.03617 , 2019.[14] Daniel Garcia-Romero, Xiaohui Zhang, Alan McCree, and Daniel Povey,“Improving speaker recognition performance in the domain adaptationchallenge using deep neural networks,” in . IEEE, 2014, pp. 378–383.[15] Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer,and Carlos Vaquero, “Unsupervised domain adaptation for i-vectorspeaker recognition,” in
Proceedings of Odyssey: The Speaker andLanguage Recognition Workshop , 2014.[16] Jahangir Alam, Gautam Bhattacharya, and Patrick Kenny, “Speakerverification in mismatched conditions with frustratingly easy domainadaptation,” in
Proc. Odyssey 2018 The Speaker and Language Recog-nition Workshop , 2018, pp. 176–180.[17] Baochen Sun, Jiashi Feng, and Kate Saenko, “Return of frustratinglyeasy domain adaptation,” in
Thirtieth AAAI Conference on ArtificialIntelligence , 2016.[18] Kong Aik Lee, Qiongqiong Wang, and Takafumi Koshinaka, “The coral+algorithm for unsupervised domain adaptation of plda,” in
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 5821–5825.[19] Diederik P Kingma and Max Welling, “Auto-encoding variationalbayes,” arXiv preprint arXiv:1312.6114 , 2013.[20] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Vox-celeb: a large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[21] David Snyder, Guoguo Chen, and Daniel Povey, “MUSAN: A Music,Speech, and Noise Corpus,” 2015. [22] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, andSanjeev Khudanpur, “A study on data augmentation of reverberantspeech for robust speech recognition,” in .IEEE, 2017, pp. 5220–5224.[23] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Lawson,“The speakers in the wild (SITW) speaker recognition database.,” in
Interspeech , 2016, pp. 818–822.[24] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, OndrejGlembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, YanminQian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,”in