Speaker verification using end-to-end adversarial language adaptation
Johan Rohdin, Themos Stafylakis, Anna Silnova, Hossein Zeinali, Lukas Burget, Oldrich Plchot
SSPEAKER VERIFICATION USING END-TO-END ADVERSARIAL LANGUAGEADAPTATION
Johan Rohdin ∗ , Themos Stafylakis ∗ , Anna Silnova , Hossein Zeinali , Luk´aˇs Burget , Oldˇrich Plchot Brno University of Technology, Speech@FIT, Brno, Czech Republic { rohdin,isilnova,zeinali,burget,plchot } @fit.vutbr.cz Omilia - Conversational Intelligence, Athens, Greece [email protected]
ABSTRACT
In this paper we investigate the use of adversarial domainadaptation for addressing the problem of language mis-match between speaker recognition corpora. In the context ofspeaker verification, adversarial domain adaptation methodsaim at minimizing certain divergences between the distri-bution that the utterance-level features follow (i.e. speakerembeddings) when drawn from source and target domains(i.e. languages), while preserving their capacity in recogniz-ing speakers. Neural architectures for extracting utterance-level representations enable us to apply adversarial adapta-tion methods in an end-to-end fashion and train the networkjointly with the standard cross-entropy loss. We examineseveral configurations, such as the use of (pseudo-)labels onthe target domain as well as domain labels in the feature ex-tractor, and we demonstrate the effectiveness of our methodon the challenging NIST SRE16 and SRE18 benchmarks.
Index Terms — Speaker recognition, domain adaptation
1. INTRODUCTION
The need for domain adaptation (DA) arises in cases when thetarget domain data is insufficient (and possibly unlabeled) fortraining a model from scratch and therefore source domaindata (assumed labeled and sufficient for training a model)should be leveraged as well. The core idea behind DA isthat the knowledge distilled from the source domain can betransferred to the target domain, despite the differences in themarginal distributions of the two domains. Conventional ap-proaches to DA, such as fine-tuning a source domain modelto the target domain data may fail in many settings due to thetarget data being weakly-labeled or even unlabeled. *Equal Contribution.This project has received funding from the European union’s Horizon 2020research and innovation programme under the Marie Sklodowska-Curie andit is co-financed by the South Moravian Region under grant agreement No.665860. The project was also supported by the Czech Science Foundationunder project No. GJ17-23870Y.
DA methods for speaker verification are of particular in-terest, as for many real-world applications large amounts oftarget domain labeled data are rarely available. Hence, fortraining state-of-the-art models which require several thou-sand of training utterances, one should resort to large out-of-domain corpora and use the small and possibly unlabeledtarget domain datasets for language, channel or other types ofadaptation. In order to promote further research in DA meth-ods, MIT-LL and NIST has organized 3 evaluations (namelythe MIT-LL DA challenge, DAC-2013, NIST SRE16, and therecent NIST SRE18) with the two latter focused primarily onlanguage adaptation. Several DA methods were introduced aspart of those evaluations, the majority of which approach theproblem as a transformation of fixed utterance-level represen-tations, such as i-vectors.In this article we examine the use of the recently emergedadversarial DA methods. Adversarial DA methods employGenerative Adversarial Networks (GANs) as a means toreduce the mismatch between source and target domains[1]. Different from [2], where an adversarial architecture isproposed for i-vector adaptation and tested for mildly mis-matched domains (DAC-2013, between Switchboard andNIST data both of which telephone data and English) (a) wepropose an end-to-end DA method by adding the adversarialloss to the cross-entropy loss of the x-vector architecture,and (b) we evaluate our method on the challenging task oflanguage adaptation. Moreover, we use Wasserstein GANs,a recently proposed version of GANs which addresses thevanishing gradient problem of GANs by replacing the do-main discrepancy measure with Wasserstein distance [3]. Wealso explore the uses of speaker labels in the adaptation data,even in the form of pseudo-labels for cases where the set isunlabeled, and we show that they are very helpful in attaininggood performance. Finally, we examine the use of domainlabels, which we concatenate to the layers of the network tolearn domain-dependent transforms using a single network.To the best of our knowledge, we are the first to utilize domainlabels in such a way. a r X i v : . [ ee ss . A S ] N ov . DOMAIN ADAPTATION IN SPEAKERRECOGNITION During the past few years, several DA methods for speakerrecognition have been proposed. In the case of unsupervisedadaptation several methods apply a clustering algorithm in or-der to estimate speaker labels, with which a target-domainPLDA model is trained, while interpolation between the pa-rameters of the source and target-domain PLDA models isapplied to obtain the adapted PLDA [4, 5, 6]. The standardspeaker recognition recipe in the Kaldi toolkit utilizes a sim-pler method for unsupervised adaptation that does not requireclustering [7]. This method aims at adjusting the covariancematrices of the PLDA model so that its total covariance bet-ter matches the total covariance of the adaptation data. Analternative approach is to compensate for dataset shift in thei-vector space by modelling the subspace of dataset shift andremoving those direction from the i-vectors [8]. Other ap-proaches do not attempt to cluster the utterances and performDA simply by matching first and second order statistics of thei-vectors between source and target domains [9].Closer to the spirit of our work, two methods based ondomain-adversarial adaptation and maximum mean discrep-ancy were recently introduced. In [2], a domain-adversarialneural networks (DANN) is employed in order to transformi-vectors to a domain-invariant representation space. Theauthors follow the recipe introduced in [1] without using orestimating speaker labels for the target domain training data.They evaluate their method on DAC 2013 and demonstratesignificant gains over other DA methods. In [10], DA isperformed using maximum mean discrepancy (MMD) as ameans to reduce the mismatch between the two distribution.The main differences compared to the DANN-based archi-tecture in [2] are (a) the use of MMD which makes trainingeasier compared to GANs, (b) the use of reconstruction lossinstead of cross entropy in the main network (i.e. an au-toencoder architecture instead of a classifier over speakers[11]), and (c) the application field, which is the languageadaptation task of NIST SRE16 instead of DAC 2013. Themethod yields slightly better results compared to inter-datasetvariability compensation [8].
3. ADVERSARIAL ADAPTATION ALGORITHM3.1. Notation and Wasserstein distance
We assume a labeled source dataset X s = { ( x si , y si ) } n s i =1 fromthe source domain D s , and target dataset X t = { ( x ti , y ti ) } n t i =1 from the target domain D t , where x ∈ R m, · denotes utter-ances and the labels y ti may be given, estimated (e.g. usingclustering) or not used at all. The two domains (i.e. lan-guages) have different marginal data distributions, P x s and P x t respectively. The embedding extractor, which is a stan-dard TDNN x-vector architecture up to the embedding layer Fig. 1 . Block-diagram of the architecture.implements a function f g : R m, · (cid:55)→ R d parametrized by θ g ,where d is the size of the embedding. The additional struc-ture, useful only during training is called the domain criticand is a feed-forward neural network implementing a function f w : R d (cid:55)→ R parametrized by θ w . The Wasserstein distancebetween two representation distributions P x s and P x t , where h s = f g ( x s ) and h t = f g ( x t ) is approximated by L wd ( x s , x t ) = 1 n s (cid:88) x s ∈ X s f w ( f g ( x s )) − n t (cid:88) x t ∈ X t f w ( f g ( x t )) . (1)As proposed in [3], an improved method (compared to [12])for constraining g w to be 1-Lipschitz function (a necessaryconditions so that L wd is an approximation of the Wassersteindistance) is to introduce a gradient penalty loss L grad (ˆ h ) = (cid:16)(cid:13)(cid:13)(cid:13) ∇ ˆ h f w (ˆ h ) (cid:13)(cid:13)(cid:13) − (cid:17) , (2)where ˆ h is a set of features created by randomly pairing andlinearly combining features from h s and h t [3].The speaker discriminator (i.e. the part of the x-vectornetwork after the embedding layer) implements a function f c : R d (cid:55)→ R l , i.e. it maps the embeddings h to the spaceof posterior probabilities over training speakers (either fromsource or from target domain) and it is parametrized by θ c (separate linear and softmax layers are assumed for sourceand target domains). The classification loss L c ( x, y ) is thestandard cross-entropy over speakers. The architecture dur-ing training is illustrated in Fig. 1. The training algorithm is given in Algorithm 1 (see [13] formore information). As we observe, the critic θ w tries to max-imize L wd , i.e. to approximate the Wasserstein distance be-tween the two domains while the feature extractor θ g tries tominimize it, yielding the usual minimax optimization prob-lem of GANs. In the inner loop of the algorithm, a set ofpoints ˆ h are randomly chosen as linear combinations between lgorithm 1 Domain adaptation algorithm Initialize feature extractor, domain critic, speaker dis-criminator θ g , θ w , θ c . if supervised d ← ( s, t ) else d ← s end repeat Sample minibatch { ( x si , y si ) } , { ( x ti , y ti ) } for t = 1 , . . . , n do h s ← f g ( x s ) , h t ← f g ( x t ) Sample h as the random points along straight linesbetween h s and h t pairs. ˆ h ← { h s , h t , h } θ w ← θ w + α ∇ θ w [ L wd ( x s , x t ) − γ L grad (ˆ h )] end for θ c ← θ c − α ∇ θ c L c ( x d , y d ) θ g ← θ g − α ∇ θ g [ L c ( x d , y d ) + δ L wd ( x s , x t )] until θ g , θ w , θ c converge.randomly paired h s and h t , on which the gradient penaliza-tion is applied, constraining f w to be a 1-Lipschitz function[3]. Finally, when labels are used in the target domain, theclassification loss is backpropagated for both sets. We use Tensorflow [14] for implementing the adversarialadaptation. We follow the standard Kaldi x-vector architec-ture [7], i.e. 5 TDNN layers with ReLU activation functionsfollowed by batch normalization, then a pooling layer thataccumulates mean and standard deviations, then two feed-forward layers with ReLU and batch normalization, thenfinally a softmax layer for classifying speakers. For the criticnetwork we use two feed-forward layers with 512 units andleaky ReLU activation functions. The critic network takesthe x-vector, i.e. the output of the first affine layer afterpooling, as input and returns a scalar as output. The domainlabel is passed to the feature extractor as a binary variablewhich is concatenated to the input of every affine layer (0corresponds to the source domain). This is equivalent to hav-ing domain-dependent biases, enabling the network to learndomain-dependent transforms. Based on light tuning to makethe training stable, we set the parameters of the adversarialtraining to γ = 10 . , δ = 0 . , α = 0 . , α = 1 . , and n = 10 . For supervised adaptation, we add a second softmaxlayer to the x-vector network, i.e. the source- and target-domain classifiers share all model parameters except those oftheir softmax layer. In order to better balance the source- andtarget-domain classification losses, we normalize them withthe logarithm of their number of classes so that the loss ofrandom prediction is approximately equal to one. After that,we set the weight for the target domain classification loss to0.2 and the weight for the source domain classification loss to0.8. In the experiments, we use minibatches containing 150segments of the target domain data and 150 segments of the (labeled) source domain data. The lengths of the segmentsare 2-4s. We use stochastic gradient descent, starting with alearning rate of 1.0 which we then half every 5 epochs, wherean epoch is defined to be 400 minibatches. We stop the train-ing after 85 epochs. During the first 3 epochs, we trained onlythe critic and the network for source domain classification.
4. EXPERIMENTS
We conduct experiments on two databases, the NIST SRE2018 cmn2 evaluation set using the unlabeled cmn2 devel-opment set for adaptation, and the NIST SRE16 evaluationset [15] using the unlabeled major data set for adaptation. Itshould be noted that the SRE16 evaluation data as well as theunlabeled major data contains utterances from two languages(Cantonese and Tagalog) which is not ideal for our proposedmethods, since language labels are not provided (and are notestimated by our algorithm). Hence, we treat two distinctlanguages as one, which is clearly suboptimal. Further, forthe SRE18 unlabeled development set, we have access to thetelephone numbers which should help the supervised trainingcompared to just using utterance IDs. The adaptation data isaugmented similarly to the training data [16], i.e. with babble,noise, music and reverberated versions of the utterances.The baseline x-vector model is trained by the Kaldi toolkit[17] using the Kaldi SRE16 x-vector recipe [16] but withadditional training data from voxceleb2, resulting in 12170training speakers. We apply the adversarial DA on this modelrather than training a model with adversarial DA from the be-ginning. We experimented with two backends. The first isan identical backend to the one in the Kaldi x-vector recipe[16]. This backend involves a preprocessing step which firstreduces the x-vector dimension by LDA from 512 to 150, andthen applies an unusual variant of length-norm . In all ex-periments, we center the evaluation data at the mean of theunlabeled development set instead of the mean of the PLDAtraining set. This can be seen as a light form of adaptation. In this subsection, we evaluate the different variants of adver-sarial DA detailed in Section 3 with Gaussian PLDA backend.In summary, three observations can be made. First, adversar-ial adaptation is effective when it is combined with supervisedtraining on the target data. Without supervised training on thetarget data adversarial adaptation degrades the performance.Second, including language labels as side-information to theTDNN layers helps for SRE18 but not SRE16. This is not toosurprising, considering that SRE16 contains two languageswhich we treat as one. See https://github.com/kaldi-asr/kaldi/blob/master/src/ivector/plda.cc able 1 . Results with adversarial adaptation. DCF refers to the average minDCF at the operating points 0.01 and 0.005. EERrefers to equal error rate in %. Cantonese and Tagalog are denoted by yue and tgl respectively.SRE18 SRE16 all SRE16 yue SRE16 tglDCF EER DCF EER DCF EER DCF EER P L DA Baseline 0.664 10.01 0.897 11.55 0.553 7.28 0.976 15.87sup 0.652 9.59 0.859 10.94 0.536 6.79 0.950 15.19adv 0.658 10.35 0.899 13.25 0.561 7.39 0.968 19.12adv+sup 0.630 8.89
We present result for the standard Kaldi-style unsupervisedPLDA DA. This method essentially estimates the excess vari-ance in the adaptation data and distributes a portion, ξ , of itto the PLDA between-class covariance matrix and a portion, η , to the PLDA within-class covariance matrix . (Our exper-imentation with supervised PLDA adaptation using cluster-ing methods were unsuccessful for both for the baseline andthe adversarial DA model.) In Kaldi ξ = 1 − η and in theSRE16 x-vector recipe [16], ξ = 0 . . The results for thesesettings are shown in the first part of Table 2 (PLDA adp 1).As can be seen, adversarial DA degrades the performance.However, the Kaldi settings of ξ and η may not be optimalwhen adversarial DA is applied. For example, if the adver-sarial DA manages to remove between-class variability muchbetter than within-class variability, the balance between ξ and η should be different. Moreover, in these experiments we usethe same adaptation data for both the adversarial DA and thePLDA adaptation. After being used for adversarial DA, theadaptation data is most likely closer to the source data thanwhat unseen (test) data is, meaning that the PLDA adaptationwill not be strong enough. To mitigate this, we tune ξ and η on the SRE18 labeled development set (without requiring thatthey sum to 1). The results are shown in the second part ofTable 2 (PLDA adp 2). As we observe, tuning ξ and η helpsboth the baseline and the models from adversarial training forSRE18. In terms of EER, the adversarial training now com-plements PLDA DA. For SRE16 using these values of ξ and η is worse than using the original Kaldi settings. Probably, theSRE18 development set is not similar enough to SRE16 for ξ and η to be properly estimated. In the main experiment we adapted all layers of the network inan end-to-end fashion. In Table 3, we show results of adaptingonly the first layer after pooling. This is similar to i-vector ADapproach in [18], although, here the transformation we learnis an affine dimensionality reduction of x-vectors.Two observations can be made. First, when adapting onlythis layer there is no clear advantage of neither supervised See https://github.com/kaldi-asr/kaldi/blob/master/src/ivector/plda.cc
Table 2 . Results with adversarial DA and PLDA DA.SRE18 SRE16 allDCF EER DCF EER P L DA a dp1 Baseline 0.580 9.05 0.613 8.01sup adv 0.616 9.70 0.664 9.42adv+sup 0.591 9.15 0.630 8.10adv+lan+sup 0.588 9.03 0.615 7.92 P L DA a dp2 Baseline 0.572 8.51 0.656 7.98sup adv 0.602 9.21 0.677 9.15adv+sup 0.578 8.28 0.649 8.00adv+lan+sup 0.576
Table 3 . Results with adversarial DA after pooling.SRE18 SRE16 allDCF EER DCF EER P L DA Baseline 0.664 10.01 0.897 11.55sup 0.667 9.92 0.887 11.53adv adaptation nor including language labels. Second, in SRE16adapting only this layer performs similar to adapting all lay-ers. This can possibly be mitigated by some form of regular-ization in the earlier layers.
5. CONCLUSION AND FUTURE WORK
In this paper we introduced an end-to-end DA method for x-vectors based on Wasserstein GANs. We examined severalconfigurations, especially with respect to the use of speakerand domain labels. We provided a detailed evaluation onNIST SRE16 and SRE18, and a fair comparison with state-of-the-art DA methods. The results show the effectivenessof the method in certain experiments, but also emphasize theneed for further experimentation. To this end, we considertraining the system from scratch with adversarial loss, apply-ing the method to other DA tasks such as gender and channels,as well as addressing overfitting to the target domain data. . REFERENCES [1] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-cal Germain, Hugo Larochelle, Franc¸ois Laviolette,Mario Marchand, and Victor Lempitsky, “Domain-adversarial training of neural networks,”
J. Mach. Learn.Res. , vol. 17, no. 1, pp. 2096–2030, Jan. 2016.[2] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng SiongChng, and Haizhou Li, “Unsupervised domain adapta-tion via domain adversarial training for speaker recogni-tion,” in . IEEE,2018, pp. 4889–4893.[3] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-cent Dumoulin, and Aaron C Courville, “Improvedtraining of wasserstein gans,” in
Advances in NeuralInformation Processing Systems , 2017, pp. 5767–5777.[4] Daniel Garcia-Romero, Alan McCree, Stephen Shum,Niko Brummer, and Carlos Vaquero, “Unsupervised do-main adaptation for i-vector speaker recognition,” in
Proceedings of Odyssey: The Speaker and LanguageRecognition Workshop , 2014.[5] Stephen H Shum, Douglas A Reynolds, Daniel Garcia-Romero, and Alan McCree, “Unsupervised clusteringapproaches for domain adaptation in speaker recogni-tion systems,” 2014.[6] Sergey Novoselov, Timur Pekhovsky, Konstantin Si-monchik, and Andrey Shulipa, “Rbm-plda subsystemfor the nist i-vector challenge,” in
Fifteenth Annual Con-ference of the International Speech Communication As-sociation , 2014.[7] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khu-danpur, “Deep neural network embeddings for text-independent speaker verification,” in
Interspeech 2017 ,Aug 2017.[8] Hagai Aronowitz, “Inter dataset variability compensa-tion for speaker recognition,” in
Acoustics, Speech andSignal Processing (ICASSP), 2014 IEEE InternationalConference on . IEEE, 2014, pp. 4002–4006.[9] Md Jahangir Alam, Gautam Bhattacharya, and PatrickKenny, “Speaker verification in mismatched conditionswith frustratingly easy domain adaptation,” in
Proc.Odyssey 2018 The Speaker and Language RecognitionWorkshop , 2018, pp. 176–180.[10] Weiwei Lin, Man-Wai Mak, Longxin Li, and Jen-TzungChien, “Reducing domain mismatch by maximum meandiscrepancy based autoencoders,” in
Proc. Odyssey2018 The Speaker and Language Recognition Work-shop , 2018, pp. 162–167. [11] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, IanGoodfellow, and Brendan Frey, “Adversarial autoen-coders,” arXiv preprint arXiv:1511.05644 , 2015.[12] Mart´ın Arjovsky, Soumith Chintala, and L´eon Bottou,“Wasserstein generative adversarial networks,” in
Pro-ceedings of the 34th International Conference on Ma-chine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , 2017, pp. 214–223.[13] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu,“Wasserstein distance guided representation learning fordomain adaptation,” in
AAAI . 2018, pp. 4058–4065,AAAI Press.[14] Mart´ın Abadi et al., “TensorFlow: Large-scale machinelearning on heterogeneous systems,” 2015, Softwareavailable from tensorflow.org.[15] NIST, “Nist 2016 speaker recognition evaluation plan,” , 2018.[16] David Snyder et al., “Kaldi sre16 x-vector recipe,” https://github.com/kaldi-asr/kaldi/tree/master/egs/sre16/v2 , 2017, Accesed:2017-11.[17] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, Mirko Han-nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,et al., “The kaldi speech recognition toolkit,” in
IEEE2011 workshop on automatic speech recognition andunderstanding . IEEE Signal Processing Society, 2011,number EPFL-CONF-192584.[18] Qing Wang, Wei Rao, Sining Sun, Lei Xie, Eng SiongChng, and Haizhou Li, “Unsupervised domain adapta-tion via domain adversarial training for speaker recog-nition,” in