[PDF] Cosine-Distance Virtual Adversarial Training for Semi-Supervised Speaker-Discriminative Acoustic Embeddings

Abstract

In this paper, we propose a semi-supervised learning (SSL) technique for training deep neural networks (DNNs) to generate speaker-discriminative acoustic embeddings (speaker embeddings). Obtaining large amounts of speaker recognition train-ing data can be difficult for desired target domains, especially under privacy constraints. The proposed technique reduces requirements for labelled data by leveraging unlabelled data. The technique is a variant of virtual adversarial training (VAT) [1] in the form of a loss that is defined as the robustness of the speaker embedding against input perturbations, as measured by the cosine-distance. Thus, we term the technique cosine-distance virtual adversarial training (CD-VAT). In comparison to many existing SSL techniques, the unlabelled data does not have to come from the same set of classes (here speakers) as the labelled data. The effectiveness of CD-VAT is shown on the 2750+ hour VoxCeleb data set, where on a speaker verification task it achieves a reduction in equal error rate (EER) of 11.1% relative to a purely supervised baseline. This is 32.5% of the improvement that would be achieved from supervised training if the speaker labels for the unlabelled data were available.

Full PDF

aa r X i v : . [ ee ss . A S ] A ug Cosine-Distance Virtual Adversarial Training for Semi-SupervisedSpeaker-Discriminative Acoustic Embeddings

Florian L. Kreyssig & Philip C. Woodland

Cambridge University Engineering Dept., Trumpington St., Cambridge, CB2 1PZ U.K. { flk24,pcw } @eng.cam.ac.uk Abstract

In this paper, we propose a semi-supervised learning (SSL)technique for training deep neural networks (DNNs) to generatespeaker-discriminative acoustic embeddings (speaker embed-dings). Obtaining large amounts of speaker recognition train-ing data can be difﬁcult for desired target domains, especiallyunder privacy constraints. The proposed technique reduces re-quirements for labelled data by leveraging unlabelled data. Thetechnique is a variant of virtual adversarial training (VAT) [1]in the form of a loss that is deﬁned as the robustness of thespeaker embedding against input perturbations, as measuredby the cosine-distance. Thus, we term the technique cosine-distance virtual adversarial training (CD-VAT). In comparisonto many existing SSL techniques, the unlabelled data does nothave to come from the same set of classes (here speakers) asthe labelled data. The effectiveness of CD-VAT is shown on the2750+ hour VoxCeleb data set, where on a speaker veriﬁcationtask it achieves a reduction in equal error rate (EER) of 11.1%relative to a purely supervised baseline. This is 32.5% of theimprovement that would be achieved from supervised trainingif the speaker labels for the unlabelled data were available.

Index Terms : semi-supervised, speaker embeddings, d-vector,speaker veriﬁcation

1. Introduction

Speaker-discriminative acoustic embeddings (or just speakerembeddings) derived through deep learning techniques havebecome the state-of-the-art for learning speaker representa-tions [2, 3] to be used for tasks such as speaker recognition,speaker veriﬁcation or speaker diarisation [2, 4, 5, 6]. Previ-ously i-vectors [7] based on factor analysis were widely used.The neural networks used to generate speaker embeddingsare typically trained on a speaker classiﬁcation task, for whichthe input is the acoustic feature sequence of an utterance andthe output is the speaker label of that utterance [2, 8]. By takingthe output of a layer of this neural network (often the penulti-mate layer) as an embedding, a ﬁxed dimensional vector can begenerated for any given input utterance. This vector is speaker-discriminative due to the training objective. It has been foundthat such speaker embeddings can be used to discriminate be-tween speakers that are not present in the training data.The quality of these embeddings will improve with theamount of training data and with the number of speakers in thetraining data, assuming the data comes from the target domain.However, the acquisition of enough suitable speaker classiﬁca-tion data for the exact conditions one desires can be difﬁcult.This is especially true as the regulations around identiﬁable userdata tighten . Under these constraints it is useful to use audio See General Data Protection Regulation (GDPR) or CaliforniaConsumer Privacy Act (CCPA). data with associated speaker labels together with de-identiﬁed(unlabelled) data to train speaker embedding generators.This paper proposes a method that enables semi-supervisedlearning (SSL) of speaker embeddings. In comparison to manySSL methods in machine learning, the proposed method doesnot assume that the labelled and unlabelled data comes fromthe same classes (here speakers). Therefore, a small amountof labelled data from a small number of speakers can be com-plemented by a larger amount of data from a large number ofspeakers. The proposed method is a newly derived sibling tovirtual adversarial training (VAT) [1] which is an SSL methodfor classiﬁcation tasks. Vanilla VAT assumes the labelled andunlabelled data to come from the same set of classes. This pa-per, however, attempts to utilise unlabelled data that comes froma completely different set of classes (here speakers).The proposed SSL technique, termed cosine-distance vir-tual adversarial training (CD-VAT), works by adding an addi-tional loss to the standard supervised training loss. The lossis deﬁned as the cosine-distance between a speaker embeddinggenerated for an utterance and the embedding generated for thesame utterance, which was perturbed by an adversarial noisethat maximally increases the cosine-distance to the original, un-perturbed, embedding. The loss is computed for every datapoint in the labelled and unlabelled data sets, thus smoothingthe embedding generator with respect to (w.r.t) the input for alldata points lying on the data manifold. It can, therefore, be seenas a regularisation technique that is informed by the unlabelleddata, which constrains the neural network to learn embeddingsthat generalise well to unseen speakers.This paper is organised as follows. Section 2 describes theCD-VAT loss and how the adversarial noise is computed. InSec. 3, the experimental setup is described including the datasets and evaluation metrics used. In Sec. 4 the experimentalresults are presented and Sec. 5 gives conclusions.

2. Cosine-Distance VAT

In nature, the outputs of most systems are smooth w.r.t spa-tial and temporal inputs [9]. Prior studies have conﬁrmed thatsmoothing the output distribution of a classiﬁer ( i.e ., encour-aging the classiﬁer to output similar distributions) against per-turbations of the input can improve its generalisation perfor-mance in semi-supervised learning [1, 10, 11, 12]. In the stan-dard version of VAT [1] (the efﬁcacy of which has been ver-iﬁed by [13, 14]) an additive loss is introduced, which tries tosmooth the categorical output distribution (measured by the KL-divergence) around every data point that lies on the data man-ifold. Here, VAT will be formulated on the level of the em-bedding layer rather than the output (classiﬁcation) layer. Thepurpose of the proposed variant of VAT is to smooth the embed-ding generator in terms of the cosine-distance, termed cosine-distance virtual adversarial training (CD-VAT). CD-VAT shoulde used together with an angular penalty loss (such as angu-lar softmax [15]) in comparison to the standard cross-entropyloss. These types of losses are very popular for both speakerveriﬁcation [16, 17, 18] as well as face veriﬁcation and identi-ﬁcation [15, 19]. When angular penalty losses are used duringspeaker classiﬁcation training, the resulting embedding genera-tor produces embeddings that are angularly discriminative i.e .the cosine-distance between embeddings indicates if embed-dings come from the same speakers.

CD-VAT adds an additional loss R CDVAT (the CD-VAT loss) tothe supervised loss l ( · , · ) with the interpolation constant α andis computed on both the labelled data set D l (size N l ) and theunlabelled data set D ul (size N ul ). The combined loss L isthen used to train the parameters θ and in turn the embeddinggenerator e ( x , θ ) . L ( D l , D ul , θ ) = l ( D l , θ ) + α R CDVAT ( D l , D ul , θ ) (1)The CD-VAT loss, R CDVAT , is the sum of local losses LCS ( x , θ ) that are computed for each input feature sequence x ∈ D l , D ul . R CDVAT ( D l , D ul , θ ) = 1 N l + N ul X x ∈D l , D ul LCS ( x , θ ) (2)The local cosine smoothness, LCS ( x , θ ) , is calculated in twosteps. First, a perturbation ( r CDVAT ) to the input sequence x isfound. This perturbation is chosen to be an adversarial pertur-bation that maximally changes the embedding ( e ( x + r CDVAT , θ ) )of the input feature sequence, x , as measured by the cosine-distance ( cd [ · , · ] ). ǫ is the maximum norm of r CDVAT . r CDVAT =arg max r ; k r k≤ ǫ cd h e (cid:16) x , ˆ θ (cid:17) , e (cid:16) x + r , ˆ θ (cid:17)i (cid:12)(cid:12)(cid:12) ˆ θ = θ (3) cd [ a , b ] = 12 − a T b k a kk b k (4)Second, LCS ( x , θ ) is then the cosine-distance between the em-bedding of the (maximally) perturbed input sequence and theembedding for the unperturbed input sequence.LCS ( x , θ ) = cd h e (cid:16) x , ˆ θ (cid:17) , e ( x + r CDVAT , θ ) i (cid:12)(cid:12)(cid:12) ˆ θ = θ (5) ˆ θ is the current setting for θ at a particular instant during opti-misation i.e . it is treated as a constant. The distinction between θ and ˆ θ is made because gradients of LCS ( x , θ ) are only prop-agated back through the embedding generated with the inputperturbation i.e . e ( x + r CDVAT , θ ) and not e (cid:16) x , ˆ θ (cid:17) . LCS ( x , θ ) indicates how “sensitive” the embedding of input x is. r CDVAT

Given the adversarial perturbation r CDVAT , the optimisation of thecombined loss L is straightforward, because the gradients ofLCS w.r.t θ are well deﬁned . In this section a method for ap-proximately ﬁnding r CDVAT is described. For simplicity, let: cd [ r , x , θ ] = cd h e (cid:16) x , ˆ θ (cid:17) , e (cid:16) x + r , ˆ θ (cid:17)i (cid:12)(cid:12)(cid:12) ˆ θ = θ (6) Note: no relationship to generative adversarial networks (GANs). Eqn. (19) can be used. cd [ r , x , θ ] has a minimum of zero at r = . Therefore, thegradient w.r.t r is also zero at r = . Therefore, the second-order Taylor approximation of cd [ r , x , θ ] is given by: cd [ r , x , θ ] ≈ r T H ( x , θ ) r (7)where H ( x , θ ) = ∇∇ r cd [ r , x , θ ] | r = . For simplicity, let H = H ( x , θ ) . Under this approximation r CDVAT emerges as thedominant eigenvector u ( x , θ ) of H with magnitude ǫ (see con-straint from Eqn. (3)). This shows that CD-VAT in effect pe-nalises λ ( r , x , θ ) , the largest eigenvalue of H : cd [ r , x , θ ] ≈ ǫ λ ( r , x , θ ) (8)The dominant eigenvector, u , of H can be found using the stan-dard power iteration method [20] combined with ﬁnite differ-ences. Let v be a randomly sampled vector that is not orthog-onal to u . Then the iterative calculation of v i +1 ← Hv i (9)causes v i to converge to u . The Hessian-vector product, Hv i ,can be approximated based on ﬁnite differences. ∇ r cd [ r , x , θ ] | r = ζ v i ≈ ∇ r cd [ r , x , θ ] | r = + ζ Hv i (10) Hv i ≈ ζ ∇ r cd [ r , x , θ ] | r = ζ v i (11)Therefore, to obtain r CDVAT we can use the iterative procedure: r CDVAT ≈ ǫ · v K (12) v i +1 = g i +1 k g i +1 k (13) g i +1 = ∇ r cd [ r , x , θ ] | r = ζ v i (14)where v K is the approximation of u ( x , θ ) after K power itera-tions and v is sampled uniformly on the unit-sphere. The valueof ζ should be as small as possible to get the best estimate of Hv i , but large enough not to cause numerical issues. Here, ζ isset to 0.005 in all our experiments . g i +1 = ∇ r cd [ r , x , θ ] | r = ζ v i is derived below: g i +1 = ∂cd [ r , x , θ ] ∂ r (cid:12)(cid:12)(cid:12) r = ζ v i (15) = ∂ e (cid:16) x + r , ˆ θ (cid:17) ∂ r ∂cd [ r , x , θ ] ∂ e (cid:16) x + r , ˆ θ (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r = ζ v i (16)let e (cid:16) x , ˆ θ (cid:17) = e and e (cid:16) x + r , ˆ θ (cid:17) = e r ∂cd [ r , x , θ ] ∂ e r = − k e k · ∂∂ e r (cid:18) e T e r k e r k (cid:19) (17) = − k e k · (cid:16) k e r k · ∂∂ e r (cid:16) e T e r (cid:17) + ∂∂ e r (cid:18) k e r k (cid:19) · (cid:16) e T e r (cid:17) (cid:17) (18) = − k e k · (cid:18) k e r k · e − e r k e r k · (cid:16) e T e r (cid:17)(cid:19) (19)The pre-multiplication with ∂ e ( x + r , ˆ θ ) ∂ r in Eqn. (16) is equivalentto the standard back-propagation algorithm. For our experiments this is a norm of 10e-6 per feature vector. o summarise, to obtain r CDVAT the required calculations are:• a forward pass to get e (cid:16) x , ˆ θ (cid:17) = e for Eqn. (19)• then for each power iteration: – a forward pass to get e (cid:16) x + v i , ˆ θ (cid:17) = e r for Eqn. (19) – a backward pass to get ∇ r cd [ r , x , θ ] | r = ζ v i for Eqn. (14)Our experiments on multiple data sets suggest that K = 1 is sufﬁcient, such that r CDVAT does not change signiﬁcantly forfurther iterations as measured by the dot-product of consecutive v i . This single power iteration, however, increases cd [ r , x , θ ] by up to a factor of in comparison to just using v i.e . therobustness of the embedding to the simple normalised Gaussiannoise v is far larger than to the adversarial noise r CDVAT . Adversarial training was originally proposed by [21], where itwas discovered that for image classiﬁcation deep neural net-works (DNNs) are very vulnerable to input perturbations ap-plied to an input in the direction to which the model’s label as-signment is the most sensitive, even when the perturbation isso small that human eyes cannot discern the perturbation. Suchperturbed data points are also known as adversarial examples and can be used as additional data points for training [22]. Forspeaker veriﬁcation [23] has investigated robustness to adver-sarial examples.Commonly used SSL techniques in speech processing in-clude self-training. The classiﬁer is ﬁrst trained on the labelleddata and then used to assign labels to the unlabelled data. Theunlabelled data (possibly after ﬁltering) with assigned labels iscombined with the labelled data. The classiﬁer is then trainedon the enlarged set of labelled data. The ﬁrst successful appli-cations of self-training were speech recognition [24, 25, 26, 27]and word-sense disambiguation [28]. For SSL of speaker em-beddings self-training cannot be used if the unlabelled data doesnot come from the same set of speakers as the labelled data.Self-supervised learning for speaker embeddings was usedin [29] where speaker embeddings were trained via reconstruct-ing the frames of a target speech segment, given the inferredembedding of another speech segment of the same utterance. Incomparison to CD-VAT their method needs the training data tobe segmented into utterances that each belong to one speaker.A purely unsupervised approach for speaker embeddings that isalso based on a reconstruction loss is proposed in [30]. Thoughthese unsupervised approaches demonstrate impressive resultsfor generating unsupervised speaker embeddings, they still re-quire supervised data to train the probabilistic linear discrimi-nant analysis (PLDA) back-end used for speaker veriﬁcation.CD-VAT notably improves the intra-speaker compactness(ISC). In [31] the ISC is directly optimised by adding a super-vised loss to the triplet loss that is otherwise used.Related to VAT are consistency-based SSL methods, suchas the mean teacher method [32] which was applied to audiocommand classiﬁcation by [33]. Another popular method toreduce labelling effort is to more effectively exploit the existinglabelled corpus through data augmentation [3, 34].

3. Experimental Setup

Experiments were designed to evaluate the effect of CD-VATon general speaker veriﬁcation performance, while also moredirectly testing the effect of CD-VAT on intra-speaker compact-ness and inter-speaker separability.

Two data sets, VoxCeleb1 (dev+test) and VoxCeleb2 (dev+test)are used to train the models and evaluate speaker veriﬁcationperformance. The VoxCeleb data sets consist of utterances thatwere obtained from Youtube videos and automatically labelledusing a visual speaker recognition pipeline. The VoxCeleb2(dev+test) data set, together with the dev portion of the Vox-Celeb1 data set, is used for training. For evaluation, the test por-tion of VoxCeleb1 is used. The combined train set consists ofmore than 2750 hours of data from 7323 speakers. The evalua-tion set consists of 4874 utterances from 40 speakers, for whichthe ofﬁcial speaker veriﬁcation list of 37720 utterance pairs isused. More information about the data is contained in Table 1.Title train test

The train data is the combination of VoxCeleb2(dev+test) and the development portion of VoxCeleb1. The testdata is the test portion of VoxCeleb1 for which 37720 utterancepairs are the veriﬁcation list.

The system input features are 30-d mel-frequency cepstral coef-ﬁcients (MFCCs). The MFCCs are extracted (using HTK [35])using 25ms windows with 10ms frame increments from 30 ﬁl-terbank channels. No vocal tract length normalization (VTLN)was applied and c is used instead of energy. These inputs werenormalised at the utterance level for mean and globally for vari-ance. No data augmentation was used for these experiments. In our model, utterance-level speaker embeddings are cre-ated by averaging multiple L -normalised window-level em-beddings. The input window to the window-level embeddinggenerator is around 2 seconds (213 frames, [-106,+106]) long.The shift between windows is just under 100 frames (see de-tails below). The embedding generator uses a time-delay neuralnetwork (TDNN) [36, 37] with a total input context of [-7,+7],which is shifted from { -99 } to { +99 } with shifts of 6 frames (re-sulting in the overall input window of [-106,+106]). The 34 out-put vectors of the TDNN are combined using the self-attentivelayer proposed in [38]. This is followed by a linear projectiondown to the embedding size, which is then the window-levelembedding. The TDNN structure resembles the one used inthe x-vector models [3] ( i.e . TDNN-layers with the followinginput contexts: [-2,+2], followed by { -2,0,+2 } , followed by { -3,0,+3 } , followed by { } ). The ﬁrst three TDNN-layers have asize of 512, the third a size of 256 and the embedding size is 32.An utterance of length T ﬁts N = ⌈ T − ⌉ full windows(at shifts of 100 frames) plus another window if padding wereused ( e.g . replication padding to

213 + N ∗ frames). Toavoid padding, shifts of T − N are used ( i.e . slightly under 100frames). The resulting indices are rounded to the nearest integer.For utterances shorter than 213 frames, the window is alignedto the centre of the utterance and replication padding used. CD-VAT was implemented in HTK [35] with which all modelswere trained in conjunction with PyHTK [39]. For training, theindow-level embedding is classiﬁed into the different speak-ers. The training objective for supervised training is angularsoftmax [15] with m = 1 . The embedding generator is opti-mised using stochastic gradient descent (SGD) with momen-tum, and weightdecay was used for regularisation. The learningrate scheduler is NewBob. The batch size used for the super-vised loss, l ( · , · ) , was 200 except for the model trained on theentire data set for which a batch size of 400 was used. Thebatch size used for the CD-VAT loss, R CDVAT , was 800 i.e . fourtimes higher. The interpolation coefﬁcient α was set to 0.4 andthe norm of the adversarial perturbation ǫ was set to 13. Themodel was trained directly on the combined loss, L ( · ) , i.e . notpre-trained on the purely supervised loss.For the experiments two partitions into labelled and unla-belled were created. For one 220k utterances are labelled and440k utterances for the other. The remaining 1057k and 837kutterances, respectively, form the unlabelled dataset. The 220kand 440k utterances are chosen from the top of the utterancelist sorted by ofﬁcial utterance name. Of the labelled dataset20k and 40k utterances, respectively, form the validation set. The main evaluation criterion is the speaker veriﬁcation equalerror rate (EER). First, utterance embeddings are formed foreach utterance by averaging the window embeddings, wherethe windows are based on the shifts described at the end ofSec. 3.2. The scores necessary for the receiver operating charac-teristic (ROC) curve used in the EER calculation are the cosine-distances between the embeddings of utterance pairs, in com-parison to the otherwise commonly used PLDA backend . TheROC curve is built using scikit-learn [40].Speaker-discriminative acoustic embeddings should havetwo qualities, intra-speaker compactness (ISC) i.e . how closeto each other are the embeddings of a single speaker and inter-speaker separability (ISS) i.e . how far apart are the embeddingsof different speakers. To give further insight about the embed-dings generated from our models we attempt to give a measureof these two qualities. First all utterance embeddings belongingto the same speaker are collected and the centroid calculated(one per speaker). For the ISS the average pairwise distance be-tween the centroids of all speakers is found. For the ISC, all ut-terance embeddings belonging to the same speaker are collectedand the average cosine-distance to the centroid of that speakerfound and the average of those per-speaker scores calculated.

4. Experimental Results

Tables 2 and 3 show the results of applying CD-VAT on the Vox-Celeb data set. The supervised baseline model trained on 200kutterances of labelled data achieves an EER of 8.32%. The useof CD-VAT reduces the EER by 11.1% relative down to an EERof 7.40%. This represents an EER recovery of 32.5% i.e . weachieve 32.5% of the reduction in EER that we would get frompure supervised training if we had the speaker labels for the un-labelled part of the training data (such a model has an EER of5.52%). This error rate recovery is similar to those seen in otherareas of machine learning. For instance, [41] presents a worderror rate recovery of 37% for SSL (minimum entropy training)of a DNN-HMM speech recogniser without additional languagemodelling data. At the same time SSL for speaker embeddingspresents additional challenges as for larger numbers of speakers [16, 18] are other publications that experiment with direct cosine-scoring for speaker veriﬁcation. System Utts D l D ul EERSup 1 200k 1249 - 8.32%CDVAT 1 1057k 7.40%Sup 2 400k 2504 - 6.85%CDVAT 2 837k 6.46%Sup 3 1277k 7323 - 5.52%Table 2:

Evaluation of CD-VAT on the VoxCeleb dataset. Theevaluation criterion EER is explained in Section 3.4. For EERa lower value is better. D l is the labelled data set and D ul isthe unlabelled data set. class overlap can exist. The information content of unlabelledexamples decreases as classes overlap as shown by [42, 43].The supervised baseline model trained on 400k utterancesof labelled data achieves an EER of 6.85%. The use of CD-VATreduces the EER by 5.7% relative down to an EER of 6.46%.This represents an EER recovery of 29.3%.System Utts D l Utts D ul ISC ISSSup 1 220k - 0.13 0.38CDVAT 1 1057k 0.09 0.36Sup 2 440k - 0.14 0.40CDVAT 2 837k 0.09 0.38Sup 3 1277k - 0.14 0.46Table 3:

Evaluation of CD-VAT on the VoxCeleb dataset. Theevluation criteria ISC and ISS are explained in Section 3.4. ForISC a lower value is better. For ISS a higher value is better.

Furthermore, the ISC, which is very closely related to theCD-VAT smoothing loss is reduced for the 200k and the 400kmodels by 31% and 36% respectively. However, at the sametime the ISS is also slightly reduced. This shows one disadvan-tage of CD-VAT, which is that it also brings the embeddings ofall utterances closer together. To put these values into perspec-tive, the threshold of the cosine-scoring used to obtain the EERis between 0.42 and 0.48 for the systems trained.

5. Conclusions

We have presented cosine-distance virtual adversarial training(CD-VAT), a method that allows for semi-supervised trainingof speaker-discriminative acoustic embeddings without the re-quirement that the set of speakers is the same for the labelledand the unlabelled data. It is shown that CD-VAT can improvespeaker veriﬁcation performance on the VoxCeleb data set overa purely supervised baseline. The proposed method recovers32.5% of the EER improvement that is obtained when speakerlabels are available for the unlabelled data. CD-VAT also sig-niﬁcantly improves the intra-speaker compactness (ISC) of thespeaker embeddings. At the same time, however, the compu-tational cost of CD-VAT is twice as high (per data point) assupervised training and two new hyper-parameters, that need tobe tuned, are introduced.

6. Acknowledgements

Florian Kreyssig is funded by an EPSRC Doctoral TrainingPartnership Award. . References [1] T. Miyato, S. Maeda, S. Ishii, & M. Koyama, “Virtual adver-sarial training: A regularization method for supervised and semi-supervised Learning,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , 2018.[2] D. Snyder, D. Garcia-Romero, D. Povey, & S. Khudanpur, “Deepneural network embeddings for text-independent speaker veriﬁca-tion,”

Proc. Interspeech , Stockholm, 2017.[3] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, & S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,”

Proc. ICASSP , Calgary, 2018.[4] M. Diez, L. Burget, S. Wang, J. Rohdin, & J. ernock, “BayesianHMM based x-vector clustering for speaker diarization,”

Proc.Interspeech , Graz, 2019.[5] Q. Li, F.L. Kreyssig, C. Zhang, & P.C. Woodland, “Discrimi-native neural clustering for speaker diarisation,” arXiv preprintarXiv:1910.09703 , 2019.[6] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree,G. Sell, J. Borgstrom, L.P. Garca-Perera, F. Richardson, R. Dehak,P.A. Torres-Carrasquillo, & N. Dehak, “State-of-the-art speakerrecognition with neural network embeddings in NIST SRE18 andspeakers in the wild evaluations,”

Computer Speech & Language ,vol. 60, pp. 101026, 2020.[7] N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, & P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,”

IEEE Trans.on Audio, Speech, and Language Processing , vol. 19, pp. 788–798, 2011.[8] E. Variani, X. Lei, E. McDermott, I.L. Moreno, & J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker veriﬁcation,”

Proc. ICASSP , Florence, 2014.[9] G. Wahba,

Spline models for observational data , Siam, 1990.[10] M. Sajjadi, M. Javanmardi, & T. Tasdizen, “Regularizationwith stochastic transformations and perturbations for deep semi-supervised learning,”

Proc. NIPS , Barcelona, 2016.[11] S. Laine & T. Aila, “Temporal ensembling for semi-supervisedlearning,”

Proc. ICLR , Toulon, 2016.[12] Y. Luo, J. Zhu, M. Li, Y. Ren, & B. Zhang, “Smooth neighbors onteacher graphs for semi-supervised learning,”

Proc. CVPR , SaltLake City, 2018.[13] A. Oliver, A. Odena, C. Raffel, E.D. Cubuk, & I.J. Goodfel-low, “Realistic evaluation of deep semi-supervised learning al-gorithms,”

Proc. NeurIPS , Montreal, 2018.[14] X. Zhai, A. Oliver, A. Kolesnikov, & L. Beyer, “S4l: Self-supervised semi-supervised learning,”

Proc. ICCV , Seoul, 2019.[15] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, & L. Song, “SphereFace:Deep hypersphere embedding for face recognition,”

Proc. CVPR ,Honolulu, 2017.[16] Y. Li, F. Gao, Z. Ou, & J. Sun, “Angular softmax loss for end-to-end speaker veriﬁcation,”

Proc. ISCSLP , Taipei City, 2018.[17] Y. Liu, L. He, & J. Liu, “Large margin softmax loss for speakerveriﬁcation,”

Proc. Interspeech , Graz, 2019.[18] C. Luu, P. Bell, & S. Renals, “DropClass and DropAdapt: Drop-ping classes for deep speaker representation learning,”

Proc.Speaker Odyssey , Tokyo, 2020.[19] J. Deng, J. Guo, N. Xue, & S. Zafeiriou, “ArcFace: Additiveangular margin loss for deep face recognition,”

Proc. CVPR , LongBeach, 2019.[20] E. Kreyszig, H. Kreyszig, & E.J. Norminton,

Advanced engineer-ing mathematics , chapter 20.8 Power method for eigenvalues, pp.885–887, Wiley, Hoboken, NJ, tenth edition, 2011.[21] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, & R. Fergus, “Intriguing properties of neural net-works,”

Proc. ICLR , Banff, 2014.[22] I.J. Goodfellow, J. Shlens, & C. Szegedy, “Explaining and har-nessing adversarial examples,”

Proc. ICLR , San Diego, 2015. [23] X. Li, N. Li, J. Zhong, X. Wu, X. Liu, D. Su, D. Yu, & H. Meng,“Investigating Robustness of Adversarial Samples Detection forAutomatic Speaker Veriﬁcation,”

Proc. Interspeech , Shanghai(virtual), 2020.[24] G. Zavaliagkos & T. Colthurst, “Utilizing untranscribed train-ing data to improve performance,”

Proc. DARPA Broadcast NewsTranscription and Understanding Workshop , Landsdowne, 1998.[25] T. Kemp & A. Waibel, “Unsupervised training of a speech recog-nizer using tv broadcasts,”

Proc. ICSLP , Sydney, Australia, 1998.[26] L. Lamel, J.L. Gauvain, & G. Adda, “Lightly supervised andunsupervised acoustic model training,”

Computer Speech & Lan-guage , vol. 16, no. 1, 2002.[27] H.Y. Chan & P. Woodland, “Improving broadcast news transcrip-tion by lightly supervised discriminative training,”

Proc. ICASSP ,Montreal, 2004.[28] D. Yarowsky, “Unsupervised word sense disambiguation rivalingsupervised methods,”

Proc. ACL , Stroudsburg, 1995.[29] T. Stafylakis, J. Rohdin, O. Plchot, P. Mizera, & L. Burget, “Self-supervised speaker embeddings,”

Proc. Interspeech , Graz, 2019.[30] Z. Peng, S. Feng, & T. Lee, “Mixture factorized auto-encoderfor unsupervised hierarchical deep factorization of speech signal,”

Proc. ICASSP , Barcelona (virtual), 2020.[31] N. Le & J.M. Odobez, “Robust and discriminative speaker em-bedding via intra-class distance variance regularization,”

Proc.Interspeech , Hyderabad, 2018.[32] A. Tarvainen & H. Valpola, “Mean teachers are better role models:Weight-averaged consistency targets improve semi-superviseddeep learning results,”

Proc. NIPS , 2017.[33] K. Lu, C.S. Foo, K.K. Teh, H.D. Tran, & V.R. Chandrasekhar,“Semi-supervised audio classiﬁcation with consistency-based reg-ularization,”

Proc. Interspeech , Graz, 2019.[34] H. Yamamoto, K.A. Lee, K. Okabe, & T. Koshinaka, “Speakeraugmentation and bandwidth extension for deep speaker embed-ding,”

Proc. Interspeech , Graz, 2019.[35] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu,G. Moore, J. Odell, D. Ollason, D. Povey, A. Ragni, V. Valtchev,P. Woodland, & C. Zhang,

The HTK Book , Cambridge UniversityEngineering Department, 2015.[36] V. Peddinti, D. Povey, & S. Khudanpur, “A time delay neuralnetwork architecture for efﬁcient modeling of long temporal con-texts,”

Proc. Interspeech , Dresden, 2015.[37] F.L. Kreyssig, C. Zhang, & P.C. Woodland, “Improved TDNNsusing deep kernels and frequency dependent grid-RNNS,”

Proc.ICASSP , Calgary, 2018.[38] G. Sun, C. Zhang, & P.C. Woodland, “Speaker diarisation using2D self-attentive combination of embeddings,”

Proc. ICASSP ,Brighton, 2019.[39] C. Zhang, F.L. Kreyssig, Q. Li, & P.C. Woodland, “PyHTK:Python library and ASR pipelines for HTK,”

Proc. ICASSP ,Brighton, 2019.[40] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-rot, & E. Duchesnay, “Scikit-learn: Machine learning in python,”

Journal of Machine Learning Research , vol. 12, pp. 2825–2830,2011.[41] V. Manohar, D. Povey, & S. Khudanpur, “Semi-supervised maxi-mum mutual information training of deep neural network acousticmodels,”

Proc. Interspeech , Dresden, Germany, 2015.[42] V. Castelli & T.M. Cover, “The relative value of labeled and un-labeled samples in pattern recognition with an unknown mixingparameter,”

IEEE Transactions on Information Theory , vol. 42,no. 6, 1996.[43] T.J. O’Neill, “Normal discrimination with unclassiﬁed observa-tions,”