[PDF] Integrating a joint Bayesian generative model in a discriminative learning framework for speaker verification

Abstract

The task for speaker verification (SV) is to decide an utterance is spoken by a target or imposter speaker. In most SV studies, a log-likelihood ratio (L_LLR) score is estimated based on a generative probability model on speaker features, and compared with a threshold for decision making. However, the generative model usually focuses on feature distributions and does not have the discriminative feature selection ability, which is easy to be distracted by nuisance features. The SV, as a hypothesis test, could be formulated as a binary classification task where a neural network (NN) based discriminative learning could be applied. Through discriminative learning, the nuisance features could be removed with the help of label supervision. However, the discriminative learning pays more attention to classification boundaries which is prone to overfitting to training data and yielding poor generalization on testing data. In this paper, we propose a hybrid learning framework, i.e., integrating a joint Bayesian (JB) generative model into a neural discriminative learning framework for SV. A Siamese NN is built with dense layers to approximate the mapping functions used in the SV pipeline with the JB model, and the L-LLR score estimated based on the JB model is connected to the distance metric in a pair-wised discriminative learning. By initializing the Siamese NN with the parameters learned from the JB model, we further train the model parameters with the pair-wised samples as a binary discrimination task. Moreover, direct evaluation metric in SV, i.e., minimum empirical Bayes risk, is designed and integrated as an objective function in the discriminative learning. We carried out SV experiments on speakers in the wild (SITW) and Voxceleb corpora. Experimental results showed that our proposed model improved the performance with a large margin compared with state-of-the-art models for SV.

Full PDF

aa r X i v : . [ ee ss . A S ] J a n Integrating a joint Bayesian generative model in adiscriminative learning framework for speakerveriﬁcation

Xugang Lu , Peng Shen , Yu Tsao , Hisashi Kawai Abstract — The task for speaker veriﬁcation (SV) is to decidean utterance is spoken by a target or imposter speaker. In mostSV studies, a log-likelihood ratio (L LLR) score is estimatedbased on a generative probability model on speaker features,and compared with a threshold for decision making. However,the generative model usually focuses on feature distributions anddoes not have the discriminative feature selection ability, which iseasy to be distracted by nuisance features. The SV, as a hypothesistest, could be formulated as a binary classiﬁcation task wherea neural network (NN) based discriminative learning could beapplied. Through discriminative learning, the nuisance featurescould be removed with the help of label supervision. However,the discriminative learning pays more attention to classiﬁcationboundaries which is prone to overﬁtting to training data andyielding poor generalization on testing data. In this paper, wepropose a hybrid learning framework, i.e., integrating a jointBayesian (JB) generative model into a neural discriminativelearning framework for SV. A Siamese NN is built with denselayers to approximate the mapping functions used in the SVpipeline with the JB model, and the L-LLR score estimatedbased on the JB model is connected to the distance metric in apair-wised discriminative learning. By initializing the Siamese NNwith the parameters learned from the JB model, we further trainthe model parameters with the pair-wised samples as a binarydiscrimination task. Moreover, direct evaluation metric in SV, i.e.,minimum empirical Bayes risk, is designed and integrated as anobjective function in the discriminative learning. We carried outSV experiments on speakers in the wild (SITW) and Voxcelebcorpora. Experimental results showed that our proposed modelimproved the performance with a large margin compared withstate-of-the-art models for SV.

I. I

NTRODUCTION

Speaker veriﬁcation (SV) is a technique to verify whetheran acoustic speech is spoken by a target or imposter speaker,it is widely used in many speech application systems wherespeaker information is required from authentic or securityperspectives [1], [2], [3]. The basic problem deﬁnition for SVis to decide whether two utterances (usually denoted as testingand enrollment utterances) are generated from the same ordifferent speakers, i.e., a hypothesis test deﬁned as: H S : x i , x j are spoken by the same speaker H D : x i , x j are spoken by different speakers (1)where H S and H D are the two hypothesises as the same anddifferent speaker spaces, respectively. ( x i , x j ) is a tuple with

1. National Institute of Information and Communications Technology,Japan. Email: [email protected]. Research Center for Information Technology Innovation, AcademicSinica, Taiwan two compared utterances indexed by i and j . For making adecision, it is necessary to estimate the similarity of the twoutterances, either calculated as a log likelihood ratio (L LLR)or a distance measure, and compare it with a threshold. Theconventional pipeline in constructing a SV system for doingthe hypothesis test deﬁned in Eq. (1) is composed of front-end speaker feature extraction and back-end speaker classiﬁermodeling. Front-end feature extraction tries to extract robustand discriminative features to represent speakers, and back-endclassiﬁer tries to model speakers with the extracted featuresbased on which the similarity or L LLR scores could beestimated. A. Front-end speaker feature extraction

In most state of the art frameworks, the front-end speakerfeature is based on i-vector representation [4], [5]. In i-vectorextraction, speech utterances with variable durations can beconverted to ﬁxed dimension vectors with the help of Gaussianmixture models (GMM) on probability distributions of acous-tic features. With the resurgence of deep learning technique,several alternative speaker features have been proposed, e.g.,d-vector and X-vector [5], [6]. These features are extractedfrom a well trained deep neural network with bottleneck layersor statistical pooling. In recent years, X-vector as one of thespeaker embedding representations is the most widely usedone [5]. The advantage of X-vector representation is thatthe model for X-vector extraction could be efﬁciently trainedwith large quantity of speech samples from various speakers.Moreover, in order to explore robust speaker information, dataaugmentation with various of noise types and signal to noiseratios (SNRs) could be easily applied in model training [5].Since the original front-end feature (e.g., either i-vector orX-vector) encodes various of acoustic factors, e.g., speakerfactor, channel transmission factor, recording device factor,etc., before classiﬁer modeling, a linear discriminative analysis(LDA) is usually applied for dimension reduction to eliminatenon-speaker speciﬁc information.

B. Back-end classiﬁer modeling

After speaker features are obtained, how to build a speakerclassiﬁer model in back-end modeling is important. There aretwo types of modeling strategies, one is generative modelling,the other is discriminative modeling. In generative model-ing, features are regarded as observations from a generationprocess with certain probability distribution assumptions on the generation variables. Based on the generation model, thehypothesis test deﬁned in Eq. (1) is regarded as a statisti-cal inference from the variable probability distributions. Forexample, probabilistic linear discriminant analysis (PLDA)modeling [4], [7] has been widely used in SV. The PLDAcan be applied to model the within-speaker and between-speaker variabilities with linear subspace modeling on speakerand noise spaces in generation. However, it is difﬁcult todetermine the dimensions of subspaces which has a large effecton the ﬁnal performance. As an alternative, joint Bayesian(JB) modelling [8], [9] is regarded as a much more efﬁcientmodel than the PLDA without subspace model assumptionson speaker and noise spaces. The hypothesis test deﬁned inEq. (1) also can be formulated as a binary classiﬁcation taskwhere the discriminative model can be learned with supervisedlearning algorithms. In the discriminative modeling, ratherthan modeling on the feature probability distributions, we onlyfocus on the classiﬁcation boundaries. For example, supportvector machine (SVM) is proposed to maximize the betweenclass distance [10], neural network based discriminative modelis applied to directly maximize classiﬁcation accuracy withlabeled training data sets [11], pairwise discriminative trainingon i-vectors is proposed as a binary classiﬁcation task forSV [12], [13]. In recent years, as a discriminative modeling,the supervised end-to-end speaker models which integrate thefront-end feature extraction and back-end speaker classiﬁermodeling in a uniﬁed optimization framework also have beenproposed [14], [15]. However, in SV tasks, usually manyspeakers are not registered in the training data, the currentstate of the art pipeline in SV is still the speaker feature rep-resentation (e.g. X-vector) with a generative speaker classiﬁermodeling.

C. Hybrid of generative and discriminative model learning

In a generative model learning, either with PLDA or JBmodel, the observation feature variable is supposed to bean additive mixture of speaker and noise variables. Thereare model assumptions on speakers and noise variables, e.g.,Gaussian assumptions of the probability distributions. If theassumptions are not satisﬁed, the performance could not beguaranteed. Moreover, in most generative model learning,the objectives focus on feature probability distributions. Thedisadvantage is that it is difﬁcult to learn the model parametersfor high dimensional features, and the model does not havethe discriminative feature selection ability which is easy tobe distracted by nuisance features. In discriminative modellearning, the objectives pay more attention to discriminationboundaries. Although nuisance features could be removed withlabel supervision, it is easy for the model to be overﬁtted tothe training data set that could not generalize well to a testingdata set. For a better understanding, we illustrate the differentfocus of generative and discriminative model learning in Fig.1. In this ﬁgure, two classes with samples are showed (circlesand triangles for class 1 and 2 respectively). As shown in thisﬁgure, generative model tries to focus on class distributionswhile the discriminative model tries to pay attention to theclassiﬁcation boundary (solid curve in Fig. 1).

Class 1 Class 2 p(cid:2)x|y (cid:6) Class 1(cid:13) p(cid:2)x|y (cid:6) Class 2(cid:13)

Discriminative Boundary

Fig. 1. Different focus for generative and discriminative model learning:Generative model learning focuses on class distributions (dashed-circles ofclass distribution shapes), and discriminative model learning emphasizes classdiscriminative boundary (solid curve).

As we have discussed, the deﬁnition of SV in Eq. (1) relatesto both generative and discriminative modeling strategies ei-ther as a generative hypothesis test or binary discriminativeclassiﬁcation task. However, in most studies, there is noexplicit modeling strategies to connect these two aspects intoone model framework in SV. In this study we propose a hybridmodel framework which explicitly integrates the JB basedgenerative model in a binary discriminative learning frame-work on X-vectors for SV. The generative model could makethe model with good generalization by probability distributionassumptions while the discriminative learning could help thegenerative model to enhance its discriminative power in featureselection and hypothesis space modeling. Our contributions aresummarized as follows:(1) We propose a hybrid model for SV which integratethe JB based generative model in a discriminative learningframework. Although hybrid of generative and discriminativemodelling has been studied in machine learning for fullyutilizing unlabeled and labeled samples, and showed improvedperformance in classiﬁcation tasks [16], [17], it is difﬁcult tointegrate the generative and discriminative models in SV tasks.The main reason is that in most studies the generative anddiscriminative models adopted different modeling structures.In this study, we make a connection between the generativeand discriminative models in classiﬁcation task via the cal-culation of L LLR for hypothesis test in the JB model, andfactorize the matrix transforms used in the JB model withafﬁne transforms which could be approximated with denselayers in a discriminative neural network model.(2) We design a direct evaluation metric based discrimina-tive learning objective function in the hybrid model learning.In the JB based generative model learning, usually an objectivefunction with negative log-likelihood is minimized, while ina neural network based discriminative model learning, anobjective function indicating the classiﬁcation error rate isminimized. However, the objective for hypothesis test task inSV is different from both of them. In a SV task, the evaluationmetric is based on weighting of two types of errors [18], [19],i.e., type I error (or false alarm rate), type II error (or missrate). In this study, we formulate this type of objective functionin the discriminative learning framework.(3) We analyze the effects of all components in model parameterizations with detailed SV experiments, and revealtheir connections to conventional distance metric learning.The remainder of the paper is organized as follows. Sec-tion II introduces the basic theoretical considerations andthe proposed hybrid model framework. Section III describesthe implementation details and experiments, in particular, wemake deep investigations of the effect of model parameters,and their connections to other related model frameworks.Section IV summarizes the study with discussions.II. P

ROPOSED HYBRID MODEL FRAMEWORK

The generative and discriminative models can be connectedwith the Bayesian theory. Before introducing their connec-tions, we give a brief review of the basic idea of generativeand discriminative modelings.

A. Generative and discriminative models in classiﬁcationtasks

A generative model tries to capture the data generationprocess with a fully joint modelling of the relation betweenfeature input and label variables as p ( x , y ) , while a discrim-inative model only tries to model the direct relation betweeninput feature and output label as p ( y | x ) , where x and y are feature and label variables, respectively. Although thegenerative model is not directly used for classiﬁcation, aclassiﬁcation model can be deduced from the generative modelas model inference based on the Bayes theory as: p ( y | x ) = p ( x , y ) p ( x ) = p ( x | y ) p ( y ) p ( x ) (2)In this equation, p ( x | y ) is the likelihood score of generatingfeature x by given a label y . Although the generative modelhas a better generalization ability with prior data distributionassumptions, it is difﬁcult for the model to learn the datastructure in a high dimension space with complex distributions.Usually a dimension reduction is applied before applying agenerative model, for example, principal component analysis(PCA) or LDA as widely used in SV systems. In most studiesfor SV, the dimension reduction and generative modelingare independently applied which is sub-optimal in modeling.Moreover, the generative model based classiﬁcation is notaccurate since probability distribution assumptions usually arenot exact enough.The discriminative model could learn the complex clas-siﬁcation boundaries with nonlinear mapping functions, andpay much attention to discriminative boundaries (as illustratedin Fig. 1), however it is prone to overﬁtting to the trainingdata and always with high conﬁdent predictions. Besidesthese theoretical differences, in practical model training aspect,generative model parameters usually are estimated based onexpectation-maximization (EM) like algorithms based on sim-ple assumptions of data distributions (e.g., Gaussian distribu-tions), while model parameters of discriminative model (neuralnetwork) usually are estimated based on gradient descentalgorithms. In the following subsections, we show how tointegrate them in a hybrid model with careful formulations.

1) Generative model based classiﬁcation:

Given a trainingdata set { ( x i , y i ) } i =1 , ,...,N , y i ∈ { , , ..., K } with x i and y i as data feature and label, K is the number of classes. For aclassiﬁcation based on a generative model, based on Eq. (2),the classiﬁcation model is: p ( y = k | x ) = p ( x | y = k ) p ( y = k ) K P j =1 p ( x | y = j ) p ( y = j ) . (3)And Eq. (3) is further cast to: p ( y = k | x ) = 11 + K P j =1 ,j = k exp ( − r k,j ( x , Θ G )) , (4)where r k,j ( x , Θ G ) = log p ( x | y = k ) p ( y = k ) p ( x | y = j ) p ( y = j ) , (5)is a L LLR score based on class generative probability modelwith Θ G as model parameter set.

2) Discriminative model based classiﬁcation:

Rather thanusing a generative model, a neural network can be applied todirectly approximate the posterior probability function p ( y | x ) .A discriminative learning tries to approximate the mappingbetween input feature and label with a softmax functiondeﬁned as: p ( y = k | x ) = exp ( o k ) K P j =1 exp ( o j ) , (6)where a network mapping function o j = φ j ( x , Θ D ) is deﬁnedas the output corresponding to the j -th class, and Θ D is theneural network parameters. Eq. (6) is further cast to: p ( y = k | x ) = 11 + K P j =1 ,j = k exp ( − h k,j ( x , Θ D )) , (7)where h k,j ( x , Θ D ) = φ k ( x , Θ D ) − φ j ( x , Θ D ) . (8)Comparing Eqs. (7), (8) and (4), (5), we can see that h k,j ( x , Θ D ) can be connected to the r k,j ( x , Θ G ) with theL LLR in calculation. This connection inspired us to incor-porate the L LLR of pair-wised samples from a generativemodel to the neural network discriminative training for SV. B. Integrating Log likelihood ratio for SV in generative anddiscriminative models

The task for SV is a problem of hypothesis test as deﬁned inEq. (1). It could be solved based on L LLR score estimationfor the two hypothesises. The advantage of using L LLR isthat it is not necessary to estimate each probability distribution(as can be canceled out based on ratio calculation) which is adifﬁcult task.Based on the generative model, given a hypothesis H S or H D , the joint probability of generating ( x i , x j ) is p ( x i , x j | H S ) or p ( x i , x j | H D ) . In making a decision, theL LLR is deﬁned as: r i,j ∆ = r ( x i , x j ) = log p ( x i, x j | H S ) p ( x i, x j | H D ) (9)With a given decision threshold, we can decide whether thetwo observation vectors are from H S or H D (as deﬁned inEq. (1)). For convenience of formulation, we deﬁne a trial asa tuple z i,j = ( x i , x j ) , and the two hypothesis spaces areconstructed from the two data sets as: S = { z i,j = ( x i , x j ) ∈ H S } D = { z i,j = ( x i , x j ) ∈ H D } (10)The ﬁnal calculation is dependent on the assumption ofgenerative models with some density functions. We ﬁrst derivethe L LLR score calculation based on the JB based generativemodel.

1) Joint Bayesian generative model:

Given two speakerfeature vectors, their distance is associated with their prob-ability distributions which model their generation process.Given an observation X-vector variable x , it is supposed to begenerated by a speaker identity variable and a random noisevariable (possibly induced by different recording backgroundnoise, sessions, or transmission channels, etc.) as: x = u + n , (11)where u is a speaker identity vector variable, n representsintra-speaker variation caused by noise. For simplicity, theobservation x is mean subtracted, and the speaker identityvariable and intra-speaker variation variable are supposed tobe with Gaussian distributions as: u ∼ N (0 , C u ) n ∼ N (0 , C n ) , (12)where C u and C n are speaker and noise covariance matrices,respectively. In veriﬁcation, for given a trial with x i and x j generated from Eq. (11), based on the assumption in Eq. (12),the two terms p ( x i , x j | H S ) and p ( x i , x j | H D ) deﬁned in Eq.(9) satisfy zero-mean Gaussian with covariances as: cov S = (cid:20) C u + C n C u C u C u + C n (cid:21) cov D = (cid:20) C u + C n C u + C n (cid:21) (13)Based on this formulation, the L LLR deﬁned in Eq. (9) couldbe calculated based on: r ( x i , x j ) = x Ti Ax i + x Tj Ax j − x Ti Gx j , (14)where A = ( C u + C n ) − − [( C u + C n ) − C u ( C u + C n ) − C u ] − G = − (2 C u + C n ) − C u C − n (15) As seen from Eq. (15), the generative model parameters Θ G in estimating L LLR is only related to covariance parameters C u and C n [8], [9]. Given a training data set, the parameterscould be estimated using EM (or EM-like) learning algorithmbased on: Θ ∗ G = arg min Θ G − X i log p ( X i | Θ G ) (16) LDA LengthNorm JB L_LLRX-vector1X-vector2

Fig. 2. Pipeline for joint Bayesian based generative modeling on X-vector forspeaker veriﬁcation, LDA: Linear Discrimination Analysis, JB: Joint Bayesianmodel, L LLR: log likelihood ratio. where Θ G = { C u , C n } , X i is a collection of samples forspeaker i .

2) Pair-wised discriminative model:

The hypothesis testdeﬁned in Eq. (1) can be regarded as a binary classiﬁcationtask. This classiﬁcation task can be solved based on neuraldiscriminative learning as formulated in Eqs. (6) and (7).In neural discriminative learning, the parameters are neuralweights (afﬁne transform matrix with linear or nonlinear acti-vations), we can connect the model parameters of generativemodel with the neural weights and optimize them with anobjective function. As a binary classiﬁcation task, given a trialwith two observation X-vector variables z i,j = ( x i , x j ) , theclassiﬁcation task is to estimate and compare p ( H S | z i,j ) and p ( H D | z i,j ) . As a binary discriminative learning, the label isdeﬁned as: y i,j = (cid:26) , z i,j ∈ H S , z i,j ∈ H D (17)For a binary discriminative neural network, with reference toEqs. (7) and (8), the posterior probability is estimated basedon: p ( y i,j | z i,j ) = ( − h HS,HD ( z i,j , Θ D )) ; z i,j ∈ H S − − h HS,HD ( z i,j , Θ D )) ; z i,j ∈ H D (18)As we have revealed from Eqs. (4), (5), and (9), wereplace the h H S ,H D ( z i,j , Θ D ) with L LLR score, and deﬁne amapping as a logistic function with scaled parameters as [20],[21]: f ( r i,j ) ∆ = 11 + exp ( − ( αr i,j + β )) (19)where r i,j = r ( z i,j ) = r ( x i , x j ) as deﬁned in Eq. (9), α and β are gain and bias factors used in the regression model. InEq. (19), we integrated the L LLR score estimated from theJB generative model in a discriminative training framework.The probability estimation in Eq. (18) is cast to: ˆ y i,j ∆ = p ( y i,j | z i,j ) = (cid:26) f ( r i,j ); z i,j ∈ H S − f ( r i,j ); z i,j ∈ H D (20)The training can be based on optimizing the binary crossentropy deﬁned as: L = − X z i,j ∈{ H S ∪ H D } ( y i,j log f ( r i,j ) + (1 − y i,j ) log(1 − f ( r i,j ))) (21) In the following subsection, we investigate the neural networkarchitecture for the hybrid model framework.

C. Hybrid model framework with neural network architecture

The conventional state of the art framework for SV based onX-vector and JB model is illustrated in Fig. 2. In this ﬁgure, the

X-vector1FC1 L_LLRSiameseNetworkX-vector2LengthNorm FC2 (cid:1) (cid:2) (cid:1) (cid:3)

Fig. 3. The proposed Siamese network for estimating the L LLR score in JBmodel for a pair-wised discriminative model learning. FC1: full connectionlayer for LDA, FC2: Full connection layer for JB, H D and H S as classhypothesis for different and the same speaker, respectively. LDA is applied on the X-vector for discriminative dimensionalreduction. After the LDA, a vector length normalization isused, then JB based generative model is applied by which theL LLR is estimated. In a pair-wised discriminative learningframework, the L LLR can be used for a binary classiﬁcationtask based on a Siamese network which is showed in Fig. 3. Inthis Siamese network, the LDA and JB model are implementedwith dense layers of neural network architecture (indicates as“FC1” and “FC2” blocks).We ﬁrst explain the LDA which will be approximated byan afﬁne transform as used in the neural network modeling.For input X-vector based samples and their correspondinglabels { ( x , y ) , ( x , y ) , ..., ( x M , y M ) } , x i ∈ R l , the LDAtransform is: h i = W T x i , (22)where W ∈ R l × d , l and d are dimensions of input X-vectorand transformed feature vector, M is the number of samples.W is estimated from the following deﬁnition: W ∗ = arg max W tr ( W T S b WW T S w W ) , (23)where tr ( . ) denotes the matrix trace operator, S w and S b areintra-class and inter-class covariance matrices deﬁned as: S w = C P j =1 M j P i =1 ( x ji − µ j )( x ji − µ j ) T S b = M C P j =1 M j ( µ j − ¯ µ )( µ j − ¯ µ ) T , (24)where C is the number of speakers, M j is the sample numberof the j -th class, µ j (for the j -th class) and ¯ µ are class-wisedmean and global mean deﬁned as: µ j = M j M j P i =1 x ji ¯ µ = M M P i =1 x i . (25)From Eq. (22), we can see that the LDA can be implementedas a linear dense layer.We further look into the estimation of the L LLR scoredeﬁned in Eq. (14). In Eq. (14), A and G are negative semi-deﬁnite symmetric matrices [8], [9], they can be decomposedas: A = − P A P TA G = − P G P TG (26) Dense layer (cid:1)

LengthNormDense layer (cid:2) (cid:3)

Dense layer (cid:2) (cid:4)

L_LLR (cid:5) (cid:6) (cid:5) (cid:7) (cid:8)(. )x (cid:13) x (cid:14) LDA_netJB_net

Fig. 4. The proposed Siamese network with integration of JB model structure(in Fig. 3 ) for speaker veriﬁcation (see the text for detailed explanation).

The L LLR score is cast to: r i,j = 2 g Ti g j − a Ti a i − a Tj a j (27)with the afﬁne linear transforms as: a i = P TA ˜h i g i = P TG ˜h i , (28)where ˜h i = h i || h i || is the length normalized vector from theLDA transform deﬁned in Eq. (22). These transforms couldbe applied in a neural network as linear dense layers. Basedon these formulations, the Siamese network in Fig. 3 is furtherimplemented as showed in Fig. 4. In this ﬁgure, there are twosub-nets, i.e., “LDA net” and “JB net”. The “LDA net” is adense layer net with transform according to Eq. (22). In the“JB net”, the JB model structure is taken into considerationas two-branch dense layer network according to Eq. (28). D. Learning objective function based on minimum empiricalBayes risk (EBR)

The cross entropy deﬁned in Eq. (21) can be applied fordiscriminative training in order to measure the classiﬁcationerror. However, the hypothesis test deﬁned in Eq. (1) isdifferent from a classiﬁcation goal, and the ﬁnal evaluationmetric for SV usually adopts some different criterions. It isbetter to optimize model parameters directly based on theevaluation metrics. As a hypothesis test task, there are twotypes of errors [18], [19], i.e., type I and type II errors. Thetwo types of errors are deﬁned as:Type I error (false alarm): z i,j ∈ H D , L LLR ≥ θ Type II error (miss): z i,j ∈ H S , L LLR < θ, (29)where θ is a decision threshold. These two types of errorsare further illustrated in Fig. 5 for a SV task. In this ﬁgure,the objective for SV is to minimize the target miss P miss (or false reject) and false alarm P fa (or false accept) in thetwo hypothesis spaces H S and H D . In real applications, it (cid:1) (cid:2) (cid:1) (cid:3) Miss False alarm P r ob a b ilit y Fig. 5. The L LLR distributions in H S and H D for the same and differentspeaker spaces, and two types of errors in the hypothesis test for SV. is better to generalize the classiﬁcation errors to a weighingof these two types of errors. With consideration of the priorknowledge in a measure of empirical Bayes risk (EBR), theevaluation metric for SV adopts a detection cost function(DCF) to measure the hardness of the decisions. It is deﬁnedas a weighted loss: C det ∆ = P tar C miss P miss + (1 − P tar ) C fa P fa , (30)where C miss and C fa are user assigned costs for miss and falsealarm detections, P tar is a prior of target trials, P miss and P fa are miss and false alarm probabilities deﬁned as: P fa = N non P z i,j ∈ H D u ( r i,j ≥ θ ) P miss = N tar P z i,j ∈ H S u ( r i,j < θ ) (31)In Eq. (31), “ N non ” and “ N tar ” are the numbers of nontargetand target trials, r i,j is L LLR estimated from Eq. (27), θ is a decision threshold, and u ( . ) is an indictor function forcounting number of trials for scores less or larger than thedecision threshold. In order to change the objective functionto be differentiable which can be used in gradient based neuralnetwork learning, the indictor function u ( . ) in Eq. (31) isreplaced with: P fa = N non P z i,j ∈{ H S ∪ H D } (1 − y i,j ) f ( r i,j ) P miss = N tar P z i,j ∈{ H S ∪ H D } y i,j (1 − f ( r i,j )) (32)where f ( r i,j ) is a sigmoid logistic function deﬁned the sameas in Eq. (19). With reference to the cross-entropy loss deﬁnedin Eq. (21), the deﬁnition of loss in Eq. (30) with Eq. (32)can be regarded as a generalized cross-entropy loss with userdeﬁned weighting cost parameters.III. E XPERIMENTS AND RESULTS

We carried out experiments on SV tasks where the test datasets are from the data corpus of speakers in the wild (SITW)[22] and Voxceleb [23]. The speaker feature and models weretrained based on Voxceleb data corpus (sets 1 and 2) [23].We adopt a state of the art pipeline for constructing theSV system as shown in Fig. 2. In this ﬁgure, the “LDA”,“Length Norm”, and “JB” blocks are designed independentlyrather than optimized jointly. The input speaker feature inour pipeline is X-vector. The X-vectors are extracted basedon a well trained neural network model which is designed for speaker classiﬁcation task [5]. For back-end models, boththe famous PLDA and JB based generative models are imple-mented in our comparisons.

A. Speaker embedding feature based on X-vector

A speaker embedding model is trained for the X-vectorextraction. The neural architecture of the embedding modelis composed of a deep time delay neural network (TDNN)layers and statistical pooling layers implemented the same asin kaldi [5]. In training the model, cross entropy criterionfor speaker classiﬁcation is used as the learning objectivefunction. The training data set includes two data sets fromVoxceleb corpus, i.e., the training set of Voxceleb1 corpusby removing overlapped speakers which are included in thetest set of the SITW, and the training set of Voxceleb2.In total, there are about 7,185 speakers with 1,236,567 ut-terances used for training. Moreover, data augmentation isapplied by adding noise, music, babble with several SNRs, andreverberation with simulated room impulse response is alsoapplied to increase data diversity. Input features for trainingthe speaker embedding model are MFCCs with 30 Mel bandbins. And the MFCCs are extracted with 25 ms frame lengthand 10 ms frame shift. Energy based voice activity detection(VAD) is applied to remove silence background regions inspeaker feature extraction. More details of feature and modelarchitecture and training procedures were introduced in [5].The ﬁnal extracted X-vectors are with 512 dimensions.

B. Back-end models

Although X-vector extracted from the speaker embeddingmodel is supposed to encode speaker discriminative informa-tion, it also encodes other acoustic factors. In a conventionalpipeline as illustrated in Fig. 2, a LDA is applied beforeapplying a generative speaker model. In this study, the 512-dimension X-vectors are transformed to 200-dimension vectorsby the LDA. Correspondingly, in the discriminative neuralnetwork model as showed in Fig. 4, a dense layer with 200neurons is also applied. Moreover, in the discriminative model,two dense layers corresponding to P A and P G of the JB modelare trained with “positive” and “negative” X-vector pairs (pairsfrom the same and different speakers). Since the discriminativeneural network architecture ﬁts well to the pipeline based ongenerative model structure, the dense layer parameters couldbe initialized with the LDA and JB model parameters intraining (according to Eqs. (22) and (28)). For comparison, therandom initialization method with “he normal” as widely usedin deep neural network learning is also applied in experiments[24]. In model training, the Adam algorithm with an initiallearning rate of . [30] was used. In order to includeenough “negative” and “positive” samples, the mini-batch sizewas set to 4096. The training X-vectors were splitted totraining and validation sets with a ratio of . The modelparameters were selected based on the best performance onthe validation set. C. Results

We ﬁrst carried out SV experiments on the data sets ofSITW. Two test sets are used, i.e., development and evaluation

TABLE IP

ERFORMANCE ON THE DEVELOPMENT SET OF

SITW.Methods EER (%) minDCF (0.01) minDCF (0.001)LDA+PLDA 3.003 0.3315 0.5198LDA+JB 3.043 0.3288 0.5019Hybrid (rand init) 4.159 0.3792 0.5883Hybrid (JB init)

TABLE IIP

ERFORMANCE ON EVALUATION SET OF

SITW.Methods EER (%) minDCF (0.01) minDCF (0.001)LDA+PLDA 3.554 0.3526 0.5657LDA+JB 3.496 0.3422 0.5645Hybrid (rand init) 4.505 0.3920 0.6003Hybrid (JB init) sets, and each is used as an independent test set. The evaluationmetrics are equal error rate (EER) and minimum decisioncost function (minDCF) (with target priors 0.01 and 0.001)[22]. The EER denotes when the type I and type II errors (asdeﬁned in Eq. (29)) are equal, and the minDCF is deﬁnedin Eq. (30). The performance results are showed in tables (I)and (II). In these two tables, “LDA+PLDA” and “LDA+JB”represent the PLDA and JB generative model based SVsystems following the pipeline in Fig. 2 (replace the block“JB” with “PLDA” for the PLDA based SV system). “Hybrid”denotes the discriminative neural network based SV systemwhich takes the JB model structure in designing the neuralmodel architecture following the pipeline in Fig. 4. And inthe “Hybrid” SV system, two model initialization methods aretested in model training as explained in section III-B. Fromthese two tables, we can see that the performance of the JBbased generative model is comparable or a slight better thanthat of the PLDA based model. In the the hybrid model, ifmodel parameters (“LDA net” and “JB net”) are randomlyinitialized, the performance is worse than the original gener-ative model based results. However, when the neural networkparameters are initialized with the “LDA” and “JB” basedmodel parameters, the performance is signiﬁcantly improved.These results indicate that the discriminative training couldfurther enhance the discriminative power of the generativemodel when the model parameters are initialized with thegenerative model based parameters. Otherwise, with randominitialization in the discriminative learning does not enhancethe performance even the the generative model structure istaken into consideration. Following the same process, theexperimental results on voxceleb1 test set are showed in tableIII. From this table, we could observe the same tendency asin tables I and II.

TABLE IIIP

ERFORMANCE ON EVALUATION SET OF VOXCELEB TEST .Methods EER (%) minDCF (0.01) minDCF (0.001)LDA+PLDA 3.128 0.3258 0.5003LDA+JB 3.105 0.3226 0.4992Hybrid (rand init) 3.340 0.3778 0.4977Hybrid (JB init) (a) (b)

Fig. 6. Visualization (t-SNE) of speaker cluster distributions of speakerfeatures from the LDA net transform based on the X-vectors before (a) andafter (b) joint discriminative training (only 20 speakers are showed).

D. Ablation study

In the proposed framework, there are two important mod-eling blocks, i.e., “LDA net” and “JB net” as illustrated inFig. 4. The function of the “LDA net” is for extractinglow-dimension discriminative speaker representations from X-vectors, and the “JB net” is applied on the extracted featurevectors for speaker modeling. They were jointly learned ina uniﬁed framework. In this subsection, we investigate theireffects on SV performance with ablation studies.

1) Effect of the “LDA net” in learning:

The X-vector isextracted from the TDNN based speaker embedding modelwhich is optimized with a purpose for speaker classiﬁcation.After the LDA process, the speaker feature has a strong powerfor speaker discrimination. In the proposed hybrid model, theLDA model is further jointly optimized for the SV task. Thet-SNE visualization of speaker feature distributions from theLDA transform before and after joint discriminative learningis showed in Fig. 6. In this ﬁgure, only part of speakersare showed (20 speakers). From this ﬁgure, we can see thatthe speaker clusters are distinctively separated based on thespeaker features (ﬁgure 6-a). After the joint training, the sep-aration of speaker clusters is further enhanced (ﬁgure 6-b). Wefurther verify discrimination power of speaker representationson SV performance with random setting of the classiﬁer model(two dense layers of the JB model) while only setting theparameters of the “LDA net” with the following conditions(Fig. 4): (a) setting the dense layer of “LDA net” withrandom values (he normal), (b) setting the “LDA net” withthe LDA parameters (independent LDA transform), (c) settingthe “LDA net” with the jointly trained LDA parameters. Theresults are showed in table IV. From these results, we cansee that, in all these three settings, even random initializationof “LDA Net” (setting (a)), the performance for SV is fairlygood. The LDA transform improved the performance (setting(b)), and after learning, the performance is further improved.

2) Effect of A and G on SV performance: As showed inEq. 14, the two terms have different effects on the speakerveriﬁcation performance. In our discriminative training whichintegrates the L LLR of the JB model, the L LLR in Eq. (14)is adapted. With different settings of A and G on Eq. (14), TABLE IVP

ERFORMANCE ON DEVELOPMENT SET OF

SITW:

RANDOM SETTING OFTHE CLASSIFIER MODEL (“JB

NET ”) AND WITH THREE SETTINGCONDITIONS FOR THE “LDA

NET ”. LDA R:

RANDOM INITIALIZATION ,LDA I:

INDEPENDENT

LDA

TRANSFORM , LDA J:

JOINTLY TRAINED

LDA

TRANSFORM .Methods EER (%) minDCF (0.01) minDCF (0.001)LDA R 18.37 0.9714 0.9826LDA I 8.24 0.6546 0.8434LDA J

TABLE VP

ERFORMANCE ON DEVELOPMENT SET OF

SITW:

DIFFERENT SETTING OFCLASSIFIER MODEL (“JB

NET ”),

BEFORE JOINT TRAINING ( SETTING THE

LDA

NET AND JB NET WITH THE INDEPENDENTLY LEARNED

LDA

AND JB MODEL PARAMETERS ).Methods EER (%) minDCF (0.01) minDCF (0.001)A (G=0) 47.71 1.000 1.000G (A=0) 6.353 0.8261 0.9806A, G (set G to A)

A, G (set A to G) 3.504 0.3978 0.6316 we could obtain: r ( x i , x j ) =  − x Ti Gx j ; for A = 0 x Ti Ax i + x Tj Ax j ; for G = 0( x i − x j ) T G ( x i − x j ); for A = G ( x i − x j ) T A ( x i − x j ); for G = A (33)Based on this formulation, we could check the different effectsof A and G on the SV performance. The two matrices A and G are connected to the two dense layer branches of the hybridmodel with weights P A and P G (refer to Fig. 4). In our model,the dense layers were ﬁrst initialized with the parametersfrom the learned JB based generative model, then the modelwas further trained with pair-wised “negative” and “positive”samples. Only in testing stage, we use different parametersettings for experiments according to Eq. (33), and the resultsare showed in tables V and VI for dev and evaluation setsof the SITW, respectively. In these two tables, by comparingconditions with A = 0 or G = 0 , we can see that the crossterm contributes large to the SV performance, i.e., the denselayer branch with neural weight P G contributes to the mostdiscriminative information in the SV task. Moreover, whenkeeping the cross term either by setting A = G or G = A ,the performance is better than setting any one of them to bezero.

3) Relation to distance metric learning:

Distance metriclearning is widely used in discriminative learning with pair-wised training samples as input [25], [26], [27], [28]. TheMahalanobis distance metric between two vectors is deﬁned

TABLE VIP

ERFORMANCE ON DEVELOPMENT SET OF

SITW:

DIFFERENT SETTING OFCLASSIFIER MODEL (“JB

NET ”),

AFTER JOINT TRAINING .Methods EER (%) minDCF (0.01) minDCF (0.001)A (G=0) 50.29 0.9996 0.9996G (A=0) 4.775 0.4206 0.6340A, G (set G to A)

Dense layer (cid:1)

LengthNormDense layer (cid:2)

L_LLR (cid:3) (cid:4) (cid:3) (cid:5) (cid:6)(. )x (cid:11) x (cid:12) LDA_net

MD_net

Fig. 7. SiameseNet with Mahalanobis net on X-vector feature for speakerveriﬁcation, FC1: full connection layer for LDA, FC2: Full connection for JB, H D : hypothesis for different speaker, H S : hypothesis for the same speaker. as: d i,j ∆ = d ( x i , x j ) = ( x i − x j ) T M ( x i − x j ) , (34)where M = PP T is a positive deﬁnite matrix. Based on thisdistance metric, the binary classiﬁcation task for SV can beformulated as: p ( y i,j | z i,j ) = σ ( λ ( d − d i,j )) , (35)where σ ( x ) = (1 + exp( − x )) − is the sigmoid logisticfunction, d is a distance decision threshold, and λ is a scaleparameter for probability calibration. From Eq. (35), we cansee that when the Mahalanobis distance d ( x i , x j ) < d , theprobability of x i and x j belonging to the same speaker ishigh, and vice verse. With pair-wised “positive” and “negative”samples, the parameters ( M , d , and λ ) can be learnedbased on a given training data set as a binary discriminativelearning task. Comparing Eqs. (34) and (33), we can seethat if we set A = G or G = A , the L LLR andMahalanobis distance have the same formulation form (exceptthe difference in matrix as negative or positive deﬁnite), i.e., d ( x i , x j ) ∝ − r ( x i , x j ) . In this sense, the distance metricbased discriminative learning framework can be regarded as aspecial case of the hybrid discriminative framework, and theL LLR deﬁned in Eq. (9) is cast to: r ( x i , x j ) = log p (∆ i,j | H S ) p (∆ i,j | H D ) , (36)where ∆ i,j = x i − x j . From this deﬁnition, we can see that thedistance metric based discriminative learning only considersthe distribution of the pair-wised sample distance space [29].In implementation, by merging the two dense layers of theclassiﬁer model (“JB net” with parameters P A and P G ),the proposed hybrid framework is changed to be one branchframework as showed in Fig. 7. In this ﬁgure, the “MD net” isthe network dense layer for Mahalanobis distance metric withan afﬁne transform matrix P , and it can be initialized with theparameters of the JB based generative model (either P = P A or P = P G ), or with random values (he normal). We test thisone branch model on dev set of SITW with different settings TABLE VIIP

ERFORMANCE ON DEVELOPMENT SET OF

SITW

OF THE S IAMESE NETWITH “MD

NET ” AS CLASSIFIER MODEL .Methods EER (%) minDCF (0.01) minDCF (0.001)Random init P P with P A Init P with P G of the “MD net” (the “LDA net” is initialized with the LDAtransform based parameters), and show the results in tableVII. From this table, we can see that when the LDA net andMD net of the one branch model are initialized with the LDAand P A parameters, the performance is the best. However, nomatter in what conditions, comparing results in tables I andVII, we can see that the hybrid model framework showed thebest performance which conﬁrmed that the model structureinspired by the JB based generative model is helpful in theSV task.

4) L LLR distributions for intra- and inter-speaker spaces:

The SV task deﬁned as a hypothesis test can be regardedas a binary classiﬁcation task. Correspondingly, as deﬁned inEq. (9), the performance is measured based on the L LLRdistributions in two spaces, i.e., the intra-speaker space H S and inter-speaker space H D . The separability can be visualizedas the histogram distributions of pair-wise distances in the twospaces. We check the histograms of the L LLR on the trainingand test sets based on the hybrid model (refer to networkpipeline in Fig. 4) with different parameter settings, and showthem in Fig. 8. From this ﬁgure, we can see that when thehybrid network parameters are set with random values, thereare large overlaps of the L LLR distributions between thetwo hypothesis spaces. When the network parameters are setwith the parameters of the JB based generative model, theseparation of the L LLR distributions is increased. With thediscriminative training, the separation is further enhanced. Inparticular, the L LLR distribution of “negative” sample pairsbecomes much more compact for both training and testing datasets.We have showed the SV performance with only A or G ma-trices in subsection III-D2. We check the L LLR distributionsof “negative” and “positive” sample pairs in inter- and intra-speaker spaces, and show the histograms for test set of SITWin Fig. 9. From this ﬁgure, we can see that both matrice A and G contribute to the difference of L LLR distributions,particularly G contributes to the large main difference ofthe L LLR distributions between “negative” and “positive”sample pairs. After the model is learned, the difference ofthe L LLR distributions for intra-speaker space H S and inter-speaker space H D is increased.IV. D ISCUSSION AND CONCLUSION

Current state of the art pipeline for SV is composed of twobuilding models, i.e., a front-end model for speaker featureextraction, and a generative model based back-end model forspeaker classiﬁer. In this study, the X-vector as a speakerembedding feature is extracted in the front-end model whichencodes strong speaker discriminative information. Based on this speaker feature, a JB based generative back-end model isapplied. The JB model tries to model the probability distri-butions of speaker features, and could predict the conditionalprobabilities for utterances even from unknown speakers. Thisis the advantage of using the generative model in the SVtask since it is often the case that the testing utterances fromthe speakers are not registered in the training set. However,as a generative model, the parameters estimation is easy tobe distracted with nuisance features in a high dimensionalspace, i.e., the the generative modeling does not have thefeature selection ability for the ﬁnal SV task. Therefore, adiscriminative dimensional reduction (e.g., LDA) is applied asan independent process block on the speaker features beforeapplying the generative modeling.We take a further look of the SV problem by regarding itas a hypothesis test, i.e., whether two compared utterancesare from the same or different speakers. As an alternative,the SV task also can be regarded as a binary classiﬁcationtask. Correspondingly, a discriminative learning frameworkcan be applied with “negative” and “positive” sample pairs (asfrom the same speaker and different speakers). The advantageof this discriminative learning framework is that the speakerfeatures can be automatically transformed and modeled in auniﬁed optimization framework. However, the learning is easyoverﬁtted to the training data set which does not generalizewell on an unknown test speakers. In this study, as our maincontribution, we proposed to integrate the generative modelin a discriminative learning framework as a hybrid model.The key point is that we integrated the L LLR estimationfrom the JB based generative model in a neural discriminativelearning framework. In particular, the linear matrices in theJB based generative model are factorized to be the linearafﬁne transforms in dense layers of the neural network model.And the network parameters are connected to the JB basedgenerative model parameters which could be initialized by thelearned model of the JB based generative model. Moreover,as our another contribution, in the discriminative learningframework, rather than simply learning the hybrid modelwith conventional binary discrimination objective function, thedirect evaluation metric for hypothesis test, i.e., EBR with falsealarm and miss rates, could be easily applied as an objectivefunction in parameter optimization learning. Our experimentsconﬁrmed that the SV beneﬁts from the hybrid model whichintegrates the advantages of both generative and discriminativemodel learning.In this study, the JB based generative model is basedon simple Gaussian probability distribution assumptions ofspeaker features and noise. In real applications, the probabilitydistributions are much more complex. Although it is difﬁcultfor a generative model to ﬁt complex shapes of probabilitydistributions in a high dimensional space, it is relatively easyfor a discriminative learning framework to approximate thecomplex distribution shapes. In the future, we will extend thecurrent study for a hybrid model framework to learn morecomplex probability distributions in SV tasks. (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) O n t r a i n i ng s e t O n t e s ti ng s e t (a) (b) (c)(d) (e) (f) Fig. 8. L LLR distributions in H S and H D spaces: the ﬁrst row (a, b, and c ) for training set, the second row (d, e, and f) for testing set; the left column(a and d) for setting model with random parameters, the middle column (b and e) for setting model with learned generative model parameters, and the rightcolumn (c and f) for setting model with learned generative model parameters and further discriminatively trained parameters. (cid:1) (cid:2) (cid:1) (cid:2) (cid:1) (cid:2) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:3) (cid:1) (cid:3) (cid:1) (cid:3) (a) (b)(c) (d) (cid:4) (cid:5) 0(cid:4) (cid:5) 0 (cid:7) (cid:5) 0(cid:7) (cid:5) 0 I n it m od e l L ea r n e d m od e l Fig. 9. L LLR distributions on testing set of the SITW for the hybrid model:the ﬁrst row for initial model with JB based generative model parameters with G = 0 (a), and with A = 0 (b), the second row is for jointly trained modelwith G = 0 (c), and with A = 0 (d). R EFERENCES[1] J. Hansen, T. Hasan, “Speaker recognition by machines and humans: Atutorial review,”

IEEE Signal processing magazine , vol. 32, no. 6, pp.74-99, 2015.[2] A. Poddar, M. Sahidullah, G. Saha, “Speaker Veriﬁcation with ShortUtterances: A Review of Challenges, Trends and Opportunities,”

IETBiometrics , 7 (2), pp. 91-101, 2018.[3] H. Beigi, Fundamentals of Speaker Recognition, Springer-Verlag, Berlin,2011, ISBN 978-0-387-77591-3.[4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-endfactor analysis for speaker veriﬁcation,”

IEEE Transactions on Audio,Speech, and Language Processing , vol. 19, no. 4, pp. 788-798, 2011.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust dnn embeddings for speaker recognition,” in

IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 5329-5333, 2018.[6] E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependentspeaker veriﬁcation,”

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 4052-4056, 2014.[7] S. Prince and J. Elder, “Probabilistic linear discriminant analysis forinferences about identity,” in

IEEE International Conference on ComputerVision (ICCV) , pp. 1-8, 2007. [8] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited:A joint formulation,” in

European Conference on Computer Vision , pp.566-579, 2012.[9] D. Chen, X. Cao, D. Wipf, F. Wen, and J. Sun, “An efﬁcient jointformulation for Bayesian face veriﬁcation,”

IEEE Transactions on patternanalysis and machine intelligence , vol. 39, pp. 32-46, 2016.[10] V. Wan, W. Campbell, “Support vector machines for speaker veriﬁca-tion and identiﬁcation,” Neural Networks for Signal Processing X, in

Proceedings of the IEEE Signal Processing Society Workshop , vol. 2, pp.775-784, 2000.[11] J. Villalba, N. Brummer, N. Dehak, “Tied variational autoencoder back-ends for i-vector speaker recognition,” in

Proceeding of INTERSPEECH ,pp. 1004-1008, 2017.[12] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka and N.Brummer, “Discriminatively trained Probabilistic Linear DiscriminantAnalysis for speaker veriﬁcation,” in

Proceeding of IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , pp.4832-4835, 2011.[13] S. Cumani, N. Brummer, L. Burget, P. Laface, O. Plchot and V.Vasilakakis, “Pairwise Discriminative Speaker Veriﬁcation in the I-VectorSpace,”

IEEE Transactions on Audio, Speech, and Language Processing ,vol. 21, no. 6, pp. 1217-1227, June 2013.[14] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-endtextdependent speaker veriﬁcation,” in

Proceeding of IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP) , pp.5115-5119, 2016.[15] L. Wan, Q. Wang, A. Papir and I.Moreno, “Generalized End-to-End Lossfor Speaker Veriﬁcation,” in

Proceeding of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 4879-4883,2018.[16] A. Lasserre, C. Bishop, T. Minka, “Principled Hybrids of Generativeand Discriminative Models,” in

Proceeding of IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR) , pp.87-94, 2006.[17] R. Raina, Y. Shen, A. Ng, A. McCallum, “Classiﬁcation with hybridgenerative/discriminative models,” in

Proceedings of the InternationalConference on Neural Information Processing Systems , pp. 545-552,2003.[18] N. Brummer, E. Villiers, “The BOSARIS toolkit user guide: Theory, al-gorithms and code for binary classiﬁer score processing,” Documentationof BOSARIS toolkit, 2011.[19] E. Lehmann, J Romano, Testing Statistical Hypotheses, Springer-VerlagNew York, 2005.[20] J. Platt, “Probabilistic Outputs for Support Vector Machines and Com-parisons to Regularized Likelihood Methods,”

Advances in large marginclassiﬁers , pp. 61-74, 1999.[21] H. Lin, C. Lin, R. Weng, “A note on Platt’s probabilistic outputs forsupport vector machines,”

Machine Learning , vol. 68, pp. 267-276, 2007.[22] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakersin the wild (SITW) speaker recognition database,” in

Proceeding ofINTERSPEECH , pp. 818-822, 2016. [23] A. Nagrani, J. Chung, W. Xie, A. Zisserman, “Voxceleb: Large-scalespeaker veriﬁcation in the wild,” Computer Science and Language , vol.60, 2020.[24] K. He, X. Zhang, S. Ren, J. Sun, “Delving Deep into Rectiﬁers: Sur-passing Human-Level Performance on ImageNet Classiﬁcation,” in

Pro-ceeding of IEEE International Conference on Computer Vision (ICCV) ,pp. 1026-1034, 2015.[25] E. Xing, A. Ng, M. Jordan, and R. Russell, “Distance Metric Learning,with application to Clustering with side-information,” in Proceeding ofAdvances in Neural Information Processing Systems , MIT Press, pp. 521-528, 2002.[26] K. Weinberger, J. Blitzer, L. Saul, “Distance Metric Learning for LargeMargin Nearest Neighbor Classiﬁcation,”

Advances in Neural InformationProcessing Systems

18, pp. 1473-1480, 2006.[27] K. Weinberger, L. Saul, “Distance Metric Learning for Large MarginClassiﬁcation,”

Journal of Machine Learning Research , vol. 10, pp. 207-244, 2009.[28] M. Guillaumin, J. Verbeek, C. Schmid, “Is that you? Metric learningapproaches for face identiﬁcation,”

Proceeding of the IEEE InternationalConference on Computer Vision , pp. 498-505, 2009.[29] B. Moghaddam, T. Jebara, A. Pentland, “Bayesian face recognition,”

Pattern Recognition , vol. 33, pp. 1771-1782, 2000.[30] D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” the 3rd International Conference on Learning Representations (ICLRthe 3rd International Conference on Learning Representations (ICLR