Integrating a joint Bayesian generative model in a discriminative learning framework for speaker verification
aa r X i v : . [ ee ss . A S ] J a n Integrating a joint Bayesian generative model in adiscriminative learning framework for speakerverification
Xugang Lu , Peng Shen , Yu Tsao , Hisashi Kawai Abstract — The task for speaker verification (SV) is to decidean utterance is spoken by a target or imposter speaker. In mostSV studies, a log-likelihood ratio (L LLR) score is estimatedbased on a generative probability model on speaker features,and compared with a threshold for decision making. However,the generative model usually focuses on feature distributions anddoes not have the discriminative feature selection ability, which iseasy to be distracted by nuisance features. The SV, as a hypothesistest, could be formulated as a binary classification task wherea neural network (NN) based discriminative learning could beapplied. Through discriminative learning, the nuisance featurescould be removed with the help of label supervision. However,the discriminative learning pays more attention to classificationboundaries which is prone to overfitting to training data andyielding poor generalization on testing data. In this paper, wepropose a hybrid learning framework, i.e., integrating a jointBayesian (JB) generative model into a neural discriminativelearning framework for SV. A Siamese NN is built with denselayers to approximate the mapping functions used in the SVpipeline with the JB model, and the L-LLR score estimatedbased on the JB model is connected to the distance metric in apair-wised discriminative learning. By initializing the Siamese NNwith the parameters learned from the JB model, we further trainthe model parameters with the pair-wised samples as a binarydiscrimination task. Moreover, direct evaluation metric in SV, i.e.,minimum empirical Bayes risk, is designed and integrated as anobjective function in the discriminative learning. We carried outSV experiments on speakers in the wild (SITW) and Voxcelebcorpora. Experimental results showed that our proposed modelimproved the performance with a large margin compared withstate-of-the-art models for SV.
I. I
NTRODUCTION
Speaker verification (SV) is a technique to verify whetheran acoustic speech is spoken by a target or imposter speaker,it is widely used in many speech application systems wherespeaker information is required from authentic or securityperspectives [1], [2], [3]. The basic problem definition for SVis to decide whether two utterances (usually denoted as testingand enrollment utterances) are generated from the same ordifferent speakers, i.e., a hypothesis test defined as: H S : x i , x j are spoken by the same speaker H D : x i , x j are spoken by different speakers (1)where H S and H D are the two hypothesises as the same anddifferent speaker spaces, respectively. ( x i , x j ) is a tuple with
1. National Institute of Information and Communications Technology,Japan. Email: [email protected]. Research Center for Information Technology Innovation, AcademicSinica, Taiwan two compared utterances indexed by i and j . For making adecision, it is necessary to estimate the similarity of the twoutterances, either calculated as a log likelihood ratio (L LLR)or a distance measure, and compare it with a threshold. Theconventional pipeline in constructing a SV system for doingthe hypothesis test defined in Eq. (1) is composed of front-end speaker feature extraction and back-end speaker classifiermodeling. Front-end feature extraction tries to extract robustand discriminative features to represent speakers, and back-endclassifier tries to model speakers with the extracted featuresbased on which the similarity or L LLR scores could beestimated. A. Front-end speaker feature extraction
In most state of the art frameworks, the front-end speakerfeature is based on i-vector representation [4], [5]. In i-vectorextraction, speech utterances with variable durations can beconverted to fixed dimension vectors with the help of Gaussianmixture models (GMM) on probability distributions of acous-tic features. With the resurgence of deep learning technique,several alternative speaker features have been proposed, e.g.,d-vector and X-vector [5], [6]. These features are extractedfrom a well trained deep neural network with bottleneck layersor statistical pooling. In recent years, X-vector as one of thespeaker embedding representations is the most widely usedone [5]. The advantage of X-vector representation is thatthe model for X-vector extraction could be efficiently trainedwith large quantity of speech samples from various speakers.Moreover, in order to explore robust speaker information, dataaugmentation with various of noise types and signal to noiseratios (SNRs) could be easily applied in model training [5].Since the original front-end feature (e.g., either i-vector orX-vector) encodes various of acoustic factors, e.g., speakerfactor, channel transmission factor, recording device factor,etc., before classifier modeling, a linear discriminative analysis(LDA) is usually applied for dimension reduction to eliminatenon-speaker specific information.
B. Back-end classifier modeling
After speaker features are obtained, how to build a speakerclassifier model in back-end modeling is important. There aretwo types of modeling strategies, one is generative modelling,the other is discriminative modeling. In generative model-ing, features are regarded as observations from a generationprocess with certain probability distribution assumptions on the generation variables. Based on the generation model, thehypothesis test defined in Eq. (1) is regarded as a statisti-cal inference from the variable probability distributions. Forexample, probabilistic linear discriminant analysis (PLDA)modeling [4], [7] has been widely used in SV. The PLDAcan be applied to model the within-speaker and between-speaker variabilities with linear subspace modeling on speakerand noise spaces in generation. However, it is difficult todetermine the dimensions of subspaces which has a large effecton the final performance. As an alternative, joint Bayesian(JB) modelling [8], [9] is regarded as a much more efficientmodel than the PLDA without subspace model assumptionson speaker and noise spaces. The hypothesis test defined inEq. (1) also can be formulated as a binary classification taskwhere the discriminative model can be learned with supervisedlearning algorithms. In the discriminative modeling, ratherthan modeling on the feature probability distributions, we onlyfocus on the classification boundaries. For example, supportvector machine (SVM) is proposed to maximize the betweenclass distance [10], neural network based discriminative modelis applied to directly maximize classification accuracy withlabeled training data sets [11], pairwise discriminative trainingon i-vectors is proposed as a binary classification task forSV [12], [13]. In recent years, as a discriminative modeling,the supervised end-to-end speaker models which integrate thefront-end feature extraction and back-end speaker classifiermodeling in a unified optimization framework also have beenproposed [14], [15]. However, in SV tasks, usually manyspeakers are not registered in the training data, the currentstate of the art pipeline in SV is still the speaker feature rep-resentation (e.g. X-vector) with a generative speaker classifiermodeling.
C. Hybrid of generative and discriminative model learning
In a generative model learning, either with PLDA or JBmodel, the observation feature variable is supposed to bean additive mixture of speaker and noise variables. Thereare model assumptions on speakers and noise variables, e.g.,Gaussian assumptions of the probability distributions. If theassumptions are not satisfied, the performance could not beguaranteed. Moreover, in most generative model learning,the objectives focus on feature probability distributions. Thedisadvantage is that it is difficult to learn the model parametersfor high dimensional features, and the model does not havethe discriminative feature selection ability which is easy tobe distracted by nuisance features. In discriminative modellearning, the objectives pay more attention to discriminationboundaries. Although nuisance features could be removed withlabel supervision, it is easy for the model to be overfitted tothe training data set that could not generalize well to a testingdata set. For a better understanding, we illustrate the differentfocus of generative and discriminative model learning in Fig.1. In this figure, two classes with samples are showed (circlesand triangles for class 1 and 2 respectively). As shown in thisfigure, generative model tries to focus on class distributionswhile the discriminative model tries to pay attention to theclassification boundary (solid curve in Fig. 1).
Class 1 Class 2 p(cid:2)x|y (cid:6) Class 1(cid:13) p(cid:2)x|y (cid:6) Class 2(cid:13)
Discriminative Boundary
Fig. 1. Different focus for generative and discriminative model learning:Generative model learning focuses on class distributions (dashed-circles ofclass distribution shapes), and discriminative model learning emphasizes classdiscriminative boundary (solid curve).
As we have discussed, the definition of SV in Eq. (1) relatesto both generative and discriminative modeling strategies ei-ther as a generative hypothesis test or binary discriminativeclassification task. However, in most studies, there is noexplicit modeling strategies to connect these two aspects intoone model framework in SV. In this study we propose a hybridmodel framework which explicitly integrates the JB basedgenerative model in a binary discriminative learning frame-work on X-vectors for SV. The generative model could makethe model with good generalization by probability distributionassumptions while the discriminative learning could help thegenerative model to enhance its discriminative power in featureselection and hypothesis space modeling. Our contributions aresummarized as follows:(1) We propose a hybrid model for SV which integratethe JB based generative model in a discriminative learningframework. Although hybrid of generative and discriminativemodelling has been studied in machine learning for fullyutilizing unlabeled and labeled samples, and showed improvedperformance in classification tasks [16], [17], it is difficult tointegrate the generative and discriminative models in SV tasks.The main reason is that in most studies the generative anddiscriminative models adopted different modeling structures.In this study, we make a connection between the generativeand discriminative models in classification task via the cal-culation of L LLR for hypothesis test in the JB model, andfactorize the matrix transforms used in the JB model withaffine transforms which could be approximated with denselayers in a discriminative neural network model.(2) We design a direct evaluation metric based discrimina-tive learning objective function in the hybrid model learning.In the JB based generative model learning, usually an objectivefunction with negative log-likelihood is minimized, while ina neural network based discriminative model learning, anobjective function indicating the classification error rate isminimized. However, the objective for hypothesis test task inSV is different from both of them. In a SV task, the evaluationmetric is based on weighting of two types of errors [18], [19],i.e., type I error (or false alarm rate), type II error (or missrate). In this study, we formulate this type of objective functionin the discriminative learning framework.(3) We analyze the effects of all components in model parameterizations with detailed SV experiments, and revealtheir connections to conventional distance metric learning.The remainder of the paper is organized as follows. Sec-tion II introduces the basic theoretical considerations andthe proposed hybrid model framework. Section III describesthe implementation details and experiments, in particular, wemake deep investigations of the effect of model parameters,and their connections to other related model frameworks.Section IV summarizes the study with discussions.II. P
ROPOSED HYBRID MODEL FRAMEWORK
The generative and discriminative models can be connectedwith the Bayesian theory. Before introducing their connec-tions, we give a brief review of the basic idea of generativeand discriminative modelings.
A. Generative and discriminative models in classificationtasks
A generative model tries to capture the data generationprocess with a fully joint modelling of the relation betweenfeature input and label variables as p ( x , y ) , while a discrim-inative model only tries to model the direct relation betweeninput feature and output label as p ( y | x ) , where x and y are feature and label variables, respectively. Although thegenerative model is not directly used for classification, aclassification model can be deduced from the generative modelas model inference based on the Bayes theory as: p ( y | x ) = p ( x , y ) p ( x ) = p ( x | y ) p ( y ) p ( x ) (2)In this equation, p ( x | y ) is the likelihood score of generatingfeature x by given a label y . Although the generative modelhas a better generalization ability with prior data distributionassumptions, it is difficult for the model to learn the datastructure in a high dimension space with complex distributions.Usually a dimension reduction is applied before applying agenerative model, for example, principal component analysis(PCA) or LDA as widely used in SV systems. In most studiesfor SV, the dimension reduction and generative modelingare independently applied which is sub-optimal in modeling.Moreover, the generative model based classification is notaccurate since probability distribution assumptions usually arenot exact enough.The discriminative model could learn the complex clas-sification boundaries with nonlinear mapping functions, andpay much attention to discriminative boundaries (as illustratedin Fig. 1), however it is prone to overfitting to the trainingdata and always with high confident predictions. Besidesthese theoretical differences, in practical model training aspect,generative model parameters usually are estimated based onexpectation-maximization (EM) like algorithms based on sim-ple assumptions of data distributions (e.g., Gaussian distribu-tions), while model parameters of discriminative model (neuralnetwork) usually are estimated based on gradient descentalgorithms. In the following subsections, we show how tointegrate them in a hybrid model with careful formulations.
1) Generative model based classification:
Given a trainingdata set { ( x i , y i ) } i =1 , ,...,N , y i ∈ { , , ..., K } with x i and y i as data feature and label, K is the number of classes. For aclassification based on a generative model, based on Eq. (2),the classification model is: p ( y = k | x ) = p ( x | y = k ) p ( y = k ) K P j =1 p ( x | y = j ) p ( y = j ) . (3)And Eq. (3) is further cast to: p ( y = k | x ) = 11 + K P j =1 ,j = k exp ( − r k,j ( x , Θ G )) , (4)where r k,j ( x , Θ G ) = log p ( x | y = k ) p ( y = k ) p ( x | y = j ) p ( y = j ) , (5)is a L LLR score based on class generative probability modelwith Θ G as model parameter set.
2) Discriminative model based classification:
Rather thanusing a generative model, a neural network can be applied todirectly approximate the posterior probability function p ( y | x ) .A discriminative learning tries to approximate the mappingbetween input feature and label with a softmax functiondefined as: p ( y = k | x ) = exp ( o k ) K P j =1 exp ( o j ) , (6)where a network mapping function o j = φ j ( x , Θ D ) is definedas the output corresponding to the j -th class, and Θ D is theneural network parameters. Eq. (6) is further cast to: p ( y = k | x ) = 11 + K P j =1 ,j = k exp ( − h k,j ( x , Θ D )) , (7)where h k,j ( x , Θ D ) = φ k ( x , Θ D ) − φ j ( x , Θ D ) . (8)Comparing Eqs. (7), (8) and (4), (5), we can see that h k,j ( x , Θ D ) can be connected to the r k,j ( x , Θ G ) with theL LLR in calculation. This connection inspired us to incor-porate the L LLR of pair-wised samples from a generativemodel to the neural network discriminative training for SV. B. Integrating Log likelihood ratio for SV in generative anddiscriminative models
The task for SV is a problem of hypothesis test as defined inEq. (1). It could be solved based on L LLR score estimationfor the two hypothesises. The advantage of using L LLR isthat it is not necessary to estimate each probability distribution(as can be canceled out based on ratio calculation) which is adifficult task.Based on the generative model, given a hypothesis H S or H D , the joint probability of generating ( x i , x j ) is p ( x i , x j | H S ) or p ( x i , x j | H D ) . In making a decision, theL LLR is defined as: r i,j ∆ = r ( x i , x j ) = log p ( x i, x j | H S ) p ( x i, x j | H D ) (9)With a given decision threshold, we can decide whether thetwo observation vectors are from H S or H D (as defined inEq. (1)). For convenience of formulation, we define a trial asa tuple z i,j = ( x i , x j ) , and the two hypothesis spaces areconstructed from the two data sets as: S = { z i,j = ( x i , x j ) ∈ H S } D = { z i,j = ( x i , x j ) ∈ H D } (10)The final calculation is dependent on the assumption ofgenerative models with some density functions. We first derivethe L LLR score calculation based on the JB based generativemodel.
1) Joint Bayesian generative model:
Given two speakerfeature vectors, their distance is associated with their prob-ability distributions which model their generation process.Given an observation X-vector variable x , it is supposed to begenerated by a speaker identity variable and a random noisevariable (possibly induced by different recording backgroundnoise, sessions, or transmission channels, etc.) as: x = u + n , (11)where u is a speaker identity vector variable, n representsintra-speaker variation caused by noise. For simplicity, theobservation x is mean subtracted, and the speaker identityvariable and intra-speaker variation variable are supposed tobe with Gaussian distributions as: u ∼ N (0 , C u ) n ∼ N (0 , C n ) , (12)where C u and C n are speaker and noise covariance matrices,respectively. In verification, for given a trial with x i and x j generated from Eq. (11), based on the assumption in Eq. (12),the two terms p ( x i , x j | H S ) and p ( x i , x j | H D ) defined in Eq.(9) satisfy zero-mean Gaussian with covariances as: cov S = (cid:20) C u + C n C u C u C u + C n (cid:21) cov D = (cid:20) C u + C n C u + C n (cid:21) (13)Based on this formulation, the L LLR defined in Eq. (9) couldbe calculated based on: r ( x i , x j ) = x Ti Ax i + x Tj Ax j − x Ti Gx j , (14)where A = ( C u + C n ) − − [( C u + C n ) − C u ( C u + C n ) − C u ] − G = − (2 C u + C n ) − C u C − n (15) As seen from Eq. (15), the generative model parameters Θ G in estimating L LLR is only related to covariance parameters C u and C n [8], [9]. Given a training data set, the parameterscould be estimated using EM (or EM-like) learning algorithmbased on: Θ ∗ G = arg min Θ G − X i log p ( X i | Θ G ) (16) LDA LengthNorm JB L_LLRX-vector1X-vector2
Fig. 2. Pipeline for joint Bayesian based generative modeling on X-vector forspeaker verification, LDA: Linear Discrimination Analysis, JB: Joint Bayesianmodel, L LLR: log likelihood ratio. where Θ G = { C u , C n } , X i is a collection of samples forspeaker i .
2) Pair-wised discriminative model:
The hypothesis testdefined in Eq. (1) can be regarded as a binary classificationtask. This classification task can be solved based on neuraldiscriminative learning as formulated in Eqs. (6) and (7).In neural discriminative learning, the parameters are neuralweights (affine transform matrix with linear or nonlinear acti-vations), we can connect the model parameters of generativemodel with the neural weights and optimize them with anobjective function. As a binary classification task, given a trialwith two observation X-vector variables z i,j = ( x i , x j ) , theclassification task is to estimate and compare p ( H S | z i,j ) and p ( H D | z i,j ) . As a binary discriminative learning, the label isdefined as: y i,j = (cid:26) , z i,j ∈ H S , z i,j ∈ H D (17)For a binary discriminative neural network, with reference toEqs. (7) and (8), the posterior probability is estimated basedon: p ( y i,j | z i,j ) = ( − h HS,HD ( z i,j , Θ D )) ; z i,j ∈ H S − − h HS,HD ( z i,j , Θ D )) ; z i,j ∈ H D (18)As we have revealed from Eqs. (4), (5), and (9), wereplace the h H S ,H D ( z i,j , Θ D ) with L LLR score, and define amapping as a logistic function with scaled parameters as [20],[21]: f ( r i,j ) ∆ = 11 + exp ( − ( αr i,j + β )) (19)where r i,j = r ( z i,j ) = r ( x i , x j ) as defined in Eq. (9), α and β are gain and bias factors used in the regression model. InEq. (19), we integrated the L LLR score estimated from theJB generative model in a discriminative training framework.The probability estimation in Eq. (18) is cast to: ˆ y i,j ∆ = p ( y i,j | z i,j ) = (cid:26) f ( r i,j ); z i,j ∈ H S − f ( r i,j ); z i,j ∈ H D (20)The training can be based on optimizing the binary crossentropy defined as: L = − X z i,j ∈{ H S ∪ H D } ( y i,j log f ( r i,j ) + (1 − y i,j ) log(1 − f ( r i,j ))) (21) In the following subsection, we investigate the neural networkarchitecture for the hybrid model framework.
C. Hybrid model framework with neural network architecture
The conventional state of the art framework for SV based onX-vector and JB model is illustrated in Fig. 2. In this figure, the
X-vector1FC1 L_LLRSiameseNetworkX-vector2LengthNorm FC2 (cid:1) (cid:2) (cid:1) (cid:3)
Fig. 3. The proposed Siamese network for estimating the L LLR score in JBmodel for a pair-wised discriminative model learning. FC1: full connectionlayer for LDA, FC2: Full connection layer for JB, H D and H S as classhypothesis for different and the same speaker, respectively. LDA is applied on the X-vector for discriminative dimensionalreduction. After the LDA, a vector length normalization isused, then JB based generative model is applied by which theL LLR is estimated. In a pair-wised discriminative learningframework, the L LLR can be used for a binary classificationtask based on a Siamese network which is showed in Fig. 3. Inthis Siamese network, the LDA and JB model are implementedwith dense layers of neural network architecture (indicates as“FC1” and “FC2” blocks).We first explain the LDA which will be approximated byan affine transform as used in the neural network modeling.For input X-vector based samples and their correspondinglabels { ( x , y ) , ( x , y ) , ..., ( x M , y M ) } , x i ∈ R l , the LDAtransform is: h i = W T x i , (22)where W ∈ R l × d , l and d are dimensions of input X-vectorand transformed feature vector, M is the number of samples.W is estimated from the following definition: W ∗ = arg max W tr ( W T S b WW T S w W ) , (23)where tr ( . ) denotes the matrix trace operator, S w and S b areintra-class and inter-class covariance matrices defined as: S w = C P j =1 M j P i =1 ( x ji − µ j )( x ji − µ j ) T S b = M C P j =1 M j ( µ j − ¯ µ )( µ j − ¯ µ ) T , (24)where C is the number of speakers, M j is the sample numberof the j -th class, µ j (for the j -th class) and ¯ µ are class-wisedmean and global mean defined as: µ j = M j M j P i =1 x ji ¯ µ = M M P i =1 x i . (25)From Eq. (22), we can see that the LDA can be implementedas a linear dense layer.We further look into the estimation of the L LLR scoredefined in Eq. (14). In Eq. (14), A and G are negative semi-definite symmetric matrices [8], [9], they can be decomposedas: A = − P A P TA G = − P G P TG (26) Dense layer (cid:1)
LengthNormDense layer (cid:2) (cid:3)
Dense layer (cid:2) (cid:4)
L_LLR (cid:5) (cid:6) (cid:5) (cid:7) (cid:8)(. )x (cid:13) x (cid:14) LDA_netJB_net
Fig. 4. The proposed Siamese network with integration of JB model structure(in Fig. 3 ) for speaker verification (see the text for detailed explanation).
The L LLR score is cast to: r i,j = 2 g Ti g j − a Ti a i − a Tj a j (27)with the affine linear transforms as: a i = P TA ˜h i g i = P TG ˜h i , (28)where ˜h i = h i || h i || is the length normalized vector from theLDA transform defined in Eq. (22). These transforms couldbe applied in a neural network as linear dense layers. Basedon these formulations, the Siamese network in Fig. 3 is furtherimplemented as showed in Fig. 4. In this figure, there are twosub-nets, i.e., “LDA net” and “JB net”. The “LDA net” is adense layer net with transform according to Eq. (22). In the“JB net”, the JB model structure is taken into considerationas two-branch dense layer network according to Eq. (28). D. Learning objective function based on minimum empiricalBayes risk (EBR)
The cross entropy defined in Eq. (21) can be applied fordiscriminative training in order to measure the classificationerror. However, the hypothesis test defined in Eq. (1) isdifferent from a classification goal, and the final evaluationmetric for SV usually adopts some different criterions. It isbetter to optimize model parameters directly based on theevaluation metrics. As a hypothesis test task, there are twotypes of errors [18], [19], i.e., type I and type II errors. Thetwo types of errors are defined as:Type I error (false alarm): z i,j ∈ H D , L LLR ≥ θ Type II error (miss): z i,j ∈ H S , L LLR < θ, (29)where θ is a decision threshold. These two types of errorsare further illustrated in Fig. 5 for a SV task. In this figure,the objective for SV is to minimize the target miss P miss (or false reject) and false alarm P fa (or false accept) in thetwo hypothesis spaces H S and H D . In real applications, it (cid:1) (cid:2) (cid:1) (cid:3) Miss False alarm P r ob a b ilit y Fig. 5. The L LLR distributions in H S and H D for the same and differentspeaker spaces, and two types of errors in the hypothesis test for SV. is better to generalize the classification errors to a weighingof these two types of errors. With consideration of the priorknowledge in a measure of empirical Bayes risk (EBR), theevaluation metric for SV adopts a detection cost function(DCF) to measure the hardness of the decisions. It is definedas a weighted loss: C det ∆ = P tar C miss P miss + (1 − P tar ) C fa P fa , (30)where C miss and C fa are user assigned costs for miss and falsealarm detections, P tar is a prior of target trials, P miss and P fa are miss and false alarm probabilities defined as: P fa = N non P z i,j ∈ H D u ( r i,j ≥ θ ) P miss = N tar P z i,j ∈ H S u ( r i,j < θ ) (31)In Eq. (31), “ N non ” and “ N tar ” are the numbers of nontargetand target trials, r i,j is L LLR estimated from Eq. (27), θ is a decision threshold, and u ( . ) is an indictor function forcounting number of trials for scores less or larger than thedecision threshold. In order to change the objective functionto be differentiable which can be used in gradient based neuralnetwork learning, the indictor function u ( . ) in Eq. (31) isreplaced with: P fa = N non P z i,j ∈{ H S ∪ H D } (1 − y i,j ) f ( r i,j ) P miss = N tar P z i,j ∈{ H S ∪ H D } y i,j (1 − f ( r i,j )) (32)where f ( r i,j ) is a sigmoid logistic function defined the sameas in Eq. (19). With reference to the cross-entropy loss definedin Eq. (21), the definition of loss in Eq. (30) with Eq. (32)can be regarded as a generalized cross-entropy loss with userdefined weighting cost parameters.III. E XPERIMENTS AND RESULTS
We carried out experiments on SV tasks where the test datasets are from the data corpus of speakers in the wild (SITW)[22] and Voxceleb [23]. The speaker feature and models weretrained based on Voxceleb data corpus (sets 1 and 2) [23].We adopt a state of the art pipeline for constructing theSV system as shown in Fig. 2. In this figure, the “LDA”,“Length Norm”, and “JB” blocks are designed independentlyrather than optimized jointly. The input speaker feature inour pipeline is X-vector. The X-vectors are extracted basedon a well trained neural network model which is designed for speaker classification task [5]. For back-end models, boththe famous PLDA and JB based generative models are imple-mented in our comparisons.
A. Speaker embedding feature based on X-vector
A speaker embedding model is trained for the X-vectorextraction. The neural architecture of the embedding modelis composed of a deep time delay neural network (TDNN)layers and statistical pooling layers implemented the same asin kaldi [5]. In training the model, cross entropy criterionfor speaker classification is used as the learning objectivefunction. The training data set includes two data sets fromVoxceleb corpus, i.e., the training set of Voxceleb1 corpusby removing overlapped speakers which are included in thetest set of the SITW, and the training set of Voxceleb2.In total, there are about 7,185 speakers with 1,236,567 ut-terances used for training. Moreover, data augmentation isapplied by adding noise, music, babble with several SNRs, andreverberation with simulated room impulse response is alsoapplied to increase data diversity. Input features for trainingthe speaker embedding model are MFCCs with 30 Mel bandbins. And the MFCCs are extracted with 25 ms frame lengthand 10 ms frame shift. Energy based voice activity detection(VAD) is applied to remove silence background regions inspeaker feature extraction. More details of feature and modelarchitecture and training procedures were introduced in [5].The final extracted X-vectors are with 512 dimensions.
B. Back-end models
Although X-vector extracted from the speaker embeddingmodel is supposed to encode speaker discriminative informa-tion, it also encodes other acoustic factors. In a conventionalpipeline as illustrated in Fig. 2, a LDA is applied beforeapplying a generative speaker model. In this study, the 512-dimension X-vectors are transformed to 200-dimension vectorsby the LDA. Correspondingly, in the discriminative neuralnetwork model as showed in Fig. 4, a dense layer with 200neurons is also applied. Moreover, in the discriminative model,two dense layers corresponding to P A and P G of the JB modelare trained with “positive” and “negative” X-vector pairs (pairsfrom the same and different speakers). Since the discriminativeneural network architecture fits well to the pipeline based ongenerative model structure, the dense layer parameters couldbe initialized with the LDA and JB model parameters intraining (according to Eqs. (22) and (28)). For comparison, therandom initialization method with “he normal” as widely usedin deep neural network learning is also applied in experiments[24]. In model training, the Adam algorithm with an initiallearning rate of . [30] was used. In order to includeenough “negative” and “positive” samples, the mini-batch sizewas set to 4096. The training X-vectors were splitted totraining and validation sets with a ratio of . The modelparameters were selected based on the best performance onthe validation set. C. Results
We first carried out SV experiments on the data sets ofSITW. Two test sets are used, i.e., development and evaluation
TABLE IP
ERFORMANCE ON THE DEVELOPMENT SET OF
SITW.Methods EER (%) minDCF (0.01) minDCF (0.001)LDA+PLDA 3.003 0.3315 0.5198LDA+JB 3.043 0.3288 0.5019Hybrid (rand init) 4.159 0.3792 0.5883Hybrid (JB init)
TABLE IIP
ERFORMANCE ON EVALUATION SET OF
SITW.Methods EER (%) minDCF (0.01) minDCF (0.001)LDA+PLDA 3.554 0.3526 0.5657LDA+JB 3.496 0.3422 0.5645Hybrid (rand init) 4.505 0.3920 0.6003Hybrid (JB init) sets, and each is used as an independent test set. The evaluationmetrics are equal error rate (EER) and minimum decisioncost function (minDCF) (with target priors 0.01 and 0.001)[22]. The EER denotes when the type I and type II errors (asdefined in Eq. (29)) are equal, and the minDCF is definedin Eq. (30). The performance results are showed in tables (I)and (II). In these two tables, “LDA+PLDA” and “LDA+JB”represent the PLDA and JB generative model based SVsystems following the pipeline in Fig. 2 (replace the block“JB” with “PLDA” for the PLDA based SV system). “Hybrid”denotes the discriminative neural network based SV systemwhich takes the JB model structure in designing the neuralmodel architecture following the pipeline in Fig. 4. And inthe “Hybrid” SV system, two model initialization methods aretested in model training as explained in section III-B. Fromthese two tables, we can see that the performance of the JBbased generative model is comparable or a slight better thanthat of the PLDA based model. In the the hybrid model, ifmodel parameters (“LDA net” and “JB net”) are randomlyinitialized, the performance is worse than the original gener-ative model based results. However, when the neural networkparameters are initialized with the “LDA” and “JB” basedmodel parameters, the performance is significantly improved.These results indicate that the discriminative training couldfurther enhance the discriminative power of the generativemodel when the model parameters are initialized with thegenerative model based parameters. Otherwise, with randominitialization in the discriminative learning does not enhancethe performance even the the generative model structure istaken into consideration. Following the same process, theexperimental results on voxceleb1 test set are showed in tableIII. From this table, we could observe the same tendency asin tables I and II.
TABLE IIIP
ERFORMANCE ON EVALUATION SET OF VOXCELEB TEST .Methods EER (%) minDCF (0.01) minDCF (0.001)LDA+PLDA 3.128 0.3258 0.5003LDA+JB 3.105 0.3226 0.4992Hybrid (rand init) 3.340 0.3778 0.4977Hybrid (JB init) (a) (b)
Fig. 6. Visualization (t-SNE) of speaker cluster distributions of speakerfeatures from the LDA net transform based on the X-vectors before (a) andafter (b) joint discriminative training (only 20 speakers are showed).
D. Ablation study
In the proposed framework, there are two important mod-eling blocks, i.e., “LDA net” and “JB net” as illustrated inFig. 4. The function of the “LDA net” is for extractinglow-dimension discriminative speaker representations from X-vectors, and the “JB net” is applied on the extracted featurevectors for speaker modeling. They were jointly learned ina unified framework. In this subsection, we investigate theireffects on SV performance with ablation studies.
1) Effect of the “LDA net” in learning:
The X-vector isextracted from the TDNN based speaker embedding modelwhich is optimized with a purpose for speaker classification.After the LDA process, the speaker feature has a strong powerfor speaker discrimination. In the proposed hybrid model, theLDA model is further jointly optimized for the SV task. Thet-SNE visualization of speaker feature distributions from theLDA transform before and after joint discriminative learningis showed in Fig. 6. In this figure, only part of speakersare showed (20 speakers). From this figure, we can see thatthe speaker clusters are distinctively separated based on thespeaker features (figure 6-a). After the joint training, the sep-aration of speaker clusters is further enhanced (figure 6-b). Wefurther verify discrimination power of speaker representationson SV performance with random setting of the classifier model(two dense layers of the JB model) while only setting theparameters of the “LDA net” with the following conditions(Fig. 4): (a) setting the dense layer of “LDA net” withrandom values (he normal), (b) setting the “LDA net” withthe LDA parameters (independent LDA transform), (c) settingthe “LDA net” with the jointly trained LDA parameters. Theresults are showed in table IV. From these results, we cansee that, in all these three settings, even random initializationof “LDA Net” (setting (a)), the performance for SV is fairlygood. The LDA transform improved the performance (setting(b)), and after learning, the performance is further improved.
2) Effect of A and G on SV performance: As showed inEq. 14, the two terms have different effects on the speakerverification performance. In our discriminative training whichintegrates the L LLR of the JB model, the L LLR in Eq. (14)is adapted. With different settings of A and G on Eq. (14), TABLE IVP
ERFORMANCE ON DEVELOPMENT SET OF
SITW:
RANDOM SETTING OFTHE CLASSIFIER MODEL (“JB
NET ”) AND WITH THREE SETTINGCONDITIONS FOR THE “LDA
NET ”. LDA R:
RANDOM INITIALIZATION ,LDA I:
INDEPENDENT
LDA
TRANSFORM , LDA J:
JOINTLY TRAINED
LDA
TRANSFORM .Methods EER (%) minDCF (0.01) minDCF (0.001)LDA R 18.37 0.9714 0.9826LDA I 8.24 0.6546 0.8434LDA J
TABLE VP
ERFORMANCE ON DEVELOPMENT SET OF
SITW:
DIFFERENT SETTING OFCLASSIFIER MODEL (“JB
NET ”),
BEFORE JOINT TRAINING ( SETTING THE
LDA
NET AND JB NET WITH THE INDEPENDENTLY LEARNED
LDA
AND JB MODEL PARAMETERS ).Methods EER (%) minDCF (0.01) minDCF (0.001)A (G=0) 47.71 1.000 1.000G (A=0) 6.353 0.8261 0.9806A, G (set G to A)
A, G (set A to G) 3.504 0.3978 0.6316 we could obtain: r ( x i , x j ) = − x Ti Gx j ; for A = 0 x Ti Ax i + x Tj Ax j ; for G = 0( x i − x j ) T G ( x i − x j ); for A = G ( x i − x j ) T A ( x i − x j ); for G = A (33)Based on this formulation, we could check the different effectsof A and G on the SV performance. The two matrices A and G are connected to the two dense layer branches of the hybridmodel with weights P A and P G (refer to Fig. 4). In our model,the dense layers were first initialized with the parametersfrom the learned JB based generative model, then the modelwas further trained with pair-wised “negative” and “positive”samples. Only in testing stage, we use different parametersettings for experiments according to Eq. (33), and the resultsare showed in tables V and VI for dev and evaluation setsof the SITW, respectively. In these two tables, by comparingconditions with A = 0 or G = 0 , we can see that the crossterm contributes large to the SV performance, i.e., the denselayer branch with neural weight P G contributes to the mostdiscriminative information in the SV task. Moreover, whenkeeping the cross term either by setting A = G or G = A ,the performance is better than setting any one of them to bezero.
3) Relation to distance metric learning:
Distance metriclearning is widely used in discriminative learning with pair-wised training samples as input [25], [26], [27], [28]. TheMahalanobis distance metric between two vectors is defined
TABLE VIP
ERFORMANCE ON DEVELOPMENT SET OF
SITW:
DIFFERENT SETTING OFCLASSIFIER MODEL (“JB
NET ”),
AFTER JOINT TRAINING .Methods EER (%) minDCF (0.01) minDCF (0.001)A (G=0) 50.29 0.9996 0.9996G (A=0) 4.775 0.4206 0.6340A, G (set G to A)
Dense layer (cid:1)
LengthNormDense layer (cid:2)
L_LLR (cid:3) (cid:4) (cid:3) (cid:5) (cid:6)(. )x (cid:11) x (cid:12) LDA_net
MD_net
Fig. 7. SiameseNet with Mahalanobis net on X-vector feature for speakerverification, FC1: full connection layer for LDA, FC2: Full connection for JB, H D : hypothesis for different speaker, H S : hypothesis for the same speaker. as: d i,j ∆ = d ( x i , x j ) = ( x i − x j ) T M ( x i − x j ) , (34)where M = PP T is a positive definite matrix. Based on thisdistance metric, the binary classification task for SV can beformulated as: p ( y i,j | z i,j ) = σ ( λ ( d − d i,j )) , (35)where σ ( x ) = (1 + exp( − x )) − is the sigmoid logisticfunction, d is a distance decision threshold, and λ is a scaleparameter for probability calibration. From Eq. (35), we cansee that when the Mahalanobis distance d ( x i , x j ) < d , theprobability of x i and x j belonging to the same speaker ishigh, and vice verse. With pair-wised “positive” and “negative”samples, the parameters ( M , d , and λ ) can be learnedbased on a given training data set as a binary discriminativelearning task. Comparing Eqs. (34) and (33), we can seethat if we set A = G or G = A , the L LLR andMahalanobis distance have the same formulation form (exceptthe difference in matrix as negative or positive definite), i.e., d ( x i , x j ) ∝ − r ( x i , x j ) . In this sense, the distance metricbased discriminative learning framework can be regarded as aspecial case of the hybrid discriminative framework, and theL LLR defined in Eq. (9) is cast to: r ( x i , x j ) = log p (∆ i,j | H S ) p (∆ i,j | H D ) , (36)where ∆ i,j = x i − x j . From this definition, we can see that thedistance metric based discriminative learning only considersthe distribution of the pair-wised sample distance space [29].In implementation, by merging the two dense layers of theclassifier model (“JB net” with parameters P A and P G ),the proposed hybrid framework is changed to be one branchframework as showed in Fig. 7. In this figure, the “MD net” isthe network dense layer for Mahalanobis distance metric withan affine transform matrix P , and it can be initialized with theparameters of the JB based generative model (either P = P A or P = P G ), or with random values (he normal). We test thisone branch model on dev set of SITW with different settings TABLE VIIP
ERFORMANCE ON DEVELOPMENT SET OF
SITW
OF THE S IAMESE NETWITH “MD
NET ” AS CLASSIFIER MODEL .Methods EER (%) minDCF (0.01) minDCF (0.001)Random init P P with P A Init P with P G of the “MD net” (the “LDA net” is initialized with the LDAtransform based parameters), and show the results in tableVII. From this table, we can see that when the LDA net andMD net of the one branch model are initialized with the LDAand P A parameters, the performance is the best. However, nomatter in what conditions, comparing results in tables I andVII, we can see that the hybrid model framework showed thebest performance which confirmed that the model structureinspired by the JB based generative model is helpful in theSV task.
4) L LLR distributions for intra- and inter-speaker spaces:
The SV task defined as a hypothesis test can be regardedas a binary classification task. Correspondingly, as defined inEq. (9), the performance is measured based on the L LLRdistributions in two spaces, i.e., the intra-speaker space H S and inter-speaker space H D . The separability can be visualizedas the histogram distributions of pair-wise distances in the twospaces. We check the histograms of the L LLR on the trainingand test sets based on the hybrid model (refer to networkpipeline in Fig. 4) with different parameter settings, and showthem in Fig. 8. From this figure, we can see that when thehybrid network parameters are set with random values, thereare large overlaps of the L LLR distributions between thetwo hypothesis spaces. When the network parameters are setwith the parameters of the JB based generative model, theseparation of the L LLR distributions is increased. With thediscriminative training, the separation is further enhanced. Inparticular, the L LLR distribution of “negative” sample pairsbecomes much more compact for both training and testing datasets.We have showed the SV performance with only A or G ma-trices in subsection III-D2. We check the L LLR distributionsof “negative” and “positive” sample pairs in inter- and intra-speaker spaces, and show the histograms for test set of SITWin Fig. 9. From this figure, we can see that both matrice A and G contribute to the difference of L LLR distributions,particularly G contributes to the large main difference ofthe L LLR distributions between “negative” and “positive”sample pairs. After the model is learned, the difference ofthe L LLR distributions for intra-speaker space H S and inter-speaker space H D is increased.IV. D ISCUSSION AND CONCLUSION
Current state of the art pipeline for SV is composed of twobuilding models, i.e., a front-end model for speaker featureextraction, and a generative model based back-end model forspeaker classifier. In this study, the X-vector as a speakerembedding feature is extracted in the front-end model whichencodes strong speaker discriminative information. Based on this speaker feature, a JB based generative back-end model isapplied. The JB model tries to model the probability distri-butions of speaker features, and could predict the conditionalprobabilities for utterances even from unknown speakers. Thisis the advantage of using the generative model in the SVtask since it is often the case that the testing utterances fromthe speakers are not registered in the training set. However,as a generative model, the parameters estimation is easy tobe distracted with nuisance features in a high dimensionalspace, i.e., the the generative modeling does not have thefeature selection ability for the final SV task. Therefore, adiscriminative dimensional reduction (e.g., LDA) is applied asan independent process block on the speaker features beforeapplying the generative modeling.We take a further look of the SV problem by regarding itas a hypothesis test, i.e., whether two compared utterancesare from the same or different speakers. As an alternative,the SV task also can be regarded as a binary classificationtask. Correspondingly, a discriminative learning frameworkcan be applied with “negative” and “positive” sample pairs (asfrom the same speaker and different speakers). The advantageof this discriminative learning framework is that the speakerfeatures can be automatically transformed and modeled in aunified optimization framework. However, the learning is easyoverfitted to the training data set which does not generalizewell on an unknown test speakers. In this study, as our maincontribution, we proposed to integrate the generative modelin a discriminative learning framework as a hybrid model.The key point is that we integrated the L LLR estimationfrom the JB based generative model in a neural discriminativelearning framework. In particular, the linear matrices in theJB based generative model are factorized to be the linearaffine transforms in dense layers of the neural network model.And the network parameters are connected to the JB basedgenerative model parameters which could be initialized by thelearned model of the JB based generative model. Moreover,as our another contribution, in the discriminative learningframework, rather than simply learning the hybrid modelwith conventional binary discrimination objective function, thedirect evaluation metric for hypothesis test, i.e., EBR with falsealarm and miss rates, could be easily applied as an objectivefunction in parameter optimization learning. Our experimentsconfirmed that the SV benefits from the hybrid model whichintegrates the advantages of both generative and discriminativemodel learning.In this study, the JB based generative model is basedon simple Gaussian probability distribution assumptions ofspeaker features and noise. In real applications, the probabilitydistributions are much more complex. Although it is difficultfor a generative model to fit complex shapes of probabilitydistributions in a high dimensional space, it is relatively easyfor a discriminative learning framework to approximate thecomplex distribution shapes. In the future, we will extend thecurrent study for a hybrid model framework to learn morecomplex probability distributions in SV tasks. (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:2) (cid:1) (cid:3) O n t r a i n i ng s e t O n t e s ti ng s e t (a) (b) (c)(d) (e) (f) Fig. 8. L LLR distributions in H S and H D spaces: the first row (a, b, and c ) for training set, the second row (d, e, and f) for testing set; the left column(a and d) for setting model with random parameters, the middle column (b and e) for setting model with learned generative model parameters, and the rightcolumn (c and f) for setting model with learned generative model parameters and further discriminatively trained parameters. (cid:1) (cid:2) (cid:1) (cid:2) (cid:1) (cid:2) (cid:1) (cid:2) (cid:1) (cid:3) (cid:1) (cid:3) (cid:1) (cid:3) (cid:1) (cid:3) (a) (b)(c) (d) (cid:4) (cid:5) 0(cid:4) (cid:5) 0 (cid:7) (cid:5) 0(cid:7) (cid:5) 0 I n it m od e l L ea r n e d m od e l Fig. 9. L LLR distributions on testing set of the SITW for the hybrid model:the first row for initial model with JB based generative model parameters with G = 0 (a), and with A = 0 (b), the second row is for jointly trained modelwith G = 0 (c), and with A = 0 (d). R EFERENCES[1] J. Hansen, T. Hasan, “Speaker recognition by machines and humans: Atutorial review,”
IEEE Signal processing magazine , vol. 32, no. 6, pp.74-99, 2015.[2] A. Poddar, M. Sahidullah, G. Saha, “Speaker Verification with ShortUtterances: A Review of Challenges, Trends and Opportunities,”
IETBiometrics , 7 (2), pp. 91-101, 2018.[3] H. Beigi, Fundamentals of Speaker Recognition, Springer-Verlag, Berlin,2011, ISBN 978-0-387-77591-3.[4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-endfactor analysis for speaker verification,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 19, no. 4, pp. 788-798, 2011.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust dnn embeddings for speaker recognition,” in
IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 5329-5333, 2018.[6] E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependentspeaker verification,”
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 4052-4056, 2014.[7] S. Prince and J. Elder, “Probabilistic linear discriminant analysis forinferences about identity,” in
IEEE International Conference on ComputerVision (ICCV) , pp. 1-8, 2007. [8] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited:A joint formulation,” in
European Conference on Computer Vision , pp.566-579, 2012.[9] D. Chen, X. Cao, D. Wipf, F. Wen, and J. Sun, “An efficient jointformulation for Bayesian face verification,”
IEEE Transactions on patternanalysis and machine intelligence , vol. 39, pp. 32-46, 2016.[10] V. Wan, W. Campbell, “Support vector machines for speaker verifica-tion and identification,” Neural Networks for Signal Processing X, in
Proceedings of the IEEE Signal Processing Society Workshop , vol. 2, pp.775-784, 2000.[11] J. Villalba, N. Brummer, N. Dehak, “Tied variational autoencoder back-ends for i-vector speaker recognition,” in
Proceeding of INTERSPEECH ,pp. 1004-1008, 2017.[12] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka and N.Brummer, “Discriminatively trained Probabilistic Linear DiscriminantAnalysis for speaker verification,” in
Proceeding of IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , pp.4832-4835, 2011.[13] S. Cumani, N. Brummer, L. Burget, P. Laface, O. Plchot and V.Vasilakakis, “Pairwise Discriminative Speaker Verification in the I-VectorSpace,”
IEEE Transactions on Audio, Speech, and Language Processing ,vol. 21, no. 6, pp. 1217-1227, June 2013.[14] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-endtextdependent speaker verification,” in
Proceeding of IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP) , pp.5115-5119, 2016.[15] L. Wan, Q. Wang, A. Papir and I.Moreno, “Generalized End-to-End Lossfor Speaker Verification,” in
Proceeding of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 4879-4883,2018.[16] A. Lasserre, C. Bishop, T. Minka, “Principled Hybrids of Generativeand Discriminative Models,” in
Proceeding of IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR) , pp.87-94, 2006.[17] R. Raina, Y. Shen, A. Ng, A. McCallum, “Classification with hybridgenerative/discriminative models,” in
Proceedings of the InternationalConference on Neural Information Processing Systems , pp. 545-552,2003.[18] N. Brummer, E. Villiers, “The BOSARIS toolkit user guide: Theory, al-gorithms and code for binary classifier score processing,” Documentationof BOSARIS toolkit, 2011.[19] E. Lehmann, J Romano, Testing Statistical Hypotheses, Springer-VerlagNew York, 2005.[20] J. Platt, “Probabilistic Outputs for Support Vector Machines and Com-parisons to Regularized Likelihood Methods,”
Advances in large marginclassifiers , pp. 61-74, 1999.[21] H. Lin, C. Lin, R. Weng, “A note on Platt’s probabilistic outputs forsupport vector machines,”
Machine Learning , vol. 68, pp. 267-276, 2007.[22] M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakersin the wild (SITW) speaker recognition database,” in
Proceeding ofINTERSPEECH , pp. 818-822, 2016. [23] A. Nagrani, J. Chung, W. Xie, A. Zisserman, “Voxceleb: Large-scalespeaker verification in the wild,” Computer Science and Language , vol.60, 2020.[24] K. He, X. Zhang, S. Ren, J. Sun, “Delving Deep into Rectifiers: Sur-passing Human-Level Performance on ImageNet Classification,” in
Pro-ceeding of IEEE International Conference on Computer Vision (ICCV) ,pp. 1026-1034, 2015.[25] E. Xing, A. Ng, M. Jordan, and R. Russell, “Distance Metric Learning,with application to Clustering with side-information,” in Proceeding ofAdvances in Neural Information Processing Systems , MIT Press, pp. 521-528, 2002.[26] K. Weinberger, J. Blitzer, L. Saul, “Distance Metric Learning for LargeMargin Nearest Neighbor Classification,”
Advances in Neural InformationProcessing Systems
18, pp. 1473-1480, 2006.[27] K. Weinberger, L. Saul, “Distance Metric Learning for Large MarginClassification,”
Journal of Machine Learning Research , vol. 10, pp. 207-244, 2009.[28] M. Guillaumin, J. Verbeek, C. Schmid, “Is that you? Metric learningapproaches for face identification,”
Proceeding of the IEEE InternationalConference on Computer Vision , pp. 498-505, 2009.[29] B. Moghaddam, T. Jebara, A. Pentland, “Bayesian face recognition,”
Pattern Recognition , vol. 33, pp. 1771-1782, 2000.[30] D. P. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” the 3rd International Conference on Learning Representations (ICLRthe 3rd International Conference on Learning Representations (ICLR