Extrapolating false alarm rates in automatic speaker verification
Alexey Sholokhov, Tomi Kinnunen, Ville Vestman, Kong Aik Lee
EExtrapolating False Alarm Rates in Automatic Speaker Verification
Alexey Sholokhov , Tomi Kinnunen , Ville Vestman , Kong Aik Lee Huawei Technologies Ltd., Moscow, Russia Computational Speech Group, University of Eastern Finland, Finland Biometrics Research Laboratories, NEC Corporation, Tokyo, Japan [email protected], [email protected], [email protected], [email protected]
Abstract
Automatic speaker verification (ASV) vendors and corpusproviders would both benefit from tools to reliably extrapolateperformance metrics for large speaker populations without col-lecting new speakers . We address false alarm rate extrapolationunder a worst-case model whereby an adversary identifies theclosest impostor for a given target speaker from a large popula-tion. Our models are generative and allow sampling new speak-ers. The models are formulated in the ASV detection scorespace to facilitate analysis of arbitrary ASV systems.
Index Terms : speaker verification, false alarm rate, closest im-postor, black-box attack, PLDA, implicit generative models
1. Introduction
How unique the human voice is?
This question is clearly rel-evant for practical deployment of automatic speaker verifica-tion (ASV) technology — yet, scarcely addressed [1] due to theopen-ended nature of the question. Unlike passwords that havezero uncertainty conditioned on person’s identity [2], the humanvoice is subject to both extrinsic and intrinsic variations, noneof which are deterministic. ‘Uniqueness’ thus depends both ondata conditions and the observer ( e.g. a specific ASV system orlistener). In our recent work [3], we addressed an alternative,more tangible question:
Given a specific ASV system (black-box) and eval-uation corpus, how does the false alarm rate be-have with an increased number of speakers?
To be precise, we modeled the sampling process of nontargetdetection scores of a given ASV system through a probabilisticgenerative model to enable indefinite increasing of the impostorpopulation size without having to collect new speech data. Theassumption is that the underlying sampling process, governedby the properties of the ASV system (treated as a black-box)and corpus, remains fixed. Drawing a random nontarget scoreproceeds in two steps. First, we draw a random pair of speakers implicitly represented by a Gaussian distribution which modelssimilarity scores between these two speakers. Second, we drawa random score from that distribution.In [3] we also revised the notion of ‘nontarget speaker’.Apart from efforts devoted to the study of spoofing attacks [4],standard evaluation benchmarks of ASV technology [5] assumenontarget speakers to be non-proactive or zero-effort impostors— other random speakers paired up with targets. We, instead,considered worst-case impostors with a deterministic , proac-tive imposture policy: given a target speaker of interest (for in-stance, a notable politician), the adversary identifies the closestimpostor to the given target from a large population (such as theInternet) to increase the chance of this impostor to be acceptedas the targeted speaker. This is an instance of adversarial attack [6, 7] on ASV [8, 9]. The general motivations are to identify loopholes of ASV and to develop defense mechanisms againstthem.In this study we improve upon the generative model pre-sented in [3]. Despite demonstrating expected overall trends,the predicted false alarm rates were substantially overestimated,particularly at high ASV thresholds (proxies of high-securityapplications). To tackle this shortcoming, we propose a dis-criminative training method which uses empirical estimates offalse alarm rates as targets. The setup is similar to standard re-gression tasks except that our primary goal is extrapolation —making predictions substantially beyond the range of inputs inthe training set. In our context, this means predicting false alarmrate of an ASV system for a population of, say, , non-target speakers but with access to data from only speakers.Without additional assumptions on the predictor functions, stan-dard regression methods available in machine learning librarieshave higher risk of producing meaningless results (see [10]).In general, the task of learning interpretable functional de-pendencies has received far less attention within machine learn-ing compared to natural sciences, where discovering physicallyplausible models is important. To obtain more trusthworthy pre-dictions, we build upon a regressor which takes into account thespecifics of detection score distribution governed by unobservedsimilarities between speakers. Specifically, it uses a generativemodel of ASV scores together with an estimator of false alarmrates in a single prediction pipeline.Another novelty of this work is modeling the generation ofnontarget scores using probabilistic linear discriminant analy-sis (PLDA) in detection score space . PLDA [11] — a genera-tive model in the space of vector representations of speech utter-ances ( e.g. i-vectors or x-vectors) — is well-known by ASV re-searchers. Our formulation, however, differs substantially fromthis familiar use case as our modeling takes place in the detec-tion score, rather than vector space. We use PLDA to gener-ate ‘new’ detection scores. The scores used for training can,but are not required to be outcomes of trial comparisons by anactual PLDA model. We learn PLDA whose log-likelihood ra-tio scores are designed to approximate distribution of detectionscores of any ASV system. Similar to [3], our models requireno other data than ASV scores (and their labels). In specific, wedo not need any speaker embeddings to train our PLDA scoregenerator.
2. Preliminaries
We begin with a brief review of some necessary technical back-ground on false alarm rate, its extrapolation, and PLDA.
The false alarm (FA) rate is defined as P FA ( τ ) ≡ (cid:90) ∞ τ p ( s | non ) d s, (1) a r X i v : . [ ee ss . A S ] A ug here τ ∈ R is a detection threshold and p ( s | non ) is the prob-ability density of nontarget scores of an ASV system. The FArate can be written as the expectation E s ∼ p ( s | θ non ) [ I { s > τ } ] ,and approximated by Monte-Carlo (MC) sampling as P FA ( τ ) ≈ R R (cid:88) r =1 I { s r > τ } , s r ∼ p ( s | θ non ) , (2)where r = 1 , . . . , R are the indices of the nontarget trials and I {·} is an indicator function. Each nontarget trial consists of apairwise comparison of utterances from two different speakers(conversely, a target trial constitutes a pairwise comparison ofutterances from the same speaker). In the special case whenevery unique speaker pair in a trial list has the same number oftrials, L , the above estimator is the same as averaging speaker-pair specific FA rates: R R (cid:88) r =1 I { s r > τ } = 1 T T (cid:88) i =1 L L (cid:88) (cid:96) =1 I { s i,(cid:96) > τ } (3)where s i,(cid:96) denotes the (cid:96) th score from the i th speaker pair, T isthe number of unique speaker pairs and R = T · L .This reformulation of (2) leads to an alternative estimatorof P FA as presented in [3]: P FA ( τ ) ≈ T T (cid:88) i =1 P ( i ) FA ( τ ) , P ( i ) FA ( τ ) = 1 |S i | (cid:88) s (cid:96) ∈S i I { s (cid:96) > τ } , (4)where S i is the set of scores for the i th speaker pair consistingof an enrolled (target) speaker and an impostor selected ran-domly from a dataset and P ( i ) FA ( τ ) is the corresponding speaker-pair specific FA rate. The following discussion is based on thefact that selecting a random impostor is equivalent to select-ing a random subset of N speakers, followed by selecting arandom speaker from this subset. Thus, (4) can be interpretedas averaging of results of T stochastic simulations, where boththe target speaker and the impostor subset are randomly drawnfrom a given database. This is in line with typical ASV trial de-signs, where (zero-effort) impostors can be considered as ran-dom speakers with different identity.This view of (4) allows us to consider several alternativepolicies to choose an impostor. We consider worst-case impos-tor that is the closest match to a given target speaker. The ad-versary might locate the closest impostor using a speaker iden-tification system [12] or by other means. In [3] we proposeda new metric, worst-case FA rate with N impostors , abbrevi-ated P N FA ( τ ) . It represents a scenario where target speakers arescored against their closest impostors. We also introduced gen-erative model of nontarget scores to allow N to exceed the num-ber of speakers in the corpus. This allows extrapolation of FArates for arbitrarily-sized impostor population. PLDA [11] models between- and within-class distributions ofhigh-dimensional vectors using low-dimensional subspaces. InASV, PLDA is used to model distributions of speaker embed-dings and for same/different speaker hypothesis testing. PLDAwas revised in [13] and [14] (see also [15]). We use the so-called two-covariance
PLDA [14]. It models the j th embedding of the i th speaker by φ i,j = b + y i + ε i,j , (5)where b ∈ R D is the center of the embedding space, y i ∈ R D is a latent speaker identity variable with normal prior N ( , B ) ,and ε i,j ∈ R D is residual with prior N ( , W ) . B and W are the between- and within- class covariance matrices. The pa- rameters θ plda = { b , B , W } are typically estimated via the expectation-maximization (EM) algorithm [16, 17] using a setof development speakers (different from target speakers).At the recognition stage, θ plda is used for computing log-likelihood ratio (LLR) score for a given pair of enrollment andtest utterances, as s ( φ e , φ t ) = log p ( φ e , φ t | H , θ plda ) p ( φ e , φ t | H , θ plda ) , (6)where H and H denote, respectively, the target (samespeaker) and nontarget (different speaker) hypotheses. H as-sumes that φ e and φ t share the same latent identity variable and H assumes that their latent identity variables are different. Thescore (6) is given by a closed-form expression — see [18]. In this work, we do not use PLDA to model speaker embed-dings. For generality, all our modeling takes place in the detec-tion scores space. We use PLDA to model the distribution ofempirical scores of any
ASV system — whether or not basedon a PLDA back-end. Note, first, that (6) represents a determin-istic function s : R D → R that assigns a real number to anypair of embeddings. Concerning performance assessment, theembeddings are not relevant. The distribution of the detectionscores (rather, the order of the scores) is a complete descrip-tion of the detection error trade-off (DET) behavior of a givensystem [19]. Second, note that PLDA is a generative model —it allows sampling new ‘speakers’ in the y -space. We want tofit a PLDA model whose score generation mechanism producesdistributions similar to the given empirical scores.To this end, we first note that PLDA is heavily over-parameterized from the perspective of LLR score order preser-vation. A centered PLDA model ( b = ) uses D ( D + 1) / parameters for each of the matrices B and W , totaling D + D [15]. In fact, we need only D numbers. Note that any invert-ible linear transformation of the feature space leaves the order ofscores unchanged. Hence, it does not alter a DET-curve. We cantherefore perform simultaneous diagonalization [20, 21] of thewithin-class and between-class covariance matrices such that(i) B becomes an identity matrix and (ii) W becomes diagonal: W = diag ( d , . . . , d D ) . Therefore, a PLDA model can bedefined through D nonnegative numbers. We use this minimalparametrization in our experiments.
3. Extrapolating false alarm rates
With the above preliminaries, we are now set to present mod-els to produce predictions of P N FA ( τ ) . We consider two differ-ent types of models. Our previous model [3] is a special caseof a location-scale model described below, while PLDA-based model is a new proposal. Both models serve to approximate thedistribution of sets of scores between a random target speakerand the closest impostor selected from a random set of N im-postors. These sets of scores can be viewed as outcomes of thegenerative process in Algorithm 1.Here, sim ( · , · ) is any speaker similarity measure. Since ex-plicit speaker representation are not available in the general casethe similarity function has to be computed from a set of speaker-pair specific scores. This case includes estimating P N FA ( τ ) fromempirical scores. We use the mean value of scores as a similar-ity measure. Given a sampled set of score sets {S , . . . , S T } ,we compute the corresponding MC estimates of the speaker-pair conditioned FA rates { P (1) FA ( τ ) , . . . , P ( T ) FA ( τ ) } — the indi-vidual terms of the sum in (4). Averaging them yields an esti-mate of P N FA ( τ ) . We now describe two generative models that lgorithm 1for i = 1 , . . . , T do Sample random enrolled (target) speaker, y ( i ) e .Sample N random test speakers, y ( i ) t , , y ( i ) t , , . . . , y ( i ) t ,N .Find the closest speaker y ( i ) t ,k , where k = arg max j sim ( y ( i ) e , y ( i ) t ,j ) Sample scores S i = { s (cid:96) } L i (cid:96) =1 between y ( i ) e and y ( i ) t ,k . end allow sampling scores according to Algorithm 1 for an arbitrary N . Each model can be trained on sets of speaker-pair specificASV scores and further be used for FA rate extrapolation. Our first family of models assumes the distribution of between-speaker scores for a given pair of speakers to be a scaled andshifted version of some base distribution defined by its cumula-tive distribution function (CDF). Our earlier model [3] assumesa Gaussian base distribution. The following generalized algo-rithm allows to generate a set of between-speaker scores forgiven N :1. Sample N pairs of location-scale values { ( µ j , σ j ) } Nj =1
2. Find the largest location parameter µ k = max j { µ j }
3. Sample scores S i = { s (cid:96) } L i (cid:96) =1 by s (cid:96) = µ k + σ k F − ( u (cid:96) ) ,where u (cid:96) ∼ U [0 , is uniformly-distributed.Here, F ( · ) is the CDF of the base distribution of scores. The al-gorithm uses inverse transform sampling [22] to generate scoresfrom the underlying distribution. Each pair ( µ j , σ j ) parameter-izes the distribution of scores between a fixed target speakerand the j th impostor. It also assumes that the closest impos-tor has the largest location parameter µ j . One limitation of themodel in [3] is the unrealistic assumption of Gaussian between-speaker scores. Here F ( · ) is allowed to be arbitrary. In practice,we use torchpwl to define a piece-wise linear function withmonotonicity constraint for CDF approximation. The above location-scale family of models represent speak-ers indirectly through their relative similarities defined throughbetween-speaker score distributions. The model described inthis section uses, instead, latent identity variables to representindividual speakers explicitly . This gives an alternative predic-tor of P N FA ( τ ) based on PLDA.A PLDA model with known parameters θ plda can be used togenerate LLR scores, as follows.1. Sample a pair of enrollment and test latent identity vari-ables ( y e , y t ) from the prior: y e ∼ N ( , B ) and y t = y e under H ; or draw the second sample y t ∼ N ( , B ) under H .2. Sample a pair of enrollment and test feature vectors φ e ∼ N ( y e , W ) , φ t ∼ N ( y t , W ) conditioned on thelatent identity variables from the first step.3. Compute the LLR score as s = (cid:96) ( φ e , φ t ) using (6).Note that the two first steps are stochastic, while the LLRscore is a deterministic function of the sampled pair of fea-ture vectors and the PLDA model. Under the H hypothesis, https://pypi.org/project/torchpwl/ this generative procedure yields scores of zero-effort impostors.Following Algorithm 1, it can be extended to sample N > impostors for a given target speaker. To select the closest im-postor, we use the LLR score (6) as a similarity measure be-tween speakers. Given identity variable of the closest impostor,one can sample a set of speaker-pair specific scores by repeatingsteps 2 and 3 in the algorithm above.We include a learnable monotonic warping function appliedto the scores generated by the model to increase flexibility. We now describe a method for training generative models in-troduced above. The training data is a set of sets of between-speaker scores produced by any ASV system (black-box). Gen-erally, there are at least three alternative approaches to constructa regressor for predicting P N FA ( τ ) , given N and τ . The firstone is to use any standard general-purpose regression techniqueto match model predictions and empirical estimates of P N FA ( τ ) computed with (4) from the empirical scores. Despite the ap-parent simplicity and attractiveness of such approach, and dueto the lack of task-specific constraints, such models are exposedto a greater risk of failure for large values of N [10].The second approach is to follow a two-stage strategy:First, train a generative model of scores and use it to gener-ate nontarget scores following Algorithm 1. Next, use the gen-erated sets of scores to estimate P N FA ( τ ) using (4). We usedthis approach in [3] where a location-scale model with Gaussianbase distribution was trained to maximize model log-likelihood[17]. Different from [3], the models proposed in this workare instances of implicit generative models: they are specifiedthrough a forward stochastic procedure for data generation, butdo not allow direct likelihood evaluation [23, 24]. Even if im-plicit generative models can be trained using plethora of meth-ods different from ML estimation (see [23] and [25]), it is non-trivial to design a training algorithm for models whose trainingset is a set of sets (see, e.g. [26]), as is the case here.In the last approach, generative model is also included to theprediction pipeline but trained discriminatively by comparingthe model-based estimates of P N FA ( τ ) against the correspondingempirical estimates (treated as ground-truth). The regressor istrained by minimizing the mean square error (MSE) betweenthe empirical and model-based false alarm rates. To addresslack of differentiability, we replace the unit step function in (4)by the sigmoid function with scaled argument. Also, the argmaxfunction which appears in PLDA score generating algorithmwas replaced by its approximation computed as a weighted sumof the speaker identity variables where weights are softmax-normalized similarities to the target speaker. This is similar to aso-called soft-attention mechanism introduced in [27].In contrast to purely generative training aimed at approxi-mating distribution of scores, discriminative training optimizesdirectly the final regression target. Using a restricted class ofregression functions, in turn, allows us to keep the extrapolatedvalues within the range of reasonable expectations.The resulting objective function (MSE in our experiments)includes random sampling and can be viewed as a nested Monte-Carlo estimate of the expected loss. Generally, such MCestimates are biased for any finite T [28] but useful for trainingvia stochastic optimization, provided that T is sufficiently large( - in our experiments). We used Adam [29] optimizerwith mini-batches of size and learning rate − to train bothmodels. . Experiments We closely follow the experimental setup of [3]. We combineVoxceleb1 [30] and Voxceleb2 [31] corpora to have a datasetwith a large number of speakers and sufficient number of utter-ances per speaker needed for reliable estimation of P N FA . Theresulting dataset has speakers with more than 100 utter-ances from each speaker, on average. The data was divided intothree disjoint sets with , and speakers. The firstset was used to train the ASV systems. The second one is thestandard Voxceleb1 evaluation protocol [32], used as a sanitycheck of our ASV systems (see [3] for details). The third setwhich contains male and female speakers was usedto compute scores for training models for P N FA extrapolation. Wecomputed similarity scores for each unique speaker pair of thesame gender. To this end, we randomly selected 18 utterancesfor each of speakers to obtain at least three hundreds ofscores ( = 324 ), which we assume to be sufficient to repre-sent speaker-pair specific score distributions.We used two standard ASV systems based on i-vectors andx-vectors to compute ASV scores used in our experiments. Dueto the space limitations, we present results only for the x-vectorsystem, which has EER of . on the standard Voxceleb1evaluation protocol. The key conclusions, however, are similarfor the i-vector system. For more details on ASV systems andsetup, refer to [3].We computed empirical and model-based estimates of theworst-case false alarm rates with N impostors, P N FA , by ran-domly selecting a target speaker T = 1000 times in Algo-rithm 1. Fig. 1 shows the estimates obtained with different mod-els. The three groups of curves correspond to different choicesof ASV threshold, τ . As detailed in [3], these thresholds are theminimizers of three different detection cost functions (DCFs).The first DCF has high cost for misses ( τ ), the second DCFhas equal costs for misses and false alarms ( τ ), and the lastone penalizes false alarms more ( τ ). The empirical curves endup to N = 1000 impostors (as we have exhausted all data)while the extrapolated regression curves for N > maybe used to speculate about the range of values of P N FA for largesizes of impostor population. For instance, the ASV systemwith P FA = P FA ≈ may have the worst case false alarm ratearound for N = 10 . That is, if the attacker has a speechsample of the target speaker and access to a proxy ASV systemwith comparable accuracy to the attacked one, the chance of ac-cepting the closest impostor may reach for a population of available impostors.To objectively assess the quality of models’ forecasts, wemeasure mean absolute error (MAE) on the extrapolated val-ues of P N FA for a held-out set with N ∈ [660 , while thecorresponding empirical values (treated as the ground-truth andcomputed according to (4)) were unseen by the model duringtraining. In specific, the inputs in the training data were formedas pairs ( N, τ ) uniformly sampled from [1 , × [ τ min , τ max ] ,where the range of thresholds is determined according to therange of empirical scores. The held-out set was created simi-larly but with a different range of N . The results summarisedin Table 1 indicate that more flexible models produce more ac-curate predictions. For instance, using a learnable base distri-bution instead of Gaussian decreases MAE for location-scalemodels and both models benefit from score warping. Thelocation-scale and PLDA models have comparable accuracy.Importantly, both provide substantial improvement over earlier,purely generative model [3]. Size of speaker population ( N )020406080100P NFA (%) PLDALoc-scaleLoc-scale (generative)Empirical estimation
Figure 1:
Worst-case false alarm estimates for male scoresgiven by x-vector system. Two new models and the model from[3] are shown along with the empirical estimates. The estimatesare shown together with their 99% confidence intervals.
Table 1:
Extrapolation performance for different models interms of MAE computed on the held-out set. + indicates that alearnable score warping function was included to a model. Wefound that PLDA with -dimensional feature space producesthe best results. Model MAE, % Location-scale (Gaussian), generative [3] 8.34Location-scale (Gaussian) 1.34Location-scale (Gaussian, + ) 0.57Location-scale (general CDF) 0.67Location-scale (general CDF, + ) 0.48PLDA ( D = 10 ) 1.18PLDA ( D = 10 , + ) 0.39
5. Conclusions
We advanced our recent work [12, 3] on worst-case impostorsin the context of ASV. In specific, we introduced new tools forperformance extrapolation of ASV systems. The models oper-ate on detection score space and are therefore applicable outsidethe scope of ASV too. Our results indicate substantial improve-ment over our previous model [3].In future work, we may relax our worst-case impostor as-sumption, for instance so that the attacker fails to identify theclosest impostor. More generally, the usual assumption in ad-versarial machine learning where the attacker knows everythingof the attacked system is potentially overly-pessimistic.
6. Acknowledgements
This work was supported in part by the Academy of Finland(Proj. No. 309629).
7. References [1] A. Nautsch, C. Rathgeb, R. Saeidi, and C. Busch, “Entropy anal-ysis of i-vector feature spaces in duration-sensitive speaker recog-nition,” in , April 2015, pp. 4674–4678.[2] K. Takahashi and T. Murakami, “A measure of informationgained through biometric systems,”
Image and Vision Computing ,vol. 32, no. 12, pp. 1194 – 1203, 2014.[3] A. Sholokhov, T. Kinnunen, V. Vestman, and K. A. Lee, “Voicebiometrics security: Extrapolating false alarm rate via hierarchi-cal Bayesian modeling of speaker verification scores,”
ComputerSpeech & Language , vol. 60, p. 101024, 2020.[4] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanili, M. Sahidullah,A. Sizov, N. Evans, M. Todisco, and H. Delgado, “ASVspoof:The automatic speaker verification spoofing and countermeasureschallenge,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 11, no. 4, pp. 588–604, June 2017.[5] C. S. Greenberg, L. P. Mason, S. O. Sadjadi, and D. A. Reynolds,“Two decades of speaker recognition evaluation at the nationalinstitute of standards and technology,”
Computer Speech & Lan-guage , vol. 60, p. 101032, 2020.[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus, “Intriguing properties ofneural networks,” in
International Conference on LearningRepresentations , 2014. [Online]. Available: http://arxiv.org/abs/1312.6199[7] B. Biggio and F. Roli, “Wild patterns: Ten years after the rise ofadversarial machine learning,”
Pattern Recognition , vol. 84, pp.317–331, 2018.[8] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-endspeaker verification with adversarial examples,” in , 2018, pp. 1962–1966.[9] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The attacker’s per-spective on automatic speaker verification: An overview,” 2020.[10] G. S. Martius and C. Lampert, “Extrapolation and learning equa-tions,” in , 2017.[11] S. Prince, P. Li, Y. Fu, U. Mohammed, and J. H. Elder, “Proba-bilistic models for inference about identity,”
IEEE Trans. PatternAnal. Mach. Intell. , vol. 34, no. 1, pp. 144–157, 2012.[12] V. Vestman, T. Kinnunen, R. G. Hautamki, and M. Sahidullah,“Voice mimicry attacks assisted by automatic speaker verifica-tion,”
Computer Speech & Language , vol. 59, pp. 36 – 54, 2020.[13] P. Kenny, “Bayesian speaker verification with heavy-tailed pri-ors,” in
Odyssey 2010: The Speaker and Language RecognitionWorkshop, Brno, Czech Republic, June 28 – July 1, 2010 , 2010.[14] N. Br¨ummer and E. de Villiers, “The speaker partitioning prob-lem,” in
Odyssey 2010: The Speaker and Language RecognitionWorkshop , Brno, Czech Republic, June 2010, p. 34.[15] A. Sizov, K. Lee, and T. Kinnunen, “Unifying probabilistic lin-ear discriminant analysis variants in biometric authentication,” in
Structural, Syntactic, and Statistical Pattern Recognition - JointIAPR International Workshop, S+SSPR 2014, Joensuu, Finland,August 20-22, 2014. Proceedings , 2014, pp. 464–475.[16] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likeli-hood from incomplete data via the EM algorithm,”
Journal of theRoyal Statistical Society, Series B , vol. 39, no. 1, pp. 1–38, 1977.[17] C. M. Bishop,
Pattern Recognition and Machine Learning (Infor-mation Science and Statistics) . Berlin, Heidelberg: Springer-Verlag, 2006.[18] J. Rohdin, S. Biswas, and K. Shinoda, “Discriminative PLDAtraining with application-specific loss functions for speaker ver-ification,” in
Odyssey 2014: The Speaker and Language Recog-nition Workshop, Joensuu, Finland, June 16-19, 2014 , 2014, pp.26–32.[19] N. Br¨ummer, “Measuring, refining and calibrating speaker andlanguage information extracted from speech,” Ph.D. dissertation,Stellenbosch University, 2010. [20] K. Fukunaga,
Introduction to Statistical Pattern Recognition (2ndEd.) . USA: Academic Press Professional, Inc., 1990.[21] Y. Wang, H. Xu, and Z. Ou, “Joint bayesian gaussian discrim-inant analysis for speaker verification,” in . IEEE, 2017, pp. 5390–5394.[22] L. Devroye,
Non-Uniform Random Variate Generation . NewYork, NY, USA: Springer-Verlag, 1986.[23] S. Mohamed and B. Lakshminarayanan, “Learning in implicitgenerative models,”
CoRR , vol. abs/1610.03483, 2016. [Online].Available: http://arxiv.org/abs/1610.03483[24] I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarialnetworks,”
CoRR , vol. abs/1701.00160, 2017. [Online]. Available:http://arxiv.org/abs/1701.00160[25] G. Louppe, J. Hermans, and K. Cranmer, “Adversarial variationaloptimization of non-differentiable simulators,” in
Proceedings ofMachine Learning Research , ser. Proceedings of Machine Learn-ing Research, K. Chaudhuri and M. Sugiyama, Eds., vol. 89.PMLR, 16–18 Apr 2019, pp. 1438–1447.[26] C.-L. Li, M. Zaheer, Y. Zhang, B. Poczos, and R. Salakhutdinov,“Point cloud GAN,” arXiv preprint arXiv:1810.05795 , 2018.[27] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-lation by jointly learning to align and translate,” arXiv preprintarXiv:1409.0473 , 2014.[28] T. Rainforth, R. Cornish, H. Yang, A. Warrington, and F. Wood,“On nesting Monte Carlo estimators,” in
Proceedings of the 35thInternational Conference on Machine Learning , ser. Proceedingsof Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.Stockholmsm¨assan, Stockholm Sweden: PMLR, 10–15 Jul 2018,pp. 4267–4276.[29] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[30] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,”
Proc. Interspeech 2017 , pp.2616–2620, 2017.[31] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deepspeaker recognition,” in
Interspeech , 2018.[32] J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A.Reynolds, and A. Zisserman, “VoxSRC 2019: The first VoxCelebspeaker recognition challenge,”