A Speaker Verification Backend with Robust Performance across Conditions
AA Speaker Verification Backend with RobustPerformance across Conditions
Luciana Ferrer a , Mitchell McLaren b , Niko Br¨ummer c a Instituto de Investigaci´on en Ciencias de la Computaci´on (ICC),CONICET-UBA, Argentina b Speech Technology and Research Lab (StarLab), SRI International, USA c Omilia Conversational Intelligence, Athens, Greece
Abstract
In this paper, we address the problem of speaker verification in conditionsunseen or unknown during development. A standard method for speakerverification consists of extracting speaker embeddings with a deep neuralnetwork and processing them through a backend composed of probabilisticlinear discriminant analysis (PLDA) and global logistic regression score cal-ibration. This method is known to result in systems that work poorly onconditions different from those used to train the calibration model. We pro-pose to modify the standard backend, introducing an adaptive calibrator thatuses duration and other automatically extracted side-information to adaptto the conditions of the inputs. The backend is trained discriminatively tooptimize binary cross-entropy. When trained on a number of diverse datasetsthat are labeled only with respect to speaker, the proposed backend consis-tently and, in some cases, dramatically improves calibration, compared to thestandard PLDA approach, on a number of held-out datasets, some of whichare markedly different from the training data. Discrimination performance isalso consistently improved. We show that joint training of the PLDA and theadaptive calibrator is essential—the same benefits cannot be achieved whenfreezing PLDA and fine-tuning the calibrator. To our knowledge, the resultsin this paper are the first evidence in the literature that it is possible to de-velop a speaker verification system with robust out-of-the-box performanceon a large variety of conditions.
Keywords:
Speaker Verification, Probabilistic Linear DiscriminantAnalysis, Robust Calibration
Preprint submitted to Computer Science and Language February 4, 2021 a r X i v : . [ c s . S D ] F e b . Introduction The task of speaker verification (SV) is to determine whether two setsof speech samples originate from a common speaker, or from two differentspeakers. SV systems output a score for each comparison—usually calleda trial —which can be used in different ways depending on the application.In some cases, like in surveillance applications where the goal is to searchfor certain target speakers within a large database of speech samples, thescores can be used to obtain a sorted list of potential matches, from mostto least likely. This list can then be reviewed by a person to make the finaldecisions. In other cases, like in authentication scenarios where the systemis used to automatically verify whether a person is who they claim to be,a hard decision has to be made for each comparison. This means that thesystem has to apply a threshold to the scores so that only trials with a scoreabove that threshold are declared same-speaker trials.The selection of such a threshold is not a trivial matter. A system withexcellent discrimination between same- and different-speaker trials could nev-ertheless produce extremely poor decisions, on average, if the wrong thresholdis used. If a data set that matches the conditions of the test samples is avail-able for development, the selection of the threshold can be done using thatdata. Yet, in many cases, no well-matched development data is available. Inthese cases, a threshold selected on some potentially mismatched develop-ment set would have to be used. Unfortunately, for modern SV systems, athreshold selected on a certain dataset is unlikely to be suitable for a differentdataset, even when conditions are seemingly very similar (Nandwana et al.,2019).As a consequence of this difficulty in the choice of threshold, SV systemscan very rarely be used out-of-the-box. While automatic speech recognition(ASR) systems may be sub-optimal on conditions for which they have notbeen trained, in most cases they still return some reasonable output. On theother hand, SV systems with a threshold selected on conditions mismatchedto the test conditions may make, on average, worse-than-random choices.Interestingly, in most cases, reasonably good performance can be achieved What do we mean by conditions ? They are a collection of (often hidden) factors thatdiffer between different speech samples and that can have the unfortunate effect of alsoshifting the distributions of speaker verification scores. See section 4 below for a moredetailed discussion. (cid:55)→ P (class 1 | data) = 1 − P (class 2 | data)In speaker verification, where class priors can vary widely between trainingand a variety of different applications at run-time, the log-likelihood-ratio (LLR) format of calibration is preferred. The speaker verifier maps its inputto the LLR as defined below:data (cid:55)→ LLR = log P (data | H s ) P (data | H d ) (1)where ‘data’ refers to all of the speech inputs to a speaker verification trial(or some processed version thereof), while H s is the same-speaker hypothesisand H d is the different-speakers hypothesis. The output of a well-calibratedclassifier can be used to make minimum-expected-cost Bayes decisions. Forthe LLR format, the Bayes decision can be written as:argmin d ∈D P ( H s | data) C ( d | H s ) + P ( H d | data) C ( d | H d )= argmin d ∈D σ (LLR + γ ) C ( d | H s ) + σ ( − LLR − γ ) C ( d | H d ) (2)where σ ( x ) = e − x is the sigmoid function, γ = log P ( H s ) P ( H d ) is the prior logodds, and C ( d | h ) is the cost of making decision d , when h is the true class.Typically, one takes D = { accept , reject } , where ‘accept’ means deciding infavor of H s and ‘reject’ deciding in favor of H d . If correct decisions havezero cost: C (accept | H s ) = C (reject | H d ) = 0, then equation (2) reduces tocomparing the LLR to a threshold t that is a function of γ and C (see, e.g.,Br¨ummer and du Preez, 2006): t = log C (accept | H d ) C (reject | H s ) − γ. (3)3n summary, for a well-calibrated system where equation (1) holds forthe test conditions, optimal decisions can be made using a threshold thatsolely depends on the cost function and the hypothesis prior. Hence, if wecould develop a system that achieves robust calibration across conditions,we would, at the same time, solve the problem of threshold selection onunseen data. Also, for the same price, we would obtain scores that areinterpretable as LLRs, which is essential in certain applications like forensicvoice comparison.In this paper we propose an SV system that achieves excellent out-of-the-box calibration performance across a wide variety of conditions and, hence,does not require matched development data for the selection of the thresh-old for every new test set. As a side-effect, the system also improves dis-crimination on some datasets for which miscalibration is occurring withinthe set because they are composed of different conditions. The proposedapproach modifies the standard probabilistic linear discriminant analysis (PLDA) backend (Kenny, 2010), by adding condition-dependent calibrationcomponents. The initial idea was proposed in our prior work (Ferrer andMcLaren, 2020a,b). In this paper, we propose several improvements to theoriginal method, including adding a duration-dependent calibration stageand an improved training procedure. We provide detailed analysis of per-formance for text-independent speaker verification, including new datasetsnot considered in our previous work, comparing different architectures andtraining data subsets, and showing the convergence behavior of our mod-els. To our knowledge, no other system in the literature has been shown togive a comparable level of robustness in terms of actual speaker verificationperformance across a wide variety of conditions.The code used to run the experiments in this paper and some exampleruns can be found in https://github.com/luferrer/DCA-PLDA .
2. Prior Work
Most current speaker verification systems are composed of a cascade ofmultiple stages. First, frame-level, acoustic features that represent the short-time contents of the signal are extracted, typically one feature vector every10ms. These variable-length sequences of frame-level features are input to adeep neural network (DNN) which is trained to optimize speaker classifica-tion performance on the training dataset. The DNN uses a temporal poolinglayer so that it can represent the speaker information in the variable-length4nput, as a new feature of fixed dimension (typically 512) that is termedthe speaker embedding (e.g., Snyder et al., 2016). The speaker embeddingsare harvested from a hidden layer in the DNN, after the temporal pooling,but before the classifier. The speaker embeddings are then typically trans-formed using linear discriminant analysis (LDA), then mean- and, sometimes,variance-normalized and, finally, L2 length-normalized. Next, probabilisticlinear discriminant analysis (PLDA) is used to obtain scores for each speakerverification trial. Finally, if required, a calibration stage can be used to con-vert the scores produced by PLDA into log-likelihood-ratios (LLRs) that canbe used to make cost-effective Bayes decisions in the intended application.Although many different calibration strategies have been explored in theliterature (e.g., Br¨ummer et al., 2014), a common solution is a simple, dis-criminatively trained affine transformation from scores to LLRs. The pa-rameters of this transformation (a scale and an offset) are trained to min-imize a weighted binary cross-entropy objective which measures the abilityof the calibrated scores to make cost-effective Bayes decisions, when theyare interpreted as LLRs (Br¨ummer and du Preez, 2006). Assuming the cal-ibration training data reflects the evaluation conditions, this procedure hasbeen repeatedly shown to provide hard-to-beat performance on a wide rangeof datasets. Yet, when evaluation conditions differ from those present inthe calibration training data, the average performance of the hard decisionsmade with the system can be very poor, sometimes even worse than that ofrandom decisions.Several approaches have been proposed in the speaker verification liter-ature which take into account the signal’s conditions in different ways inthe calibration stage in order to achieve robustness across conditions. Insome cases, the side-information—some representation of the conditions ofa signal—was assumed to be known or estimated separately during testing(Ferrer et al., 2005; Solewicz and Koppel, 2005, 2007; Ferrer et al., 2008;Mandasari et al., 2013, 2015; Nautsch et al., 2016) and calibration param-eters were conditioned on these discrete or continuous side-information val-ues. The Focal Bilinear toolkit (Br¨ummer, 2008) implements a version ofside-information-dependent calibration where the calibrated score is a bilin-ear function of the scores and the side-information vector, which is assumedto be composed of numbers between 0 and 1. More recently, we proposed anapproach called trial-based calibration (TBC) where calibration parametersare trained independently for each trial using a subset of the developmentdata (McLaren et al., 2014; Ferrer et al., 2019) selected using a model trained5o estimate the similarity between the conditions of two samples. This ap-proach, while successful, is computationally expensive. Another approach isproposed by Tan and Mak (2017) where the calibration stage jointly learnsto calibrate and to estimate the SNR level and duration of the training sig-nals, implicitly using the SNR and duration information to determine thecalibration parameters.Our proposed approach follows the spirit of these earlier methods, tack-ling the problem of robustness in calibration by using a representation ofthe signals’ conditions to determine the calibration parameters for each trial.The main differences between our method and the previous methods are (1)that the representation of the signal’s conditions is estimated automaticallywithin the model without the need for data labeled with respect to condi-tion, in training or testing, and (2) that the full backend, including the scoregeneration and the calibration components, is learned jointly and discrim-inatively to optimize speaker verification performance. The approach, firstproposed in (Ferrer and McLaren, 2020a,b) and extended here, is a backendthat takes speaker embeddings, for example, x-vectors (Snyder et al., 2016),as input and outputs calibrated scores for each trial. The backend has thesame functional form as the usual PLDA backend, including the calibrationstage, but it is trained discriminatively to optimize a weighted binary cross-entropy. Further, the calibration stage is made to depend on the duration ofthe signal and on a low-dimensional side-information vector that is derivedfrom the same speaker embeddings and is meant to represent the conditionof the signal other than the duration. We call this method, discriminativecondition-aware PLDA (DCA-PLDA).The proposed method is related to recent papers (Snyder et al., 2016;Rohdin et al., 2018; Garcia-Romero et al., 2020) which also propose to usethe binary cross-entropy as loss function during DNN training. The idea ofdiscriminatively training a PLDA-like backend was first proposed by Burgetet al. (2011). In that work, the parameters of the backend were trained usinga support vector machine or linear logistic regression. More recently, Sny-der et al. (2016) applied this idea to a DNN-based SV pipeline, integrat-ing the embedding extractor and the backend together and training themjointly to optimize binary cross-entropy. A similar approach was presentedby Ramoji et al. (2020), who proposed to use a loss function that approx-imates the detection cost function instead of binary cross-entropy. Rohdinet al. (2018) proposed to use an architecture that mimics the previous i-vector pipeline (Dehak et al., 2011) for speaker verification, pre-training all6arameters separately and then fine-tuning the full model to minimize binarycross-entropy. Finally, Garcia-Romero et al. (2020) proposed to transformthe scores produced as the dot product between the embeddings generatedby a ResNet with an affine function trained to optimize binary cross-entropy.The weight parameter of this transform is given by a network that takes asinput the pooling layer from the ResNet.Most of the papers described above focus on discrimination performance,not including calibration performance in the results. As we show in this pa-per, though, discriminatively training a PLDA-like backend as done in thosepapers does not suffice to achieve good generalization in terms of calibra-tion. One paper that shows calibration performance is the one by Garcia-Romero et al. (2020), where both discrimination and calibration performanceare shown to improve using the proposed method over a global calibrationmodel. Yet, the calibration loss for some conditions is still very large. Ourproposed method is similar to this method, with the main difference beingthat, in our case, the calibration stage depends on a low-dimensional esti-mate of the condition of the samples and on their duration which, accordingto our results, appears to be key for giving good calibration generalization.Yet, several other characteristics, including training and test data, networkarchitecture, and training procedure differ in both methods which might alsoexplain the difference in calibration performance.
3. Standard PLDA-based Backend
In the current state of the art in embedding-based speaker verification,two scoring approaches are common. One is cosine scoring, where the scoreis the dot product between two L2 length-normalized embeddings (Garcia-Romero et al., 2020). This approach has no trainable parameters in thescoring stage and instead relies on training the embedding extractor in a waythat works well with cosine scoring. Although cosine scoring can achieve verygood accuracy on matched test and training conditions, it is not equipped todeal with variable conditions and there is no easy way to adapt it to new con-ditions. The other way to score is to use a trainable PLDA-based backend.PLDA training requires less data than what is needed for training an embed-ding extractor and the computational requirements are orders of magnitudesmaller. PLDA scoring is therefore an attractive way to adapt a speaker rec-ognizer to new conditions. PLDA backends are typically composed of severalstages, described in the following sections.7 .1. Pre-processing
First, linear discriminant analysis (LDA) is applied to reduce the dimen-sion of the embeddings while emphasizing speaker information and reducingother irrelevant information. In addition to concentrating speaker informa-tion, the LDA stage also contributes to making the data more Gaussian:every output of the LDA projection is a sum of many relatively independentinputs, which tend to be more Gaussian. Then, each transformed dimensionis mean- and variance-normalized and the resulting vectors are length nor-malized, a procedure that has an even more powerful Gaussianization effect,which is essential for good performance of the subsequent Gaussian PLDAmodeling stage (Garcia-Romero and Espy-Wilson, 2011). The pre-processingstage for an embedding x can be summarized by the following equation: w = Norm( A p x + m p ) , (4)where A p is the LDA projection matrix restricted to the first N dimensionsand scaled to result in variance of 1.0 in each dimension, m p is the globalmean of the data after multiplication with A p , and Norm performs L2 lengthnormalization. After pre-processing, PLDA (Ioffe, 2006; Prince, 2007) is used to computea score for each trial. The two-covariance PLDA variant (Br¨ummer and DeVilliers, 2010), widely used in speaker verification, assumes that each pre-processed embedding, w , can be modeled as w = y + e, where y and e are assumed to be independent, and are both Gaussian-distributed so that y ∼ N ( µ, B − ) ,w | y ∼ N ( y, W − ) , where B is the between-speaker precision matrix and W is the within-speakerprecision matrix.The training procedure used to obtain point estimates for µ , B and W requires the use of an expectation-maximization algorithm (Br¨ummer, 2010a;Sizov et al., 2014). Initial values µ , B and W for the parameters are8eeded in order to jump start the EM algorithm. While random initializationworks fine, a much better starting point can be achieved by using the sampleestimates of the global mean and covariance matrices. Given a set of pre-processed vectors w i where sample i corresponds to speaker s i , we computethe initial parameters as follows: µ s = (cid:80) i | s i = s w i N s µ = (cid:80) s c s (cid:80) i | s i = s w i (cid:80) s c s N s (5) B = (cid:20) (cid:80) s c s N s ( µ s − µ )( µ s − µ ) (cid:62) (cid:80) s c s N s (cid:21) − (6) W = (cid:34) (cid:80) s c s (cid:80) i | s i = s ( x i − µ s )( x i − µ s ) (cid:62) (cid:80) s c s N s (cid:35) − (7)where i is an index over samples, s is an index over speakers, N s is thenumber of samples for speaker s , and c s is an arbitrary weight correspondingto speaker s which can be used to increase or decrease the influence of aspeaker’s data. As we will see, in our experiments, the weights are used tocompensate for the severe imbalance across conditions that is present in ourtraining data. Note that, for integer weights, this weighting procedure isequivalent to repeating the data from each speaker s c s times, taking eachrepetition as coming from a different speaker.The EM formulation can also be changed easily to introduce the weightsin the expectation step. Specifically, the weights are introduced in equations(24) and (25) in (Br¨ummer, 2010a) by multiplying each term in each of thosesummations with the weight corresponding to speaker i (note that subindex i in those equations is the speaker index).Given a trial composed of a single enrollment and single test sample withpre-processed embeddings, w and w , the PLDA score is defined as thefollowing log-likelihood-ratio (LLR): s = log P ( w , w | H s ) P ( w , w | H d ) (8)where, as mentioned in Section 1, H s and H d are the hypotheses that theenrollment and test speakers are the same or different, respectively. The9ollowing closed-form solution can be derived for this LLR given the assump-tions for the PLDA model: s = 2 w (cid:62) Λ w + w (cid:62) Γ w + w (cid:62) Γ w + w (cid:62) c + w (cid:62) c + k, (9)where Λ, Γ, c and k are given by:Λ = 12 W (cid:62) ˜Λ W (10)Γ = 12 W (cid:62) ( ˜Λ − ˜Γ) W (11) c = W (cid:62) ( ˜Λ − ˜Γ) Bµ (12) k = 12 ˜ k + 12 ( Bµ ) (cid:62) ( ˜Λ − Bµ (13)with ˜Λ = ( B + 2 W ) − ˜Γ = ( B + W ) − ˜ k = − | ˜Γ | − log | B | + log | ˜Λ | + µ (cid:62) Bµ See (Cumani et al., 2013) for a derivation of these equations. Note,though, that the expression for k in that paper is missing a 1/2 multiplying ˜ k and ˜ k has wrong signs for two of the terms. These errors have been correctedin the equations above.In summary, once the PLDA parameters have been trained, the score foreach trial can be computed with a simple second-order polynomial expressionof the pre-processed embeddings for the trial. As mentioned above, PLDA scores are computed as LLRs given the es-timated PLDA model’s parameters. Yet, since the assumptions made byPLDA do not exactly hold in practice, the scores produced by this modelare usually badly calibrated. For this reason, the usual procedure is to post-process the PLDA scores using a calibration stage. The standard procedurefor calibration in speaker verification is to use linear logistic regression, whichapplies an affine transformation to the scores, training the parameters to min-imize binary cross-entropy (Br¨ummer and Doddington, 2013). The objectivefunction to be minimized is given by C π = − πT (cid:88) i ∈T log( q i ) − − πN (cid:88) i ∈N log(1 − q i ) , (14)10here q i = σ ( l i + γ ) , (15) l i = αs i + β, (16)and where s i is the score for trial i given by equation (9), σ is the sigmoidfunction, π = P ( H s ) is a parameter reflecting the expected prior probabilityfor same-speaker trials, γ = log( π/ (1 − π )) is the prior log odds, and α and β are the calibration parameters to be optimized by minimizing the quantityin equation (14).Since the cross-entropy is a proper scoring rule (Br¨ummer and Dodding-ton, 2013), this procedure usually leads to well-calibrated scores on data thatis similar to the data used to train the calibration model. On the other hand,this does not guarantee that calibration will be good on data from any othercondition. Nandwana et al. (2019) showed the effect that a mismatch in sev-eral different characteristics of the signals has on calibration performance.In particular, they show that duration, distance to the microphone and lan-guage mismatch can cause a large degradation in calibration performance.Those results illustrate a very common phenomenon in speaker verification:applying a calibration model on data mismatched to that used to train themodel can lead to severely miscalibrated scores. This is the problem we aimto tackle in this work.
4. Theoretical perspective on conditions
Thus far we have relied on an intuitive understanding of the conceptof the conditions of speaker verification trials. In this section we presenta generative modeling view that explains in more detail what we mean byconditions and how they interact with the PLDA model.Consider the following graphical model for a speaker verification trial,that has as its input a pair of pre-processed x-vectors, w and w , which arehypothesized to have been produced by two different speakers: w y e y w e PLDA h d g g h d conditions Observed nodes are shaded, clear ones are hidden. impostor trial. The top row represents the PLDA model,where y , y are the hidden speaker identity variables , theoretical variablessuggested by Patrick Kenny et al. (2007) that determine speakers’ voices. Weneed two of them, for the two hypothesized speakers. The hidden variables, e , e represent additive Gaussian noise that model the fact that a givenspeaker sounds different from one occasion to the next. PLDA training andscoring algorithms marginalize over these hidden variables (see, e.g., Kenny,2010; Br¨ummer and De Villiers, 2010).Let us now allow for an additional set of variables (some hidden, someobserved) that we collectively term the condition variables. They are shownin the bottom row. We refer to ( h i , d i , g i ) as the conditions of side i ∈ { , } ofthe trial and to ( h , d , g , h , d , g ) as the conditions of the whole trial. Forthe target trial hypothesis, where w , w are assumed to have been producedby a common speaker, the graphical model changes to: w e y w e PLDA h d g h d conditionsHere we can still refer to the conditions of the trial as ( h , d , g , h , d , g ),with the understanding that g = g = g .In PLDA, the e i represent a collection of factors that influence the x-vectors, but by themselves do not carry any speaker information. The h i represent additional hidden factors of this nature that cannot be successfullymodeled by PLDA. The hidden variables g i denote a collection of factorsthat roughly determine the speaker’s vocal qualities, for example gender andmother tongue. Again, the idea is that the g i allow for factors that cannotbe modeled well by PLDA. Finally, d i is the observed duration of the speechsegment from which w i is extracted.To understand how additional hidden variables can generalize PLDA,consider a simplified example where g i ∈ { male , female } and h i ∈{ telephone , laptop } . This generalizes PLDA to a mixture model. Similarly, When w i is not given, y i is independent of ( e i , h i ). This independence can be readfrom the diagram, see for example Bishop (2006). h i → w i ← g i represent even more general mechanisms for the conditionfactors to influence the w vectors.The classical PLDA training and scoring algorithms effectively marginal-ize over the y i and e i variables, at fixed , implicit values of the conditionvariables. The work in this paper effectively augments PLDA with mecha-nisms that can deal with variable conditions. These mechanisms, which willbe detailed in the next section, can be understood in two different ways: • Explicitly, we are discriminatively training a data-dependent calibratorof the PLDA scores. • Implicitly, we are replacing the traditional combination of a PLDAmodel and a fixed post-calibrator, with a more complex model thatproduces well-calibrated scores across conditions. The above graphicaldiagrams represent the new model. Since this model is discriminativelytrained without any explicit reference to the hidden variables, it is im-plicitly learning a scoring function where all the hidden variables (thePLDA ones and the new condition variables) are effectively marginal-ized.
5. Condition-Aware Discriminative Backend
In our recent papers (Ferrer and McLaren, 2020a,b) we presented a back-end method with the same functional form as the PLDA-backend explainedin Section 3, but where all parameters are optimized jointly, in a manner sim-ilar to the one used by Rohdin et al. (2018) (though, note that in this paperwe only optimize jointly up to the backend stage instead of the full pipeline,as in Rohdin’s paper). The novel aspect in our previous papers was the in-tegration into the backend of a side-information-dependent calibration stageaimed at improving calibration robustness across conditions. We proposed tocapture the information about the conditions that affect calibration in a side-information vector, which was then used to determine the parameters of thecalibration stage. In our first paper (Ferrer and McLaren, 2020a), the side-information vector was generated by a separate model, which was trainedto predict labeled conditions in the training data. These condition labelswere derived from the available metadata in each of the training datasets,which had different levels of detail depending on the dataset. In our second13aper, we integrated the extraction of the side-information vector within thebackend itself. This approach has the advantage of not requiring condition-labeled data during training. Further, it gives the backend the freedom todefine as side-information only what is useful for calibration. In this paper,we continue in the direction of the second approach, with the addition of aduration-dependent calibration stage. Hence, in the current paper the condi-tion of a sample is represented by both the side-information vector as well asthe duration of the sample. As we will see, this addition leads to a significantgain over the original architecture. In the following subsections, we describethe backend architecture and the training procedure in detail.
In Section 7 we show that discriminatively training a PLDA-like backenddoes not suffice to obtain good calibration across conditions. This is, in part,due to the fact that the PLDA assumptions (explained in Sections 3.2 and4) do not hold in practice. This problem can be solved, as usually donefor the standard PLDA backend, by training a specific calibration model foreach domain of interest, which requires having at least some domain-specificlabeled data. In this research, though, we assume that no domain-specificdata is available for system adaptation or for training a calibration model.This also means that a domain-specific decision threshold cannot be learned.Hence, we aim to design the best possible out-of-the-box system for unknownconditions for which the output scores will be well-calibrated even on suchunseen conditions. As explained in Section 1, well-calibrated scores, wheninterpreted as LLRs, can be thresholded at theoretically-determined Bayesdecision thresholds, which will then make cost-effective decisions for a rangeof different applications (see, for example, Van Leeuwen and Br¨ummer, 2007).In order to achieve the goal of a robust well-calibrated system, we makethe parameters of the calibration stage depend on side-information vectorsthat are meant to describe the conditions of the signals that affect calibra-tion. As explained in Section 4, the implicit goal behind this approach is togeneralize the PLDA formulation to take into account the condition of thesignals. Specifically, we transform the scores produced by PLDA using anaffine transformation as in equation 16, but make the parameters α and β be functions of side-information vectors, z and z , for each of the signals in14 trial, using a form identical to the one used for PLDA scoring: α s = 2 z (cid:62) Λ α s z + z (cid:62) Γ α s z + z (cid:62) Γ α s z + ( z + z ) (cid:62) c α s + k α s , (17) β s = 2 z (cid:62) Λ β s z + z (cid:62) Γ β s z + z (cid:62) Γ β s z + ( z + z ) (cid:62) c β s + k β s . (18)where the Γ’s, Λ’s, c ’s, and k ’s are parameters to be optimized. The keycomponent of this model are the side-information vectors, z i , which we defineto be given by z i = f ( A z m i + b z ) , (19)where A z and b z are parameters to optimize, f is some transformation, and m i is obtained using the same form as in equation (4), though with differentparameters, A m and b m . That is, m i = Norm( A m x i + b m ) , . (20)We explored three different forms for function f : the identity function whereno transformation is used, softmax, and the logarithm of the softmax, whichis what we used in our prior work (Ferrer and McLaren, 2020a,b).Finally, in this work we propose to add an additional duration-dependentcalibration stage before the side-information-dependent one. Nandwana et al.(2019) showed that sample duration has a large impact on the score distri-bution. As a consequence of this, training a calibration model with data ofdurations that are mismatched to the test data can lead to severely miscal-ibrated scores. Several approaches have been proposed in the literature tomitigate this problem (Mandasari et al., 2013, 2015; Nautsch et al., 2016). Inthese works, the duration of each of the samples in a trial is used to conditionthe parameters of the calibration model. Here, we propose a simple modelwhere the duration of the samples in a trial are transformed into featurevectors and the calibration parameters are obtained as a function of thesevectors using the same functional form as in equations (17) and (18). α d = 2 e (cid:62) Λ α d e + e (cid:62) Γ α d e + e (cid:62) Γ α d e + ( e + e ) (cid:62) c α d + k α d , (21) β d = 2 e (cid:62) Λ β d e + e (cid:62) Γ β d e + e (cid:62) Γ β d e + ( e + e ) (cid:62) c β d + k β d . (22)where the Γ’s, Λ’s, c ’s, and k ’s are parameters to be optimized, and e i isa feature vector derived from d i , the number of frames used to extract theembedding vector x i . 15n this work we explore three ways to convert a duration d into a featurevector given by e log = log( d ) , (23) e bin = onehot(bin( d, t )) , (24) e wlog = log( d ) [ σ sc ( d ) , (1 − σ sc ( d ))] (cid:62) , (25) σ sc ( d ) = σ ( s (log( d ) − log( c )))where bin refers to the operation that turns the d into an index correspondingto its bin given a list of thresholds t , onehot converts the index into a vectorwith a 1 at that index and 0 elsewhere, σ is the sigmoid function, and s and c are hyperparameters to be tuned. Note that e log is a scalar, e bin is a columnvector of size given by the number of thresholds in t plus 1, and e wlog is acolumn vector of size 2.In the first case, e log , we assume that the calibration parameters can beestimated as a second-order polynomial function of the log duration of the twosamples involved in the trial. In the second case, the calibration parametersare free to take any possible value for each combination of enrollment and testbins. This gives more flexibility to the model, but has the disadvantage ofsetting the calibration parameters to fixed values within each combination ofenrollment and test bins. Hence, in this approach it is necessary to use severalbins so that the piece-wise constant approximation can fit the data. This maypotentially result in a model that overfits the training conditions more easily.Note that in this case, one does not need to use the full expression for α d and β d ; the Λ term suffices since the one-hot vectors operate as selectors ofthe α d and β d for each combination of bins for the two samples in the trial.Finally, the third option, e wlog is a hybrid where we assume that the secondorder polynomial of the log durations is approximately appropriate over theleft and right regions of the duration space, defined by the parameter c . Bymultiplying the logarithm of the duration with a sigmoid centered at log( c )and its flipped version, we create two features that are non-zero for small andlarge durations, respectively, with tapering at values near log( c ) determinedby the s parameter. This approach allows more flexibility than using thesingle logarithmic feature while still being continuous. We compare thesethree approaches in Section 7.Figure 1 shows the complete architecture of the proposed backend. Notethat the orange blocks at the center correspond to the standard PLDA for-mulation, except that the global calibration stage is replaced by duration16 " 𝑚 Affine +length-norm (4) 𝑥 " 𝑥 𝑤 𝑤 " 𝑠 𝑙 Affine +length-norm (20) 𝛼 ) 𝛽 ) Beta (18) 𝑧 𝑧 " Alpha (17)
Affine + transform (19)
Side-information branchSpeaker verification branch Side-info-depcalibration (16)
Dur-depcalibration (16) 𝛼 , 𝛽 , Beta (22) 𝑒 Alpha (21)
Duration branch 𝑑 " 𝑑 𝑒 " Duration features (23-25)
Speaker Embed. Extractor PLDA-like scoring (9)
Figure 1: Schematic of the proposed DCA-PLDA backend. The backend includes the samecomponents as the PLDA approach, an affine transform, length-normalization, PLDA-likescoring, and calibration, with the only difference in functional form being that the cali-bration stage, rather than being a global model with fixed parameters for all samples, ismade to depend on duration features and side-information vectors extracted from the em-beddings. The equations indicated in parenthesis inside each block describe the functionalform of the blocks. and side-information-dependent calibration stages.
The proposed architecture can be trained using stochastic gradient de-scent or one of its variants. In particular, in this work we use Adam opti-mization (Kingma and Ba, 2015). These algorithms require an initial valuefor the parameters, the selection of a loss function, and a process to generatebatches during training. We discuss all these issues in this section. Further,we explain the training procedure, which is done in stages, and the way weselect the best epoch from each run.
We initialize the parameters in the speaker verification branch (Figure 1)using the corresponding values from a standard PLDA-based backend, whichare obtained as described in Section 3. The parameters of the calibrationstages are all initialized to 0, except for the k values. If only one stage is used(duration- or side-information-dependent), then the k parameters for the α β blocks entering that model are initialized to the corresponding global calibration parameters obtained using linear logistic regression, as explainedin Section 3.3. If both calibration stages are used, then the k parameters ofthe duration stage are initialized to the global calibration parameters and theones for the side-information stage are initialized to be a pass-through, k α s =1 and k β s = 0. Finally, in our experiments, we also test a plain discriminativePLDA (D-PLDA) architecture were the calibration stage has fixed values for α and β that do not depend on duration or side-information. In this case,those parameters are initialized to the global calibration parameters.Given this initialization procedure, before the first training iteration themodel is identical to a PLDA-based backend with global calibration. Thisgives us a way to directly compare PLDA and D-PLDA systems using exactlythe same code-base, which can be found in https://github.com/luferrer/DCA-PLDA .The side-information branch has some additional parameters that needto be initialized. This initialization has no effect on the output of the modelbefore the first training iteration, since, as described above, the Γ, Λ and c pa-rameters in equations (17) and (18) are initialized to 0. Yet, this initializationhas a significant effect on the performance of the final model. To initializethe first stage of the side-information branch, we use the last M dimensionsof the full LDA transform (before selection of the best N dimensions) used toinitialize the speaker verification branch of the model, including mean andvariance normalization. That is, we use the dimensions that are less usefulfor speaker verification under the LDA assumptions and apply the same pro-cedure as the LDA stage in the speaker verification branch. In (Ferrer andMcLaren, 2020b) we show that this initialization gives an advantage overrandom initialization. The parameters of the dimensionality reduction stage(equation 19) are the only parameters that are initialized randomly, using anormal distribution centered at 0.0 with standard deviation of 0.5.After initialization, the Γ and Λ parameters in all polynomial blocks aresymmetric matrices. During training, these matrices are still constrained tobe symmetric. This is done simply by optimizing auxiliary matrices ˆΓ andˆΛ, and setting Γ = 0 . T ) and Λ = 0 . T ). We train the model for the target speaker verification task. To this end,we use the same loss function used to train the calibration model in thetraditional systems, the weighted cross-entropy given in equation (14), and18ptimize it using the Adam optimizer. Hence, we need to define mini-batchesover which the loss and gradients are computed at each optimization step.These mini-batches need to be composed of speaker verification trials. Inthis work, we restrict ourselves to solving the simplest speaker verificationproblem where two speech segments are compared to each other to decidewhether the two belong to the same speaker or not. That is, each trial issimply composed of two audio segments.We test two different ways of creating the trials for a batch of size N ,where N is an even number. In the simplest method, similar to the one usedin our previous papers, we first randomly select N/ from each of those speakers are selected. Finally,for each of those sessions, a sample is selected. Speaker, sessions and samplesare selected, in order, from randomly sorted lists. That is, at the beginningof each training process, we randomly sort a set of lists of speakers, sessionsand samples. Then, when creating mini-batches, we traverse those lists inorder, selecting the next item in the list until the list is exhausted, at whichtime we randomize the list again and start selecting from the beginning ofthat list. In this way, during training we see each speaker approximately thesame number of times. For each speaker, each of their sessions is also seenapproximately the same number of times. And, for each session, each of itssamples is seen approximately the same number of times. Finally, once thesamples for a mini-batch have been selected, all possible trials between theseN samples are used to compute the cross-entropy, excluding different-domainand same-session impostor trials.In the second method for batch selection, we modify the method aboveto ensure that each batch is composed of the same number of speakers fromeach training domain. That is, if the training data is composed of D domains,then at the first step during batch creation we select N/D/2 speakers fromeach domain. The rest of the process is identical to the first method. Two speech segments are defined as belonging to the same session if they were recordedduring the same day, in the same recording setup; for example, when the same audiois recorded using more than one microphone, or when several recordings are made oneafter the other. Also, during a telephone conversation, both sides of the conversation areconsidered to belong to the same recording session. Finally, when we split waveforms inchunks, all chunks from a certain waveform belong to the recording session of the originalwaveform. .2.3. Training Stages and Model Selection In our previous works we trained the parameters using a two-stage pro-cedure where the parameters of the speaker verification branch were trainedusing all available samples during the first stage and then frozen during asecond stage where the parameters of the side-information branch and thecalibration stage were fine-tuned using data balanced by domain. Duringboth stages the mini-batches were created using a method similar to the firstone described above without domain balancing. Yet, since during the secondstage the input data was domain-balanced, the mini-batches created duringthat stage were also approximately domain-balanced.In the current work, instead of creating a balanced dataset, which re-sults in a large percentage of the data being unused during training, we usethe second method for batch generation described above, where the domainbalance is enforced at batch creation. In this case, all training speakers areused. In Section 7 we compare two procedures for training: (1) balancingby domain at initialization and during batch generation, and (2) withoutbalancing at either stage.As mentioned above, we want the resulting model to be robust to a vari-ety of conditions, including those that may not appear during training. Whatwe find experimentally is that the models trained with the procedure abovewildly vary in terms of robustness from one mini-batch to the next. Thatis, while the loss computed on the training data or on data from conditionssimilar to those in training behaves as expected, decreasing somewhat mono-tonically as training progresses (though, certainly, with some noise), the losson unseen conditions is extremely noisy, even from one mini-batch to thenext. Results illustrating this phenomenon are shown in Section 7.2.Given this behavior, we use the following procedure for selecting a modelfrom a certain training run. In a first stage, we train the model over sev-eral thousand mini-batches to reach a good performance on seen conditions.Then, in a second stage, we take the last model from the first stage and keeptraining it using a higher learning rate than in the first stage, evaluating theloss on a list of development sets after every mini-batch update. The increasein learning rate was found to give improved results over using the same rateas in the first stage. After training with the increased learning rate over afew thousand mini-batches, we select the model that led to the best averageloss over the development sets. Finally, in a third stage, the best model fromthe second stage is fine-tuned using a slower learning rate over a few more20atches. The final model from the training procedure is given by the bestmodel from the third stage, selected based on the average performance overthe development sets.Importantly, we found the random seed to have a significant effect onthe resulting best model. The random seed affects the mini-batch creationand also the initialization of the dimensionality reduction stage that turns m vectors to z vectors. In our experiments, unless otherwise stated, the resultsshown correspond to the best model over 20 seeds, selected based on theaverage performance over the development sets.
6. Experimental Setup
In this section we describe the system configuration and datasets used forour experiments.
We use standard x-vectors as input for our backend experiments (Snyderet al., 2016; Snyder, 2017). These vectors are obtained as the pre-activationsfrom a DNN trained to classify the speakers in the training data. The inputfeatures for the embedding extraction network are power-normalized cep-stral coefficients (PNCC) (Kim and Stern, 2012) which, in our experiments,gave better results than the more standard mel frequency cepstral coeffi-cients (MFCCs). We extract 30 PNCCs with a bandwidth going from 100 to7600 Hz and root compression of 1/15. The features are mean and variancenormalized over a rolling window of 3 seconds.Silence frames are then discarded using a DNN-based speech activitydetection system (SAD). Frames with a log-posterior for the speech classabove -0.5 are used for the extraction of embeddings. The SAD DNN has twohidden layers with 500 and 100 nodes, respectively, and uses 19-dimensionalmel-frequency cepstral coefficients (MFCC) features (excluding C ), stackedover a window of 31 frames around each frame, and mean and variancenormalized over a 3 second rolling window. The model was trained on cleantelephone and microphone data from a random selection of files from Mixerdatasets (2004-2010), Fisher, and Switchboard. A 5 minute DTMF tone, anda selection of noise and music samples with and without speech added wereincluded in the pool of training data. The ground truth SAD labels needed forDNN training where generated by decoding training audio using an Englishsenone DNN. The speech/non-speech labels were then produced by applying21 threshold of 0.2 to the sum of the three silence senones posteriors from theDNN. For all augmented system training data, the SAD alignments from theraw audio were used as ground truth.The embedding extractor was trained with 234K signals from 14,630speakers. This data was compiled from Switchboard, NIST SRE 2004–2008,NIST SRE 2012, Mixer6, Voxceleb1, and Voxceleb2 (train set) data. Vox-celeb1 data had 60 speakers removed that overlapped with Speakers in theWild (SITW). More details on these datasets are given in Section 6.2. Allwaveforms were up- or down-sampled to 16 KHz before further processing.In addition, we down-sampled any data originally of 16 kHz or higher sam-pling rate (74K files) to 8 kHz before up-sampling back to 16 kHz, keepingtwo “raw” versions of each of these waveforms. This procedure allowed theembeddings system to operate well in both 8kHz and 16kHz bandwidths.Augmentation of data was applied using four categories of degradationsas in (McLaren et al., 2018), including music and noise, both at 10 to 25dB signal-to-noise ratio, compression, and low levels of reverb. We used 412noises compiled from both freesound.org and the MUSAN corpus. Musicdegradations were sourced from 645 files from MUSAN and 99 instrumen-tal pieces purchased from Amazon music. For reverberation, examples werecollected from 47 real impulse responses available on echothief.com and 400low-level reverb signals sourced from MUSAN. Compression was applied us-ing 32 different codec-bitrate combinations with open source tools. We aug-mented the raw training data to produce 2 copies per file per degradationtype (randomly selecting the specific degradation and SNR level, when ap-propriate) such that the data available for training was 9-fold the amount ofraw samples. In total, this resulted in 2,778K files for training the speakerembedding DNNs.The architecture of our embeddings extractor DNN follows the Kaldirecipe (Snyder, 2017). The DNN is implemented in Tensorflow, trained us-ing an Adam optimizer with chunks of speech between 2.0 and 3.5 seconds.Overall, we extract about 4K chunks of speech from each of the speakers.DNNs were trained over 4 epochs over the data using a mini batch size of96 examples. We used dropout with a probability linearly increasing from0.0 up to 0.1 at 1.5 epochs then linearly decreasing back to 0.0 at the finaliteration. The learning rate started at 0.0005, increasing linearly after 0.3epochs reaching 0.03 at the final iteration while training simultaneously using8 GPUs, averaging the parameters from the 8 jobs every 100 mini-batches.22 .2. Backend The training data for the PLDA and DCA-PLDA backends includes allthe training data used for the embedding extractor, plus two additionaldatasets, excluding all signals for which no information about the record-ing session could be obtained and all speakers for which a single session wasavailable. We divide this data in 6 different domains: • VOX : The same Voxceleb 1 and 2 (Nagrani et al., 2020) data used forembedding extractor training. It includes 7129 speakers. Recordingsare interviews mostly, though not exclusively, in English. We excludethe test part of Voxceleb 2 for use in evaluation. • SWB : Also included in embedding extractor training. It consists of2389 speakers speaking through a telephone channel (Godfrey et al.,1992). • MIX : This includes all the English-only speakers from SRE datasetsfrom 2004 through 2012 (Przybocki et al., 2007; Martin and Greenberg,2009, 2010; Greenberg et al., 2013, 2020) and Mixer6 (Brandschainet al., 2013) datasets used for embedding extractor training. It includes2713 speakers recorded over telephone and microphone channels. • MIX ML : This domain includes any bilingual speakers from the SREdatasets used for embedding extractor training. It consists of 1322speakers. This is the only training domain on which we can createcross-language trials since all other domains are either exclusively inEnglish or include speakers speaking a single language each. • FVCAUS : This domain is composed of interviews and conversationalexcerpts from 328 Australian English speakers from the forensic voicecomparison dataset (Morrison et al., 2015). Audio was recorded us-ing close talking microphones resulting in extremely clean waveformscompared to the rest of the domains. This set is not included in thetraining data for the embedding extractor. • RATS SRC : This domain is composed of telephone calls in five non-English languages from 286 speakers. We only used the source data (notre-transmitted) of the DARPA RATS program (Walker and Strassel,2012) for the SID task. This set is not included in the training datafor the embedding extractor. 23 able 1: The development and evaluation datasets with number of speakers (spk) andtarget/impostor (tgt/imp) trial counts. Eval SplitUse Dataset
The inclusion of the last two sets is important for achieving better gen-eralization performance since those two sets consist of conditions that areunderrepresented in the embedding extractor training data: very clean high-quality microphone data and non-English data.The number of speakers listed above exclude 50 speakers from each do-main that are held-out as development sets. The training samples from thefirst four domains are augmented with the same procedure as for embeddingtraining, except that only one degraded version of each original segment iscreated instead of 8 as in embedding extractor training, and that noise andmusic is added at 5dB level instead of 10-25dB. This decision was made veryearly on during development of our PLDA backend because we saw small butconsistent gains in results from using this lower SNR level.For backend training we explore two options: using the full waveformsavailable from each domain, and cutting each waveform into smaller segmentsfrom 4 to 240 seconds, creating 8 chunks per waveform. The chunks arecreated to achieve a close-to-uniform distribution of log durations for eachdomain.
We focus on the problem of text-independent speaker verification, withtrials defined as the comparison between two individual segments containinga single speaker each. We use several different datasets for development andevaluation of the proposed approach to include a large variety of conditions.24
Sets held-out from training : These sets were created from 50 speak-ers from each of the 6 training domains described in the previous sectionwhich are held-out during training. • SITW : The Speakers in the Wild (SITW) dataset contains speechsamples in English from open-source media (McLaren et al., 2016) in-cluding naturally occurring noises, reverberation, codec, and channelvariability. This data is similar in nature to the Voxceleb data. • SRE16 : The NIST Speaker Recognition Evaluation (SRE) 2016dataset (Sadjadi et al., 2017) includes variability due to do-main/channel and language. We present results on the non-Englishconversational telephone speech which is recorded over a wide vari-ety of handset types. We use both the labeled development set, withspeech in Cebuano and Mandarin, and the evaluation set, with speechin Tagalog and Cantonese. • SRE18 : We show results on the Call My Net 2 (CMN2) subset ofthe SRE’18 dataset (Sadjadi et al., 2019). This subset has similarcharacteristics to the SRE16 dataset with the exception of focusing ona different language, Tunisian Arabic, and including speech recordedover VOIP instead of just PSTN calls. • SRE19 : We use the NIST SRE 2019 conversational telephone speech(CTS) dataset (Sadjadi et al., 2020) which consists of data from theCMN2 collect that was unused for SRE18. The spoken language wasTunisian Arabic and the collection made over varying telephony chan-nels and in varying environments. • Voices : For this corpus (Richey et al., 2018; Nandwana et al., 2020),audio was recorded in furnished rooms with background noise playedin conjunction with foreground speech selected from the LibriSpeechcorpus using 12 different microphones. • FBI : The FBI evaluation corpus was supplied by the Federal Bureauof Investigation (FBI) and consists of 14 distinct conditions includingsame/cross-channel and same/cross-language trials with a wide range ofdifficulties. This set of corpora were selected by the FBI for calibrationresearch to represent a very wide range of different conditions, collectionsources, environments, languages, and channels. Details on this datasetand the conditions included in our experiments can be found in (Ferreret al., 2019). 25
Vox2-Tst : We use the test split of the Voxceleb 2 dataset (Nagraniet al., 2020). This set is composed of interviews downloaded fromYouTube. Language labels are not available for this data but it appearsto be mostly, though not completely, in English. • LASRS : The corpus is composed of 100 bilingual speakers from eachof three languages, Arabic, Korean and Spanish (Beck et al., 2004).Each speaker is recorded in two separate sessions speaking English andtheir native language using several recording devices. • FVCCMN : Composed of interviews and conversational excerpts fromover 68 female Chinese speakers from the forensic voice comparisondataset (Zhang and Morrison, 2011). Recordings were made with high-quality lapel microphones. This corpus is similar to the FVCAUS oneused for backend training, except that the language is different.Table 1 shows the statistics for development and evaluation sets. ForSITW, SRE16, SRE18, SRE19 and Voices we use the 1-side enrollment trialsdefined with the datasets. These sets also have a corresponding developmentset defined in their releases. In this paper we only use the one for SRE16,which we use for development. We do not show results for the other devel-opment splits since we found that conclusions were very similar for each de-velopment split compared to the corresponding evaluation split. For LASRS,FVCCMN and Vox2-Tst we create exhaustive trials, excluding same-sessiontrials.For most of the results we use summary metrics were we average theresults over six different groups: • TRNH : The 6 sets composed of speakers held-out from training data.We use these sets for development • DEV : The two additional development sets, SRE16 Dev and FVC-CMN, corresponding to conditions unseen during training • VOX : Voxceleb2 and SITW Eval. • SRE : SRE16 Eval, SRE18 Eval, and SRE19 Eval. • VOICES : Voices Eval • XLAN : FBI and LASRSAs explained in Section 5.2.3, in order to select a model from a certaintraining process, after a first warm up stage, we start evaluating the models26btained after every mini-batch update. This evaluation is done using theaverage over the sets included in the TRNH and DEV groups. As we willshow in the results, including the two unseen DEV sets when selecting thebest model is essential to achieve robustness on the evaluation sets. Foreach of the 8 sets, we also include a version where each segment in eachtrial is chunked from 4 to 30 seconds (with a uniform distribution in the logdomain) to ensure that the selected models are robust on short durations aswell. Finally, the performance on the 16 sets (both the full-duration and theshort-duration version of the 8 sets) is averaged to select the best model. Thissame averaged metric was used to the select hyperparameters like the learningrate, architecture, side-information size and all other development decisions.Results on the rest of the datasets were not used during development. Whenshowing results for a certain group we average both over the original setsand corresponding chunked versions between 4 and 16 seconds (again, usinga uniform distribution of durations in the log domain). In this case, weuse more extreme durations than during development for a more challengingcondition as well as additional mismatch with the development scenario.Note that chunks (partial speech segments) are used in all stages of train-ing and evaluation. For embedding extractor training we use the standardchunking approach for this purpose, with durations between 2.0 and 3.5 sec-onds, as explained in Section 6.1. For backend training we either use thefull segments or chunks between 4 and 240 seconds. For development weuse, for each set, the original full-segment version and a new version createdby chunking between 4 and 30 seconds. Finally, for the evaluation of theselected models, we use, for each set, the full-segment version and a newversion created by chunking between 4 and 16 seconds. The use of chunkswas essential for getting a robust model across durations since most datasetsused in our experiments are composed of relatively long segments.
In machine learning, there has been much recent interest in measures ofthe goodness of the calibration of probabilistic classifiers. Guo et al. (2017)and many others that cite them, motivate the need for calibration by therequirement to be able to make good decisions in the face of uncertainty.We agree. Unfortunately, Guo et al. (2017) and subsequent publications losesight of this requirement and they judge calibration with measures that arenot designed to reflect decision-making ability. On the other hand, the mea-surement of calibration via proper scoring rules (see Gneiting and Raftery,27007, and references therein) is a mostly solved problem. Proper scoring rulesquantify the risk (expected cost) of using the outputs of probabilistic classi-fiers to make minimum-expected-cost Bayes decisions (Br¨ummer, 2010b). Asshown in equation (2), that is exactly how we apply the LLR outputs of ourspeaker verifier: we use the LLRs to make Bayes decisions. We therefore usea proper scoring rule to evaluate the goodness of our LLRs. The ubiquitousclassifier training objective, cross-entropy is indeed the expected value of aproper scoring rule. In (Br¨ummer and du Preez, 2006; Br¨ummer, 2010b) wemotivated the use of prior-weighted cross-entropy as a measure of the good-ness of binary classifiers that output LLRs. We termed this measure Cllr(cost of LLR). In this work, we show results in terms of this metric.Like any other proper scoring rule, this metric is sensitive to both thediscrimination and the calibration performance of the system. Specifically,the Cllr is defined as the weighted cross-entropy in equation (14) with targetprior π = 0 .
5. We also show results in terms of a weighted cross-entropy with π = 0 .
01, which represents applications where the effective prior probabilityof targets is small. We will call these metrics, Cllr.5 and Cllr.01, respectively.During training we optimize Cllr.01. We also select the best model per run(best seed and epoch) and hyperparameters based on this metric. In ourexperiments, training and optimizing for this metric resulted in better testCllr.01, without hurting Cllr.5.Even a very discriminant system (with low equal-error-rate), can have ahigh Cllr (for any value of π ) if the calibration is bad. Such a system wouldlead to bad decisions when thresholded with the theoretical Bayes decisionthreshold, for some or all of the cost functions that represent the applicationsof interest. In order to analyze what part of the Cllr is due to miscalibration,we can compute the value for this metric when the test scores are calibratedoptimally. This is usually done by using the PAV algorithm (Br¨ummer anddu Preez, 2013) on the test scores, which finds a non-parametric monotonictransformation that optimizes the value of any proper scoring rule, includingCllr. The optimal value of the Cllr is called mininum Cllr since it is the bestCllr that can be obtained on the test set without changing the discriminationpower of the scores. The scores transformed with the PAV algorithm forma useful reference for judging calibration: the relative difference between theactual Cllr and the minimum Cllr indicates what percentage of the Cllr is also know as negative log-likelihood P ( H s ) = 0 .
01, and unity costs, C (accept | H d ) = C (reject | H s ) = 1 (see equation 2). That is, we computeDCF = 0 . P miss +0 . P fa , where P miss is the probability of labeling a same-speaker trial as a different-speaker trial and P fa is the probability of labelingan different-speaker trial as a same-speaker trial. The errors are computedon hard decisions made by thresholding the scores with the threshold thatwould result in the best expected DCF if the scores were well calibrated(Equation 3). In contrast with the Cllr, the DCF measures the performanceof hard decisions for a single operating point rather than the overall qualityof the scores. As for the case of the Cllr, a minimum value of DCF can alsobe obtained to determine what part of the DCF is due to misscalibration.In this case, the minimum is obtained by simply sweeping a threshold andchoosing the one that minimizes the DCF.For a gentle introduction to calibration, the Cllr metric, the PAV algo-rithm, DCF, and optimal decisions, please see (Van Leeuwen and Br¨ummer,2007).
7. Results
In this section we include various results and analysis comparing the stan-dard PLDA backend with different configurations of the proposed backend29or different training methods. In all cases we use the training proceduredescribed in Section 5.2.3 with a fixed learning rate of 0.0005 during the firststage which consists of 12000 mini-batches. Then, for the second stage whenwe start testing performance after every update, we increase the learning rateto 0.001, which improved performance over using the smaller learning ratefor this stage. We do this over 3000 mini-batches. Finally, in the third stagewe fine tune the best model from the second stage with a smaller learningrate of 0.00001 over 100 mini-batches. We use a batch size of 2048, whichresulted in better performance than the smaller sizes we used in our previouspapers. We use L2 regularization on all parameters, with weights tuned onthe development data, and clip the gradients to a maximum norm of 4.0.Unless otherwise noted, we use an LDA dimension of 300, a dimensionfor m (equation 20) of 200, dimension of 6 and identity transform as f for z (equation 19), and duration features given by equation (25) with c = 30and s = 2. These default parameters for the architecture and those for thetraining procedure described above are provided in configuration files in theexample directory in https://github.com/luferrer/DCA-PLDA .To initialize the calibration stage we do not use all the available trainingdata since that is computationally unfeasible for linear logistic regressionand also unnecessary given the small number of parameters to optimize.Hence, we simply select 2000 speakers and create trials using one of the twoprocedures described in Section 5.2.2.We use the following naming convention for the systems. Systems forwhich the LDA and PLDA parameters are frozen at their standard valuesare called PLDA. Systems where the LDA and PLDA stages are traineddiscriminatively after initialization with the standard values are called D-PLDA. For both cases, the calibration stages are indicated as suffixes: DDfor duration-dependent calibration, SD for side-information-dependent cali-bration, and DSD for duration- and side-information-dependent calibration.The full system proposed in this paper is then called D-PLDA-DSD. We alsocall this system DCA-PLDA, which is easier to pronounce and remember.We use the harder-to-pronounce D-PLDA-DSD name when convenient forindicating the difference between systems under comparison. In this section we show support for two important decisions made interms of training setup: the initialization procedure and the use of trainingdata as full or chunked waveforms. Figure 2 shows results for the standard30LDA backend, a D-PLDA backend obtained by replacing the duration andside-information dependent calibration stages in DCA-PLDA with a globalcalibration stage, and the full proposed backend, DCA-PLDA, with bothcalibration stages. Results are shown on the two groups of sets that are usedfor development, TRNH and DEV.We compare results using two different approaches for initialization andbatch generation. In one case, which we call flat , all training samples areconsidered equally important. The PLDA model is learned setting all speakerweights in equations (5), (6), and (7), and during EM iterations equal to1.0. Further, when generating batches for D-PLDA or DCA-PLDA, trainingspeakers are selected randomly with equal probability. Hence, in this case,the most frequent domain, which is VOX, containing half of the trainingspeakers, dominates the training process. In the other case, which we call balanced-by-domain , we set the speaker weights for PLDA initialization andEM iterations to be the inverse of the number of speakers for their domain(given the way we define the domains, each speaker only appears in a singledomain). Further, when generating batches for D-PLDA or DCA-PLDAwe use the second approach described in Section 5.2.2, where each batchcontains the same number of speakers for each domain. Hence, in this case,all domains are equally represented during training.We show results when training with full files and using chunks (see Section6.2). Arguably, the DCA-PLDA with duration-dependent calibration modelshould only be trained with chunked data, since otherwise it would not beable to separate the effect of duration and side-information, given that theduration distribution for the original waveforms is highly domain dependent.On the other hand, after chunking, all domains have a uniform durationdistribution between 4 and 240 seconds, which should better allow the modelto separate the effect of duration and side-information on the calibrationparameters.We can see that for all three systems, the best training approach is to usechunks and balance the data by domain. For the rest of the experiments inthis paper, chunks and balanced-by-domain weights and batches are used totrain all systems.Finally, we also tried random initialization for DCA-PLDA, using a nor-mal distribution centered at 0.0 with standard deviation of 0.5. These resultsare not included to reduce clutter in the figure; they are significantly worsethan the ones shown here, especially on the DEV sets where random initial-ization led to up to 60% worse results for the balanced-by-domain case. This31
RNH DEV0.00.10.20.30.40.50.60.70.8 C ll r . .
22 0 . .
31 0 . .
22 1 . .
19 0 . .
19 0 . .
28 0 . .
15 0 . .
14 0 . .
17 0 . .
15 0 . .
16 0 . .
13 0 . PLDA FULLPLDA CHNKD-PLDA FULLD-PLDA CHNKDCA-PLDA FULLDCA-PLDA CHNK
Figure 2: Average actual and minimum (black lines) Cllr with π = 0 . highlights the importance of the initialization process described in Section5.2.1 for getting optimal performance with the proposed approach. The left and center plots of Figure 3 shows the loss used for training(Cllr.01) for the TRNH and DEV groups, for two systems D-PLDA and DCA-PLDA. We show the actual (solid) and minimum (dashed) Cllr.01 for everyepoch composed of 10 mini-batches each. The loss at iteration 0 correspondsto the model initialized as described in Section 5.2.1. We can see that forthe TRNH sets, which correspond to unseen speakers from conditions seenon the training data, after the first few mini-batches, the learning curvesare smooth and stabilize at a low value, specially for the simpler D-PLDAmodel. On the other hand, on the DEV sets, which correspond to unseenconditions, the actual Cllr is quite unstable. Interestingly, this does nothappen for the minimum Cllr, which changes relatively little across epochs.The minimum Cllr in these curves is obtained as the average Cllr over alltest sets in each group, after transforming the scores with an affine functionobtained with linear logistic regression trained on each test set. The factthat these curves are smooth implies that the changes in parameter valuesthat occur between epochs result in similar shift and scaling of the scores for32ll scores from each test set, so that the discrimination performance withineach set (and, hence, minimum Cllr) is not affected by these changes. On theother hand, the shift and scaling of scores that occurs across epochs greatlyaffects calibration. This observation highlights the importance of using actualinstead of minimum loss as a metric during development.The right plot in Figure 3 show the results obtained on TRNH and DEVsets for the D-PLDA and DCA-PLDA models selected using only TRNH setsor using the average over all TRNH and DEV sets (our default approach).As we can see, the performance on TRNH sets is better when using onlyTRNH sets for selection, but the performance on DEV sets is extremely badfor those models compared to the ones selected using the average over bothTRNH and DEV sets. We pay a very large price on unseen conditions whenselecting the optimal model on seen conditions. For this reason, we use thestrategy described in Section 5.2.3 to select the best model for a certain runbased on the average performance over the TRNH and DEV sets. As wewill see, the model selected this way shows good generalization on unseendatasets.We have tried several approaches to prevent the DEV performance fromchanging so drastically between one mini-batch and the next, including usingslower learning rates or different learning rate schedules, higher L2 regular-ization coefficients, and regularizing the output of the system as proposedby Pezeshki et al. (2020). While some of these approaches succeeded intaming the learning curve for the DEV sets, they all resulted in worse finalperformance.Finally, we tried adding the two DEV sets to the training data as twoseparate training domains, one for each set. This led to a degradation inperformance on the evaluation sets for both systems. It appears to be im-portant for those two sets which, as shown in this section, are essential forchoosing a robust model, to be held-out from training.
As described in Section 5.1, we explored different ways to encode dura-tion information into features that can then be used to obtain condition-dependent calibration parameters using the polynomial forms in equations(21) and (22). These features are given by the logarithm (Log, equation 23),by binning and one-hot encoding (Bin, equation 24), and by a windowed ver-sion of the logarithm (WLog, equation 25). The left plot in Figure 4 comparesthe results for a simple D-PLDA with global calibration, and three D-PLDA33
25 50 75 100 125 150 175 200
Epoch number C ll r . TRNH D-PLDADCA-PLDA
Epoch number C ll r . DEV
TRNH DEV0.00.10.20.30.40.50.60.70.8 C ll r . .
13 0 . .
11 0 . .
14 0 . .
13 0 . D-PLDA TRNHDCA-PLDA TRNHD-PLDA TRNH+DEVDCA-PLDA TRNH+DEV
Figure 3:
Left and center:
Cllr.01 (the training objective function) for the first 200epochs composed of 10 mini-batches each on the TRNH and DEV groups for two systems,D-PLDA and DCA-DPLDA. Solid curves correspond to actual Cllr.01, while dashed curvescorrespond to the minimum Cllr.01.
Right:
Cllr.5 on TRNH and DEV for the D-PLDAand DCA-PLDA models selected using only TRNH sets, or using the average over allTRNH and DEV sets. systems using only the duration-dependent calibration stage, each with adifferent set of duration features. For the Bin features we use thresholds of8, 16, 32, 64 and 128, which are equally spread in the logarithmic domain.For the WLog features, we use c = 30 seconds and s = 2. These valueswere lightly tuned on TRNH+DEV sets. The side-information-dependentcalibration stage is disabled for the experiments in this section to study theeffect of duration-dependent calibration alone.We can see that on the TRNH sets only the binning approach gives amodest gain over not using a duration-dependent calibration stage. On theother hand, for the DEV sets, this approach for creating duration featuresresults in a degradation in performance. It appears that the flexibility ofthis model allows it to overfit the training conditions. Overall, the WLogapproach gives the best trade-off across sets. This is the setup we use forthe rest of the experiments in this paper. As we will see in our final results,the gain observed on DEV sets with respect to D-PLDA when using thisduration-dependent approach generalizes to the evaluations sets, resulting inlarger gains on some conditions.The center and right plots in Figure 4 show the values of α d and β d obtained with the WLog model as a function of the original duration forthe enrollment and test sides of a trial. We can see that, while there issome displeasing non-monotonicity in α d at the lower end of the durations,34 RNH DEV0.00.10.20.30.40.50.60.70.8 C ll r . .
14 0 . .
16 0 . .
13 0 . .
14 0 . D-PLDAD-PLDA-DD LogD-PLDA-DD BinD-PLDA-DD WLog d dur test = 8dur test = 32dur test = 128 4 8 16 32 64 128 256dur enroll76543210 d Figure 4:
Left:
Average actual and minimum (black lines) Cllr with π = 0 . Right:
Values of α d and β d obtained by the D-PLDA-DD WLog modelas a function of the durations of the two sides of a trial (dur enroll and dur test). the trends are reasonable: larger durations correspond to larger values of α d which implies larger values of the resulting LLR (i.e., increased confidence).Also, α d is always positive, meaning that the sign of the scores that comeout of the PLDA stage are never reversed. We explored three different forms for the f function in equation (19),identity, softmax and log-softmax (Section 5.1). The identity gave a slightlybetter result than the other two, though all three were similar in perfor-mance. We also explored different dimensions for the z vector, selectingthe value that gave the best average performance on TRNH plus DEV sets,which was a dimension of 6. Nevertheless, any values between 4 and 8 gavesimilar performance. This optimization was done on the full model withboth calibration stages, using the duration-dependent calibration stage withthe parameters selected as described in the previous section. For the rest ofthe experiments in this paper we use the identity function as transform and6-dimensional z vectors.Analyzing the values of α and β for the side-information-dependent cali-bration stage as we did for the duration-dependent stage is not possible sincethe side-information vector is not unidimensional. One way to visualize theeffect of the side-information dependent calibration as a whole is to compare35 D - P L D A - D S D imptar 20 10 0 10 20score0.000.050.100.15 p ( s c o r e | c l a ss ) D-PLDA-DD tarD-PLDA-DD impD-PLDA-DSD tarD-PLDA-DSD imp
Figure 5: Scatter plot and distributions of the output scores ( l in Figure 1) from theD-PLDA-DD and D-PLDA-DSD systems on the Vox2-Tst set. the scores from the D-PLDA-DD selected in the previous section, with thescores from the full system, D-PLDA-DSD. Figure 5 shows the scatter plotof the two scores and their distributions for the Vox2-Tst dataset. In thiscase, the D-PLDA-DD system has a Cllr.5 of 0.11, while the D-PLDA-DSDsystem has a Cllr of 0.08 (see Figure 8). This improvement is seen in thefact that the target and impostor distributions for the D-PLDA-DSD systemcross at zero, which is a property of well-calibrated LLRs (van Leeuwen andBr¨ummer, 2013), while the ones for the D-PLDA-DD system do not.In the scatter plot we can see that, while the target scores are closeto linearly related, the impostor scores are not. This suggests that the side-information vectors z are not only reflecting speaker-independent information(the h variables from Section 4) since that information should be similar fortarget and impostor trials within this set and would result in linearly relatedscores for both classes, but also some rough speaker-dependent information(the g variables from Section 4) which would be the same for both sides of atarget trial but differ in impostor trials. We can also see that the D-PLDA-DSD system is more confident: the average slope in the scatter plot is greaterthan one and the target and impostor means move further apart. In this section we show the effect of discriminatively and jointly trainingall parameters in the model versus training only the calibration parametersdiscriminatively using generatively-trained LDA and PLDA models for scoregeneration. For each case, training the full model (D-PLDA) or trainingonly the calibration stages (PLDA), we run four options: global calibration36
RNH DEV SRE XLAN VOICES VOX0.00.20.40.60.8 C ll r . .
19 0 .
59 0 .
56 0 .
39 0 .
44 0 . .
15 0 .
64 0 .
64 0 .
46 0 .
44 0 . .
14 1 .
19 0 .
82 0 .
44 0 .
40 0 . .
12 0 .
70 0 .
53 0 .
44 0 .
41 0 . .
14 0 .
49 0 .
46 0 .
37 0 .
37 0 . .
17 0 .
41 0 .
48 0 .
36 0 .
35 0 . .
14 0 .
42 0 .
42 0 .
29 0 .
41 0 . .
13 0 .
35 0 .
42 0 .
27 0 .
32 0 . PLDAPLDA-SDPLDA-DDPLDA-DSDD-PLDAD-PLDA-SDD-PLDA-DDD-PLDA-DSD
Figure 6: Comparison of PLDA and D-PLDA systems with and without duration- andside-info-dependent calibration stages. Note that the D-PLDA-DSD system is the fullDCA-PLDA system proposed in this paper. (no suffix), side-information-dependent calibration (SD), duration-dependentcalibration (DD), and side-information- and duration-dependent calibration(DSD).Figure 6 shows the results for those 8 systems on all 6 groups describedin Section 6.3. Results show that joint training of all the model’s parametersgives a significant and consistent gain across all architectures. That is, theD-PLDA version is always better than its corresponding PLDA version, insome cases by a very large margin. Notably, condition-dependent calibrationonly gives consistent gains over global calibration if the full model is trainedjointly.The one exception to these observations is on the TRNH sets, where thePLDA versions work quite well. It is important to note that this is the mostcommon scenario in speaker verification papers, where the PLDA model istrained on data that is matched to the evaluation data. In this scenario,the jointly trained models do not offer a consistent advantage; D-PLDA isbetter than PLDA only when global calibration is used. This highlights theimportance of evaluating performance on mismatched conditions. While allmodels are somewhat similar on conditions matched to the train conditions,they are significantly different on unseen conditions. In particular, the D-PLDA-DSD model gives the best trade-off across sets.
As explained in Section 5.2.3, all results shown in this paper correspond tothe best model in terms of average Cllr.01 over the TRNH and DEV sets, ob-tained out of 20 runs with different random seeds. For our proposed system,D-PLDA-DSD, the median performance across the 20 seeds (approximatedby the average Cllr.01 performance of the seed number 10 after sorting thembased on performance) and the worst performance are 6% and 14% worse37
RNH DEV SRE XLAN VOICES VOX0.00.10.20.30.40.50.60.7 C ll r . .
74 4 .
71 4 .
24 1 .
57 0 .
39 0 . .
62 3 .
06 2 .
15 0 .
82 0 .
41 0 . .
14 0 .
49 0 .
46 0 .
37 0 .
37 0 . .
13 0 .
35 0 .
42 0 .
27 0 .
32 0 . D-PLDA VOXDCA-PLDA VOXD-PLDA FULLDCA-PLDA FULL
Figure 7: Comparison of D-PLDA and DCA-PLDA (a.k.a, D-PLDA-DSD) systems (1)training the models with all training domains and selecting the best epoch and seed usingTRNH and DEV sets (FULL), and (2) training the models only with the VOX domainand selecting the best epoch and seed using the held-out VOX data, one of the 6 sets inthe TRNH group (VOX). than the best performance, respectively, while for D-PLDA these numbersare 22% and 38%. Hence, despite the fact that our model is slightly morecomplex than D-PLDA in terms of number of parameters, it gives a morestable performance across seeds. A similar trend is observed for the resultsin Figure 4 where the best model, D-PLDA-DD WLog, has smaller relativerange of performances across seeds than the other systems in that figure.These results suggest that better models have an easier time reaching a goodset of parameters, regardless of the seed.
In this section, we select two systems D-PLDA and D-PLDA-DSD andcompare their performance on all evaluation groups when the systems aretrained in two ways: (1) with the default approach, using all training domainsfor training and selecting the best epoch and seed using TRNH+DEV sets,and (2) using only the VOX training domain and selecting the best epoch andseed using only the heldout part of the VOX training data. The latter wouldbe the standard approach when developing a system exclusively for VOX-likedata. Figure 7 shows that training the model using a variety of conditions is16% worse on VOX sets (SITW-Eval and Vox2-Tst) compared to using onlyVOX data. Yet, this loss is small compared to the gain obtained on all othersets. Again, as we saw in Section 7.2 with the issue of model selection, alarge price is paid on unseen conditions in exchange for a moderate gain onseen conditions. This again highlights the importance of evaluating systemson a variety of conditions when trying to develop a system that is robust tounseen conditions. 38 .8. Final Results on Evaluation Data
In this section we present complete results on all sets described in Section6.3, excluding the ones used for development. We show results in termsof Cllr.5, which we have been using to show most results in this section,Cllr.01, which we use during system training and model selection, and DCF,as described in Section 6.4. We compare four systems, two systems that weconsider our baseline, PLDA and D-PLDA, the proposed D-PLDA systemwith duration-dependent calibration and the full D-PLDA-DSD system withboth calibration stages (which we have also called DCA-PLDA).Figure 8 shows that the D-PLDA-DSD system outperforms the two base-line systems, PLDA and D-PLDA, on all datasets without exception. Notethat these baseline systems are better than in a standard implementationsince we train the models using chunked waveforms and using weights to bal-ance data by domain, two strategies which, as we saw in Section 7.1 (Figure2), give a significant advantage over the standard approach of training withfull waveforms and equal weights for all speakers. Comparing the D-PLDA-DD and the full D-PLDA-DSD systems we see that, in most cases the latteris better with some exceptions on SRE data. Overall, D-PLDA-DSD givesthe best trade-off across datasets.Figure 9 shows the results for the same four systems in Figure 8 on subsetsof the FBI data, which was specifically designed by the FBI for work oncalibration. Again we see that the DCA-PLDA system is significantly betterthan both PLDA and D-PLDA. In most cases the largest gain is due to theduration-dependency of the calibration stage. Yet, in the easier conditions(cond2, cond5, cond3) the additional side-information-dependent calibrationstage gives a further gain.
Scoring the full matrix of trials for Vox2-Tst (i.e., all 24 million trials thatcan be formed by pairing the 4903 samples in the set) takes 2.0 seconds forPLDA and D-PLDA systems (these two approaches are identical in termsof operations needed for evaluation), 3.1 seconds for D-PLDA-DD and 4.3seconds for D-PLDA-DSD, on an Intel Xeon 2.53GHz CPU. Hence, the runtime for the proposed approach is about twice that of the standard PLDAapproach on this large set. On smaller sets like SITW-Eval, composed of1.4 million trials, the difference is smaller, with PLDA and D-PLDA takingabout 0.18 seconds and D-PLDA-DSD taking 0.29 seconds.39
RE16-Eval SRE18-Eval SRE19-Eval Voices-Eval LASRS FBI Vox2-Tst SITW-Eval0.00.10.20.30.40.50.6 C ll r . .
76 0 .
50 0 .
29 0 .
40 0 .
35 0 .
34 0 .
14 0 . .
55 0 .
41 0 .
26 0 .
34 0 .
34 0 .
30 0 .
10 0 . .
38 0 .
31 0 .
21 0 .
40 0 .
23 0 .
14 0 .
11 0 . .
37 0 .
32 0 .
24 0 .
31 0 .
23 0 .
13 0 .
08 0 . PLDAD-PLDAD-PLDA-DDD-PLDA-DSD (DCA-PLDA)SRE16-Eval SRE18-Eval SRE19-Eval Voices-Eval LASRS FBI Vox2-Tst SITW-Eval0.00.10.20.30.40.50.6 C ll r . .
54 0 .
79 0 .
47 0 .
49 0 .
68 0 .
55 0 .
21 0 . .
74 0 .
50 0 .
36 0 .
43 0 .
83 0 .
43 0 .
17 0 . .
49 0 .
38 0 .
30 0 .
54 0 .
37 0 .
20 0 .
17 0 . .
55 0 .
36 0 .
29 0 .
40 0 .
33 0 .
17 0 .
12 0 . PLDAD-PLDAD-PLDA-DDD-PLDA-DSD (DCA-PLDA)SRE16-Eval SRE18-Eval SRE19-Eval Voices-Eval LASRS FBI Vox2-Tst SITW-Eval0.00.20.40.60.8 D C F . .
18 1 .
36 0 .
82 0 .
75 1 .
41 0 .
99 0 .
35 0 . .
07 0 .
66 0 .
54 0 .
68 1 .
70 0 .
67 0 .
32 0 . .
71 0 .
58 0 .
50 0 .
84 0 .
69 0 .
33 0 .
31 0 . .
90 0 .
53 0 .
46 0 .
61 0 .
56 0 .
27 0 .
21 0 . PLDAD-PLDAD-PLDA-DDD-PLDA-DSD (DCA-PLDA)
Figure 8: Comparison of PLDA, D-PLDA baseline systems with two proposed systems,D-PLDA-DD and D-PLDA-DSD (also called DCA-PLDA) on all evaluation sets for thetwo Cllr metrics and DCF with an effective probability of target of 0.01. ond8 cond6 cond11 cond12 cond10 cond14 cond9 cond2 cond5 cond30.00.10.20.30.40.50.6 C ll r . .
54 0 .
42 0 .
41 0 .
37 0 .
35 0 .
34 0 .
29 0 .
29 0 .
20 0 . .
41 0 .
40 0 .
39 0 .
27 0 .
31 0 .
29 0 .
33 0 .
38 0 .
12 0 . .
21 0 .
16 0 .
19 0 .
19 0 .
14 0 .
16 0 .
11 0 .
13 0 .
09 0 . .
21 0 .
13 0 .
19 0 .
20 0 .
14 0 .
20 0 .
11 0 .
08 0 .
07 0 . PLDAD-PLDAD-PLDA-DDD-PLDA-DSD (DCA-PLDA)cond8 cond6 cond11 cond12 cond10 cond14 cond9 cond2 cond5 cond30.00.10.20.30.40.50.6 C ll r . .
83 0 .
81 0 .
57 0 .
51 0 .
59 0 .
46 0 .
51 0 .
52 0 .
33 0 . .
45 0 .
68 0 .
41 0 .
31 0 .
45 0 .
32 0 .
54 0 .
70 0 .
15 0 . .
23 0 .
18 0 .
25 0 .
32 0 .
23 0 .
25 0 .
14 0 .
20 0 .
11 0 . .
22 0 .
15 0 .
20 0 .
28 0 .
19 0 .
24 0 .
13 0 .
16 0 .
07 0 . PLDAD-PLDAD-PLDA-DDD-PLDA-DSD (DCA-PLDA)cond8 cond6 cond11 cond12 cond10 cond14 cond9 cond2 cond5 cond30.00.20.40.60.8 D C F . .
38 1 .
68 0 .
85 0 .
73 1 .
08 0 .
66 0 .
96 1 .
00 0 .
65 0 . .
44 1 .
23 0 .
44 0 .
42 0 .
70 0 .
39 0 .
93 1 .
34 0 .
22 0 . .
30 0 .
24 0 .
42 0 .
65 0 .
43 0 .
48 0 .
20 0 .
32 0 .
18 0 . .
29 0 .
22 0 .
32 0 .
50 0 .
36 0 .
41 0 .
21 0 .
26 0 .
10 0 . PLDAD-PLDAD-PLDA-DDD-PLDA-DSD (DCA-PLDA)
Figure 9: Same as Figure 8 on FBI subsets. . Conclusions We presented a novel backend approach for speaker verification whichconsists of a series of operations that mimic the standard PLDA-backendfollowed by calibration. The parameters of the model are learned jointly tooptimize the overall speaker verification performance of the system, directlytargeting the loss function of interest in the speaker verification task. In orderto achieve good generalization in terms of calibration performance acrossvarying conditions, we introduced a duration-dependent calibration stagefollowed by a side-information dependent calibration stage, where the side-information vector is meant to represent the hidden condition factors and islearned jointly and discriminatively with the rest of the model.We compared our proposed approach with two baselines: a standardPLDA backend, and a discriminatively trained PLDA-like backend, both withglobal calibration. We showed that the proposed backend with condition-aware calibration stages gives significant improvements over both baselinesystems on a wide variety of test conditions. To our knowledge, this is the firstwork showing a single system providing robust out-of-the-box performanceacross several different conditions, some of them unseen during training.In the future we plan to extend this work in several ways. We planto replace the PLDA stage by heavy-tailed PLDA formulation which hasbeen shown to be a better fit for the distribution of the x-vectors (Silnovaet al., 2018). We will also extend this work to multi-class tasks like languageidentification. Finally, we will explore end-to-end training of the embeddingextractor and the proposed backend.
References
Beck, S.D., Schwartz, R., Nakasone, H., 2004. A bilingual multi-modal voicecorpus for language and speaker recognition (LASR) services, in: Proc.Odyssey-04, Toledo, Spain.Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer.Brandschain, L., Graff, D., Walker, K., Cieri, C., 2013. Mixer 6 speech. https://catalog.ldc.upenn.edu/LDC2013S03 .Br¨ummer, N., 2008. Focal bilinear toolkit. http://niko.brummer.googlepages.com/focalbilinear .42r¨ummer, N., 2010a. EM for simplified PLDA. https://sites.google.com/site/nikobrummer/EMforSPLDA.pdf .Br¨ummer, N., 2010b. Measuring, Refining and Calibrating Speaker and Lan-guage Information Extracted from Speech. Ph.D. thesis. Stellenbosch Uni-versity.Br¨ummer, N., De Villiers, E., 2010. The speaker partitioning problem, in:Proc. Odyssey-10, Brno, Czech Republic.Br¨ummer, N., Doddington, G., 2013. Likelihood-ratio calibration using prior-weighted proper scoring rules, in: Proc. Interspeech, Lyon, France.Br¨ummer, N., du Preez, J., 2006. Application independent evaluation ofspeaker detection. Computer Speech and Language 20.Br¨ummer, N., du Preez, J., 2013. The PAV algorithm optimizes binaryproper scoring rules. https://sites.google.com/site/nikobrummer/pav_optimizes_rbpsr.pdf .Br¨ummer, N., Swart, A., van Leeuwen, D., 2014. A comparison of linearand non-linear calibrations for speaker recognition, in: Proc. Odyssey-14,Joensuu, Finland.Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., Br¨ummer, N.,2011. Discriminatively trained probabilistic linear discriminant analysisfor speaker verification, in: Proc. ICASSP, Prague.Cumani, S., Br¨ummer, N., Burget, L., Laface, P., Plchot, O., Vasilakakis,V., 2013. Pairwise discriminative speaker verification in the i-vector space.IEEE Transactions on Audio, Speech, and Language Processing 21.Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Transactions on Audio,Speech, and Language Processing 19.Ferrer, L., Graciarena, M., Zymnis, A., Shriberg, E., 2008. System combina-tion using auxiliary information for speaker verification, in: Proc. ICASSP,Las Vegas.Ferrer, L., McLaren, M., 2020a. A discriminative condition-aware backendfor speaker verification, in: Proc. of ICASSP 2020, Barcelona, Spain.43errer, L., McLaren, M., 2020b. A speaker verification backend for improvedcalibration performance across varying conditions, in: Proc. Odyssey-20,Tokyo, Japan.Ferrer, L., Nandwana, M.K., McLaren, M., Castan, D., Lawson, A., 2019.Toward fail-safe speaker recognition: Trial-based calibration with a rejectoption. IEEE/ACM Trans. Audio Speech and Language Processing 27.Ferrer, L., S¨onmez, K., Kajarekar, S., 2005. Class-dependent score combina-tion for speaker recognition, in: Proc. Interspeech, Lisbon.Garcia-Romero, D., Espy-Wilson, C., 2011. Analysis of i-vector length nor-malization in speaker recognition systems, in: Proc. Interspeech, Florence,Italy.Garcia-Romero, D., Sell, G., McCree, A., 2020. MagNetO: X-vector magni-tude estimation network plus offset for improved speaker recognition, in:Proc. Odyssey-20, Tokyo, Japan.Gneiting, T., Raftery, A.E., 2007. Strictly proper scoring rules, prediction,and estimation 102.Godfrey, J., Holliman, E., McDaniel, J., 1992. Switchboard: Telephonespeech corpus for research and development, in: Proc. ICASSP, San Fran-cisco.Greenberg, C.S., Mason, L.P., Sadjadi, S.O., Reynolds, D.A., 2020. Twodecades of speaker recognition evaluation at the national institute of stan-dards and technology. Computer Speech and Language 60.Greenberg, C.S., Stanford, V.M., Martin, A.F., Yadagiri, M., Doddington,G.R., Godfrey, J.J., Hernandez-Cordero, J., 2013. The 2012 NIST speakerrecognition evaluation, in: Proc. Interspeech, Lyon, France.Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q., 2017. On calibration ofmodern neural networks, in: Proc. of the 34th International Conferenceon Machine Learning, Sydney, Australia.Ioffe, S., 2006. Probabilistic linear discriminant analysis, in: Proc. of the 9thEuropean Conference on Computer Vision, Graz, Austria.44enny, P., 2010. Bayesian speaker verification with heavy-tailed priors, in:Proc. Odyssey-10, Brno, Czech Republic. Keynote presentation.Kenny, P., Boulianne, G., Ouellet, P., Dumouchel, P., 2007. Joint factoranalysis versus eigenchannels in speaker recognition. IEEE Transactionson Audio, Speech, and Language Processing .Kenny, P., Reynolds, D., Castaldo, F., 2010. Diarization of telephone con-versations using factor analysis. IEEE Journal of Selected Topics in SignalProcessing .Kim, C., Stern, R., 2012. Power-normalized cepstral coefficients (PNCC) forrobust speech recognition, in: Proc. ICASSP, Kyoto.Kingma, D.P., Ba, J., 2015. Adam: A method for stochastic optimization,in: Proc. of ICLR, San Diego.van Leeuwen, D.A., Br¨ummer, N., 2013. The distribution of calibratedlikelihood-ratios in speaker recognition, in: Proc. Interspeech, Lyon,France.Mandasari, M.I., Saeidi, R., van Leeuwen, D.A., 2015. Quality measuresbased calibration with duration and noise dependency for speaker recog-nition. Speech Communication 72.Mandasari, M.I., Saeidi, R., McLaren, M., van Leeuwen, D.A., 2013. Qualitymeasure functions for calibration of speaker recognition systems in variousduration conditions. IEEE Transactions on Audio, Speech, and LanguageProcessing 21.Martin, A.F., Greenberg, C.S., 2009. NIST 2008 speaker recongition evalu-ation: Performance across telephone and room microphone channels, in:Proc. Interspeech, Brighton.Martin, A.F., Greenberg, C.S., 2010. The NIST 2010 speaker recognitionevaluation, in: Proc. Interspeech, Makuhari, Japan.McLaren, M., Castan, D., Nandwana, M., Ferrer, L., Yilmaz, E., 2018. Howto train your speaker embeddings extractor, in: Proc. of Speaker Odyssey,Les Sables d’Olonne, France. 45cLaren, M., Ferrer, L., Castan, D., Lawson, A., 2016. The speakers inthe wild (SITW) speaker recognition database, in: Proc. Interspeech, SanFrancisco.McLaren, M., Lawson, A., Ferrer, L., Scheffer, N., Lei, Y., 2014. Trial-based calibration for speaker recognition in unseen conditions, in: Proc.Odyssey-14, Joensuu, Finland.Morrison, G., Zhang, C., Enzinger et. al, E., 2015. Forensic database ofvoice recordings of 500+ australian english speakers. http://databases.forensic-voice-comparison.net .Nagrani, A., Chung, J.S., Xie, W., Zisserman, A., 2020. Voxceleb: Large-scale speaker verification in the wild. Computer Speech and Language60.Nandwana, M.K., Ferrer, L., McLaren, M., Castan, D., Lawson, A., 2019.Analysis of critical metadata factors for the calibration of speaker recog-nition systems, in: Proc. Interspeech, Graz, Austria.Nandwana, M.K., Lomnitz, M., Richey, C., McLaren, M., Castan, D., Ferrer,L., Lawson, A., 2020. The VOiCES from a Distance Challenge 2019:Analysis of Speaker Verification Results and Remaining Challenges, in:Proc. Odyssey-20, Tokyo, Japan.Nautsch, A., Saeidi, R., Rathgeb, C., Busch, C., 2016. Robustness of quality-based score calibration of speaker recognition systems with respect to low-snr and short-duration conditions., in: Proc. Odyssey-16, Bilbao, Spain.Pezeshki, M., Kaba, S.O., Bengio, Y., Courville, A., Precup, D., Lajoie, G.,2020. Gradient starvation: A learning proclivity in neural networks. arXivpreprint arXiv:2011.09468 .Prince, S., 2007. Probabilistic linear discriminant analysis for inferencesabout identity, in: Proceedings of the International Conference on Com-puter Vision.Przybocki, M.A., Martin, A.F., Le, A.N., 2007. NIST speaker recognitionevaluations utilizing the mixer corpora - 2004, 2005, 2006. IEEE Transac-tions on Audio, Speech, and Language Processing 15. doi: . 46amoji, S., Krishnan, P., Ganapathy, S., 2020. Neural PLDA modeling forend-to-end speaker verification, in: Proc. Interspeech, Shanghai, China.Richey, C., Barrios, M.A., Armstrong, Z., Bartels, C., Franco, H., Gracia-rena, M., Lawson, A., Nandwana, M.K., Stauffer, A.R., van Hout, J.,Gamble, P., Hetherly, J., Stephenson, C., Ni, K., 2018. Voices obscured incomplex environmental settings (VOICES) corpus, in: Proc. Interspeech,Hyderabad, India.Rohdin, J., Silnova, A., Diez, M., Plchot, O., Matejka, P., Burget, L., 2018.End-to-end DNN based speaker recognition inspired by i-vector and PLDA,in: Proc. ICASSP, Calgary, Canada.Sadjadi, S.O., Greenberg, C.S., Reynolds, D.A., Mason, L., 2019. The 2018NIST speaker recognition evaluation, in: Proc. Interspeech, Graz, Austria.Sadjadi, S.O., Greenberg, C.S., Singer, E., Reynolds, D., Mason, L.,Hernandez-Cordero, J., 2020. The 2019 NIST speaker recognition eval-uation cts challenge, in: Proc. Odyssey-20, Tokyo, Japan.Sadjadi, S.O., Kheyrkhah, T., Tong, A., Greenberg, C.S., Reynolds, D.A.,2017. The 2016 NIST speaker recognition evaluation, in: Proc. Interspeech,Stockholm.Silnova, A., Br¨ummer, N., Garcia-Romero, D., Snyder, D., Burget, L., 2018.Fast variational Bayes for heavy-tailed PLDA applied to i-vectors and x-vectors, in: Proc. Interspeech, Hyderabad, India.Sizov, A., Lee, K.A., Kinnunen, T., 2014. Unifying probabilistic linear dis-criminant analysis variants in biometric authentication, in: Joint IAPRInternational Workshops on Statistical Techniques in Pattern Recogni-tion (SPR) and Structural and Syntactic Pattern Recognition (SSPR),Springer. pp. 464–475.Snyder, D., 2017. NIST SRE 2016 xvector recipe. https://david-ryan-snyder.github.io/2017/10/04/model_sre16_v2.html .Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y.,Khudanpur, S., 2016. Deep neural network-based speaker embeddings forend-to-end speaker verification, in: Proc. of Spoken Language TechnologyWorkshop (SLT). 47olewicz, Y., Koppel, M., 2005. Considering speech quality in speaker veri-fication fusion, in: Proc. Interspeech, Lisbon.Solewicz, Y., Koppel, M., 2007. Using post-classifiers to enhance fusion oflow- and high-level speaker recognition. IEEE Transactions on Audio,Speech, and Language Processing 15.Tan, Z., Mak, M.W., 2017. i-Vector DNN scoring and calibration for noiserobust speaker verification., in: INTERSPEECH, Stockholm.Van Leeuwen, D.A., Br¨ummer, N., 2007. An introduction to application-independent evaluation of speaker recognition systems, in: Speaker classi-fication I: Fundamentals, Features, and Methods. Springer-Verlag.Walker, K., Strassel, S., 2012. The RATS radio traffic collection system, in:Proc. Odyssey-12, Singapore.Zhang, C., Morrison, G., 2011. Forensic database of audio record-ings of 68 female speakers of standard chinese. http://databases.forensic-voice-comparison.nethttp://databases.forensic-voice-comparison.net