[PDF] Learnable MFCCs for Speaker Verification

Abstract

We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort.

Full PDF

LLearnable MFCCs for Speaker Veriﬁcation

Xuechen Liu

CNRS, Inria, LORIAUniversit´e de Lorraine

F-54000, Nancy, [email protected]

Md Sahidullah

CNRS, Inria, LORIAUniversit´e de Lorraine

F-54000, Nancy, [email protected]

Tomi Kinnunen

School of ComputingUniversity of Eastern Finland

FI-80101 Joensuu, [email protected].ﬁ

Abstract —We propose a learnable mel-frequency cepstral coef-ﬁcients (MFCCs) front-end architecture for deep neural network(DNN) based automatic speaker veriﬁcation. Our architectureretains the simplicity and interpretability of MFCC-based fea-tures while allowing the model to be adapted to data ﬂexibly.In practice, we formulate data-driven version of four lineartransforms in a standard MFCC extractor — windowing, discreteFourier transform (DFT), mel ﬁlterbank and discrete cosinetransform (DCT). Results reported reach up to 6.7% (VoxCeleb1)and 9.7% (SITW) relative improvement in term of equal errorrate (EER) from static MFCCs, without additional tuning effort.

Index Terms —Speaker veriﬁcation, feature extraction, mel-frequency cesptral coefﬁcients (MFCCs).

I. I

NTRODUCTION

Automatic speaker veriﬁcation (ASV) [1] is used in forensicvoice comparison, personalization of voice-based services and,more recently, smart home electronic devices. A typical ASVsystem can be broken down into three elementary components:(i) a frame-level feature extractor, (ii) a speaker embeddingextractor, and (iii) a speaker comparator. Their functionsare, respectively, to transform a waveform into a sequenceof feature vectors, to extract ﬁxed-sized speaker embeddingvectors, and to compare two speaker embeddings (one froman enrollment and the other one from a test utterance).While previous generations of ASV technology reliedlargely on statistical approaches such as i-vectors [2], state-of-the-art ASV leverages from deep neural networks (DNNs) toextract speaker embeddings. Representative examples include d-vector [3] and x-vector [4]. Numbers of extensions fromthem have been proposed as well [5], [6]. Common to mostof those speaker embedding extractors is using either mel-frequency cepstral coefﬁcients (MFCCs) [7] or raw spectrumas frame-level features. Different from the speaker embeddingextractor, whose parameters are obtained through numericaloptimization, raw spectra and MFCCs are obtained withﬁxed/static operations. In this work, our main goal is toformulate a lightweight, data-driven version of a standardMFCC extractor.Related recent work includes so-called end-to-end [8] [9]front-end solutions. Using DNN-based components that areoptimized jointly, such end-to-end solutions process the raw

This work was partially supported by Academy of Finland (project 309629)and Inria Nancy Grand Est. waveform to produce either detection scores or intermediatefeatures to be used with other components. Despite promisingresults, the end-to-end approaches tend to require substantialengineering efforts, making them potentially inﬂexible foradaptation to new applications or data. Additionally, unlesssome prior domain knowledge is used in designing theDNN components, such models can be difﬁcult to interpret.Meanwhile, analysis and assessment of relative importance ofdifferent signal processing components is important in speech-related research. Interpretability is also demanded in high-stake applications, such as forensic voice comparison.For reasons above, we advocate a novel architecture whosedesign is guided by one the most successful ﬁxed featureextractor, MFCCs. Even if an MFCC extractor is typically notviewed as a neural network, it can be seen as a DNN consistingof a number of linear layers (and some non-linearities). Itis therefore a natural idea to expand the speaker embeddingextractor to include MFCC-speciﬁc layers to be optimizedin a data-driven manner. This is enabled by deﬁning a com-putational graph and the associated automatic differentiationprocedures available in standard DNN toolboxes. Though wedraw inspiration from similar ideas in other tasks (e.g. [10],[11]), our aim is an initial formulation and an experimentalfeasibility study in the context of ASV.II.

LEARNABLE

MFCC

EXTRACTOR

A. Front-end MFCC extractor

A typical MFCC extractor consists of a cascade of linearand non-linear transformations originally motivated [7] fromsignal processing and human auditory system considerations.Typical steps (after pre-processing) include windowing, powerspectrum computation using discrete Fourier transform (DFT),smoothing by a bank of triangular-shaped ﬁlters, logarithmiccompression and discrete cosine transform (DCT).MFCCs have been used across many different speech andaudio applications successfully, suggesting their generality asan application-agnostic frame-level feature. Nonetheless, thestandard transformations in MFCC extractors may be im-proved further. For instance, [12] uses low-variance multi-taperspectrum estimation to replace Hamming-windowed DFT.Other studies employ alternative time-frequency representa-tions, such as constant-Q transform (CQT) [13] and wavelet-based methods [14], [15]. Different frequency warping scalesare studied in [16]. Similarly, triangular ﬁlterbank can be a r X i v : . [ c s . S D ] F e b eplaced with Gaussian and Gammatone ﬁlterbanks [17]. Thelogarithmic compression can also be substituted with cube-root compression [18]. The suitability of block DCT as analternative of standard DCT (i.e, DCT-II) is explored in [19].All above studies focus on developing other ﬁxed operationmodels by overcoming some of the limitations of the existingone. Differently from those studies, we propose to optimizethe parameters of MFCC pipeline in a data-driven manner. Weconsider making learnable components based on static MFCCsonly, as dynamic (delta) coefﬁcients were not found useful inour previous work [20]. B. Differentiable linear transforms for MFCCs

With the above motivations, we would like to start fromﬁxed MFCCs by making four highlighted differentiable lineartransforms learnable. Three of them are real-valued, namely,windowing, mel ﬁlterbank and DCT. Therefore, when design-ing their learnable counterparts, for each component we simplycreate operators that have the same input and output as thestatic one so that we retain the same exact computational ﬂow.The only difference from static feature extractor is that thegradient can now be back-propagated to update the numericalvalues of the linear transforms. DFT is an exception sinceit is a complex -valued linear operator. Nonetheless, whenintegrated as a step to produce a power spectrum, the operationcan be expressed as: | X | = | F x | = | ( F real + j F imag ) x | = g ( F real x ) + g ( F imag x ) , (1)where g ( F x ) = | F x | . Here, x is a windowed speech frame, X is the complex-valued spectrum, F is the complex-valuedDFT matrix and | . | denotes element-wise modulus. Thus,(1) can be implemented as two real-valued linear transforms,followed by squared summation.III. O PTIMIZATION OF LEARNABLE COMPONENTS

We describe below three techniques to optimize the selectedcomponents. We refer to the corresponding matrices as kernels ,denoted by speciﬁc symbols: W for the window function, F for DFT (as noted in Equation 1), M for the mel ﬁlterbank,and D for DCT. A. Kernelized initialization

Trainable component are often initialized using randomnumbers from a normal distribution [21]. In this work, how-ever, we assert that a standard MFCC extractor serves as areasonable starting point for further learning. Thus, our ﬁrsttechnique initializes each kernel with its corresponding staticcounterpart. For windowing, we use the Hamming window[22]. For mel ﬁlterbank and DCT static correspondents areavailable and can be directly used in place. For DFT, wegenerate kernels from the DFT matrix and separate the realand imaginary parts F real and F imag in Eq. (1). After initial-ization, training proceeds the same way as for any standardDNN-based speaker embedding extractor. The kernel initialization sets a starting point for furtheradaptation. We consider two additions to the training pro-cedure. The idea in both is to promote speciﬁc numericalproperties of each static component to regulate learning,discouraging overly aggressive deviation from their respectivestatic counterparts. We detail the two ideas, loss regularizationand kernel update, in the following two subsections. B. Loss regularization

We modify the training objective of the speaker embeddingextractor as L new = L + λ ∗ g loss ( K ) , where L is multi-classcross-entropy loss, L new is regularized loss, K denotes thekernel, and λ is regularization constant. For all experimentsaddressed in this work, we set λ = 0 . . In Section V, systemsadapted with such method are marked as name + loss. , where name is the name of adapted component. We design separateregularizer g loss ( · ) for each of the four linear components. Windowing . Many window functions (e.g. Hamming andBlackman) are generated using sinusoids [22]. Thus, ourregularizer measures distance from the learnable window to acosine function: g loss ( W ) = || W norm − C || , where C ( n ) = − cos(2 πn/M ) , n ∈ [0 , M − is a cosine function, M beingequal to frame length (i.e. length of window vector), W norm isa mean-normalized window, and || . || denotes Frobenius norm[23]. Therefore, when the constraint equals zero, the windowequals a cosine function. DFT . A DFT matrix is squared and symmetric. It canbe split into real and imaginary parts, both of which arereal-valued, squared and symmetric. We therefore introducesuch property when implementing regularization by computingmatrix-wise distance of the kernel to its symmetric version: F dist . = F norm − F norm F (cid:62) norm , where F dist . is the differencematrix and F norm is the normalized version of F . This appliesto both F real and F imag in Eq. 1. The Frobenius norm of F dist. is then used for regularization: g loss ( F ) = || F dist . || .Therefore, when the constraint is perfectly met ( g loss ( F ) = 0 ),we see that F (cid:62) norm = I , where I is an identity matrix. Mel ﬁlterbank . Mel ﬁlterbank is a set of overlapped trian-gular ﬁlters with scaled peak magnitude, which can be eitherconstant across all ﬁlters (our case) or varied via differentfrequency bins [24]. Computationally, it is a matrix with non-negative elements with high sparsity. In order to control thelevel of sparsity of the kernel, we adopt L regularization [25]on the ﬁlterbank kernel to avoid over-ﬁtting, instead of L ,which tends to have a more enhancing effect on sparsity ofmodel as a loss regularizer. Formally, g loss ( M ) = || M || . DCT . A DCT matrix is orthonormal, i.e. DD (cid:62) = D (cid:62) D = I . We employed a recently-proposed soft orthonormality lossfunction [26], expressed as g loss ( D ) = (cid:107) D (cid:62) D − I (cid:107) , where I is the identity matrix. Optimizing such loss function minimizesthe distance between the Gram matrix of D and I to encourageorthonormality. C. Kernel update

Aside from loss regularization, the other optimization tech-nique performs direct update on the kernel operators everyime after gradient update. Compared to loss regularization, itis a more ‘brute-force’ approach. The updated kernel matrixor vector is then directly used for next iteration: K new = g kernel ( K ) , where, K is kernel matrix after gradient updateand K new is the directly-updated one used for next iteration.In Section V, systems adapted with this method are markedas name + kernel. , where name is the name of adaptedcomponent. Design of updater g kernel ( . ) for each componentis as follows. Windowing . Commonly-used window functions are non-negative and symmetric. Inspired by such properties, ourkernel update is g kernel ( W ) = | cat( W [:size / , W ﬂip ) | , where W [:size / denotes the half-size truncated version of windowvector W while W ﬂip is its ﬂipped (time-reversed) version.Here cat . performs column-wise concatenation and | . | denotesabsolute values. DFT . As noted above, DFT matrices are square and sym-metric. To enhance such properties, we perform a simpleupdate on the kernel: g kernel ( F ) = F F (cid:62) , where F and F new denote kernel at the end of the current iteration and the nextiteration, respectively. It is easy to see that F new is indeedsymmetric . Similar to the loss regularization scheme, thisupdate is applied to both F real and F imag . Mel ﬁlterbank . As mel ﬁlterbank is a set of overlappedtriangular ﬁlters with non-negative values, we force posi-tivity by replacing negative elements with a small value: ∀ i, j, g kernel ( M i,j ) = (cid:15) if M i,j ≤ , otherwise M i,j ,where (cid:15) = 10 − , i and j denote row and column indicesof the ﬁlterbank M . DCT . For DCT, we again capture its orthonormality re-quirement from its static correspondent by performing QRdecomposition [23] on the learnt kernel matrix: g kernel ( D ) = QR ( D ) , where QR ( . ) decomposes D = QR and outputsonly the orthogonal matrix Q . Such an operation can beperformed because the kernel learnt corresponds to DCT isset to be a square matrix, which means number of melﬁlters is same as number of output cepstral coefﬁcients. Weacquire such design choice because setting number of ﬁlterssame as ﬁnal static feature dimension can bring competitiveperformance, as shown in [20]. This applies to all experimentsin this work, including baseline.IV. D ATA AND EXPERIMENTAL PROTOCOL

A. Data

We trained baseline x-vector model on dev [27] partitionof VoxCeleb1, which consists of 1211 speakers. We usedthe same dataset for additional training steps on learnablelinear components. For evaluation, we considered one matchedand another relatively mismatched condition. For the former,we used the test partition of VoxCeleb1 that consists of 40speakers, 18860 genuine trials and same number of impostortrials [27]. The latter was composed of the development part of speakers-in-the-wild [28] (SITW) corpus “core-core”condition, containing 119 speakers. It contains 2597 genuine ( F F (cid:62) ) (cid:62) = ( F (cid:62) ) (cid:62) F (cid:62) = F F (cid:62) . and 335629 impostor trials. We refer to the two datasets as Voxceleb1-test and

SITW-DEV , respectively.

B. System conﬁguration

For the baseline system, we used 30 static MFCCs as theinput features and replicated x-vector conﬁguration from [4] asthe speaker embedding extractor. We trained the model usingVoxCeleb1 without any data augmentation and Adam [29]as optimizer. During test time, we extracted 512 dimensionalspeaker embedding from the ﬁrst fully-connected layer afterstatistics pooling.We adapted each of the four learnable front-end systemsat a time, using same data as for training. In order toprevent distractions in terms of joint optimization from scratchand meet the aim of providing light-weighted interface foradaptation, the selected component was jointly optimized withthe pre-trained baseline x-vector. Speaker embeddings for allsystems with learnt front-end components were extracted insame manner as baseline after adaptation.For all systems, we applied energy-based speech activitydetection (SAD) before feature processor and cepstral meannormalization (CMN). All embeddings extracted at inferencetime were length-normalized and centered prior to beingtransformed by a 200-dimensional linear discriminant analysis (LDA). Scoring was implemented through probabilistic lineardiscriminant analysis (PLDA) [30] classiﬁer. We used Kaldifor data preparation and PLDA training and PyTorch [31] forall DNN-related training and inference experiments.

C. EvaluationEqual error rate (EER) and minimum detection cost function (minDCF) were used to measure ASV performance. MinDCFwas computed with target speaker prior p = 0 . anddetection costs C FA = C miss = 1 . . We used BOSARIS [32]to produce selected detection trade-off (DET) curves. Fig. 1. Loss propagation for baseline and adapted MFCC components. Bestviewed in color.

V. R

ESULTS

Before presenting ASV results, we demonstrate validationloss (on dev set) propagation of our baseline and adaptedsystems in Fig. 1. The baseline x-vector system (with ﬁxedMFCC components) was pre-trained with 1000 iterations,followed up by another 1000 iterations to adapt the MFCC

1 2 5 10 20 30

False Alarm Rate [in %]

1 2 5 10 20 30 M i ss R a t e [ i n % ] windowing baselinewindow window+losswindow+kernel (a) windowing

1 2 5 10 20 30

False Alarm Rate [in %]

1 2 5 10 20 30 M i ss R a t e [ i n % ] DFT baselinedft dft+lossdft+kernel (b) DFT

1 2 5 10 20 30

False Alarm Rate [in %]

1 2 5 10 20 30 M i ss R a t e [ i n % ] mel filterbank baselinemelbank melbank+lossmelbank+kernel (c) mel ﬁlterbank

1 2 5 10 20 30

False Alarm Rate [in %]

1 2 5 10 20 30 M i ss R a t e [ i n % ] DCT baselinedct dct+lossdct+kernel (d) DCT

Fig. 2. DET plots for

SITW-DEV . Best viewed in color.TABLE IR

ESULTS ON

Voxceleb1-test

AND

SITW-DEV . A

LL SYSTEMS ASIDE FROMBASELINE ARE WITH KERNEL INITIALIZATION .Voxceleb1-test SITW-DEVOperator (+optimal.) EER(%) minDCF EER(%) minDCFBaseline MFCC 4.64 0.6071 6.72 0.8243Window 4.51 0.5544

DCT 4.36 0.5572 6.27 0.7950DCT + loss. 4.46 components. The adaptation, especially for window functionand DFT, results in a notable decrease of validation loss. Thisindicates potential to make components of an MFCC extractorlearnable. The ASV results are reported in Table I.

A. ASV results on Voxceleb1-test

Concerning windowing, all the three adapted variants out-perform the baseline. Loss regularization and kernel updateare particularly more effective. The results indicate usefulnessto retain symmetricity and positivity of the window.Concerning DFT, simply letting it to be data-driven (withoutadded regularization or kernel update) yields lowest EERamong all systems, with a relative improvement of 6.7% overthe baseline. In fact, the additional regularization and directupdate are detrimental. This indicates potential weakness ofour symmetricity constraint.Concerning the mel ﬁlterbank,

Melbank + kernel. yieldsthe best performance among the three adapted variants, withbest minDCF of all systems, improving baseline by relatively6.7%. This indicates the importance of enforcing positivity ofthe learnt ﬁlters.Concerning DCT, similar to windowing all the learningschemes improve upon the baseline. While QR decompositiondoes not bring notable positive impact, the orthonormality-enhancing loss regularization results in slightly worse EER, but improved minDCF. In fact,

DCT + loss. results in lowestminDCF among all systems.

B. ASV results on SITW-DEV

We now move on to discuss ASV results on the morechallenging

SITW-DEV data. Overall, the data-driven com-ponents yield now more competitive performance boost overthe baseline. Adapting the window function is most effective,with a relative improvement of 9.7% in EER over the baseline.Concerning DFT,

DFT + loss. slightly outperforms baselinein both metrics while

DFT + kernel. is the only variantthat does not reach baseline in EER. This ﬁnding is in linewith

VoxCeleb1-test results. Concerning mel ﬁlterbank, allthe three systems outperform baseline. Overall, it achievescompetitive performance compared with learnable DCT. Itreﬂects its potential on being made adaptable to improvesystem robustness.DET curves for single systems including baseline on

SITW-DEV have been shown in Fig. 2. The curves agree withobservations from Table I in general. Systems with windowfunction adapted produce largest improvement gap with base-line compared with other three, which can be reﬂected fromEER. Considering systems that are less strict on false alarms,we can see that ones like

DCT + kernel. and window + loss. are exceptional and thus can be taken into concern.VI. C

ONCLUSION

We conduct an initial study on a lightweight learnableMFCC feature extractor as a compromise between end-to-endarchitectures and hand-crafted feature extractors. Our initialresults on

SITW-DEV are promising: the proposed scheme im-proved upon baseline MFCC extractor. Results for optimizedwindow and mel ﬁlterbank are particularly promising. Due toour domain-speciﬁc optimization constraints, the learnt repre-sentations bear close resemblance to ﬁxed MFCC operations.For interpretability and computational reasons, we restrictedthe focus on optimization of individual MFCC extractor com-ponents; joint optimization of all the four linear componentshas been left as future work. Similarly, the work can beextended with other deep models such as extended TDNNand ResNet using larger datasets and data augmentation.

EFERENCES[1] J. H. L. Hansen and T. Hasan, “Speaker recognition by machines andhumans: A tutorial review,”

IEEE Signal Processing Magazine , vol. 32,no. 6, pp. 74–99, 2015.[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker veriﬁcation,”

IEEE Transactions onAudio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798,May 2011.[3] E. Variani et al., “Deep neural networks for small footprint text-dependent speaker veriﬁcation,” in

Proc. ICASSP , 2014, pp. 4052–4056.[4] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust DNN embeddings for speaker recognition,” in

Proc.ICASSP , 2018, pp. 5329–5333.[5] L. You, W. Guo, D. Li, and J. Du, “Multi-Task learning with high-orderstatistics for X-vector based text-independent speaker veriﬁcation,” in

Proc. INTERSPEECH , 2019, pp. 1158–1162.[6] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversationsusing x-vectors,” in

Proc. ICASSP , 2019, pp. 5796–5800.[7] S. Davis and P. Mermelstein, “Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences,”

ACOUSTICS, SPEECH AND SIGNAL PROCESSING, IEEE TRANSAC-TIONS ON , pp. 357–366, 1980.[8] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveformwith SincNet,” in

Proc. SLT , 2018, pp. 1021–1028.[9] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, andE. Dupoux, “Learning ﬁlterbanks from raw speech for phone recogni-tion,” in , 2018, pp. 5509–5513.[10] S. Vasquez and M. Lewis, “Melnet: A generative model for audio inthe frequency domain,”

ArXiv , vol. abs/1906.01083, 2019.[11] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Filterbankdesign for end-to-end speech separation,” in

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6364–6368.[12] T. Kinnunen et al., “Low-variance multitaper MFCC features: A casestudy in robust speaker veriﬁcation,”

IEEE Transactions on Audio,Speech, and Language Processing , vol. 20, no. 7, pp. 1990–2001, 2012.[13] M. Todisco, H. Delgado, and N. Evans, “Articulation rate ﬁltering ofCQCC features for automatic speaker veriﬁcation,” in

Proc. INTER-SPEECH , 2016, pp. 3628–3632.[14] J. And´en and S. Mallat, “Deep scattering spectrum,”

IEEE Transactionson Signal Processing , vol. 62, no. 16, pp. 4114–4128, 2014.[15] O. Farooq and S. Datta, “Mel ﬁlter-like admissible wavelet packetstructure for speech recognition,”

IEEE Signal Processing Letters , vol.8, no. 7, pp. 196–198, 2001.[16] S. Umesh, L. Cohen, and D. Nelson, “Fitting the mel scale,” in , 1999, vol. 1,pp. 217–220 vol.1.[17] C. Kim and R. M. Stern, “Power-normalized cepstral coefﬁcients(PNCC) for robust speech recognition,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 24, no. 7, pp. 1315–1329,2016.[18] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,”

The Journal of the Acoustical Society of America , vol. 87, no. 4, pp.1738–1752, 1990.[19] A. K. Naveena and N. K. Narayanan, “Block dct coefﬁcients andhistogram for image retrieval,” in , 2017, pp. 48–52.[20] X. Liu, M. Sahidullah, and T. Kinnunen., “A comparative re-assessmentof feature extractors for deep speaker embeddings,” in

Proc. INTER-SPEECH , 2020.[21] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feed-forward neural networks,” in

Proceedings of the ThirteenthInternational Conference on Artiﬁcial Intelligence and Statistics . 2010,pp. 249–256, PMLR.[22] F. J. Harris, “On the use of windows for harmonic analysis with thediscrete fourier transform,”

Proceedings of the IEEE , vol. 66, no. 1, pp.51–83, 1978.[23] G.H. Golub and C.F. Van Loan,

Matrix Computations , The JohnsHopkins University Press, 1996. [24] R. Lawrence and J. Hwang,

Fundamentals of Speech Recognition ,Prentice-Hall, Inc., USA, 1993.[25] T. Hastie, R. Tibshirani, and J. Friedman,

The Elements of StatisticalLearning , Springer Series in Statistics. Springer New York Inc., NewYork, NY, USA, 2001.[26] Y. Zhu and B. Mak, “Orthogonality Regularizations for End-to-EndSpeaker Veriﬁcation,” in

Proc. Odyssey 2020 The Speaker and LanguageRecognition Workshop , 2020, pp. 17–23.[27] A. Nagrani, J. Chung, and A. Zisserman, “VoxCeleb: A large-scalespeaker identiﬁcation dataset,” in

Proc. INTERSPEECH , 2017, pp.2616–2620.[28] M. McLaren, Luciana Ferrer, Diego Cast´an Lavilla, and Aaron Lawson,“The speakers in the wild (SITW) speaker recognition database,” in

Proc. INTERSPEECH , 2016, pp. 818–822.[29] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in , 2015.[30] S. Ioffe, “Probabilistic linear discriminant analysis,” in

Computer Vision– ECCV 2006 , Aleˇs Leonardis, Horst Bischof, and Axel Pinz, Eds.,Berlin, Heidelberg, 2006, pp. 531–542, Springer Berlin Heidelberg.[31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,L. Zeming, A. Desmaison, L. Antiga, and A. Lerer, “Automaticdifferentiation in pytorch,” in