Learnable MFCCs for Speaker Verification
LLearnable MFCCs for Speaker Verification
Xuechen Liu
CNRS, Inria, LORIAUniversit´e de Lorraine
F-54000, Nancy, [email protected]
Md Sahidullah
CNRS, Inria, LORIAUniversit´e de Lorraine
F-54000, Nancy, [email protected]
Tomi Kinnunen
School of ComputingUniversity of Eastern Finland
FI-80101 Joensuu, [email protected].fi
Abstract —We propose a learnable mel-frequency cepstral coef-ficients (MFCCs) front-end architecture for deep neural network(DNN) based automatic speaker verification. Our architectureretains the simplicity and interpretability of MFCC-based fea-tures while allowing the model to be adapted to data flexibly.In practice, we formulate data-driven version of four lineartransforms in a standard MFCC extractor — windowing, discreteFourier transform (DFT), mel filterbank and discrete cosinetransform (DCT). Results reported reach up to 6.7% (VoxCeleb1)and 9.7% (SITW) relative improvement in term of equal errorrate (EER) from static MFCCs, without additional tuning effort.
Index Terms —Speaker verification, feature extraction, mel-frequency cesptral coefficients (MFCCs).
I. I
NTRODUCTION
Automatic speaker verification (ASV) [1] is used in forensicvoice comparison, personalization of voice-based services and,more recently, smart home electronic devices. A typical ASVsystem can be broken down into three elementary components:(i) a frame-level feature extractor, (ii) a speaker embeddingextractor, and (iii) a speaker comparator. Their functionsare, respectively, to transform a waveform into a sequenceof feature vectors, to extract fixed-sized speaker embeddingvectors, and to compare two speaker embeddings (one froman enrollment and the other one from a test utterance).While previous generations of ASV technology reliedlargely on statistical approaches such as i-vectors [2], state-of-the-art ASV leverages from deep neural networks (DNNs) toextract speaker embeddings. Representative examples include d-vector [3] and x-vector [4]. Numbers of extensions fromthem have been proposed as well [5], [6]. Common to mostof those speaker embedding extractors is using either mel-frequency cepstral coefficients (MFCCs) [7] or raw spectrumas frame-level features. Different from the speaker embeddingextractor, whose parameters are obtained through numericaloptimization, raw spectra and MFCCs are obtained withfixed/static operations. In this work, our main goal is toformulate a lightweight, data-driven version of a standardMFCC extractor.Related recent work includes so-called end-to-end [8] [9]front-end solutions. Using DNN-based components that areoptimized jointly, such end-to-end solutions process the raw
This work was partially supported by Academy of Finland (project 309629)and Inria Nancy Grand Est. waveform to produce either detection scores or intermediatefeatures to be used with other components. Despite promisingresults, the end-to-end approaches tend to require substantialengineering efforts, making them potentially inflexible foradaptation to new applications or data. Additionally, unlesssome prior domain knowledge is used in designing theDNN components, such models can be difficult to interpret.Meanwhile, analysis and assessment of relative importance ofdifferent signal processing components is important in speech-related research. Interpretability is also demanded in high-stake applications, such as forensic voice comparison.For reasons above, we advocate a novel architecture whosedesign is guided by one the most successful fixed featureextractor, MFCCs. Even if an MFCC extractor is typically notviewed as a neural network, it can be seen as a DNN consistingof a number of linear layers (and some non-linearities). Itis therefore a natural idea to expand the speaker embeddingextractor to include MFCC-specific layers to be optimizedin a data-driven manner. This is enabled by defining a com-putational graph and the associated automatic differentiationprocedures available in standard DNN toolboxes. Though wedraw inspiration from similar ideas in other tasks (e.g. [10],[11]), our aim is an initial formulation and an experimentalfeasibility study in the context of ASV.II.
LEARNABLE
MFCC
EXTRACTOR
A. Front-end MFCC extractor
A typical MFCC extractor consists of a cascade of linearand non-linear transformations originally motivated [7] fromsignal processing and human auditory system considerations.Typical steps (after pre-processing) include windowing, powerspectrum computation using discrete Fourier transform (DFT),smoothing by a bank of triangular-shaped filters, logarithmiccompression and discrete cosine transform (DCT).MFCCs have been used across many different speech andaudio applications successfully, suggesting their generality asan application-agnostic frame-level feature. Nonetheless, thestandard transformations in MFCC extractors may be im-proved further. For instance, [12] uses low-variance multi-taperspectrum estimation to replace Hamming-windowed DFT.Other studies employ alternative time-frequency representa-tions, such as constant-Q transform (CQT) [13] and wavelet-based methods [14], [15]. Different frequency warping scalesare studied in [16]. Similarly, triangular filterbank can be a r X i v : . [ c s . S D ] F e b eplaced with Gaussian and Gammatone filterbanks [17]. Thelogarithmic compression can also be substituted with cube-root compression [18]. The suitability of block DCT as analternative of standard DCT (i.e, DCT-II) is explored in [19].All above studies focus on developing other fixed operationmodels by overcoming some of the limitations of the existingone. Differently from those studies, we propose to optimizethe parameters of MFCC pipeline in a data-driven manner. Weconsider making learnable components based on static MFCCsonly, as dynamic (delta) coefficients were not found useful inour previous work [20]. B. Differentiable linear transforms for MFCCs
With the above motivations, we would like to start fromfixed MFCCs by making four highlighted differentiable lineartransforms learnable. Three of them are real-valued, namely,windowing, mel filterbank and DCT. Therefore, when design-ing their learnable counterparts, for each component we simplycreate operators that have the same input and output as thestatic one so that we retain the same exact computational flow.The only difference from static feature extractor is that thegradient can now be back-propagated to update the numericalvalues of the linear transforms. DFT is an exception sinceit is a complex -valued linear operator. Nonetheless, whenintegrated as a step to produce a power spectrum, the operationcan be expressed as: | X | = | F x | = | ( F real + j F imag ) x | = g ( F real x ) + g ( F imag x ) , (1)where g ( F x ) = | F x | . Here, x is a windowed speech frame, X is the complex-valued spectrum, F is the complex-valuedDFT matrix and | . | denotes element-wise modulus. Thus,(1) can be implemented as two real-valued linear transforms,followed by squared summation.III. O PTIMIZATION OF LEARNABLE COMPONENTS
We describe below three techniques to optimize the selectedcomponents. We refer to the corresponding matrices as kernels ,denoted by specific symbols: W for the window function, F for DFT (as noted in Equation 1), M for the mel filterbank,and D for DCT. A. Kernelized initialization
Trainable component are often initialized using randomnumbers from a normal distribution [21]. In this work, how-ever, we assert that a standard MFCC extractor serves as areasonable starting point for further learning. Thus, our firsttechnique initializes each kernel with its corresponding staticcounterpart. For windowing, we use the Hamming window[22]. For mel filterbank and DCT static correspondents areavailable and can be directly used in place. For DFT, wegenerate kernels from the DFT matrix and separate the realand imaginary parts F real and F imag in Eq. (1). After initial-ization, training proceeds the same way as for any standardDNN-based speaker embedding extractor. The kernel initialization sets a starting point for furtheradaptation. We consider two additions to the training pro-cedure. The idea in both is to promote specific numericalproperties of each static component to regulate learning,discouraging overly aggressive deviation from their respectivestatic counterparts. We detail the two ideas, loss regularizationand kernel update, in the following two subsections. B. Loss regularization
We modify the training objective of the speaker embeddingextractor as L new = L + λ ∗ g loss ( K ) , where L is multi-classcross-entropy loss, L new is regularized loss, K denotes thekernel, and λ is regularization constant. For all experimentsaddressed in this work, we set λ = 0 . . In Section V, systemsadapted with such method are marked as name + loss. , where name is the name of adapted component. We design separateregularizer g loss ( · ) for each of the four linear components. Windowing . Many window functions (e.g. Hamming andBlackman) are generated using sinusoids [22]. Thus, ourregularizer measures distance from the learnable window to acosine function: g loss ( W ) = || W norm − C || , where C ( n ) = − cos(2 πn/M ) , n ∈ [0 , M − is a cosine function, M beingequal to frame length (i.e. length of window vector), W norm isa mean-normalized window, and || . || denotes Frobenius norm[23]. Therefore, when the constraint equals zero, the windowequals a cosine function. DFT . A DFT matrix is squared and symmetric. It canbe split into real and imaginary parts, both of which arereal-valued, squared and symmetric. We therefore introducesuch property when implementing regularization by computingmatrix-wise distance of the kernel to its symmetric version: F dist . = F norm − F norm F (cid:62) norm , where F dist . is the differencematrix and F norm is the normalized version of F . This appliesto both F real and F imag in Eq. 1. The Frobenius norm of F dist. is then used for regularization: g loss ( F ) = || F dist . || .Therefore, when the constraint is perfectly met ( g loss ( F ) = 0 ),we see that F (cid:62) norm = I , where I is an identity matrix. Mel filterbank . Mel filterbank is a set of overlapped trian-gular filters with scaled peak magnitude, which can be eitherconstant across all filters (our case) or varied via differentfrequency bins [24]. Computationally, it is a matrix with non-negative elements with high sparsity. In order to control thelevel of sparsity of the kernel, we adopt L regularization [25]on the filterbank kernel to avoid over-fitting, instead of L ,which tends to have a more enhancing effect on sparsity ofmodel as a loss regularizer. Formally, g loss ( M ) = || M || . DCT . A DCT matrix is orthonormal, i.e. DD (cid:62) = D (cid:62) D = I . We employed a recently-proposed soft orthonormality lossfunction [26], expressed as g loss ( D ) = (cid:107) D (cid:62) D − I (cid:107) , where I is the identity matrix. Optimizing such loss function minimizesthe distance between the Gram matrix of D and I to encourageorthonormality. C. Kernel update
Aside from loss regularization, the other optimization tech-nique performs direct update on the kernel operators everyime after gradient update. Compared to loss regularization, itis a more ‘brute-force’ approach. The updated kernel matrixor vector is then directly used for next iteration: K new = g kernel ( K ) , where, K is kernel matrix after gradient updateand K new is the directly-updated one used for next iteration.In Section V, systems adapted with this method are markedas name + kernel. , where name is the name of adaptedcomponent. Design of updater g kernel ( . ) for each componentis as follows. Windowing . Commonly-used window functions are non-negative and symmetric. Inspired by such properties, ourkernel update is g kernel ( W ) = | cat( W [:size / , W flip ) | , where W [:size / denotes the half-size truncated version of windowvector W while W flip is its flipped (time-reversed) version.Here cat . performs column-wise concatenation and | . | denotesabsolute values. DFT . As noted above, DFT matrices are square and sym-metric. To enhance such properties, we perform a simpleupdate on the kernel: g kernel ( F ) = F F (cid:62) , where F and F new denote kernel at the end of the current iteration and the nextiteration, respectively. It is easy to see that F new is indeedsymmetric . Similar to the loss regularization scheme, thisupdate is applied to both F real and F imag . Mel filterbank . As mel filterbank is a set of overlappedtriangular filters with non-negative values, we force posi-tivity by replacing negative elements with a small value: ∀ i, j, g kernel ( M i,j ) = (cid:15) if M i,j ≤ , otherwise M i,j ,where (cid:15) = 10 − , i and j denote row and column indicesof the filterbank M . DCT . For DCT, we again capture its orthonormality re-quirement from its static correspondent by performing QRdecomposition [23] on the learnt kernel matrix: g kernel ( D ) = QR ( D ) , where QR ( . ) decomposes D = QR and outputsonly the orthogonal matrix Q . Such an operation can beperformed because the kernel learnt corresponds to DCT isset to be a square matrix, which means number of melfilters is same as number of output cepstral coefficients. Weacquire such design choice because setting number of filterssame as final static feature dimension can bring competitiveperformance, as shown in [20]. This applies to all experimentsin this work, including baseline.IV. D ATA AND EXPERIMENTAL PROTOCOL
A. Data
We trained baseline x-vector model on dev [27] partitionof VoxCeleb1, which consists of 1211 speakers. We usedthe same dataset for additional training steps on learnablelinear components. For evaluation, we considered one matchedand another relatively mismatched condition. For the former,we used the test partition of VoxCeleb1 that consists of 40speakers, 18860 genuine trials and same number of impostortrials [27]. The latter was composed of the development part of speakers-in-the-wild [28] (SITW) corpus “core-core”condition, containing 119 speakers. It contains 2597 genuine ( F F (cid:62) ) (cid:62) = ( F (cid:62) ) (cid:62) F (cid:62) = F F (cid:62) . and 335629 impostor trials. We refer to the two datasets as Voxceleb1-test and
SITW-DEV , respectively.
B. System configuration
For the baseline system, we used 30 static MFCCs as theinput features and replicated x-vector configuration from [4] asthe speaker embedding extractor. We trained the model usingVoxCeleb1 without any data augmentation and Adam [29]as optimizer. During test time, we extracted 512 dimensionalspeaker embedding from the first fully-connected layer afterstatistics pooling.We adapted each of the four learnable front-end systemsat a time, using same data as for training. In order toprevent distractions in terms of joint optimization from scratchand meet the aim of providing light-weighted interface foradaptation, the selected component was jointly optimized withthe pre-trained baseline x-vector. Speaker embeddings for allsystems with learnt front-end components were extracted insame manner as baseline after adaptation.For all systems, we applied energy-based speech activitydetection (SAD) before feature processor and cepstral meannormalization (CMN). All embeddings extracted at inferencetime were length-normalized and centered prior to beingtransformed by a 200-dimensional linear discriminant analysis (LDA). Scoring was implemented through probabilistic lineardiscriminant analysis (PLDA) [30] classifier. We used Kaldifor data preparation and PLDA training and PyTorch [31] forall DNN-related training and inference experiments.
C. EvaluationEqual error rate (EER) and minimum detection cost function (minDCF) were used to measure ASV performance. MinDCFwas computed with target speaker prior p = 0 . anddetection costs C FA = C miss = 1 . . We used BOSARIS [32]to produce selected detection trade-off (DET) curves. Fig. 1. Loss propagation for baseline and adapted MFCC components. Bestviewed in color.
V. R
ESULTS
Before presenting ASV results, we demonstrate validationloss (on dev set) propagation of our baseline and adaptedsystems in Fig. 1. The baseline x-vector system (with fixedMFCC components) was pre-trained with 1000 iterations,followed up by another 1000 iterations to adapt the MFCC
1 2 5 10 20 30
False Alarm Rate [in %]
1 2 5 10 20 30 M i ss R a t e [ i n % ] windowing baselinewindow window+losswindow+kernel (a) windowing
1 2 5 10 20 30
False Alarm Rate [in %]
1 2 5 10 20 30 M i ss R a t e [ i n % ] DFT baselinedft dft+lossdft+kernel (b) DFT
1 2 5 10 20 30
False Alarm Rate [in %]
1 2 5 10 20 30 M i ss R a t e [ i n % ] mel filterbank baselinemelbank melbank+lossmelbank+kernel (c) mel filterbank
1 2 5 10 20 30
False Alarm Rate [in %]
1 2 5 10 20 30 M i ss R a t e [ i n % ] DCT baselinedct dct+lossdct+kernel (d) DCT
Fig. 2. DET plots for
SITW-DEV . Best viewed in color.TABLE IR
ESULTS ON
Voxceleb1-test
AND
SITW-DEV . A
LL SYSTEMS ASIDE FROMBASELINE ARE WITH KERNEL INITIALIZATION .Voxceleb1-test SITW-DEVOperator (+optimal.) EER(%) minDCF EER(%) minDCFBaseline MFCC 4.64 0.6071 6.72 0.8243Window 4.51 0.5544
DCT 4.36 0.5572 6.27 0.7950DCT + loss. 4.46 components. The adaptation, especially for window functionand DFT, results in a notable decrease of validation loss. Thisindicates potential to make components of an MFCC extractorlearnable. The ASV results are reported in Table I.
A. ASV results on Voxceleb1-test
Concerning windowing, all the three adapted variants out-perform the baseline. Loss regularization and kernel updateare particularly more effective. The results indicate usefulnessto retain symmetricity and positivity of the window.Concerning DFT, simply letting it to be data-driven (withoutadded regularization or kernel update) yields lowest EERamong all systems, with a relative improvement of 6.7% overthe baseline. In fact, the additional regularization and directupdate are detrimental. This indicates potential weakness ofour symmetricity constraint.Concerning the mel filterbank,
Melbank + kernel. yieldsthe best performance among the three adapted variants, withbest minDCF of all systems, improving baseline by relatively6.7%. This indicates the importance of enforcing positivity ofthe learnt filters.Concerning DCT, similar to windowing all the learningschemes improve upon the baseline. While QR decompositiondoes not bring notable positive impact, the orthonormality-enhancing loss regularization results in slightly worse EER, but improved minDCF. In fact,
DCT + loss. results in lowestminDCF among all systems.
B. ASV results on SITW-DEV
We now move on to discuss ASV results on the morechallenging
SITW-DEV data. Overall, the data-driven com-ponents yield now more competitive performance boost overthe baseline. Adapting the window function is most effective,with a relative improvement of 9.7% in EER over the baseline.Concerning DFT,
DFT + loss. slightly outperforms baselinein both metrics while
DFT + kernel. is the only variantthat does not reach baseline in EER. This finding is in linewith
VoxCeleb1-test results. Concerning mel filterbank, allthe three systems outperform baseline. Overall, it achievescompetitive performance compared with learnable DCT. Itreflects its potential on being made adaptable to improvesystem robustness.DET curves for single systems including baseline on
SITW-DEV have been shown in Fig. 2. The curves agree withobservations from Table I in general. Systems with windowfunction adapted produce largest improvement gap with base-line compared with other three, which can be reflected fromEER. Considering systems that are less strict on false alarms,we can see that ones like
DCT + kernel. and window + loss. are exceptional and thus can be taken into concern.VI. C
ONCLUSION
We conduct an initial study on a lightweight learnableMFCC feature extractor as a compromise between end-to-endarchitectures and hand-crafted feature extractors. Our initialresults on
SITW-DEV are promising: the proposed scheme im-proved upon baseline MFCC extractor. Results for optimizedwindow and mel filterbank are particularly promising. Due toour domain-specific optimization constraints, the learnt repre-sentations bear close resemblance to fixed MFCC operations.For interpretability and computational reasons, we restrictedthe focus on optimization of individual MFCC extractor com-ponents; joint optimization of all the four linear componentshas been left as future work. Similarly, the work can beextended with other deep models such as extended TDNNand ResNet using larger datasets and data augmentation.
EFERENCES[1] J. H. L. Hansen and T. Hasan, “Speaker recognition by machines andhumans: A tutorial review,”
IEEE Signal Processing Magazine , vol. 32,no. 6, pp. 74–99, 2015.[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798,May 2011.[3] E. Variani et al., “Deep neural networks for small footprint text-dependent speaker verification,” in
Proc. ICASSP , 2014, pp. 4052–4056.[4] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-vectors: Robust DNN embeddings for speaker recognition,” in
Proc.ICASSP , 2018, pp. 5329–5333.[5] L. You, W. Guo, D. Li, and J. Du, “Multi-Task learning with high-orderstatistics for X-vector based text-independent speaker verification,” in
Proc. INTERSPEECH , 2019, pp. 1158–1162.[6] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversationsusing x-vectors,” in
Proc. ICASSP , 2019, pp. 5796–5800.[7] S. Davis and P. Mermelstein, “Comparison of parametric representationsfor monosyllabic word recognition in continuously spoken sentences,”
ACOUSTICS, SPEECH AND SIGNAL PROCESSING, IEEE TRANSAC-TIONS ON , pp. 357–366, 1980.[8] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveformwith SincNet,” in
Proc. SLT , 2018, pp. 1021–1028.[9] N. Zeghidour, N. Usunier, I. Kokkinos, T. Schaiz, G. Synnaeve, andE. Dupoux, “Learning filterbanks from raw speech for phone recogni-tion,” in , 2018, pp. 5509–5513.[10] S. Vasquez and M. Lewis, “Melnet: A generative model for audio inthe frequency domain,”
ArXiv , vol. abs/1906.01083, 2019.[11] M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Filterbankdesign for end-to-end speech separation,” in
ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2020, pp. 6364–6368.[12] T. Kinnunen et al., “Low-variance multitaper MFCC features: A casestudy in robust speaker verification,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 20, no. 7, pp. 1990–2001, 2012.[13] M. Todisco, H. Delgado, and N. Evans, “Articulation rate filtering ofCQCC features for automatic speaker verification,” in
Proc. INTER-SPEECH , 2016, pp. 3628–3632.[14] J. And´en and S. Mallat, “Deep scattering spectrum,”
IEEE Transactionson Signal Processing , vol. 62, no. 16, pp. 4114–4128, 2014.[15] O. Farooq and S. Datta, “Mel filter-like admissible wavelet packetstructure for speech recognition,”
IEEE Signal Processing Letters , vol.8, no. 7, pp. 196–198, 2001.[16] S. Umesh, L. Cohen, and D. Nelson, “Fitting the mel scale,” in , 1999, vol. 1,pp. 217–220 vol.1.[17] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients(PNCC) for robust speech recognition,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 24, no. 7, pp. 1315–1329,2016.[18] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,”
The Journal of the Acoustical Society of America , vol. 87, no. 4, pp.1738–1752, 1990.[19] A. K. Naveena and N. K. Narayanan, “Block dct coefficients andhistogram for image retrieval,” in , 2017, pp. 48–52.[20] X. Liu, M. Sahidullah, and T. Kinnunen., “A comparative re-assessmentof feature extractors for deep speaker embeddings,” in
Proc. INTER-SPEECH , 2020.[21] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feed-forward neural networks,” in
Proceedings of the ThirteenthInternational Conference on Artificial Intelligence and Statistics . 2010,pp. 249–256, PMLR.[22] F. J. Harris, “On the use of windows for harmonic analysis with thediscrete fourier transform,”
Proceedings of the IEEE , vol. 66, no. 1, pp.51–83, 1978.[23] G.H. Golub and C.F. Van Loan,
Matrix Computations , The JohnsHopkins University Press, 1996. [24] R. Lawrence and J. Hwang,
Fundamentals of Speech Recognition ,Prentice-Hall, Inc., USA, 1993.[25] T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of StatisticalLearning , Springer Series in Statistics. Springer New York Inc., NewYork, NY, USA, 2001.[26] Y. Zhu and B. Mak, “Orthogonality Regularizations for End-to-EndSpeaker Verification,” in
Proc. Odyssey 2020 The Speaker and LanguageRecognition Workshop , 2020, pp. 17–23.[27] A. Nagrani, J. Chung, and A. Zisserman, “VoxCeleb: A large-scalespeaker identification dataset,” in
Proc. INTERSPEECH , 2017, pp.2616–2620.[28] M. McLaren, Luciana Ferrer, Diego Cast´an Lavilla, and Aaron Lawson,“The speakers in the wild (SITW) speaker recognition database,” in
Proc. INTERSPEECH , 2016, pp. 818–822.[29] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in , 2015.[30] S. Ioffe, “Probabilistic linear discriminant analysis,” in
Computer Vision– ECCV 2006 , Aleˇs Leonardis, Horst Bischof, and Axel Pinz, Eds.,Berlin, Heidelberg, 2006, pp. 531–542, Springer Berlin Heidelberg.[31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,L. Zeming, A. Desmaison, L. Antiga, and A. Lerer, “Automaticdifferentiation in pytorch,” in