[PDF] A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings

Abstract

Modern automatic speaker verification relies largely on deep neural networks (DNNs) trained on mel-frequency cepstral coefficient (MFCC) features. While there are alternative feature extraction methods based on phase, prosody and long-term temporal operations, they have not been extensively studied with DNN-based methods. We aim to fill this gap by providing extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets. Our findings reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction. Experimental results demonstrate up to 16.3\% (VoxCeleb) and 25.1\% (SITW) relative decrease in equal error rate (EER) to the baseline.

Full PDF

AA Comparative Re-Assessment of Feature Extractors for Deep SpeakerEmbeddings

Xuechen Liu , , Md Sahidullah , Tomi Kinnunen School of Computing, University of Eastern Finland, Joensuu, Finland Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, France [email protected], [email protected], [email protected]

Abstract

Modern automatic speaker veriﬁcation relies largely on deepneural networks (DNNs) trained on mel-frequency cepstral co-efﬁcient (MFCC) features. While there are alternative featureextraction methods based on phase, prosody and long-term tem-poral operations, they have not been extensively studied withDNN-based methods. We aim to ﬁll this gap by providing ex-tensive re-assessment of 14 feature extractors on VoxCeleb andSITW datasets. Our ﬁndings reveal that features equipped withtechniques such as spectral centroids, group delay function, andintegrated noise suppression provide promising alternatives toMFCCs for deep speaker embeddings extraction. Experimentalresults demonstrate up to 16.3% (VoxCeleb) and 25.1% (SITW)relative decrease in equal error rate (EER) to the baseline.

Index Terms : Speaker veriﬁcation, feature extraction, deepspeaker embeddings.

1. Introduction

Automatic speaker veriﬁcation (ASV) [1] aims to determinewhether two speech segments are from the same speaker or not.It ﬁnds applications in forensics, surveillance, access control,and home electronics.While the ﬁeld has long been dominated by approachessuch as i-vectors [2], the focus has recently shifted to non-linear deep neural networks (DNNs). They have been found to surpassprevious solutions in many cases.Representative DNN approaches include d-vector [3], deepspeaker [4] and x-vector [5]. As illustrated in Figure 1, DNNsare used to extract ﬁxed-sized speaker embedding from each ut-terance. These embeddings can then be used for speaker com-parison with a back-end classiﬁer. The network input and outputconsist of a sequence of acoustic feature vectors and a vector ofspeaker posteriors, respectively. The DNN learns input-outputmapping through a number of intermediate layers, includingtemporal pooling (necessary for the extraction of ﬁxed-sizedembedding). A number of improvements to this core frame-work have been proposed, including hybrid frame-level layers [6], use of multi-task learning [7] and alternative loss functions [8], to name a few. In addition, practitioners often use externaldata [9, 10] to augment training data. This enforces the DNNto extract speaker-related attributes regardless of input pertur-bations.While substantial amount of work has been devoted in im-proving DNN architectures, loss functions, and data augmen-tation recipes, the same cannot be said about acoustic features.There are, however, at least two important reasons to study fea-ture extraction. First, data-driven models can only be as goodas their input data — the features. Second, in collaborativesettings, it is customary to fuse several ASV systems. These Figure 1:

X-vector speaker embedding extractor [5]. Speakerembeddings are usually extracted from the ﬁrst fully-connectedlayer after statistics pooling. systems should not only perform well in isolation, but be sufﬁ-ciently diverse as well. One way to achieve diversity is to trainsystems with different features.The acoustic features used to train deep speaker embed-ding extractors are typically standard mel-frequency cepstralcoefﬁcients (MFCCs) or intermediate representations neededin MFCC extraction: raw spectrum [11], mel-spectrum ormel-ﬁlterbank outputs. There are few exceptions where fea-ture extractor is also learnt as part of the DNN architecture( e.g. [12]), although the empirical performance is often behindhand-crafted feature extraction schemes. This raises a questionwhether deep speaker embedding extractors might be improvedby simple plug-and-play of other hand-crafted feature extrac-tors in place of MFCCs. Such methods are abundant in the pastASV literature [13, 14, 15], and in the context of related taskssuch as spooﬁng attack detection [16, 17]. An extensive study inthe context of DNN-based ASV is however missing. Our studyaims to ﬁll this gap.MFCCs are obtained from the power spectrum of a spe-ciﬁc time-frequency representation, short-term Fourier trans-form (STFT). MFCCs are therefore subjected to certain short-comings of the STFT. They also lack speciﬁcity to the short-term phase of the signal. We therefore include a number ofalternative features based on short-term power spectrum and short-term phase . Additionally, we also include fundamentalfrequency and methods that leverage from long-term process-ing beyond a short-time frame. Improvements over MFCCs areoften motivated by robustness to additive noise, improved sta-tistical properties, or closer alignment with human perception.The selected 14 features and their categorization, detailed be- a r X i v : . [ ee ss . A S ] J u l ow, is inspired from [16] and [17]. For generality, we carry ex-periments on two widely-adopted datasets, VoxCeleb [11] andspeakers-in-the-wild (SITW) [18]. To the best of our knowl-edge, this is the ﬁrst extensive re-assessment of acoustic fea-tures for DNN-based ASV.

2. Feature Extraction Methods

In this section, we provide a comprehensive list of feature ex-tractors with brief description for each method. Table 1 summa-rizes the selected feature extractors along with their parametersettings and references to earlier ASV studies. . MFCCs are computed byintegrating STFT power spectrum with overlapped band-passﬁlters on the mel-scale, followed by log compression and dis-crete cosine transform (DCT). Following [1] a desired numberof lower-order coefﬁcients is retained. Standard MFCCs formour baseline features.

Multi-taper mel frequency cepstral coefﬁcients (Multi-taper) . Viewing each short-term frame of speech as arealization of a random processes, the windowed STFTused in MFCC extraction is known to have high variance.To alleviate this, multi-taper spectrum estimator is adopted[13]. It uses several window functions (tapers) to obtaina low-variance power spectrum estimate, given by ˆ S ( f ) =Σ Kj =1 λ ( j ) | Σ N − t =0 w j ( t ) x ( t ) e − i πtf/N | . Here, w j ( t ) is the j -th taper (window) and λ ( j ) is its corresponding weight. Thenumber of tapers, K , is an integer (typically between 4 and8). There are a number of alternative taper sets to choosefrom: Thomson window [28], sinusoidal model (SWCE) [29]and multi-peak [30]. In this study, we chose SWCE. A detailedintroduction of such spectrum estimator with experiments onconventional ASV can be found in [13]. Linear prediction cepstral features . An alternative toMFCC in terms of cepstral feature computation is from all-pole [31] representation of signal.

Linear prediction cepstral coef-ﬁcients (LPCCs) are derived from the linear prediction coefﬁ-cients (LPCs) by a recursive operation [32]. Similar methodapplies for perceptual LPCCs (PLPCCs) with applying a seriesof perceptual processing at primary stage [33].

Spectral subband centroid features . Spectral subbandcentroid based features were introduced and investigated in sta-tistical ASV [22]. We consider two types of spectral centroidfeatures: spectral centroid magnitude (SCM) and subband cen-troid frequency (SCF). They can be computed from weightedaverage of normalized energy of subband magnitude and fre-quency respectively. SCFs are then used directly as SCF coefﬁ-cients (SCFCs) while log compression and DCT are performedfor SCMs to obtain SCM coefﬁcients (SCMCs). For more de-tails one can refer to [22].

Constant-Q cepstral coefﬁcients (CQCCs) . Constant-Qtransform (CQT) was introduced in [34]. It has been applied inmusic signal processing [35], spooﬁng detection [36] as wellin ASV [37]. Different from STFT, CQT produces a time-frequency representation with variable resolution. The resultingCQT power spectrum is log-compressed and uniformly sam-pled, followed by DCT to yield CQCCs. Further details can befound in [36]. . MGDF was in-troduced in [38] with application to phone recognition andfurther applied to speaker recognition [23]. It is a paramet-ric representation of the phase spectrum, deﬁned as τ ( k ) =sign . | X R ( k ) Y R ( k ) + Y I ( k ) X I ( k ) / ( S ( k )) γ | α , where k is thefrequency index; X R ( k ) and X I ( k ) are real and imaginary partof discrete Fourier transform (DFT) from speech samples x ( n ) ; Y R ( k ) and Y I ( k ) are real and the imaginary parts of DFT of nx ( n ) . sign . is the the sign of X R ( k ) Y R ( k )+ Y I ( k ) X I ( k ) while α and γ are the control parameters; S ( k ) is a smoothed mag-nitude spectrum. The cepstral-like coefﬁcients which can beused as features are then obtained from function outputs by log-compression and DCT. All-pole group delayed function (APGDF) . An alterna-tive phase representation of signal was proposed for ASV in[14]. The group delay function is computed by differentiatingthe unwrapped phase of all-pole spectrum. The main advantageof APGDF compared to MGDF is a fewer number of controlparameters.

Cosine phase function (cosphase) . Cosine of phase hasbeen applied for spooﬁng attack detection [16, 39]. The DFT-based unwrapped phase DFT is ﬁrst normalized to [ − , usingcosine operation, and then processed with DCT to derive thecosphase coefﬁcients. Constant-Q magnitudephase octave coefﬁcients (CM-POCs) . Unlike the previous DFT-based features, CMPOCs uti-lize CQT. The magnitude-phase spectrum (MPS) from CQT iscomputed as (cid:112) ln ( | X ( ω )) | + φ ( ω ) , where X ( ω ) and φ ( ω ) denote magnitude and phase of CQT. Then, MPS is segmentedaccording to the octave, and processed with log-compressionand DCT to derive CMPOCs. The CMPOCs are studied so farfor playback attack detection [40]. We use the term ‘long-term processing’ to refer methods thatuse information across a longer context of consecutive frames.

Mean Hilbert envelope coefﬁcients (MHECs) . Proposedin [25] for i-vector based ASV, MHEC applies

Gammatone ﬁl-terbanks on the speech signal. The output of each channel ofthe ﬁlterbank is then processed to compute temporal envelopesas e s ( t, j ) = s ( t, j )+ ˆ s ( t, j ) , where s ( t, j ) is the so-called ‘an-alytical signal’ and ˆ s ( t, j ) denotes its Hilbert transform [41]. t and j represent time and channel index respectively. The en-velopes are low-pass ﬁltered, framed and averaged to computeenergies. Finally, the energies are transformed to cepstral-likecoefﬁcients by log-compression and DCT. More details can befound in [25]. Power-normalized cepstral coefﬁcients (PNCCs) . Togenerate PNCCs input waveform is ﬁrst processed by Gam-matone ﬁlterbanks and fed into a cascade of non-linear time-varying operations, aimed at suppressing the impact of noiseand reverberation. Mean power normalization is performed atthe output of such operation series so as to minimize the poten-tially detrimental effect on amplitude scaling. Cepstral featuresare then obtained by power-law non-linearity and DCT. PNCCshave been applied to speech recognition [15] as well as i-vectorbased ASV [26].

Aside from various type of features an initial investigation onthe effect of harmonic information was conducted. For sim-able 1:

List of feature extractors that are addressed in this study, with conﬁguration details and references to exemplar earlier relevantstudies on ASV. As mentioned in Section 1 aside from MFCCs, previous works noted here are ones on conventional models.

Category Feature (dim.) Conﬁguration details Previous work on ASVShort-term magnitudepower spectral features MFCC (30) Baseline, No. of FFT coefﬁcients=512 [5, 6]CQCC (60) CQCC v2.0 package [19]LPCC (30) LP order=30 [20]PLPCC (30) LP order=30, bark-scale ﬁlterbank [21]SCFC (30) No. ﬁlters=30 [22]SCMC (30) No. ﬁlters=30Multi-taper (30) MFCC with SWCE windowing, no. tapers=8 [13, 21]Short-termphase spectral features MGDF(30) α = 0 . , γ = 0 . , ﬁrst 30 coeff. from DCT [23, 24]APGDF (30) LP order=30 [14]CosPhase (30) First 30 coeff. from DCT -CMPOC (30) N = 96 , First 30 coeff. from DCT -Short-term featureswith long-term processing MHEC (30) No. of ﬁlters in Gammatone ﬁlter bank=20 [25]PNCC (30) First 30 coeff. from DCT [26]Fundamental frequency features MFCC+ pitch (33) Kaldi pitch extractor, MFCC (30) with pitch (3) [27] plicity and comparability, the pitch extraction algorithm from[42] based on normalized cross correlation function (NCCF)was employed to extract 3-dimensional pitch vectors. They arethen appended to MFCCs. In rest of the paper, we refer thisfeature as MFCC+ pitch .Table 2: Result of prior experiment on investigating dynamicfeatures on Voxceleb1-E test set. Dimension of static part forall three cases were set to be 30.

Feature EER(%) minDCFMFCC 4.65 0.5937MFCC+ ∆ ∆∆ Table 3:

Result of different features and fusion systems onVoxceleb1-E test set and SITW development set (SITW-DEV).

Voxceleb1-E SITW-DEVFeature EER(%) minDCF EER(%) minDCFMFCC 4.65 0.5937 8.12 0.8531CQCC 8.21 0.8310 9.43 0.9093LPCC 6.42 0.7129 9.39 0.9109PLPCC 7.06 0.7433 9.12 0.9178SCFC 6.56 0.7173 7.82 0.8530SCMC

MFCC+ pitch

MFCC+SCMC+Multi-taper

MFCC+cosphase+PNCC

3. Experiments

We conducted training of neural network on the dev [11] partof Voxceleb1 consisting 1211 speakers. We used two evalua-tion sets, one for matched train-test condition and the other forrelatively mismatched condition. First one was from the test part of the same VoxCeleb1 dataset consisting 40 speakers, and the other one was from the development part of SITW under“core-core” condition, consisting 119 speakers. The VoxCeleb1evaluation consists of 18860 genuine trials and same number ofimposter trials. On the other hand, the corresponding SITW par-tition has 2597 genuine and 335629 imposter trials. We will re-fer the two datasets as ‘ Voxceleb1-E ’ and ‘

SITW-DEV ’ respec-tively.

Before being fed into feature extractors, we extracted all thefeatures with a frame length of 25 ms and 10 ms shift. Weapply Hamming [43] window in all cases except for the multi-taper feature. In Table 1, we describe the associated control pa-rameters (if applicable) and the implementation details for eachfeature extractor. As for post-processing, we applied energy-based speech activity detection (SAD) and utterance-level cep-stral mean normalization (CMN) [1] except for MFCC+ pitch ,where the additional components contain probability of voicing (POV).

To compare different feature extractors, we trained x-vector sys-tem for each of them, as illustrated in Figure 1. We replicatedthe DNN conﬁguration from [5]. We trained the model usingdata described above without any data augmentation. This willhelp to assess the inherent robustness of individual features. Weextracted 512-dimensional speaker embedding for each test ut-terance. The embeddings were length-normalized and centeredbefore being transformed using a 200-dimensional linear dis-criminant analysis (LDA), followed by scoring with a proba-bilistic linear discriminant analysis (PLDA) [44] classiﬁer.

The veriﬁcation accuracy was measured by equal error rate (EER) and minimum detection cost function (minDCF) with tar-get speaker prior p = 0 . and two costs C FA = C miss = 1 . . Detection error trade-off (DET) curves for all feature extrac-tion methods are also presented. We used Kaldi for computingEER and minDCF. BOSARIS was called for DET illustration. https://github.com/kaldi-asr/kaldi https://sites.google.com/site/bosaristoolkit/ .5 1 2 5 10 20 30 40 False Alarm Rate [in %]

2 5 10 20 30 40 M i ss R a t e [ i n % ] Voxceleb1-E

MFCC+Cosphase+PNCCMFCC+SCMC+Multi-TaperMFCCSCMCMFCC+pitchMulti-TaperPNCCMHEC CMPOCAPGDFCosphaseLPCCSCFCPLPCCMGDFCQCC0.5 1 2 5 10 20 30 40

False Alarm Rate [in %]

2 5 10 20 30 40 M i ss R a t e [ i n % ] SITW-DEV

MFCC+Cosphase+PNCCMFCC+SCMC+Multi-TaperMFCCSCMCMFCC+pitchMulti-TaperPNCCMHEC CMPOCAPGDFCosphaseLPCCSCFCPLPCCMGDFCQCC

Figure 2:

DET plots for evaluation sets. (top) Voxceleb1-E;(bottom) SITW-DEV. Best viewed in color.

4. Results

We ﬁrst conducted a preliminary experiment on investigatingthe effectiveness of dynamic features with result reported in Ta-ble 2, as a sanity check. We extended the baseline by addingdelta and double-delta coefﬁcients along with the static MFCCs.According to the table adding delta features did not improveperformance. This might be because the frame-level networklayers already capture information across neighboring frames.In the remainder, we utilize static features only.Table 3 summarizes the results for both corpora. In ex-periment of

Voxceleb1-E , we found that MFCCs outperformmost of alternative features in terms of EER, with SCMCs asthe only exception. This may indicate the effectiveness of in-formation related to subband energies. However, SCFCs didnot outperform SCMCs, which suggests that the subband mag-nitudes may be more important than their frequencies. Con-cerning phase spectral features, MGDFs were behind the otherfeatures. This might be due to sub-optimal control parameter settings. CMPOCs reached relatively 27.6% lower EER thanCQCCs, which highlights the effectiveness of phase features inCQT-based feature category. Moreover, while competitive EERand best minDCF can be observed from MFCC+ pitch , LPCCsand PLPCCs did not perform as good. This indicates the poten-tial importance of explicit harmonic information. Such ﬁndingcan be further found in

SITW-DEV results. Similar observationcan be found from multi-taper MFCCs, which reclaims the efﬁ-cacy of multi-taper windowing from conventional ASV.Focusing more on

SITW-DEV , most competitive featuresinclude those from the phase and ‘long-term’ categories.PNCCs reached best performance in both metrics, outperform-ing baseline MFCCs by 25.1% relative in terms of EER. Thismight be due to the robustness-enhancing operations integratedin the pipeline, recalling that

SITW-DEV represents more chal-lenging and mismatched data conditions. While not outper-forming the baseline in

Voxceleb1-E , SCFCs yielded com-petitive numbers along with SCMCs, which further indicatesusefulness of subband information. Best performance fromcosphase under phase category reﬂects the advantage of cosinenormalizer relative to group delay function. An additional ben-eﬁt of cosphase over group delay features is that it has lessernumber of control parameters.Next, we addressed simple equal-weighted linear score fu-sion. We considered two sets of features: 1) MFCCs, SCMCsand Multi-taper; 2) MFCCs, cosphase and PNCCs. The formerset of extractors share similar spectral operations while the lat-ter cover more diverse speech attributes. Results are presentedat the bottom of Table 3. In

Voxceleb1-E , we can see furtherimprovement for both fused systems, especially for the ﬁrst onewhich reached lowest overall EER, outperforming baseline by16.3% relatively. But under

SITW-DEV the best performancewas still held by single system. This indicates that simple equal-weighted linear score-level fusion may be more effective forrelatively matched conditions.Finally, the DET curves for all systems including fused onesare shown in Figure 2, which agrees with the ﬁndings in Table3. Concerning

Voxceleb1-E , the two fusion systems are closer tothe origin than any of the single systems in general, which corre-sponds to the indication above. Concerning SITW, PNCCs con-ﬁrms its superior performance on

SITW-DEV , but from right-bottom both spectral centroid features are heading out, whichmay indicate their favor to systems that are less strict on falsealarms.

5. Conclusion

This paper presents an extensive re-assessment of variousacoustic feature extractors for DNN-based ASV systems. Weevaluated them on Voxceleb1 and SITW, covering matchedand unmatched conditions. We achieved improvements overMFCCs especially on SITW, which represents more mis-matched testing condition. We also found alternative methodssuch as spectral centroids, group delay function, and integratednoise suppression can be useful for DNN system. For futurework they thus shall be revisited and extended under more sce-narios. Finally we gave an initial attempt on score-level fusedsystems with competitive performance, indicating the potentialof such approach.

6. Acknowledgements

This work was partially supported by Academy of Finland(project 309629) and Inria Nancy Grand Est. . References [1] T. Kinnunen and H. Li, “An overview of text-independent speakerrecognition: From features to supervectors,”

Speech Communica-tion , vol. 52, no. 1, pp. 12 – 40, 2010.[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker veriﬁcation,”

IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, May 2011.[3] E. Variani et al. , “Deep neural networks for small footprinttext-dependent speaker veriﬁcation,” in

Proc. ICASSP , 2014, pp.4052–4056.[4] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,”

CoRR , vol. abs/1705.02304, 2017.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,” in

Proc. ICASSP , 2018, pp. 5329–5333.[6] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversa-tions using x-vectors,” in

Proc. ICASSP , 2019, pp. 5796–5800.[7] L. You, W. Guo, L. R. Dai, and J. Du, “Multi-Task learning withhigh-order statistics for X-vector based text-independent speakerveriﬁcation,” in

Proc. INTERSPEECH , 2019, pp. 1158–1162.[8] Y. Li, F. Gao, Z. Ou, and J. Sun, “Angular softmax loss for end-to-end speaker veriﬁcation,” , pp. 190–194,2018.[9] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in

Proc. ICASSP , 2017, pp. 5220–5224.[10] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech,and Noise Corpus,” 2015, arXiv:1510.08484v1.[11] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identiﬁcation dataset,” in

Proc. INTERSPEECH ,2017, pp. 2616–2620.[12] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-form with SincNet,” in

Proc. SLT , 2018, pp. 1021–1028.[13] T. Kinnunen et al. , “Low-variance multitaper MFCC features:A case study in robust speaker veriﬁcation,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 20, no. 7, pp.1990–2001, 2012.[14] P. Rajan, T. Kinnunen, C. Hanili, J. Pohjalainen, and P. Alku, “Us-ing group delay functions from all-pole models for speaker recog-nition,”

Proc. INTERSPEECH , pp. 2489–2493, 01 2013.[15] C. Kim and R. M. Stern, “Power-normalized cepstral coefﬁcients(PNCC) for robust speech recognition,”

IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 24, no. 7, pp.1315–1329, 2016.[16] M. Sahidullah, T. Kinnunen, and C. Hanili, “A comparison of fea-tures for synthetic speech detection,” in

Proc. INTERSPEECH , 092015, pp. 2087–2091.[17] C. Hanili, “Features and classiﬁers for replay spooﬁng attack de-tection,” in , 2017, pp. 1187–1191.[18] M. McLaren, L. Ferrer, D. Castn Lavilla, and A. Lawson, “Thespeakers in the wild (SITW) speaker recognition database,” in

Proc. INTERSPEECH , 2016, pp. 818–822.[19] M. Todisco, H. Delgado, and N. Evans, “Articulation rate ﬁlteringof CQCC features for automatic speaker veriﬁcation,” in

Proc.INTERSPEECH 2016 , 2016, pp. 3628–3632.[20] X. Jing, J. Ma, J. Zhao, and H. Yang, “Speaker recognition basedon principal component analysis of LPCC and MFCC,” in

Proc.ICSPCC , 2014, pp. 403–408.[21] M. J. Alam et al. , “Multitaper MFCC and PLP features for speakerveriﬁcation using i-vectors,”

Speech Communication , vol. 55,no. 2, pp. 237 – 251, 2013. [22] J. M. K. Kua et al. , “Investigation of spectral centroid magnitudeand frequency for speaker recognition,” in

Proc. Odyssey , 2010,pp. 34–39.[23] P. Rajan, S. H. K. Parthasarathi, and H. A. Murthy, “Robustnessof phase based features for speaker recognition,” in

Proc. INTER-SPEECH , 2009.[24] T. Thiruvaran, E. Ambikairajah, and J. Epps, “Group delay fea-tures for speaker recognition,” in , 2007,pp. 1–5.[25] S. Sadjadi and J. Hansen, “Mean hilbert envelope coefﬁcients(MHEC) for robust speaker and language identiﬁcation,”

SpeechCommunication , vol. 72, pp. 138–148, 05 2015.[26] N. Wang and L. Wang, “Robust speaker recognition based onmulti-stream features,” in , 2016, pp. 1–4.[27] A. G. Adami, “Modeling prosodic differences for speaker recog-nition,”

Speech Communication , vol. 49, no. 4, pp. 277 – 291,2007.[28] D. J. Thomson, “Spectrum estimation and harmonic analysis,”

Proceedings of the IEEE , vol. 70, no. 9, pp. 1055–1096, 1982.[29] M. Hansson-Sandsten and J. Sandberg, “Optimal cepstrum es-timation using multiple windows,” in

Proc. ICASSP , 2009, pp.3077–3080.[30] M. Hansson, T. Gansler, and G. Salomonsson, “A multiple win-dow method for estimation of a peaked spectrum,” in

Proc.ICASSP , vol. 3, 1995, pp. 1617–1620 vol.3.[31] J. Makhoul, “Linear prediction: A tutorial review,”

Proceedingsof the IEEE , vol. 63, no. 4, pp. 561–580, 1975.[32] L. Rabiner and B.-H. Juang,

Fundamentals of Speech Recogni-tion . USA: Prentice-Hall, Inc., 1993.[33] H. Hermansky, “Perceptual linear predictive (PLP) analysis ofspeech,”

The Journal of the Acoustical Society of America , vol. 87,no. 4, pp. 1738–1752, 1990.[34] J. Youngberg and S. Boll, “Constant-q signal analysis and synthe-sis,” in

Proc. ICASSP , vol. 3, April 1978, pp. 375–378.[35] A. Schrkhuber, Christian; Klapuri, “Constant-q transform toolboxfor music processing,” in , 2010.[36] M. Todisco, H. Delgado, and N. W. D. Evans, “A new feature forautomatic speaker veriﬁcation anti-spooﬁng: Constant q cepstralcoefﬁcients,” in

Proc. Odyssey , 2016, pp. 283–290.[37] H. Delgado et al. , “Further optimisations of constant q cepstralprocessing for integrated utterance veriﬁcation and text-dependentspeaker veriﬁcation,” in

Proc. SLT , 12 2016.[38] H. A. Murthy and V. Gadde, “The modiﬁed group delay func-tion and its application to phoneme recognition,” in

Proc. ICASSP ,vol. 1, 2003, pp. I–68.[39] Z. Wu, X. Xiao, E. S. Chng, and H. Li, “Synthetic speech detec-tion using temporal modulation feature,” in

Proc. ICASSP , 2013,pp. 7234–7238.[40] J. Yang and L. Liu, “Playback speech detection based onmagnitude-phase spectrum,”

Electronics Letters , vol. 54, 05 2018.[41] L. Cohen,

Time-Frequency Analysis: Theory and Applications .USA: Prentice-Hall, Inc., 1995.[42] P. Ghahremani et al. , “A pitch extraction algorithm tuned for au-tomatic speech recognition,” in

Proc. ICASSP , 2014, pp. 2494–2498.[43] F. J. Harris, “On the use of windows for harmonic analysis withthe discrete Fourier transform,”

Proceedings of the IEEE , vol. 66,no. 1, pp. 51–83, Jan 1978.[44] S. Ioffe, “Probabilistic linear discriminant analysis,” in