A Comparative Re-Assessment of Feature Extractors for Deep Speaker Embeddings
AA Comparative Re-Assessment of Feature Extractors for Deep SpeakerEmbeddings
Xuechen Liu , , Md Sahidullah , Tomi Kinnunen School of Computing, University of Eastern Finland, Joensuu, Finland Universit´e de Lorraine, CNRS, Inria, LORIA, F-54000, Nancy, France [email protected], [email protected], [email protected]
Abstract
Modern automatic speaker verification relies largely on deepneural networks (DNNs) trained on mel-frequency cepstral co-efficient (MFCC) features. While there are alternative featureextraction methods based on phase, prosody and long-term tem-poral operations, they have not been extensively studied withDNN-based methods. We aim to fill this gap by providing ex-tensive re-assessment of 14 feature extractors on VoxCeleb andSITW datasets. Our findings reveal that features equipped withtechniques such as spectral centroids, group delay function, andintegrated noise suppression provide promising alternatives toMFCCs for deep speaker embeddings extraction. Experimentalresults demonstrate up to 16.3% (VoxCeleb) and 25.1% (SITW)relative decrease in equal error rate (EER) to the baseline.
Index Terms : Speaker verification, feature extraction, deepspeaker embeddings.
1. Introduction
Automatic speaker verification (ASV) [1] aims to determinewhether two speech segments are from the same speaker or not.It finds applications in forensics, surveillance, access control,and home electronics.While the field has long been dominated by approachessuch as i-vectors [2], the focus has recently shifted to non-linear deep neural networks (DNNs). They have been found to surpassprevious solutions in many cases.Representative DNN approaches include d-vector [3], deepspeaker [4] and x-vector [5]. As illustrated in Figure 1, DNNsare used to extract fixed-sized speaker embedding from each ut-terance. These embeddings can then be used for speaker com-parison with a back-end classifier. The network input and outputconsist of a sequence of acoustic feature vectors and a vector ofspeaker posteriors, respectively. The DNN learns input-outputmapping through a number of intermediate layers, includingtemporal pooling (necessary for the extraction of fixed-sizedembedding). A number of improvements to this core frame-work have been proposed, including hybrid frame-level layers [6], use of multi-task learning [7] and alternative loss functions [8], to name a few. In addition, practitioners often use externaldata [9, 10] to augment training data. This enforces the DNNto extract speaker-related attributes regardless of input pertur-bations.While substantial amount of work has been devoted in im-proving DNN architectures, loss functions, and data augmen-tation recipes, the same cannot be said about acoustic features.There are, however, at least two important reasons to study fea-ture extraction. First, data-driven models can only be as goodas their input data — the features. Second, in collaborativesettings, it is customary to fuse several ASV systems. These Figure 1:
X-vector speaker embedding extractor [5]. Speakerembeddings are usually extracted from the first fully-connectedlayer after statistics pooling. systems should not only perform well in isolation, but be suffi-ciently diverse as well. One way to achieve diversity is to trainsystems with different features.The acoustic features used to train deep speaker embed-ding extractors are typically standard mel-frequency cepstralcoefficients (MFCCs) or intermediate representations neededin MFCC extraction: raw spectrum [11], mel-spectrum ormel-filterbank outputs. There are few exceptions where fea-ture extractor is also learnt as part of the DNN architecture( e.g. [12]), although the empirical performance is often behindhand-crafted feature extraction schemes. This raises a questionwhether deep speaker embedding extractors might be improvedby simple plug-and-play of other hand-crafted feature extrac-tors in place of MFCCs. Such methods are abundant in the pastASV literature [13, 14, 15], and in the context of related taskssuch as spoofing attack detection [16, 17]. An extensive study inthe context of DNN-based ASV is however missing. Our studyaims to fill this gap.MFCCs are obtained from the power spectrum of a spe-cific time-frequency representation, short-term Fourier trans-form (STFT). MFCCs are therefore subjected to certain short-comings of the STFT. They also lack specificity to the short-term phase of the signal. We therefore include a number ofalternative features based on short-term power spectrum and short-term phase . Additionally, we also include fundamentalfrequency and methods that leverage from long-term process-ing beyond a short-time frame. Improvements over MFCCs areoften motivated by robustness to additive noise, improved sta-tistical properties, or closer alignment with human perception.The selected 14 features and their categorization, detailed be- a r X i v : . [ ee ss . A S ] J u l ow, is inspired from [16] and [17]. For generality, we carry ex-periments on two widely-adopted datasets, VoxCeleb [11] andspeakers-in-the-wild (SITW) [18]. To the best of our knowl-edge, this is the first extensive re-assessment of acoustic fea-tures for DNN-based ASV.
2. Feature Extraction Methods
In this section, we provide a comprehensive list of feature ex-tractors with brief description for each method. Table 1 summa-rizes the selected feature extractors along with their parametersettings and references to earlier ASV studies. . MFCCs are computed byintegrating STFT power spectrum with overlapped band-passfilters on the mel-scale, followed by log compression and dis-crete cosine transform (DCT). Following [1] a desired numberof lower-order coefficients is retained. Standard MFCCs formour baseline features.
Multi-taper mel frequency cepstral coefficients (Multi-taper) . Viewing each short-term frame of speech as arealization of a random processes, the windowed STFTused in MFCC extraction is known to have high variance.To alleviate this, multi-taper spectrum estimator is adopted[13]. It uses several window functions (tapers) to obtaina low-variance power spectrum estimate, given by ˆ S ( f ) =Σ Kj =1 λ ( j ) | Σ N − t =0 w j ( t ) x ( t ) e − i πtf/N | . Here, w j ( t ) is the j -th taper (window) and λ ( j ) is its corresponding weight. Thenumber of tapers, K , is an integer (typically between 4 and8). There are a number of alternative taper sets to choosefrom: Thomson window [28], sinusoidal model (SWCE) [29]and multi-peak [30]. In this study, we chose SWCE. A detailedintroduction of such spectrum estimator with experiments onconventional ASV can be found in [13]. Linear prediction cepstral features . An alternative toMFCC in terms of cepstral feature computation is from all-pole [31] representation of signal.
Linear prediction cepstral coef-ficients (LPCCs) are derived from the linear prediction coeffi-cients (LPCs) by a recursive operation [32]. Similar methodapplies for perceptual LPCCs (PLPCCs) with applying a seriesof perceptual processing at primary stage [33].
Spectral subband centroid features . Spectral subbandcentroid based features were introduced and investigated in sta-tistical ASV [22]. We consider two types of spectral centroidfeatures: spectral centroid magnitude (SCM) and subband cen-troid frequency (SCF). They can be computed from weightedaverage of normalized energy of subband magnitude and fre-quency respectively. SCFs are then used directly as SCF coeffi-cients (SCFCs) while log compression and DCT are performedfor SCMs to obtain SCM coefficients (SCMCs). For more de-tails one can refer to [22].
Constant-Q cepstral coefficients (CQCCs) . Constant-Qtransform (CQT) was introduced in [34]. It has been applied inmusic signal processing [35], spoofing detection [36] as wellin ASV [37]. Different from STFT, CQT produces a time-frequency representation with variable resolution. The resultingCQT power spectrum is log-compressed and uniformly sam-pled, followed by DCT to yield CQCCs. Further details can befound in [36]. . MGDF was in-troduced in [38] with application to phone recognition andfurther applied to speaker recognition [23]. It is a paramet-ric representation of the phase spectrum, defined as τ ( k ) =sign . | X R ( k ) Y R ( k ) + Y I ( k ) X I ( k ) / ( S ( k )) γ | α , where k is thefrequency index; X R ( k ) and X I ( k ) are real and imaginary partof discrete Fourier transform (DFT) from speech samples x ( n ) ; Y R ( k ) and Y I ( k ) are real and the imaginary parts of DFT of nx ( n ) . sign . is the the sign of X R ( k ) Y R ( k )+ Y I ( k ) X I ( k ) while α and γ are the control parameters; S ( k ) is a smoothed mag-nitude spectrum. The cepstral-like coefficients which can beused as features are then obtained from function outputs by log-compression and DCT. All-pole group delayed function (APGDF) . An alterna-tive phase representation of signal was proposed for ASV in[14]. The group delay function is computed by differentiatingthe unwrapped phase of all-pole spectrum. The main advantageof APGDF compared to MGDF is a fewer number of controlparameters.
Cosine phase function (cosphase) . Cosine of phase hasbeen applied for spoofing attack detection [16, 39]. The DFT-based unwrapped phase DFT is first normalized to [ − , usingcosine operation, and then processed with DCT to derive thecosphase coefficients. Constant-Q magnitudephase octave coefficients (CM-POCs) . Unlike the previous DFT-based features, CMPOCs uti-lize CQT. The magnitude-phase spectrum (MPS) from CQT iscomputed as (cid:112) ln ( | X ( ω )) | + φ ( ω ) , where X ( ω ) and φ ( ω ) denote magnitude and phase of CQT. Then, MPS is segmentedaccording to the octave, and processed with log-compressionand DCT to derive CMPOCs. The CMPOCs are studied so farfor playback attack detection [40]. We use the term ‘long-term processing’ to refer methods thatuse information across a longer context of consecutive frames.
Mean Hilbert envelope coefficients (MHECs) . Proposedin [25] for i-vector based ASV, MHEC applies
Gammatone fil-terbanks on the speech signal. The output of each channel ofthe filterbank is then processed to compute temporal envelopesas e s ( t, j ) = s ( t, j )+ ˆ s ( t, j ) , where s ( t, j ) is the so-called ‘an-alytical signal’ and ˆ s ( t, j ) denotes its Hilbert transform [41]. t and j represent time and channel index respectively. The en-velopes are low-pass filtered, framed and averaged to computeenergies. Finally, the energies are transformed to cepstral-likecoefficients by log-compression and DCT. More details can befound in [25]. Power-normalized cepstral coefficients (PNCCs) . Togenerate PNCCs input waveform is first processed by Gam-matone filterbanks and fed into a cascade of non-linear time-varying operations, aimed at suppressing the impact of noiseand reverberation. Mean power normalization is performed atthe output of such operation series so as to minimize the poten-tially detrimental effect on amplitude scaling. Cepstral featuresare then obtained by power-law non-linearity and DCT. PNCCshave been applied to speech recognition [15] as well as i-vectorbased ASV [26].
Aside from various type of features an initial investigation onthe effect of harmonic information was conducted. For sim-able 1:
List of feature extractors that are addressed in this study, with configuration details and references to exemplar earlier relevantstudies on ASV. As mentioned in Section 1 aside from MFCCs, previous works noted here are ones on conventional models.
Category Feature (dim.) Configuration details Previous work on ASVShort-term magnitudepower spectral features MFCC (30) Baseline, No. of FFT coefficients=512 [5, 6]CQCC (60) CQCC v2.0 package [19]LPCC (30) LP order=30 [20]PLPCC (30) LP order=30, bark-scale filterbank [21]SCFC (30) No. filters=30 [22]SCMC (30) No. filters=30Multi-taper (30) MFCC with SWCE windowing, no. tapers=8 [13, 21]Short-termphase spectral features MGDF(30) α = 0 . , γ = 0 . , first 30 coeff. from DCT [23, 24]APGDF (30) LP order=30 [14]CosPhase (30) First 30 coeff. from DCT -CMPOC (30) N = 96 , First 30 coeff. from DCT -Short-term featureswith long-term processing MHEC (30) No. of filters in Gammatone filter bank=20 [25]PNCC (30) First 30 coeff. from DCT [26]Fundamental frequency features MFCC+ pitch (33) Kaldi pitch extractor, MFCC (30) with pitch (3) [27] plicity and comparability, the pitch extraction algorithm from[42] based on normalized cross correlation function (NCCF)was employed to extract 3-dimensional pitch vectors. They arethen appended to MFCCs. In rest of the paper, we refer thisfeature as MFCC+ pitch .Table 2: Result of prior experiment on investigating dynamicfeatures on Voxceleb1-E test set. Dimension of static part forall three cases were set to be 30.
Feature EER(%) minDCFMFCC 4.65 0.5937MFCC+ ∆ ∆∆ Table 3:
Result of different features and fusion systems onVoxceleb1-E test set and SITW development set (SITW-DEV).
Voxceleb1-E SITW-DEVFeature EER(%) minDCF EER(%) minDCFMFCC 4.65 0.5937 8.12 0.8531CQCC 8.21 0.8310 9.43 0.9093LPCC 6.42 0.7129 9.39 0.9109PLPCC 7.06 0.7433 9.12 0.9178SCFC 6.56 0.7173 7.82 0.8530SCMC
MFCC+ pitch
MFCC+SCMC+Multi-taper
MFCC+cosphase+PNCC
3. Experiments
We conducted training of neural network on the dev [11] partof Voxceleb1 consisting 1211 speakers. We used two evalua-tion sets, one for matched train-test condition and the other forrelatively mismatched condition. First one was from the test part of the same VoxCeleb1 dataset consisting 40 speakers, and the other one was from the development part of SITW under“core-core” condition, consisting 119 speakers. The VoxCeleb1evaluation consists of 18860 genuine trials and same number ofimposter trials. On the other hand, the corresponding SITW par-tition has 2597 genuine and 335629 imposter trials. We will re-fer the two datasets as ‘ Voxceleb1-E ’ and ‘
SITW-DEV ’ respec-tively.
Before being fed into feature extractors, we extracted all thefeatures with a frame length of 25 ms and 10 ms shift. Weapply Hamming [43] window in all cases except for the multi-taper feature. In Table 1, we describe the associated control pa-rameters (if applicable) and the implementation details for eachfeature extractor. As for post-processing, we applied energy-based speech activity detection (SAD) and utterance-level cep-stral mean normalization (CMN) [1] except for MFCC+ pitch ,where the additional components contain probability of voicing (POV).
To compare different feature extractors, we trained x-vector sys-tem for each of them, as illustrated in Figure 1. We replicatedthe DNN configuration from [5]. We trained the model usingdata described above without any data augmentation. This willhelp to assess the inherent robustness of individual features. Weextracted 512-dimensional speaker embedding for each test ut-terance. The embeddings were length-normalized and centeredbefore being transformed using a 200-dimensional linear dis-criminant analysis (LDA), followed by scoring with a proba-bilistic linear discriminant analysis (PLDA) [44] classifier.
The verification accuracy was measured by equal error rate (EER) and minimum detection cost function (minDCF) with tar-get speaker prior p = 0 . and two costs C FA = C miss = 1 . . Detection error trade-off (DET) curves for all feature extrac-tion methods are also presented. We used Kaldi for computingEER and minDCF. BOSARIS was called for DET illustration. https://github.com/kaldi-asr/kaldi https://sites.google.com/site/bosaristoolkit/ .5 1 2 5 10 20 30 40 False Alarm Rate [in %]
2 5 10 20 30 40 M i ss R a t e [ i n % ] Voxceleb1-E
MFCC+Cosphase+PNCCMFCC+SCMC+Multi-TaperMFCCSCMCMFCC+pitchMulti-TaperPNCCMHEC CMPOCAPGDFCosphaseLPCCSCFCPLPCCMGDFCQCC0.5 1 2 5 10 20 30 40
False Alarm Rate [in %]
2 5 10 20 30 40 M i ss R a t e [ i n % ] SITW-DEV
MFCC+Cosphase+PNCCMFCC+SCMC+Multi-TaperMFCCSCMCMFCC+pitchMulti-TaperPNCCMHEC CMPOCAPGDFCosphaseLPCCSCFCPLPCCMGDFCQCC
Figure 2:
DET plots for evaluation sets. (top) Voxceleb1-E;(bottom) SITW-DEV. Best viewed in color.
4. Results
We first conducted a preliminary experiment on investigatingthe effectiveness of dynamic features with result reported in Ta-ble 2, as a sanity check. We extended the baseline by addingdelta and double-delta coefficients along with the static MFCCs.According to the table adding delta features did not improveperformance. This might be because the frame-level networklayers already capture information across neighboring frames.In the remainder, we utilize static features only.Table 3 summarizes the results for both corpora. In ex-periment of
Voxceleb1-E , we found that MFCCs outperformmost of alternative features in terms of EER, with SCMCs asthe only exception. This may indicate the effectiveness of in-formation related to subband energies. However, SCFCs didnot outperform SCMCs, which suggests that the subband mag-nitudes may be more important than their frequencies. Con-cerning phase spectral features, MGDFs were behind the otherfeatures. This might be due to sub-optimal control parameter settings. CMPOCs reached relatively 27.6% lower EER thanCQCCs, which highlights the effectiveness of phase features inCQT-based feature category. Moreover, while competitive EERand best minDCF can be observed from MFCC+ pitch , LPCCsand PLPCCs did not perform as good. This indicates the poten-tial importance of explicit harmonic information. Such findingcan be further found in
SITW-DEV results. Similar observationcan be found from multi-taper MFCCs, which reclaims the effi-cacy of multi-taper windowing from conventional ASV.Focusing more on
SITW-DEV , most competitive featuresinclude those from the phase and ‘long-term’ categories.PNCCs reached best performance in both metrics, outperform-ing baseline MFCCs by 25.1% relative in terms of EER. Thismight be due to the robustness-enhancing operations integratedin the pipeline, recalling that
SITW-DEV represents more chal-lenging and mismatched data conditions. While not outper-forming the baseline in
Voxceleb1-E , SCFCs yielded com-petitive numbers along with SCMCs, which further indicatesusefulness of subband information. Best performance fromcosphase under phase category reflects the advantage of cosinenormalizer relative to group delay function. An additional ben-efit of cosphase over group delay features is that it has lessernumber of control parameters.Next, we addressed simple equal-weighted linear score fu-sion. We considered two sets of features: 1) MFCCs, SCMCsand Multi-taper; 2) MFCCs, cosphase and PNCCs. The formerset of extractors share similar spectral operations while the lat-ter cover more diverse speech attributes. Results are presentedat the bottom of Table 3. In
Voxceleb1-E , we can see furtherimprovement for both fused systems, especially for the first onewhich reached lowest overall EER, outperforming baseline by16.3% relatively. But under
SITW-DEV the best performancewas still held by single system. This indicates that simple equal-weighted linear score-level fusion may be more effective forrelatively matched conditions.Finally, the DET curves for all systems including fused onesare shown in Figure 2, which agrees with the findings in Table3. Concerning
Voxceleb1-E , the two fusion systems are closer tothe origin than any of the single systems in general, which corre-sponds to the indication above. Concerning SITW, PNCCs con-firms its superior performance on
SITW-DEV , but from right-bottom both spectral centroid features are heading out, whichmay indicate their favor to systems that are less strict on falsealarms.
5. Conclusion
This paper presents an extensive re-assessment of variousacoustic feature extractors for DNN-based ASV systems. Weevaluated them on Voxceleb1 and SITW, covering matchedand unmatched conditions. We achieved improvements overMFCCs especially on SITW, which represents more mis-matched testing condition. We also found alternative methodssuch as spectral centroids, group delay function, and integratednoise suppression can be useful for DNN system. For futurework they thus shall be revisited and extended under more sce-narios. Finally we gave an initial attempt on score-level fusedsystems with competitive performance, indicating the potentialof such approach.
6. Acknowledgements
This work was partially supported by Academy of Finland(project 309629) and Inria Nancy Grand Est. . References [1] T. Kinnunen and H. Li, “An overview of text-independent speakerrecognition: From features to supervectors,”
Speech Communica-tion , vol. 52, no. 1, pp. 12 – 40, 2010.[2] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet,“Front-end factor analysis for speaker verification,”
IEEE Trans-actions on Audio, Speech, and Language Processing , vol. 19,no. 4, pp. 788–798, May 2011.[3] E. Variani et al. , “Deep neural networks for small footprinttext-dependent speaker verification,” in
Proc. ICASSP , 2014, pp.4052–4056.[4] C. Li, X. Ma, B. Jiang, X. Li, X. Zhang, X. Liu, Y. Cao, A. Kan-nan, and Z. Zhu, “Deep speaker: an end-to-end neural speakerembedding system,”
CoRR , vol. abs/1705.02304, 2017.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan-pur, “X-vectors: Robust DNN embeddings for speaker recogni-tion,” in
Proc. ICASSP , 2018, pp. 5329–5333.[6] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, andS. Khudanpur, “Speaker recognition for multi-speaker conversa-tions using x-vectors,” in
Proc. ICASSP , 2019, pp. 5796–5800.[7] L. You, W. Guo, L. R. Dai, and J. Du, “Multi-Task learning withhigh-order statistics for X-vector based text-independent speakerverification,” in
Proc. INTERSPEECH , 2019, pp. 1158–1162.[8] Y. Li, F. Gao, Z. Ou, and J. Sun, “Angular softmax loss for end-to-end speaker verification,” , pp. 190–194,2018.[9] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur,“A study on data augmentation of reverberant speech for robustspeech recognition,” in
Proc. ICASSP , 2017, pp. 5220–5224.[10] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech,and Noise Corpus,” 2015, arXiv:1510.08484v1.[11] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in
Proc. INTERSPEECH ,2017, pp. 2616–2620.[12] M. Ravanelli and Y. Bengio, “Speaker recognition from raw wave-form with SincNet,” in
Proc. SLT , 2018, pp. 1021–1028.[13] T. Kinnunen et al. , “Low-variance multitaper MFCC features:A case study in robust speaker verification,”
IEEE Transactionson Audio, Speech, and Language Processing , vol. 20, no. 7, pp.1990–2001, 2012.[14] P. Rajan, T. Kinnunen, C. Hanili, J. Pohjalainen, and P. Alku, “Us-ing group delay functions from all-pole models for speaker recog-nition,”
Proc. INTERSPEECH , pp. 2489–2493, 01 2013.[15] C. Kim and R. M. Stern, “Power-normalized cepstral coefficients(PNCC) for robust speech recognition,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 24, no. 7, pp.1315–1329, 2016.[16] M. Sahidullah, T. Kinnunen, and C. Hanili, “A comparison of fea-tures for synthetic speech detection,” in
Proc. INTERSPEECH , 092015, pp. 2087–2091.[17] C. Hanili, “Features and classifiers for replay spoofing attack de-tection,” in , 2017, pp. 1187–1191.[18] M. McLaren, L. Ferrer, D. Castn Lavilla, and A. Lawson, “Thespeakers in the wild (SITW) speaker recognition database,” in
Proc. INTERSPEECH , 2016, pp. 818–822.[19] M. Todisco, H. Delgado, and N. Evans, “Articulation rate filteringof CQCC features for automatic speaker verification,” in
Proc.INTERSPEECH 2016 , 2016, pp. 3628–3632.[20] X. Jing, J. Ma, J. Zhao, and H. Yang, “Speaker recognition basedon principal component analysis of LPCC and MFCC,” in
Proc.ICSPCC , 2014, pp. 403–408.[21] M. J. Alam et al. , “Multitaper MFCC and PLP features for speakerverification using i-vectors,”
Speech Communication , vol. 55,no. 2, pp. 237 – 251, 2013. [22] J. M. K. Kua et al. , “Investigation of spectral centroid magnitudeand frequency for speaker recognition,” in
Proc. Odyssey , 2010,pp. 34–39.[23] P. Rajan, S. H. K. Parthasarathi, and H. A. Murthy, “Robustnessof phase based features for speaker recognition,” in
Proc. INTER-SPEECH , 2009.[24] T. Thiruvaran, E. Ambikairajah, and J. Epps, “Group delay fea-tures for speaker recognition,” in , 2007,pp. 1–5.[25] S. Sadjadi and J. Hansen, “Mean hilbert envelope coefficients(MHEC) for robust speaker and language identification,”
SpeechCommunication , vol. 72, pp. 138–148, 05 2015.[26] N. Wang and L. Wang, “Robust speaker recognition based onmulti-stream features,” in , 2016, pp. 1–4.[27] A. G. Adami, “Modeling prosodic differences for speaker recog-nition,”
Speech Communication , vol. 49, no. 4, pp. 277 – 291,2007.[28] D. J. Thomson, “Spectrum estimation and harmonic analysis,”
Proceedings of the IEEE , vol. 70, no. 9, pp. 1055–1096, 1982.[29] M. Hansson-Sandsten and J. Sandberg, “Optimal cepstrum es-timation using multiple windows,” in
Proc. ICASSP , 2009, pp.3077–3080.[30] M. Hansson, T. Gansler, and G. Salomonsson, “A multiple win-dow method for estimation of a peaked spectrum,” in
Proc.ICASSP , vol. 3, 1995, pp. 1617–1620 vol.3.[31] J. Makhoul, “Linear prediction: A tutorial review,”
Proceedingsof the IEEE , vol. 63, no. 4, pp. 561–580, 1975.[32] L. Rabiner and B.-H. Juang,
Fundamentals of Speech Recogni-tion . USA: Prentice-Hall, Inc., 1993.[33] H. Hermansky, “Perceptual linear predictive (PLP) analysis ofspeech,”
The Journal of the Acoustical Society of America , vol. 87,no. 4, pp. 1738–1752, 1990.[34] J. Youngberg and S. Boll, “Constant-q signal analysis and synthe-sis,” in
Proc. ICASSP , vol. 3, April 1978, pp. 375–378.[35] A. Schrkhuber, Christian; Klapuri, “Constant-q transform toolboxfor music processing,” in , 2010.[36] M. Todisco, H. Delgado, and N. W. D. Evans, “A new feature forautomatic speaker verification anti-spoofing: Constant q cepstralcoefficients,” in
Proc. Odyssey , 2016, pp. 283–290.[37] H. Delgado et al. , “Further optimisations of constant q cepstralprocessing for integrated utterance verification and text-dependentspeaker verification,” in
Proc. SLT , 12 2016.[38] H. A. Murthy and V. Gadde, “The modified group delay func-tion and its application to phoneme recognition,” in
Proc. ICASSP ,vol. 1, 2003, pp. I–68.[39] Z. Wu, X. Xiao, E. S. Chng, and H. Li, “Synthetic speech detec-tion using temporal modulation feature,” in
Proc. ICASSP , 2013,pp. 7234–7238.[40] J. Yang and L. Liu, “Playback speech detection based onmagnitude-phase spectrum,”
Electronics Letters , vol. 54, 05 2018.[41] L. Cohen,
Time-Frequency Analysis: Theory and Applications .USA: Prentice-Hall, Inc., 1995.[42] P. Ghahremani et al. , “A pitch extraction algorithm tuned for au-tomatic speech recognition,” in
Proc. ICASSP , 2014, pp. 2494–2498.[43] F. J. Harris, “On the use of windows for harmonic analysis withthe discrete Fourier transform,”
Proceedings of the IEEE , vol. 66,no. 1, pp. 51–83, Jan 1978.[44] S. Ioffe, “Probabilistic linear discriminant analysis,” in