[PDF] Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks

Abstract

Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system during training. However, their performance degrades in face of unseen nature of attacks. In comparison to the synthetic speech created by a wide range of TTS and VC methods, genuine speech has a more consistent distribution. We believe that the difference between the distribution of synthetic and genuine speech is an important discriminative feature between the two classes. In this regard, we propose a novel method referred to as feature genuinization that learns a transformer with convolutional neural network (CNN) using the characteristics of only genuine speech. We then use this genuinization transformer with a light CNN classifier. The ASVspoof 2019 logical access corpus is used to evaluate the proposed method. The studies show that the proposed feature genuinization based LCNN system outperforms other state-of-the-art spoofing countermeasures, depicting its effectiveness for detection of synthetic speech attacks.

Full PDF

LLight Convolutional Neural Network with Feature Genuinizationfor Detection of Synthetic Speech Attacks

Zhenzong Wu , Rohan Kumar Das , ∗ , Jichen Yang , ∗ and Haizhou Li , Department of Electrical and Computer Engineering, National University of Singapore, Singapore Kriston AI Lab, China [email protected], { rohankd, eleyji, haizhou.li } @nus.edu.sg Abstract

Modern text-to-speech (TTS) and voice conversion (VC)systems produce natural sounding speech that questions the se-curity of automatic speaker veriﬁcation (ASV). This makes de-tection of such synthetic speech very important to safeguardASV systems from unauthorized access. Most of the existingspooﬁng countermeasures perform well when the nature of theattacks is made known to the system during training. However,their performance degrades in face of unseen nature of attacks.In comparison to the synthetic speech created by a wide rangeof TTS and VC methods, genuine speech has a more consistentdistribution. We believe that the difference between the dis-tribution of synthetic and genuine speech is an important dis-criminative feature between the two classes. In this regard, wepropose a novel method referred to as feature genuinization thatlearns a transformer with convolutional neural network (CNN)using the characteristics of only genuine speech. We then usethis genuinization transformer with a light CNN classiﬁer. TheASVspoof 2019 logical access corpus is used to evaluate theproposed method. The studies show that the proposed featuregenuinization based LCNN system outperforms other state-of-the-art spooﬁng countermeasures, depicting its effectiveness fordetection of synthetic speech attacks.

Index Terms : Feature genuinization, synthetic speech detec-tion, ASVspoof 2019, logical access attacks

1. Introduction

In the recent years, automatic speaker veriﬁcation (ASV) sys-tems are deployed in different real-world applications [1–3].These systems are exposed to spooﬁng attacks for unautho-rized access, hence detection of such attacks attracts much at-tention [4, 5]. Various spooﬁng attacks are broadly classiﬁedinto replay, impersonation, voice conversion (VC) and text-to-speech synthesis (TTS) attacks [6]. The latest progress in VCand TTS systems can produce perceptually natural soundingspeech, which poises a threat to fool the ASV systems [7–9].The research on spooﬁng countermeasures grew in the lastdecade since the inception of ASVspoof challenge series. Thechallenge provided a platform to the researchers across differ-ent domains to explore fake speech detection using a commonbenchmarked corpus [10,11]. Its recent edition ASVspoof 2019is devoted to detection of both synthetic and replay speech withtwo subtasks [12]. The logical access track focuses on detectionof synthetic speech created using state-of-the-art VC and TTSsystems, which is the focus of this paper. *Corresponding Author The explorations on spooﬁng attack detection cover two di-rections from the perspective of a detection task. The spoof-ing countermeasures either focus on novel front-end featuresor effective classiﬁers. Some of the former studies focused onthe importance of robust features such as cochlear ﬁlter cep-stral coefﬁcient and instantaneous frequency (CFCCIF) [13],linear frequency cepstral coefﬁcients (LFCC), subband spec-tral ﬂux coefﬁcients and spectral centroid frequency coefﬁ-cients [14]. Later, the long-term constant-Q transform (CQT)based constant-Q cepstral coefﬁcients (CQCC) proved to be oneof the strong front-ends for synthetic speech detection [15]. Therecent explorations with features derived from CQT are alsofound to effective for spoof detection [16–18].With the advent of deep learning methods, robust classi-ﬁers are investigated for detection of spooﬁng attacks. Some ofthese include end-to-end systems with light convolutional neu-ral networks (LCNN) [19, 20], squeeze excitation and residualnetworks [21, 22]. The end-to-end systems have much differ-ence with the works that focus on novel features. The formerare data driven deep learning methods, while the latter empha-size on hand-crafted feature, which require prior knowledge.Further, we note that the same neural network based systemcan perform differently for a range of features [19]. Therefore,a robust spooﬁng countermeasure is required to have a strongfeature extractor that captures the discriminative artifacts alongwith an effective classiﬁer.The synthetic speech attacks can be created with a widerange of TTS and VC algorithms [6]. In general, spooﬁngcountermeasures do not handle synthetic speech from unseensources because of lack of generalization ability [23]. We notethat genuine examples have a comparatively lower variance thansynthetic speech. We believe that the consistent characteris-tics of genuine speech set genuine speech apart from a varietyof different synthetic speech. A recent study using temporaldomain information shows that spooﬁng detection can be im-proved by modifying the probability mass function of spoofedspeech close to that of the genuine speech [24]. This processis termed as genuinization, which is found to be effective whenapplied to both train and test examples for synthetic speech de-tection.In a similar direction, we hypothesize that, if we are ableto derive a model that ﬁts well the distribution of the genuinespeech, such a model will take genuine speech as the input andgenerate genuine speech as the output following the same distri-bution of the genuine speech. However, when the model takesspoof speech as input, it will generate very different output, thatampliﬁes the difference to genuine speech. With this hypothe-sis, we propose to derive a model using the genuine speech fea-tures with convolutional neural network (CNN) that is referredto as genuinization transformer. Further, the process is referred a r X i v : . [ ee ss . A S ] S e p .. ... ... ... GenuinizationtransformerSpeechCNN ... ... ... ...

Bonaﬁde/Spoof TrainedgenuinizationtransformerLCNN ... ... ... ...

LPSGenuineGenuine Training TransformedGenuine/SpoofGenuine/Spoof

Figure 1:

The block diagram of feature genuinization process. to as feature genuinization as a given feature representation isprojected on a domain learned using the genuine features. Thegenuinization transformer is then used together with an LCNNsystem for detection of synthetic speech attacks.The rest of the paper is organized as follows. Section 2introduces the details of proposed feature genuinization. Sec-tion 3 describes the feature genuinization based LCNN systemfor detection of spooﬁng attacks. The experiments and their re-sults with discussion are reported in Section 4 and Section 5,respectively. Finally, the paper is concluded in Section 6.

2. Feature Genuinization

We aim to learn a transformer that does not change the char-acteristics of genuine speech features, whereas it projects spoofspeech to a different output, maximizing the difference betweengenuine and spoof speech. Figure 1 shows the block diagram ofthe proposed feature genuinization process. It can be observedthat there are two stages of the process. The ﬁrst stage basi-cally focuses on training a feature genuinization transformerusing the characteristics features derived from only genuinespeech. During the second stage, this trained feature genuiniza-tion transformer is used to convert any given features that en-hances the discrimination of genuine and spoof speech.The CNN based architectures have shown their effective-ness in the ﬁeld of anti-spooﬁng research [25]. In this regard, weuse CNN for training the genuinization transformer as shown inFigure 1. The detailed architecture of the CNN used in thisframework can be seen from Figure 2. It can be observed thatthe functionality of the proposed genuinization transformer issimilar to that of an autoencoder. However, the output of gen-uinization transformer is considered as the ﬁnal transformer re-sult. In addition, we apply a full convolutional layer and there-fore, there is no fully connected layers in the transformer. Thiscan thereby force the network to focus on the temporal cor-relation between the input signal and the whole stratiﬁcationprocess. Further, it reduces the number of training parameters,which signiﬁcantly results in less training period.A study in [26] shows that it is a good practice to use stridedconvolution rather than pooling to downsample as it allows thenetwork to learn its own pooling function. Therefore, we usethis method during the training of genuinization transformer. Inaddition, batchNorm2d and leaky rectiﬁed linear unit (ReLU)activation function are used in the training because they can

InputConv2d, 4×4, 32, stride = 2, paddingLeakyReLU,0.2 OutputBatchNormConv2d, 4×4, 64, stride = 2, paddingLeakyReLU,0.2BatchNormConv2d, 4×4, 128, stride = 2, paddingLeakyReLU,0.2BatchNormConv2d, 4×4, 256, stride = 2, paddingLeakyReLU,0.2BatchNormConv2d, 4×4, 512, stride = 2, paddingLeakyReLU,0.2BatchNorm ConvTranspos2d,4×4, 256, stride = 2, paddingReLUBatchNormBatchNormConvTranspos2d,4×4, 128, stride = 2, paddingReLUBatchNormConvTranspos2d,4×4, 64, stride = 2, paddingReLUBatchNormConvTranspos2d,4×4, 32, stride = 2, paddingConvTranspos2d,4×4, 32, stride = 2, paddingReLU

Figure 2:

The architecture of genuinization transformer. promote healthy gradient ﬂow, which is critical for the learn-ing process.The architecture of proposed genuinization transformershown in Figure 2 consists of two functionalities: encodingand decoding. During the encoding phase, the input signal iscompressed through a number of strided convolutional layers,and then the convolution result is obtained by leaky ReLU. Inthe decoding phase, the encoding process is reversed by de-convolution, and then by ReLU. In this way, the transformerworks as an autoencoder that learns the characteristics of gen-uine speech [27]. As a result of this, it ampliﬁes the discrimina-tion of genuine and spoof speech in the transformed domain.Once the genuinization transformer is trained, it can beused to transform any given genuine/spoofed features to a trans-formed domain that is learned using the only genuine featurecharacteristics. This novel way of transforming the feature isreferred as feature genuinization as mentioned earlier. Next, wediscuss about the LCNN system using the feature generalizationfor detection of spooﬁng attacks.

3. LCNN with Feature Genuinization

Various deep learning systems have shown their effectivenessfor spooﬁng attack detection [19, 21, 22, 28, 29]. Therefore,we plan to use the proposed feature genuinization with a deeplearning system. The LCNN is one of the strongest systems .. ... ... ...

GenuinizaitontransformerBonaﬁde/SpoofLPSLCNN

Figure 3:

Block diagram of the proposed feature genuinizationbased LCNN system. that has proven to be useful for its compactness and efﬁcacy foranti-spooﬁng [19, 30]. In this work, we use LCNN based sys-tem with the transformed features obtained using genuinizationtransformer.Figure 3 shows the block diagram of the proposed featuregenuinization based LCNN system. We consider log powerspectrum (LPS) of a given speech as the input feature to thegenuinization transformer. It transforms the given input LPS toa genuinized feature, which is an input to the LCNN. Duringtraining, the training data and their corresponding label infor-mation is fed to the LCNN system. Once the training is com-pleted, the detection result for a given input to the system canbe obtained to identify the spooﬁng attacks.We used Max-Feature-Map (MFM) activation function in-stead of commonly used ReLU function for the LCNN systemsimilar to that in [19]. The main advantage of MFM is that itcan learn compact features instead of sparse high-dimensionalones like ReLU. Further, MFM resorts to max function to sup-press the activations of a small number of neurons so that MFMbased CNN models are light and robust. Therefore, these areapplied to reduce the dimensionality of the output and obtainmore discriminative feature maps.

4. Experiments

In this section, we discuss the database and experimental setupfor the studies.

We consider the ASVspoof 2019 logical access corpus for thestudies of synthetic speech detection in this work [12, 31]. Thecorpus has three partitions, which are train, development andevaluation set. The genuine examples of the ASVspoof 2019corpus are part of VCTK database, which is a standard corpusfor speech synthesis. It contains 107 speakers data that includes46 male and 61 female speakers. It is to be noted that there isno overlap of speakers across different subsets. The syntheticspeech attacks for the development set are created with two VCand four TTS state-of-the-art methods. However, the spoofedexamples of evaluation set are derived from unseen methods. https://datashare.is.ed.ac.uk/handle/10283/3336 http://dx.doi.org/10.7488/ds/1994 https://pytorch.org Table 1:

Summary of ASVspoof 2019 logical access corpus.

Subset

Train 8 12 2,580 22,800Development 4 6 2,548 22,296Evaluation 21 27 7,355 63,882

The ASVspoof 2019 uses an ASV-centric metric given by tan-dem detection cost function (t-DCF) as the primary metric andequal error rate (EER) as a secondary metric for benchmarkingthe systems [31, 32]. We considered the scores of ASV sys-tem given along with the ASVspoof 2019 logical access corpusto combine with that from spooﬁng countermeasure system forcomputation of t-DCF measure. Table 1 presents a summary ofthe ASVspoof 2019 logical access corpus.

The long-term CQT based features are found to capture use-ful artifacts for spooﬁng attack detection [33]. Therefore, weuse LPS derived from CQT as the input feature for the studies.The parameters for CQT computation are set based on follow-ing those in [15]. The number of octaves and frequency bins inevery octaves are set at 9 and 96, respectively. In addition, thestatic dimension of LPS is 863. For LPS extraction from CQT,the length of every ﬁle is set as 256 frames by either paddingand cropping. In particular, the examples with frame-lengthover than 256 frames are truncated, while the examples withframe-length less than 256 frames are ﬁlled with the last framevalue. Thus, the we have an input feature of 863 ×

256 for everyexample.During training of the LCNN system, an additional batchnormalization step is used after max pooling layer to increasethe stability and convergence speed. As such models are proneto overﬁtting, we consider dropout and weight decay methodsto avoid such issue. The dropout is used for fully connected lay-ers with the ratio 0.4 and the weight decay is set to × − .In addition, the parameters like number of layers and nodes areoptimized on the development set. The proposed feature gen-uinization based LCNN system is implemented using PyTorch toolkit.

5. Results and Discussion

The proposed system is a pipeline with a feature genuiniza-tion followed by LCNN. We compare the proposed systemwith LCNN baseline without feature genuinization. Further,we also consider the two baseline spooﬁng countermeasuresof ASVspoof 2019 challenge. They are based on CQCC andLFCC features with Gaussian mixture model (GMM) classi-ﬁer [12, 31].Table 2 shows the results of proposed feature genuinizationbased LCNN system, that we refer as FG-LCNN, on ASVspoof2019 logical access corpus and its comparison to the baselinesystems. We observe that introducing feature genuinizationmodule in the baseline LCNN system improves the detectionof spooﬁng attacks. While the results on the development setare close, the improvement from the proposed system is evi-dent from the results on the evaluation set, which contains morechallenging spooﬁng attacks of unseen nature. This conﬁrmsour hypothesis to use a feature genuinization model exploitingthe characteristics of genuine speech. Further, we ﬁnd that theperformance of the proposed system is much better than the twoASVspoof 2019 challenge baselines.able 2:

Performance of proposed feature genuinization basedLCNN (FG-LCNN) and its comparison to baseline systems onASVspoof 2019 logical access corpus.

System Development Set Evaluation Sett-DCF EER (%) t-DCF EER (%)

Baseline: LCNN 0.002 0.080 0.111 4.448

FG-LCNN 0.000 0.002 0.102 4.070ASVspoof 2019 Baseline [12]

CQCC-GMM 0.0123 0.43 0.2366 9.57LFCC-GMM 0.0663 2.71 0.2116 8.09

Table 3:

Performance of proposed feature genuinization basedLCNN (FG-LCNN) and its comparison to feature spooﬁngbased LCNN (FS-LCNN) contrast system on ASVspoof 2019logical access corpus evaluation set.

System t-DCF EER (%)

Baseline: LCNN 0.111 4.448

Prposed: FG-LCNN 0.102 4.070

Contrast: FS-LCNN 0.138 4.860We further perform a justiﬁcation experiment for validationof our proposed method. The idea behind feature genuinizationprocess is based on the assumption that the genuine speech ex-amples are considered to be less varied than the synthetic speechattacks created using a wide range of methods. We perform acontrast experiment, where we learn a transformation model us-ing CNN by only considering the spoofed speech features. Werefer this process as feature spooﬁng and the model as spooﬁngtransformer, similar to the case of our proposed method. Thisspooﬁng transformer is then used to transform any given fea-ture of genuine or spoofed speech to another domain, which isthen used in the LCNN system pipeline, that we call FS-LCNN.The rest of the experimental setup remains the same to that ourproposed method.Table 3 shows the performance comparison of the FS-LCNN contrast system with our proposed FG-LCNN and thebaseline LCNN system. We consider the results of evaluationset for the comparison as the results of development set canshow very accurate detection of synthetic speech attacks. Weﬁnd that the FS-LCNN contrast system does not perform betterthan our proposed FG-LCNN system, but rather degrades fromthe baseline LCNN system. This further strengthens our pro-posed idea of using feature genuinization process with LCNNsystem for detection of spooﬁng attacks.We are now interested in comparing the proposed system tovarious single system based results available of ASVspoof 2019logical access corpus. In this regard, we consider some of thewell performing front-end as well as back-ends that have showntheir effectiveness for spooﬁng attack detection in ASVspoof2019 challenge. Some of the those front-ends are zero timewindowing cepstral coefﬁcients (ZTWCC), single frequencycepstral coefﬁcients (SFCC) and instantaneous frequency cep-stral coefﬁcients (IFCC) that are implemented with GMM basedclassiﬁer [34]. Further, deep learning based classiﬁers suchas deep neural network (DNN), ResNet and LCNN are usedfor detection of spooﬁng attacks using front-ends like mel fre-quency cepstral coefﬁcient (MFCC), constant-Q statistics-plus-principal information coefﬁcients (CQSPIC), CQCC, LFCC, Table 4:

Performance comparison of the proposed feature gen-uinization based LCNN system to some known single systemson ASVspoof 2019 logical access evaluation set.

System t-DCF EER (%)

ZTWCC-GMM [34] 0.141 6.13IFCC-GMM [34] 0.357 15.59SFFCC-GMM [34] 0.323 13.97CQCC-DNN [35] 0.308 12.79LFCC-DNN [35] 0.234 9.65MFCC-ResNet [36] 0.204 9.33LPS-DFT-ResNet [36] 0.274 9.68CQCC-ResNet [36] 0.217 7.69CQSPIC-DNN [35] 0.183 7.81CQSPIC-GMM [35] 0.164 7.74LFCC-LCNN [19] 0.100 5.06LPS-FFT-LCNN [19] 0.103 4.53

Proposed: FG-LCNN 0.102 4.07

LPS of discrete Fourier transform (DFT) and fast Fourier trans-form (FFT) in ASVspoof 2019 challenge [19,35–37]. We reportthe respective system results from their published works for thecomparison on the evaluation set of ASVspoof 2019 logical ac-cess corpus.Table 4 reports the performance comparison of the pro-posed FG-LCNN system to some of the single systems reportedin ASVspoof 2019 challenge discussed above. It is observedthat the LCNN based systems represent the best performingsingle system, that justiﬁes its use as the baseline LCNN inthis work. Further, the effectiveness of proposed feature gen-uinization is evident on using it with the LCNN system, whichoutperforms other reported single systems in terms of EER onASVspoof 2019 logical access corpus.

6. Conclusion

This work proposes a novel feature genuinization based LCNNsystem for detection of synthetic speech attacks. The charac-teristics of genuine speech are exploited to learn a model usingCNN. It transforms a genuine feature distribution more close tothat of the genuine speech, whereas leads to a different outputfor features of spoof speech, thereby maximizing their differ-ence. The transformed features are then used with an LCNNsystem. The studies conducted on ASVspoof 2019 logical ac-cess corpus show the effectiveness of the feature genuinizationbased LCNN system for detecting synthetic speech attacks. Thecomparison of the proposed system to various state-of-the-artspooﬁng countermeasures showcases it as one of the strong sin-gle anti-spooﬁng system. The future work will focus on extend-ing the studies to replay attack detection.

7. Acknowledgements

This research work is partially supported by ProgrammaticGrant No. A1687b0033 from the Singapore Government’s Re-search, Innovation and Enterprise 2020 plan (Advanced Man-ufacturing and Engineering domain), Human-Robot InteractionPhase 1 (Grant No. 192 25 00054) by the National ResearchFoundation, Prime Minister’s Ofﬁce, Singapore under the Na-tional Robotics Programme. This work is also part of a collab-oration with Kriston AI Lab, China in 2019. . References [1] K. A. Lee, B. Ma, and H. Li, “Speaker veriﬁcation makes its debutin smartphone,” in

SLTC Newsletter , February 2013.[2] R. K. Das, S. Jelil, and S. R. M. Prasanna, “Development of multi-level speech based person authentication system,”

Journal of Sig-nal Processing Systems , vol. 88, no. 3, pp. 259–271, Sep 2017.[3] S. Jelil, A. Shrivastava, R. K. Das, S. R. M. Prasanna, andR. Sinha, “SpeechMarker: A voice based multi-level attendanceapplication,” in

Interspeech 2019 , 2019, pp. 3665–3666.[4] Z. Wu and H. Li, “On the study of replay and voice conversionattacks to text-dependent speaker veriﬁcation,”

Multimedia Toolsand Applications , vol. 75, no. 9, pp. 5311–5327, May 2016.[5] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “Theattacker’s perspective on automatic speaker veriﬁcation: Anoverview,” in

Interspeech 2020 , 2020. [Online]. Available:https://arxiv.org/abs/2004.08849[6] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,“Spooﬁng and countermeasures for speaker veriﬁcation: A sur-vey,”

Speech Communication , vol. 66, pp. 130 – 153, 2015.[7] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel meth-ods,” in

Odyssey 2018 , 2018, pp. 195–202.[8] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito,F. Villavicencio, and Z. Ling, “A spooﬁng benchmark for the 2018voice conversion challenge: Leveraging from spooﬁng counter-measures for speech artifact assessment,” in

Odyssey 2018 , 2018,pp. 187–194.[9] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi,and T. Kinnunen, “Can we steal your vocal identity from the in-ternet?: Initial investigation of cloning Obamas voice using GAN,WaveNet and low-quality found data,” in

Odyssey 2018 , 2018, pp.240–247.[10] N. Evans, T. Kinnunen, and J. Yamagishi, “Spooﬁng and counter-measures for automatic speaker veriﬁcation,” in

Interspeech 2013 ,2013, pp. 925–929.[11] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilc¸i, M. Sahidullah,A. Sizov, N. Evans, M. Todisco, and H. Delgado, “ASVspoof:The automatic speaker veriﬁcation spooﬁng and countermeasureschallenge,”

IEEE Journal of Selected Topics in Signal Processing ,vol. 11, no. 4, pp. 588–604, June 2017.[12] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado,A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,“ASVspoof 2019: Future horizons in spoofed and fake audio de-tection,” in

Interspeech 2019 , 2019, pp. 1008–1012.[13] T. B. Patel and H. A. Patil, “Combining evidences from mel cep-stral, cochlear ﬁlter cepstral and instantaneous frequency featuresfor detection of natural vs. spoofed speech,” in

Interspeech 2015 ,2015, pp. 2062–2066.[14] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison offeatures for synthetic speech detection,” in

Interspeech 2015 ,2015, pp. 2087–2091.[15] M. Todisco, H. Delgado, and N. Evans, “A new feature for auto-matic speaker veriﬁcation anti-spooﬁng: Constant Q cepstral co-efﬁcients,” in

Odyssey 2016 , 2016, pp. 283–290.[16] J. Yang, R. K. Das, and N. Zhou, “Extraction of octave spectrainformation for spooﬁng attack detection,”

IEEE/ACM Transac-tions on Audio, Speech and Language Processing , vol. 27, pp.2373–2384, 2019.[17] J. Yang, R. K. Das, and H. Li, “Signiﬁcance of subband featuresfor synthetic speech decetion,”

IEEE Transactions on InformationForensics and Security , vol. 15, pp. 2160–2170, 2020.[18] J. Yang and R. K. Das, “Long-term high frequency features forsynthetic speech detection,”

Digital Signal Processing , vol. 97, p.102622, 2020. [19] G. Lavrentyva, S. Novoselov, A. Tseren, M. Volkova, A. Gor-lanov, and A. Kozlos, “STC antispooﬁng systems for theASVspoof2019 challenge,” in

Interspeech 2019 , Graz, Austria,2019, pp. 1033–1037.[20] Y. Yang, H. Wang, H. Dinkel, Z. Chen, S. Wang, Y. Qian, andK. Yu, “The SJTU robust anti-spooﬁng system for the ASVspoof2019 challenge,” in

Interspeech 2019 , 2019, pp. 1038–1042.[21] C.-I. Lai, N. Chen, J. Villaba, and N. Dehak, “ASSERT:Anti-spooﬁng with squeeze-excitation and residual networks,” in

In-terspeech , Graz, Austria, 2019, pp. 1013–1017.[22] J. Monteiro and J. Alam, “Development of voice spooﬁng de-tection systems for 2019 edition of automatic speaker veriﬁca-tion and countermeasures challenge,” in

IEEE Automatic SpeechRecognition and Understanding (ASRU) Workshop 2019 , 2019,pp. 1003–1010.[23] R. K. Das, J. Yang, and H. Li, “Assessing the scope of generalizedcountermeasures for anti-spooﬁng,” in

IEEE International Con-ference on Acoustics, Speech, and Signal Processing (ICASSP)2020 , 2020, pp. 6589–6593.[24] I. Lapidot and J.-F. Bonastre, “Effects of wavform PMF on anti-spooﬁng detection,” in

Interspeech 2019 , 2019, pp. 2853–2857.[25] C. Zhang, S. Ranjan, M. K. Nandwana, Q. Zhang, A. Misra,G. Liu, F. Kelly, and J. H. L. Hansen, “Joint information from non-linar and linear features for spooﬁng detection: an i-vector basedapproach,”

IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP) , pp. 5035–5038, 2016.[26] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-tation learning with deep convolutional generative adversarialnetworks,”

CoRR , vol. abs/1511.06434, 2015. [Online]. Avail-able: http://arxiv.org/abs/1511.06434[27] J. Gehring, Y. Miao, F. Metze, and A. Waibel, “Extracting deepbottleneck features using stacked auto-encoders,” in

IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) 2013 , 2013, pp. 3377–3381.[28] J. weon Jung, H. jin Shim, H.-S. Heo, and H.-J. Yu, “Replay attackdetection with complementary high-resolution information usingend-to-end DNN for the ASVspoof 2019 challenge,” in

Proc. In-terspeech 2019 , 2019, pp. 1083–1087.[29] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M.Gomez, “A light convolutional GRU-RNN deep feature extrac-tor for ASV spooﬁng detection,” in

Proc. Interspeech 2019 , 2019,pp. 1068–1072.[30] X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face rep-resentation with noisy labels,”

IEEE Transactions on InformationForensics and Security , vol. 13, no. 11, pp. 2884–2896, 2018.[31] “ASVspoof 2019: Automatic speaker veriﬁcation spooﬁng andcountermeasures challenge evaluation plan,” 2019.[32] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco,M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a de-tection cost function for the tandem assessment of spooﬁng coun-termeasures and automatic speaker veriﬁcation,” in

Odyssey 2018 ,2018, pp. 312–319.[33] R. K. Das, J. Yang, and H. Li, “Long range acoustic features forspoofed speech detection,” in

Interspeech 2019 , 2019, pp. 1058–1062.[34] K. N. R. K. R. Alluri and A. K. Vupala, “IIIT-H spooﬁng coun-termeasures for automatic speaker veriﬁcation spooﬁng and coun-termesures challenge 2019,” in

Interspeech 2019 , Graz, Austria,2019, pp. 1043–1047.[35] R. K. Das, J. Yang, and H. Li, “Long range acoustic and deepfeatures perspective on ASVspoof 2019,” in

Automatic SpeechRecognition and Understanding (ASRU) Workshop , 2019, pp.1018–1025.[36] M. Alzanto, Z. Wang, and M. B. Srivastava, “Deep residual neu-ral networks for audio spooﬁng detection,” in

Interspeech 2019 ,Graz, Austria, 2019, pp. 1078–1082.[37] J. Yang and R. K. Das, “Improving anti-spooﬁng with octavespectrum and short-term spectral statistics information,”