Light Convolutional Neural Network with Feature Genuinization for Detection of Synthetic Speech Attacks
LLight Convolutional Neural Network with Feature Genuinizationfor Detection of Synthetic Speech Attacks
Zhenzong Wu , Rohan Kumar Das , ∗ , Jichen Yang , ∗ and Haizhou Li , Department of Electrical and Computer Engineering, National University of Singapore, Singapore Kriston AI Lab, China [email protected], { rohankd, eleyji, haizhou.li } @nus.edu.sg Abstract
Modern text-to-speech (TTS) and voice conversion (VC)systems produce natural sounding speech that questions the se-curity of automatic speaker verification (ASV). This makes de-tection of such synthetic speech very important to safeguardASV systems from unauthorized access. Most of the existingspoofing countermeasures perform well when the nature of theattacks is made known to the system during training. However,their performance degrades in face of unseen nature of attacks.In comparison to the synthetic speech created by a wide rangeof TTS and VC methods, genuine speech has a more consistentdistribution. We believe that the difference between the dis-tribution of synthetic and genuine speech is an important dis-criminative feature between the two classes. In this regard, wepropose a novel method referred to as feature genuinization thatlearns a transformer with convolutional neural network (CNN)using the characteristics of only genuine speech. We then usethis genuinization transformer with a light CNN classifier. TheASVspoof 2019 logical access corpus is used to evaluate theproposed method. The studies show that the proposed featuregenuinization based LCNN system outperforms other state-of-the-art spoofing countermeasures, depicting its effectiveness fordetection of synthetic speech attacks.
Index Terms : Feature genuinization, synthetic speech detec-tion, ASVspoof 2019, logical access attacks
1. Introduction
In the recent years, automatic speaker verification (ASV) sys-tems are deployed in different real-world applications [1–3].These systems are exposed to spoofing attacks for unautho-rized access, hence detection of such attacks attracts much at-tention [4, 5]. Various spoofing attacks are broadly classifiedinto replay, impersonation, voice conversion (VC) and text-to-speech synthesis (TTS) attacks [6]. The latest progress in VCand TTS systems can produce perceptually natural soundingspeech, which poises a threat to fool the ASV systems [7–9].The research on spoofing countermeasures grew in the lastdecade since the inception of ASVspoof challenge series. Thechallenge provided a platform to the researchers across differ-ent domains to explore fake speech detection using a commonbenchmarked corpus [10,11]. Its recent edition ASVspoof 2019is devoted to detection of both synthetic and replay speech withtwo subtasks [12]. The logical access track focuses on detectionof synthetic speech created using state-of-the-art VC and TTSsystems, which is the focus of this paper. *Corresponding Author The explorations on spoofing attack detection cover two di-rections from the perspective of a detection task. The spoof-ing countermeasures either focus on novel front-end featuresor effective classifiers. Some of the former studies focused onthe importance of robust features such as cochlear filter cep-stral coefficient and instantaneous frequency (CFCCIF) [13],linear frequency cepstral coefficients (LFCC), subband spec-tral flux coefficients and spectral centroid frequency coeffi-cients [14]. Later, the long-term constant-Q transform (CQT)based constant-Q cepstral coefficients (CQCC) proved to be oneof the strong front-ends for synthetic speech detection [15]. Therecent explorations with features derived from CQT are alsofound to effective for spoof detection [16–18].With the advent of deep learning methods, robust classi-fiers are investigated for detection of spoofing attacks. Some ofthese include end-to-end systems with light convolutional neu-ral networks (LCNN) [19, 20], squeeze excitation and residualnetworks [21, 22]. The end-to-end systems have much differ-ence with the works that focus on novel features. The formerare data driven deep learning methods, while the latter empha-size on hand-crafted feature, which require prior knowledge.Further, we note that the same neural network based systemcan perform differently for a range of features [19]. Therefore,a robust spoofing countermeasure is required to have a strongfeature extractor that captures the discriminative artifacts alongwith an effective classifier.The synthetic speech attacks can be created with a widerange of TTS and VC algorithms [6]. In general, spoofingcountermeasures do not handle synthetic speech from unseensources because of lack of generalization ability [23]. We notethat genuine examples have a comparatively lower variance thansynthetic speech. We believe that the consistent characteris-tics of genuine speech set genuine speech apart from a varietyof different synthetic speech. A recent study using temporaldomain information shows that spoofing detection can be im-proved by modifying the probability mass function of spoofedspeech close to that of the genuine speech [24]. This processis termed as genuinization, which is found to be effective whenapplied to both train and test examples for synthetic speech de-tection.In a similar direction, we hypothesize that, if we are ableto derive a model that fits well the distribution of the genuinespeech, such a model will take genuine speech as the input andgenerate genuine speech as the output following the same distri-bution of the genuine speech. However, when the model takesspoof speech as input, it will generate very different output, thatamplifies the difference to genuine speech. With this hypothe-sis, we propose to derive a model using the genuine speech fea-tures with convolutional neural network (CNN) that is referredto as genuinization transformer. Further, the process is referred a r X i v : . [ ee ss . A S ] S e p .. ... ... ... GenuinizationtransformerSpeechCNN ... ... ... ...
Bonafide/Spoof TrainedgenuinizationtransformerLCNN ... ... ... ...
LPSGenuineGenuine Training TransformedGenuine/SpoofGenuine/Spoof
Figure 1:
The block diagram of feature genuinization process. to as feature genuinization as a given feature representation isprojected on a domain learned using the genuine features. Thegenuinization transformer is then used together with an LCNNsystem for detection of synthetic speech attacks.The rest of the paper is organized as follows. Section 2introduces the details of proposed feature genuinization. Sec-tion 3 describes the feature genuinization based LCNN systemfor detection of spoofing attacks. The experiments and their re-sults with discussion are reported in Section 4 and Section 5,respectively. Finally, the paper is concluded in Section 6.
2. Feature Genuinization
We aim to learn a transformer that does not change the char-acteristics of genuine speech features, whereas it projects spoofspeech to a different output, maximizing the difference betweengenuine and spoof speech. Figure 1 shows the block diagram ofthe proposed feature genuinization process. It can be observedthat there are two stages of the process. The first stage basi-cally focuses on training a feature genuinization transformerusing the characteristics features derived from only genuinespeech. During the second stage, this trained feature genuiniza-tion transformer is used to convert any given features that en-hances the discrimination of genuine and spoof speech.The CNN based architectures have shown their effective-ness in the field of anti-spoofing research [25]. In this regard, weuse CNN for training the genuinization transformer as shown inFigure 1. The detailed architecture of the CNN used in thisframework can be seen from Figure 2. It can be observed thatthe functionality of the proposed genuinization transformer issimilar to that of an autoencoder. However, the output of gen-uinization transformer is considered as the final transformer re-sult. In addition, we apply a full convolutional layer and there-fore, there is no fully connected layers in the transformer. Thiscan thereby force the network to focus on the temporal cor-relation between the input signal and the whole stratificationprocess. Further, it reduces the number of training parameters,which significantly results in less training period.A study in [26] shows that it is a good practice to use stridedconvolution rather than pooling to downsample as it allows thenetwork to learn its own pooling function. Therefore, we usethis method during the training of genuinization transformer. Inaddition, batchNorm2d and leaky rectified linear unit (ReLU)activation function are used in the training because they can
InputConv2d, 4×4, 32, stride = 2, paddingLeakyReLU,0.2 OutputBatchNormConv2d, 4×4, 64, stride = 2, paddingLeakyReLU,0.2BatchNormConv2d, 4×4, 128, stride = 2, paddingLeakyReLU,0.2BatchNormConv2d, 4×4, 256, stride = 2, paddingLeakyReLU,0.2BatchNormConv2d, 4×4, 512, stride = 2, paddingLeakyReLU,0.2BatchNorm ConvTranspos2d,4×4, 256, stride = 2, paddingReLUBatchNormBatchNormConvTranspos2d,4×4, 128, stride = 2, paddingReLUBatchNormConvTranspos2d,4×4, 64, stride = 2, paddingReLUBatchNormConvTranspos2d,4×4, 32, stride = 2, paddingConvTranspos2d,4×4, 32, stride = 2, paddingReLU
Figure 2:
The architecture of genuinization transformer. promote healthy gradient flow, which is critical for the learn-ing process.The architecture of proposed genuinization transformershown in Figure 2 consists of two functionalities: encodingand decoding. During the encoding phase, the input signal iscompressed through a number of strided convolutional layers,and then the convolution result is obtained by leaky ReLU. Inthe decoding phase, the encoding process is reversed by de-convolution, and then by ReLU. In this way, the transformerworks as an autoencoder that learns the characteristics of gen-uine speech [27]. As a result of this, it amplifies the discrimina-tion of genuine and spoof speech in the transformed domain.Once the genuinization transformer is trained, it can beused to transform any given genuine/spoofed features to a trans-formed domain that is learned using the only genuine featurecharacteristics. This novel way of transforming the feature isreferred as feature genuinization as mentioned earlier. Next, wediscuss about the LCNN system using the feature generalizationfor detection of spoofing attacks.
3. LCNN with Feature Genuinization
Various deep learning systems have shown their effectivenessfor spoofing attack detection [19, 21, 22, 28, 29]. Therefore,we plan to use the proposed feature genuinization with a deeplearning system. The LCNN is one of the strongest systems .. ... ... ...
GenuinizaitontransformerBonafide/SpoofLPSLCNN
Figure 3:
Block diagram of the proposed feature genuinizationbased LCNN system. that has proven to be useful for its compactness and efficacy foranti-spoofing [19, 30]. In this work, we use LCNN based sys-tem with the transformed features obtained using genuinizationtransformer.Figure 3 shows the block diagram of the proposed featuregenuinization based LCNN system. We consider log powerspectrum (LPS) of a given speech as the input feature to thegenuinization transformer. It transforms the given input LPS toa genuinized feature, which is an input to the LCNN. Duringtraining, the training data and their corresponding label infor-mation is fed to the LCNN system. Once the training is com-pleted, the detection result for a given input to the system canbe obtained to identify the spoofing attacks.We used Max-Feature-Map (MFM) activation function in-stead of commonly used ReLU function for the LCNN systemsimilar to that in [19]. The main advantage of MFM is that itcan learn compact features instead of sparse high-dimensionalones like ReLU. Further, MFM resorts to max function to sup-press the activations of a small number of neurons so that MFMbased CNN models are light and robust. Therefore, these areapplied to reduce the dimensionality of the output and obtainmore discriminative feature maps.
4. Experiments
In this section, we discuss the database and experimental setupfor the studies.
We consider the ASVspoof 2019 logical access corpus for thestudies of synthetic speech detection in this work [12, 31]. Thecorpus has three partitions, which are train, development andevaluation set. The genuine examples of the ASVspoof 2019corpus are part of VCTK database, which is a standard corpusfor speech synthesis. It contains 107 speakers data that includes46 male and 61 female speakers. It is to be noted that there isno overlap of speakers across different subsets. The syntheticspeech attacks for the development set are created with two VCand four TTS state-of-the-art methods. However, the spoofedexamples of evaluation set are derived from unseen methods. https://datashare.is.ed.ac.uk/handle/10283/3336 http://dx.doi.org/10.7488/ds/1994 https://pytorch.org Table 1:
Summary of ASVspoof 2019 logical access corpus.
Subset
Train 8 12 2,580 22,800Development 4 6 2,548 22,296Evaluation 21 27 7,355 63,882
The ASVspoof 2019 uses an ASV-centric metric given by tan-dem detection cost function (t-DCF) as the primary metric andequal error rate (EER) as a secondary metric for benchmarkingthe systems [31, 32]. We considered the scores of ASV sys-tem given along with the ASVspoof 2019 logical access corpusto combine with that from spoofing countermeasure system forcomputation of t-DCF measure. Table 1 presents a summary ofthe ASVspoof 2019 logical access corpus.
The long-term CQT based features are found to capture use-ful artifacts for spoofing attack detection [33]. Therefore, weuse LPS derived from CQT as the input feature for the studies.The parameters for CQT computation are set based on follow-ing those in [15]. The number of octaves and frequency bins inevery octaves are set at 9 and 96, respectively. In addition, thestatic dimension of LPS is 863. For LPS extraction from CQT,the length of every file is set as 256 frames by either paddingand cropping. In particular, the examples with frame-lengthover than 256 frames are truncated, while the examples withframe-length less than 256 frames are filled with the last framevalue. Thus, the we have an input feature of 863 ×
256 for everyexample.During training of the LCNN system, an additional batchnormalization step is used after max pooling layer to increasethe stability and convergence speed. As such models are proneto overfitting, we consider dropout and weight decay methodsto avoid such issue. The dropout is used for fully connected lay-ers with the ratio 0.4 and the weight decay is set to × − .In addition, the parameters like number of layers and nodes areoptimized on the development set. The proposed feature gen-uinization based LCNN system is implemented using PyTorch toolkit.
5. Results and Discussion
The proposed system is a pipeline with a feature genuiniza-tion followed by LCNN. We compare the proposed systemwith LCNN baseline without feature genuinization. Further,we also consider the two baseline spoofing countermeasuresof ASVspoof 2019 challenge. They are based on CQCC andLFCC features with Gaussian mixture model (GMM) classi-fier [12, 31].Table 2 shows the results of proposed feature genuinizationbased LCNN system, that we refer as FG-LCNN, on ASVspoof2019 logical access corpus and its comparison to the baselinesystems. We observe that introducing feature genuinizationmodule in the baseline LCNN system improves the detectionof spoofing attacks. While the results on the development setare close, the improvement from the proposed system is evi-dent from the results on the evaluation set, which contains morechallenging spoofing attacks of unseen nature. This confirmsour hypothesis to use a feature genuinization model exploitingthe characteristics of genuine speech. Further, we find that theperformance of the proposed system is much better than the twoASVspoof 2019 challenge baselines.able 2:
Performance of proposed feature genuinization basedLCNN (FG-LCNN) and its comparison to baseline systems onASVspoof 2019 logical access corpus.
System Development Set Evaluation Sett-DCF EER (%) t-DCF EER (%)
Baseline: LCNN 0.002 0.080 0.111 4.448
FG-LCNN 0.000 0.002 0.102 4.070ASVspoof 2019 Baseline [12]
CQCC-GMM 0.0123 0.43 0.2366 9.57LFCC-GMM 0.0663 2.71 0.2116 8.09
Table 3:
Performance of proposed feature genuinization basedLCNN (FG-LCNN) and its comparison to feature spoofingbased LCNN (FS-LCNN) contrast system on ASVspoof 2019logical access corpus evaluation set.
System t-DCF EER (%)
Baseline: LCNN 0.111 4.448
Prposed: FG-LCNN 0.102 4.070
Contrast: FS-LCNN 0.138 4.860We further perform a justification experiment for validationof our proposed method. The idea behind feature genuinizationprocess is based on the assumption that the genuine speech ex-amples are considered to be less varied than the synthetic speechattacks created using a wide range of methods. We perform acontrast experiment, where we learn a transformation model us-ing CNN by only considering the spoofed speech features. Werefer this process as feature spoofing and the model as spoofingtransformer, similar to the case of our proposed method. Thisspoofing transformer is then used to transform any given fea-ture of genuine or spoofed speech to another domain, which isthen used in the LCNN system pipeline, that we call FS-LCNN.The rest of the experimental setup remains the same to that ourproposed method.Table 3 shows the performance comparison of the FS-LCNN contrast system with our proposed FG-LCNN and thebaseline LCNN system. We consider the results of evaluationset for the comparison as the results of development set canshow very accurate detection of synthetic speech attacks. Wefind that the FS-LCNN contrast system does not perform betterthan our proposed FG-LCNN system, but rather degrades fromthe baseline LCNN system. This further strengthens our pro-posed idea of using feature genuinization process with LCNNsystem for detection of spoofing attacks.We are now interested in comparing the proposed system tovarious single system based results available of ASVspoof 2019logical access corpus. In this regard, we consider some of thewell performing front-end as well as back-ends that have showntheir effectiveness for spoofing attack detection in ASVspoof2019 challenge. Some of the those front-ends are zero timewindowing cepstral coefficients (ZTWCC), single frequencycepstral coefficients (SFCC) and instantaneous frequency cep-stral coefficients (IFCC) that are implemented with GMM basedclassifier [34]. Further, deep learning based classifiers suchas deep neural network (DNN), ResNet and LCNN are usedfor detection of spoofing attacks using front-ends like mel fre-quency cepstral coefficient (MFCC), constant-Q statistics-plus-principal information coefficients (CQSPIC), CQCC, LFCC, Table 4:
Performance comparison of the proposed feature gen-uinization based LCNN system to some known single systemson ASVspoof 2019 logical access evaluation set.
System t-DCF EER (%)
ZTWCC-GMM [34] 0.141 6.13IFCC-GMM [34] 0.357 15.59SFFCC-GMM [34] 0.323 13.97CQCC-DNN [35] 0.308 12.79LFCC-DNN [35] 0.234 9.65MFCC-ResNet [36] 0.204 9.33LPS-DFT-ResNet [36] 0.274 9.68CQCC-ResNet [36] 0.217 7.69CQSPIC-DNN [35] 0.183 7.81CQSPIC-GMM [35] 0.164 7.74LFCC-LCNN [19] 0.100 5.06LPS-FFT-LCNN [19] 0.103 4.53
Proposed: FG-LCNN 0.102 4.07
LPS of discrete Fourier transform (DFT) and fast Fourier trans-form (FFT) in ASVspoof 2019 challenge [19,35–37]. We reportthe respective system results from their published works for thecomparison on the evaluation set of ASVspoof 2019 logical ac-cess corpus.Table 4 reports the performance comparison of the pro-posed FG-LCNN system to some of the single systems reportedin ASVspoof 2019 challenge discussed above. It is observedthat the LCNN based systems represent the best performingsingle system, that justifies its use as the baseline LCNN inthis work. Further, the effectiveness of proposed feature gen-uinization is evident on using it with the LCNN system, whichoutperforms other reported single systems in terms of EER onASVspoof 2019 logical access corpus.
6. Conclusion
This work proposes a novel feature genuinization based LCNNsystem for detection of synthetic speech attacks. The charac-teristics of genuine speech are exploited to learn a model usingCNN. It transforms a genuine feature distribution more close tothat of the genuine speech, whereas leads to a different outputfor features of spoof speech, thereby maximizing their differ-ence. The transformed features are then used with an LCNNsystem. The studies conducted on ASVspoof 2019 logical ac-cess corpus show the effectiveness of the feature genuinizationbased LCNN system for detecting synthetic speech attacks. Thecomparison of the proposed system to various state-of-the-artspoofing countermeasures showcases it as one of the strong sin-gle anti-spoofing system. The future work will focus on extend-ing the studies to replay attack detection.
7. Acknowledgements
This research work is partially supported by ProgrammaticGrant No. A1687b0033 from the Singapore Government’s Re-search, Innovation and Enterprise 2020 plan (Advanced Man-ufacturing and Engineering domain), Human-Robot InteractionPhase 1 (Grant No. 192 25 00054) by the National ResearchFoundation, Prime Minister’s Office, Singapore under the Na-tional Robotics Programme. This work is also part of a collab-oration with Kriston AI Lab, China in 2019. . References [1] K. A. Lee, B. Ma, and H. Li, “Speaker verification makes its debutin smartphone,” in
SLTC Newsletter , February 2013.[2] R. K. Das, S. Jelil, and S. R. M. Prasanna, “Development of multi-level speech based person authentication system,”
Journal of Sig-nal Processing Systems , vol. 88, no. 3, pp. 259–271, Sep 2017.[3] S. Jelil, A. Shrivastava, R. K. Das, S. R. M. Prasanna, andR. Sinha, “SpeechMarker: A voice based multi-level attendanceapplication,” in
Interspeech 2019 , 2019, pp. 3665–3666.[4] Z. Wu and H. Li, “On the study of replay and voice conversionattacks to text-dependent speaker verification,”
Multimedia Toolsand Applications , vol. 75, no. 9, pp. 5311–5327, May 2016.[5] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “Theattacker’s perspective on automatic speaker verification: Anoverview,” in
Interspeech 2020 , 2020. [Online]. Available:https://arxiv.org/abs/2004.08849[6] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li,“Spoofing and countermeasures for speaker verification: A sur-vey,”
Speech Communication , vol. 66, pp. 130 – 153, 2015.[7] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicen-cio, T. Kinnunen, and Z. Ling, “The voice conversion challenge2018: Promoting development of parallel and nonparallel meth-ods,” in
Odyssey 2018 , 2018, pp. 195–202.[8] T. Kinnunen, J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito,F. Villavicencio, and Z. Ling, “A spoofing benchmark for the 2018voice conversion challenge: Leveraging from spoofing counter-measures for speech artifact assessment,” in
Odyssey 2018 , 2018,pp. 187–194.[9] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi,and T. Kinnunen, “Can we steal your vocal identity from the in-ternet?: Initial investigation of cloning Obamas voice using GAN,WaveNet and low-quality found data,” in
Odyssey 2018 , 2018, pp.240–247.[10] N. Evans, T. Kinnunen, and J. Yamagishi, “Spoofing and counter-measures for automatic speaker verification,” in
Interspeech 2013 ,2013, pp. 925–929.[11] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilc¸i, M. Sahidullah,A. Sizov, N. Evans, M. Todisco, and H. Delgado, “ASVspoof:The automatic speaker verification spoofing and countermeasureschallenge,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 11, no. 4, pp. 588–604, June 2017.[12] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado,A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,“ASVspoof 2019: Future horizons in spoofed and fake audio de-tection,” in
Interspeech 2019 , 2019, pp. 1008–1012.[13] T. B. Patel and H. A. Patil, “Combining evidences from mel cep-stral, cochlear filter cepstral and instantaneous frequency featuresfor detection of natural vs. spoofed speech,” in
Interspeech 2015 ,2015, pp. 2062–2066.[14] M. Sahidullah, T. Kinnunen, and C. Hanilc¸i, “A comparison offeatures for synthetic speech detection,” in
Interspeech 2015 ,2015, pp. 2087–2091.[15] M. Todisco, H. Delgado, and N. Evans, “A new feature for auto-matic speaker verification anti-spoofing: Constant Q cepstral co-efficients,” in
Odyssey 2016 , 2016, pp. 283–290.[16] J. Yang, R. K. Das, and N. Zhou, “Extraction of octave spectrainformation for spoofing attack detection,”
IEEE/ACM Transac-tions on Audio, Speech and Language Processing , vol. 27, pp.2373–2384, 2019.[17] J. Yang, R. K. Das, and H. Li, “Significance of subband featuresfor synthetic speech decetion,”
IEEE Transactions on InformationForensics and Security , vol. 15, pp. 2160–2170, 2020.[18] J. Yang and R. K. Das, “Long-term high frequency features forsynthetic speech detection,”
Digital Signal Processing , vol. 97, p.102622, 2020. [19] G. Lavrentyva, S. Novoselov, A. Tseren, M. Volkova, A. Gor-lanov, and A. Kozlos, “STC antispoofing systems for theASVspoof2019 challenge,” in
Interspeech 2019 , Graz, Austria,2019, pp. 1033–1037.[20] Y. Yang, H. Wang, H. Dinkel, Z. Chen, S. Wang, Y. Qian, andK. Yu, “The SJTU robust anti-spoofing system for the ASVspoof2019 challenge,” in
Interspeech 2019 , 2019, pp. 1038–1042.[21] C.-I. Lai, N. Chen, J. Villaba, and N. Dehak, “ASSERT:Anti-spoofing with squeeze-excitation and residual networks,” in
In-terspeech , Graz, Austria, 2019, pp. 1013–1017.[22] J. Monteiro and J. Alam, “Development of voice spoofing de-tection systems for 2019 edition of automatic speaker verifica-tion and countermeasures challenge,” in
IEEE Automatic SpeechRecognition and Understanding (ASRU) Workshop 2019 , 2019,pp. 1003–1010.[23] R. K. Das, J. Yang, and H. Li, “Assessing the scope of generalizedcountermeasures for anti-spoofing,” in
IEEE International Con-ference on Acoustics, Speech, and Signal Processing (ICASSP)2020 , 2020, pp. 6589–6593.[24] I. Lapidot and J.-F. Bonastre, “Effects of wavform PMF on anti-spoofing detection,” in
Interspeech 2019 , 2019, pp. 2853–2857.[25] C. Zhang, S. Ranjan, M. K. Nandwana, Q. Zhang, A. Misra,G. Liu, F. Kelly, and J. H. L. Hansen, “Joint information from non-linar and linear features for spoofing detection: an i-vector basedapproach,”
IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP) , pp. 5035–5038, 2016.[26] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-tation learning with deep convolutional generative adversarialnetworks,”
CoRR , vol. abs/1511.06434, 2015. [Online]. Avail-able: http://arxiv.org/abs/1511.06434[27] J. Gehring, Y. Miao, F. Metze, and A. Waibel, “Extracting deepbottleneck features using stacked auto-encoders,” in
IEEE Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) 2013 , 2013, pp. 3377–3381.[28] J. weon Jung, H. jin Shim, H.-S. Heo, and H.-J. Yu, “Replay attackdetection with complementary high-resolution information usingend-to-end DNN for the ASVspoof 2019 challenge,” in
Proc. In-terspeech 2019 , 2019, pp. 1083–1087.[29] A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M.Gomez, “A light convolutional GRU-RNN deep feature extrac-tor for ASV spoofing detection,” in
Proc. Interspeech 2019 , 2019,pp. 1068–1072.[30] X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face rep-resentation with noisy labels,”
IEEE Transactions on InformationForensics and Security , vol. 13, no. 11, pp. 2884–2896, 2018.[31] “ASVspoof 2019: Automatic speaker verification spoofing andcountermeasures challenge evaluation plan,” 2019.[32] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco,M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a de-tection cost function for the tandem assessment of spoofing coun-termeasures and automatic speaker verification,” in
Odyssey 2018 ,2018, pp. 312–319.[33] R. K. Das, J. Yang, and H. Li, “Long range acoustic features forspoofed speech detection,” in
Interspeech 2019 , 2019, pp. 1058–1062.[34] K. N. R. K. R. Alluri and A. K. Vupala, “IIIT-H spoofing coun-termeasures for automatic speaker verification spoofing and coun-termesures challenge 2019,” in
Interspeech 2019 , Graz, Austria,2019, pp. 1043–1047.[35] R. K. Das, J. Yang, and H. Li, “Long range acoustic and deepfeatures perspective on ASVspoof 2019,” in
Automatic SpeechRecognition and Understanding (ASRU) Workshop , 2019, pp.1018–1025.[36] M. Alzanto, Z. Wang, and M. B. Srivastava, “Deep residual neu-ral networks for audio spoofing detection,” in
Interspeech 2019 ,Graz, Austria, 2019, pp. 1078–1082.[37] J. Yang and R. K. Das, “Improving anti-spoofing with octavespectrum and short-term spectral statistics information,”