[PDF] Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge

Abstract

Automatic Speaker Verification (ASV) is the process of identifying a person based on the voice presented to a system. Different synthetic approaches allow spoofing to deceive ASV systems (ASVs), whether using techniques to imitate a voice or recunstruct the features. Attackers try to beat up the ASVs using four general techniques; impersonation, speech synthesis, voice conversion, and replay. The last technique is considered as a common and high potential tool for spoofing purposes since replay attacks are more accessible and require no technical knowledge from adversaries. In this study, we introduce a novel replay spoofing countermeasure for ASVs. Accordingly, we used the Constant Q Cepstral Coefficient (CQCC) features fed into an autoencoder to attain more informative features and to consider the noise information of spoofed utterances for discrimination purpose. Finally, different configurations of the Siamese network were used for the first time in this context for classification. The experiments performed on ASVspoof challenge 2019 dataset using Equal Error Rate (EER) and Tandem Detection Cost Function (t-DCF) as evaluation metrics show that the proposed system improved the results over the baseline by 10.73% and 0.2344 in terms of EER and t-DCF, respectively.

Full PDF

11 Replay Spooﬁng Countermeasure UsingAutoencoder and Siamese Network on ASVspoof2019 Challenge

Mohammad Adiban, Hossein Sameti, Saeedreza Shehnepoor

Abstract —Automatic Speaker Veriﬁcation (ASV) is the processof identifying a person based on the voice presented to a system.Different synthetic approaches allow spooﬁng to deceive ASVsystems (ASVs), whether using techniques to imitate a voiceor recunstruct the features. Attackers try to beat up the ASVsusing four general techniques; impersonation, speech synthesis,voice conversion, and replay. The last technique is consideredas a common and high potential tool for spooﬁng purposessince replay attacks are more accessible and require no technicalknowledge from adversaries. In this study, we introduce a novelreplay spooﬁng countermeasure for ASVs. Accordingly, we usedthe Constant Q Cepstral Coefﬁcient (CQCC) features fed into anautoencoder to attain more informative features and to considerthe noise information of spoofed utterances for discriminationpurpose. Finally, different conﬁgurations of the Siamese networkwere used for the ﬁrst time in this context for classiﬁcation.The experiments performed on ASVspoof challenge 2019 datasetusing Equal Error Rate (EER) and Tandem Detection CostFunction (t-DCF) as evaluation metrics show that the proposedsystem improved the results over the baseline by 10.73% and0.2344 in terms of EER and t-DCF, respectively.

Index Terms —Spoof detection, Replay Attack, ASVspoof Chal-lenge, CQCC, Autoencoder, Siamese Network

I. I

NTRODUCTION

Speech can be considered as one of the most importantmeans of communication for the human. Each individualhas a unique voice pattern, identiﬁable as a signature. Thispattern helps people to identify the other communication end.Consequently, tremendous low-cost technologies have beendeveloped based on voice as a biometric feature for identityrecognition [1] known as Automatic Speaker Veriﬁcationsystems (ASVs). ASV captures different clues such asintonations and vocal tract shapes in order to verify a personsidentity [2]. On the other hand, there are techniques availableto synthesize the voice or its characteristics [3][4]. Thisprovides a great opportunity for spooﬁng, where the attackerexploits a speciﬁc speaker’s voice to spoof an ASV system.Besides, advances in the channel and noise detection andremoval of their effects have made the ASV applicationsuitable for market usage. The issue become severe whenit comes to e-commerce and smartphone logical accessscenarios [5]. These ASVs mostly use short-term spectral The authors are with the Department of Computer Engineering, SharifUniversity of Technology, Tehran, Iran (e-mail: [email protected];[email protected]; [email protected]). features which are very vulnerable to spooﬁng attacks.Accordingly, the research effort mainly investigates four typesof attacks to address this problem: impersonation, speechsynthesis, voice conversion, and replay. By mimicking thehuman voice or even alter that, attackers deceive the ASVs,for the ﬁrst category . Formant extraction ( F , F , andF ) is aregular feature based approach to spot spoof in this category[3][6][7]. The second category focuses on Text to Speech(TTS) approaches where a system synthesizes a text togenerate the voice. In the third category, the voice of a personis converted from one person to the target speaker and thenit is presented to ASVs for spooﬁng [8][9]. Finally, attackerstake use of recorded voices from the genuine speakers forspooﬁng, to form a replay attack. Several studies extractfeatures from speech and classify spoof/genuine utterances[10][11]. However, there are problems still remaining out ofattention. As [12] mentioned, one of the important aspectsto consider is determining which features or classiﬁers arebetter for discrimination. Furthermore, transforming featuresinto space with a more informative context for discriminationis also an interesting research direction should be considered.Here in this work, we employ the Siamese networks for theﬁrst time as a classiﬁer to increase the level of discriminationstrength between the extracted features. Although Siamesenetworks were used for spooﬁng detection tasks such asa discrimiative feature extraction [13], the performance ofsuch networks as a classiﬁer is overlooked in ASVs. Ourmotivation for using such a method is owing to the factthat Siamese networks are applicable to tasks includingmeasuring similarity or determining relationships betweentwo comparable subjects [14]. In such tasks, two identicalsubnetworks are usually used to process the two injectedinputs, and another module takes the outputs of these twosub-networks and makes the ﬁnal decision. Siamese networkscan be fed by two sets of features and after taking the ﬁnaloutput from the output layer, they will use a distance measurein order to spot an input as a spoof/non-spoof sample.In this study, we also extract CQCCs as basic featuresfrom both genuine and spoof samples. Since these featuresare sparse and raw, we feed them to autoencoders to havecompressed and more discriminative features. In particular,our motivation for using autoencoder in this context isdue to the fact that replay speech suffers from channel orconvolutional noise, since the recording process is performedby two microphones, through one loudspeaker [15]. As aresult, noise detection is one of the other aspects of replay a r X i v : . [ ee ss . A S ] O c t spooﬁng detection as it is one of the most important factorsin the successful adaptation of ASVs in the market place. Forinstance, in [16], a robust system is developed to improveDeep Neural Network (DNN) performance by using noisydata from channels. This type of systems is called NoiseAware systems, where noise is introduced to the system, inorder to help the system to detect spooﬁng attacks especiallyfor replay ones. In this study, Autoencoder Neural Netoworkis employed, due to their prior success in extracting the noiseas latent variable, given the same noisy input as the network’soutput [17][18]. We trained the autoencoder in order toconsider the noise and extract an informative representationof basic features, capable of representing both meaningfuland sparse features related to original samples. Moreover, incontrast with other feature reduction and selection methods nofeature is excluded, Hence, for both clean and noisy speechall features are preserved. Finally, in order to have contextualrepresentation of features, we ﬁrst feed them to ConvolutionalNeural Network (CNN) and then share the weights of CNNwith another CNN identical to the ﬁrst CNN and afterwards,they are employed in Siamese network. Siamese networksextract the similarity or a connection between two inputs andresult in a signiﬁcant discriminative classiﬁcation.In the following, ﬁrst we list the related studies in Section II,then we will deﬁne the models of our works in Section III.Subsequently, the proposed system is introduced in SectionIV. Next, we give a brief explanation of the experimentalsetup in Section V. In Section VI, we discuss experimentalresults and ﬁnally, make the conclusion and propose futureworks in Section VII.II. R ELATED W ORKS

Different techniques are proposed based on physiologicaland behavioral clues used for identiﬁcation. Four types ofSpoof attacks mostly threaten ASVs.Among them, replay attacks focus on using a pre-recordedspeech from a genuine speaker in order to deceive ASVs.Speech could be recorded without the speakers consent oreven it can be generated by concatenating different parts ofspeech from the genuine speaker. High tech devices includingsmartphones, laptops and recorders, designed for recording,complicate the situation, even more. As a result, it can bethe most potent threat to ASVs. The vulnerability of ASVsto replay attacks ﬁrst was investigated by [10], where authorsreported a signiﬁcant increase in Equal Error Rate (EER) andFalse Acceptance Rate (FAR) for ASVs. Other studies [11][19]demonstrated that using Joint Factor Analysis (JFA) in whichFAR increased for the attack in this category. In [20], for a text-independent system with GMM-UBM classiﬁer considerableincrease in EER for ASVs is reported. As mentioned, thereare various low-cost ASVs that have saturated the market fromone hand, and on the other hand, easiness and availability ofdevices required for replay attacks, lead to proposing counter-measures for both sides [5]. Furthermore, for a replay attack,there is no need for special speech processing techniques [21]and unlike other types of spooﬁng, they can be used with less knowledge about ASVs. Hence, they are more engagedin spooﬁng attacks.Therefore, in this study, considering the importance ofreplay spooﬁng attacks, we focus on this type of attack, andwe will examine countermeasures for this threat. Among oneof the ﬁrst countermeasures for a replay attack, [22] employedﬁxed pass-phrase on a text-dependent ASVs. The detectorstores past access attempts with new attempts and then makethe decision. Results demonstrated improvement in EER whilethey detected and recognized playback utterances. In [23],authors use spectral ratio and modulation indexes in order todetect spooﬁng in far-ﬁeld recordings where noises increasein recorded speech due to the distance. As a result of thelong-distance, the spectrum turns into a ﬂattened one and thenthe ability for modulation index is reduced. For classiﬁcationpurposes, an SVM approach was used. Their results show thatthe FAR of text-independent JFA is reduced for the ASVs.Wang et al. [24] claims that licit recordings have just a speciﬁctype of channel noise which is mixed with additional noise ofenvironment when a replay attack is presented in far-ﬁeld. AGMM-UBM classiﬁer was employed in order to detect thereplay attack and the results show a signiﬁcant decrease inEER.Ji et al. [20], captured CQCC features and feed them toa decision tree classiﬁer, Results showed EER of 10.8% forthis study on ASVspoof 2017 dataset. Adiban et al. [15]combined different features including Mel-Frequency CepstralCoefﬁcients (MFCC), RASTA-PLP, Modiﬁed Group Delay,CQCC, i-vector, and Linear Prediction Cepstral Coefﬁcients(LPCC) and fed them to different classiﬁers such as SVM,Multi-Layer Perceptron (MLP) neural networks and GMMson ASVspoof 2017 dataset. The best performance was 10.31%for EER. Shim et al. [25] investigated the problem of replayspooﬁng attack detection and noise classiﬁcation using multi-task learning on playback devices, recording environments anddevices. Results showed a 30% improvement from 13.57% to9.56% in terms of EER on ASVspoof 2017 dataset. A DiscreteFourier Transform (DFT) was used in [26] for ASVspoof2017 dataset. Then, a normalization was applied to featuresin the q-log domain. For feature reduction, Principal Compo-nent Analysis (PCA) performed on initial feature. The bestperformance was at the EER of 11.19%. Anti-spooﬁng taskon ASVspoof 2017 dataset was performed in [27] where aCNN used as a deep learning approach with Max-Feature-Mapactivation function. The proposed approach yielded 6.73%as EER. Long-term temporal envelopes also extracted fromsub-band signals using Frequency Domain Linear Prediction(FDLP) for feature extraction and GMM and CNN used forclassiﬁcation were presented in [28]. The reported EER onASVspoof 2017 dataset was 9.70%. Sailor et al. [29] reportedEER of 8.89% on ASVspoof 2017 dataset using ConvRBM-CC as features and GMMs as classiﬁcation.In the work presented in [30], inverted Mel-frequencycepstral Coefﬁcients (IMFCC), short time Fourier transform(STFT), group delay gram (GD gram) and joint gram areused as features. Extracted features fed into a residual neuralnetwork (ResNet) classiﬁer. The best results obtained in scorelevel fusion on the ASVspoof 2019 dataset, which are 0.66% and 0.0168 in terms of EER and t-DCF, respectively. Li etal. in [31], implemented on ASVspoof 2019 dataset utilizesmultiple spectral features within the network, such as MFCC,CQCC, Fbank, etc., and also butterﬂy unit (BU) for multi-tasklearning. Results showed 0.67% and 0.0148 in terms of EERand t-DCF, respectively. In [32], various features, includingMFCC, CQCC, and STFT are extracted and then fed intoResNet classiﬁer. Finally, in the fusion level EER of 0.28%and t-DCF of 0.0074 are achieved. The study investigated in[33], proposed a method for replay spooﬁng detection basedon various long range acoustic features and Deep NeuralNetworks as classiﬁers. Their best combined system obtainedt-DCF of 0.1381 and EER of 5.95% for physical access onAVSpoof 2019 dataset. The study in [34] is another workin which the proposed system is implemented using LightCNN (LCNN) based on various acoustic features such as CQT,LFCC, and Discrete Cosine Transform (DCT). The best resultreported in this study is obtained in score level fusion whichis 0.54 and 0.0122 in terms of ERR and t-DCF, respectively.In [35], authors utilize different spectral features, includingdelta and acceleration MFCC (SDA MFCC), Inverted MFCC(IMFCC), CQCC, sub-band centroid magnitude coefﬁcients(SCMC) features and i-vector in order to detect spooﬁng usingvarious shallow and deep classiﬁers such as GMM, SVM,CNN, Convolutional Recurrent Neural Network (CRNN), 1D-CNN and Wave U net, resulting in EER of 5.43 and t-DCF of0.1465.Table I summaries the replay attack countermeasures inves-tigated in this section. III. M

ODELS

A. Constant Q Cepstral Coefﬁcient

In this work we used CQCCs. These features utilize Con-stant Q Transform (CQT) [37] which is originally presentedfor tasks related to music processing [38] and later it wasemployed in speaker veriﬁcation tasks [36]. Despite manycepstral based features utilizing prevalent Fourier Transform(FT), CQCCs employ Constant Q Transform (CQT) whichuses geometrically spaced frequency bins [36]. The maindifference between FT and CQT is that FT uses ﬁxed time-frequency resolution, however, CQT utilizes a constant Qfactor which leads to an increase in frequency resolution atlower frequencies and also a better time resolution at higherfrequencies [39]. The Q factor is the ratio between the centerfrequency f k and the bandwidth δ f deﬁned as. Q = f k δ f . (1)Details about CQT can be found in [39]. The CQCC extractionsteps are introduced as follows: Initially, for the given signal x ( n ) , CQT is engaged to obtain the spectrum. Then, the powerspectrum and subsequently the logarithmic power spectrumare calculated. Eventually, in order to attain the cepstrum of x ( n ) , we apply the DCT as the inverse transformation in thespectrum logarithm. The extraction steps of CQCC are shownin Fig. 1. In addition, a comparison of the CQT and STFTspectrogram for a speech signal is illustrated in Fig. 2. Fig. 2. A comparison of CQT and STFT spectrogram.

B. Autoncoders

Autoencoder (AE) is a generative model that can be usedto extract substantial information and features. Accordingly,an autoencoder is a neural network devised for the purposeof learning an identity function in an unsupervised settingto reconstruct high dimensional input data { x ; x ; ... ; x m } as outputs { ˆ x ; ˆ x ; ... ; ˆ x m } . As mentioned in Sec. I, AE iscapable of capturing latent variables hidden in input samples.Furthermore, AE extracts most informative and independentfeatures to compress the signiﬁcant information of input sam-ples [18]. Considering the sparse nature of noise, AE can bothcapture the sparse features of the noisy input and also to mapthem to a features space, capable of discriminating noisy andclean samples. Accordingly, classiﬁer is able to discriminateboth type of samples with high accuracy. The structure of theautoencoder is depicted in Fig. 3. According to this structure,the input data feature is reduced to reduced compact ones,which can be considered as a feature vector. As shown in Fig.3, L is a hidden layer. Eq. 2 deﬁnes the activation of unit i in layer l . a ( l ) i = f ( n (cid:88) j =1 W ( l − ij a ( l − j + b (1) j ) , (2)where W represents weight and b denotes bias parameters.According to Fig. 3, the input layer is considered as a (1) = x ,and in the output layer, a (3) = ˆ x . In addition, in the hiddenlayers, the sigmoid function is used as the activation function f . However, the linear function is engaged in the output layer.The objective function will be deﬁned as: J ( W, b ) = 1 m m (cid:88) i =1 ( 12 || x i − ˆ x i || ) + λ n − (cid:88) l =1 s − (cid:88) i =1 s l +1 (cid:88) j =1 ( W ( l ) ji ) , (3)where the parameters λ , n l and s l represent the strength ofregularization, the number of layers in the network and thenumber of units in layer L , respectively. In the training phase, Constant-Q

Transform

Power

Spectrum LOG

Uniform

Resampling DCT 𝑥(𝑛) 𝑋 𝐶𝑄 (𝑘) 𝑋 𝐶𝑄 (𝑘) 𝑋 𝐶𝑄 (𝑘) 𝑙𝑜𝑔 𝑋 𝐶𝑄 (𝑙) 𝐶𝑄𝐶𝐶(𝑃)

Fig. 1. Block diagram of CQCC feature extraction [36]. TABLE IS

UMMARY OF THE REPLAY ATTACK C OUNTERMEASURE (CM) M

ETHODS , AND R ESULTS .Study CM Method EER% / t-DCFZ. Ji, et al. [20] CQCC + GMM mean supervector-Gradient Boosting 10.8 / -M. Adiban, et al. [15] MFCC, RASTA-PLP, CQCC, LPCC and i-vector + GMM, MLP and SVM 10.31 / -H.-J. Shim et al. [25] Noise detection + neural network and Multi-task learning 9.56-13.57 / -M. J. Alam, et al. [26] DFT-based features + feature normalization + PCA 11.9 / -G. Lavrentyeva, et al. [27] Max-Feature-Map activation + CNN 6.37 / -B. Wickramasinghe, et al. [28] FDLP (TC and RC) + GMM and CNN 9.70 / -H. B. Sailor, et al. [29] ConvRBM-CC + GMM 8.89 / -W. Cai, et al. [30] IMFCC,STFT, GD gram, joint Gram + ResNet 0.66 / 0.0168R. Li, et al. [31] MFCC,CQCC, Fbank + BU 0.67 / 0.0148M. Alzantot, et al. [32] MFCC,CQCC, STFT + ResNet 0.28 / 0.0074R. K. Das, et al. [33] long rangeacoustic features + DNN 5.95 / 0.1381G. Lavrentyeva, et al. [34] CQT, LFCC and DCT + LCNN 0.54 / 0.0122B. Chettri, et al. [35] Spectral features, i-vector + Deep and Shallow Classiﬁers 5.43 / 0.1465 the objective function will be minimized with respect to theparameters of W and b . Fig. 3. Diagram of the Autoencoder.

C. Siamese Networks

Siamese Networks are one among all types of neural net-works ﬁrst used by [40][41] for different purposes. Most ofthe works focused on their application in veriﬁcation taskssuch as face recognition, signature recognition, etc. Siamesenetworks take the sample as input and then maps it onto anew latent space where similar samples have shorter distancethan non-similar ones. So the whole idea is ﬁnding a TargetSpace where the semantic distance between inputs is found.Thus, the Siamese networks can be very useful where thetraining data does not contain sufﬁcient information neededfor classiﬁcation.Hence, it can be realized that Siamese networks can be appliedin veriﬁcation tasks. Spooﬁng detection is also a matter ofveriﬁcation process of whether a voice is genuine or not. Inthis work, considering the fake nature of spoof, a similaritymeasure can be engaged to compare features extracted from both data given to Siamese and then use another neural layerfor the ﬁnal decision.Architecture for this network is given in Fig. 4. It includestwo identical deep neural networks with the same conﬁgura-tions in terms of weights, hyper-parameters, etc. Each one ofthese neural networks is called leg of the network and bothare identical. One network is trained by input samples andthen a copy of the network is used with the same weightsas the other leg. The network can be a simple Multi-LayerPerceptron (MLP) or other types of deep neural networks suchas CNN or RNN. The results are combined mostly using acombination function and then the output is fed to a fullyconnected network that is trained to produce a metric to showif two samples are the same or different based on output givenby the combination function. So the task here is to minimizethe loss function based on the combination function. The ﬁnalfully connected layer can be performed by a sigmoid functionor one single neuron.IV. P

ROPOSED S YSTEM

In this section, we will give information about the Modelwe employed for each step of the whole framework. Theproposed system is depicted in Fig. 5. For the ﬁrst step,CQCC features with different dimensions are extracted fromthe initial voice using the CQCC feature extractor. Thereupon,for encoding features, we used an X × Y × X structure where X is the dimension size of the extracted features and Y is thebottleneck dimension size. For training, we engaged a ﬁne-tuning back propagation, so the parameters will be improvedin each iteration. Then we used a Convolution Neural Network(CNN) similar to the one used in [42]. CNN consisted of3 convolutional layers, 3 max-pooling layers and, 2 FullyConnected (FC) layers. Max-pooling operator have shown tobe sensitive to noisy input which is an important features.Noisy input is one of the important characteristics of a replayattack [43]. Two independent convolutional networks were Fully Connected Layer

Weighs

Merge 𝑋 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑋 𝑡𝑒𝑠𝑡 Fig. 4. Architecture of Siamese Networks.

Constant-Q

Transform

Power

Spectrum LOG Uniform Resampling DCT 𝑥(𝑛) 𝐶𝑄𝐶𝐶(𝑝) O r i g i n a l S a m p l e Encoder C o d e Decoder R e c o n s t r u c t e d S a m p l e 𝐶𝑄𝐶𝐶′(𝑝)𝑥′(𝑛)

𝑇𝑒𝑠𝑡 𝑆𝑝𝑒𝑒𝑐ℎ𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑆𝑝𝑒𝑒𝑐ℎ s 𝑝𝑜𝑜𝑓/ 𝑏𝑜𝑛𝑎 𝑓𝑖𝑑𝑒 CQCC Feature Extraction Feature Reduction CNNCQCC Feature Extraction Feature

Reduction

CNN Distance(T,O)

Weights

Fig. 5. Diagram of our proposed Siamese network. used in every layer with Max-Feature-Map (MFM) [44] asthe activation function. For compression purposes, a max-pooling kernel was used with a size of 2 ×

2. For the FClayer, we used a softmax function to discriminate betweenspoof and genuine samples. For improving the performance,we employed a Highway structure from [45] in the ﬁnallayer which regulates information transfer considering thisfact that each neural network is a highway, where unimpededinformation can ﬂow over several layers without attenuation.We used ReLU as an afﬁne transform function and sigmoidas an activation function: y = H ( x, W H ) ∗ T ( x, W T ) + x. (1 − T ( x, W T )) (4)In Eq. 4, H, T are the afﬁne transform function and activationfunction, respectively. The weighting matrices W h , W T areupdated in training process. x represents the input and thedot operator (.) is used to denote element-wise multiplication.Finally y is output of highway function. Probability of a voiceto be a spoof or not is calculated by a softmax layer using the following equation: P ( c i | z ) = e z T w (cid:80) Ci =1 e z Ti w , (5)where c i is the i th class that we try to calculate whether sample x is a member of this class or not. z is the given feature vectorand C is the number of whole classes, which in our case istwo, spoof and genuine.One of the trained CNNs is fed with genuine ﬁxed-lengthvectors obtained from autoencoder and the other ones are fedwith the test sample. If they are similar they will get the sameclass and otherwise, they will be placed in different classes.As [46], we tried to use dropout for preventing overﬁtting withmini-batches of 200 samples. For loss function we used CrossEntropy: CE = − C (cid:48) (cid:88) i =1 t i log( f ( s i )) − (1 − t i ) log(1 − f ( s i )) , (6)where t i is the true label of class i , f ( s i ) is the calculatedlabel for sample s i and C (cid:48) indicates the number of total classes in the dataset which is 2 in our case. Finally, a simple lossfunction is used to calculate the distance between each featureto take the value of 1 (spoof) if it is more than 0.5 and 0(genuine) otherwise.V. E XPERIMENTAL S ETUP

A. Dataset

The ASVspoof 2019 challenge [47] is introduced to expandthe goals of the previous challenges ASVspoof 2015 [48]and ASVspoof 2017 [49]. The main purpose in ASVspoof2015 was to introduce countermeasure systems for detectingspoofed/non-spoofed speech in which spoofed speech wasimplemented based on either text-to-speech (TTS) or voiceconversion (VC) approaches. In addition, the 2017 ASVspoofChallenge was presented to guide studies to introduce coun-termeasure systems for detecting replay spooﬁng attacks.Subsequently, the ASVspoof 2019 challenge was presentedto complete the objectives of the two previous challengesand provided two subsets including Logical Access (LA) andPhysical Access (PA). In the PA scenario, the spooﬁng attacksare based on replay attacks where an adversary tries to recorda genuine speech and replay it in order to deceive the ASVsystem. PA includes three subsets: Training, Development, andEvaluation sets.For training and development, a dataset was created usinga combination of three room sizes, with three levels ofreverberation and three different speaker-to-ASV microphonedistances, and in total 27 conﬁgurations. The replay attackdataset comprises nine different conﬁgurations from threedifferent categories of distances and three different qualities.Evaluation dataset consisted of 137457 trials including bothreplay spoofed speech and bona ﬁde with different conﬁgura-tions with unique speakers. The statistics of each of subsets issummarized in Table II.

TABLE IIP

HYSICAL A CCESS S CENARIO S TATISTICS FOR

ASVSPOOF 2019D

ATABASE

Subset

B. Evaluation Metrics

The ASVspoof 2019 challenge is a binary classiﬁcation task,in which utterances from real humans are labeled as a positiveclass and spoof attacks are labeled as a negative class. TandemDetection Cost Function (t-DCF) [50] is adopted by ASVspoof2019 as a standard measure metric, which is based on detectiontheory and can be speciﬁed for the envisioned application.Isolation of different systems (such as CM and ASV) is a keyfeature of tandem systems and t-DCF. We also use EER as anadditional metric to measure our system performance. VI. E

XPERIMENTAL E VALUATION

A. Baseline System

The organizers of the challenge ASVspoof 2019 have in-troduced two systems as a baseline [47] for participators. Theproposed method in the baseline system is based on CQCC andLinear Frequency Cepstral Coefﬁcients (LFCC) [51] featuresand GMM classiﬁers. Accordingly, they extract CQCC andLFCC features from training data and then two GMMs (oneGMM for the bona ﬁde and the other for spoofed data) with512 components to learn the model by EM iterations (trainingGMMs). In the next step, the score for each trial is computedusing these GMMs. The results for the development set andevaluation set of physical access scenario in terms of t-DCFand EER are presented in Table III.

TABLE III T -DCF AND

EER

RESULTS FOR TWO BASELINE COUNTERMEASURES ONPHYSICAL ACCESS SCENARIOS FOR BOTH DEVELOPMENT SET ANDEVALUATION SET [47][52].Baseline System Development Set Evaluation SetEER% t-DCF EER% t-DCFLFCC-GMM 11.96 0.2554 13.54 0.3017CQCC-GMM 9.87 0.1953 11.04 0.2454

B. Proposed System Conﬁguration

As we have mentioned, our proposed system consists ofthree main parts. First, extracting the CQCC features with dif-ferent dimensions from input speech. Second, dimensionalityfeature reduction using autoencoder to different dimensions.Finally, using the Siamese network as a classiﬁer. This processincludes the Training and Evaluation phases:

1) Training Phase:

To train our system, we extract CQCCfeatures from training data (for both bona ﬁde and spoofeddata). In the following, we trained an autoencoder withdifferent bottleneck dimensions using the extracted featuresseparately with Siamese networks. Accordingly, we take intoaccount two objectives of training the autoencoder. First, wetry to reduce the data dispersion. More importantly, autoen-coder is trained by considering the noise in spoofed speech,leading to achieving more valuable data for spooﬁng detection.As the ﬁnal step, we trained Siamese Networks (including twoCNNs) as our classiﬁer. In order to train these CNNs, ﬁrst, wedivided our dataset into two balanced parts (one part is usedfor ﬁrst CNN and another part for the second one). Afterwards,we randomly selected data form each part without replacing itand then apply each selected data considering their labels tothe network assigned to that part. Finally, CNNs learn whethertwo input data are from equal classes or not and subsequentlytheir shared weights are updated. Here, different conﬁgurationsof CNNs, indicated as Siamese Network Conﬁg”, were usedto observe the impacts of using various ﬁlters with differ-ent sizes on performance. Considering the effects of usingdifferent feature extraction methods, we used three differentconﬁgurations of CNN. In this regard, we run our systems ondevelopment data to obtain our conﬁgurations and tune theirparameters. Therefore, the ﬁrst conﬁguration, indicated byConﬁg. 1, contains three layers of convolutional and average-pooling layers and 1 hidden layer. It holds 160, 200, and 100 ﬁlters, respectively. In the hidden layer, we have 300 nodes and3 pooling width. For the second conﬁguration (Conﬁg. 2), wehave the same architecture, with a change in the number ofthe hidden nodes to 500. In the third conﬁguration (Conﬁg. 3),we just replaced average-pooling with max-pooling to observethe impact of the max ﬁlter instead of average-pooling one.

2) Evaluation Phase:

In the evaluation phase, similar to thetraining phase, we extract different dimensions of the CQCCfeatures from the evaluation set. We then used an autoencoderto reduce the dimensions of the extracted features. At theclassiﬁcation level, we applied evaluation data to one of thetrained CNN in the previous step. And the input of anotherCNN is a ﬁxed bona ﬁde speech which is randomly selectedfrom the training set. So for each evaluation data, we willcheck whether the input data matches the ﬁxed bona ﬁdespeech. If it matches, we will consider it as a bona ﬁde,otherwise, it will be considered as a spoofed speech. Finally,we repeat this cycle with 100 ﬁxed bona ﬁde data which arerandomly selected from the training set without consideringtheir gender or number of speakers and ﬁnally, we get themajority votes. As mentioned in the training phase, it shouldbe reminded that we used development trials in order to tunethe parameters of our system.

C. Siamese Networks Results and Analysis

Results obtained from different classiﬁer setups are in-vestigated in this section. Table IV shows EER and t-DCFachieved from different dimensions of CQCC and differentconﬁgurations of our classiﬁer without any dimensionalityreduction on the development set and evaluation set. As shownin Table IV, the best results in each classiﬁer conﬁgurationbelong to 90-dimensional CQCC. Among these results, thebest result is achieved from Siamese Conﬁg. 3. It can be seenthis result improves the baseline system 10.42% and 0.2344in terms of EER and t-DCF, respectively.

Discussion:

As explained in the introduction, using descriptiveCQCC through a higher dimension of features can be effectivein the spooﬁng detection task due to the fact that it gives a bet-ter resolution at different frequencies. This leads to a detaileddescription of noisy frequencies and usual ones. On the otherhand, increasing the dimension of the features might triggerredundancy. Thus, we are looking for a balance between higherresolution and lower redundancy. Experimental results showthat 90-dimensional CQCC achieves the equilibrium point. Inaddition, it is obvious that results for Conﬁg. 3 are betterthan the other two conﬁgurations. The key point for thisconﬁguration is that it has fewer hidden nodes than the secondconﬁguration and so we can realize that increasing the numberof hidden nodes is helpful up to some point and after that theperformance decays. Furthermore, it shows that using max-pooling layer can be more helpful than average-pooling. Thisalso comes back to the noisy nature of the given task, wherethe noises effects can be removed using the average-pooling[43], while max-pooling will result in selecting noisier valuesat the selection layer.

TABLE IVCQCC

DIMENSION EFFECTS ON THE PERFORMANCE OF THE PROPOSEDSYSTEM WITH DIFFERENT CONFIGURATIONS OF

CNN

CONFIGURATIONSNOTED AS ”S IAMESE C ONFIGS ”.Siamese Network Conﬁgs.Conﬁg. 1 Conﬁg. 2 Conﬁg. 3Set CQCCDims. EER% t-DCF EER% t-DCF EER% t-DCF D e v .

30 2.17 0.0411 2.35 0.0437 1.96 0.039160 1.29 0.0240 1.84 0.0319 1.65 0.032490

120 1.46 0.0247 1.93 0.0386 1.24 0.0207 E v a l .

30 6.80 0.1550 7.12 0.1649 5.66 0.137160 4.19 0.1072 5.57 0.1323 3.78 0.088390

120 4.73 0.1251 6.29 0.1528 4.11 0.0854

D. Autoencoder Performance

Fig. 6 and Fig. 7 show the effects of using the autoencoderin our proposed system. Accordingly, we used the autoencoderto reduce the dimensions of the 90-dimensional CQCC. Asshown in Figs. 6 and 7, best results are obtained from the70-dimensional bottleneck of autoencoder for each Siameseconﬁguration. Similar to previous results, the best resultsbelong to Siamese Conﬁg. 3 that outperforms the baselinesystem by 10.72% and 0.2344 in terms of EER and t-DCF,respectively.

Discussion:

Figs. 6 and 7 show the same results for threedifferent conﬁgurations, which were analyzed in the previousdiscussion. But it shows that the better performance for theproposed approach is given for higher bottleneck features.The purpose of using autoencoder is to consider the impact ofnoise, but attaching too much importance to the noise effectis associated with ignoring valuable information which leadsto increment in EER. Therefore, there is a trade-off, as muchas noise is considered, information is also missed and vice-versa. Consequently, there is an optimum value where mostof the noises are considered and informative values are stillpreserved, which is 70 in this case.

20 30 40 50 60 70 80

Autoencoder Bottleneck Dims EE R % Siamese Confg1Siamese Confg2Siamese Confg3

Fig. 6. Effects of dimension reduction by autoencoder on EER for evaluationset with CQCC vector size of 90 on the proposed system.

Table V summarizes the best results obtained by differentconﬁgurations of the proposed systems and baseline systems.

20 30 40 50 60 70 80

Autoencoder Bottleneck Dims t - DC F Siamese Confg1Siamese Confg2Siamese Confg3

Fig. 7. Effects of dimension reduction by autoencoder on t-DCF for evaluationset with CQCC vector size of 90 on the proposed system.TABLE VC

OMPARISON OF THE BEST RESULTS OBTAINED BY DIFFERENT S IAMESENETWORKS CONFIGURATIONS WITH BASELINE SYSTEMS .(CQCC

VECTORSIZE = 90, A

UTO E NCODER BOTTLENECK SIZE = 70)Dev. Eval.System EER% t-DCF

EER% t-DCF

Baseline 1 (LFCC + GMM) 11.96 0.2554 13.54 0.3017Baseline 2 (CQCC + GMM)

Conf. 1 CQCC 1.98 0.1529 3.27 0.0745Siamese CQCC + AE

Network Conf. 2 CQCC 2.40 0.0448 4.13 0.1086Conﬁgs. CQCC + AE

Conf. 3 CQCC 1.77 0.0212 3.02 0.0627CQCC + AE

E. Performance With Different Training Data sizes

In order to measure the performance of our proposed systemin face of different training data, we used different sizesof the training set to train the system. In this regard, werandomly divided the training set into ﬁve equal folds withoutconsidering the gender or number of speakers. In each step,the training set was raised fold by fold. For each fold, thesystem is trained similar to what we do in the training phase.Fig. 8 and Fig. 9 show the effects of using different sizes ofthe training set. As expected, the results were improved byincreasing the volume of the training data. It is noteworthythat our proposed system could achieve better results than thebaseline system on the evaluation set using only 60% of thetraining data. By 80% of data as the training set, the proposedapproach outperforms the baseline system.

Discussion:

The results for this part demonstrate better out-comes for more amount of data which is kind of obvious.Increasing training data helps the classiﬁer to ﬁgure out thedata model which leads to better prediction of test samples,however, considering 60% of whole data as a training setprovides comparable results to the baseline system, and 80%of the whole data gives a signiﬁcant improvement over thebaseline system. Another important subject is that the proposedsystem demonstrates impressive results using less trainingdata. This makes the proposed approach more robust tomissing data. It means the proposed system can be applied to tasks with less training data and since data scarcity is a seriousissue, it is a strong advantage for the proposed approach.

20% 40% 60% 80%

Supervision EE R % Config 1Config 2Config 3

Fig. 8. The Effects of Using Different Sizes of the training set on the EER ofthe best conﬁguration of Proposed System (CNN Conﬁg. 3, 90-dimensionalCQCC, 70-dimensional AE bottleneck).

20% 40% 60% 80%

Supervision t - DC F Config 1Config 2Config 3

Fig. 9. The Effects of Using Different Sizes of the training set on the t-DCFof the best conﬁguration of Proposed System (CNN Conﬁg. 3, 90-dimensionalCQCC, 70-dimensional AE bottleneck).

F. Effect of Siamese Networks on Performance

To evaluate the performance of the Siamese networks,the results for employed Siamese networks and single CNNare compared at table VI. The hyper-parameters of CQCCand autoencoder are tuned based on the best performanceof the system. Therefore, the dimension of CQCC is ﬁxedon 90, and for the autoencoder we used bottleneck witha size of 70. Three different conﬁgurations are used foreach case. The results indicate that Siamese networks areeffective at improving the results compared with single CNN.

TABLE VIE

FFECT OF EMPLOYED S IAMSE NETWORK ON THE PERFORMANCE (CQCC

VECTOR SIZE = 90, BOTTLENECK SIZE= 70).Siamese Network CNNSet Conﬁgurations EER t-DCF EER t-DCFDev. Conf. 1 0.23 0.0027 0.94 0.0212Conf. 2 0.29 0.0196 1.21 0.0403Conf. 3

Eval. Conf. 1 0.80 0.0208 3.25 0.1339Conf. 2 0.91 0.0278 5.11 0.2726Conf. 3

Discussion:

Siamese networks have shown promising re-sults in different classiﬁcation tasks and their application wasrestricted to classiﬁcation tasks with an unlimited number ofclasses. Hence, their effectiveness was overlooked in suchtasks. However, as Table VI indicates, the Siamese networksimprove the performance of the system. Although CNN yieldssatisfactory results, it does show weaknesses in stability. Inother words, the system performance varies with differentconﬁgurations, and the results for the evaluation set are muchworse than the results for the development set. This is mostlybecause both classes are trained on one network which some-how reveals some evidence of over-ﬁtting. In addition, as aresult of the network discriminates between different samples,the network has difﬁculty in handling the different inputs,especially in case of using average-pooling which removes thenoises. On the other hand, for Siamese networks, the differenceof generated output is learnt and this is one step in addition tonormal classiﬁcation. In this case, even with the average forpooling, the model is learnt through the ﬁnal layer. However,the average pooling would remove the noises and leads to adrop in performance of classiﬁcation.VII. C

ONCLUSION

We proposed a novel replay spooﬁng countermeasure sys-tem for ASVs based on physical access ASVspoof 2019dataset. In this study, different conﬁgurations of CNNs in thestructure of Siamese Network were investigated for the pur-pose of classiﬁcation. Moreover, the autoencoder is employedto resolve the dispersion problem of well-known CQCC fea-tures. The experimental results conﬁrmed the high efﬁciencyof Siamese Network for spooﬁng detection. Additionally,autoencoders could signiﬁcantly consider and utilize noiseand eliminate redundant and irrelevant information of CQCCfeatures, resulting in outperforming the baseline system. Be-sides, the results show that the proposed method could achievecomparable performance using a smaller amount of training set(only 60% training set) which reﬂects the effectiveness and thescalability of our system.For future work, an integrated framework can be proposedto identify each of four general attacks on ASVspoof systemsand apply appropriate countermeasures regarding the attack. Itis a necessary task since every ASV system is always facingother types of attacks and has to be able to confront each typeof attack. R

EFERENCES[1] A. K. Jain, A. Ross, and S. Pankanti, “Biometrics: a tool for informationsecurity,”

IEEE transactions on information forensics and security ,vol. 1, no. 2, pp. 125–143, 2006.[2] T. Kinnunen and H. Li, “An overview of text-independent speakerrecognition: From features to supervectors,”

Speech communication ,vol. 52, no. 1, pp. 12–40, 2010.[3] T. B. Amin, P. Marziliano, and J. S. German, “Glottal and vocaltract characteristics of voice impersonators,”

IEEE Transactions onMultimedia , vol. 16, no. 3, pp. 668–678, 2014.[4] T. Masuko, K. Tokuda, and T. Kobayashi, “Imposture using syntheticspeech against speaker veriﬁcation based on spectrum and pitch,” in

Sixth International Conference on Spoken Language Processing , 2000.[5] K. A. Lee, B. Ma, and H. Li, “Speaker veriﬁcation makes its debutin smartphone,”

IEEE signal processing society speech and languagetechnical committee newsletter , 2013. [6] Y. W. Lau, M. Wagner, and D. Tran, “Vulnerability of speaker ver-iﬁcation to voice mimicking,” in

Proceedings of 2004 InternationalSymposium on Intelligent Multimedia, Video and Speech Processing,2004.

IEEE, 2004, pp. 145–148.[7] A. Eriksson and P. Wretling, “How ﬂexible is the human voice?-A case study of mimicry,” in

Fifth European Conference on SpeechCommunication and Technology , 1997.[8] Y. Stylianou, “Voice transformation: a survey,” in . IEEE,2009, pp. 3585–3588.[9] N. Evans, F. Alegre, Z. Wu, and T. Kinnunen, “Anti-spooﬁng, voiceconversion,”

Encyclopedia of Biometrics , pp. 115–122, 2015.[10] J. Lindberg and M. Blomberg, “Vulnerability in speaker veriﬁcation-astudy of technical impostor techniques,” in

Sixth European Conferenceon Speech Communication and Technology , 1999.[11] J. Villalba and E. Lleida, “Speaker veriﬁcation performance degradationagainst spooﬁng and tampering attacks,” in

FALA workshop , 2010, pp.131–134.[12] S. Kaavya, V. Sethu, P. N. Le, and E. Ambikairajah, “Investigationof sub-band discriminative information between spoofed and genuinespeech,” in

Interspeech , 2016, pp. 1710–1714.[13] S. Kaavya, V. Sethu, and E. Ambikairajah, “Deep Siamese ArchitectureBased Replay Detection for Secure Voice Biometric,” in

Interspeech ,2018, pp. 671–675.[14] S. Bell and K. Bala, “Learning visual similarity for product design withconvolutional neural networks,”

ACM Transactions on Graphics (TOG) ,vol. 34, no. 4, p. 98, 2015.[15] M. Adiban, H. Sameti, N. Maghsoodi, and S. Shahsavari, “SUT SystemDescription for Anti-Spooﬁng 2017 Challenge,” in

Proceedings of the29th Conference on Computational Linguistics and Speech Processing(ROCLING 2017) , 2017, pp. 264–275.[16] S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T. F. Zheng, andY. Li, “Noisy training for deep neural networks in speech recognition,”

EURASIP Journal on Audio, Speech, and Music Processing , vol. 2015,no. 1, pp. 1–14, 2015.[17] M. Sun, X. Zhang, and T. F. Zheng, “Unseen noise estimation usingseparable deep auto encoder for speech enhancement,”

IEEE/ACMTransactions on Audio, Speech and Language Processing (TASLP) ,vol. 24, no. 1, pp. 93–104, 2016.[18] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptationfor robust speech recognition via variational autoencoder-based dataaugmentation,” in . IEEE, 2017, pp. 16–23.[19] J. Villalba and E. Lleida, “Detecting replay attacks from far-ﬁeldrecordings on speaker veriﬁcation systems,” in

European Workshop onBiometrics and Identity Management . Springer, 2011, pp. 274–285.[20] Z. Ji, Z.-Y. Li, P. Li, M. An, S. Gao, D. Wu, and F. Zhao, “EnsembleLearning for Countermeasure of Audio Replay Spooﬁng Attack inASVspoof2017.” in

Interspeech , 2017, pp. 87–91.[21] Z. Wu, S. Gao, E. S. Cling, and H. Li, “A study on replay attackand anti-spooﬁng for text-dependent speaker veriﬁcation,” in

In Signaland Information Processing Association Annual Summit and Conference(APSIPA) , 2014, pp. 1–5.[22] W. Shang and M. Stevenson, “Score normalization in playback attackdetection,” in . IEEE, 2010, pp. 1678–1681.[23] T. Kinnunen, Z.-Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, and H. Li,“Vulnerability of speaker veriﬁcation systems against voice conversionspooﬁng attacks: The case of telephone speech,” in . IEEE, 2012, pp. 4401–4404.[24] Z.-F. Wang, G. Wei, and Q.-H. He, “Channel pattern noise basedplayback attack detection algorithm for speaker recognition,” in , vol. 4.IEEE, 2011, pp. 1708–1713.[25] H.-J. Shim, J.-W. Jung, H.-S. Heo, S.-H. Yoon, and H.-J. Yu, “Replayspooﬁng detection system for automatic speaker veriﬁcation using multi-task learning of noise classes,” in . IEEE, 2018, pp. 172–176.[26] M. J. Alam, G. Bhattacharya, and P. Kenny, “Boosting the performanceof spooﬁng detection systems on replay attacks using q-logarithmdomain feature normalization,” in

Proc. Odyssey 2018 The Speaker andLanguage Recognition Workshop , 2018, pp. 393–398.[27] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev,and V. Shchemelinin, “Audio replay attack detection with deep learningframeworks.” in

Interspeech , 2017, pp. 82–86. [28] B. Wickramasinghe, S. Irtza, E. Ambikairajah, and J. Epps, “FrequencyDomain Linear Prediction Features for Replay Spooﬁng Attack Detec-tion,” in Interspeech , 2018, pp. 661–665.[29] H. Sailor, M. Kamble, and H. Patil, “Auditory ﬁlterbank learning fortemporal modulation features in replay spoof speech detection,” in

Interspeech , 2018, pp. 666–670.[30] W. Cai, H. Wu, D. Cai, and M. Li, “The dku replay detection systemfor the asvspoof 2019 challenge: On data augmentation, feature repre-sentation, classiﬁcation, and fusion,” arXiv preprint arXiv:1907.02663 ,2019.[31] R. Li, M. Zhao, Z. Li, L. Li, and Q. Hong, “Anti-spooﬁng speaker ver-iﬁcation system with multi-feature integration and multi-task learning,”

Proc. Interspeech 2019 , pp. 1048–1052, 2019.[32] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural net-works for audio spooﬁng detection,” arXiv preprint arXiv:1907.00501 ,2019.[33] R. K. Das, J. Yang, and H. Li, “Long range acoustic features for spoofedspeech detection,” in , 2019.[34] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, andA. Kozlov, “Stc antispooﬁng systems for the asvspoof2019 challenge,” arXiv preprint arXiv:1904.05576 , 2019.[35] B. Chettri, D. Stoller, V. Morﬁ, M. A. M. Ram´ırez, E. Benetos, and B. L.Sturm, “Ensemble models for spooﬁng detection in automatic speakerveriﬁcation,” arXiv preprint arXiv:1904.04589 , 2019.[36] M. Todisco, H. Delgado, and N. W. D. Evans, “Articulation RateFiltering of CQCC Features for Automatic Speaker Veriﬁcation.” in

Interspeech , 2016, pp. 3628–3632.[37] J. C. Brown, “Calculation of a constant Q spectral transform,”

TheJournal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991.[38] C. Sch¨orkhuber and A. Klapuri, “Constant-Q transform toolbox formusic processing,” in , 2010, pp. 3–64.[39] M. Todisco, H. Delgado, and N. Evans, “A new feature for automaticspeaker veriﬁcation anti-spooﬁng: Constant Q cepstral coefﬁcients,” in

Speaker Odyssey Workshop, Bilbao, Spain , vol. 25, 2016, pp. 249–252.[40] L. Y. Chopra, S., Hadsell, R., “Learning a similarity metric discrimina-tively, with application to face veriﬁcation,” in

Proc. of Computer Visionand Pattern Recognition. IEEE, . Proc. of Computer Vision and PatternRecognition. IEEE, 2005, pp. 539–546.[41] R. Bromley, J., Bentz, J. W., Buttou, L., Guyon, I., LeCun, Y., Moore,C., Sackinger, E., Shah, “Signature veriﬁcation using a siamese timedelay neural network,”

International Journal of Pattern Recognition andArtiﬁcial Intelligence , vol. 7, pp. 669–688, 1993.[42] J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network forface anti-spooﬁng,” arXiv preprint arXiv:1408.5601 , 2014.[43] Z. Ma, Y. Ding, B. Li, and X. Yuan, “Deep cnns with robust lbp guidingpooling for face recognition,”

Sensors , vol. 18, no. 11, p. 3876, 2018.[44] X. Wu, R. He, and Z. Sun, “A lightened cnn for deep face representa-tion,” arXiv preprint arXiv:1511.02683 , vol. 4, no. 8, 2015.[45] K. Greff, “Highway Networks,” arXiv preprint arXiv:1505.00387 , 2015.[46] K. Ahrabian and B. Babaali, “Usage of autoencoders and siamese net-works for online handwritten signature veriﬁcation,”

Neural Computingand Applications , pp. 1–14, 2018.[47] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado,A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,“ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detec-tion,” arXiv preprint arXiv:1904.05441 , 2019.[48] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc¸i, M. Sahidullah,and A. Sizov, “ASVspoof 2015: the ﬁrst automatic speaker veriﬁcationspooﬁng and countermeasures challenge,” in

Sixteenth Annual Confer-ence of the International Speech Communication Association , 2015.[49] T. Kinnunen, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “TheASVspoof 2017 Challenge : Assessing the Limits of Replay SpooﬁngAttack Detection,” in

Interspeech , 2017, pp. 2–6.[50] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidul-lah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a detection cost functionfor the tandem assessment of spooﬁng countermeasures and automaticspeaker veriﬁcation,” arXiv preprint arXiv:1804.09618 , 2018.[51] M. Sahidullah, T. Kinnunen, and C. Hanil, “A comparison of featuresfor synthetic speech detection,” in

Interspeech , 2015, pp. 2087–2091.[52] T. Kinnunen, N. Evans, J. Yamagishi, K. A. Lee, M. Sahidullah,M. Todisco, and H. Delgado, “ASVspoof 2019: Automatic SpeakerVeriﬁcation Spooﬁng and Countermeasures Challenge Evaluation Plan,”