Replay Spoofing Countermeasure Using Autoencoder and Siamese Network on ASVspoof 2019 Challenge
11 Replay Spoofing Countermeasure UsingAutoencoder and Siamese Network on ASVspoof2019 Challenge
Mohammad Adiban, Hossein Sameti, Saeedreza Shehnepoor
Abstract —Automatic Speaker Verification (ASV) is the processof identifying a person based on the voice presented to a system.Different synthetic approaches allow spoofing to deceive ASVsystems (ASVs), whether using techniques to imitate a voiceor recunstruct the features. Attackers try to beat up the ASVsusing four general techniques; impersonation, speech synthesis,voice conversion, and replay. The last technique is consideredas a common and high potential tool for spoofing purposessince replay attacks are more accessible and require no technicalknowledge from adversaries. In this study, we introduce a novelreplay spoofing countermeasure for ASVs. Accordingly, we usedthe Constant Q Cepstral Coefficient (CQCC) features fed into anautoencoder to attain more informative features and to considerthe noise information of spoofed utterances for discriminationpurpose. Finally, different configurations of the Siamese networkwere used for the first time in this context for classification.The experiments performed on ASVspoof challenge 2019 datasetusing Equal Error Rate (EER) and Tandem Detection CostFunction (t-DCF) as evaluation metrics show that the proposedsystem improved the results over the baseline by 10.73% and0.2344 in terms of EER and t-DCF, respectively.
Index Terms —Spoof detection, Replay Attack, ASVspoof Chal-lenge, CQCC, Autoencoder, Siamese Network
I. I
NTRODUCTION
Speech can be considered as one of the most importantmeans of communication for the human. Each individualhas a unique voice pattern, identifiable as a signature. Thispattern helps people to identify the other communication end.Consequently, tremendous low-cost technologies have beendeveloped based on voice as a biometric feature for identityrecognition [1] known as Automatic Speaker Verificationsystems (ASVs). ASV captures different clues such asintonations and vocal tract shapes in order to verify a personsidentity [2]. On the other hand, there are techniques availableto synthesize the voice or its characteristics [3][4]. Thisprovides a great opportunity for spoofing, where the attackerexploits a specific speaker’s voice to spoof an ASV system.Besides, advances in the channel and noise detection andremoval of their effects have made the ASV applicationsuitable for market usage. The issue become severe whenit comes to e-commerce and smartphone logical accessscenarios [5]. These ASVs mostly use short-term spectral The authors are with the Department of Computer Engineering, SharifUniversity of Technology, Tehran, Iran (e-mail: [email protected];[email protected]; [email protected]). features which are very vulnerable to spoofing attacks.Accordingly, the research effort mainly investigates four typesof attacks to address this problem: impersonation, speechsynthesis, voice conversion, and replay. By mimicking thehuman voice or even alter that, attackers deceive the ASVs,for the first category . Formant extraction ( F , F , andF ) is aregular feature based approach to spot spoof in this category[3][6][7]. The second category focuses on Text to Speech(TTS) approaches where a system synthesizes a text togenerate the voice. In the third category, the voice of a personis converted from one person to the target speaker and thenit is presented to ASVs for spoofing [8][9]. Finally, attackerstake use of recorded voices from the genuine speakers forspoofing, to form a replay attack. Several studies extractfeatures from speech and classify spoof/genuine utterances[10][11]. However, there are problems still remaining out ofattention. As [12] mentioned, one of the important aspectsto consider is determining which features or classifiers arebetter for discrimination. Furthermore, transforming featuresinto space with a more informative context for discriminationis also an interesting research direction should be considered.Here in this work, we employ the Siamese networks for thefirst time as a classifier to increase the level of discriminationstrength between the extracted features. Although Siamesenetworks were used for spoofing detection tasks such asa discrimiative feature extraction [13], the performance ofsuch networks as a classifier is overlooked in ASVs. Ourmotivation for using such a method is owing to the factthat Siamese networks are applicable to tasks includingmeasuring similarity or determining relationships betweentwo comparable subjects [14]. In such tasks, two identicalsubnetworks are usually used to process the two injectedinputs, and another module takes the outputs of these twosub-networks and makes the final decision. Siamese networkscan be fed by two sets of features and after taking the finaloutput from the output layer, they will use a distance measurein order to spot an input as a spoof/non-spoof sample.In this study, we also extract CQCCs as basic featuresfrom both genuine and spoof samples. Since these featuresare sparse and raw, we feed them to autoencoders to havecompressed and more discriminative features. In particular,our motivation for using autoencoder in this context isdue to the fact that replay speech suffers from channel orconvolutional noise, since the recording process is performedby two microphones, through one loudspeaker [15]. As aresult, noise detection is one of the other aspects of replay a r X i v : . [ ee ss . A S ] O c t spoofing detection as it is one of the most important factorsin the successful adaptation of ASVs in the market place. Forinstance, in [16], a robust system is developed to improveDeep Neural Network (DNN) performance by using noisydata from channels. This type of systems is called NoiseAware systems, where noise is introduced to the system, inorder to help the system to detect spoofing attacks especiallyfor replay ones. In this study, Autoencoder Neural Netoworkis employed, due to their prior success in extracting the noiseas latent variable, given the same noisy input as the network’soutput [17][18]. We trained the autoencoder in order toconsider the noise and extract an informative representationof basic features, capable of representing both meaningfuland sparse features related to original samples. Moreover, incontrast with other feature reduction and selection methods nofeature is excluded, Hence, for both clean and noisy speechall features are preserved. Finally, in order to have contextualrepresentation of features, we first feed them to ConvolutionalNeural Network (CNN) and then share the weights of CNNwith another CNN identical to the first CNN and afterwards,they are employed in Siamese network. Siamese networksextract the similarity or a connection between two inputs andresult in a significant discriminative classification.In the following, first we list the related studies in Section II,then we will define the models of our works in Section III.Subsequently, the proposed system is introduced in SectionIV. Next, we give a brief explanation of the experimentalsetup in Section V. In Section VI, we discuss experimentalresults and finally, make the conclusion and propose futureworks in Section VII.II. R ELATED W ORKS
Different techniques are proposed based on physiologicaland behavioral clues used for identification. Four types ofSpoof attacks mostly threaten ASVs.Among them, replay attacks focus on using a pre-recordedspeech from a genuine speaker in order to deceive ASVs.Speech could be recorded without the speakers consent oreven it can be generated by concatenating different parts ofspeech from the genuine speaker. High tech devices includingsmartphones, laptops and recorders, designed for recording,complicate the situation, even more. As a result, it can bethe most potent threat to ASVs. The vulnerability of ASVsto replay attacks first was investigated by [10], where authorsreported a significant increase in Equal Error Rate (EER) andFalse Acceptance Rate (FAR) for ASVs. Other studies [11][19]demonstrated that using Joint Factor Analysis (JFA) in whichFAR increased for the attack in this category. In [20], for a text-independent system with GMM-UBM classifier considerableincrease in EER for ASVs is reported. As mentioned, thereare various low-cost ASVs that have saturated the market fromone hand, and on the other hand, easiness and availability ofdevices required for replay attacks, lead to proposing counter-measures for both sides [5]. Furthermore, for a replay attack,there is no need for special speech processing techniques [21]and unlike other types of spoofing, they can be used with less knowledge about ASVs. Hence, they are more engagedin spoofing attacks.Therefore, in this study, considering the importance ofreplay spoofing attacks, we focus on this type of attack, andwe will examine countermeasures for this threat. Among oneof the first countermeasures for a replay attack, [22] employedfixed pass-phrase on a text-dependent ASVs. The detectorstores past access attempts with new attempts and then makethe decision. Results demonstrated improvement in EER whilethey detected and recognized playback utterances. In [23],authors use spectral ratio and modulation indexes in order todetect spoofing in far-field recordings where noises increasein recorded speech due to the distance. As a result of thelong-distance, the spectrum turns into a flattened one and thenthe ability for modulation index is reduced. For classificationpurposes, an SVM approach was used. Their results show thatthe FAR of text-independent JFA is reduced for the ASVs.Wang et al. [24] claims that licit recordings have just a specifictype of channel noise which is mixed with additional noise ofenvironment when a replay attack is presented in far-field. AGMM-UBM classifier was employed in order to detect thereplay attack and the results show a significant decrease inEER.Ji et al. [20], captured CQCC features and feed them toa decision tree classifier, Results showed EER of 10.8% forthis study on ASVspoof 2017 dataset. Adiban et al. [15]combined different features including Mel-Frequency CepstralCoefficients (MFCC), RASTA-PLP, Modified Group Delay,CQCC, i-vector, and Linear Prediction Cepstral Coefficients(LPCC) and fed them to different classifiers such as SVM,Multi-Layer Perceptron (MLP) neural networks and GMMson ASVspoof 2017 dataset. The best performance was 10.31%for EER. Shim et al. [25] investigated the problem of replayspoofing attack detection and noise classification using multi-task learning on playback devices, recording environments anddevices. Results showed a 30% improvement from 13.57% to9.56% in terms of EER on ASVspoof 2017 dataset. A DiscreteFourier Transform (DFT) was used in [26] for ASVspoof2017 dataset. Then, a normalization was applied to featuresin the q-log domain. For feature reduction, Principal Compo-nent Analysis (PCA) performed on initial feature. The bestperformance was at the EER of 11.19%. Anti-spoofing taskon ASVspoof 2017 dataset was performed in [27] where aCNN used as a deep learning approach with Max-Feature-Mapactivation function. The proposed approach yielded 6.73%as EER. Long-term temporal envelopes also extracted fromsub-band signals using Frequency Domain Linear Prediction(FDLP) for feature extraction and GMM and CNN used forclassification were presented in [28]. The reported EER onASVspoof 2017 dataset was 9.70%. Sailor et al. [29] reportedEER of 8.89% on ASVspoof 2017 dataset using ConvRBM-CC as features and GMMs as classification.In the work presented in [30], inverted Mel-frequencycepstral Coefficients (IMFCC), short time Fourier transform(STFT), group delay gram (GD gram) and joint gram areused as features. Extracted features fed into a residual neuralnetwork (ResNet) classifier. The best results obtained in scorelevel fusion on the ASVspoof 2019 dataset, which are 0.66% and 0.0168 in terms of EER and t-DCF, respectively. Li etal. in [31], implemented on ASVspoof 2019 dataset utilizesmultiple spectral features within the network, such as MFCC,CQCC, Fbank, etc., and also butterfly unit (BU) for multi-tasklearning. Results showed 0.67% and 0.0148 in terms of EERand t-DCF, respectively. In [32], various features, includingMFCC, CQCC, and STFT are extracted and then fed intoResNet classifier. Finally, in the fusion level EER of 0.28%and t-DCF of 0.0074 are achieved. The study investigated in[33], proposed a method for replay spoofing detection basedon various long range acoustic features and Deep NeuralNetworks as classifiers. Their best combined system obtainedt-DCF of 0.1381 and EER of 5.95% for physical access onAVSpoof 2019 dataset. The study in [34] is another workin which the proposed system is implemented using LightCNN (LCNN) based on various acoustic features such as CQT,LFCC, and Discrete Cosine Transform (DCT). The best resultreported in this study is obtained in score level fusion whichis 0.54 and 0.0122 in terms of ERR and t-DCF, respectively.In [35], authors utilize different spectral features, includingdelta and acceleration MFCC (SDA MFCC), Inverted MFCC(IMFCC), CQCC, sub-band centroid magnitude coefficients(SCMC) features and i-vector in order to detect spoofing usingvarious shallow and deep classifiers such as GMM, SVM,CNN, Convolutional Recurrent Neural Network (CRNN), 1D-CNN and Wave U net, resulting in EER of 5.43 and t-DCF of0.1465.Table I summaries the replay attack countermeasures inves-tigated in this section. III. M
ODELS
A. Constant Q Cepstral Coefficient
In this work we used CQCCs. These features utilize Con-stant Q Transform (CQT) [37] which is originally presentedfor tasks related to music processing [38] and later it wasemployed in speaker verification tasks [36]. Despite manycepstral based features utilizing prevalent Fourier Transform(FT), CQCCs employ Constant Q Transform (CQT) whichuses geometrically spaced frequency bins [36]. The maindifference between FT and CQT is that FT uses fixed time-frequency resolution, however, CQT utilizes a constant Qfactor which leads to an increase in frequency resolution atlower frequencies and also a better time resolution at higherfrequencies [39]. The Q factor is the ratio between the centerfrequency f k and the bandwidth δ f defined as. Q = f k δ f . (1)Details about CQT can be found in [39]. The CQCC extractionsteps are introduced as follows: Initially, for the given signal x ( n ) , CQT is engaged to obtain the spectrum. Then, the powerspectrum and subsequently the logarithmic power spectrumare calculated. Eventually, in order to attain the cepstrum of x ( n ) , we apply the DCT as the inverse transformation in thespectrum logarithm. The extraction steps of CQCC are shownin Fig. 1. In addition, a comparison of the CQT and STFTspectrogram for a speech signal is illustrated in Fig. 2. Fig. 2. A comparison of CQT and STFT spectrogram.
B. Autoncoders
Autoencoder (AE) is a generative model that can be usedto extract substantial information and features. Accordingly,an autoencoder is a neural network devised for the purposeof learning an identity function in an unsupervised settingto reconstruct high dimensional input data { x ; x ; ... ; x m } as outputs { ˆ x ; ˆ x ; ... ; ˆ x m } . As mentioned in Sec. I, AE iscapable of capturing latent variables hidden in input samples.Furthermore, AE extracts most informative and independentfeatures to compress the significant information of input sam-ples [18]. Considering the sparse nature of noise, AE can bothcapture the sparse features of the noisy input and also to mapthem to a features space, capable of discriminating noisy andclean samples. Accordingly, classifier is able to discriminateboth type of samples with high accuracy. The structure of theautoencoder is depicted in Fig. 3. According to this structure,the input data feature is reduced to reduced compact ones,which can be considered as a feature vector. As shown in Fig.3, L is a hidden layer. Eq. 2 defines the activation of unit i in layer l . a ( l ) i = f ( n (cid:88) j =1 W ( l − ij a ( l − j + b (1) j ) , (2)where W represents weight and b denotes bias parameters.According to Fig. 3, the input layer is considered as a (1) = x ,and in the output layer, a (3) = ˆ x . In addition, in the hiddenlayers, the sigmoid function is used as the activation function f . However, the linear function is engaged in the output layer.The objective function will be defined as: J ( W, b ) = 1 m m (cid:88) i =1 ( 12 || x i − ˆ x i || ) + λ n − (cid:88) l =1 s − (cid:88) i =1 s l +1 (cid:88) j =1 ( W ( l ) ji ) , (3)where the parameters λ , n l and s l represent the strength ofregularization, the number of layers in the network and thenumber of units in layer L , respectively. In the training phase, Constant-Q
Transform
Power
Spectrum LOG
Uniform
Resampling DCT 𝑥(𝑛) 𝑋 𝐶𝑄 (𝑘) 𝑋 𝐶𝑄 (𝑘) 𝑋 𝐶𝑄 (𝑘) 𝑙𝑜𝑔 𝑋 𝐶𝑄 (𝑙) 𝐶𝑄𝐶𝐶(𝑃)
Fig. 1. Block diagram of CQCC feature extraction [36]. TABLE IS
UMMARY OF THE REPLAY ATTACK C OUNTERMEASURE (CM) M
ETHODS , AND R ESULTS .Study CM Method EER% / t-DCFZ. Ji, et al. [20] CQCC + GMM mean supervector-Gradient Boosting 10.8 / -M. Adiban, et al. [15] MFCC, RASTA-PLP, CQCC, LPCC and i-vector + GMM, MLP and SVM 10.31 / -H.-J. Shim et al. [25] Noise detection + neural network and Multi-task learning 9.56-13.57 / -M. J. Alam, et al. [26] DFT-based features + feature normalization + PCA 11.9 / -G. Lavrentyeva, et al. [27] Max-Feature-Map activation + CNN 6.37 / -B. Wickramasinghe, et al. [28] FDLP (TC and RC) + GMM and CNN 9.70 / -H. B. Sailor, et al. [29] ConvRBM-CC + GMM 8.89 / -W. Cai, et al. [30] IMFCC,STFT, GD gram, joint Gram + ResNet 0.66 / 0.0168R. Li, et al. [31] MFCC,CQCC, Fbank + BU 0.67 / 0.0148M. Alzantot, et al. [32] MFCC,CQCC, STFT + ResNet 0.28 / 0.0074R. K. Das, et al. [33] long rangeacoustic features + DNN 5.95 / 0.1381G. Lavrentyeva, et al. [34] CQT, LFCC and DCT + LCNN 0.54 / 0.0122B. Chettri, et al. [35] Spectral features, i-vector + Deep and Shallow Classifiers 5.43 / 0.1465 the objective function will be minimized with respect to theparameters of W and b . Fig. 3. Diagram of the Autoencoder.
C. Siamese Networks
Siamese Networks are one among all types of neural net-works first used by [40][41] for different purposes. Most ofthe works focused on their application in verification taskssuch as face recognition, signature recognition, etc. Siamesenetworks take the sample as input and then maps it onto anew latent space where similar samples have shorter distancethan non-similar ones. So the whole idea is finding a TargetSpace where the semantic distance between inputs is found.Thus, the Siamese networks can be very useful where thetraining data does not contain sufficient information neededfor classification.Hence, it can be realized that Siamese networks can be appliedin verification tasks. Spoofing detection is also a matter ofverification process of whether a voice is genuine or not. Inthis work, considering the fake nature of spoof, a similaritymeasure can be engaged to compare features extracted from both data given to Siamese and then use another neural layerfor the final decision.Architecture for this network is given in Fig. 4. It includestwo identical deep neural networks with the same configura-tions in terms of weights, hyper-parameters, etc. Each one ofthese neural networks is called leg of the network and bothare identical. One network is trained by input samples andthen a copy of the network is used with the same weightsas the other leg. The network can be a simple Multi-LayerPerceptron (MLP) or other types of deep neural networks suchas CNN or RNN. The results are combined mostly using acombination function and then the output is fed to a fullyconnected network that is trained to produce a metric to showif two samples are the same or different based on output givenby the combination function. So the task here is to minimizethe loss function based on the combination function. The finalfully connected layer can be performed by a sigmoid functionor one single neuron.IV. P
ROPOSED S YSTEM
In this section, we will give information about the Modelwe employed for each step of the whole framework. Theproposed system is depicted in Fig. 5. For the first step,CQCC features with different dimensions are extracted fromthe initial voice using the CQCC feature extractor. Thereupon,for encoding features, we used an X × Y × X structure where X is the dimension size of the extracted features and Y is thebottleneck dimension size. For training, we engaged a fine-tuning back propagation, so the parameters will be improvedin each iteration. Then we used a Convolution Neural Network(CNN) similar to the one used in [42]. CNN consisted of3 convolutional layers, 3 max-pooling layers and, 2 FullyConnected (FC) layers. Max-pooling operator have shown tobe sensitive to noisy input which is an important features.Noisy input is one of the important characteristics of a replayattack [43]. Two independent convolutional networks were Fully Connected Layer
Weighs
Merge 𝑋 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑋 𝑡𝑒𝑠𝑡 Fig. 4. Architecture of Siamese Networks.
Constant-Q
Transform
Power
Spectrum LOG Uniform Resampling DCT 𝑥(𝑛) 𝐶𝑄𝐶𝐶(𝑝) O r i g i n a l S a m p l e Encoder C o d e Decoder R e c o n s t r u c t e d S a m p l e 𝐶𝑄𝐶𝐶′(𝑝)𝑥′(𝑛)
𝑇𝑒𝑠𝑡 𝑆𝑝𝑒𝑒𝑐ℎ𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑆𝑝𝑒𝑒𝑐ℎ s 𝑝𝑜𝑜𝑓/ 𝑏𝑜𝑛𝑎 𝑓𝑖𝑑𝑒 CQCC Feature Extraction Feature Reduction CNNCQCC Feature Extraction Feature
Reduction
CNN Distance(T,O)
Weights
Fig. 5. Diagram of our proposed Siamese network. used in every layer with Max-Feature-Map (MFM) [44] asthe activation function. For compression purposes, a max-pooling kernel was used with a size of 2 ×
2. For the FClayer, we used a softmax function to discriminate betweenspoof and genuine samples. For improving the performance,we employed a Highway structure from [45] in the finallayer which regulates information transfer considering thisfact that each neural network is a highway, where unimpededinformation can flow over several layers without attenuation.We used ReLU as an affine transform function and sigmoidas an activation function: y = H ( x, W H ) ∗ T ( x, W T ) + x. (1 − T ( x, W T )) (4)In Eq. 4, H, T are the affine transform function and activationfunction, respectively. The weighting matrices W h , W T areupdated in training process. x represents the input and thedot operator (.) is used to denote element-wise multiplication.Finally y is output of highway function. Probability of a voiceto be a spoof or not is calculated by a softmax layer using the following equation: P ( c i | z ) = e z T w (cid:80) Ci =1 e z Ti w , (5)where c i is the i th class that we try to calculate whether sample x is a member of this class or not. z is the given feature vectorand C is the number of whole classes, which in our case istwo, spoof and genuine.One of the trained CNNs is fed with genuine fixed-lengthvectors obtained from autoencoder and the other ones are fedwith the test sample. If they are similar they will get the sameclass and otherwise, they will be placed in different classes.As [46], we tried to use dropout for preventing overfitting withmini-batches of 200 samples. For loss function we used CrossEntropy: CE = − C (cid:48) (cid:88) i =1 t i log( f ( s i )) − (1 − t i ) log(1 − f ( s i )) , (6)where t i is the true label of class i , f ( s i ) is the calculatedlabel for sample s i and C (cid:48) indicates the number of total classes in the dataset which is 2 in our case. Finally, a simple lossfunction is used to calculate the distance between each featureto take the value of 1 (spoof) if it is more than 0.5 and 0(genuine) otherwise.V. E XPERIMENTAL S ETUP
A. Dataset
The ASVspoof 2019 challenge [47] is introduced to expandthe goals of the previous challenges ASVspoof 2015 [48]and ASVspoof 2017 [49]. The main purpose in ASVspoof2015 was to introduce countermeasure systems for detectingspoofed/non-spoofed speech in which spoofed speech wasimplemented based on either text-to-speech (TTS) or voiceconversion (VC) approaches. In addition, the 2017 ASVspoofChallenge was presented to guide studies to introduce coun-termeasure systems for detecting replay spoofing attacks.Subsequently, the ASVspoof 2019 challenge was presentedto complete the objectives of the two previous challengesand provided two subsets including Logical Access (LA) andPhysical Access (PA). In the PA scenario, the spoofing attacksare based on replay attacks where an adversary tries to recorda genuine speech and replay it in order to deceive the ASVsystem. PA includes three subsets: Training, Development, andEvaluation sets.For training and development, a dataset was created usinga combination of three room sizes, with three levels ofreverberation and three different speaker-to-ASV microphonedistances, and in total 27 configurations. The replay attackdataset comprises nine different configurations from threedifferent categories of distances and three different qualities.Evaluation dataset consisted of 137457 trials including bothreplay spoofed speech and bona fide with different configura-tions with unique speakers. The statistics of each of subsets issummarized in Table II.
TABLE IIP
HYSICAL A CCESS S CENARIO S TATISTICS FOR
ASVSPOOF 2019D
ATABASE
Subset
B. Evaluation Metrics
The ASVspoof 2019 challenge is a binary classification task,in which utterances from real humans are labeled as a positiveclass and spoof attacks are labeled as a negative class. TandemDetection Cost Function (t-DCF) [50] is adopted by ASVspoof2019 as a standard measure metric, which is based on detectiontheory and can be specified for the envisioned application.Isolation of different systems (such as CM and ASV) is a keyfeature of tandem systems and t-DCF. We also use EER as anadditional metric to measure our system performance. VI. E
XPERIMENTAL E VALUATION
A. Baseline System
The organizers of the challenge ASVspoof 2019 have in-troduced two systems as a baseline [47] for participators. Theproposed method in the baseline system is based on CQCC andLinear Frequency Cepstral Coefficients (LFCC) [51] featuresand GMM classifiers. Accordingly, they extract CQCC andLFCC features from training data and then two GMMs (oneGMM for the bona fide and the other for spoofed data) with512 components to learn the model by EM iterations (trainingGMMs). In the next step, the score for each trial is computedusing these GMMs. The results for the development set andevaluation set of physical access scenario in terms of t-DCFand EER are presented in Table III.
TABLE III T -DCF AND
EER
RESULTS FOR TWO BASELINE COUNTERMEASURES ONPHYSICAL ACCESS SCENARIOS FOR BOTH DEVELOPMENT SET ANDEVALUATION SET [47][52].Baseline System Development Set Evaluation SetEER% t-DCF EER% t-DCFLFCC-GMM 11.96 0.2554 13.54 0.3017CQCC-GMM 9.87 0.1953 11.04 0.2454
B. Proposed System Configuration
As we have mentioned, our proposed system consists ofthree main parts. First, extracting the CQCC features with dif-ferent dimensions from input speech. Second, dimensionalityfeature reduction using autoencoder to different dimensions.Finally, using the Siamese network as a classifier. This processincludes the Training and Evaluation phases:
1) Training Phase:
To train our system, we extract CQCCfeatures from training data (for both bona fide and spoofeddata). In the following, we trained an autoencoder withdifferent bottleneck dimensions using the extracted featuresseparately with Siamese networks. Accordingly, we take intoaccount two objectives of training the autoencoder. First, wetry to reduce the data dispersion. More importantly, autoen-coder is trained by considering the noise in spoofed speech,leading to achieving more valuable data for spoofing detection.As the final step, we trained Siamese Networks (including twoCNNs) as our classifier. In order to train these CNNs, first, wedivided our dataset into two balanced parts (one part is usedfor first CNN and another part for the second one). Afterwards,we randomly selected data form each part without replacing itand then apply each selected data considering their labels tothe network assigned to that part. Finally, CNNs learn whethertwo input data are from equal classes or not and subsequentlytheir shared weights are updated. Here, different configurationsof CNNs, indicated as Siamese Network Config”, were usedto observe the impacts of using various filters with differ-ent sizes on performance. Considering the effects of usingdifferent feature extraction methods, we used three differentconfigurations of CNN. In this regard, we run our systems ondevelopment data to obtain our configurations and tune theirparameters. Therefore, the first configuration, indicated byConfig. 1, contains three layers of convolutional and average-pooling layers and 1 hidden layer. It holds 160, 200, and 100 filters, respectively. In the hidden layer, we have 300 nodes and3 pooling width. For the second configuration (Config. 2), wehave the same architecture, with a change in the number ofthe hidden nodes to 500. In the third configuration (Config. 3),we just replaced average-pooling with max-pooling to observethe impact of the max filter instead of average-pooling one.
2) Evaluation Phase:
In the evaluation phase, similar to thetraining phase, we extract different dimensions of the CQCCfeatures from the evaluation set. We then used an autoencoderto reduce the dimensions of the extracted features. At theclassification level, we applied evaluation data to one of thetrained CNN in the previous step. And the input of anotherCNN is a fixed bona fide speech which is randomly selectedfrom the training set. So for each evaluation data, we willcheck whether the input data matches the fixed bona fidespeech. If it matches, we will consider it as a bona fide,otherwise, it will be considered as a spoofed speech. Finally,we repeat this cycle with 100 fixed bona fide data which arerandomly selected from the training set without consideringtheir gender or number of speakers and finally, we get themajority votes. As mentioned in the training phase, it shouldbe reminded that we used development trials in order to tunethe parameters of our system.
C. Siamese Networks Results and Analysis
Results obtained from different classifier setups are in-vestigated in this section. Table IV shows EER and t-DCFachieved from different dimensions of CQCC and differentconfigurations of our classifier without any dimensionalityreduction on the development set and evaluation set. As shownin Table IV, the best results in each classifier configurationbelong to 90-dimensional CQCC. Among these results, thebest result is achieved from Siamese Config. 3. It can be seenthis result improves the baseline system 10.42% and 0.2344in terms of EER and t-DCF, respectively.
Discussion:
As explained in the introduction, using descriptiveCQCC through a higher dimension of features can be effectivein the spoofing detection task due to the fact that it gives a bet-ter resolution at different frequencies. This leads to a detaileddescription of noisy frequencies and usual ones. On the otherhand, increasing the dimension of the features might triggerredundancy. Thus, we are looking for a balance between higherresolution and lower redundancy. Experimental results showthat 90-dimensional CQCC achieves the equilibrium point. Inaddition, it is obvious that results for Config. 3 are betterthan the other two configurations. The key point for thisconfiguration is that it has fewer hidden nodes than the secondconfiguration and so we can realize that increasing the numberof hidden nodes is helpful up to some point and after that theperformance decays. Furthermore, it shows that using max-pooling layer can be more helpful than average-pooling. Thisalso comes back to the noisy nature of the given task, wherethe noises effects can be removed using the average-pooling[43], while max-pooling will result in selecting noisier valuesat the selection layer.
TABLE IVCQCC
DIMENSION EFFECTS ON THE PERFORMANCE OF THE PROPOSEDSYSTEM WITH DIFFERENT CONFIGURATIONS OF
CNN
CONFIGURATIONSNOTED AS ”S IAMESE C ONFIGS ”.Siamese Network Configs.Config. 1 Config. 2 Config. 3Set CQCCDims. EER% t-DCF EER% t-DCF EER% t-DCF D e v .
30 2.17 0.0411 2.35 0.0437 1.96 0.039160 1.29 0.0240 1.84 0.0319 1.65 0.032490
120 1.46 0.0247 1.93 0.0386 1.24 0.0207 E v a l .
30 6.80 0.1550 7.12 0.1649 5.66 0.137160 4.19 0.1072 5.57 0.1323 3.78 0.088390
120 4.73 0.1251 6.29 0.1528 4.11 0.0854
D. Autoencoder Performance
Fig. 6 and Fig. 7 show the effects of using the autoencoderin our proposed system. Accordingly, we used the autoencoderto reduce the dimensions of the 90-dimensional CQCC. Asshown in Figs. 6 and 7, best results are obtained from the70-dimensional bottleneck of autoencoder for each Siameseconfiguration. Similar to previous results, the best resultsbelong to Siamese Config. 3 that outperforms the baselinesystem by 10.72% and 0.2344 in terms of EER and t-DCF,respectively.
Discussion:
Figs. 6 and 7 show the same results for threedifferent configurations, which were analyzed in the previousdiscussion. But it shows that the better performance for theproposed approach is given for higher bottleneck features.The purpose of using autoencoder is to consider the impact ofnoise, but attaching too much importance to the noise effectis associated with ignoring valuable information which leadsto increment in EER. Therefore, there is a trade-off, as muchas noise is considered, information is also missed and vice-versa. Consequently, there is an optimum value where mostof the noises are considered and informative values are stillpreserved, which is 70 in this case.
20 30 40 50 60 70 80
Autoencoder Bottleneck Dims EE R % Siamese Confg1Siamese Confg2Siamese Confg3
Fig. 6. Effects of dimension reduction by autoencoder on EER for evaluationset with CQCC vector size of 90 on the proposed system.
Table V summarizes the best results obtained by differentconfigurations of the proposed systems and baseline systems.
20 30 40 50 60 70 80
Autoencoder Bottleneck Dims t - DC F Siamese Confg1Siamese Confg2Siamese Confg3
Fig. 7. Effects of dimension reduction by autoencoder on t-DCF for evaluationset with CQCC vector size of 90 on the proposed system.TABLE VC
OMPARISON OF THE BEST RESULTS OBTAINED BY DIFFERENT S IAMESENETWORKS CONFIGURATIONS WITH BASELINE SYSTEMS .(CQCC
VECTORSIZE = 90, A
UTO E NCODER BOTTLENECK SIZE = 70)Dev. Eval.System EER% t-DCF
EER% t-DCF
Baseline 1 (LFCC + GMM) 11.96 0.2554 13.54 0.3017Baseline 2 (CQCC + GMM)
Conf. 1 CQCC 1.98 0.1529 3.27 0.0745Siamese CQCC + AE
Network Conf. 2 CQCC 2.40 0.0448 4.13 0.1086Configs. CQCC + AE
Conf. 3 CQCC 1.77 0.0212 3.02 0.0627CQCC + AE
E. Performance With Different Training Data sizes
In order to measure the performance of our proposed systemin face of different training data, we used different sizesof the training set to train the system. In this regard, werandomly divided the training set into five equal folds withoutconsidering the gender or number of speakers. In each step,the training set was raised fold by fold. For each fold, thesystem is trained similar to what we do in the training phase.Fig. 8 and Fig. 9 show the effects of using different sizes ofthe training set. As expected, the results were improved byincreasing the volume of the training data. It is noteworthythat our proposed system could achieve better results than thebaseline system on the evaluation set using only 60% of thetraining data. By 80% of data as the training set, the proposedapproach outperforms the baseline system.
Discussion:
The results for this part demonstrate better out-comes for more amount of data which is kind of obvious.Increasing training data helps the classifier to figure out thedata model which leads to better prediction of test samples,however, considering 60% of whole data as a training setprovides comparable results to the baseline system, and 80%of the whole data gives a significant improvement over thebaseline system. Another important subject is that the proposedsystem demonstrates impressive results using less trainingdata. This makes the proposed approach more robust tomissing data. It means the proposed system can be applied to tasks with less training data and since data scarcity is a seriousissue, it is a strong advantage for the proposed approach.
20% 40% 60% 80%
Supervision EE R % Config 1Config 2Config 3
Fig. 8. The Effects of Using Different Sizes of the training set on the EER ofthe best configuration of Proposed System (CNN Config. 3, 90-dimensionalCQCC, 70-dimensional AE bottleneck).
20% 40% 60% 80%
Supervision t - DC F Config 1Config 2Config 3
Fig. 9. The Effects of Using Different Sizes of the training set on the t-DCFof the best configuration of Proposed System (CNN Config. 3, 90-dimensionalCQCC, 70-dimensional AE bottleneck).
F. Effect of Siamese Networks on Performance
To evaluate the performance of the Siamese networks,the results for employed Siamese networks and single CNNare compared at table VI. The hyper-parameters of CQCCand autoencoder are tuned based on the best performanceof the system. Therefore, the dimension of CQCC is fixedon 90, and for the autoencoder we used bottleneck witha size of 70. Three different configurations are used foreach case. The results indicate that Siamese networks areeffective at improving the results compared with single CNN.
TABLE VIE
FFECT OF EMPLOYED S IAMSE NETWORK ON THE PERFORMANCE (CQCC
VECTOR SIZE = 90, BOTTLENECK SIZE= 70).Siamese Network CNNSet Configurations EER t-DCF EER t-DCFDev. Conf. 1 0.23 0.0027 0.94 0.0212Conf. 2 0.29 0.0196 1.21 0.0403Conf. 3
Eval. Conf. 1 0.80 0.0208 3.25 0.1339Conf. 2 0.91 0.0278 5.11 0.2726Conf. 3
Discussion:
Siamese networks have shown promising re-sults in different classification tasks and their application wasrestricted to classification tasks with an unlimited number ofclasses. Hence, their effectiveness was overlooked in suchtasks. However, as Table VI indicates, the Siamese networksimprove the performance of the system. Although CNN yieldssatisfactory results, it does show weaknesses in stability. Inother words, the system performance varies with differentconfigurations, and the results for the evaluation set are muchworse than the results for the development set. This is mostlybecause both classes are trained on one network which some-how reveals some evidence of over-fitting. In addition, as aresult of the network discriminates between different samples,the network has difficulty in handling the different inputs,especially in case of using average-pooling which removes thenoises. On the other hand, for Siamese networks, the differenceof generated output is learnt and this is one step in addition tonormal classification. In this case, even with the average forpooling, the model is learnt through the final layer. However,the average pooling would remove the noises and leads to adrop in performance of classification.VII. C
ONCLUSION
We proposed a novel replay spoofing countermeasure sys-tem for ASVs based on physical access ASVspoof 2019dataset. In this study, different configurations of CNNs in thestructure of Siamese Network were investigated for the pur-pose of classification. Moreover, the autoencoder is employedto resolve the dispersion problem of well-known CQCC fea-tures. The experimental results confirmed the high efficiencyof Siamese Network for spoofing detection. Additionally,autoencoders could significantly consider and utilize noiseand eliminate redundant and irrelevant information of CQCCfeatures, resulting in outperforming the baseline system. Be-sides, the results show that the proposed method could achievecomparable performance using a smaller amount of training set(only 60% training set) which reflects the effectiveness and thescalability of our system.For future work, an integrated framework can be proposedto identify each of four general attacks on ASVspoof systemsand apply appropriate countermeasures regarding the attack. Itis a necessary task since every ASV system is always facingother types of attacks and has to be able to confront each typeof attack. R
EFERENCES[1] A. K. Jain, A. Ross, and S. Pankanti, “Biometrics: a tool for informationsecurity,”
IEEE transactions on information forensics and security ,vol. 1, no. 2, pp. 125–143, 2006.[2] T. Kinnunen and H. Li, “An overview of text-independent speakerrecognition: From features to supervectors,”
Speech communication ,vol. 52, no. 1, pp. 12–40, 2010.[3] T. B. Amin, P. Marziliano, and J. S. German, “Glottal and vocaltract characteristics of voice impersonators,”
IEEE Transactions onMultimedia , vol. 16, no. 3, pp. 668–678, 2014.[4] T. Masuko, K. Tokuda, and T. Kobayashi, “Imposture using syntheticspeech against speaker verification based on spectrum and pitch,” in
Sixth International Conference on Spoken Language Processing , 2000.[5] K. A. Lee, B. Ma, and H. Li, “Speaker verification makes its debutin smartphone,”
IEEE signal processing society speech and languagetechnical committee newsletter , 2013. [6] Y. W. Lau, M. Wagner, and D. Tran, “Vulnerability of speaker ver-ification to voice mimicking,” in
Proceedings of 2004 InternationalSymposium on Intelligent Multimedia, Video and Speech Processing,2004.
IEEE, 2004, pp. 145–148.[7] A. Eriksson and P. Wretling, “How flexible is the human voice?-A case study of mimicry,” in
Fifth European Conference on SpeechCommunication and Technology , 1997.[8] Y. Stylianou, “Voice transformation: a survey,” in . IEEE,2009, pp. 3585–3588.[9] N. Evans, F. Alegre, Z. Wu, and T. Kinnunen, “Anti-spoofing, voiceconversion,”
Encyclopedia of Biometrics , pp. 115–122, 2015.[10] J. Lindberg and M. Blomberg, “Vulnerability in speaker verification-astudy of technical impostor techniques,” in
Sixth European Conferenceon Speech Communication and Technology , 1999.[11] J. Villalba and E. Lleida, “Speaker verification performance degradationagainst spoofing and tampering attacks,” in
FALA workshop , 2010, pp.131–134.[12] S. Kaavya, V. Sethu, P. N. Le, and E. Ambikairajah, “Investigationof sub-band discriminative information between spoofed and genuinespeech,” in
Interspeech , 2016, pp. 1710–1714.[13] S. Kaavya, V. Sethu, and E. Ambikairajah, “Deep Siamese ArchitectureBased Replay Detection for Secure Voice Biometric,” in
Interspeech ,2018, pp. 671–675.[14] S. Bell and K. Bala, “Learning visual similarity for product design withconvolutional neural networks,”
ACM Transactions on Graphics (TOG) ,vol. 34, no. 4, p. 98, 2015.[15] M. Adiban, H. Sameti, N. Maghsoodi, and S. Shahsavari, “SUT SystemDescription for Anti-Spoofing 2017 Challenge,” in
Proceedings of the29th Conference on Computational Linguistics and Speech Processing(ROCLING 2017) , 2017, pp. 264–275.[16] S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T. F. Zheng, andY. Li, “Noisy training for deep neural networks in speech recognition,”
EURASIP Journal on Audio, Speech, and Music Processing , vol. 2015,no. 1, pp. 1–14, 2015.[17] M. Sun, X. Zhang, and T. F. Zheng, “Unseen noise estimation usingseparable deep auto encoder for speech enhancement,”
IEEE/ACMTransactions on Audio, Speech and Language Processing (TASLP) ,vol. 24, no. 1, pp. 93–104, 2016.[18] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptationfor robust speech recognition via variational autoencoder-based dataaugmentation,” in . IEEE, 2017, pp. 16–23.[19] J. Villalba and E. Lleida, “Detecting replay attacks from far-fieldrecordings on speaker verification systems,” in
European Workshop onBiometrics and Identity Management . Springer, 2011, pp. 274–285.[20] Z. Ji, Z.-Y. Li, P. Li, M. An, S. Gao, D. Wu, and F. Zhao, “EnsembleLearning for Countermeasure of Audio Replay Spoofing Attack inASVspoof2017.” in
Interspeech , 2017, pp. 87–91.[21] Z. Wu, S. Gao, E. S. Cling, and H. Li, “A study on replay attackand anti-spoofing for text-dependent speaker verification,” in
In Signaland Information Processing Association Annual Summit and Conference(APSIPA) , 2014, pp. 1–5.[22] W. Shang and M. Stevenson, “Score normalization in playback attackdetection,” in . IEEE, 2010, pp. 1678–1681.[23] T. Kinnunen, Z.-Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, and H. Li,“Vulnerability of speaker verification systems against voice conversionspoofing attacks: The case of telephone speech,” in . IEEE, 2012, pp. 4401–4404.[24] Z.-F. Wang, G. Wei, and Q.-H. He, “Channel pattern noise basedplayback attack detection algorithm for speaker recognition,” in , vol. 4.IEEE, 2011, pp. 1708–1713.[25] H.-J. Shim, J.-W. Jung, H.-S. Heo, S.-H. Yoon, and H.-J. Yu, “Replayspoofing detection system for automatic speaker verification using multi-task learning of noise classes,” in . IEEE, 2018, pp. 172–176.[26] M. J. Alam, G. Bhattacharya, and P. Kenny, “Boosting the performanceof spoofing detection systems on replay attacks using q-logarithmdomain feature normalization,” in
Proc. Odyssey 2018 The Speaker andLanguage Recognition Workshop , 2018, pp. 393–398.[27] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev,and V. Shchemelinin, “Audio replay attack detection with deep learningframeworks.” in
Interspeech , 2017, pp. 82–86. [28] B. Wickramasinghe, S. Irtza, E. Ambikairajah, and J. Epps, “FrequencyDomain Linear Prediction Features for Replay Spoofing Attack Detec-tion,” in Interspeech , 2018, pp. 661–665.[29] H. Sailor, M. Kamble, and H. Patil, “Auditory filterbank learning fortemporal modulation features in replay spoof speech detection,” in
Interspeech , 2018, pp. 666–670.[30] W. Cai, H. Wu, D. Cai, and M. Li, “The dku replay detection systemfor the asvspoof 2019 challenge: On data augmentation, feature repre-sentation, classification, and fusion,” arXiv preprint arXiv:1907.02663 ,2019.[31] R. Li, M. Zhao, Z. Li, L. Li, and Q. Hong, “Anti-spoofing speaker ver-ification system with multi-feature integration and multi-task learning,”
Proc. Interspeech 2019 , pp. 1048–1052, 2019.[32] M. Alzantot, Z. Wang, and M. B. Srivastava, “Deep residual neural net-works for audio spoofing detection,” arXiv preprint arXiv:1907.00501 ,2019.[33] R. K. Das, J. Yang, and H. Li, “Long range acoustic features for spoofedspeech detection,” in , 2019.[34] G. Lavrentyeva, S. Novoselov, A. Tseren, M. Volkova, A. Gorlanov, andA. Kozlov, “Stc antispoofing systems for the asvspoof2019 challenge,” arXiv preprint arXiv:1904.05576 , 2019.[35] B. Chettri, D. Stoller, V. Morfi, M. A. M. Ram´ırez, E. Benetos, and B. L.Sturm, “Ensemble models for spoofing detection in automatic speakerverification,” arXiv preprint arXiv:1904.04589 , 2019.[36] M. Todisco, H. Delgado, and N. W. D. Evans, “Articulation RateFiltering of CQCC Features for Automatic Speaker Verification.” in
Interspeech , 2016, pp. 3628–3632.[37] J. C. Brown, “Calculation of a constant Q spectral transform,”
TheJournal of the Acoustical Society of America , vol. 89, no. 1, pp. 425–434, 1991.[38] C. Sch¨orkhuber and A. Klapuri, “Constant-Q transform toolbox formusic processing,” in , 2010, pp. 3–64.[39] M. Todisco, H. Delgado, and N. Evans, “A new feature for automaticspeaker verification anti-spoofing: Constant Q cepstral coefficients,” in
Speaker Odyssey Workshop, Bilbao, Spain , vol. 25, 2016, pp. 249–252.[40] L. Y. Chopra, S., Hadsell, R., “Learning a similarity metric discrimina-tively, with application to face verification,” in
Proc. of Computer Visionand Pattern Recognition. IEEE, . Proc. of Computer Vision and PatternRecognition. IEEE, 2005, pp. 539–546.[41] R. Bromley, J., Bentz, J. W., Buttou, L., Guyon, I., LeCun, Y., Moore,C., Sackinger, E., Shah, “Signature verification using a siamese timedelay neural network,”
International Journal of Pattern Recognition andArtificial Intelligence , vol. 7, pp. 669–688, 1993.[42] J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network forface anti-spoofing,” arXiv preprint arXiv:1408.5601 , 2014.[43] Z. Ma, Y. Ding, B. Li, and X. Yuan, “Deep cnns with robust lbp guidingpooling for face recognition,”
Sensors , vol. 18, no. 11, p. 3876, 2018.[44] X. Wu, R. He, and Z. Sun, “A lightened cnn for deep face representa-tion,” arXiv preprint arXiv:1511.02683 , vol. 4, no. 8, 2015.[45] K. Greff, “Highway Networks,” arXiv preprint arXiv:1505.00387 , 2015.[46] K. Ahrabian and B. Babaali, “Usage of autoencoders and siamese net-works for online handwritten signature verification,”
Neural Computingand Applications , pp. 1–14, 2018.[47] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado,A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,“ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detec-tion,” arXiv preprint arXiv:1904.05441 , 2019.[48] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc¸i, M. Sahidullah,and A. Sizov, “ASVspoof 2015: the first automatic speaker verificationspoofing and countermeasures challenge,” in
Sixteenth Annual Confer-ence of the International Speech Communication Association , 2015.[49] T. Kinnunen, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “TheASVspoof 2017 Challenge : Assessing the Limits of Replay SpoofingAttack Detection,” in
Interspeech , 2017, pp. 2–6.[50] T. Kinnunen, K. A. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidul-lah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a detection cost functionfor the tandem assessment of spoofing countermeasures and automaticspeaker verification,” arXiv preprint arXiv:1804.09618 , 2018.[51] M. Sahidullah, T. Kinnunen, and C. Hanil, “A comparison of featuresfor synthetic speech detection,” in
Interspeech , 2015, pp. 2087–2091.[52] T. Kinnunen, N. Evans, J. Yamagishi, K. A. Lee, M. Sahidullah,M. Todisco, and H. Delgado, “ASVspoof 2019: Automatic SpeakerVerification Spoofing and Countermeasures Challenge Evaluation Plan,”