Multi-Task Siamese Neural Network for Improving Replay Attack Detection
MMULTI-TASK SIAMESE NEURAL NETWORK FOR IMPROVING REPLAY ATTACKDETECTION
Patrick von Platen , , Fei Tao , Gokhan Tur Uber AI RWTH Aachen University ABSTRACT
Automatic speaker verification systems are vulnerable toaudio replay attacks which bypass security by replayingrecordings of authorized speakers. Replay attack detection(RA) detection systems built upon Residual Neural Networks(ResNet)s have yielded astonishing results on the publicbenchmark ASVspoof 2019 Physical Access challenge. Withmost teams using fine-tuned feature extraction pipelines andmodel architectures, the generalizability of such systemsremains questionable though. In this work, we analyse the ef-fect of discriminative feature learning in a multi-task learning(MTL) setting can have on the generalizability and discrim-inability of RA detection systems. We use a popular ResNetarchitecture optimized by the cross-entropy criterion as ourbaseline and compare it to the same architecture optimizedby MTL using Siamese Neural Networks (SNN). It can beshown that SNN outperform the baseline by relative 26.8%Equal Error Rate (EER). We further enhance the model’sarchitecture and demonstrate that SNN with additional re-construction loss yield another significant improvement ofrelative 13.8 % EER.
Index Terms — replay attack detection, siamese neuralnetworks, multi-task learning, discriminative feature learning
1. INTRODUCTION
Automatic speaker verification (ASV) systems are nowadaysincreasingly used for various applications. However, ASVsystems are vulnerable to audio spoofing attacks , which at-tempt to gain unauthorized access by manipulating the audioinput. One of the most popular and effective audio spoofingattacks are replay attacks (RA)s. In an RA the attacker foolsthe ASV system by replaying a recording of an authorizedspeaker. Considering how effective and cheap RAs are, it isnecessary to augment an ASV system with an RA detectionsystem in practice.The public benchmark ASVspoof initiative started withthe ASVspoof 2015 challenge which dealt with text-to-speechand voice conversion spoofing attacks [1]. ASVspoof 2017[2] was the first challenge concerned with RA detectionand thus created a benchmark data set consisting of voicecommand recordings. ASVspoof 2019 [3], then introduced a much larger corpus of longer and text-independent recordingsfor RA detection.The performance of RA detection systems has beenthought highly dependent on their input feature process-ing [4]. Correspondingly, earlier work has largely dealt withhandcrafted feature processing and it has been found thathigh frequency and phase information can be helpful for RAdetection ( e.g. in [5, 6]). Popular input features that emergedinclude linear frequency cepstral coefficients (LFCC) [7] andgroup delay (GD) grams [8]. In recent years, input featuresderived from shorter handcrafted feature processing pipelines,such as the log power magnitude spectra (
LOGSPEC ) [9],attracted more interest. In contrast to LFCC, LOGSPEC pre-serves much more of the information present in the originalraw signal and thus relies on deep neural netwokrs (DNN)s aspowerful feature extractors [9, 10, 11, 12, 13]. Overall, thereis currently no conclusive consensus about the best inputfeature for RA detection.As the quality of recording and replaying devices is get-ting better, detecting the difference between genuine andspoofed audios is becoming more difficult. Thus, it becomesnecessary to improve the discriminability and generalizabil-ity of RA detection systems. Besides common regularizationtechniques, like data augmentation and Dropout ( cf. with[10, 13]), multiple teams have used discriminative loss func-tions and multi-task learning (MTL) [14] for better featurediscrimination and generalization ( cf. with [9, 12, 15]).Siamese Neural Networks (SNN) [16] have shown to sig-nificantly improve the discriminability and generalizability ofmodels [17]. In this paper, we propose to use SNN in an MTLsetting for RA detection. More generally, we investigate towhat extent adding discriminative loss functions in a MTLsetting can improve the performance of RA detection systemson the ASVspoof 2019 challenge Physical Access (PA) data.The analysis is conducted on multiple input features. It ismade sure that none of the systems rely on additional dataand labels and that all of our settings follow the real-worldapplication implementation. Our main contributions include:1) Proposal of SNN in MTL setting for improved discrim-inability and generalizability of RA detection systems; 2) Ex-tensive analysis of discriminative loss functions on multipleinput features; 3) Enhancement of a popular architecture for a r X i v : . [ ee ss . A S ] F e b A detection with second-order statistics pooling; 4) Com-bination of reconstruction loss (ReL) with SNN in an MTLsetting.
2. RELATED WORK
Convolutional neural networks (CNN)s and especially deepresidual neural networks (ResNet)s [18] have yielded thestate-of-the-art performance on the ASVspoof 2019 PA dataset [9, 10]. To deal with the much smaller data set than theone ResNet was originally designed for [9, 10, 15] signifi-cantly reduce the size of their models by scaling down thenumber of kernels employed in each of the CNN layers. Akey component in the architecture of their models is the pro-jection of ResNet’s three dimensional tensor output to a onedimensional vector for further binary classification. In [15] arecurrent layer processes the tensor along the time dimensionand outputs the last hidden state. A simpler and apparentlymore effective approach is to use a global average pooling(GAP) layer instead [9, 10]. Given the success of ResNetwith GAP, we use this architecture as our baseline in thisstudy. In other fields of research it has been shown that us-ing second-order statistics in addition to first-order statisticsyields better feature embeddings for utterance level classifi-cation tasks, e.g. in [19]. This led us to extent the GAP layerto additionally perform variance pooling.In [15], MTL has been applied for RA detection in theform of center loss (CL), which has been shown to greatlyimprove the discriminability of a model [20]. CL is com-prised of the cross-entropy (CE) loss and the intra-class vari-ance loss of the feature embeddings weighted by a hyper pa-rameter to control the intra-class compactness [20]. SNN areknown to significantly improve the discriminability and gen-eralizability of a model [17] and have found to be effectivein similarity assessment in computer vision [21]. By using apair of input features during training, SNN simultaneously in-crease the inter-class variance of the embedded input featureswhile decreasing the intra-class variance of the embedded in-put features. Since CL can be seen as a special case of SNN , SNN are expected to better improve the discriminability ofthe model. This inspired us to propose SNN in a MTL settingfor RA detection.Another loss function, which is easily applicable in theMTL setting, is ReL. ReL is an unsupervised loss functionand is usually employed in autoencoders to improve the net-work’s ability to maintain the most distinctive informationabout the input features in compressed form. When added toa standard CE loss function, ReL can act as an effective reg-ularizer by encouraging the network to learn robust featureembeddings [22]. The centroid used in CL can be seen as one of the inputs in the input pairused for SNN.
3. PROPOSED APPROACH3.1. Audio Preprocessing & Feature Extraction
In a real-world application, the utterance input can be consid-ered as a continuous buffer of audio input. We set the buffersize to l b seconds to keep the audio processing step simpleand easy to deploy. Therefore, all utterances are cut or zero-padded to have a maximum length of l b seconds and only utterance-level input is considered.The models are tested on the three input features: lin-ear frequency filterbank features ( LFBANK ), LOGSPEC and
GD grams . LFBANK correspond to the conventionally usedLFCC features without the discrete cosine decorrelation step.We chose to leave out this decorrelation step because neuralnetworks are known to act as excellent decorrelators.We used modified GD grams as defined in Eqs. (28) and(29) in [8] with α = 0 . and γ = 0 . because the GD gramsas formalized in [6] and [10] did not yield any reasonable re-sults in our experiments. For all input features, the short timeFourier transform employed a window size of 50ms and awindow shift of 15ms. LFBANK subsequently applies 80 fil-ters ( cf. with [23]) without any delta coefficients. The result-ing input dimension for GD gram/LOGSPEC and LFBANKis × and × , respectively.Confirming the observations made in [9] and [23], the in-put features are not normalized, but simply scaled down to bein the range from − to . Similar to [10], the RA detection system is built upon a”thin” 34-layer ResNet, which is presented in detail in Ta-ble 1. The ResNet blocks ( i.e. Res1 - Res4 ) employ the”full pre-activation” residual unit proposed in [24]. Due todifferences in the input dimensions between LOGSPEC/GDgram and LFBANK, slightly different stride kernels are used( cf. with Table 1). The ResNet network is followed by a
Table 1 . Architecture of ”thin” 34-layer ResNetLayer Kernel Filters Input feature StrideConv1 [3 ×
16 GD/LOGSPEC [2 × LFBANK [2 × Res1 × (cid:20) × × (cid:21)
16 GD/LOGSPEC [2 × LFBANK [1 × Res2 × (cid:20) × × (cid:21)
32 GD/LOGSPEC [2 × LFBANK [1 × Res3 × (cid:20) × × (cid:21)
64 GD/LOGSPEC [1 × LFBANK [2 × Res4 × (cid:20) × × (cid:21)
128 GD/LOGSPEC [1 × LFBANK [2 × GAP layer as explained in Eq. (1) in [10]. Extending GAPto the retrieval of second-order statistics, we define a globalverage and variance pooling (GAVP) layer that extracts boththe mean and variance from all feature maps of ResNet’s lastCNN layer. To keep the number of parameters constant, thepooling layer is followed by a × dense layer if GAP isemployed and a × dense layer if GAVP is employed.The final dense layer (called ”Out” in Fig. 1) following theGAP or GAVP layer has a single output neuron with Sig-moid activation function yielding the probability of the inputfeature to be classified as being spoofed . All layers exceptthe pooling and final layer make use of the Rectified LinearUnit (ReLU). The model has about 1.34 million trainableparameters. SNN are made of two sub-networks which share the same setof trainable parameters so that a pair of input features X , X is used as an input during training. Besides computing theconventional CE loss for each sub-network individually ( i.e. L CE , L CE ), a distance loss L SNN between the feature embed-ding ( i.e. e , e ) of each sub-network is calculated ( cf. withFig. 1). A common choice for L SNN is the hinge loss ( cf. with[25]): L SNN = E ( X ,y ) , ( X ,y ) ∼D×D [max(0 , m − l d d ( e , e ))] , (1)wheres m represents the margin, l d equals if the input fea-ture labels y , y are equal or else − and d ( e , e ) is a dis-tance metric of choice for which we empirically found the co-sine similarity to work best. During training, SNN then aimsat minimizing the sum of L CE , L CE and L SNN , whereas eachloss contributes with equal weight.Optionally, two ReLs ( L ReL , L ReL ) - one for each sub-network - can be added to the overall loss. In this case ashared decoder (with a negligible amount of parameters) isused to reconstruct the pair of input features X , X from theoutputs O , O of the last convolutional layer: L ReL i = | X i − Decoder ( O i ) | F , ∀ i ∈ { , } , (2)with | · | F being the Frobenius norm. The decoder consists ofthree consecutive Deconvolution layers each of which upsam-ples the input using the stride kernel [2 × and which employ , , kernels of size [3 × respectively. The outputs ofDecoder ( O ) and Decoder ( O ) are ”mean-pooled” over theiroutput feature maps and finally zero-padded to have exactlythe same dimension as their respective input feature matrices.The complete architecture of SNN is illustracted in Fig. 1.As can be noted from Eq. (1), the space of possible train-ing samples for SNN includes all pair-wise combinations of D with itself, which is prohibitively large. A simple remedytaken in this study is to control the dataset’s size by a hyperparameter numSamples . Before every epoch, a dataset D SNN is created by the following simple, but effective sampling pro-cedure: function C REATE
SNND
ATA S ET ( numSamples ) D pos , D neg ← SPLIT B Y L ABEL A ND S HUFFLE ( D ) c pos , c neg ← for i in to numSamples dofor j in to do D r , c r ← RAND (( D pos , c pos ) , ( D neg , c neg )) X j , y j ← D r [ c r ] , D r == D pos c r ← ( c r + 1) mod LEN ( D r ) D SNN [ i ] ← (( X , y ) , ( X , y )) return D SNN
First, no data sample in D pos (or D neg , respectively) is usedtwice before every other data sample has been sampled at leastonce, which ensures almost certainly that all data samples areused per epoch by setting numSamples accordingly. Second,it is ensured that the space of possible sample pairs D × D is vastly explored by shuffling the order of D pos and D neg before every epoch. Third, by choosing to sample from D pos or D neg with even probability, the smaller of the two data setsis upsampled so that D SNN is balanced.
ResNetDenseGAVP/GAPOut DecoderResNetDenseGAVP/GAPOutDecoder
Fig. 1 . A sketch map of SNN for MTL showing the loss func-tions L CE , L CE , L SNN and L ReL , L ReL for RA detection.
4. EXPERIMENTAL SETUP AND RESULTS
In all experiments, the models were evaluated on the PA sub-set of the ASVspoof 2019 corpus [3]. PA consists of 48600 spoofed and 5400 genuine utterances in the training ( train )data set, 24300 spoofed and 5400 genuine utterances in thedevelopment ( dev ) data set and 116640 spoofed and 18090 genuine utterances in the evaluation ( eval ) data set. The mod-els were optimized by Adam with β = 0 . , β = 0 . ,learning rate . × − and weight decay, which was tunedfor each experiment separately. Training was stopped if theequal error rate (EER) on the dev data set did not improveover 15 consecutive epochs. The models were implementedwith the Keras framework [26].First, we analysed the effect of audio input length on theperformance of a simplified ResNet model using LFBANKon the eval set. We noticed that increasing the input length l b from 5.0s to 6.5s and eventually to 8.5s improved the EERfrom 9.31 % to 6.75 % to finally 6.22 %. In this experiment,e simply cut or padded the end of the audio to the specificlength. Based on existing literature ( e.g. [23]), it can be ex-plained that the beginning and tailing silence cues can lead tobetter performance. Considering these findings and our prac-tical application, we decided to use 8.5s input length and todo cutting and padding at the end of the audio from now on(so that we do not rely on voice activity detection in practicalapplications).We then analyse the proposed model architecture withGAP [10]. As one baseline, the model was trained usingsimple CE loss ( L CE ) [15], which is abbreviated as M GAPCE .As another baseline, the model was trained using CL loss( L CE + γ L CL ), which is abbreviated as M GAPCL . For M GAPCL ,we found that γ = 0 . yields the best results. The base-lines are compared to SNN as described in Section 3.3, whichwe abbreviate as M GAPSNN . For M GAPSNN the numSamples was setto × and the margin m was set to . . In all training se-tups, we used a batch size of . M GAPCE , M
GAPCL and M GAPSNN were all evaluated using LFBANK, LOGSPEC and GD gramas input. In a final step, the systems were systematically fusedby means of logistic regression with the Bosaris toolkit [27]using the dev data set for calibration.Due to the data imbalance of 9 to 1 in the training set, weadopted the weighted CE loss for M GAPCE and M GAPCL with theCE weight for spoofed utterance input set to . To improvetraining stability, the bias of the output neuron was initializedto log(9) ( cf. with [28]) if weighted CE was used. The resultscan be seen in Table 2. Table 2 . Comparison of M GAPCE , M
GAPCL , M
GAPSNN for differentinput features. Results are reported in % EER.Model Loss Input Feature Dev Eval M GAPCE L CE LFBANK 3.70 5.17GD Gram 6.20 8.63LOGSPEC 1.98 2.79Fused - 2.22 M GAPCL L CE +0 . L CL LFBANK 2.76 4.06GD Gram 4.44 7.13LOGSPEC 1.37 2.33Fused - 1.70 M GAPSNN L CE + L CE + L SNN
LFBANK 2.73 3.66GD Gram 3.53 5.89LOGSPEC
Fused - 1.52It can be seen that both MTL models M GAPCL and M GAPSNN outperform the single task learning model M GAPCE by relative23.4 % EER and 31.5 % EER averaged over all input features. M GAPSNN further outperforms M GAPCL by a relative margin of 10.6% EER. We could observe that during training, the MTL se-tups M GAPCL and M GAPSNN converged faster and also seemed togeneralize better as the EER on the dev data set decreased much smoother during training.In the second experiment, we took the best performingmodel M GAPSNN for LOGSPEC input as our new baseline. First,we analysed the effect of extracting second-order statistics inaddition to first-order statistics from of the CNN feature mapsby replacing the GAP layer with a GAVP layer. This setupis abbreviated as M GAVPSNN . Second, we extended SNN withtwo additional reconstruction loss functions L ReL , L ReL according to Eq. (2) for both GAP ( M GAPSNN , ReL ) and GAVP( M GAVPSNN , ReL ). Empirically, it was found that L ReL and L ReL is much smaller than L CE , L CE and L SNN , so that the loss isscaled by a weighting factor of . Because, we experiencedRAM memory overflow issues with M GAPSNN , ReL and M GAVPSNN , ReL ,the batch size used in training was reduced to and num-Samples set to × to have the same number of steps perepoch as before . The results are shown in Table 3. Table 3 . Comparison of M GAVPSNN , M
GAPSNN , ReL and M GAVPSNN , ReL forLOGSPEC input feature. Results are reported in % EER.Model Loss Dev Eval M GAVPSNN L CE + L CE + L SNN M GAPSNN , ReL L CE + L CE + L SNN +50 L ReL + 50 L ReL M GAVPSNN , ReL L CE + L CE + L SNN +50 L ReL + 50 L ReL It can be seen that both using the GAVP layer and addingReL gives a significant performance boost compared to M GAPSNN . Consequently the best single system performanceof . % EER on the eval data set is achieved by M GAVPSNN , ReL which outperforms M M GAPCE by relative 30.5 % EER whilehaving the same number of parameters.
5. CONCLUSION
We have thoroughly analysed the discriminate feature learn-ing in an MTL setting for RA detection and found that SNNsignificantly outperforms the baseline on multiple input fea-tures. We explain this improvement by the following. First,SNN greatly improve the discriminability of the model by ex-plicitly increasing the inter-class variance of the model. Sec-ond, because SNN sample from a very large pool of possiblesample pairs - each giving a different gradient signal - themodel regularizes much better during training. We then fur-ther improve upon SNN by adding ReL and replacing GAPwith GAVP. This leads to a single system EER of . % andcan be justified by better regularization induced by ReL andmore discriminative feature embeddings thanks to the extrac-tion of first- and second-order statistics. More details can be found at . . REFERENCES [1] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi,Cemal Hanilc¸i, Md Sahidullah, and Aleksandr Sizov, “Asvspoof 2015:the first automatic speaker verification spoofing and countermeasureschallenge,” in Sixteenth Annual Conference of the International SpeechCommunication Association , 2015.[2] Tomi Kinnunen, Md Sahidullah, H´ector Delgado, MassimilianoTodisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee, “Theasvspoof 2017 challenge: Assessing the limits of replay spoofing attackdetection,”
ISCA (the International Speech Communication Associa-tion) , 2017.[3] Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Hec-tor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans,Tomi Kinnunen, and Kong Aik Lee, “Asvspoof 2019: Future horizonsin spoofed and fake audio detection,” arXiv preprint arXiv:1904.05441 ,2019.[4] Hemant A Patil and Madhu R Kamble, “A survey on replay attack de-tection for automatic speaker verification (asv) system,” in . IEEE, 2018, pp. 1047–1053.[5] Parav Nagarsheth, Elie Khoury, Kailash Patil, and Matt Garland, “Re-play attack detection using dnn for channel discrimination.,” in
Inter-speech , 2017, pp. 97–101.[6] Francis Tom, Mohit Jain, and Prasenjit Dey, “End-to-end audio replayattack detection using deep convolutional networks with attention.,” in
Interspeech , 2018, pp. 681–685.[7] Md Sahidullah, Tomi Kinnunen, and Cemal Hanilc¸i, “A comparisonof features for synthetic speech detection,”
ISCA (the InternationalSpeech Communication Association) , 2015.[8] Rajesh M Hegde, Hema A Murthy, and Venkata Ramana Rao Gadde,“Significance of the modified group delay feature in speech recogni-tion,”
IEEE Transactions on Audio, Speech, and Language Processing ,vol. 15, no. 1, pp. 190–202, 2006.[9] Cheng-I Lai, Nanxin Chen, Jes´us Villalba, and Najim Dehak, “Assert:Anti-spoofing with squeeze-excitation and residual networks,” arXivpreprint arXiv:1904.01120 , 2019.[10] Weicheng Cai, Haiwei Wu, Danwei Cai, and Ming Li, “The dku replaydetection system for the asvspoof 2019 challenge: On data augmenta-tion, feature representation, classification, and fusion,” arXiv preprintarXiv:1907.02663 , 2019.[11] Alejandro Gomez-Alanis, Antonio M Peinado, Jose A Gonzalez, andAngel Manuel Gomez, “A gated recurrent convolutional neural net-work for robust spoofing detection,”
IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , 2019.[12] Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Ma-rina Volkova, Artem Gorlanov, and Alexandr Kozlov, “Stc anti-spoofing systems for the asvspoof2019 challenge,” arXiv preprintarXiv:1904.05576 , 2019.[13] Hossein Zeinali, Themos Stafylakis, Georgia Athanasopoulou, JohanRohdin, Ioannis Gkinis, Luk´aˇs Burget, Jan ˇCernock`y, et al., “Detect-ing spoofing attacks using vgg and sincnet: but-omilia submission toasvspoof 2019 challenge,” arXiv preprint arXiv:1907.12908 , 2019.[14] Rich Caruana, “Multitask learning,”
Machine learning , vol. 28, no. 1,pp. 41–75, 1997.[15] Jee-weon Jung, Hye-jin Shim, Hee-Soo Heo, and Ha-Jin Yu, “Replayattack detection with complementary high-resolution information us-ing end-to-end dnn for the asvspoof 2019 challenge,” arXiv preprintarXiv:1904.10134 , 2019.[16] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard S¨ackinger, andRoopak Shah, “Signature verification using a” siamese” time delayneural network,” in
Advances in neural information processing systems ,1994, pp. 737–744. [17] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov, “Siameseneural networks for one-shot image recognition,” in
ICML deep learn-ing workshop , 2015, vol. 2.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deepresidual learning for image recognition,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2016, pp. 770–778.[19] Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu,“Semantic segmentation with second-order pooling,” in
European Con-ference on Computer Vision . Springer, 2012, pp. 430–443.[20] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao, “A discrim-inative feature learning approach for deep face recognition,” in
Euro-pean conference on computer vision . Springer, 2016, pp. 499–515.[21] Sean Bell and Kavita Bala, “Learning visual similarity for productdesign with convolutional neural networks,”
ACM Transactions onGraphics (TOG) , vol. 34, no. 4, pp. 98, 2015.[22] F. Zhuang, D. Luo, X. Jin, H. Xiong, P. Luo, and Q. He, “Representa-tion learning via semi-supervised autoencoder for multi-task learning,”in , 2015.[23] Bhusan Chettri, Daniel Stoller, Veronica Morfi, Marco A Mart´ınezRam´ırez, Emmanouil Benetos, and Bob L Sturm, “Ensemble modelsfor spoofing detection in automatic speaker verification,” arXiv preprintarXiv:1904.04589 , 2019.[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Identitymappings in deep residual networks,” in
European conference on com-puter vision . Springer, 2016, pp. 630–645.[25] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Pi-ana, and Alessandro Verri, “Are loss functions all the same?,”
NeuralComputation , vol. 16, no. 5, pp. 1063–1076, 2004.[26] Franc¸ois Chollet et al., “Keras,” https://keras.io , 2015.[27] Niko Br¨ummer and Edward De Villiers, “The bosaris toolkit: The-ory, algorithms and code for surviving the new dcf,” arXiv preprintarXiv:1304.2865 , 2013.[28] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and PiotrDoll´ar, “Focal loss for dense object detection,” in