[PDF] Adversarial defense for automatic speaker verification by cascaded self-supervised learning models

Abstract

Automatic speaker verification (ASV) is one of the core technologies in biometric identification. With the ubiquitous usage of ASV systems in safety-critical applications, more and more malicious attackers attempt to launch adversarial attacks at ASV systems. In the midst of the arms race between attack and defense in ASV, how to effectively improve the robustness of ASV against adversarial attacks remains an open question. We note that the self-supervised learning models possess the ability to mitigate superficial perturbations in the input after pretraining. Hence, with the goal of effective defense in ASV against adversarial attacks, we propose a standard and attack-agnostic method based on cascaded self-supervised learning models to purify the adversarial perturbations. Experimental results demonstrate that the proposed method achieves effective defense performance and can successfully counter adversarial attacks in scenarios where attackers may either be aware or unaware of the self-supervised learning models.

Full PDF

AADVERSARIAL DEFENSE FOR AUTOMATIC SPEAKER VERIFICATIONBY CASCADED SELF-SUPERVISED LEARNING MODELS

Haibin Wu ∗ , Xu Li ∗ , Andy T. Liu , Zhiyong Wu , Helen Meng , Hung-yi Lee Graduate Institute of Communication Engineering, National Taiwan University Human-Computer Communications Laboratory, The Chinese University of Hong Kong Shenzhen International Graduate School, Tsinghua University

ABSTRACT

Automatic speaker veriﬁcation (ASV) is one of the core technolo-gies in biometric identiﬁcation. With the ubiquitous usage of ASVsystems in safety-critical applications, more and more malicious at-tackers attempt to launch adversarial attacks at ASV systems. Inthe midst of the arms race between attack and defense in ASV, howto effectively improve the robustness of ASV against adversarial at-tacks remains an open question. We note that the self-supervisedlearning models possess the ability to mitigate superﬁcial perturba-tions in the input after pretraining. Hence, with the goal of effectivedefense in ASV against adversarial attacks, we propose a standardand attack-agnostic method based on cascaded self-supervised learn-ing models to purify the adversarial perturbations. Experimental re-sults demonstrate that the proposed method achieves effective de-fense performance and can successfully counter adversarial attacksin scenarios where attackers may either be aware or unaware of theself-supervised learning models.

Index Terms — Adversarial attack, adversarial defense, auto-matic speaker veriﬁcation, self-supervised learning

1. INTRODUCTION

Automatic speaker veriﬁcation (ASV) aims at conﬁrming a speakeridentity claim given a segment of spoken utterance. The technol-ogy has been widely applied in our everyday lives, such as smartphones, e-banking authentication, etc. Through decades of devel-opment, three most representative model architectures with high-performance were proposed, i.e. i-vector embedding systems [1–4],x-vector embedding systems [5, 6] and r-vector embedding systems[7, 8]. ASV is one of the most essential technologies for biometricidentiﬁcation, so the security for ASV systems is vitally important.However, previous work have shown that cutting-edge ASV systemsare not only subjected to spooﬁng audios [9] generated by audio re-play, speech synthesis and voice conversion, they are vulnerable toadversarial attacks as well [10–15].The concept of adversarial attacks was ﬁrst proposed by Szegedyet al. [10] and they showed that an image classiﬁcation neural net-work that can outperform humans on classiﬁcation of clean testingimages can become seriously confused on the same testing set af-ter some imperceptible adversarial perturbations are added. Adver-sarial samples are composed of genuine samples and deliberatelycrafted adversarial perturbations, and using adversarial samples toattack well-trained neural networks is called adversarial attack. Notonly can adversarial perturbations make image classiﬁcation mod-els fail catastrophically, such attacks can also affect speech-related ∗ Equal contribution. tasks. Carlini et al. [16] investigated the vulnerability of end-to-endautomatic speech recognition (ASR) models by targeted adversarialattacks. Given a piece of audio, whether speech or music, they cancraft another adversarial audio, that is over 99% similar to the origi-nal one, but can manipulate the ASR model to hallucinate arbitrarilypredeﬁned transcriptions. The anti-spooﬁng model, a protector forASV systems by detecting and ﬁltering spooﬁng audios, can also besubjected to adversarial attacks [17]. This was among the ﬁrst effortsto show that high-performance anti-spooﬁng models cannot counteradversarial attacks in both white-box and black-box scenarios.With the ubiquitous usage of ASV systems in safety-critical en-vironments, more and more malicious attackers attempt to launchadversarial attacks at ASV systems [11–14]. [11] ﬁrst adopted ad-versarial samples to deceive the end-to-end ASV systems. They con-ducted both cross-dataset and cross-feature attacks and showed theeffectiveness of adversarial samples in both settings. Even the state-of-the-art ASV models, GMM i-vector system and x-vector system,are vulnerable to adversarial attacks [12]. Also, [12] illustrated theadversarial samples generated from i-vector systems are transferableto attack the x-vector systems. Xie et al. [13] crafted more dangerousadversarial samples which were universal, real-time and robust to de-ceive the x-vector based speaker recognition systems. [14] employedthe psychoacoustic principle of frequency masking to make the ad-versarial audios against x-vector based speaker recognition systemmore indistinguishable to original audios from human’s perception.Adversarial perturbations on ASV models have compromisedtheir robustness considerably, which makes them unreliable in somesafety-critical environments. This has led researchers to develop avariety of defense methods to counter the attacks. Wang et al. [18]injected the adversarial samples into the training set and adopted theadversarial objective as regularization to improve the robustness ofASV. Adversarial training which adopts adversarial samples to aug-ment the training set was introduced to alleviate the vulnerability ofanti-spooﬁng for ASV against adversarial attacks [19]. Li et al. [20]separately trained a detection network to distinguish the adversar-ial samples from genuine samples. A major drawback of the threemethods [18–20] is that they need to know the details of the attack-ing algorithm for adversarial sample generation. So the above threemethods tend to overﬁt to the attacking algorithm used for generatingadversarial samples which are used for training the defense models,not to mention that it is impossible for the ASV system designer toknow the exact attacking algorithm adopted by the attackers in thewild. Wu et al. [21] proposed to use the self-supervised learningbased model, the Mockingjay [22], as a feature extractor in front ofthe anti-spooﬁng of ASV to mitigate the transferability of black-boxattacks. This method requires modiﬁcation and retraining of the anti-spooﬁng model, and is conﬁned to only black-box attack scenarios. a r X i v : . [ ee ss . A S ] F e b elf-supervised learning arouses keen interests recently, andtransformer encoder representations from alteration (TERA) [23]is a self-supervised learning method proposed as a more advancedapproach to Mockingjay [22]. The TERA model is trained by adenoising task. After training, it possesses the ability of mitigatingsuperﬁcial perturbations in the inputs and transforming corruptedspeech into clean speech. The adversarial perturbations can beconsidered as a kind of noise and thereafter, the pretrained TERAmodels can also counter the adversarial noise to some extent.In the midst of the arms race between attack and defense forASV, how to improve the robustness of ASV against adversarialperturbations still remains as an open question. Hence, this workproposes the cascaded TERA models to purify the adversarial per-turbations and counter adversarial attacks. The proposed defensemethod is a standard method without changing the internals of ASVsystems. So it has no conﬂict with previous defense methods [18–20]and can even serve as reinforcement for them. Also, in contrast toprevious attack-dependent methods [18–20], the proposed methodis an attack-agnostic method which doesn’t require the knowledgeabout adversarial samples generation process. As the beginningwork for adversarial defense of ASV by attack-agnostic methods,there is no baseline for reference. So we also ﬁrst employ hand-crafted ﬁlters for adversarial defense and set them as our baseline.Our contributions are as follows: We are among the ﬁrst to proposethe self-supervised learning based models for adversarial defense onASV systems. We begin with applying hand-crafted ﬁlters includingGaussian, mean and median ﬁlters, to counter adversarial attacks forASV. Experimental results demonstrate that our proposed methodachieves effective defense performance and successfully counteradversarial attacks in both scenarios where attackers are aware orunaware of self-supervised learning models.

2. PROPOSED METHOD2.1. TERA pretraining

The TERA model is pretrained by solving a self-supervised alteration-prediction task with a L reconstruction loss function as shown inFig. 1a. At training time, the TERA pretraining task requires themodel to take a sequence of frames as input that has a certain per-centage of randomly selected portions to be altered, and attempts toreconstruct the altered frames. The TERA pretraining scheme con-sists of several objectives: 1) time alteration: reconstructing fromcorrupted blocks of time steps with width of W T . 2) channel al-teration: reconstructing from missing blocks of frequency channelswith width of W C . 3) magnitude alteration: reconstructing fromaltered feature magnitudes with a probability of P N . The modelacquires information around the corrupted or altered input, and usethem to reconstruct the clean input. After pretraining, the modellearns the ability to map corrupted speech to clean speech, and alsothe ability of denoising and puriﬁcation. This subsection will present the defense procedure of our proposedmethod. As shown in Fig. 1c, the in-the-wild attackers can delib-erately ﬁnd adversarial noise δ and add it to the genuine sample x to generate the adversarial sample ˜ x . The adversarial sample ˜ x isover 99% similar to the genuine sample x from human’s ears, butthe prediction of the ASV model reverses. The adversarial attackswhich make ASV models fail catastrophically are dangerous. Hence,this paper proposes the cascaded self-supervised learning models tocounter them. Fig. 1b illustrates the framework of adversarial de-fense by integrating ASV with cascaded TERA models. We ﬁrst pretrain the TERA model, as shown in Fig. 1a. Then we deﬁne K ,the number of concatenated TERA models. Concatenating K TERAmodels should reduce the adversarial attack success rate without sac-riﬁcing the accuracy of benign samples too much. We show the pro-cedure of ﬁnding such a qualiﬁed K in section 4.1. Ideally, given apiece of adversarial audio ˜ x , the cascaded TERA models will serveas a deep ﬁlter to help decontaminate the superﬁcial adversarial per-turbations and reconstruct the pivotal information from the input. Ifthe input is a piece of genuine audio x , the deep ﬁlter simply per-forms nearly lossless reconstruction and keep the key information.After puriﬁcation, the puriﬁed audio ˜ x (cid:48) will be used for ASV tasks. Previous works coarsely divide the attacking scenarios into white-box and black-box, which is vague and ambiguous to some extent.In contrast, we attempt to detail the attacking scenario and name twothreat models from the perspective of the adversary’s knowledge.Thereafter, we will present our countermeasures.•

Adversaries are unaware of TERA : Attackers have the accessto the internals of the target ASV model, including modelstructure, model parameters and gradients, while they are notaware of the existence of the cascaded TERA models in frontof the ASV model. In this setting, the cascaded TERA modelsserve as a deep ﬁlter to decontaminate the adversarial sam-ples. The TERA obtains the ability of purifying corruptedspeech into clean speech after pretraining. So as we will seein subsection 4.1, when the number of cascaded TERA mod-els increases, the equal error rate decreases, which shows theeffectiveness of the proposed method in mitigating adversar-ial attack against these adversaries.•

Adversaries are aware of TERA : Attackers have access to theentire ASV model and the training strategy of TERA mod-els. Our experiments show that even though attackers gener-ate adversarial samples with some information of the TERAmodels, our approach is still effective on purifying adversarialsamples and protecting the ASV models.

3. EXPERIMENTAL SETUP3.1. ASV setting

This work adopts the r-vector embedding system [24] as the ASVto be attacked. The notation of r-vector comes from the networkarchitecture of ResNet [24], which has been adopted in the state-of-the-art speaker veriﬁcation systems. The r-vector system adopts thesame architecture as [24], and AAM-softmax loss [25] with hyper-parameters { m = 0.2, s = 30 } is used for training neural networks.Extracted r-vectors are length-normalized before cosine scoring. In this work, we generate adversarial samples using the basic itera-tive method (BIM) [26] to attack the r-vector system, which has beenverifed to be effective to degrade deep neural network systems. Weassume that X ( e ) and X ( t ) are enrollment and testing utterances,respectively. The ASV system function is denoted as S with param-eters θ . Attackers aims at perturbing the genuine testing input X ( t ) to make it more similar with X ( e ) under the judgement of ASV. Byapplying BIM, it perturbs X ( t ) towards the gradient of system out-put S w.r.t. X ( t ) in an iterative manner. Starting from the genuine ig. 1 . (a). The illustration of TERA’s pretraining strategy. (b) The framework for adversarial defense on ASV by TERA models. (c). Theprocedure of adversarial attack.input X ( t )0 = X ( t ) , this process can be formulated as Eq. 1: X ( t ) n +1 = clip X ( t ) ,(cid:15) ( X ( t ) n + αsign ( ∇ X ( t ) n S θ ( X ( e ) , X ( t ) n ))) , for n = 0 , ..., N − (1)where sign is a function that takes the sign of the gradient, α is thestep size, N is the number of iterations, (cid:15) is the perturbation degreeand clip X ( t ) ,(cid:15) ( X ) holds the norm constraints by applying element-wise clipping such that (cid:107) X − X ( t ) (cid:107) ∞ ≤ (cid:15) . In our experiments, N is set as 5. The (cid:15) is set as 0.3, so that there is no difference betweenadversarial and genuine audios from human perception, while the at-tack can still succeed in making the ASV system behave incorrectly.Finally, α is set as (cid:15) divided by N . This work is conducted on Voxceleb1 [27], which consists of shortclips of human speech. There are in total 148,642 utterances for 1251speakers. We develop our ASV system on the training and develop-ment partitions, while reserve 4,874 utterances of the testing parti-tion for evaluating our ASV system and generating adversarial sam-ples. Notice that generating adversarial samples is time-consumingand resource-consuming. Without loss of generality, we randomlyselect 1000 trials out of 37,720 trials provided in [27], to generateadversarial samples.

Table 1 shows the r-vector system performance on the complete trialsprovided in [27] and also the selected 1K trials. The system perfor-mance is evaluated by equal error rate (EER) and minimum detectioncost function (minDCF) with a prior probability of target trials to be0.01. We observe that the system performance has been seriouslydegraded after applying adversarial inputs, which veriﬁes the effec-tiveness of our adversarial attack algorithm. Besides, we observeconsistent performance trends between the results on the complete

Table 1 . The r-vector system performance with genuine (gen-input)and adversarial inputs (adv-input).complete trials 1K trialsEER (%) minDCF EER (%) minDCFgen-input 8.39 0.638 8.87 0.792adv-input 65.92 1.000 66.02 1.000trials and those on the selected 1K trials, which indicates that theselection process of 1K trials is reasonable. Further experiments areconducted on these 1K adversarial samples.

We use the TERA implementation from the S3PRL speech toolkit.We use a time alteration width W T of , channel alteration width W C of , and magnitude alteration probability P N set to . framesof time alteration corresponding to ms of speech, which is inthe range of average phoneme duration. According to [28], we set W C as for better reconstruction. Setting W C and W T too largewill make the self-supervised learning model hard to do reconstruc-tion. The rest of the alteration policy follows the original design ofTERA [23]. In order to evaluate our proposed method in the sce-nario where adversaries are aware of TERA, we pretrain two TERAmodels with an identical setting except for a unique random seed, de-noted as TERA0 and TERA1, respectively. Each model consists of 3layers of Transformer encoders with multi-head self-attention [29],followed by a feed-forward prediction network. The dataset adoptedto pretrain TERA models is Voxceleb2 [7]. The input to the modelis 24-dim MFCC extracted by standard Kaldi [30] scripts. We usethe Adam optimizer [31] with mini-batches of size 128 to ﬁnd modelparameters that minimize the L1 loss of the TERA pretraining task.The model is trained for 30K steps, where learning rate is warmedup over the ﬁrst 7% to a peak value of × − and then linearlydecayed. If not speciﬁed otherwise, other pretraining settings followthe TERA paper [23]. . EXPERIMENTAL RESULTS4.1. Adversaries are unaware of TERA This subsection assumes that attackers have access to the completeASV model parameters while they are unaware of the cascadedTERA models in the frontend of the ASV system. This setting isthe most practical one in the real world because it is unrealistic forattackers to know everything about the target models through query-ing the API. In this setting, we adopt TERA0 as a basic element,and duplicate it into different amounts to be placed in front of theASV. Defense performance is evaluated when ASV integrated withdifferent number of TERA models, as shown in Fig. 2.

Fig. 2 . The r-vector system’s EER (%) for ASV integrated with dif-ferent number of cascaded TERA models.We observe that for adversarial speech, integration of TERAmodels can dramatically decrease EER of the attacked system fromover 65% to around 20%, which indicates that integration of TERAmodels can purify the adversarial signals and mitigate the attackeffectiveness. We also observe that the performance of ASV withgenuine inputs drops due to the imperfect reconstruction of TERAmodels. Possible solutions can either be improving the reconstruc-tion ability of TERA, or using reconstructed inputs to ﬁnetune ASVsystems, which will be investigated in future works.As investigated in [19], some hand-crafted ﬁlters also have theability of purifying adversarial signals and alleviating the destruc-tiveness of adversarial attacks. In this work, we leverage three ﬁl-ters, i.e. Gaussian, median and mean ﬁlters, to be positioned in frontof the ASV to defend against adversarial attacks and set them as ourbaseline. Table 2 illustrates the system EER for genuine and adver-sarial inputs when the ASV integrated with cascaded TERA models,Gaussian, median and mean ﬁlters. We observe that all ﬁlters havethe ability of purifying adversarial signals given adversarial inputs.The attack effectiveness has been degraded by over 50% after inte-gration with these ﬁlters. However, due to additional noise caused bythe ﬁltering process, all ﬁlters also degrade the system performanceon genuine inputs. Notice that the proposed TERA models outper-form the other ﬁlters with respect to both purifying adversarial sig-nals within adversarial inputs, and preserving ASV performance ongenuine inputs.

This subsection gives a case study to show the robustness of ourdefense approach to a more severe attacking scenario, where attack-

Table 2 . The system’s EER(%) for genuine and adversarial inputswhen integrating ASV with TERA, Gaussian, median and mean ﬁl-ters. (NA means nothing is positioned in front of ASV.)NA 10*TERA0 Gaussian median meangen-input 8.87 17.32 30.30 27.06 27.71adv-input 66.02 22.94 31.60 29.65 29.44ers not only have access to the entire ASV parameters, but also areaware of the TERA models in front of ASV. We assume that at-tackers know the training strategy of the TERA model, and pre-train a substitute TERA model (denoted as TERA1) to be placedin front of ASV to generate adversarial samples. In the real-worldapplications, it is hard for attackers to know the speciﬁc numberof TERA models in front of ASV, and in this work we only in-tegrate ASV with one TERA1 model to generate adversarial sam-ples. The attacking process is identically conﬁgured as BIM withperturbation degree (cid:15) = 0 . . Table 3 illustrates the system perfor-mance when ASV integrated with different number of TERA0 mod-els, given adversarial samples generated by the integration of ASVand one TERA1 model as the inputs. We observe that even thoughattackers know the training setting of TERA models in front of ASV,the attack destructiveness is still alleviated by integrating ASV withmore TERA models. Moreover, based on our experiments, it ismemory- and computation-consuming when performing white-boxattacks on ASV integrated with TERA models. Hence placing suf-ﬁcient number of TERA models in front of ASV could be a goodoption to defend against adversarial attacks. Table 3 . The r-vector system’s EER(%) when integrating ASV withdifferent number of TERA models.NA 1*TERA0 2*TERA0 3*TERA0EER(%) 54.55 53.68 47.62 40.69

5. CONCLUSION

This work proposes integrating ASV with cascaded TERA modelsfor defense against adversarial attacks. We conduct experiments intwo attacking scenarios depending on whether the adversaries areaware of the TERA models or not. The scenario where attackers areunaware of the TERA models is more practical, and experimental re-sults indicate that introducing cascaded TERA models as a deep ﬁl-ter can purify the adversarial signals and mitigate the attack destruc-tiveness. Also our proposed method outperforms hand-crafted ﬁl-ters with respect to both decontaminating adversarial signals withinadversarial inputs, and preserving ASV performance on genuine in-puts. For the other scenario where attackers are aware of the TERAmodels, experimental results verify that integrating ASV with moreTERA models is still effective on alleviating adversarial noise eventhough attackers utilize the TERA information to generate adversar-ial samples. Preserving better performance on genuine samples thanhand-crafted ﬁlters is not good enough, so we attempt to tackle thisproblem in future works.

6. ACKNOWLEDGEMENT

This work was done when H. Wu was a visiting student at ShenzhenInternational Graduate School, Tsinghua University. H. Wu and A.Liu are supported by Frontier Speech Technology Scholarship of Na-tional Taiwan University. A. Liu is supported by ASUS AICS. XuLi is supported by HKSAR Government’s Research Grants CouncilGeneral Research Fund (Project No. 14208718). . REFERENCES [1] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novelscheme for speaker recognition using a phonetically-awaredeep neural network,” in . IEEE,2014, pp. 1695–1699.[2] P. Kenny, “A small footprint i-vector extractor,” in

Odyssey2012-The Speaker and Language Recognition Workshop , 2012.[3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker veriﬁcation,”

IEEETransactions on Audio, Speech, and Language Processing , vol.19, no. 4, pp. 788–798, 2010.[4] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,”in

Twelfth annual conference of the international speech com-munication association , 2011.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust dnn embeddings for speakerrecognition,” in . IEEE,2018, pp. 5329–5333.[6] X. Li, J. Zhong, J. Yu, S. Hu, X. Wu, X. Liu, andH. Meng, “Bayesian x-vector: Bayesian neural networkbased x-vector system for speaker veriﬁcation,” arXiv preprintarXiv:2004.04014 , 2020.[7] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622 , 2018.[8] N. Li, D. Tuo, D. Su, Z. Li, D. Yu, and A. Tencent, “Deepdiscriminative embeddings for duration robust speaker veriﬁ-cation.,” in

Interspeech , 2018, pp. 2262–2266.[9] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado,X. Wang, N. Evans, T. Kinnunen, K. A. Lee, V. Vestman, andA. Nautsch, “Asvspoof 2019: The 3rd automatic speaker ver-iﬁcation spooﬁng and countermeasures challenge database,”2019.[10] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus, “Intriguing properties of neu-ral networks,” arXiv preprint arXiv:1312.6199 , 2013.[11] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-end speaker veriﬁcation with adversarial examples,” in . IEEE, 2018, pp. 1962–1966.[12] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adver-sarial attacks on gmm i-vector based speaker veriﬁcation sys-tems,” in

ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6579–6583.[13] Y. Xie, C. Shi, Z. Li, J. Liu, Y. Chen, and B. Yuan, “Real-time, universal, and robust adversarial attacks against speakerrecognition systems,” in

ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 1738–1742.[14] Q. Wang, P. Guo, and L. Xie, “Inaudible adversarial perturba-tions for targeted attack in speaker recognition,” arXiv preprintarXiv:2005.10637 , 2020. [15] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The attacker’sperspective on automatic speaker veriﬁcation: An overview,” arXiv preprint arXiv:2004.08849 , 2020.[16] N. Carlini and D. Wagner, “Audio adversarial examples: Tar-geted attacks on speech-to-text,” in . IEEE, 2018, pp. 1–7.[17] S. Liu, H. Wu, H.-y. Lee, and H. Meng, “Adversarial attacks onspooﬁng countermeasures of automatic speaker veriﬁcation,” arXiv preprint arXiv:1910.08716 , 2019.[18] Q. Wang, P. Guo, S. Sun, L. Xie, and J. H. Hansen, “Adversar-ial regularization for end-to-end robust speaker veriﬁcation.,”in

Interspeech , 2019, pp. 4010–4014.[19] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against ad-versarial attacks on spooﬁng countermeasures of asv,” arXivpreprint arXiv:2003.03065 , 2020.[20] X. Li, N. Li, J. Zhong, X. Wu, X. Liu, D. Su, D. Yu, andH. Meng, “Investigating robustness of adversarial samplesdetection for automatic speaker veriﬁcation,” arXiv preprintarXiv:2006.06186 , 2020.[21] H. Wu, A. T. Liu, and H.-y. Lee, “Defense for black-boxattacks on anti-spooﬁng models by self-supervised learning,” arXiv preprint arXiv:2006.03214 , 2020.[22] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee,“Mockingjay: Unsupervised speech representation learningwith deep bidirectional transformer encoders,”

ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , May 2020.[23] A. T. Liu, S.-W. Li, and H. yi Lee, “Tera: Self-supervisedlearning of transformer encoder representation for speech,”2020.[24] H. Zeinali, S. Wang, A. Silnova, P. Matˇejka, and O. Plchot,“But system description to voxceleb speaker recognition chal-lenge 2019,” arXiv preprint arXiv:1910.12592 , 2019.[25] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Mar-gin matters: Towards more discriminative deep neural networkembeddings for speaker recognition,” in . IEEE, 2019, pp. 1652–1656.[26] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial ma-chine learning at scale,” arXiv preprint arXiv:1611.01236 ,2016.[27] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identiﬁcation dataset,” arXiv preprintarXiv:1706.08612 , 2017.[28] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmenta-tion method for automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is allyou need,” 2017.[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speechrecognition toolkit,” in

ASRU , 2011.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980