Adversarial defense for automatic speaker verification by cascaded self-supervised learning models
Haibin Wu, Xu Li, Andy T. Liu, Zhiyong Wu, Helen Meng, Hung-yi Lee
AADVERSARIAL DEFENSE FOR AUTOMATIC SPEAKER VERIFICATIONBY CASCADED SELF-SUPERVISED LEARNING MODELS
Haibin Wu ∗ , Xu Li ∗ , Andy T. Liu , Zhiyong Wu , Helen Meng , Hung-yi Lee Graduate Institute of Communication Engineering, National Taiwan University Human-Computer Communications Laboratory, The Chinese University of Hong Kong Shenzhen International Graduate School, Tsinghua University
ABSTRACT
Automatic speaker verification (ASV) is one of the core technolo-gies in biometric identification. With the ubiquitous usage of ASVsystems in safety-critical applications, more and more malicious at-tackers attempt to launch adversarial attacks at ASV systems. Inthe midst of the arms race between attack and defense in ASV, howto effectively improve the robustness of ASV against adversarial at-tacks remains an open question. We note that the self-supervisedlearning models possess the ability to mitigate superficial perturba-tions in the input after pretraining. Hence, with the goal of effectivedefense in ASV against adversarial attacks, we propose a standardand attack-agnostic method based on cascaded self-supervised learn-ing models to purify the adversarial perturbations. Experimental re-sults demonstrate that the proposed method achieves effective de-fense performance and can successfully counter adversarial attacksin scenarios where attackers may either be aware or unaware of theself-supervised learning models.
Index Terms — Adversarial attack, adversarial defense, auto-matic speaker verification, self-supervised learning
1. INTRODUCTION
Automatic speaker verification (ASV) aims at confirming a speakeridentity claim given a segment of spoken utterance. The technol-ogy has been widely applied in our everyday lives, such as smartphones, e-banking authentication, etc. Through decades of devel-opment, three most representative model architectures with high-performance were proposed, i.e. i-vector embedding systems [1–4],x-vector embedding systems [5, 6] and r-vector embedding systems[7, 8]. ASV is one of the most essential technologies for biometricidentification, so the security for ASV systems is vitally important.However, previous work have shown that cutting-edge ASV systemsare not only subjected to spoofing audios [9] generated by audio re-play, speech synthesis and voice conversion, they are vulnerable toadversarial attacks as well [10–15].The concept of adversarial attacks was first proposed by Szegedyet al. [10] and they showed that an image classification neural net-work that can outperform humans on classification of clean testingimages can become seriously confused on the same testing set af-ter some imperceptible adversarial perturbations are added. Adver-sarial samples are composed of genuine samples and deliberatelycrafted adversarial perturbations, and using adversarial samples toattack well-trained neural networks is called adversarial attack. Notonly can adversarial perturbations make image classification mod-els fail catastrophically, such attacks can also affect speech-related ∗ Equal contribution. tasks. Carlini et al. [16] investigated the vulnerability of end-to-endautomatic speech recognition (ASR) models by targeted adversarialattacks. Given a piece of audio, whether speech or music, they cancraft another adversarial audio, that is over 99% similar to the origi-nal one, but can manipulate the ASR model to hallucinate arbitrarilypredefined transcriptions. The anti-spoofing model, a protector forASV systems by detecting and filtering spoofing audios, can also besubjected to adversarial attacks [17]. This was among the first effortsto show that high-performance anti-spoofing models cannot counteradversarial attacks in both white-box and black-box scenarios.With the ubiquitous usage of ASV systems in safety-critical en-vironments, more and more malicious attackers attempt to launchadversarial attacks at ASV systems [11–14]. [11] first adopted ad-versarial samples to deceive the end-to-end ASV systems. They con-ducted both cross-dataset and cross-feature attacks and showed theeffectiveness of adversarial samples in both settings. Even the state-of-the-art ASV models, GMM i-vector system and x-vector system,are vulnerable to adversarial attacks [12]. Also, [12] illustrated theadversarial samples generated from i-vector systems are transferableto attack the x-vector systems. Xie et al. [13] crafted more dangerousadversarial samples which were universal, real-time and robust to de-ceive the x-vector based speaker recognition systems. [14] employedthe psychoacoustic principle of frequency masking to make the ad-versarial audios against x-vector based speaker recognition systemmore indistinguishable to original audios from human’s perception.Adversarial perturbations on ASV models have compromisedtheir robustness considerably, which makes them unreliable in somesafety-critical environments. This has led researchers to develop avariety of defense methods to counter the attacks. Wang et al. [18]injected the adversarial samples into the training set and adopted theadversarial objective as regularization to improve the robustness ofASV. Adversarial training which adopts adversarial samples to aug-ment the training set was introduced to alleviate the vulnerability ofanti-spoofing for ASV against adversarial attacks [19]. Li et al. [20]separately trained a detection network to distinguish the adversar-ial samples from genuine samples. A major drawback of the threemethods [18–20] is that they need to know the details of the attack-ing algorithm for adversarial sample generation. So the above threemethods tend to overfit to the attacking algorithm used for generatingadversarial samples which are used for training the defense models,not to mention that it is impossible for the ASV system designer toknow the exact attacking algorithm adopted by the attackers in thewild. Wu et al. [21] proposed to use the self-supervised learningbased model, the Mockingjay [22], as a feature extractor in front ofthe anti-spoofing of ASV to mitigate the transferability of black-boxattacks. This method requires modification and retraining of the anti-spoofing model, and is confined to only black-box attack scenarios. a r X i v : . [ ee ss . A S ] F e b elf-supervised learning arouses keen interests recently, andtransformer encoder representations from alteration (TERA) [23]is a self-supervised learning method proposed as a more advancedapproach to Mockingjay [22]. The TERA model is trained by adenoising task. After training, it possesses the ability of mitigatingsuperficial perturbations in the inputs and transforming corruptedspeech into clean speech. The adversarial perturbations can beconsidered as a kind of noise and thereafter, the pretrained TERAmodels can also counter the adversarial noise to some extent.In the midst of the arms race between attack and defense forASV, how to improve the robustness of ASV against adversarialperturbations still remains as an open question. Hence, this workproposes the cascaded TERA models to purify the adversarial per-turbations and counter adversarial attacks. The proposed defensemethod is a standard method without changing the internals of ASVsystems. So it has no conflict with previous defense methods [18–20]and can even serve as reinforcement for them. Also, in contrast toprevious attack-dependent methods [18–20], the proposed methodis an attack-agnostic method which doesn’t require the knowledgeabout adversarial samples generation process. As the beginningwork for adversarial defense of ASV by attack-agnostic methods,there is no baseline for reference. So we also first employ hand-crafted filters for adversarial defense and set them as our baseline.Our contributions are as follows: We are among the first to proposethe self-supervised learning based models for adversarial defense onASV systems. We begin with applying hand-crafted filters includingGaussian, mean and median filters, to counter adversarial attacks forASV. Experimental results demonstrate that our proposed methodachieves effective defense performance and successfully counteradversarial attacks in both scenarios where attackers are aware orunaware of self-supervised learning models.
2. PROPOSED METHOD2.1. TERA pretraining
The TERA model is pretrained by solving a self-supervised alteration-prediction task with a L reconstruction loss function as shown inFig. 1a. At training time, the TERA pretraining task requires themodel to take a sequence of frames as input that has a certain per-centage of randomly selected portions to be altered, and attempts toreconstruct the altered frames. The TERA pretraining scheme con-sists of several objectives: 1) time alteration: reconstructing fromcorrupted blocks of time steps with width of W T . 2) channel al-teration: reconstructing from missing blocks of frequency channelswith width of W C . 3) magnitude alteration: reconstructing fromaltered feature magnitudes with a probability of P N . The modelacquires information around the corrupted or altered input, and usethem to reconstruct the clean input. After pretraining, the modellearns the ability to map corrupted speech to clean speech, and alsothe ability of denoising and purification. This subsection will present the defense procedure of our proposedmethod. As shown in Fig. 1c, the in-the-wild attackers can delib-erately find adversarial noise δ and add it to the genuine sample x to generate the adversarial sample ˜ x . The adversarial sample ˜ x isover 99% similar to the genuine sample x from human’s ears, butthe prediction of the ASV model reverses. The adversarial attackswhich make ASV models fail catastrophically are dangerous. Hence,this paper proposes the cascaded self-supervised learning models tocounter them. Fig. 1b illustrates the framework of adversarial de-fense by integrating ASV with cascaded TERA models. We first pretrain the TERA model, as shown in Fig. 1a. Then we define K ,the number of concatenated TERA models. Concatenating K TERAmodels should reduce the adversarial attack success rate without sac-rificing the accuracy of benign samples too much. We show the pro-cedure of finding such a qualified K in section 4.1. Ideally, given apiece of adversarial audio ˜ x , the cascaded TERA models will serveas a deep filter to help decontaminate the superficial adversarial per-turbations and reconstruct the pivotal information from the input. Ifthe input is a piece of genuine audio x , the deep filter simply per-forms nearly lossless reconstruction and keep the key information.After purification, the purified audio ˜ x (cid:48) will be used for ASV tasks. Previous works coarsely divide the attacking scenarios into white-box and black-box, which is vague and ambiguous to some extent.In contrast, we attempt to detail the attacking scenario and name twothreat models from the perspective of the adversary’s knowledge.Thereafter, we will present our countermeasures.•
Adversaries are unaware of TERA : Attackers have the accessto the internals of the target ASV model, including modelstructure, model parameters and gradients, while they are notaware of the existence of the cascaded TERA models in frontof the ASV model. In this setting, the cascaded TERA modelsserve as a deep filter to decontaminate the adversarial sam-ples. The TERA obtains the ability of purifying corruptedspeech into clean speech after pretraining. So as we will seein subsection 4.1, when the number of cascaded TERA mod-els increases, the equal error rate decreases, which shows theeffectiveness of the proposed method in mitigating adversar-ial attack against these adversaries.•
Adversaries are aware of TERA : Attackers have access to theentire ASV model and the training strategy of TERA mod-els. Our experiments show that even though attackers gener-ate adversarial samples with some information of the TERAmodels, our approach is still effective on purifying adversarialsamples and protecting the ASV models.
3. EXPERIMENTAL SETUP3.1. ASV setting
This work adopts the r-vector embedding system [24] as the ASVto be attacked. The notation of r-vector comes from the networkarchitecture of ResNet [24], which has been adopted in the state-of-the-art speaker verification systems. The r-vector system adopts thesame architecture as [24], and AAM-softmax loss [25] with hyper-parameters { m = 0.2, s = 30 } is used for training neural networks.Extracted r-vectors are length-normalized before cosine scoring. In this work, we generate adversarial samples using the basic itera-tive method (BIM) [26] to attack the r-vector system, which has beenverifed to be effective to degrade deep neural network systems. Weassume that X ( e ) and X ( t ) are enrollment and testing utterances,respectively. The ASV system function is denoted as S with param-eters θ . Attackers aims at perturbing the genuine testing input X ( t ) to make it more similar with X ( e ) under the judgement of ASV. Byapplying BIM, it perturbs X ( t ) towards the gradient of system out-put S w.r.t. X ( t ) in an iterative manner. Starting from the genuine ig. 1 . (a). The illustration of TERA’s pretraining strategy. (b) The framework for adversarial defense on ASV by TERA models. (c). Theprocedure of adversarial attack.input X ( t )0 = X ( t ) , this process can be formulated as Eq. 1: X ( t ) n +1 = clip X ( t ) ,(cid:15) ( X ( t ) n + αsign ( ∇ X ( t ) n S θ ( X ( e ) , X ( t ) n ))) , for n = 0 , ..., N − (1)where sign is a function that takes the sign of the gradient, α is thestep size, N is the number of iterations, (cid:15) is the perturbation degreeand clip X ( t ) ,(cid:15) ( X ) holds the norm constraints by applying element-wise clipping such that (cid:107) X − X ( t ) (cid:107) ∞ ≤ (cid:15) . In our experiments, N is set as 5. The (cid:15) is set as 0.3, so that there is no difference betweenadversarial and genuine audios from human perception, while the at-tack can still succeed in making the ASV system behave incorrectly.Finally, α is set as (cid:15) divided by N . This work is conducted on Voxceleb1 [27], which consists of shortclips of human speech. There are in total 148,642 utterances for 1251speakers. We develop our ASV system on the training and develop-ment partitions, while reserve 4,874 utterances of the testing parti-tion for evaluating our ASV system and generating adversarial sam-ples. Notice that generating adversarial samples is time-consumingand resource-consuming. Without loss of generality, we randomlyselect 1000 trials out of 37,720 trials provided in [27], to generateadversarial samples.
Table 1 shows the r-vector system performance on the complete trialsprovided in [27] and also the selected 1K trials. The system perfor-mance is evaluated by equal error rate (EER) and minimum detectioncost function (minDCF) with a prior probability of target trials to be0.01. We observe that the system performance has been seriouslydegraded after applying adversarial inputs, which verifies the effec-tiveness of our adversarial attack algorithm. Besides, we observeconsistent performance trends between the results on the complete
Table 1 . The r-vector system performance with genuine (gen-input)and adversarial inputs (adv-input).complete trials 1K trialsEER (%) minDCF EER (%) minDCFgen-input 8.39 0.638 8.87 0.792adv-input 65.92 1.000 66.02 1.000trials and those on the selected 1K trials, which indicates that theselection process of 1K trials is reasonable. Further experiments areconducted on these 1K adversarial samples.
We use the TERA implementation from the S3PRL speech toolkit.We use a time alteration width W T of , channel alteration width W C of , and magnitude alteration probability P N set to . framesof time alteration corresponding to ms of speech, which is inthe range of average phoneme duration. According to [28], we set W C as for better reconstruction. Setting W C and W T too largewill make the self-supervised learning model hard to do reconstruc-tion. The rest of the alteration policy follows the original design ofTERA [23]. In order to evaluate our proposed method in the sce-nario where adversaries are aware of TERA, we pretrain two TERAmodels with an identical setting except for a unique random seed, de-noted as TERA0 and TERA1, respectively. Each model consists of 3layers of Transformer encoders with multi-head self-attention [29],followed by a feed-forward prediction network. The dataset adoptedto pretrain TERA models is Voxceleb2 [7]. The input to the modelis 24-dim MFCC extracted by standard Kaldi [30] scripts. We usethe Adam optimizer [31] with mini-batches of size 128 to find modelparameters that minimize the L1 loss of the TERA pretraining task.The model is trained for 30K steps, where learning rate is warmedup over the first 7% to a peak value of × − and then linearlydecayed. If not specified otherwise, other pretraining settings followthe TERA paper [23]. . EXPERIMENTAL RESULTS4.1. Adversaries are unaware of TERA This subsection assumes that attackers have access to the completeASV model parameters while they are unaware of the cascadedTERA models in the frontend of the ASV system. This setting isthe most practical one in the real world because it is unrealistic forattackers to know everything about the target models through query-ing the API. In this setting, we adopt TERA0 as a basic element,and duplicate it into different amounts to be placed in front of theASV. Defense performance is evaluated when ASV integrated withdifferent number of TERA models, as shown in Fig. 2.
Fig. 2 . The r-vector system’s EER (%) for ASV integrated with dif-ferent number of cascaded TERA models.We observe that for adversarial speech, integration of TERAmodels can dramatically decrease EER of the attacked system fromover 65% to around 20%, which indicates that integration of TERAmodels can purify the adversarial signals and mitigate the attackeffectiveness. We also observe that the performance of ASV withgenuine inputs drops due to the imperfect reconstruction of TERAmodels. Possible solutions can either be improving the reconstruc-tion ability of TERA, or using reconstructed inputs to finetune ASVsystems, which will be investigated in future works.As investigated in [19], some hand-crafted filters also have theability of purifying adversarial signals and alleviating the destruc-tiveness of adversarial attacks. In this work, we leverage three fil-ters, i.e. Gaussian, median and mean filters, to be positioned in frontof the ASV to defend against adversarial attacks and set them as ourbaseline. Table 2 illustrates the system EER for genuine and adver-sarial inputs when the ASV integrated with cascaded TERA models,Gaussian, median and mean filters. We observe that all filters havethe ability of purifying adversarial signals given adversarial inputs.The attack effectiveness has been degraded by over 50% after inte-gration with these filters. However, due to additional noise caused bythe filtering process, all filters also degrade the system performanceon genuine inputs. Notice that the proposed TERA models outper-form the other filters with respect to both purifying adversarial sig-nals within adversarial inputs, and preserving ASV performance ongenuine inputs.
This subsection gives a case study to show the robustness of ourdefense approach to a more severe attacking scenario, where attack-
Table 2 . The system’s EER(%) for genuine and adversarial inputswhen integrating ASV with TERA, Gaussian, median and mean fil-ters. (NA means nothing is positioned in front of ASV.)NA 10*TERA0 Gaussian median meangen-input 8.87 17.32 30.30 27.06 27.71adv-input 66.02 22.94 31.60 29.65 29.44ers not only have access to the entire ASV parameters, but also areaware of the TERA models in front of ASV. We assume that at-tackers know the training strategy of the TERA model, and pre-train a substitute TERA model (denoted as TERA1) to be placedin front of ASV to generate adversarial samples. In the real-worldapplications, it is hard for attackers to know the specific numberof TERA models in front of ASV, and in this work we only in-tegrate ASV with one TERA1 model to generate adversarial sam-ples. The attacking process is identically configured as BIM withperturbation degree (cid:15) = 0 . . Table 3 illustrates the system perfor-mance when ASV integrated with different number of TERA0 mod-els, given adversarial samples generated by the integration of ASVand one TERA1 model as the inputs. We observe that even thoughattackers know the training setting of TERA models in front of ASV,the attack destructiveness is still alleviated by integrating ASV withmore TERA models. Moreover, based on our experiments, it ismemory- and computation-consuming when performing white-boxattacks on ASV integrated with TERA models. Hence placing suf-ficient number of TERA models in front of ASV could be a goodoption to defend against adversarial attacks. Table 3 . The r-vector system’s EER(%) when integrating ASV withdifferent number of TERA models.NA 1*TERA0 2*TERA0 3*TERA0EER(%) 54.55 53.68 47.62 40.69
5. CONCLUSION
This work proposes integrating ASV with cascaded TERA modelsfor defense against adversarial attacks. We conduct experiments intwo attacking scenarios depending on whether the adversaries areaware of the TERA models or not. The scenario where attackers areunaware of the TERA models is more practical, and experimental re-sults indicate that introducing cascaded TERA models as a deep fil-ter can purify the adversarial signals and mitigate the attack destruc-tiveness. Also our proposed method outperforms hand-crafted fil-ters with respect to both decontaminating adversarial signals withinadversarial inputs, and preserving ASV performance on genuine in-puts. For the other scenario where attackers are aware of the TERAmodels, experimental results verify that integrating ASV with moreTERA models is still effective on alleviating adversarial noise eventhough attackers utilize the TERA information to generate adversar-ial samples. Preserving better performance on genuine samples thanhand-crafted filters is not good enough, so we attempt to tackle thisproblem in future works.
6. ACKNOWLEDGEMENT
This work was done when H. Wu was a visiting student at ShenzhenInternational Graduate School, Tsinghua University. H. Wu and A.Liu are supported by Frontier Speech Technology Scholarship of Na-tional Taiwan University. A. Liu is supported by ASUS AICS. XuLi is supported by HKSAR Government’s Research Grants CouncilGeneral Research Fund (Project No. 14208718). . REFERENCES [1] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novelscheme for speaker recognition using a phonetically-awaredeep neural network,” in . IEEE,2014, pp. 1695–1699.[2] P. Kenny, “A small footprint i-vector extractor,” in
Odyssey2012-The Speaker and Language Recognition Workshop , 2012.[3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker verification,”
IEEETransactions on Audio, Speech, and Language Processing , vol.19, no. 4, pp. 788–798, 2010.[4] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,”in
Twelfth annual conference of the international speech com-munication association , 2011.[5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu-danpur, “X-vectors: Robust dnn embeddings for speakerrecognition,” in . IEEE,2018, pp. 5329–5333.[6] X. Li, J. Zhong, J. Yu, S. Hu, X. Wu, X. Liu, andH. Meng, “Bayesian x-vector: Bayesian neural networkbased x-vector system for speaker verification,” arXiv preprintarXiv:2004.04014 , 2020.[7] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” arXiv preprint arXiv:1806.05622 , 2018.[8] N. Li, D. Tuo, D. Su, Z. Li, D. Yu, and A. Tencent, “Deepdiscriminative embeddings for duration robust speaker verifi-cation.,” in
Interspeech , 2018, pp. 2262–2266.[9] J. Yamagishi, M. Todisco, M. Sahidullah, H. Delgado,X. Wang, N. Evans, T. Kinnunen, K. A. Lee, V. Vestman, andA. Nautsch, “Asvspoof 2019: The 3rd automatic speaker ver-ification spoofing and countermeasures challenge database,”2019.[10] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus, “Intriguing properties of neu-ral networks,” arXiv preprint arXiv:1312.6199 , 2013.[11] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling end-to-end speaker verification with adversarial examples,” in . IEEE, 2018, pp. 1962–1966.[12] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adver-sarial attacks on gmm i-vector based speaker verification sys-tems,” in
ICASSP 2020-2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6579–6583.[13] Y. Xie, C. Shi, Z. Li, J. Liu, Y. Chen, and B. Yuan, “Real-time, universal, and robust adversarial attacks against speakerrecognition systems,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 1738–1742.[14] Q. Wang, P. Guo, and L. Xie, “Inaudible adversarial perturba-tions for targeted attack in speaker recognition,” arXiv preprintarXiv:2005.10637 , 2020. [15] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The attacker’sperspective on automatic speaker verification: An overview,” arXiv preprint arXiv:2004.08849 , 2020.[16] N. Carlini and D. Wagner, “Audio adversarial examples: Tar-geted attacks on speech-to-text,” in . IEEE, 2018, pp. 1–7.[17] S. Liu, H. Wu, H.-y. Lee, and H. Meng, “Adversarial attacks onspoofing countermeasures of automatic speaker verification,” arXiv preprint arXiv:1910.08716 , 2019.[18] Q. Wang, P. Guo, S. Sun, L. Xie, and J. H. Hansen, “Adversar-ial regularization for end-to-end robust speaker verification.,”in
Interspeech , 2019, pp. 4010–4014.[19] H. Wu, S. Liu, H. Meng, and H.-y. Lee, “Defense against ad-versarial attacks on spoofing countermeasures of asv,” arXivpreprint arXiv:2003.03065 , 2020.[20] X. Li, N. Li, J. Zhong, X. Wu, X. Liu, D. Su, D. Yu, andH. Meng, “Investigating robustness of adversarial samplesdetection for automatic speaker verification,” arXiv preprintarXiv:2006.06186 , 2020.[21] H. Wu, A. T. Liu, and H.-y. Lee, “Defense for black-boxattacks on anti-spoofing models by self-supervised learning,” arXiv preprint arXiv:2006.03214 , 2020.[22] A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee,“Mockingjay: Unsupervised speech representation learningwith deep bidirectional transformer encoders,”
ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , May 2020.[23] A. T. Liu, S.-W. Li, and H. yi Lee, “Tera: Self-supervisedlearning of transformer encoder representation for speech,”2020.[24] H. Zeinali, S. Wang, A. Silnova, P. Matˇejka, and O. Plchot,“But system description to voxceleb speaker recognition chal-lenge 2019,” arXiv preprint arXiv:1910.12592 , 2019.[25] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Mar-gin matters: Towards more discriminative deep neural networkembeddings for speaker recognition,” in . IEEE, 2019, pp. 1652–1656.[26] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial ma-chine learning at scale,” arXiv preprint arXiv:1611.01236 ,2016.[27] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb:a large-scale speaker identification dataset,” arXiv preprintarXiv:1706.08612 , 2017.[28] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmenta-tion method for automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is allyou need,” 2017.[30] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speechrecognition toolkit,” in
ASRU , 2011.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980