[PDF] Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems

Abstract

Robust speaker recognition, including in the presence of malicious attacks, is becoming increasingly important and essential, especially due to the proliferation of several smart speakers and personal agents that interact with an individual's voice commands to perform diverse, and even sensitive tasks. Adversarial attack is a recently revived domain which is shown to be effective in breaking deep neural network-based classifiers, specifically, by forcing them to change their posterior distribution by only perturbing the input samples by a very small amount. Although, significant progress in this realm has been made in the computer vision domain, advances within speaker recognition is still limited. The present expository paper considers several state-of-the-art adversarial attacks to a deep speaker recognition system, employing strong defense methods as countermeasures, and reporting on several ablation studies to obtain a comprehensive understanding of the problem. The experiments show that the speaker recognition systems are vulnerable to adversarial attacks, and the strongest attacks can reduce the accuracy of the system from 94% to even 0%. The study also compares the performances of the employed defense methods in detail, and finds adversarial training based on Projected Gradient Descent (PGD) to be the best defense method in our setting. We hope that the experiments presented in this paper provide baselines that can be useful for the research community interested in further studying adversarial robustness of speaker recognition systems.

Full PDF

AA DVERSARIAL A TTACK AND D EFENSE S TRATEGIES FOR D EEP S PEAKER R ECOGNITION S YSTEMS

A P

REPRINT

Arindam Jati †∗ [email protected] Chin-Cheng Hsu †∗ [email protected] Monisankha Pal † [email protected] Raghuveer Peri † [email protected] Wael AbdAlmageed †§ [email protected] Shrikanth Narayanan †§ [email protected] † Electrical and Computer Engineering, University of Southern California (USC), Los Angeles, CA, USA § USC Information Sciences Institute, Marina del Rey, CA, USAAugust 19, 2020 A BSTRACT

Robust speaker recognition, including in the presence of malicious attacks, is becoming increasinglyimportant and essential, especially due to the proliferation of several smart speakers and personalagents that interact with an individual’s voice commands to perform diverse, and even sensitive tasks.Adversarial attack is a recently revived domain which is shown to be effective in breaking deep neuralnetwork-based classiﬁers, speciﬁcally, by forcing them to change their posterior distribution by onlyperturbing the input samples by a very small amount. Although, signiﬁcant progress in this realmhas been made in the computer vision domain, advances within speaker recognition is still limited.The present expository paper considers several state-of-the-art adversarial attacks to a deep speakerrecognition system, employing strong defense methods as countermeasures, and reporting on severalablation studies to obtain a comprehensive understanding of the problem. The experiments show thatthe speaker recognition systems are vulnerable to adversarial attacks, and the strongest attacks canreduce the accuracy of the system from 94% to even 0%. The study also compares the performancesof the employed defense methods in detail, and ﬁnds adversarial training based on Projected GradientDescent (PGD) to be the best defense method in our setting. We hope that the experiments presentedin this paper provide baselines that can be useful for the research community interested in furtherstudying adversarial robustness of speaker recognition systems. K eywords Adversarial attack · Deep neural network · Speaker recognition

Deep learning models are recently found to be vulnerable to adversarial attacks [1, 2] where the attacker potentiallydiscovers blind spots in the model, and crafts adversarial samples that are only slightly different from the originalsamples, rendering the trained model fail to correctly classify them or even to perform any other inference task onthem. Over the last few years, several researchers have devoted signiﬁcant effort in devising novel adversarial attackalgorithms [3, 4, 5, 6], proposing defensive countermeasures to gain robustness [3, 4], and demonstrating exploratoryanalyses [6, 7, 8]. ∗ Authors contributed equally. a r X i v : . [ ee ss . A S ] A ug PREPRINT - A

UGUST

19, 2020

Adversarial attack on speech processing systems.

With the rapid increase in the incorporation of Deep NeuralNetworks (DNN) within speech processing applications like Automatic Speech Recognition (ASR) [9, 10], speakerrecognition [11, 12, 13, 14], and speech emotion and behavior studies [15, 16], it is becoming essential to study theprobable weaknesses of the employed models in the presence of adversarial attacks. In [17], the authors have shown thatit is possible to achieve even success rate in attacking deep ASR systems. In [18] the authors have successfullygenerated imperceptible (to humans) adversarial audio samples while retaining high attack success rate. These studieshighlight the vulnerability of deep ASR models against adversarial attacks.

Adversarial attack on speaker recognition systems.

Speaker recognition models are being widely employed inseveral applications including smart speakers and personal digital assistants [19, 11], bio-metric systems [20], andforensics [21]. Therefore, having robust speaker recognition models that are not susceptible to adversarial perturbationis an important requirement. However, speaker recognition models have not been investigated extensively in thepresence of adversarial attacks. Some initial work can be found in the literature (please refer to Section 3), but a detailedanalysis of white box attacks (will be discussed in Section 2.2) with state-of-the art attack algorithms is difﬁcult to ﬁnd.Moreover, to the best of our knowledge, effective defensive countermeasures for those attacks have not been proposed.The present work aims to address these issues in particular.

Contributions.

This paper focuses on adversarial attacks and possible countermeasures for deep speaker recognitionsystems, with the following contributions. • In contrast to previous works in this ﬁeld (discussed in Section 3), we perform adversarial attack directly onthe time domain speech signal (and not on the spectrogram), which is more realistic in real-life scenarios. • We provide an extensive analysis of the effect of multiple state-of-the-art white box adversarial attacks on aDNN-based speaker recognition model. • We propose multiple defensive countermeasures for the deep speaker recognition system, and analyze theirperformance. • We perform transferability analysis [8] to investigate how adversarial speech crafted with a particular modelcan also be harmful to a different model. • We present various ablation studies ( e.g., varying the strength of the attack, measuring signal-to-noise ra-tio (SNR) and perceptibility of the adversarial speech samples etc. ) that might be helpful to gain a comprehen-sive understanding of the problem. • We share ready-to-run software implementation of the present work toward supporting reproducibility andfurther research.We aim to set baselines in the present exposition study, and hope it can help the community interested to continuefurther research in this domain. Paper outline.

The rest of the paper is organized as follows. In Section 2, we provide preliminaries about speakerrecognition and adversarial attack. In Section 3, we highlight the related work. The adversarial attack algorithms anddefense strategies are introduced in Section 4. Experimental setting and results are described in Section 5 and Section 6,respectively. Finally, conclusions and future directions are provided in Section 7.

Speaker recognition systems can be developed either for identiﬁcation or veriﬁcation [11] of individuals from theirspeech. In a closed set speaker identiﬁcation scenario [11, 14], we are provided with train and test utterances from a setof unique speakers. The task is to train a model that, given a test utterance, can classify it to one of the training speakers.Speaker veriﬁcation [13, 12], on the other hand, is an open set problem. The task is to verify whether a test utteranceclaiming a particular speaker’s identity is actually spoken by that speaker (whose enrolment utterance is availablebeforehand). The training data in the latter case, is generally utterances from a mutually exclusive set of speakers.Although, speaker veriﬁcation differs from speaker identiﬁcation during the testing phase, most of the recent state-of-the-art speaker veriﬁcation systems [13, 22, 12, 23] are trained with the objective of learning to classify the set of Source codes are available at https://github.com/usc-sail/gard-adversarial-speaker-id PREPRINT - A

UGUST

19, 2020training speakers. In other words, these models are trained with a cross-entropy objective over the unique set of trainingspeakers ( i.e., similar to a speaker identiﬁcation scenario).Formally, if x ∈ R D denotes a time domain audio sample with speaker label y , then learning a speaker identiﬁer modelis generally done through Empirical Risk Minimization (ERM) [4]: argmin θ E ( x ,y ) ∼D [ L ( x , y, θ )] (1)where, L ( · ) is the cross-entropy objective, and θ denotes the set of trainable parameters of the DNN.An intermediate representation of the trained DNN model might be subsequently extracted as a speaker embedding [13]which is expected to carry speaker-speciﬁc information. The speaker embeddings are then utilized for veriﬁcationpurposes. Because of this widespread use, in this study, we work with a closed set speaker identiﬁcation (or classiﬁcation)model. The ﬁndings of this study can motivate future research on open set speaker veriﬁcation task (see Section 7 forfuture directions). Given an audio sample x , an adversarial attack generates a perturbed signal given by (cid:101) x = x + η such that (cid:107) η (cid:107) p < (cid:15) (2)with the goal of forcing the classiﬁer to produce erroneous output for (cid:101) x . In other words, if x has a true label y , then theattacker forces the classiﬁer to produce (cid:101) y (cid:54) = y for the perturbed sample (cid:101) x . In this paper, we will focus on l ∞ and l norms which are most widely employed in the literature. We explore white-box [8] attack in this study. This assumes that the attacker has complete knowledge of the modelarchitecture, parameters, loss functions, and gradients. We adopt this stronger form of attack (compared to black-boxattack [8]) because it does not assume that any part of the model can be kept hidden from the attacker, and it is the mostfrequently employed threat model in the adversarial attack literature [3, 4, 5, 6].Adversarial attack can be targeted or untargeted [8]. An untargeted attack only forces the model to generate erroneousoutputs, whereas, a targeted attack forces the model to predict a target class which is different from the true class. Weperform untargeted attacks in this study, and leave the targeted attack for future study (see Section 7). Although most of the experiments in this paper are with white box attacks, we study the transferability of adversarialsamples in Section 6.6, which gives us a notion of performance during a black box attack as well. The transferabilitytest [8, 24] evaluates the vulnerability of a target model against the adversarial samples generated with a source model.The attacker has full knowledge about the source model, but no or limited knowledge about the target model (forexample, knowledge about the fact that both source and target have convolutional layers). The goal of the attacker is togenerate adversarial samples (with the source model) in such a way that they “transfer well” to the target model, i.e., those samples also make the target model vulnerable.

This section describes key previous work on adversarial attack and defense methods proposed for speaker recognitionsystems. • Li et al. [25] showed that an i-vector [26] based speaker veriﬁcation system is susceptible to adversarialattacks, and the adversarial samples generated with the i-vector system also transfer well to a DNN-basedx-vector [13] system . The attack was performed on the feature space (and not directly on the time domainspeech signal), and with only the Fast Gradient Sign Method (FGSM) [3] (will be further discussed in Section 4)was investigated for that purpose. Moreover, no defense method was proposed. • Kreuk et al. [27] demonstrated the vulnerability of an end-to-end DNN-based speaker veriﬁcation system toFGSM attack. The attack was done on the feature space, and the authors discovered cross-feature transferabilityof the adversarial samples. No defense method was proposed in the paper. i-vectors have been the state-of-the-art in speaker veriﬁcation for a decade until DNN-based x-vectors were shown to outperformthem [23, 13]. PREPRINT - A

UGUST

19, 2020 • Chen et al. [28] proposed the Natural Evolution Strategy (NES) based adversarial sample generation procedure,and successfully attacked a GMM-UBM system and i-vector based speaker recognition systems. They foundimpressive attack success rate with their proposed method. However, the authors did not attack more recentDNN-based speaker recognition frameworks which are shown to have state-of-the-art performances. Moreover,the test set involved in their experiments only included speakers (TABLE I of [28]), and thus, an extensivestudy with a much higher number of test speakers is still needed. • Wang et al. [30] proposed adversarial regularization based defense methods using FGSM and Local Distribu-tional Smoothness (LDS) [31] techniques. The proposed method was shown to improve the performance of aspeaker veriﬁcation system, but only FGSM was employed as the attack algorithm, and similar to most of theabove methods, the attack was performed on the feature space and not on the time domain audio.In summary, although these studies represent important initial efforts on adversarial attacks on speaker recognitionsystem, many technical questions still remain to be addressed. Limitations include consideration of primarily featurespace attacks [25, 27, 30] (and not time domain), limited number of attack algorithms [25, 27, 28, 30], limited numberof speakers in the test set [28], and no or limited number of defense methods [25, 27, 28]. The present expositionstudy aims to address some of these limitations by reporting extensive experimental analysis, ablation studies, and byproposing and evaluating various defense methods.

A group of gradient-based attack algorithms tries to maximize the loss function by ﬁnding a suitable perturbation whichlies inside the l p -ball around x . Formally, max η : (cid:107) η (cid:107) p <(cid:15) L ( x + η , y, θ ) . (3)A different group of algorithms aims at decreasing the posterior of the true output class, and increasing the posterior ofthe most confusing wrong class. Here we present the attack algorithms we employ in our study. Fast Gradient Sign Method (FGSM).

Goodfellow et al. [3] proposed this computationally efﬁcient one-step l ∞ attack to generate adversarial samples by only using the sign of the gradient function, and moving in the direction ofgradient to increase the loss: (cid:101) x = x + (cid:15) sign ( ∇ x L ( x , y, θ )) . (4) Projected Gradient Descent (PGD).

Madry et al. [4] proposed a more generalized version with iterative gradientbased l ∞ attack: (cid:101) x i +1 = Π x + S [ (cid:101) x i + α sign ( ∇ x L ( x , y, θ ))] , (5)where, α is the step size of the gradient descent update, x + S is the set of allowed perturbations i.e., the l ∞ -ball around x , and Π x + S denotes the constrained projection operation in a standard PGD optimization algorithm. PGD is run for aﬁxed number of maximum iterations, T . Throughout the text, we will denote PGD run for T iterations by “PGD- T ”. Carlini and Wagner attack (Carlini l and Carlini l ∞ ). Carlini and Wagner [6] deﬁned the general methodologyof their attack by minimize (cid:107) η (cid:107) p + c · g ( (cid:101) x ) such that (cid:101) x ∈ [0 , . (6)Here, g ( · ) deﬁnes the objective function given by g ( (cid:101) x ) = (cid:20) Z ( (cid:101) x ) t − max j (cid:54) = t ( Z ( (cid:101) x ) j ) + δ (cid:21) + (7)where, Z ( · ) is the output vector containing posterior probabilities for all the classes, t denotes the output nodecorresponding to the true class y , δ is the conﬁdence margin parameter, and [ · ] + denotes the max ( · , function.Intuitively, the attack tries to maximize the posterior probability of a class that is not the true class of x , but has thehighest posterior among all the wrong classes. The norm can be either l or l ∞ . For Carlini l ∞ attack, the minimizationof (cid:107) η (cid:107) ∞ is not straightforward due to non-differentiability, and an iterative procedure is employed in [6] . GMM-UBM stands for Gaussian Mixture Model-Universal Background Model, a classical model in speaker recognition [29]. We suggest the readers to refer to [6] for detailed information about the iterative workaround for l ∞ attack, and also for choosingthe values for the weight parameter, c . PREPRINT - A

UGUST

19, 2020

The intuition here is to train the model on adversarial samples generated by a certain adversarialattack. The adversarial samples are generated online using the training data and the current model parameters. Madry etal. [4] introduced the generalized notion of adversarial training by a mini-max optimization given by: argmin θ E ( x ,y ) ∼D (cid:34) max η : (cid:107) η (cid:107) p <(cid:15) L ( x + η , y, θ ) (cid:35) (8)The inner maximization task is addressed by the attack algorithm utilized during adversarial training, and the outerminimization is the standard ERM (Equation (1)) employed to train the model parameterized with θ . We separatelyapply both one-step FGSM (Equation (4)), and T -step PGD (Equation (5)) algorithms to solve the inner maximizationproblem. Throughout the remaining text, we refer to these as “FGSM adversarial training” and “PGD- T adversarialtraining” respectively.Notably, the overall training is done on clean as well as adversarial samples. The overall loss function is given by: L AT ( x , (cid:101) x , y, θ ) = (1 − w AT ) · L ( x , y, θ ) + w AT · L ( (cid:101) x , y, θ ) , (9)where w AT is the weight of the adversarial training. Adversarial Lipschitz Regularization (ALR).

This approach of gaining robustness is based on learning a functionthat is not much sensitive to a small change in the input. In other words, if we can learn a relatively smooth function,then the posterior distribution should not vary abruptly if the input perturbation is within the maximum allowed limit.We propose a training strategy equipped with the recently invented adversarial Lipschitz regularization technique [32].Similar to the regularization based on local distribution smoothness in Virtual Adversarial Training (VAT) [31], ALRimposes a regularization term deﬁned using Lipschitz smoothness: (cid:107) f (cid:107) L = sup x , (cid:101) x ∈ X,d X ( x , (cid:101) x ) > d Y ( f ( x ) , f ( (cid:101) x )) d X ( x , (cid:101) x ) , (10)where f ( · ) the function of interest (implemented by the neural network) that maps the input metric space ( X, d X ) tooutput metric space ( Y, d Y ) . In our case of speaker classiﬁcation, we chose f ( · ) as the ﬁnal log-posterior output ofthe network, i.e., f ( x ) = log p ( y | x , θ ) , l norm as d Y , and l norm as d X . The adversarial perturbation η = (cid:15) η k in (cid:101) x = x + η is approximated by the power iterations: η i +1 = ∇ η i d Y (cid:0) f ( x ) , f ( x + ξ η i ) (cid:1)(cid:13)(cid:13) ∇ η i d Y (cid:0) f ( x ) , f ( x + ξ η i ) (cid:1)(cid:13)(cid:13) , (11)where, η is randomly initialized, and ξ is another hyperparameter (see Section 5.5). The regularization term added totraining is L ALR = (cid:20) d Y ( f ( x ) , f ( (cid:101) x )) d X ( x , (cid:101) x ) − K (cid:21) + , (12)where K is the desired Lipschitz constant we wish to impose. We implement the core of most of our attack and defense algorithms (except ALR) through the Adversarial RobustnessToolbox [33]. For ALR, we follow the original implementation of [32]. The rest of the experimental details aredescribed below.

We employ Librispeech [34] (the “train-clean-100” subset) dataset for all the experiments. It contains hours ofclean speech from unique speakers ( females). We utilize all the speakers for our experiment. For every speaker,we employ of the utterances for training the classiﬁer, and the remaining utterances for testing. The train-testsplit is deterministic, and it is kept ﬁxed throughout all the experiments.5

PREPRINT - A

UGUST

19, 2020

We implemented our classiﬁer, f : X → Y , by combining a Convolutional Neural Network (CNN) with a digital signalprocessing (DSP) front-end which is differentiable. The non-trainable

DSP front-end extracts log Mel-spectrogramwhich is viewed as a temporal signal of F channels, where F is the number of Mel frequency bins. The back-end iseither of the two DNN models described below. As both modules are differentiable, the adversarial attack schemesintroduced in Section 4 can be applied to create time-domain perturbation directly. The model consists of 8 stacks of convolutional layers that transforms the Mel-spectrogram into the label space Y .ReLU nonlinearity and batch normalization are used after every convolutional layer. Maxpool is employed after everyalternate layer. The penultimate layer has a dimension of 32. The model has total of 219k trainable parameters. Weperform most of our analysis with this model, and utilize the following TDNN model for transferability experiments. The Time Delay Neural Network (TDNN) [23, 13] is one of the current state-of-the-art models for speaker recognition.We adopt the model architecture proposed in [13] for the experiments related to transferability analysis (Section 6.6).The model consists of time-dilated convolutional layers along with a statistics pooling module, and it has ∼ . milliontrainable parameters, and hence, is much larger than the 1D CNN model. We employ the Adam optimizer [35] with a learning rate of . , β = 0 . , and β = 0 . . We train with a minibatchsize of , and train all models (except ALR) for epochs, since the training accuracy reaches almost andsaturates within that. The ALR training converges much slowly, and thus, all ALR-based experiments are run for epochs. The training accuracy tends to saturate after that. Our main results (Section 6.3) are obtained from the experiment with attack strength (cid:15) = 0 . for l ∞ attacks, andconﬁdence margin δ = 0 for Carlini l attack. The choice of (cid:15) = 0 . is due a reasonably strong SNR ( ∼ dB forFGSM/PGD) of the adversarial samples (see Section 6.1). Furthermore, we vary the strength of different attacks, andthe results are shown in Section 6.4. The PGD attack is for iterations with a step size α = (cid:15)/ . In ALR method, we set the number of power iterations K = 1 , and the hyperparameter ξ = 10 , as recommendedin [32]. The FGSM- and PGD-based adversarial training algorithms are run with (cid:15) = 0 . . Hence, the main results(Section 6.3) employ the same (cid:15) value in both the attack and the adversarial training based defense. The ablationstudy in Section 6.4 is particularly designed to investigate the effect of using different values of (cid:15) during the attack.Speciﬁcally, the study varies (cid:15) above and below the vicinity of (cid:15) = 0 . (set during defense training), and analyzesthe effectiveness of the defense method. The PGD adversarial training uses iterations ( i.e., PGD-10 as introducedin Section 4.2) , although we evaluated it against PGD attack with higher number of iterations (Section 6.3 and 6.5).During adversarial training, we create minibatches containing equal number of clean and adversarial samples, i.e., inEquation (9), we set w AT = 0 . . vs. SNR

To have a substantial understanding about the strength of different attack algorithms, we computed the mean Signal-to-Noise Ratio (SNR) of all the test adversarial samples for every level of attack strength. For l ∞ attacks, (cid:15) varies between { . , . , . , . } , and for Carlini l attack the conﬁdence margin δ varies between { , . , . , . } .The curves for l ∞ attacks are shown in Figure 1a. There are two important observations. First, the average level of SNRis ∼ dB higher for Carlini l ∞ than FGSM and PGD. Second, the SNR level tends to decrease faster with increase in PGD adversarial training is slow, and we could not afford to run it for more than iterations. PREPRINT - A

UGUST

19, 2020 0 HDQ 6 15 G % $WWDFN)*603*'&DUOLQL l (a) Attack strength vs. SNR 3(6 4 VF R U H $WWDFN)*603*'&DUOLQL l (b) Attack strength vs. perceptibility Figure 1: Mean SNR and PESQ score of the test adversarial samples generated by different l ∞ attack algorithms atdifferent strengths. (cid:15) for PGD and FGSM as compared to Carlini l ∞ . The reason might be attributed to the optimization algorithms thatvarious attack methods use for generating the adversarial samples. For example, the Carlini method enforces minimumperturbation required to change the output prediction, while PGD enforces perturbation projected inside the l ∞ -ball ofsize (cid:15) around x . We have also computed mean SNR for Carlini l attack for different values of the conﬁdence margin δ .The SNR level tends to stay around dB, and does not vary much with increasing δ . A visualization of the adversarialspectrograms is shown in Appendix A for a more detailed analysis of the attack algorithms. vs. perceptibility We measure the perceptibility of the generated adversarial samples by employing Perceptual Evaluation of SpeechQuality (PESQ) [36, 37]. While subjective measure with multiple human annotators can be more accurate, it istime-consuming and costly. The objective PESQ measure has been the ITU-T standard for measuring telephonictransmission quality. It gives a mean opinion score by comparing the degraded speech signal with the original speechrecording. The PESQ score is between − . to . , and a higher value indicates better quality. Figure 1b depicts theaverage PESQ scores of all the test adversarial samples generated via different attack methods at various strengths. Wecan see the gradual degradation of audio quality with the increase of attack strength. Similar to the ﬁndings in the SNRanalysis (Section 6.1), Carlini l ∞ attack produces more perceptible adversarial audio samples than PGD-100 and FGSM.The degradation is also slower for Carlini attack. It is noteworthy that at (cid:15) = 0 . the attack algorithms are able toachieve high audio quality (PESQ score > . ), but force the classiﬁer to produce erroneous outputs (Section 6.4). Wehave also computed the average PESQ score for Carlini l attack. The PESQ score is ∼ . , and does not vary muchwith change in the conﬁdence margin δ . Table 1 presents the test performance of standard training (without any defense) and all the employed defense methodsfor three l ∞ attacks, and one l attack. All the performances are averaged over random runs. The l ∞ attacks arewith (cid:15) = 0 . , and all the adversarial training methods are run with the same (cid:15) value. As we can see, the accuracy ofthe standard training method drops from by a signiﬁcant margin for all the attacks. This shows the vulnerability ofthe model, and further underscores the need for strong countermeasures. A comparison between the three l ∞ attacksshows the FGSM is the weakest one ( adversarial accuracy for standard training), and PGD-100 (PGD with 100iterations) is the strongest one ( adversarial accuracy for standard training).Comparing different defense methods, we can see that FGSM-based adversarial training is the weakest defense strategy.The ALR method is better than FGSM adversarial training, although it fails to defend against a PGD-100 attack. ThePGD-10 adversarial training is found to perform the best in our experiments. It is interesting to see that PGD adversarialtraining with iterations is able to defend against a PGD attack with iterations. The proposed PGD-10 adversarialtraining gives absolute improvements of , and over the undefended performance against FGSM, Carlini l ∞ and PGD-100 attacks, respectively. 7 PREPRINT - A

UGUST

19, 2020Table 1: Different attacks on a speaker recognition system, and performance of different defense methods. “

Benign ”denotes accuracy on clean samples, and “

Adv. ” denotes accuracy on adversarial samples. Accuracy is on a scale of [0 , . Standardtraining FGSMadv. training ALR PGD-10adv. trainingNorm Attack

Benign Adv. Benign Adv. Benign Adv. Benign Adv. l ∞ FGSM 0.94 0.25 0.82 0.20

Carlini l ∞ PGD-100 0.00 0.00 0.00 l Carlini l As observed in the previous literature, the performance gain achieved by PGD-10 adversarial training against differentadversarial attacks generally results in a drop in benign accuracy. Similarly, in our experiment, the accuracy on theclean test samples drops for both FGSM- and PGD-based adversarial training methods, with the FGSM variant gettinghigher drop in performance. The ALR method, on the other hand, achieves a absolute improvement in benignaccuracy compared to the model with standard training, possibly because of lesser overﬁtting due to the presence of thepenalty term shown in Equation (12).The last row of Table 1 shows the performance of different defense methods against Carlini l attack. Standard training,FGSM adversarial training, and ALR algorithms are unable to defend against this attack. Defense with PGD-10adversarial training also performs poorly for this l attack. The reason might be attributed to the adversarial trainingmethodology which is based on l ∞ perturbation, and thus, probably fails to defend against a strong l attack.A related ablation study is provided in Appendix B which shows the similarity between the misclassiﬁed predictionsmade by the model under different attacks. This could possibly reveal some inherent similarities between differentattacks. Figure 2 shows how the performances of different defense methods vary when we vary the strength of the adversarialattacks. Note that the adversarial training-based defense methods still employ the same (cid:15) = 0 . during training, but (cid:15) of the attack algorithm varies.We can observe that the general trend of the curves is downward with the increase of the strength of any attack. The onlyexception is the standard training scenario for FGSM attack. The performance surprisingly increases in the beginningand then saturates.Comparing different defense methods, we can see the PGD-10 adversarial training continues to outperform all the otherdefense methods for all attack types, and for all strength levels. The proposed ALR training is found to be the next bestdefense technique.Another interesting observation is that the accuracy curves for both the Carlini methods are more ﬂat in nature comparedto FGSM and PGD attacks. The reason might be attributed to the relatively less drop in SNR values of the testadversarial samples generated by Carlni method as the attack strength increases, as explained in Section 6.1. Here we analyze the best defense method, i.e.,

PGD-10 adversarial training, in further detail. Speciﬁcally, we investigateits behavior when we attack it with PGD attack with different number of iterations and at different strengths. Figure 3shows the variation of the adversarial accuracy for PGD-10 defense. Each line denotes attack at a particular strength i.e., a particular (cid:15) value. The horizontal axis denotes the iteration number, varying in the range { , , , , } .A closer inspection reveals that after the ﬁrst drop in performance for PGD-10 attack to PGD-20 attack, the accuracyvalue tends to decrease very slowly. For example, at (cid:15) = 0 . , adversarial accuracy against PGD-10 attack is ∼ .Then a absolute drop is observed when we perform a PGD-20 attack, and the adversarial accuracy becomes ∼ .The accuracy tends to drop very slowly afterwards, and we see a ∼ accuracy against a PGD-100 attack. Wehypothesize that this behavior happens because around to iterations might be enough to project the perturbedadversarial samples to the edge of the l ∞ -ball, and thus much higher number of iterations do not necessarily produce astronger attack. 8 PREPRINT - A

UGUST

19, 2020 $ G Y H U V D U L D O D FF X U D F\ 'HIHQVH6WDQGDUGWUDLQLQJ)*60DGYWUDLQLQJ$/53*'DGYWUDLQLQJ (a) FGSM attack $ G Y H U V D U L D O D FF X U D F\ 'HIHQVH6WDQGDUGWUDLQLQJ)*60DGYWUDLQLQJ$/53*'DGYWUDLQLQJ (b) Carlini l ∞ attack $ G Y H U V D U L D O D FF X U D F\ 'HIHQVH6WDQGDUGWUDLQLQJ)*60DGYWUDLQLQJ$/53*'DGYWUDLQLQJ (c) PGD-100 attack $ G Y H U V D U L D O D FF X U D F\ 'HIHQVH6WDQGDUGWUDLQLQJ)*60DGYWUDLQLQJ$/53*'DGYWUDLQLQJ (d) Carlini l attack Figure 2: Ablation study 1: Varying the strength ( (cid:15) for three l ∞ attacks, δ for the Carlini l attack) in different attackalgorithms, and performance of different defense methods. 0D[LPXPQXPEHURILWHUDWLRQVLQ3*'DWWDFN $ G Y H U V D U L D O D FF X U D F\ $WWDFNVWUHQJWK = 0.0005= 0.002= 0.0035= 0.005 Figure 3: Performance of PGD-10 adversarial training against PGD attack at different strengths, and with differentnumber of iterations. 9

PREPRINT - A

UGUST

19, 2020Table 2: Transferability of adversarial samples between different models. The adversarial samples are generated withthe “source” model, but seem to be effective against the “target” model as well. Accuracy is on a scale of [0 , . Benign accuracy Adversarial accuracy Adversarial accuracySource attack (cid:15)

1D CNN TDNN

Source

1D CNN

Target

TDNN

Source

TDNN

Target

1D CNN

FGSM 0.002 0.98 0.87 0.19 0.40 0.03 0.16PGD-100 0 0.36 0 0.08Table 3: Effect of training data augmentation with white Gaussian noise. Accuracy is on a scale of [0 , . Benign accuracy Adversarial accuracyAttack (cid:15)

No augmentation Augmentation No augmentation Augmentation

FGSM 0.002 0.94

We perform the transferability analysis between the smaller CNN model, and the larger TDNN model (please seeSection 5.2 for model architectures). Table 2 shows the benign accuracies for both the models . We can see thatadversarial samples crafted from the “source” model tend to be harmful for the “target” model as well, as evident fromthe signiﬁcant drop in the performances. Adversarial samples generated with the larger model (TDNN) tend to bemore effective in attacking the smaller model. Further studies are needed to fully understand the observed pattern oftransferability. Noise augmentation is a standard technique employed during training a speaker recognition model [13, 38]. Here, weexperiment with augmenting the dataset with white Gaussian noise (scaled with a factor equal to the (cid:15) used in the attack)during training the undefended model. The model is trained with both clean and noisy samples. The experimentalobservations are tabulated in Table 3. As expected, the benign accuracy improves. However, we could not ﬁnd anyimprovement in the performance of the model for defending against FGSM and PGD-100 adversarial attacks. Thereason might be attributed to the ability of attack algorithms to generate more novel noise samples (compared to simplewhite Gaussian noise) that force the model posteriors to change.

The paper presented an extensive exploratory analysis of adversarial attacks on a closed set speaker recognition system.We reported results obtained from experiments with multiple state-of-the-art attack algorithms with varying attackstrengths. We also investigated state-of-the-art defense methods, and adopted them for employing as countermeasuresfor the speaker recognition model. We performed several ablation studies to understand the SNR characteristics andperceptibility of the adversarial speech, analyze the transferability of the adversarial attacks, and the effectiveness ofwhite noise augmentation during training. The main observations are the following: • Speaker recognition system such as the one employed in the current study is vulnerable to white box adversarialattacks. The performance of the undefended model dropped from to with the strongest attack (PGD-100) in our experiment even at dB SNR and PESQ score > . • Adversarial samples crafted with the Carlini and Wagner method are found to have the best perceptual qualityin terms of the PESQ score. • The adversarial samples generated with a particular source model are found to transfer well to a different targetmodel, and hence, are also harmful for the target model. This is particularly alarming because it can open upchances for black box attacks. The TDNN model has lower benign accuracy possibly because of overﬁtting, but we do not spend time to ﬁne-tune the TDNNmodel. Transferability of the adversarial samples is still clear from the table. PREPRINT - A

UGUST

19, 2020 • Augmenting training data with white Gaussian noise is not found to be effective. • Experimenting with several defense methods showed that PGD-based adversarial training is the best defensestrategy in out setting. • Although, PGD adversarial training is the best defense method, it is not found to be effective against l attackin our experiments, probably because of employing l ∞ norm during training.We hope the source codes published along with this paper can be helpful to the research community interested inpursuing further work in this domain. Several important future directions can be taken from here. • Extending the work for a speaker veriﬁcation setting would a good exploratory direction. Speciﬁcally, anend-to-end system like [30] can be investigated with the state-of-the-art attacks introduced in this paper, andperformances of different defense algorithms can be established. • Metric learning such as triplet training [39] are shown to learn compact and robust embeddings againstadversarial attacks for images [40, 41]. Metric learning is also found to be useful for learning robust speakerembeddings in [38]. A natural extension can be to verify the adversarial robustness of speaker embeddingslearning via metric learning. • Adaptive attacks [42] are particularly designed to break any speciﬁc defense algorithm. The strategiesintroduced in [42] can be a starting point to perform model-speciﬁc adversarial attacks on existing defensemethods proposed for speaker recognition systems. • Studying targeted attacks might be another good direction from here, especially, since this could be a potentialthreat for biometric systems that rely on speaker recognition modules. • Finally, further research can be done on crafting imperceptible (to human judgement or by retaining highPESQ score) adversarial audio samples with high attack success rate such as in [18], and also formulatingeffective detection [43] and defense algorithms as countermeasures.

References [1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and RobFergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 , 2013.[2] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´c, Pavel Laskov, Giorgio Giacinto,and Fabio Roli. Evasion attacks against machine learning at test time. In

Joint European conference on machinelearning and knowledge discovery in databases , pages 387–402. Springer, 2013.[3] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In

International Conference on Learning Representations , 2015.[4] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks. In

International Conference on Learning Representations , 2018.[5] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. Thelimitations of deep learning in adversarial settings. In , pages 372–387. IEEE, 2016.[6] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In , pages 39–57. IEEE, 2017.[7] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security:Circumventing defenses to adversarial examples. In

ICML , pages 274–283, 2018.[8] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, IanGoodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. arXiv preprintarXiv:1902.06705 , 2019.[9] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for largevocabulary conversational speech recognition. In , pages 4960–4964. IEEE, 2016.[10] Kartik Audhkhasi, Bhuvana Ramabhadran, George Saon, Michael Picheny, and David Nahamoo. Direct acoustics-to-word models for english conversational speech recognition. In

Proc. Interspeech 2017 , pages 959–963,2017. 11

PREPRINT - A

UGUST

19, 2020[11] John HL Hansen and Tauﬁq Hasan. Speaker recognition by machines and humans: A tutorial review.

IEEE Signalprocessing magazine , 32(6):74–99, 2015.[12] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. In

Proc.Interspeech 2018 , pages 1086–1090, 2018.[13] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robustdnn embeddings for speaker recognition. In , pages 5329–5333. IEEE, 2018.[14] Arindam Jati and Panayiotis Georgiou. Neural predictive coding using convolutional neural networks towardunsupervised learning of speaker characteristics.

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , 27(10):1577–1589, 2019.[15] Shrikanth Narayanan and Panayiotis G Georgiou. Behavioral signal processing: Deriving human behavioralinformatics from speech and language.

Proceedings of the IEEE , 101(5):1203–1233, 2013.[16] Che-Wei Huang and Shrikanth Narayanan. Deep convolutional recurrent neural network with attention mechanismfor robust speech emotion recognition. In ,pages 583–588. IEEE, 2017.[17] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In , pages 1–7. IEEE, 2018.[18] Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible, robust, and targetedadversarial examples for automatic speech recognition. In

International Conference on Machine Learning , pages5231–5240, 2019.[19] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker veriﬁcation.In , pages 4879–4883.IEEE, 2018.[20] Antonio Nucci and Ram Keralapura. Hierarchical real-time speaker recognition for biometric voip veriﬁcationand targeting, April 17 2012. US Patent 8,160,877.[21] Timo Becker, Michael Jessen, and Catalin Grigoras. Forensic speaker veriﬁcation using formant features andgaussian mixture models. In

Ninth Annual Conference of the International Speech Communication Association ,2008.[22] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: A large-scale speaker identiﬁcation dataset.In

Proc. Interspeech 2017 , pages 2616–2620, 2017.[23] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. Deep neural network embeddingsfor text-independent speaker veriﬁcation. In

Interspeech , pages 999–1003, 2017.[24] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomenato black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 , 2016.[25] Xu Li, Jinghua Zhong, Xixin Wu, Jianwei Yu, Xunying Liu, and Helen Meng. Adversarial attacks on gmmi-vector based speaker veriﬁcation systems. In

ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pages 6579–6583. IEEE, 2020.[26] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis forspeaker veriﬁcation.

IEEE Transactions on Audio, Speech, and Language Processing , 19(4):788–798, 2010.[27] Felix Kreuk, Yossi Adi, Moustapha Cisse, and Joseph Keshet. Fooling end-to-end speaker veriﬁcation withadversarial examples. In , pages 1962–1966. IEEE, 2018.[28] Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. Who is real bob?adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840 , 2019.[29] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker veriﬁcation using adapted gaussian mixturemodels.

Digital signal processing , 10(1-3):19–41, 2000.[30] Qing Wang, Pengcheng Guo, Sining Sun, Lei Xie, and John HL Hansen. Adversarial regularization for end-to-endrobust speaker veriﬁcation. In

Interspeech , pages 4010–4014, 2019.[31] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing withvirtual adversarial training. arXiv preprint arXiv:1507.00677 , 2015.[32] Dávid Terjék. Adversarial lipschitz regularization. In . OpenReview.net, 2020.12

PREPRINT - A

UGUST

19, 2020[33] Maria-Irina Nicolae, Mathieu Sinn, Minh Ngoc Tran, Beat Buesser, Ambrish Rawat, Martin Wistuba, ValentinaZantedeschi, Nathalie Baracaldo, Bryant Chen, Heiko Ludwig, Ian Molloy, and Ben Edwards. Adversarialrobustness toolbox v1.2.0.

CoRR , 1807.01069, 2018.[34] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based onpublic domain audio books. In , pages 5206–5210. IEEE, 2015.[35] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[36] ITU-T Recommendation. Perceptual evaluation of speech quality (pesq): An objective method for end-to-endspeech quality assessment of narrow-band telephone networks and speech codecs.

Rec. ITU-T P. 862 , 2001.[37] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speechquality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ,volume 2, pages 749–752. IEEE, 2001.[38] Arindam Jati, Raghuveer Peri, Monisankha Pal, Tae Jin Park, Naveen Kumar, Ruchir Travadi, Panayiotis GGeorgiou, and Shrikanth Narayanan. Multi-task discriminative training of hybrid dnn-tvm model for speakerveriﬁcation with noisy and far-ﬁeld speech. In

INTERSPEECH , pages 2463–2467, 2019.[39] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A uniﬁed embedding for face recognition andclustering. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 815–823,2015.[40] Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. Metric learning for adversarialrobustness. In

Advances in Neural Information Processing Systems , pages 480–491, 2019.[41] Yaoyao Zhong and Weihong Deng. Adversarial learning with margin-based triplet embedding regularization. In

Proceedings of the IEEE International Conference on Computer Vision , pages 6549–6558, 2019.[42] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarialexample defenses. arXiv preprint arXiv:2002.08347 , 2020.[43] Skyler Speakman, Srihari Sridharan, Sekou Remy, Komminist Weldemariam, and Edward McFowland. Subsetscanning over neural network activations. arXiv preprint arXiv:1810.08676 , 2018.

Appendices

A Visualizing spectrograms

Figure 4 shows the mel-spectrograms of a randomly chosen utterance for different attacks at varying (cid:15) values. Here, forexploratory analysis, we increase (cid:15) beyond the range speciﬁed in our experiments described in the main text. We can seethat for both FGSM and PGD, the noise is visible in the mel-spectrogram for (cid:15) = 0 . . The signal becomes extremelynoisy for (cid:15) = 0 . (SNR drops below − dB, and PESQ score < . ). On the other hand, for Carlini l ∞ attack, thenoise is almost invisible at (cid:15) = 0 . and (cid:15) = 0 . , as also evident from the high SNR values and PESQ scores. Thenoise becomes somewhat visible at (cid:15) = 0 . where the SNR drops to dB, and the PESQ score becomes ∼ . . B Similarity in misclassiﬁcation for different attacks

We investigate whether different attack algorithms force the model to misclassify a particular input utterance as the same(wrong) speaker. This could possibly reveal similarity between different attack algorithms. Figure 5 shows the fractionof similarity ( i.e., average number of matches) between the wrong predictions made by the model for different attacks.As evident, wrong predictions for Carlini l ∞ and Carlini l attacks are very similar ( > similarity for all the (cid:15) values), possibly because the inherent strategy of the Carlini attack remains the same in the two variants. The similaritybetween FGSM and the two Carlini attacks is also noticeable. More interestingly, the similarity scores tend to decreasewhen (cid:15) increases. We hypothesize that a low (cid:15) constrains the attack algorithm with a smaller space for perturbation, andhence, the model generally tends to wrongly predict the closest class (one that causes the most confusion). On the otherhand, a high (cid:15) opens up a lot more allowed space for the perturbation, and hence, the similarity between the wrongpredictions tends to decrease. 13 PREPRINT - A

UGUST

19, 2020 H z Original H z = 0.002, SNR + 30 dB, PESQ = 3.288 H z = 0.02, SNR + 7 dB, PESQ = 2.152 H z = 0.2, SNR 13 dB, PESQ = 1.405 (a) FGSM H z Original H z = 0.002, SNR + 32 dB, PESQ = 3.320 H z = 0.02, SNR + 9 dB, PESQ = 1.923 H z = 0.2, SNR 10 dB, PESQ = 0.896 (b) PGD-100 H z Original H z = 0.002, SNR + 73 dB, PESQ = 4.429 H z = 0.02, SNR + 49 dB, PESQ = 4.174 H z = 0.2, SNR + 10 dB, PESQ = 3.359 (c) Carlini l ∞ Figure 4: Spectrograms of an original utterance and its perturbed versions under different l ∞ attacks at varying strengths. FGSM PGD-100 Carlini l Carlini l F G S M P G D - C a r li n i l C a r li n i l (a) (cid:15) = 0 . FGSM PGD-100 Carlini l Carlini l F G S M P G D - C a r li n i l C a r li n i l (b) (cid:15) = 0 . FGSM PGD-100 Carlini l Carlini l F G S M P G D - C a r li n i l C a r li n i l (c) (cid:15) = 0 .005