Adversarial Attack and Defense Strategies for Deep Speaker Recognition Systems
Arindam Jati, Chin-Cheng Hsu, Monisankha Pal, Raghuveer Peri, Wael AbdAlmageed, Shrikanth Narayanan
AA DVERSARIAL A TTACK AND D EFENSE S TRATEGIES FOR D EEP S PEAKER R ECOGNITION S YSTEMS
A P
REPRINT
Arindam Jati †∗ [email protected] Chin-Cheng Hsu †∗ [email protected] Monisankha Pal † [email protected] Raghuveer Peri † [email protected] Wael AbdAlmageed †§ [email protected] Shrikanth Narayanan †§ [email protected] † Electrical and Computer Engineering, University of Southern California (USC), Los Angeles, CA, USA § USC Information Sciences Institute, Marina del Rey, CA, USAAugust 19, 2020 A BSTRACT
Robust speaker recognition, including in the presence of malicious attacks, is becoming increasinglyimportant and essential, especially due to the proliferation of several smart speakers and personalagents that interact with an individual’s voice commands to perform diverse, and even sensitive tasks.Adversarial attack is a recently revived domain which is shown to be effective in breaking deep neuralnetwork-based classifiers, specifically, by forcing them to change their posterior distribution by onlyperturbing the input samples by a very small amount. Although, significant progress in this realmhas been made in the computer vision domain, advances within speaker recognition is still limited.The present expository paper considers several state-of-the-art adversarial attacks to a deep speakerrecognition system, employing strong defense methods as countermeasures, and reporting on severalablation studies to obtain a comprehensive understanding of the problem. The experiments show thatthe speaker recognition systems are vulnerable to adversarial attacks, and the strongest attacks canreduce the accuracy of the system from 94% to even 0%. The study also compares the performancesof the employed defense methods in detail, and finds adversarial training based on Projected GradientDescent (PGD) to be the best defense method in our setting. We hope that the experiments presentedin this paper provide baselines that can be useful for the research community interested in furtherstudying adversarial robustness of speaker recognition systems. K eywords Adversarial attack · Deep neural network · Speaker recognition
Deep learning models are recently found to be vulnerable to adversarial attacks [1, 2] where the attacker potentiallydiscovers blind spots in the model, and crafts adversarial samples that are only slightly different from the originalsamples, rendering the trained model fail to correctly classify them or even to perform any other inference task onthem. Over the last few years, several researchers have devoted significant effort in devising novel adversarial attackalgorithms [3, 4, 5, 6], proposing defensive countermeasures to gain robustness [3, 4], and demonstrating exploratoryanalyses [6, 7, 8]. ∗ Authors contributed equally. a r X i v : . [ ee ss . A S ] A ug PREPRINT - A
UGUST
19, 2020
Adversarial attack on speech processing systems.
With the rapid increase in the incorporation of Deep NeuralNetworks (DNN) within speech processing applications like Automatic Speech Recognition (ASR) [9, 10], speakerrecognition [11, 12, 13, 14], and speech emotion and behavior studies [15, 16], it is becoming essential to study theprobable weaknesses of the employed models in the presence of adversarial attacks. In [17], the authors have shown thatit is possible to achieve even success rate in attacking deep ASR systems. In [18] the authors have successfullygenerated imperceptible (to humans) adversarial audio samples while retaining high attack success rate. These studieshighlight the vulnerability of deep ASR models against adversarial attacks.
Adversarial attack on speaker recognition systems.
Speaker recognition models are being widely employed inseveral applications including smart speakers and personal digital assistants [19, 11], bio-metric systems [20], andforensics [21]. Therefore, having robust speaker recognition models that are not susceptible to adversarial perturbationis an important requirement. However, speaker recognition models have not been investigated extensively in thepresence of adversarial attacks. Some initial work can be found in the literature (please refer to Section 3), but a detailedanalysis of white box attacks (will be discussed in Section 2.2) with state-of-the art attack algorithms is difficult to find.Moreover, to the best of our knowledge, effective defensive countermeasures for those attacks have not been proposed.The present work aims to address these issues in particular.
Contributions.
This paper focuses on adversarial attacks and possible countermeasures for deep speaker recognitionsystems, with the following contributions. • In contrast to previous works in this field (discussed in Section 3), we perform adversarial attack directly onthe time domain speech signal (and not on the spectrogram), which is more realistic in real-life scenarios. • We provide an extensive analysis of the effect of multiple state-of-the-art white box adversarial attacks on aDNN-based speaker recognition model. • We propose multiple defensive countermeasures for the deep speaker recognition system, and analyze theirperformance. • We perform transferability analysis [8] to investigate how adversarial speech crafted with a particular modelcan also be harmful to a different model. • We present various ablation studies ( e.g., varying the strength of the attack, measuring signal-to-noise ra-tio (SNR) and perceptibility of the adversarial speech samples etc. ) that might be helpful to gain a comprehen-sive understanding of the problem. • We share ready-to-run software implementation of the present work toward supporting reproducibility andfurther research.We aim to set baselines in the present exposition study, and hope it can help the community interested to continuefurther research in this domain. Paper outline.
The rest of the paper is organized as follows. In Section 2, we provide preliminaries about speakerrecognition and adversarial attack. In Section 3, we highlight the related work. The adversarial attack algorithms anddefense strategies are introduced in Section 4. Experimental setting and results are described in Section 5 and Section 6,respectively. Finally, conclusions and future directions are provided in Section 7.
Speaker recognition systems can be developed either for identification or verification [11] of individuals from theirspeech. In a closed set speaker identification scenario [11, 14], we are provided with train and test utterances from a setof unique speakers. The task is to train a model that, given a test utterance, can classify it to one of the training speakers.Speaker verification [13, 12], on the other hand, is an open set problem. The task is to verify whether a test utteranceclaiming a particular speaker’s identity is actually spoken by that speaker (whose enrolment utterance is availablebeforehand). The training data in the latter case, is generally utterances from a mutually exclusive set of speakers.Although, speaker verification differs from speaker identification during the testing phase, most of the recent state-of-the-art speaker verification systems [13, 22, 12, 23] are trained with the objective of learning to classify the set of Source codes are available at https://github.com/usc-sail/gard-adversarial-speaker-id PREPRINT - A
UGUST
19, 2020training speakers. In other words, these models are trained with a cross-entropy objective over the unique set of trainingspeakers ( i.e., similar to a speaker identification scenario).Formally, if x ∈ R D denotes a time domain audio sample with speaker label y , then learning a speaker identifier modelis generally done through Empirical Risk Minimization (ERM) [4]: argmin θ E ( x ,y ) ∼D [ L ( x , y, θ )] (1)where, L ( · ) is the cross-entropy objective, and θ denotes the set of trainable parameters of the DNN.An intermediate representation of the trained DNN model might be subsequently extracted as a speaker embedding [13]which is expected to carry speaker-specific information. The speaker embeddings are then utilized for verificationpurposes. Because of this widespread use, in this study, we work with a closed set speaker identification (or classification)model. The findings of this study can motivate future research on open set speaker verification task (see Section 7 forfuture directions). Given an audio sample x , an adversarial attack generates a perturbed signal given by (cid:101) x = x + η such that (cid:107) η (cid:107) p < (cid:15) (2)with the goal of forcing the classifier to produce erroneous output for (cid:101) x . In other words, if x has a true label y , then theattacker forces the classifier to produce (cid:101) y (cid:54) = y for the perturbed sample (cid:101) x . In this paper, we will focus on l ∞ and l norms which are most widely employed in the literature. We explore white-box [8] attack in this study. This assumes that the attacker has complete knowledge of the modelarchitecture, parameters, loss functions, and gradients. We adopt this stronger form of attack (compared to black-boxattack [8]) because it does not assume that any part of the model can be kept hidden from the attacker, and it is the mostfrequently employed threat model in the adversarial attack literature [3, 4, 5, 6].Adversarial attack can be targeted or untargeted [8]. An untargeted attack only forces the model to generate erroneousoutputs, whereas, a targeted attack forces the model to predict a target class which is different from the true class. Weperform untargeted attacks in this study, and leave the targeted attack for future study (see Section 7). Although most of the experiments in this paper are with white box attacks, we study the transferability of adversarialsamples in Section 6.6, which gives us a notion of performance during a black box attack as well. The transferabilitytest [8, 24] evaluates the vulnerability of a target model against the adversarial samples generated with a source model.The attacker has full knowledge about the source model, but no or limited knowledge about the target model (forexample, knowledge about the fact that both source and target have convolutional layers). The goal of the attacker is togenerate adversarial samples (with the source model) in such a way that they “transfer well” to the target model, i.e., those samples also make the target model vulnerable.
This section describes key previous work on adversarial attack and defense methods proposed for speaker recognitionsystems. • Li et al. [25] showed that an i-vector [26] based speaker verification system is susceptible to adversarialattacks, and the adversarial samples generated with the i-vector system also transfer well to a DNN-basedx-vector [13] system . The attack was performed on the feature space (and not directly on the time domainspeech signal), and with only the Fast Gradient Sign Method (FGSM) [3] (will be further discussed in Section 4)was investigated for that purpose. Moreover, no defense method was proposed. • Kreuk et al. [27] demonstrated the vulnerability of an end-to-end DNN-based speaker verification system toFGSM attack. The attack was done on the feature space, and the authors discovered cross-feature transferabilityof the adversarial samples. No defense method was proposed in the paper. i-vectors have been the state-of-the-art in speaker verification for a decade until DNN-based x-vectors were shown to outperformthem [23, 13]. PREPRINT - A
UGUST
19, 2020 • Chen et al. [28] proposed the Natural Evolution Strategy (NES) based adversarial sample generation procedure,and successfully attacked a GMM-UBM system and i-vector based speaker recognition systems. They foundimpressive attack success rate with their proposed method. However, the authors did not attack more recentDNN-based speaker recognition frameworks which are shown to have state-of-the-art performances. Moreover,the test set involved in their experiments only included speakers (TABLE I of [28]), and thus, an extensivestudy with a much higher number of test speakers is still needed. • Wang et al. [30] proposed adversarial regularization based defense methods using FGSM and Local Distribu-tional Smoothness (LDS) [31] techniques. The proposed method was shown to improve the performance of aspeaker verification system, but only FGSM was employed as the attack algorithm, and similar to most of theabove methods, the attack was performed on the feature space and not on the time domain audio.In summary, although these studies represent important initial efforts on adversarial attacks on speaker recognitionsystem, many technical questions still remain to be addressed. Limitations include consideration of primarily featurespace attacks [25, 27, 30] (and not time domain), limited number of attack algorithms [25, 27, 28, 30], limited numberof speakers in the test set [28], and no or limited number of defense methods [25, 27, 28]. The present expositionstudy aims to address some of these limitations by reporting extensive experimental analysis, ablation studies, and byproposing and evaluating various defense methods.
A group of gradient-based attack algorithms tries to maximize the loss function by finding a suitable perturbation whichlies inside the l p -ball around x . Formally, max η : (cid:107) η (cid:107) p <(cid:15) L ( x + η , y, θ ) . (3)A different group of algorithms aims at decreasing the posterior of the true output class, and increasing the posterior ofthe most confusing wrong class. Here we present the attack algorithms we employ in our study. Fast Gradient Sign Method (FGSM).
Goodfellow et al. [3] proposed this computationally efficient one-step l ∞ attack to generate adversarial samples by only using the sign of the gradient function, and moving in the direction ofgradient to increase the loss: (cid:101) x = x + (cid:15) sign ( ∇ x L ( x , y, θ )) . (4) Projected Gradient Descent (PGD).
Madry et al. [4] proposed a more generalized version with iterative gradientbased l ∞ attack: (cid:101) x i +1 = Π x + S [ (cid:101) x i + α sign ( ∇ x L ( x , y, θ ))] , (5)where, α is the step size of the gradient descent update, x + S is the set of allowed perturbations i.e., the l ∞ -ball around x , and Π x + S denotes the constrained projection operation in a standard PGD optimization algorithm. PGD is run for afixed number of maximum iterations, T . Throughout the text, we will denote PGD run for T iterations by “PGD- T ”. Carlini and Wagner attack (Carlini l and Carlini l ∞ ). Carlini and Wagner [6] defined the general methodologyof their attack by minimize (cid:107) η (cid:107) p + c · g ( (cid:101) x ) such that (cid:101) x ∈ [0 , . (6)Here, g ( · ) defines the objective function given by g ( (cid:101) x ) = (cid:20) Z ( (cid:101) x ) t − max j (cid:54) = t ( Z ( (cid:101) x ) j ) + δ (cid:21) + (7)where, Z ( · ) is the output vector containing posterior probabilities for all the classes, t denotes the output nodecorresponding to the true class y , δ is the confidence margin parameter, and [ · ] + denotes the max ( · , function.Intuitively, the attack tries to maximize the posterior probability of a class that is not the true class of x , but has thehighest posterior among all the wrong classes. The norm can be either l or l ∞ . For Carlini l ∞ attack, the minimizationof (cid:107) η (cid:107) ∞ is not straightforward due to non-differentiability, and an iterative procedure is employed in [6] . GMM-UBM stands for Gaussian Mixture Model-Universal Background Model, a classical model in speaker recognition [29]. We suggest the readers to refer to [6] for detailed information about the iterative workaround for l ∞ attack, and also for choosingthe values for the weight parameter, c . PREPRINT - A
UGUST
19, 2020
The intuition here is to train the model on adversarial samples generated by a certain adversarialattack. The adversarial samples are generated online using the training data and the current model parameters. Madry etal. [4] introduced the generalized notion of adversarial training by a mini-max optimization given by: argmin θ E ( x ,y ) ∼D (cid:34) max η : (cid:107) η (cid:107) p <(cid:15) L ( x + η , y, θ ) (cid:35) (8)The inner maximization task is addressed by the attack algorithm utilized during adversarial training, and the outerminimization is the standard ERM (Equation (1)) employed to train the model parameterized with θ . We separatelyapply both one-step FGSM (Equation (4)), and T -step PGD (Equation (5)) algorithms to solve the inner maximizationproblem. Throughout the remaining text, we refer to these as “FGSM adversarial training” and “PGD- T adversarialtraining” respectively.Notably, the overall training is done on clean as well as adversarial samples. The overall loss function is given by: L AT ( x , (cid:101) x , y, θ ) = (1 − w AT ) · L ( x , y, θ ) + w AT · L ( (cid:101) x , y, θ ) , (9)where w AT is the weight of the adversarial training. Adversarial Lipschitz Regularization (ALR).
This approach of gaining robustness is based on learning a functionthat is not much sensitive to a small change in the input. In other words, if we can learn a relatively smooth function,then the posterior distribution should not vary abruptly if the input perturbation is within the maximum allowed limit.We propose a training strategy equipped with the recently invented adversarial Lipschitz regularization technique [32].Similar to the regularization based on local distribution smoothness in Virtual Adversarial Training (VAT) [31], ALRimposes a regularization term defined using Lipschitz smoothness: (cid:107) f (cid:107) L = sup x , (cid:101) x ∈ X,d X ( x , (cid:101) x ) > d Y ( f ( x ) , f ( (cid:101) x )) d X ( x , (cid:101) x ) , (10)where f ( · ) the function of interest (implemented by the neural network) that maps the input metric space ( X, d X ) tooutput metric space ( Y, d Y ) . In our case of speaker classification, we chose f ( · ) as the final log-posterior output ofthe network, i.e., f ( x ) = log p ( y | x , θ ) , l norm as d Y , and l norm as d X . The adversarial perturbation η = (cid:15) η k in (cid:101) x = x + η is approximated by the power iterations: η i +1 = ∇ η i d Y (cid:0) f ( x ) , f ( x + ξ η i ) (cid:1)(cid:13)(cid:13) ∇ η i d Y (cid:0) f ( x ) , f ( x + ξ η i ) (cid:1)(cid:13)(cid:13) , (11)where, η is randomly initialized, and ξ is another hyperparameter (see Section 5.5). The regularization term added totraining is L ALR = (cid:20) d Y ( f ( x ) , f ( (cid:101) x )) d X ( x , (cid:101) x ) − K (cid:21) + , (12)where K is the desired Lipschitz constant we wish to impose. We implement the core of most of our attack and defense algorithms (except ALR) through the Adversarial RobustnessToolbox [33]. For ALR, we follow the original implementation of [32]. The rest of the experimental details aredescribed below.
We employ Librispeech [34] (the “train-clean-100” subset) dataset for all the experiments. It contains hours ofclean speech from unique speakers ( females). We utilize all the speakers for our experiment. For every speaker,we employ of the utterances for training the classifier, and the remaining utterances for testing. The train-testsplit is deterministic, and it is kept fixed throughout all the experiments.5
PREPRINT - A
UGUST
19, 2020
We implemented our classifier, f : X → Y , by combining a Convolutional Neural Network (CNN) with a digital signalprocessing (DSP) front-end which is differentiable. The non-trainable
DSP front-end extracts log Mel-spectrogramwhich is viewed as a temporal signal of F channels, where F is the number of Mel frequency bins. The back-end iseither of the two DNN models described below. As both modules are differentiable, the adversarial attack schemesintroduced in Section 4 can be applied to create time-domain perturbation directly. The model consists of 8 stacks of convolutional layers that transforms the Mel-spectrogram into the label space Y .ReLU nonlinearity and batch normalization are used after every convolutional layer. Maxpool is employed after everyalternate layer. The penultimate layer has a dimension of 32. The model has total of 219k trainable parameters. Weperform most of our analysis with this model, and utilize the following TDNN model for transferability experiments. The Time Delay Neural Network (TDNN) [23, 13] is one of the current state-of-the-art models for speaker recognition.We adopt the model architecture proposed in [13] for the experiments related to transferability analysis (Section 6.6).The model consists of time-dilated convolutional layers along with a statistics pooling module, and it has ∼ . milliontrainable parameters, and hence, is much larger than the 1D CNN model. We employ the Adam optimizer [35] with a learning rate of . , β = 0 . , and β = 0 . . We train with a minibatchsize of , and train all models (except ALR) for epochs, since the training accuracy reaches almost andsaturates within that. The ALR training converges much slowly, and thus, all ALR-based experiments are run for epochs. The training accuracy tends to saturate after that. Our main results (Section 6.3) are obtained from the experiment with attack strength (cid:15) = 0 . for l ∞ attacks, andconfidence margin δ = 0 for Carlini l attack. The choice of (cid:15) = 0 . is due a reasonably strong SNR ( ∼ dB forFGSM/PGD) of the adversarial samples (see Section 6.1). Furthermore, we vary the strength of different attacks, andthe results are shown in Section 6.4. The PGD attack is for iterations with a step size α = (cid:15)/ . In ALR method, we set the number of power iterations K = 1 , and the hyperparameter ξ = 10 , as recommendedin [32]. The FGSM- and PGD-based adversarial training algorithms are run with (cid:15) = 0 . . Hence, the main results(Section 6.3) employ the same (cid:15) value in both the attack and the adversarial training based defense. The ablationstudy in Section 6.4 is particularly designed to investigate the effect of using different values of (cid:15) during the attack.Specifically, the study varies (cid:15) above and below the vicinity of (cid:15) = 0 . (set during defense training), and analyzesthe effectiveness of the defense method. The PGD adversarial training uses iterations ( i.e., PGD-10 as introducedin Section 4.2) , although we evaluated it against PGD attack with higher number of iterations (Section 6.3 and 6.5).During adversarial training, we create minibatches containing equal number of clean and adversarial samples, i.e., inEquation (9), we set w AT = 0 . . vs. SNR
To have a substantial understanding about the strength of different attack algorithms, we computed the mean Signal-to-Noise Ratio (SNR) of all the test adversarial samples for every level of attack strength. For l ∞ attacks, (cid:15) varies between { . , . , . , . } , and for Carlini l attack the confidence margin δ varies between { , . , . , . } .The curves for l ∞ attacks are shown in Figure 1a. There are two important observations. First, the average level of SNRis ∼ dB higher for Carlini l ∞ than FGSM and PGD. Second, the SNR level tends to decrease faster with increase in PGD adversarial training is slow, and we could not afford to run it for more than iterations. PREPRINT - A
UGUST
19, 2020 0 H D Q 6 1 5 G % $ W W D F N ) * 6 0 3 * ' &