Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems
Sonal Joshi, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak
JJOURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 1
Adversarial Attacks and Defensesfor Speaker Identification Systems
Sonal Joshi,
Student Member,
Jes´us Villalba,
Member, IEEE,
Piotr ˙Zelasko,
Member, IEEE,
LaureanoMoro-Vel´azquez,
Member, IEEE, and Najim Dehak,
Senior Member, IEEE
Abstract —Research in automatic speaker recognition (SR) hasbeen undertaken for several decades, reaching great performance.However, researchers discovered potential loopholes in thesetechnologies like spoofing attacks–voice replay, conversion, andsynthesis– and thoroughly investigated them in the last years.Quite recently, a new genre of attack, termed adversarial attacks,has been proved to be fatal in computer vision (CV), andit is vital to study their effects on SR systems. This paperexamines how state-of-the-art speaker identification (SID) sys-tems are vulnerable to adversarial attacks and how to defendagainst them. We investigated adversarial attacks common inthe literature like fast gradient sign method (FGSM), iterative-FGSM–a.k.a. basic iterative method (BIM)–, and Carlini-Wagner(CW). Furthermore, we propose four pre-processing defensesagainst these attacks–viz. randomized smoothing, DefenseGAN,variational autoencoder (VAE), and WaveGAN vocoder. We foundthat SID were extremely vulnerable under Iterative FGSMand Carlini-Wagner attacks. Randomized smoothing defenserobustified the system for imperceptible BIM and Carlini-Wagnerattacks recovering classification accuracies ∼ . Defensesbased on generative models (DefenseGAN, VAE, and WaveGAN)project adversarial examples (outside of the manifold) back intothe clean manifold. In the case that the attacker cannot adapt theattack to the defense (black-box defense), WaveGAN performedthe best, being close to the clean condition (Accuracy > ).However, if the attack is adapted to the defense–assuming theattacker has access to the defense model–(white-box defense),VAE and WaveGAN protection dropped significantly–50% and37% accuracy for CW attack. To counteract this, we combinedrandomized smoothing with VAE or WaveGAN. We found thatsmoothing followed by WaveGAN vocoder was the most effectivedefense overall. As a black-box defense, it provides 93% averageaccuracy. A white-box defense, accuracy only degraded foriterative attacks with perceptible perturbations ( L ∞ ≥ . ). Index Terms —Speaker Recognition, x-vectors, adversarial at-tacks, adversarial defenses
I. I
NTRODUCTION S PEAKER recognition (SR) has application in forensics,smart home assistants, authentication, etc. Recently, re-search has shown that these systems are subject to threats,including spoofing and adversarial attacks [1]. Spoofing hasbeen extensively studied in ASVSpoof challenges [2], and
All authors are associated with the Department of Electrical and ComputerEngineering, and the Center for Language and Speech Processing (CLSP),Johns Hopkins University, Baltimore, MD, 21218 USAJes´us Villalba and Najim Dehak are also affiliated to the Human LanguageTechnology Center of Excellence, Johns Hopkins University, Baltimore, MD,21218, USAThis project was supported by DARPA Award HR001119S0026-GARD-FP-052Manuscript received January xx, 2021 countermeasures have been proposed [3]. However, the ad-versarial attacks field remains less explored. SR adversarialattacks usually consist of slightly modifying the original signalby adding a small human-imperceptible noise to make the SRneural network fail and give incorrect predictions [4]. Thequestion is how effective are these attacks and how we defendagainst them.As the pioneering work on adversarial attacks was in thearea of image classification [4], attacks to computer visionsystems have been extensively studied [5], [6], [7], [8]. Theseattacks have been, later, extended to other modalities likevideo [9] and speech. For the latter, adversarial attacks havebeen studied for Automatic speech recognition systems (ASR)and speaker recognition (SR). Adversarial attacks on ASR aregenerally focused on end-to-end architectures, e.g., Mozilla’sDeepSpeech [10], rather than hybrid DNN-HMM systems. Forexample, the study in [11] attacks DeepSpeech using Houdiniattack. Another study [12] proposes a white-box iterativeoptimization-based adversarial attack (henceforth referred toas Carlini-Wagner attack), demonstrating a 100% successrate. The authors in [13] propose an audio-agnostic universaladversarial perturbation for DeepSpeech. The authors in [14]attack WaveNet [15] using fast gradient sign method [16] andthe fooling gradient sign method [4]. On the other hand, someworks attack the state-of-the-art ASR based on Kaldi [17].The authors in [18] propose a surreptitious attack to a Kaldibased ASR by embedding voice into songs and playing itin the background while being inaudible to the human ear.The studies [19], [20] show that psychoacoustic modelingcan be leveraged to make the attacks imperceptible. Someother works [20], [21] suggest that physical adversarial at-tacks, meaning adversarial attacks generated over-the-air usingrealistic simulated environmental distortions, also break ASRsystem. More detailed overview of adversarial attacks andcountermeasures on ASR is presented in [22].Another speech technology under the radar of adversarialattacks is speaker recognition (SR) which is an umbrella termfor three areas: speaker identification or classification (givena voice, identify who is speaking), speaker verification (givena voice and a claimed identity, decide whether the claim iscorrect or not), and anti-spoofing (spoofing countermeasuresfor replay, voice-synthesis attacks on speaker verificationsystems). The adversarial attacks research in the SR fieldstarted on non-state-of-the-art speaker identification systems.For instance, authors in [23], [24] attack models based onrecurrent Long Short Term Memory (LSTM). Recently, ahandful of works focusing on attacking on state-of-the-art a r X i v : . [ ee ss . A S ] J a n OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 2
Fig. 1: x-Vector Speaker Classification Pipeline. Here, x (cid:48) is adversarial sample of benign waveform x (with without attackgives classifier output as speaker label y benign ) such that the classifier output is y adv (or in short y ) such that y (cid:54) = y benign x-vector [25] and i-vector [26] systems are available in theliterature. [27] attacks GMM i-vector and x-vector modelsusing Fast Gradient Sign Method (FGSM). [28] attacks apublic pre-trained Kaldi time-delay neural network (TDNN)based x-vector model by proposing a real-time black-boxuniversal attack. Authors in [29] introduce universal adver-sarial perturbations (UAP) that can fool SR systems withhigh probability using generative networks to model the low-dimensional manifold of UAPs. The study [30] proposes awhite-box psychoacoustics-based method to attack an x-vectormodel. A recent survey study [31] provides a good overviewof existing research in this area and claims that the attacks“almost universally” fail to transfer to other models. Authorsin [32] study the effects of adversarial attacks on speakerverification x-vector systems. They successfully transfer white-box attacks like FGSM, randomized FGSM and CW fromsmall white-box to larger black-box systems. [33] proposes aMomentum Iterative Fast Gradient Sign Method (MI-FGSM)attack against spoofing countermeasure systems.The question that arises is if there is a way to defendagainst adversarial attacks. As attack algorithms rely on theclassifier’s gradient information, gradient masking/obfuscationcan be used as a defense. For example, authors of [34]use defensive distillation when training DNNs to defendagainst adversarial attacks; and [35], [36] use pre-processingof input data by adding a non-smooth or non-differentiablepreprocessor. PixelDefend [37] and Defense-GAN [38] usegenerative models to project adversarial samples onto theclean/benign manifold. PixelDefend uses the PixelCNN modeland Defense-GAN uses a generative adversarial network todo the projection. Adding the generative network before theclassifier causes the final model to be an extremely deep neuralnetwork, and hence the cumulative gradient to be extremelysmall or irregularly large [9]. This makes estimating thegradient difficult. Another type of defense is to make the modelinherently robust against adversarial attacks. This can be doneis by introducing adversarial samples in model training. Thismethod is called adversarial training. The authors of [39]and [40] propose adversarial training using FGSM attacks, andthose of [41] use projected gradient descent (PGD) attacks.PGD adversarial training generalizes better for unseen attacksthan FGSM; however, it’s computationally costly and henceslow. [42] proposes “You Only Propagate Once” (YOPO) algo-rithm to make adversarial training faster by reusing gradientswhile generating PGD attacks. Some defenses called “certified defenses” guarantee protection up to a certain perturbation L p radius [43], [44]For SR and ASR systems, pre-processing and adversarialtraining defenses are also found in the literature. In theclass of pre-processing defenses, audio turbulence and audiosqueezing have been proposed for ASR defense in [18].Another study [45] proposes local smoothing using a low-pass filter, down-sampling/up-sampling, and quantization. Theresearch for defenses for SR has been sparse. Some authorspropose MP3 compression [46], whereas others employ aseparate VGG-like binary classifier to detect the attacks [47].Recently, the study [48] proposes adversarial training based onProjected Gradient Descent (PGD) attacks. Another work [49]proposes a defense mechanism based on a hybrid adversarialtraining using multi-task objectives, feature-scattering (FS),and margin losses. However, the caveats of adversarial trainingare that generalizes poorly to attack algorithms and threatmodels unseen during training, and requires careful traininghyperparameter tuning besides significantly increasing trainingtime. Given these issues, our work focuses on the class of pre-processing defenses. The main advantage of these defenses isthat they do not need information about the type of attacksused by the adversary.The main contributions of this work are as follows: • We evaluate the performance of four attacks at variousstrengths on state-of-the-art speaker identification or clas-sification model (x-vector) and show its vulnerability toadversarial attacks. • We analyze four pre-processing defenses–viz. random-ized smoothing defense (Section V-A), Defense-GAN(Section V-B), variational auto-encoder (VAE) defense(Section V-C) and WaveGAN vocoder defense (SectionV-D). The first two are adapted from computer visionliterature and the last two are newly proposed in thispaper. • We show that while the x-vector system has inherentrobustness against some attacks like transfer universalattacks and low L ∞ FGSM attacks, defenses like Wave-GAN vocoder and VAE help the x-vector to withstandmore stronger attacks.The rest of the paper is organized as follows: Section IIdescribes the x-vector Speaker classification network; SectionIII describes the considered threat model; Section IV for-mally introduces adversarial attacks and methods; Section Vdescribes proposed defense methods against the adversarial
OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 3
TABLE I: ThinResNet34 x-vector architecture. N in the lastrow is the number of speakers. The first dimension of the inputshows number of filter-banks and the third dimension indicatesthe number of frames T . Layer name Structure Output
Input – × × T Conv2D-1 3 ×
3, 16, Stride 1 × × T ResNetBlock-1 (cid:20) × , × , (cid:21) × , Stride 1 × × T ResNetBlock-2 (cid:20) × , × , (cid:21) × , Stride × × T/ ResNetBlock-3 (cid:20) × , × , (cid:21) × , Stride × × T/ ResNetBlock-4 (cid:20) × , × , (cid:21) × , Stride × × T/ Flatten – × T/ StatsPooling –
Dense1 –
Dense2 (Softmax) – N attacks; Section VI shows the experimental setup and theresults are discussed in Section VII; and finally, Section VIIIgives concluding remarks.II. X -V ECTOR S PEAKER C LASSIFICATION
In this study, we used state-of-the-art x-vector style neuralarchitectures [50], [25] for the Speaker Classification task.x-Vector networks are divided into three parts. First, anencoder network extracts frame-level representations fromacoustic features such as MFCC or filter-banks. Second, aglobal temporal pooling layer that aggregates the frame-levelrepresentation into a single vector per utterance. We usedmean and standard deviation over time for pooling. Finally,a classification head, which is a feed-forward classificationnetwork, processes this single vector and computes the speakerclass posteriors. This network is trained using additive angularmargin softmax (AAM-softmax) [51] loss. Different x-vectorsystems are characterized by different encoder architectures,pooling methods, and training objectives. In preliminary ex-periments, we compared the vulnerability of x-vectors basedon ThinResNet34, ResNet34 [52], [53], EfficientNet [54] andTransformer [55] Encoders to adversarial attacks. These resultsare shown in Section VII-A. We found that all architectureswere similarly vulnerable, being EfficientNet slightly morerobust for some attacks. To evaluate the defenses, we usedthe ThinResNet34 x-vector, whose architecture is described inTable I. ThinResNet34 was more vulnerable than the largerEfficientNets and ResNets, so we will be able to observe thedefense improvement better. Also, it allowed us to reducethe computing cost. Having a small computing cost is veryimportant since generating attacks involves multiple gradientdescend iterations per trial.III. T
HREAT M ODEL
We consider a threat model where an attacker crafts anadversarial example by adding an imperceptible perturbation tothe speech waveform in order to alter the speaker identificationdecision. Suppose x ∈ R T is the original clean audio wave-form of length T , also called benign, or bonafide. Let y benign be its true label. Let g ( . ) be the x-vector network predictingspeaker class posteriors. An adversarial example of x is givenby x (cid:48) = x + δ , where δ is the adversarial perturbation. δ can be optimized to produce untargeted or targetedattacks. Untargeted attacks make the system to predict anyclass other than the true class, g ( x (cid:48) ) (cid:54) = y benign , but withoutforcing to predict any specific target class. On the contrary,targeted attacks are successful only when the system predicta given target class y adv instead of the true class, g ( x (cid:48) ) = y adv (cid:54) = y benign . In this work, we consider untargeted attackssince we observed that it is easier to make them succeed.Thus, we expect that defending from those attacks will bemore difficult.To enforce imperceptibility of the perturbation, some dis-tance metric is minimized or bounded D ( x , x (cid:48) ) < ε . Typically,this is the L p norm of the perturbation, D ( x , x (cid:48) ) = (cid:107) δ (cid:107) p Among the attacks that we considered, FGSM and Iter-FGSMattacks limit the maximum L ∞ norm, while Carlini-Wagnerminimizes the L norm.We considered two variants of this threat model dependingon the knowledge of the attacker: white-box and black-boxattacks. In white-box attacks, the attacker has fullknowledgeof the system, including the architecture and parameters. Theycan compute the gradient of the loss function and back-propagate it all the way to the waveform to compute theadversarial perturbation. We performed experiments with threeattacks in white-box settings: Fast Gradient Sign (FGSM) [5](Section IV-A), Iterative FGSM [7] (Section IV-B) and Carlini-Wagner [6] (Section IV-C) attacks.In black-box attacks, the attacker does not have details aboutthe architecture of the x-vector system, and the adversary canjust observe the predicted labels of the system given an input.In this work, we consider a particular case of black-box attack,usually known as transfer based black-box attack . Here, theidea is that the adversary has access to an alternative white-box model, which he can use to create adversarial examples,and use them to attack the victim model.IV. A DVERSARIAL ATTACKS
Following, we describe the attack algorithms that we eval-uated.
A. Fast gradient sign method (FGSM)
Fast gradient sign method (FGSM) [5] takes the benignaudio waveform of length T x ∈ R T and computes anadversarial example x (cid:48) by taking a single step in the directionthat maximizes the missclasification error as x (cid:48) = x + ε sign( ∇ x L ( g ( x ) , y benign )) , (1)where function g ( x ) is the x-vector network producing speakerlabel posteriors, L is categorical cross-entropy loss, y benign is the true label of the utterance. ε restricts the L ∞ normof the perturbation by imposing (cid:107) x (cid:48) − x (cid:107) ∞ ≤ ε to keep theattack imperceptible. In other words, the larger the epsilon, themore perceptible are the perturbations and the more effectiveare the attacks (meaning classification accuracy deteriorates).It should be noted that FGSM produces adversarial attacks OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 4 quickly, but the attacks are not very strong to fool the classifier,as demonstrated by our experiments.
B. Iterative Fast Gradient Sign Method or Basic IterativeMethod (BIM)
Iterative FGSM [7] or Basic Iterative Method (BIM) takesiterative smaller steps α in the direction of the gradient incontrast to FGSM which takes a single step x (cid:48) i +1 = x + clip ε ( x (cid:48) i + α sign( ∇ x (cid:48) i L ( g ( x (cid:48) i ) , y benign )) − x ) , (2)where x (cid:48) = x and i is iteration step for optimization. The clip function assures that the L ∞ norm of perturbation is smallerthan ε after each optimization step i . This results in a strongerattack than that of FGSM, however it also takes more time tocompute. In our experiments, we used α = ε/ . C. Carlini-Wagner attack
The Carlini-Wagner (CW) attack [6] is computed by findingthe minimum perturbation δ that fools the classifier whilemaintaining imperceptibility. δ is obtained by minimizing theloss, C ( δ ) (cid:44) D ( x , x + δ ) + c f ( x + δ ) . (3)where, D is a distance metric, typically L norm. By mini-mizing D , we minimize the perceptibility of the perturbation.The function f is a criterion defined in such a way thatthe system fails if and only if f ( x + δ ) ≤ . To perform anon-targeted attack in a classification task, we define f ( x (cid:48) ) as, f ( x (cid:48) ) = max( g ( x (cid:48) ) benign − max { g ( x (cid:48) ) j , j (cid:54) = benign } + κ, (4)where x (cid:48) is the adversarial sample, g ( x (cid:48) ) j is the logit predic-tion for class j , and κ is a confidence parameter. The attackis successful when f ( x (cid:48) ) ≤ . This condition is met when,at least, one of the logits of the non-benign classes is largerthan the logits of the benign class plus κ . We can increase theconfidence in the attack success by setting κ > .The weight c balances D and f objectives. The optimizationalgorithm consists of a nested loop. In the outer loop, A binarysearch is used to find the optimal c for every utterance. Inthe inner loop, for each value of c , C ( δ ) is optimized bygradient descend iterations. If an attack fails, c is increased inorder to increase the weight of the f objective over D , andthe optimization is repeated. If the attack is successful, c isreduced.For our experiments, we set the learning rate parameter as0.001, confidence κ as , the maximum number of inner loopiterations as 10, and the maximum number of halving/doublingof c in the outer loop as 5. D. Transfer universal attacks
Universal perturbations are adversarial noises that are op-timized to be sample-agnostic [56]. Thus, we can add thesame universal perturbation to multiple audios and make theclassifier to fail for all of them. The algorithm to optimizethese universal perturbations is summarized as follows. Given a set of samples X = { x , . . . , x N } , we intend to find aperturbation δ , such as (cid:107) δ (cid:107) ∞ ≤ ε , to make it imperceptible,that can fool most samples in X . The algorithm iterates overthe data in X gradually updating δ . In each iteration i , ifthe a current perturbation δ doesn’t fool sample x i ∈ X , weadd an extra perturbation ∆ δ i with minimal norm such as theperturbed point x i + δ + ∆ δ i fools the classifier. This is doneby solving ∆ δ i = arg min r (cid:107) r (cid:107) ∞ s . t . g ( x i + δ + r ) (cid:54) = g ( x i ) (5)where g is the classification network. Then, we enforce theconstrain (cid:107) δ (cid:107) ≤ ε , by projecting the updated perturbation onthe (cid:96) p ball of radius ε , δ ← P p,ε ( δ + ∆ δ i ) (6)where P p,ε is the projection function. We take several itera-tions over the data in X to improve the universal perturbation.The iterations stops when at least a proportion P of thesamples in X is miss-classified.In [56], the authors show that universal perturbation presentalso cross-model universality. In other words, universal per-turbations computed for a specific architecture can also foolother architectures. Hence, it poses a potential security breachto break any classifier. Motivated by this, we used the univer-sal perturbations pre-computed in Armory toolkit to createtransfer black-box attacks for our x-vector classifier. Theseuniversal perturbations were create using a white-box SincNetclassification model [57] .V. A DVERSARIAL D EFENSES
Defenses against adversarial attacks mainly include adver-sarial training [58], introducing randomization [59], or usingsome form of denoising to remove the adversarial pertur-bation [60]. Adversarial training has proven to be a gooddefense [48], however, robustness degrades when test-timeattacks differ from those used when training the system. Ad-versarial training significantly increases training time, whichhinders scalability when using very large x-vector models. Inthis work, we focus on preprocessing defenses, which intendto remove the adversarial perturbation from the input signal.These defenses do not have prior knowledge of type andstrength of attacks.
A. Randomized smoothing
Randomized smoothing is a method to construct a new smoothed classifier g (cid:48) from a base classifier g . The smoothedclassifier g (cid:48) predicts the class that the base classifier g wouldpredict given an input x perturbed with isotropic Gaussiannoise [59]. That is g (cid:48) ( x ) = arg max c P ( g ( x + n ) = c ) (7) n ∼ N (cid:0) , σ I (cid:1) (8) https://github.com/twosixlabs/armory/blob/master/armory/data/adversarial/librispeech adversarial.py https://github.com/mravanelli/SincNet OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 5
Fig. 2: Scheme of Defense-GAN inference stepwhere P is a probability measure and the noise standarddeviation σ is an hyperparameter that controls the trade-offbetween robustness and accuracy. Randomized smoothing isa certifiable defense against attacks with bounded L normperturbations. In [59], the authors prove tight bounds forcertified accuracies under Gaussian noise smoothing. Despitethis defense is not certifiable for L p with p (cid:54) = 2 , we foundthat it also perform well for other norms like L ∞ .The main requirement for this defense to be effective isthat the base classifier g needs to be robust to Gaussian noise.The base x-vector classifier is already robust to noise sinceit is trained on speech augmented with real noises [61] andreverberation [62] using a wide range of signal-to-noise ratiosand room sizes. However, we robustified our classifier furtherby adding an extra fine-tuning step. In this step, we fine-tuneall the layers of the x-vector classifier by adding Gaussiannoise on top of the real noises and reverberations. The trainingstandard deviation was uniformly sampled from the uniformdistribution in interval [0 , . , assuming a waveform withdynamic range [-1,1) interval. At test time, we added Gaussiannoise with σ = 0 . . This defense made the system robustfor FGSM and BIM (with L ∞ ≤ . ) and CW attacks incomparison to the undefended system. We found that the test σ needed to be significantly larger than the L ∞ norm of theadversarial perturbation for the defense to be effective. Wealso found that training with random Uniform σ , was betterthan matching the train and test σ values. B. Defense-GAN
Generative adversarial networks (GAN) are implicit gener-ative models inspired by game theory [16] where two of twonetworks, termed Generator G and Discriminator D , competewith each other (hence the name ”adversarial”). G maps alow-dimensional latent variable z to the observed data space x . The goal of G is to generate samples G ( z ) that are similarto real data given random variable z ∼ N ( , I ) . Meanwhile, D learns to discriminate between real and generated (fake)samples. D and G are trained alternatively to optimize thefollowing objective V ( G, D ) = E x ∼ p data ( x ) [log( D ( x ))]+ E z ∼ p z ( z ) [log(1 − D ( G ( z )))] , (9)which is minimized w.r.t. G and maximized w.r.t. D . When G is being updated, it tries to fool the Discriminator intoclassifying fake samples as real. When D is being updated, it learns to detect fake and real samples better. As the trainingprogresses, these adversaries keep improving by learning tofool one other. It can be shown that this objective minimizesthe Jensen-Shannon (JS) distance between the real and gener-ated distributions [16].Wasserstein GAN (WGAN) is an alternative GAN formu-lation that minimizes the Wasserstein distance rather than theJS distance [63]. Using Kantorovich-Rubinstein duality, theobjective that achieves this is V ( G, D ) = E x ∼ p data ( x ) [ D ( x )] − E z ∼ p z ( z ) [ D ( G ( z ))] (10)where the network D , termed as critic , needs to be a K-Lipschitz function. Wasserstein GAN usually yields betterperformance than vanilla GAN.The work in [38] proposes to use GANs as a preprocessingdefense against adversarial attacks. This method termed asDefense-GAN has been demonstrated as an effective defensein several computer vision datasets. The key assumptionsare that a well-trained generator learns the low dimensionalmanifold of the clean data, while adversarial samples areoutside this manifold. Therefore, projecting the test data ontothe manifold defined by the GAN Generator will clean thesample of the adversarial perturbation.A two-step approach is followed by Defense-GAN. In thetraining phase, a GAN is trained on benign data to learn thelow-dimensinal data manifold. In the inference phase, The testsample x is projected onto the GAN manifold by optimizinglatent variable z to minimize the reconstruction error z ∗ = arg min z (cid:107) G ( z ) − x (cid:107) . (11)Then, the projected sample is x ∗ = G ( z ∗ ) z is optimizedby L gradient descent iterations, initialized with R randomseed (random restarts). Figure 2 outlines the Defense-GANinference step.We adapted DefenseGAN to the audio domain. Wetrained Wasserstein GAN with a Deep Convolutionalgenerator-discriminator (DCGAN) to generate clean log-Mel-Spectrograms, similar to [64]. We enforced the 1-Lipschitzconstraint for the Discriminator by adding a gradient penaltyloss [65]. The Generator was trained to generate spectrogrampatches of 80 frames per each latent variable. At inference,we used the GAN to project adversarial spectrograms tothe clean manifold. Note that adversarial perturbation wasadded in the wave domain, not in the log-spectrogram do-main. The DefenseGAN procedure needs to be sequentially OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 6
Fig. 3: Pipeline of VAE defense. Combination of VAE with randomized smoothing is indicated by optional blockapplied to each block of 80 frames until reconstructing thefull utterance spectrogram. We observed that DefenseGANdidn’t achieve high-quality spectrogram generation, whichdegraded performance on benign samples. To mitigate this, weused a weighted combination of original x and reconstructedspectrograms x ∗ as x ∗∗ = α x ∗ + (1 − α ) x .The major advantage of Defense-GAN is that it is nottrained for any particular type of attack. Therefore, this defensestrategy is “attack independent”. Another advantage is thatthe iterative procedure used for inference is not differentiable.Thus, it is difficult to create adaptive attacks that can break thedefense. To perform attacks on this defense, we approximate x (cid:48) ≈ x –assuming small adversarial perturbations– for thebackward pass. Then the Jacobian of the defense can approx-imated as an identity matrix. This is termed as backward passdifferentiable approximation (BPDA) [66].DefenseGAN has also some disadvantages. The success ofDefense-GAN is highly dependent on training a good Gener-ator [38], otherwise, the performance suffers on both originaland adversarial examples. Also, it a very slow procedurerequiring R × L forward/backward passes on the Generator.Hyper-parameters L and R need to be tuned to balance speedand accuracy. C. Variational autoencoder defense
Considering the disadvantages of DefenseGAN, we decidedto explore other generative models such as variational auto-encoders (VAE) [67]. Variational auto-encoders are probabilis-tic latent variable models consisting of two networks, termedencoder and decoder. The decoder defines the generationmodel of the observed variable x given the latent z , i.e., theconditional data likelihood P ( x | z ) . Meanwhile, the encodercomputes the approximate posterior of the latent given the data q ( z | x ) . The model trained by maximizing the evidence lowerbound L = E q ( z | x ) [log P ( x | z )] − D KL ( q ( z | x ) || P ( z )) (12)where P ( z ) is the latent prior, typically standard Normal. Thefirst term of the ELBO intends to minimize the reconstructionloss while the second is a regularization term that keeps theposterior close to the prior. The KL divergence term also helpsto create an information bottleneck (IB), limiting the amountof information that flows through the latent variable [68]. Weexpect that this IB will let pass the most relevant informa-tion while removing small adversarial perturbations from thesignal.As for DefenseGAN, we applied this defense in the log-filter-bank domain, as shown in Figure 3. We assumed Gaus-sian forms for the likelihood and the approximate posterior. The encoder and decoder predict the means and variancesof these distributions. To make our VAE more robust, wetrained it on a denoising fashion [69] (Denoising VAE). Duringtraining, the latent posterior is compute from a noisy versionof the sample, while the decoder tries to predict the originalsample. At inference, we compute the latent posterior from theadversarial sample, draw a sample z ∼ q ( z | x ) ; and forwardpass z through the decoder. We used the decoder predictedmean as an estimate of the benign sample. Variational auto-encoders have the advantage that they are easier to trainthan GANs and inference is also faster, since it just consistsof a single forward pass of an encoder/decoder networks.Nevertheless, they have the drawback that they are fullydifferentiable so and the attacker can adapt to the defenseback-propagating gradients through the VAE. In that sense,we considered two alternatives for the defense: • Black-box VAE : The attacker does not know that there isa defense or he/she does not have access to the defensemodel, so we approximate the Jacobian of the VAE bythe identity matrix during back-propagation in the sameway as we did with DefenseGAN • White-box VAE : The attacker has full access to thedefense model, so we back-propagate gradients to thecombined VAE plus x-vector classifier pipeline.We also experimented with combining RandomizedSmoothing and VAE defenses, as illustrated in Figure 3.
D. WaveGAN vocoder defense
The last of the defense that we explored was a WaveGANvocoder defense. A vocoder reconstructs the speech waveformgiven a compressed representation of it, in our case log-Mel-Spectrograms. We used the ParallelWaveGAN vocoderproposed in [70]. This vocoder is based on a generativeadversarial network similar to DefenseGAN. However, thereare some differences • DefenseGAN reconstructs the clean spectrograms whileWaveGAN reconstructs the clean waveform. • In DefenseGAN, reconstruction is based only on thelatent variable z . Meanwhile in the WaveGAN vocoder,both generator G ( y , z ) and discriminator D ( x , y ) arealso conditioned on the log-Mel Spectrogram y of theinput signal. • At inference, DefenseGAN needs to iterate to find theoptimum value of the latent z . In the WaveGAN, we justneed to forward pass the log-Mel-Spectrogram y and thelatent z through the WaveGAN generator. z is a standardNormal random variable with the same length as the inputwaveform. Figure 4 shows the reconstruction pipeline. OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 7
Fig. 4: Pipeline for WaveGAN Vocoder Defense. Combination of WaveGAN with randomized smoothing is indicated by theoptional blocks.This Parallel WaveGAN was trained on a combinationof waveform-domain adversarial loss; and a reconstructionloss in Short Time Fourier Transform (STFT) domain. Thereconstruction loss improves the stability and efficiency ofthe adversarial training. It is the sum of several STFT lossescomputed with different spectral analysis parameters (windowlength/shift, FFT length). ParalellWaveGAN training schemecombines adversarial and multi-resolution STFT losses. Thearchitecture of the generator is a non-autoregressive WaveNet,while the architecture for the Discriminator is based on adilated convolutional network.We considered two alternatives regarding the knowledge ofthe attacker.
Black-box WaveGAN where the attacker does nothave access to the WaveGAN model; and
White-box WaveGAN where the attacker can use the WaveGAN model to adapt theattack to the defense.As shown by the optional blocks in Figure 4, we can alsocombine smoothing and WaveGAN defenses. Here, we canchoose whether we want to apply smoothing before or afterthe WaveGAN. VI. E
XPERIMENTAL S ETUP
A. Dataset and task
The DARPA program
Guaranteeing AI Robustness AgainstDeception (GARD) promotes research on defenses againstadversarial attacks in different domains. The program orga-nizes periodic evaluations to benchmark the defenses devel-oped by the participants. Our experimental setup is based onthe most recent GARD evaluation on defenses for speakeridentification systems. The proposed task is closed-set speakeridentification, also known as speaker classification. Given atest utterance, the task is to identify the most likely amongthe enrolled speakers.The speaker identification task is built on the Librispeechdataset [71] development set, which contains 40 speakers(20 male and 20 female). The development set was splitinto training (GARD-train), validation (GARD-val), and test(GARD-test). Each speaker has about 4 minutes for enrollmentand 2 minutes for test, with a total of 1283 test recordings.While the setup is small, we need to take into account the ex-tensive computing cost of performing adversarial attacks. Forexample, Carlini-Wagner attacks require hundreds of gradientdescent iterations to obtain the adversarial perturbation. Also,note that following Doddington’s rule of 30 [72], with 1283recordings, we can measure error rates as low as 2.3% withconfidence (97.7% accuracy). Thus, the dataset was selected to be able to obtain meaningful conclusions while limitingcomputing cost.We will compare the performance of attacks and defensesusing classification accuracy as the primary metric. Attackswill degrade the accuracy while defenses will try to increaseaccuracy to match one of the non-attacked systems. To quan-tify the strength of the attack, we will indicate the L ∞ normof the adversarial perturbation, assuming that the waveform’sdynamic range is normalized to [ − , .This experimental setup is publicly available through theArmory toolkit . Armory toolkit uses the attacks available inthe Adversarial Robustness Toolbox to generate the adversar-ial samples. B. Baseline
We experimented with different x-vector architectures forour undefended baseline. Adversarial perturbations were addedin the wave domain. Acoustic features were 80 dimension log-Mel-filter-banks with short-time mean normalization. Thesefeatures are used as input to x-vector networks based onThinResNet34, ResNet34, EfficientNet-b0/b4 or Transformer-Encoder as described in Section II. The network predicts logitsfor the speaker classes. The full pipeline was implementedin PyTorch [73] so we can back-propagate gradients fromthe final score to the waveform. Note that we used state-of-the-art architectures. They provided equal error rates between1.17% (ResNet34) to 1.9% (ThinResNet34) on the popularVoxCeleb1 test set. We observed that all the architecturesare extremely vulnerable to attacks. EfficientNet x-vector wasmore robust overall but had a higher computing cost thanother architectures. For this reason, we decided to evaluateour defenses on ThinResNet34, which allowed us to performexperiments faster.The x-vector networks were pre-trained on the VoxCeleb2dataset [74], comprising 6114 speakers. Training data was aug-mented × with noise from the MUSAN corpus and impulseresponses from the RIR dataset . We used additive angularmargin softmax loss (AAM-Softmax) [51] with margin=0.3and scale=30 and Adam optimizer to train the network. Touse this network for speaker identification on the LibriSpeechdev set, we restarted the last two layers of the network andfine-tuned on the LibriSpeech dev train partition. We fine-tuned with AAM-Softmax with margin=0.9, since increasingthe margin slightly improved performance in this set. https://github.com/twosixlabs/armory/blob/master/docs/scenarios.md https://github.com/Trusted-AI/adversarial-robustness-toolbox OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 8
TABLE II: GAN generator/discriminator architectures. Forgenerator, latent dim =100 and ReLU activations were usedafter each deconvolution layer. For Discriminator, LeakyReLUactivations with negative slope = 0.2 were used after eachconvolution layer. For both, the output padding is × GeneratorLayer name Structure Output
Input - latent dimLinear -
Reshape - × × ConvTranspose2d-1 × , 1024 , Stride 3 × × ConvTranspose2d-2 × , 512 , Stride 3 × × ConvTranspose2d-3 × , 256 , Stride 3 × × ConvTranspose2d-4 × , 128 , Stride 3 × × ConvTranspose2d-5 × , 64 , Stride 3 × × Crop center - × × DiscriminatorLayer name Structure Output
Input - × × Conv2d-1 × , 64 , Stride 2 × × Conv2d-2 × , 128 , Stride 2 × × Conv2d-3 × , 256 , Stride 2 × × Conv2d-4 × , 512 , Stride 2 × × Conv2d-5 × , 1024 , Stride 2 × × Reshape - 1024Linear - 1
C. Adversarial Attacks
We experimented with attacks implemented in the Adversar-ial Robustness Toolkit: fast gradient sign method (FGSM), ba-sic iterative method (BIM) (a.k.a. IterFGSM), Carlini-Wagner-L2, and Universal Perturbations. For FGSM and BIM, wetested L ∞ norms ε between 0.0001 and 0.2. For BIM, weused a learning rate α = ε/ , and performed seven iterations.For Carlini-Wagner, we used confidence κ = 0 , learning rate0.001, 10 iterations in the inner loop, and a maximum of10 iterations in the outer loop. Universal perturbations weretransferred from a SincNet model [57], to create transfer black-box attacks as explained in Section IV-D. D. Defenses1) Randomized smoothing defense:
To improve the per-formance for the randomized smoothing defense, we fine-tuned the network with Gausian noise with σ ∼ U (0 . . on top of real noises and reverberation. We compared threeversions: fine-tuning the network classification head (last twolayers) without Gaussian noise; fine-tuning the head networkwith Gaussian Noise, and fine-tuning the full network withGaussian noise. At inference time, we experimented by usingone or multiple smoothed samples. When using multiplesamples, we add several noise signals to the utterance, evaluateeach one of them and average the outputs.
2) Defense-GAN:
The GAN in DefenseGAN used a DeepConvolutional generator-discriminator (DCGAN) trained withWasserstein and gradient penalty losses, similar to [64]. Itwas trained on VoxCeleb2 using the same log-Mel-filter-bankfeatures required by the x-vector networks. The generatorand discriminator architectures are shown in Table II. Thegenerator projects the latent variable z with dimension 100to × d dimension and reshape to create a tensor ofsize × with × d channels. A sequence of 2 D transposed convolutions with stride 2 upsamples this tensorto generate square log-Mel-filter-bank patches with 80 framesof dimension 80. The critic is a sequence of 2 D convolutionswith stride 2, which downsample the input to generate a singleoutput per filter-bank patch. We did not observe improvementsby increasing the complexity of generator and discriminatornetworks. At inference time, the DefenseGAN procedure used0.025 learning rate, 300 iterations, and 10 random seeds.
3) Variational auto-encoder defense:
A denoising VAE wastrained on VoxCeleb2 augmented with noise and reverberationto model the manifold of benign log-Mel-Spectrograms. Con-trary to DefenseGAN, the VAE improved the reconstructionerror when increasing the complexity of the encoder/decoderarchitectures. The architectures that performed the best werebased on 2D residual networks. Table III describes the En-coder/Decoder architectures. The encoder downsamples theinput spectrogram in time and frequency dimension multipletimes to create an information bottleneck. The output of theencoder is the mean of the posterior q ( z | x ) , while the varianceis kept as trainable constant. Meanwhile, the decoder is amirrored version of the encoder. It takes the latent sample z ∼ q ( z | x ) and upsamples it to predict the mean of P ( x | z ) .The variance is also a trainable constant. To perform theupsampling operation, we used subpixel convolutions, sincethey provide better quality than transposed convolutions [75].
4) WaveGAN vocoder defense:
For the ParallelWaveGANvocoder, we used the implementation available in . Wecompared two models trained on different datasets. First, thepublic model trained on the Artic Voices [76] datasets. Second,we trained a model on VoxCeleb2 [77].VII. R ESULTS
A. Undefended baselines
We compared the vulnerability of ResNet34, EfficientNet,Transformer and ThinResNet34 x-vector to adversarial attacks.Table IV shows classification accuracy for undefended base-lines under FGSM, BIM, CW, and universal transfer attacks.The systems are robust to the Universal transfer attack obtain-ing accuracy similar to the clean condition. This indicates thatthe attacker cannot transfer attacks from white-box models thatare very different from the victim model verifying the claimby [31].For FGSM attacks, we only observed strong degradationfor L ∞ ≥ . . However, for iterative attacks (BIM, CW),all systems failed even with imperceptible perturbations. Thesystem using a fusion of the above networks gives lightimprovements (especially for BIM attacks with L ∞ ≤ . andCW); however, the system is still vulnerable. In the followingsections, we analyze the defenses using the ThinResNet34 x-vector. This is mainly motivated by the high computing costof performing adversarial attacks. Note that ThinResNet34 isaround × faster than the full ResNet34 architecture. B. Randomized smoothing defense
Table V shows the results for the randomized smoothingdefense. We experimented with different Gaussian noise stan- https://github.com/kan-bayashi/ParallelWaveGAN OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 9
TABLE III: ResNet2d VAE encoder/decoder architectures. The first dimension of the input shows number of filter-banks, andthe third dimension indicates the number of frames T . Batchnorm and ReLU activations were used after each convolution.Subpixel convolutions were used for upsampling operations. Encoder DecoderLayer name Structure Output Structure Output
Input – × × T – × × T/ Conv2D-1 × , 64, Stride 1 × × T × , 512, Stride 1 × × T/ ResNetBlock-1 (cid:20) × , × , (cid:21) × , Stride 1 × × T (cid:20) × , × , (cid:21) × , Stride × × T/ ResNetBlock-2 (cid:20) × , × , (cid:21) × , Stride × × T/ (cid:20) × , × , (cid:21) × , Stride × × T/ ResNetBlock-3 (cid:20) × , × , (cid:21) × , Stride × × T/ (cid:20) × , × , (cid:21) × , Stride × × T/ ResNetBlock-4 (cid:20) × , × , (cid:21) × , Stride × × T/ (cid:20) × , × , (cid:21) × , Stride 2 × × T Projection × , 80, Stride 1 × × T/ × , 1, Stride 1 × × T TABLE IV: Classification accuracy (%) for several undefended x-vector architectures under adversarial attacks
Architecture Clean FGSM Attack BIM Attack Universal CW L ∞ TABLE V: Classification accuracy (%) for randomized smoothing defense. We compare fine-tuning x-vector w/o Gaussiannoise; and finetuning the network clasification head versus the full network.
Gaussian Smoothing Clean FGSM Attack BIM Attack Universal CWAugment σL ∞ dard deviations σ . Unless indicated otherwise, we use a singlenoisy sample for the randomized smoothing.The first block of the table shows results for ThinResNet34fine-tuned without Gaussian noise (only real noise and re-verberation). Smoothing only damages clean and universalattack performance when using very high σ = 0 . , whichis very perceptible noise. We also observe that smoothingdamaged accuracy under FGSM attacks as we increase thesmoothing σ . However, it improves accuracy under BIM andCW attacks. To make the system robust to BIM attacks, weneeded σ ∼ × L ∞ . It also provided a huge improvementfor the Carlini-Wagner attack w.r.t. the undefended system.In the second block, we fine-tuned the network head withGaussian noise on top of real noise and reverberation. Bydoing this, we increased the robustness of the system for the smoothing defense. Thus, the performance for clean and uni-versal with high smoothing σ was improved, being close to theperformance without defense. It also significantly improvedthe accuracy under FGSM with L ∞ ∈ [0 . , . –in somecases up to 100% of relative improvement w.r.t the systemfine-tuned without Gaussian noise. It also improved BIM at L ∞ = 0 . . When using multiple smoothing samples–denotedas . × , accuracy degraded. This was because averagingmultiple noise samples provides a better estimation of thegradients for the attacker.In the third block, we fine-tuned the full x-vector networkwith Gaussian noise. This allowed us to obtain good accuraciesfor very high smoothing σ ∈ [0 . , . . In most attacks, we gotsimilar or better results than when fine-tuning just the head.For σ = 0 . , we obtained the best results overall. For σ = 0 . , OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 10 we achieved the best results under BIM attack with L ∞ ≥ . ,at the cost of damaging accuracy in low L ∞ attacks. C. Defense GAN
We observed that the DefenseGAN generated spectrogramsare losing minute details and degraded benign signal per-formance. As explained in Section V-B, we interpolated theoriginal and reconstructed spectrograms using a factor α . Weselected α = 0 . to maintain high benign accuracy at thesame time that we improve robustness against BIM and CWattacks. Lower α = 0 . , provided better benign accuracy, butno improved w.r.t. the undefended systems. Higher α = 0 . ,degraded benign accuracy down to 72%. Table VI comparesDefenseGAN versus undefended classification accuracies. De-fenseGAN degraded FGSM attacks but improved BIM andCW attacks. For CW, we get an accuracy of 60.94% against1.41% without defense. This shows that DefenseGAN hasgood potential to defend against CW attack. This finding issimilar to DefenseGAN for CW attack on images [38]. D. Variational Auto-encoder (VAE) defense
Table VII presents the VAE defense results. First, we useVAE as a pre-processing defense for the original ThinRes-Net34 x-vector (
No fine-tuning block in the table). We comparethe case where the attacker does not have access to the VAEmodel (Black-box VAE) versus the case where the attackerknows the VAE model and can back-propagate through it(White-box VAE). Black-box VAE degrades FGSM w.rt. unde-fended for high L ∞ > . , which are perceptible. However, itprovides significant improvements for BIM L ∞ ≤ . –2.3%to 71.4% for L ∞ = 0 . – and Carlini-Wagner–1.4% to 71.1%.When using White-Box VAE, adversarial accuracies reduced.We only observe significant improvements w.r.t. undefendedfor BIM L ∞ = 0 . and Carlini-Wagner.Introducing VAE degraded clean accuracy about 10% ab-solute. To counteract this, we fine-tuned the x-vector model,including the VAE in the pipeline. Fine-tuning improved clean,FGSM, and universal transfer attack accuracies, being closeto the ones of the undefended system. Again, VAE improvedBIM, and CW attacks w.r.t. the undefended system.Finally, we combined VAE and randomized smoothingdefenses. Smoothing σ = 0 . plus Black-box VAE improvedall BIM and CW attacks with respect to just VAE. It alsoimproved w.r.t. just smoothing σ = 0 . in Table V. OnlyBIM with L ∞ = 0 . , which is very perceptible, was seri-ously degraded w.r.t to the clean condition. For the case ofSmoothing + White-Box VAE, BIM and CW also improvedwith respect to just White-box VAE. However, there was nosignificant improvement w.r.t. just smoothing. E. WaveGAN vocoder defense
Table VIII displays the WaveGAN defense results. We startusing WaveGAN as a black-box block from the attacker pointof view and compare the publicly available model trained onArtic to our model trained on the VoxCeleb test set. We ob-serve that the VoxCeleb model performed better in the benigncondition and across all attacks. We conclude that the larger variability in terms of speakers and channels in VoxCelebcontributes to making the defense more robust. Contrary tothe VAE experiments above, WaveGAN did not degrade thebenign accuracy. Thus, we did not need to fine-tune the x-vector network, including the WaveGAN in the pipeline. Theblack-box VoxCeleb WaveGAN provided very high adversarialrobustness with accuracies > for most attacks. Theaverage absolute improvement overall attacks was 41.5% w.r.t.the baseline. The improvement for BIM (80.47% absolute) andCW (97.34% absolute) is noteworthy. This is very superior toprevious smoothing and VAE defenses. However, white-boxWaveGAN performance degraded significantly w.r.t black-box.In the white-box setting, WaveGAN only improved in BIM L ∞ < . and CW, similarly to white-box VAE.We tried to improve further by integrating smoothing withthe WaveGAN vocoder Defense. We can do it in two ways:applying smoothing before or after the vocoder. The results areshown in the third and fourth blocks of Table VIII. We did notobserve significant statistical differences between both variantsfor smoothing σ < . . For σ ≥ . , smoothing followed byWaveGAN was better. However, overall, adding smoothing tothe black-box WaveGAN did not provide consistent improve-ments w.r.t. just the WaveGAN.The last line of the table shows the results for smoothingfollowed by the white-box WaveGAN. In this case, accuracywas similar to the case of just using smoothing. We concludethat combining smoothing and WaveGAN vocoder defense isthe best strategy. When the attacker does not have access to thedefense model, it is robust across attacks and L ∞ levels. Whenthe attacker can back-propagate through the WaveGAN, itstill maintains competitive robustness for most attacks, exceptthose with very high L ∞ norms.VIII. C ONCLUSION
We studied the robustness of several x-vector speakeridentification systems to adversarial attacks, and analyzedfour defenses to increase adversarial robustness. Table IXsummarizes the results for the best configurations of eachdefense. The proposed defenses degraded benign accuraciesvery little–4.8 % absolute in the worse case. This is a very milddegradation compared to the robustness when the system isadversarially attacked. We observed that undefended systemswere inherently robust to FGSM and Black-box Universalattacks transferred from a speaker identification system withdifferent architecture. However, they were extremely vulnera-ble under Iterative FGSM and Carlini-Wagner attacks.The simple randomized smoothing defense robustified thesystem for imperceptible BIM ( L ∞ ≤ . ) and Carlini-Wagner recovering classification accuracies ∼ . We alsoexplored defenses based on three types of generative models:DefenseGAN, VAE, and WaveGAN vocoder. These modelslearn the manifold of benign samples and were used to projectadversarial examples (outside of the manifold) back into theclean manifold. DefenseGAN and VAE worked at the log-spectrogram level, while WaveGAN worked at the waveformlevel. We noted that DefenseGAN performed poorly. For VAEand WaveGAN we considered two configurations: attacker OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 11
TABLE VI: Classification accuracy (%) for DefenseGAN defense with α = 0 . and α = 0 . DefenseGAN Clean FGSM Attack BIM Attack Universal CW L ∞ α = 0 . TABLE VII: Classification accuracy (%) for VAE Defense with and without randomized smoothing; with and without fine-tuning the x-vector classifier with the VAE pre-processing; considering VAE as a black-box or white-box block for the attacker. x-Vector Smooth VAE Clean FGSM Attack BIM Attack Universal CWfine-tuning σ Type L ∞ TABLE VIII: Classification accuracy (%) for WaveGAN defense with and without randomized smoothing; consideringWaveGAN as a black-box or white-box block for the attacker. WaveGAN was trained on VoxCeleb, except the pretrainedmodel, which was trained on Artic.
Defense Smooth Clean FGSM Attack BIM Attack Universal CW σL ∞ having knowledge of the defense (white-box) or not (black-box). In the black-box setting, both VAE and WaveGANsignificantly improved adversarial robustness w.r.t. undefendedsystem. WaveGAN performed the best, being close to the cleancondition. For BIM attacks, WaveGAN improved averageaccuracy from 17.16% to a staggering 97.63%. For CW, itimproved from 1.4% to 98.8%. However, in the white-boxsetting, VAE and WaveGAN protection dropped significantly–WaveGAN accuracy dropped to 37% for CW attack.Finally, we experimented with combining randomizedsmoothing with VAE or WaveGAN. We found that smoothingfollowed by WaveGAN vocoder is the most effective defense.In the black-box setting, it protected against all attacks, includ- ing perceptible attacks with high L ∞ perturbations–averageaccuracy was 93% compared to 52% in the undefendedsystem. In the white-box setting, accuracy only degraded forperceptible perturbations with L ∞ > . .In the future, we plan to evaluate these defenses in otherspeech tasks like automatic speech recognition tasks. Also,we plan to study how to improve the robustness of generativepre-processing defenses in the white-box setting.R EFERENCES[1] R. K. Das, X. Tian, T. Kinnunen, and H. Li, “The Attacker’s Perspectiveon Automatic Speaker Verification: An Overview,” arXiv:2004.08849[cs, eess] , Apr. 2020, arXiv: 2004.08849. [Online]. Available:http://arxiv.org/abs/2004.08849
OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 12
TABLE IX: Summary of classification accuracy (%) of all defenses with their best setting. Note: Smoothing σ = 0 . ,WaveGAN models is trained on Voxceleb, and smoothing is applied before WaveGAN/VAE. Defense Clean FGSM Attack BIM Attack Universal CW L ∞ - 0.0001 0.001 0.01 0.1 0.2 0.0001 0.001 0.01 0.1 0.2 0.3 -No defense 100.0 96.9 90.0 92.3 93.4 91.1 83.4 2.3 0.0 0.0 0.0 100.0 1.4Smoothing 98.0 98.3 98.4 97.0 64.4 44.1 97.2 97.8 97.7 18.9 2.0 98.7 96.9DefenseGAN 96.3 91.6 84.2 81.9 49.4 23.3 85.0 23.6 2.8 1.4 1.6 96.9 60.9VAE-blackbox 98.9 98.4 98.1 94.2 91.4 84.8 94.7 67.2 12.7 0.9 1.1 99.9 56.1VAE-whitebox 99.4 96.9 94.1 94.2 91.4 84.8 92.3 35.3 1.3 0.9 0.5 99.8 50.2Smoothing+VAE-blackbox 95.2 95.8 95.5 94.7 79.5 51.4 96.6 95.2 94.5 63.1 9.8 97.5 95.5Smoothing+VAE-whitebox 95.2 95.8 96.1 95.2 65.6 50.0 96.1 95.8 94.2 19.1 2.7 97.4 95.8WaveGAN-blackbox 99.5 99.5 99.7 99.1 86.6 77.2 99.4 99.7 99.5 97.2 92.3 99.9 98.8WaveGAN-whitebox 97.0 98.8 95.8 93.6 83.0 62.3 94.7 36.4 0.8 0.8 0.8 99.8 37.5Smoothing+WaveGAN-blackbox 95.6 95.2 96.3 95.8 93.0 74.2 94.8 94.8 96.9 95.5 93.4 97.5 95.2Smoothing+WaveGAN-whitebox 95.8 94.7 95.6 94.4 88.9 60.6 95.2 93.3 86.7 14.4 3.1 97.4 92.8[2] M. Todisco, X. Wang, V. Vestman, M. Sahidullah, H. Delgado,A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee,“ASVSpoof 2019: Future horizons in spoofed and fake audio detection,”in INTERSPEECH 2019 , Graz, Austria, sep 2019, pp. 1008–1012.[3] C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “ASSERT: Anti-Spoofingwith Squeeze-Excitation and Residual neTworks,” in
INTERSPEECH2019 , Graz, Austria, sep 2019.[4] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J.Goodfellow, and R. Fergus, “Intriguing properties of neural networks,”in
ICLR 2014 , 2014.[5] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and HarnessingAdversarial Examples,” in
ICLR 2015 , dec 2015.[6] N. Carlini and D. Wagner, “Towards Evaluating the Robustness of NeuralNetworks,” in
IEEE Symposium on Security and Privacy, 2017 , aug2016.[7] A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial examples inthe physical world,” in
CoRR 2017 , jul 2017.[8] Y. Dong, Q.-A. Fu, X. Yang, T. Pang, H. Su, Z. Xiao, and J. Zhu,“Benchmarking Adversarial Robustness,” dec 2019.[9] H. X. Y. M. Hao-Chen, L. D. Deb, H. L. J.-L. T. Anil, and K. Jain,“Adversarial attacks and defenses in images, graphs and text: A review,”
International Journal of Automation and Computing , vol. 17, no. 2, pp.151–178, 2020.[10] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg,C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al. , “Deepspeech 2: End-to-end speech recognition in english and mandarin,” in
International conference on machine learning , 2016, pp. 173–182.[11] M. Cisse, Y. Adi, N. Neverova, and J. Keshet, “Houdini: Fooling DeepStructured Prediction Models,” in
NIPS 2017 , jul 2017, pp. 6977—-6987.[12] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attackson speech-to-text,” in
SPW 2018 , 2018.[13] P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. McAuley, andF. Koushanfar, “Universal Adversarial Perturbations for Speech Recog-nition Systems,” in
INTERSPEECH 2019 , Graz, Austria, sep 2019, pp.481–485.[14] D. Iter, J. Huang, and M. Jermann, “Generating adversarial examplesfor speech recognition,” Tech. Rep., 2017.[15] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:A generative model for raw audio,” 2016.[16] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative AdversarialNetworks,” arXiv:1406.2661 [cs, stat] , Jun. 2014, arXiv: 1406.2661.[Online]. Available: http://arxiv.org/abs/1406.2661[17] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. , “The kaldispeech recognition toolkit,” in
IEEE 2011 workshop on automatic speechrecognition and understanding , no. CONF. IEEE Signal ProcessingSociety, 2011.[18] X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang,H. Huang, X. Wang, and C. A. Gunter, “CommanderSong: A Systematic Approach for Practical Adversarial Voice Recognition,” in
USENIXSecurity 2018 , jan 2018.[19] L. Schonherr, K. Kohls, S. Zeiler, T. Holz, and D. Kolossa, “AdversarialAttacks Against Automatic Speech Recognition Systems via Psychoa-coustic Hiding,” in
NDSS 2019 , Reston, VA, 2019.[20] Y. Qin, N. Carlini, I. Goodfellow, G. Cottrell, and C. Raffel, “Imper-ceptible, Robust, and targeted adversarial examples for automatic speechrecognition,” in
ICML 2019 , 2019, pp. 9141–9150.[21] H. Yakura and J. Sakuma, “Robust Audio Adversarial Example for aPhysical Attack,” in
IJCAI 2019 . California: IJCAI 2019, aug 2019,pp. 5334–5341.[22] D. Wang, R. Wang, L. Dong, D. Yan, X. Zhang, and Y. Gong,“Adversarial examples attack and countermeasure for speech recognitionsystem: A survey,” in
International Conference on Security and Privacyin Digital Economy . Springer, 2020, pp. 443–468.[23] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet, “Fooling End-To-End SpeakerVerification With Adversarial Examples,” in
ICASSP 2018 , apr 2018, pp.1962–1966.[24] Y. Gong and C. Poellabauer, “Crafting Adversarial Examples For SpeechParalinguistics Applications,” in
DYnamic and Novel Advances in Ma-chine Learning and Intelligent Cyber Security (DYNAMICS) Workshop ,San Juan, Puerto Rico, dec 2018.[25] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,“X-Vectors : Robust DNN Embeddings for Speaker Recognition,” in
ICASSP 2018 , Alberta, Canada, apr 2018, pp. 5329–5333.[26] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798,2010.[27] X. Li, J. Zhong, X. Wu, J. Yu, X. Liu, and H. Meng, “Adversarial Attackson GMM I-Vector Based Speaker Verification Systems,” in
ICASSP2020 , Barcelona, Spain, may 2020, pp. 6579–6583.[28] Y. Xie, C. Shi, Z. Li, J. Liu, Y. Chen, and B. Yuan, “Real-Time,Universal, and Robust Adversarial Attacks Against Speaker RecognitionSystems,” in
ICASSP 2020 , may 2020, pp. 1738–1742.[29] J. Li, X. Zhang, C. Jia, J. Xu, L. Zhang, Y. Wang, S. Ma, and W. Gao,“Universal adversarial perturbations generative network for speakerrecognition,” in . IEEE, 2020, pp. 1–6.[30] Q. Wang, P. Guo, and L. Xie, “Inaudible adversarial perturbations for tar-geted attack in speaker recognition,” arXiv preprint arXiv:2005.10637 ,2020.[31] H. Abdullah, K. Warren, V. Bindschaedler, N. Papernot, and P. Traynor,“SoK: The Faults in our ASRs: An Overview of Attacks againstAutomatic Speech Recognition and Speaker Identification Systems,” arXiv:2007.06622 [cs] , Jul. 2020, arXiv: 2007.06622. [Online].Available: http://arxiv.org/abs/2007.06622[32] J. Villalba, Y. Zhang, and N. Dehak, “x-Vectors Meet AdversarialAttacks: Benchmarking Adversarial Robustness in Speaker Verification,”p. 5.[33] Y. Zhang, Z. Jiang, J. Villalba, and N. Dehak, “Black-box Attacks onSpoofing Countermeasures Using Transferability of Adversarial Exam-ples,” p. 5.
OURNAL OF L A TEX CLASS FILES, VOL. XX, NO. XX, JANUARY 2021 13 [34] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillationas a defense to adversarial perturbations against deep neural networks,”in . IEEE, 2016,pp. 582–597.[35] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometerencoding: One hot way to resist adversarial examples,” in
InternationalConference on Learning Representations , 2018.[36] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten, “Counter-ing adversarial images using input transformations,” arXiv preprintarXiv:1711.00117 , 2017.[37] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman, “Pixelde-fend: Leveraging generative models to understand and defend againstadversarial examples,” arXiv preprint arXiv:1710.10766 , 2017.[38] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-GAN:Protecting Classifiers Against Adversarial Attacks Using GenerativeModels,” arXiv:1805.06605 [cs, stat] , May 2018, arXiv: 1805.06605.[Online]. Available: http://arxiv.org/abs/1805.06605[39] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[40] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learningat scale,” arXiv preprint arXiv:1611.01236 , 2016.[41] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towardsdeep learning models resistant to adversarial attacks,” arXiv preprintarXiv:1706.06083 , 2017.[42] D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. Dong, “You only propagateonce: Painless adversarial training using maximal principle,” arXivpreprint arXiv:1905.00877 , vol. 2, no. 3, 2019.[43] A. Raghunathan, J. Steinhardt, and P. Liang, “Certified defenses againstadversarial examples,” arXiv preprint arXiv:1801.09344 , 2018.[44] A. Levine and S. Feizi, “Wasserstein smoothing: Certified robustnessagainst wasserstein adversarial attacks,” in
International Conference onArtificial Intelligence and Statistics . PMLR, 2020, pp. 3938–3947.[45] G. Chen, S. Chen, L. Fan, X. Du, Z. Zhao, F. Song, and Y. Liu, “Whois real bob? adversarial attacks on speaker recognition systems,” arXivpreprint arXiv:1911.01840 , 2019.[46] G. Rigoll and B. U. Seeber, “Mp3 compression to diminish adversarialnoise in end-to-end speech recognition,” in
Speech and Computer:22nd International Conference, SPECOM 2020, St. Petersburg, Russia,October 7–9, 2020, Proceedings , vol. 12335. Springer Nature, 2020,p. 22.[47] X. Li, N. Li, J. Zhong, X. Wu, X. Liu, D. Su, D. Yu, and H. Meng,“Investigating robustness of adversarial samples detection for automaticspeaker verification,” arXiv preprint arXiv:2006.06186 , 2020.[48] A. Jati, C.-C. Hsu, M. Pal, R. Peri, W. AbdAlmageed, andS. Narayanan, “Adversarial Attack and Defense Strategies for DeepSpeaker Recognition Systems,” arXiv:2008.07685 [cs, eess] , Aug. 2020,arXiv: 2008.07685. [Online]. Available: http://arxiv.org/abs/2008.07685[49] M. Pal, A. Jati, R. Peri, C.-C. Hsu, W. AbdAlmageed, and S. Narayanan,“Adversarial defense for deep speaker recognition using hybrid adver-sarial training,” arXiv preprint arXiv:2010.16038 , 2020.[50] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep Neu-ral Network Embeddings for Text-Independent Speaker Verification,” in
INTERSPEECH 2017 , Stockholm, Sweden, aug 2017, pp. 999–1003.[51] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive AngularMargin Loss for Deep Face Recognition,” in
CVPR 2019 , 2019.[52] H. Zeinali, S. Wang, A. Silnova, P. Matˇejka, and O. Plchot, “Butsystem description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592 , 2019.[53] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell,J. Borgstrom, L. P. Garc´ıa-Perera, F. Richardson, R. Dehak, P. A. Torres-Carrasquillo, and N. Dehak, “State-of-the-art speaker recognition withneural network embeddings in NIST SRE18 and Speakers in the Wildevaluations,”
Computer Speech & Language , vol. 60, p. 101026, mar2020.[54] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con-volutional neural networks,” in
International Conference on MachineLearning , 2019, pp. 6105–6114.[55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
Advancesin neural information processing systems , 2017, pp. 5998–6008.[56] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Univer-sal adversarial perturbations,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2017, pp. 1765–1773.[57] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveformwith sincnet,” in . IEEE, 2018, pp. 1021–1028. [58] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towardsdeep learning models resistant to adversarial attacks,” in
InternationalConference on Learning Representations , 2018.[59] J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adversarial robustnessvia randomized smoothing,” in
International Conference on MachineLearning , 2019, pp. 1310–1320.[60] K. Ren, T. Zheng, Z. Qin, and X. Liu, “Adversarial attacks and defensesin deep learning,”
Engineering , 2020.[61] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, andNoise Corpus,” 2015, arXiv:1510.08484v1.[62] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “Astudy on data augmentation of reverberant speech for robust speechrecognition,” in . IEEE, 2017, pp. 5220–5224.[63] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-sarial networks,” in
Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , 2017, pp. 214–223.[64] C. Donahue, J. McAuley, and M. Puckette, “Adversarial AudioSynthesis,” arXiv:1802.04208 [cs] , Feb. 2019, arXiv: 1802.04208.[Online]. Available: http://arxiv.org/abs/1802.04208[65] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville,“Improved Training of Wasserstein GANs,” arXiv:1704.00028 [cs,stat] , Dec. 2017, arXiv: 1704.00028. [Online]. Available: http://arxiv.org/abs/1704.00028[66] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give afalse sense of security: Circumventing defenses to adversarial examples,”in
International Conference on Machine Learning , 2018, pp. 274–283.[67] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,”in
Proceedings of the International Conference of LearningRepresentations, ICLR 2014 , no. Ml, Banff, Alberta, Canada,apr 2014, pp. 1–14. [Online]. Available: http://arxiv.org/abs/1312.6114[68] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variationalinformation bottleneck,”
ICLR 2017 , 2017.[69] D. J. Im, S. Ahn, R. Memisevic, and Y. Bengio, “Denoising criterion forvariational auto-encoding framework,” in
Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence , 2017, pp. 2059–2065.[70] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fastwaveform generation model based on generative adversarial networkswith multi-resolution spectrogram,” in
ICASSP 2020-2020 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 6199–6203.[71] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:An asr corpus based on public domain audio books,” in , 2015, pp. 5206–5210.[72] G. R. Doddington, “The NIST speaker recognition evaluation- Overview, methodology, systems, results, perspective,”
SpeechCommunication , vol. 31, no. 2-3, pp. 225–254, jun 2000. [Online].Available: http://dx.doi.org/10.1016/S0167-6393(99)00080-1[73] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style,High-Performance Deep Learning Library,” in
NeurIPS 2019 . CurranAssociates, Inc., 2019, pp. 8024–8035.[74] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,”
Computer Speech and Language ,vol. 60, 2020.[75] W. Shi, J. Caballero, F. Husz´ar, J. Totz, A. P. Aitken, R. Bishop,D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2016, pp. 1874–1883.[76] J. Kominek and A. W. Black, “The cmu arctic speech databases,” in
Fifth ISCA workshop on speech synthesis , 2004.[77] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speakerrecognition,” arXiv preprint arXiv:1806.05622arXiv preprint arXiv:1806.05622