[PDF] Adversarially Training for Audio Classifiers

Abstract

In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90\%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

Full PDF

AAdversarially Training for Audio Classiﬁers

Raymel Alfonso Sallo ´Ecole de Technologie Sup´erieure ( ´ETS)D´epartement de G´enie Logiciel et TI1100 Notre-Dame W, Montr´eal,H3C 1K3, Qu´ebec, [email protected]

Mohammad Esmaeilpour ´Ecole de Technologie Sup´erieure ( ´ETS)D´epartement de G´enie Logiciel et TI1100 Notre-Dame W, Montr´eal,H3C 1K3, Qu´ebec, [email protected]

Patrick Cardinal ´Ecole de Technologie Sup´erieure ( ´ETS)D´epartement de G´enie Logiciel et TI1100 Notre-Dame W, Montr´eal,H3C 1K3, Qu´ebec, [email protected]

Abstract —In this paper, we investigate the potential effect ofthe adversarially training on the robustness of six advanceddeep neural networks against a variety of targeted and non-targeted adversarial attacks. We ﬁrstly show that, the ResNet-56model trained on the 2D representation of the discrete wavelettransform appended with the tonnetz chromagram outperformsother models in terms of recognition accuracy. Then we demon-strate the positive impact of adversarially training on this modelas well as other deep architectures against six types of attackalgorithms (white and black-box) with the cost of the reducedrecognition accuracy and limited adversarial perturbation. Werun our experiments on two benchmarking environmental sounddatasets and show that without any imposed limitations on thebudget allocations for the adversary, the fooling rate of theadversarially trained models can exceed 90%. In other words,adversarial attacks exist in any scales, but they might requirehigher adversarial perturbations compared to non-adversariallytrained models.

Index Terms —Spectrogram, Chromagram, Tonnetz Features,Discrete Wavelet Transformation (DWT), Short-Time FourierTransformation (STFT), Sound Classiﬁcation, Deep Neural Net-work, ResNet, VGG, AlexNet, GoogLeNet, Adversarial Attack,Adversarially Training.

I. I

NTRODUCTION

The existence of adversarial attacks has been characterizedfor data-driven audio and speech recognition models for bothwaveform and representation domains [1], [2]. During the pastyears, many strong white and black-box adversarial algorithmshave been introduced which they basically recast costly op-timization problems against victim classiﬁers. Unfortunately,these attacks effectively degrade the classiﬁcation performanceof almost all data-driven models from conventional classiﬁers(e.g., support vector machines) to the state-of-the-art deepneural networks [3]. This poses an extreme growing concernabout the security and the reliability of the classiﬁers.The typical approach in crafting adversarial example is tosolve an optimization problem in order to obtain the smallestpossible perturbations for the legitimate samples, undetectableby a human, aiming at fooling the classiﬁer. The commonlyused measures to compare the altered sample with the orig-inal one are l or l ∞ similarity metrics. The computationalcomplexity of this optimization process is dependent to the di-mensions of the given input samples. Consequently, it requiresconsiderable computational overhead for high dimensionaldata, even in the case of short audio signals [1]. However, regardless of the computational cost of the attacks, this threatactively exists for any end-to-end audio and speech classiﬁer.Since the highest recognition accuracies have been reportedon 2D representations of audio signals [2], [4], the optimizedattack algorithms developed for computer vision applicationssuch as fast gradient sign method (FGSM) [5] led to securityconcerns for audio classiﬁers [3].Although some approaches have been introduced for de-fending victim models against adversarial attacks, there is notyet a reliable framework achieving the required efﬁciency.Based on the detailed discussion in [6], common defencealgorithms usually obfuscate gradient information but runningstronger attack algorithms against them consistently fool thesedetectors. Unfortunately, vulnerability against adversarial at-tacks is an open problem in data-driven classiﬁcation andthough the generated fake examples look very similar tonoisy samples, they lie in dissimilar subspaces [3], [7]. Ithas been shown that adversarial examples lie in the manifoldsmarginally over the decision boundary of the victim classiﬁer,where the model lacks of generalizability [3]. Therefore,integrating these examples into the training set of the victimclassiﬁer could improve the robustness. This approach, knownas adversarially training [5], might be a more reasonabledefense approach without shattering gradient vectors [6]. How-ever, there is no guarantee for the safety of the adversariallytrained classiﬁers [8].Although there are some discussions in the computer visiondomain about the negative effect of adversarially training onthe recognition performance of the victim classiﬁers [9], tothe best of our knowledge, this potential side effect has notbeen yet studied for the 2D representation of audio signals.We address this issue in this paper and report our results ontwo benchmarking environmental sound datasets. Speciﬁcally,our main contributions in this paper are: • characterizing the adversarially training impact on sixadvanced deep neural network architectures for diverseaudio representations, • demonstrating that deep neural networks specially thosewith residual blocks have higher recognition performanceon tonnetz features concatenated with DWT spectrogramscompared to STFT representations, • showing the adversarially trained AlexNet model outper-forms ResNets with limiting the perturbation magnitude, a r X i v : . [ ee ss . A S ] O c t experimentally proving that although adversarially train-ing reduces recognition accuracy of the victim model, itmakes the attack more costly for the adversary in termsof required perturbation.The rest of this paper is organized as follows. In Section II,we review some related works about adversarial attacks de-veloped for 2D domains. Details about signal transformationand 2D representation production are provided in Section IIIand IV, respectively. In Section V, we brieﬂy introduce ourselected front-end audio classiﬁers which are state-of-the-artdeep learning architectures. The adversarial attack proceduresand budget allocation for the adversary are discussed inSection VI. Accordingly, section VII explains the adversariallytraining framework and obtained results are summarized inSection VIII. II. R ELATED W ORKS

There is a large volume of published studies on attackingclassiﬁers using different optimization techniques aiming to ef-fectively disrupt their recognition performances. In this paper,we focus on ﬁve strong white-box targeted and non-targetedattack algorithms which have been reported to be very de-structive when used on advanced deep learning models trainedon audio representations [2]. Moreover, we also use a black-box adversarial attack, based on the gradient approximation,against the victim classiﬁers .The fast gradient sign method is a well-established baselinein targeted adversarial attack. The computational cost of thisone-shot approach at runtime is low, taking advantage of thelinear characteristics in deep neural networks. Kurakin et al. [10] introduced an iterative version of FGSM, known as thebasic iterative method (BIM), for running stronger attacksagainst victim classiﬁers and is formulated at: x (cid:48) n +1 = clip { x (cid:48) n + ζ sgn ( ∇ x J [ x (cid:48) n , l ( x )]) } (1)where the legitimate and its associated adversarial examplesare represented by x and x (cid:48) , respectively. The initial state inthis recursive formulation is x (cid:48) = x in the (cid:15) -neighbourhood(the distance measured by a similarity metric such as l ) ofthe legitimate manifold. This is followed by a clipping oper-ation for keeping the adversarial perturbation within [ − (cid:15), (cid:15) ] .Moreover, l ( x ) and sgn( · ) stand for the label of x and thegeneral sign function. In Eq. 1, the step size ζ = 1 , thoughit is tunable according to the adversary’s wishes. Two typesof optimizations can be used with Eq. 1: (1) optimizing up toreach the ﬁrst adversarial example (BIM-a) and (2) continuingthe optimization up to a predeﬁned number of iterations (BIM-b). For measuring the (cid:15) , two similarity metrics are suggested: l ∞ and l . In this work, we focus on the latter.Gradient information of a deep neural network containsdirection of intensity variation associated with the model deci-sion boundary. Exploiting these information vectors for ﬁndingthe least likely probability distribution is the key idea of theJacobian-based Saliency map attack (JSMA) [11]. For theadversarial label l (cid:48) , this iterative attack algorithm runs againstthe model f and strives to achieve l (cid:48) = arg max j f j ( x ) . The JSMA increases the probability of the target label l (cid:48) whileminimizes those of the other classes including the ground-truth using a saliency map as shown in Eq. 2. S ( x , l (cid:48) )[ i ] = | J i,l (cid:48) ( x ) | (cid:88) j (cid:54) = l (cid:48) J i,j ( x )  (2)where J i,j denotes the forward derivative of the model for thefeature i computed as: J f [ i, j ]( x ) = ∂f j ( x ) ∂ x i (3)the Jacobian vectors associated with label l (cid:48) and values of thesaliency map less or greater than zero (no variation shield), S ( x , l (cid:48) )[ i ] = 0 . This white-box attack algorithm searches,iteratively, the feature index on which the perturbation willbe applied in order to fool the model toward the target label l (cid:48) using the similarity metric l .The perturbation required for pushing a sample over thedecision boundary of the victim classiﬁer should be as minimalas possible. In a white box scenario, the optimization processuses local properties of the decision boundary. It has beenshown that linearizing the boundary in the subspace of theoriginal samples can yield to adversarial perturbation smallerthan FGSM attack. This approach, known as the DeepFoolattack, is shown in Eq. 4 [12]: arg min (cid:107) (cid:15) (cid:107) , (cid:15) = − f ( x ) w / (cid:107) w (cid:107) (4)where the w refers to the weight function of the recogni-tion model. Unlike other abovementioned adversarial attacks,DeepFool is a non-targeted attack and it iterates as many timesas needed for pushing random samples to be marginally overthe locally linearized decision boundary with the condition ofmaximizing the prediction likelihood toward any labels otherthan the ground-truth. Though both l ∞ or l measurementmetrics can be used in the DeepFool attack, we use the latterin accordance with BIM algorithms.Presumably, a straightforward approach for keeping anadversarial perturbation undetectable can be achieved by re-ducing its magnitude and distribute it over all input features.Additionally, not every feature should be perturbed and theirgradient vectors should not be shattered. Following thesetwo assumptions, Carlini and Wagner attack (CWA) has beenintroduced [13]. The general framework of their proposedalgorithm is based on the following minimization problem: min (cid:107) x (cid:48) − x (cid:107) + c · L ( x (cid:48) ) (5)where the constant c is obtainable through a binary search.Finding the most appropriate value for this hyperparameteris very challenging since it may easily dominate the distancefunction and push the sample too far away from the adversarialsubspace. Although in Eq. 5 the l similarity metric forcomputing the adversarial perturbation (cid:15) is employed, CWAproperly generalizes for both l and l ∞ . In the conﬁguration ofthis adversarial attack, the loss function L is deﬁned over theogits of Z for the trained model f as shown in the followingequation: L ( x (cid:48) ) = max (cid:20) max i (cid:54) = l (cid:48) { Z ( x (cid:48) ) i − Z ( x (cid:48) ) l (cid:48) , − κ } (cid:21) (6)where κ controls the effectiveness and the adjacency of theadversarial examples to the decision boundary of the model.In this regard, higher values for this parameter in conjunc-tion with a minimum (cid:15) -neighbourhood results in adversarialexamples with higher conﬁdence.For achieving the overall unrestricted adversarial perturba-tion ( (cid:107) (cid:15) (cid:107) ) with small enough magnitude, CWA solves Eq. 5through the following optimization framework: min ρ (cid:13)(cid:13)(cid:13)(cid:13)

12 (tanh( ρ ) + 1) − x (cid:13)(cid:13)(cid:13)(cid:13) + c · L (cid:18)

12 tanh( ρ ) + 1 (cid:19) (7)where ρ = arctan( x + δ ) and the unrestricted approximateperturbation δ ∗ is as the following. δ ∗ = 12 (tanh( ρ + 1)) − x (8)This perturbation is unrestricted and it should be tuned forfeature values by measuring ∇ f ( x + δ ∗ ) . For feature inten-sities with negligible gradient values, the actual adversarialperturbation truncates to zero, and for the rest: δ ← δ ∗ .Attacking victim classiﬁers while there is an unrestrictedaccess to the details of the attacked models, including thetraining dataset, hyperparameters, architecture, and more im-portantly gradient information, like all the abovementionedattack algorithms, is less challenging compared to the black-box attack scenarios. Usually, in the latter scheme, the ad-versary runs gradient estimation via querying the classiﬁer bytraining a surrogate model. In this paper, the chosen black-box attack is the natural evolution strategy (NES [14]) whichhas been employed for gradient approximation in [15]. Thisiterative algorithm is known as partial information attack(PIA) and it encodes l ∞ similarity metric as part of itstargeted optimization problem. Finding the proper adversarialperturbation bound for PIA is to some extent challenging andrequires a very high number of querying to the victim model.Before discussing how adversarial attack and adversari-ally training on various deep neural network architectureshave been implemented, we ﬁrstly need to provide a briefoverview on the transformation of an audio signal into 2Drepresentations. The next section will describe spectrogramgeneration using short time Fourier transformation (STFT),discrete wavelet transformation (DWT), and tonnetz featureextraction. We will then train our classiﬁers using these repre-sentations and investigate how adversarially training impactstheir robustness to adversarial attacks.III. A UDIO T RANSFORMATION

Since audio and speech signals have high dimensionality intime domain, their 2D representations with lower dimension-alities have been widely used for training advanced classiﬁersoriginally developed for 2D computer vision applications [16]. In this work, we use STFT and DWT, both with and withouttonnetz features for generating 2D representations of audio sig-nals. This section brieﬂy reviews the required transformationsby this work.For a discrete signal a [ n ] distributed over time n using theHann window function H [ · ] , we can compute the complexFourier transformation using the following equation: STFT (cid:8) a [ n ] (cid:9) ( m, ω ) = ∞ (cid:88) n = −∞ a [ n ] H [ n − m ] e − jωn (9)where m is the time scale and m (cid:28) n . Additionally, ω standsfor the continuous frequency coefﬁcient. This transformationapplies on shorter overlapping sub-signals with a predeﬁnedsampling rate and forms the STFT spectrogram as shown inEq. 10. (10) SP STFT (cid:8) a [ n ] (cid:9) ( m, ω ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) n = −∞ a [ n ] w [ n − m ] e − jωn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) There are several variants of the STFT transformation suchas mel-scale and cepstral coefﬁcient, producing even lowerdimensionality, that have been widely used for various speechprocessing tasks [17], [18]. However in this work, we use thestandard STFT representation for training the front-end denseclassiﬁers.Generating DWT spectrogram is very similar to the Fouriertransformation as they both employ continuous and differen-tiable basis functions. For the wavelet transformation, severalfunctions have been studied and their effectiveness for audiosignals have been investigated in [19], [20]. The general formof this transformation for a continuous function a ( t ) is shownin Eq. 11. DWT (cid:8) a ( t ) (cid:9) = 1 (cid:112) | s | (cid:90) ∞−∞ a ( t ) ψ (cid:0) t − τs (cid:1) dt (11)where τ and s refer to the time variations in the transformationand the wavelet scale, respectively. Moreover, ν stands for thebasis mother functions. Common choices for this function areHaar, Mexican Hat, and complex Morlet. The latter has beenextensively used in signal processing, mainly because of itsnonlinear characteristics [16] (see Eq. 12). ψ ( t ) = 1 √ π e − jωt e − t / (12)The complex Morlet is continuous in its conjugate manifold.The convolution of this function with overlapping chunks ofthe given audio signal results in its spectral visualization asdescribed in Eq. 13. SP DWT (cid:8) a ( t ) (cid:9) = (cid:12)(cid:12) DWT (cid:8) a [ k, n ] (cid:9)(cid:12)(cid:12) (13)where k and n are integer numbers associated with scales of ψ .The two aforementioned transformations represent spa-tiotemporal modulation features of a signal in the frequencydomain, regardless of its harmonic characteristics. It has beendemonstrated that using harmonic change detection functionHCDF) provides distinctive features for the audio signal[21]. This function provides chromagram from the constant-Q transformation (CQT) which are also known as tonnetzfeatures. According to [21], there are four major steps in aHCDF system. Firstly, the audio signal is converted into alogarithmic spectrum vectors using CQT. Then, pitch-classvectors are extracted from the tonal transformation based onthe quantized chromagram. In the third step, 6-dimensionalcentroid vectors form a tensor from the tonal transformation.Finally, a smoothing operation postprocesses this tensor fordistance calculation.We use HCDF system for generating spectrogram fromaudio signals in order to enhance recognition performance ofthe classiﬁers. In the next section, we provide details of thisprocess for two benchmarking environmental sound datasets.IV. S PECTROGRAM P RODUCTION

We produce STFT representation based on the instructionsprovided by the open source Python library Librosa [22]. Weset the windows size and the hop length ( n and m in Eq. 9)to 2048 and 512, respectively. Additionally, we initialize thenumber of ﬁlters to 2048 which is the standard value forthe environmental sounds task [16]. Audio chunks associatedwith each window are padded in order to reduce the potentialnegative effect of loosing temporal dependencies. Furthermore,the frames are overlapped using a ratio of 50%.For generating DWT spectrograms, we use our modiﬁedversion of the wavelet sound explorer [23] with the complexMorlet mother function. As proposed by [4], we set the DWTsampling frequency to 16 KHz for ESC-50 and 8 KHz forUrbanSound8K with the uniform 50% overlapping ratio. Forenhancement purposes, we use the logarithmic visualizationon the generated spectrograms to better characterize highfrequency areas.For the tonnetz chromagram, we use the default settingsprovided by Librosa with the sampling rate of 22.05 KHz.We resize the resulting chromagrams in such a way that theresult will comply with the aforementioned representations.Inspired from [24], we append these features to the STFTand DWT spectrograms and organize them into two additionalrepresentations. In the next section, we provide more detailsabout the training of the front-end classiﬁers using these fourspectrogram sets.V. C LASSIFICATION M ODELS

Since an adversary runs the adversarial attack against theclassiﬁer, the choice of the victim network architecture affectsthe fooling rate of the model. This issue has been stud-ied in [2] for the advanced GoogLeNet [25] and AlexNet[26] architectures trained on DWT (with linear, logarithmic,and logarithmic real visualizations), STFT, and their pooledspectrograms. Since our main objective is investigating theimpact of adversarially training on advanced deep learningclassiﬁers, we additionally include ResNet-X architectureswith X ∈ { , , } [27] and VGG-16 [28] architectures. The pretrained models of these six classiﬁers have beenused and the input and output layers have been ﬁne-tunedas described in [2]. Computational hardware used for allexperiments are two NVIDIA GTX- -Ti with × GBmemory in addition to a -bit Intel Core-i - ( . GHz)CPU with GB RAM. We carry out our experiments usingthe ﬁve-fold cross validation setup for all the spectrogram sets.As a common practice in model performance analysis, we pre-serve 70% of the entire samples for training and developmentfollowed by running the early stopping scenario. We reportrecognition accuracy of these models for the remaining 30%samples.In the next section, we provide the detailed setup for theadversarial algorithms mentioned in section II. We addition-ally discuss budget allocations required by the adversary forsuccessfully attacking the six ﬁnely trained victim models.VI. A

DVERSARIAL A TTACK S ETUP

For effectively attacking the classiﬁers, the adversary shouldtune the hyperparameters required by the attack algorithmssuch as the number of iteration, the perturbation limitation, thenumber of line search within the manifold, which we expressthem all as the budget allocations. For ﬁnding the optimalrequired budgets, we bind the fooling rates of the attackalgorithms to a predeﬁned threshold

AU C > . associatedwith the area under curve of the attack success. In otherwords, we allocate as much budget as needed for reaching the AU C > . for all attacks against the victim models. This is acritical threshold for demonstrating the extreme vulnerabilityof neural networks against adversarial attacks.In accordance to the above note, we use Foolbox [29], thefreely available python package in support of the uniformreproducible implementations of the attack algorithms. Forthe BIM-a and BIM-b algorithms, we deﬁne the (cid:15) ≥ . with the conﬁdence of ( ≥ ). In the JSMA framework, weset the number of iterations to a maximum of 1000 and thescaling factor within [0 , (with equivalent displacementof 50). The number of iterations in the DeepFool attack isinitialized to 100 with the supremum value in light of 600and the static step of 100. For the costly CWA attack, weset the search step c ∈ { , , , } within the number ofiteration { , , , . } associated with every c . Exceptof the DeepFool which is a non-targeted attack, we randomlyselect targeted wrong labels for the rest of the algorithms.There are four hyperparameters required for the black-boxPIA algorithm. We empirically limit the perturbation boundto (cid:15) ≥ . followed by an iterative line search to ﬁndthe most approximately optimal variance in the NES gradientestimation. We initialize the number of iteration to 500 withdecay rate of . and the learning rate η ∈ [0 . , . .In the framework which we attack the front-end audio clas-siﬁers, we run the algorithms on the shufﬂed batches of 500samples up to 50 batches of 100 samples randomly selectedfrom the clean spectrograms in every step toward spanningthe entire datasets. These attacks are performed consideringthe abovementioned allocated budgets once before and after WT (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 0 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) ∞ = 1 . DWT | Tonnetz (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 0 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) ∞ = 1 . STFT (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) ∞ = 2 . STFT | Tonnetz (cid:107) (cid:15) (cid:107) = 0 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) ∞ = 2 . Fig. 1: Crafted adversarial examples for the ResNet-56 using the six optimization-based attack algorithms. The ﬁrst columnof the ﬁgure denotes the original representations for the randomly selected sample from the class of ’children playing’ in theUrbanSound8K dataset. Other columns are associated with the attack algorithms namely, BIM-a, BIM-b, JSMA, DeepFool,CWA, and PIA, respectively. Adversarial Perturbation values have been written at the bottom of each adversarial spectrogram.adversarially training in order to measure the robustness ofthe models. Section VII provides details on how adversariallytraining has been implemented.VII. A

DVERSARIALLY T RAINING

The idea of adversarially training was ﬁrstly proposed in [5],where authors showed that, augmenting the training datasetwith the one-shot FGSM adversarial examples improves therobustness of the victim models. As commonly known, themain advantage of this simple approach is that, it does not shat-ter nor obfuscate gradient information while runs a fast non-iterative procedure. This has made the adversarially trainingto be a relatively reliable defense approach. However, it maynot conﬁdently defend against stronger white-box adversarialalgorithms [8].Many adversarial defense approaches have been introducedduring the past years which have been reported to outperformFGSM-based adversarially training [30], [31], [32]. However,some studies have been reported that these advanced defenseapproaches shatter gradient vectors and they might easily breakagainst strong adversarial attacks which do not incorporatethe exact gradient information such as the backward passdifferentiable approximation [6]. Augmenting the clean training dataset with adversarialexamples in the adversarially trained framework is shown inEq. 14 [5]. J (cid:48) ( x , l, w ) = αJ ( x , l, w ) + (1 − α ) J ( x (cid:48) , l, w ) (14)where α is a subjective weight scalar deﬁnable by the ad-versary. Additionally, J and w denote the loss function andthe derived weight vector of the victim model, respectively.Moreover x and x (cid:48) refer to the legitimate and adversarialexample associated with the genuine label l . Adversariallytraining using a costly attack algorithm is very time-consumingand memory prohibitive in practice. Therefore, we use theFGSM for augmenting the original spectrogram datasets withthe adversarial examples according to the assumption of J (cid:48) ( x , l, w ) = J ( x (cid:48) , l, w ) .In the next section, we report our achieved results for thedense neural network models about the adversarial attacks andadversarially training on four different representations, namelySTFT, DWT, STFT appended with tonnetz features, and DWTappended with tonnetz chromagrams.VIII. E XPERIMENTAL R ESULTS

We conduct our experiments on two environmental soundsdatasets: UrabanSound8K [33] and ESC-50 [34]. The ﬁrstdataset contains 8732 short recording arranged in 10 classesABLE I: Recognition performance (%) of the audio classiﬁers trained on the original spectrogram datasets (without adversarialexample augmentation). Values inside of the parenthesis indicate the recognition percentage drop after adversarially trainingthe models with the fooling rate

AU C > . . Accordingly, the maximum perturbation is achieved at (cid:107) (cid:15) (cid:107) ≤ . Outperformingaccuracies are shown in bold face. Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16ESC-50 STFT . , (06 .

89) 64 . , (10 .

91) 66 . , (12 .

13) 67 . , (14 . , (09 .

29) 68 . , (08 . DWT . , (08 .

42) 65 . , (11 .

23) 67 . , (15 .

71) 67 . , (18 . , (11 .

09) 71 . , (16 . STFT | Tonnetz . , (24 .

09) 64 . , (23 .

76) 67 . , (19 .

48) 66 . , (23 . , (25 .

19) 70 . , (23 . DWT | Tonnetz . , (19 .

07) 68 . , (18 .

53) 68 . , (24 .

27) 67 . , (21 . , (18 .

21) 68 . , (18 . UrbanSound8K STFT . , (10 .

35) 86 . , (21 .

43) 88 . , (14 .

94) 88 . , (09 . , (23 .

06) 87 . , (14 . DWT . , (16 .

35) 87 . , (19 .

59) 88 . , (15 .

08) 88 . , (19 . , (15 .

49) 90 . , (16 . STFT | Tonnetz . , (25 .

77) 86 . , (22 .

05) 88 . , (17 .

64) 88 . , (26 .

42) 89 . , (20 . , (21 . DWT | Tonnetz . , (16 .

83) 87 . , (20 .

41) 88 . , (29 .

12) 89 . , (27 . , (26 .

08) 89 . , (24 . TABLE II: Robustness comparison (average

AU C % ) of the adversarially trained models attacked with the constraint (cid:107) (cid:15) (cid:107) ≤ .Victim models with lower fooling rates are indicated in bold. Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16ESC-50 STFT . .

13 55 .

31 53 .

87 51 . DWT . .

56 54 .

18 52 .

26 52 . STFT | Tonnetz . .

10 55 .

29 54 .

19 52 . DWT | Tonnetz . .

87 53 .

77 50 .

42 51 . UrbanSound8K STFT . .

08 55 .

91 57 .

30 54 . DWT . .

59 54 .

40 55 .

86 53 . STFT | Tonnetz . .

75 51 .

02 54 .

11 52 . DWT | Tonnetz . .

13 56 .

81 55 .

38 55 . TABLE III: Comparison of (cid:15) r for attacking the original and adversarially trained models with the constraint of AU C > . .Higher values for (cid:15) r associated with each representation are shown in bold. Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16ESC-50 STFT .

412 1 .

631 1 .

897 2 . . DWT .

562 1 .

509 1 .

741 1 .

982 1 . STFT | Tonnetz .

804 1 .

918 2 . .

095 1 . DWT | Tonnetz .

014 2 .

336 1 .

788 1 . . UrbanSound8K STFT .

562 1 . .

372 1 .

991 1 . DWT .

154 2 .

287 2 .

764 1 . . STFT | Tonnetz .

231 2 .

108 1 .

981 2 .

003 1 . DWT | Tonnetz .

606 2 .

199 2 .

405 1 . . (car horn, dog bark, drilling, jackhammer, street music, siren,children playing, air conditioner, engine idling and gun shot)with the audio length of < seconds. ESC-50 dataset con-tains 2K audio signals with an equal length of ﬁve secondsorganized in 50 classes.For enhancing both quality and quantity of these datasets,especially for ESC-50, we ﬁlter samples using the pitch-shifting operation in the temporal domain as proposed in [16].According to their proposed 1D ﬁltration setup, we use thescales of { . , . , . , . } . This increases the size of thedatasets by a factor of 4.Following the explanations provided in section IV aboutthe spectrogram production, the dimension of each resultingspectrogram is × for both STFT and DWT (the log-arithmic scale) representations on the two datasets. Moreover,the dimensions of the resulting chromagrams is × ,which will be appended to the aforementioned representations.Table I summarizes recognition accuracies of the classiﬁerstrained on these spectrograms. Additionally, this table showsthe effect of the adversarially training on the recognitionperformance of these models. The classiﬁers in Table I have been selected for evaluationon the test sets after running the ﬁve-fold cross-validationscenario on the randomized development portion of the train-ing datasets. Regarding this table, different architectures ofthe deep neural networks show competitive performances.However, in the majority of the cases, the ResNet-56 outper-forms other classiﬁers averaged over 10 repeated experimentson the spectrograms. The highest recognition accuracy hasbeen achieved by the ResNet-56 architecture, trained on theappended representation of DWT and tonnetz chromagramsfor both UrbanSound8K and ESC-50 datasets. The number ofparameters in the ResNet-56 is 11.3% and 14.26% higher thanits rival models VGG-16 and ResNet-34, respectively.Fig. 1 visually compares the adversarial examples craftedagainst the outperforming classiﬁer, the ResNet-56, usingthe six adversarial attacks with a randomly selected audiosample and represented with the four spectrograms approachesdescribed earlier. Although the generated spectrograms arevisually very similar to their legitimate counterparts, they allmake the classiﬁer to predict wrong labels.Table I also shows the drop ratio of the recognition ac-uracies after adversarially trained the models following theprocedure explained in section VII. The maximum requiredadversarial perturbation for complying with the fooling rateof AU C > . is achieved at (cid:107) (cid:15) (cid:107) ≤ , averaged over allthe attacks. In attacking the adversarially trained models, theprocedures outlined in section VI has been implemented indi-vidually for every audio classiﬁer. According to the obtainedresults, adversarially training considerably reduces the perfor-mance of all models. For the ESC-50, the neural networkstrained on the appended representation of STFT and tonnetzfeatures (STFT | Tonnetz) has experienced the most negativeimpact compared to other representations. The average dropratio for adversarially trained models on the DWT | Tonnetzrepresentations is slightly more than the STFT | Tonnetzcounterparts for the UrbanSound8K dataset. However, forboth datasets, these ratio for models trained on the DWTspectrogram are considerably higher than those trained withthe STFT representations.We measure the fooling rate of adversarially trained modelsafter attacking them using the same six adversarial algorithmsfollowing the procedure explained in section VI with theimposed condition of (cid:107) (cid:15) (cid:107) ≤ for the adversarial perturbation.This experiment uncovers the impact of adversarially trainingon the robustness of the audio classiﬁers (see Table II).We applied the aforementioned condition to make this tablecomparable with Table I. Regarding the results reported inTable II, adversarially training has improved the robustness ofall the classiﬁers, particularly AlexNet.For investigating the overall impact of the adversariallytraining on the robustness of audio classiﬁers, we attackthe adversarially trained models using the same six attackalgorithms without the condition of (cid:107) (cid:15) (cid:107) ≤ . Unfortunately,we could achieve the fooling rate with AU C > . for allthe classiﬁers following the attack procedure explained in sec-tion VI. However, attacking the adversarially trained modelsrequires larger values for the adversarial perturbation ( (cid:107) (cid:15) (cid:107) )compared to attacking the original models and consequently,increases the number of callbacks to the original spectrogramwith extra batch gradient computations. This might degrade thequality of the generated spectrograms. In order to analyticallycompare the maximum adversarial perturbation required forthe original and the adversarially trained models, we computethe average perturbation ratio as shown in Eq. 15: (cid:15) r = (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) a (cid:15) o (cid:12)(cid:12)(cid:12)(cid:12) (15)where (cid:15) a and (cid:15) o denote the average adversarial perturbationrequired for successfully attacking the adversarially trainedand original models (both with AU C > . ), respectively.Table III summarizes values for (cid:15) r for the victim modelstrained on different representations.Note that an (cid:15) r ≥ indicates the positive impact ofadversarially training on the robustness of the audio clas-siﬁers via increasing the computational cost of the attackby expanding the magnitude of the required adversarial per-turbation. With respect to the measured (cid:15) r metric for all the front-end classiﬁers, the ResNet-56 architecture showedbetter robustness against adversarial attacks in average for50% of the experiments. In other words, attacking this modeladds additional cost for the adversary in crafting adversarialexamples with the AU C > . .IX. C ONCLUSION

In this paper, we presented the impact of adversarially train-ing as a gradient obfuscation-free defense approach againstadversarial attacks. We trained six advanced deep learningclassiﬁers on four different 2D representations of environmen-tal audio signals and run ﬁve white-box and one black-box at-tack algorithms against these victim models. We demonstratedthat adversarially training considerably reduces the recognitionaccuracy of the classiﬁer but improves the robustness againstsix types of targeted and non-targeted adversarial examplesby constraining over the maximum required adversarial per-turbation to (cid:107) (cid:15) (cid:107) ≤ . In other words, adversarially trainingis not a remedy for the threat of adversarial attacks, how-ever it escalates the cost of attack for the adversary withdemanding larger adversarial perturbations compared to thenon-adversarially trained models.A CKNOWLEDGMENT

This work was funded by the Natural Sciences and Engi-neering Research Council of Canada (NSERC) under GrantRGPIN 2016-04855 and Grant RGPIN 2016-06628.R

EFERENCES[1] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attackson speech-to-text,” arXiv preprint arXiv:1801.01944 , 2018.[2] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “A robust approachfor securing audio classiﬁcation against adversarial attacks,”

IEEETransactions on Information Forensics and Security , vol. 15, pp. 2147–2159, 2020.[3] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “Detection of adver-sarial attacks and characterization of adversarial subspace,” in

IEEE IntlConf on Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp.3097–3101.[4] V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifyingenvironmental sounds using image recognition networks,”

ProcediaComputer Science , vol. 112, pp. 2048–2056, 2017.[5] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[6] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give afalse sense of security: Circumventing defenses to adversarial examples,” arXiv preprint arXiv:1802.00420 , 2018.[7] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, M. E.Houle, G. Schoenebeck, D. Song, and J. Bailey, “Characterizing ad-versarial subspaces using local intrinsic dimensionality,” arXiv preprintarXiv:1801.02613 , 2018.[8] F. Tram`er, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, andP. McDaniel, “Ensemble adversarial training: Attacks and defenses,” arXiv preprint arXiv:1705.07204 , 2017.[9] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-chine learning: from phenomena to black-box attacks using adversarialsamples,” arXiv preprint arXiv:1605.07277 , 2016.[10] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in thephysical world,” arXiv preprint arXiv:1607.02533 , 2016.[11] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,” in .IEEE, 2016, pp. 372–387.12] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simpleand accurate method to fool deep neural networks,” in

IEEE Conf CompVis Patt Recog , 2016, pp. 2574–2582.[13] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in

IEEE Symp Secur Priv , 2017, pp. 39–57.[14] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Natural evolutionstrategies,” in . IEEE, 2008, pp. 3381–3387.[15] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adver-sarial attacks with limited queries and information,” arXiv preprintarXiv:1804.08598 , 2018.[16] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “Unsupervised featurelearning for environmental sound classiﬁcation using weighted cycle-consistent generative adversarial network,”

Applied Soft Computing ,vol. 86, p. 105912, 2020.[17] I. Patel and Y. S. Rao, “Speech recognition using hidden markovmodel with mfcc-subband technique,” in .IEEE, 2010, pp. 168–172.[18] L. Juvela, B. Bollepalli, X. Wang, H. Kameoka, M. Airaksinen, J. Yam-agishi, and P. Alku, “Speech waveform synthesis from mfcc sequenceswith generative adversarial networks,” in . IEEE,2018, pp. 5679–5683.[19] V. Mitra and C.-J. Wang, “Content based audio classiﬁcation: a neuralnetwork approach,”

Soft Computing , vol. 12, no. 7, pp. 639–646, 2008.[20] S. Patidar and R. B. Pachori, “Classiﬁcation of cardiac sound signalsusing constrained tunable-q wavelet transform,”

Expert Systems withApplications , vol. 41, no. 16, pp. 7161–7170, 2014.[21] C. Harte, M. Sandler, and M. Gasser, “Detecting harmonic change inmusical audio,” in

Proceedings of the 1st ACM workshop on Audio andmusic computing multimedia , 2006, pp. 21–26.[22] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,and O. Nieto, “Librosa: Audio and music signal analysis in python,” in , vol. 8, 2015.[23] S. Hanov, “Wavelet sound explorer software,” http://stevehanov.ca/wavelet/, 2008.[24] Y. Su, K. Zhang, J. Wang, and K. Madani, “Environment soundclassiﬁcation using a two-stream cnn based on decision-level fusion,”

Sensors , vol. 19, no. 7, p. 1733, 2019.[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

IEEE Conf Comp Vis Patt Recog , 2015, pp. 1–9.[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proc. IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[29] J. Rauber, W. Brendel, and M. Bethge, “Foolbox: A python toolbox tobenchmark the robustness of machine learning models,” arXiv preprintarXiv:1707.04131 , 2017.[30] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillationas a defense to adversarial perturbations against deep neural networks,”in

IEEE Symposium on Security and Privacy , 2016, pp. 582–597.[31] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometerencoding: One hot way to resist adversarial examples,” in

InternationalConference on Learning Representations , 2018.[32] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten, “Counter-ing adversarial images using input transformations,” arXiv preprintarXiv:1711.00117 , 2017.[33] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in , Orlando,FL, USA, Nov. 2014.[34] K. J. Piczak, “Esc: Dataset for environmental sound classiﬁcation,” in