Adversarially Training for Audio Classifiers
Raymel Alfonso Sallo, Mohammad Esmaeilpour, Patrick Cardinal
AAdversarially Training for Audio Classifiers
Raymel Alfonso Sallo ´Ecole de Technologie Sup´erieure ( ´ETS)D´epartement de G´enie Logiciel et TI1100 Notre-Dame W, Montr´eal,H3C 1K3, Qu´ebec, [email protected]
Mohammad Esmaeilpour ´Ecole de Technologie Sup´erieure ( ´ETS)D´epartement de G´enie Logiciel et TI1100 Notre-Dame W, Montr´eal,H3C 1K3, Qu´ebec, [email protected]
Patrick Cardinal ´Ecole de Technologie Sup´erieure ( ´ETS)D´epartement de G´enie Logiciel et TI1100 Notre-Dame W, Montr´eal,H3C 1K3, Qu´ebec, [email protected]
Abstract —In this paper, we investigate the potential effect ofthe adversarially training on the robustness of six advanceddeep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56model trained on the 2D representation of the discrete wavelettransform appended with the tonnetz chromagram outperformsother models in terms of recognition accuracy. Then we demon-strate the positive impact of adversarially training on this modelas well as other deep architectures against six types of attackalgorithms (white and black-box) with the cost of the reducedrecognition accuracy and limited adversarial perturbation. Werun our experiments on two benchmarking environmental sounddatasets and show that without any imposed limitations on thebudget allocations for the adversary, the fooling rate of theadversarially trained models can exceed 90%. In other words,adversarial attacks exist in any scales, but they might requirehigher adversarial perturbations compared to non-adversariallytrained models.
Index Terms —Spectrogram, Chromagram, Tonnetz Features,Discrete Wavelet Transformation (DWT), Short-Time FourierTransformation (STFT), Sound Classification, Deep Neural Net-work, ResNet, VGG, AlexNet, GoogLeNet, Adversarial Attack,Adversarially Training.
I. I
NTRODUCTION
The existence of adversarial attacks has been characterizedfor data-driven audio and speech recognition models for bothwaveform and representation domains [1], [2]. During the pastyears, many strong white and black-box adversarial algorithmshave been introduced which they basically recast costly op-timization problems against victim classifiers. Unfortunately,these attacks effectively degrade the classification performanceof almost all data-driven models from conventional classifiers(e.g., support vector machines) to the state-of-the-art deepneural networks [3]. This poses an extreme growing concernabout the security and the reliability of the classifiers.The typical approach in crafting adversarial example is tosolve an optimization problem in order to obtain the smallestpossible perturbations for the legitimate samples, undetectableby a human, aiming at fooling the classifier. The commonlyused measures to compare the altered sample with the orig-inal one are l or l ∞ similarity metrics. The computationalcomplexity of this optimization process is dependent to the di-mensions of the given input samples. Consequently, it requiresconsiderable computational overhead for high dimensionaldata, even in the case of short audio signals [1]. However, regardless of the computational cost of the attacks, this threatactively exists for any end-to-end audio and speech classifier.Since the highest recognition accuracies have been reportedon 2D representations of audio signals [2], [4], the optimizedattack algorithms developed for computer vision applicationssuch as fast gradient sign method (FGSM) [5] led to securityconcerns for audio classifiers [3].Although some approaches have been introduced for de-fending victim models against adversarial attacks, there is notyet a reliable framework achieving the required efficiency.Based on the detailed discussion in [6], common defencealgorithms usually obfuscate gradient information but runningstronger attack algorithms against them consistently fool thesedetectors. Unfortunately, vulnerability against adversarial at-tacks is an open problem in data-driven classification andthough the generated fake examples look very similar tonoisy samples, they lie in dissimilar subspaces [3], [7]. Ithas been shown that adversarial examples lie in the manifoldsmarginally over the decision boundary of the victim classifier,where the model lacks of generalizability [3]. Therefore,integrating these examples into the training set of the victimclassifier could improve the robustness. This approach, knownas adversarially training [5], might be a more reasonabledefense approach without shattering gradient vectors [6]. How-ever, there is no guarantee for the safety of the adversariallytrained classifiers [8].Although there are some discussions in the computer visiondomain about the negative effect of adversarially training onthe recognition performance of the victim classifiers [9], tothe best of our knowledge, this potential side effect has notbeen yet studied for the 2D representation of audio signals.We address this issue in this paper and report our results ontwo benchmarking environmental sound datasets. Specifically,our main contributions in this paper are: • characterizing the adversarially training impact on sixadvanced deep neural network architectures for diverseaudio representations, • demonstrating that deep neural networks specially thosewith residual blocks have higher recognition performanceon tonnetz features concatenated with DWT spectrogramscompared to STFT representations, • showing the adversarially trained AlexNet model outper-forms ResNets with limiting the perturbation magnitude, a r X i v : . [ ee ss . A S ] O c t experimentally proving that although adversarially train-ing reduces recognition accuracy of the victim model, itmakes the attack more costly for the adversary in termsof required perturbation.The rest of this paper is organized as follows. In Section II,we review some related works about adversarial attacks de-veloped for 2D domains. Details about signal transformationand 2D representation production are provided in Section IIIand IV, respectively. In Section V, we briefly introduce ourselected front-end audio classifiers which are state-of-the-artdeep learning architectures. The adversarial attack proceduresand budget allocation for the adversary are discussed inSection VI. Accordingly, section VII explains the adversariallytraining framework and obtained results are summarized inSection VIII. II. R ELATED W ORKS
There is a large volume of published studies on attackingclassifiers using different optimization techniques aiming to ef-fectively disrupt their recognition performances. In this paper,we focus on five strong white-box targeted and non-targetedattack algorithms which have been reported to be very de-structive when used on advanced deep learning models trainedon audio representations [2]. Moreover, we also use a black-box adversarial attack, based on the gradient approximation,against the victim classifiers .The fast gradient sign method is a well-established baselinein targeted adversarial attack. The computational cost of thisone-shot approach at runtime is low, taking advantage of thelinear characteristics in deep neural networks. Kurakin et al. [10] introduced an iterative version of FGSM, known as thebasic iterative method (BIM), for running stronger attacksagainst victim classifiers and is formulated at: x (cid:48) n +1 = clip { x (cid:48) n + ζ sgn ( ∇ x J [ x (cid:48) n , l ( x )]) } (1)where the legitimate and its associated adversarial examplesare represented by x and x (cid:48) , respectively. The initial state inthis recursive formulation is x (cid:48) = x in the (cid:15) -neighbourhood(the distance measured by a similarity metric such as l ) ofthe legitimate manifold. This is followed by a clipping oper-ation for keeping the adversarial perturbation within [ − (cid:15), (cid:15) ] .Moreover, l ( x ) and sgn( · ) stand for the label of x and thegeneral sign function. In Eq. 1, the step size ζ = 1 , thoughit is tunable according to the adversary’s wishes. Two typesof optimizations can be used with Eq. 1: (1) optimizing up toreach the first adversarial example (BIM-a) and (2) continuingthe optimization up to a predefined number of iterations (BIM-b). For measuring the (cid:15) , two similarity metrics are suggested: l ∞ and l . In this work, we focus on the latter.Gradient information of a deep neural network containsdirection of intensity variation associated with the model deci-sion boundary. Exploiting these information vectors for findingthe least likely probability distribution is the key idea of theJacobian-based Saliency map attack (JSMA) [11]. For theadversarial label l (cid:48) , this iterative attack algorithm runs againstthe model f and strives to achieve l (cid:48) = arg max j f j ( x ) . The JSMA increases the probability of the target label l (cid:48) whileminimizes those of the other classes including the ground-truth using a saliency map as shown in Eq. 2. S ( x , l (cid:48) )[ i ] = | J i,l (cid:48) ( x ) | (cid:88) j (cid:54) = l (cid:48) J i,j ( x ) (2)where J i,j denotes the forward derivative of the model for thefeature i computed as: J f [ i, j ]( x ) = ∂f j ( x ) ∂ x i (3)the Jacobian vectors associated with label l (cid:48) and values of thesaliency map less or greater than zero (no variation shield), S ( x , l (cid:48) )[ i ] = 0 . This white-box attack algorithm searches,iteratively, the feature index on which the perturbation willbe applied in order to fool the model toward the target label l (cid:48) using the similarity metric l .The perturbation required for pushing a sample over thedecision boundary of the victim classifier should be as minimalas possible. In a white box scenario, the optimization processuses local properties of the decision boundary. It has beenshown that linearizing the boundary in the subspace of theoriginal samples can yield to adversarial perturbation smallerthan FGSM attack. This approach, known as the DeepFoolattack, is shown in Eq. 4 [12]: arg min (cid:107) (cid:15) (cid:107) , (cid:15) = − f ( x ) w / (cid:107) w (cid:107) (4)where the w refers to the weight function of the recogni-tion model. Unlike other abovementioned adversarial attacks,DeepFool is a non-targeted attack and it iterates as many timesas needed for pushing random samples to be marginally overthe locally linearized decision boundary with the condition ofmaximizing the prediction likelihood toward any labels otherthan the ground-truth. Though both l ∞ or l measurementmetrics can be used in the DeepFool attack, we use the latterin accordance with BIM algorithms.Presumably, a straightforward approach for keeping anadversarial perturbation undetectable can be achieved by re-ducing its magnitude and distribute it over all input features.Additionally, not every feature should be perturbed and theirgradient vectors should not be shattered. Following thesetwo assumptions, Carlini and Wagner attack (CWA) has beenintroduced [13]. The general framework of their proposedalgorithm is based on the following minimization problem: min (cid:107) x (cid:48) − x (cid:107) + c · L ( x (cid:48) ) (5)where the constant c is obtainable through a binary search.Finding the most appropriate value for this hyperparameteris very challenging since it may easily dominate the distancefunction and push the sample too far away from the adversarialsubspace. Although in Eq. 5 the l similarity metric forcomputing the adversarial perturbation (cid:15) is employed, CWAproperly generalizes for both l and l ∞ . In the configuration ofthis adversarial attack, the loss function L is defined over theogits of Z for the trained model f as shown in the followingequation: L ( x (cid:48) ) = max (cid:20) max i (cid:54) = l (cid:48) { Z ( x (cid:48) ) i − Z ( x (cid:48) ) l (cid:48) , − κ } (cid:21) (6)where κ controls the effectiveness and the adjacency of theadversarial examples to the decision boundary of the model.In this regard, higher values for this parameter in conjunc-tion with a minimum (cid:15) -neighbourhood results in adversarialexamples with higher confidence.For achieving the overall unrestricted adversarial perturba-tion ( (cid:107) (cid:15) (cid:107) ) with small enough magnitude, CWA solves Eq. 5through the following optimization framework: min ρ (cid:13)(cid:13)(cid:13)(cid:13)
12 (tanh( ρ ) + 1) − x (cid:13)(cid:13)(cid:13)(cid:13) + c · L (cid:18)
12 tanh( ρ ) + 1 (cid:19) (7)where ρ = arctan( x + δ ) and the unrestricted approximateperturbation δ ∗ is as the following. δ ∗ = 12 (tanh( ρ + 1)) − x (8)This perturbation is unrestricted and it should be tuned forfeature values by measuring ∇ f ( x + δ ∗ ) . For feature inten-sities with negligible gradient values, the actual adversarialperturbation truncates to zero, and for the rest: δ ← δ ∗ .Attacking victim classifiers while there is an unrestrictedaccess to the details of the attacked models, including thetraining dataset, hyperparameters, architecture, and more im-portantly gradient information, like all the abovementionedattack algorithms, is less challenging compared to the black-box attack scenarios. Usually, in the latter scheme, the ad-versary runs gradient estimation via querying the classifier bytraining a surrogate model. In this paper, the chosen black-box attack is the natural evolution strategy (NES [14]) whichhas been employed for gradient approximation in [15]. Thisiterative algorithm is known as partial information attack(PIA) and it encodes l ∞ similarity metric as part of itstargeted optimization problem. Finding the proper adversarialperturbation bound for PIA is to some extent challenging andrequires a very high number of querying to the victim model.Before discussing how adversarial attack and adversari-ally training on various deep neural network architectureshave been implemented, we firstly need to provide a briefoverview on the transformation of an audio signal into 2Drepresentations. The next section will describe spectrogramgeneration using short time Fourier transformation (STFT),discrete wavelet transformation (DWT), and tonnetz featureextraction. We will then train our classifiers using these repre-sentations and investigate how adversarially training impactstheir robustness to adversarial attacks.III. A UDIO T RANSFORMATION
Since audio and speech signals have high dimensionality intime domain, their 2D representations with lower dimension-alities have been widely used for training advanced classifiersoriginally developed for 2D computer vision applications [16]. In this work, we use STFT and DWT, both with and withouttonnetz features for generating 2D representations of audio sig-nals. This section briefly reviews the required transformationsby this work.For a discrete signal a [ n ] distributed over time n using theHann window function H [ · ] , we can compute the complexFourier transformation using the following equation: STFT (cid:8) a [ n ] (cid:9) ( m, ω ) = ∞ (cid:88) n = −∞ a [ n ] H [ n − m ] e − jωn (9)where m is the time scale and m (cid:28) n . Additionally, ω standsfor the continuous frequency coefficient. This transformationapplies on shorter overlapping sub-signals with a predefinedsampling rate and forms the STFT spectrogram as shown inEq. 10. (10) SP STFT (cid:8) a [ n ] (cid:9) ( m, ω ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) n = −∞ a [ n ] w [ n − m ] e − jωn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) There are several variants of the STFT transformation suchas mel-scale and cepstral coefficient, producing even lowerdimensionality, that have been widely used for various speechprocessing tasks [17], [18]. However in this work, we use thestandard STFT representation for training the front-end denseclassifiers.Generating DWT spectrogram is very similar to the Fouriertransformation as they both employ continuous and differen-tiable basis functions. For the wavelet transformation, severalfunctions have been studied and their effectiveness for audiosignals have been investigated in [19], [20]. The general formof this transformation for a continuous function a ( t ) is shownin Eq. 11. DWT (cid:8) a ( t ) (cid:9) = 1 (cid:112) | s | (cid:90) ∞−∞ a ( t ) ψ (cid:0) t − τs (cid:1) dt (11)where τ and s refer to the time variations in the transformationand the wavelet scale, respectively. Moreover, ν stands for thebasis mother functions. Common choices for this function areHaar, Mexican Hat, and complex Morlet. The latter has beenextensively used in signal processing, mainly because of itsnonlinear characteristics [16] (see Eq. 12). ψ ( t ) = 1 √ π e − jωt e − t / (12)The complex Morlet is continuous in its conjugate manifold.The convolution of this function with overlapping chunks ofthe given audio signal results in its spectral visualization asdescribed in Eq. 13. SP DWT (cid:8) a ( t ) (cid:9) = (cid:12)(cid:12) DWT (cid:8) a [ k, n ] (cid:9)(cid:12)(cid:12) (13)where k and n are integer numbers associated with scales of ψ .The two aforementioned transformations represent spa-tiotemporal modulation features of a signal in the frequencydomain, regardless of its harmonic characteristics. It has beendemonstrated that using harmonic change detection functionHCDF) provides distinctive features for the audio signal[21]. This function provides chromagram from the constant-Q transformation (CQT) which are also known as tonnetzfeatures. According to [21], there are four major steps in aHCDF system. Firstly, the audio signal is converted into alogarithmic spectrum vectors using CQT. Then, pitch-classvectors are extracted from the tonal transformation based onthe quantized chromagram. In the third step, 6-dimensionalcentroid vectors form a tensor from the tonal transformation.Finally, a smoothing operation postprocesses this tensor fordistance calculation.We use HCDF system for generating spectrogram fromaudio signals in order to enhance recognition performance ofthe classifiers. In the next section, we provide details of thisprocess for two benchmarking environmental sound datasets.IV. S PECTROGRAM P RODUCTION
We produce STFT representation based on the instructionsprovided by the open source Python library Librosa [22]. Weset the windows size and the hop length ( n and m in Eq. 9)to 2048 and 512, respectively. Additionally, we initialize thenumber of filters to 2048 which is the standard value forthe environmental sounds task [16]. Audio chunks associatedwith each window are padded in order to reduce the potentialnegative effect of loosing temporal dependencies. Furthermore,the frames are overlapped using a ratio of 50%.For generating DWT spectrograms, we use our modifiedversion of the wavelet sound explorer [23] with the complexMorlet mother function. As proposed by [4], we set the DWTsampling frequency to 16 KHz for ESC-50 and 8 KHz forUrbanSound8K with the uniform 50% overlapping ratio. Forenhancement purposes, we use the logarithmic visualizationon the generated spectrograms to better characterize highfrequency areas.For the tonnetz chromagram, we use the default settingsprovided by Librosa with the sampling rate of 22.05 KHz.We resize the resulting chromagrams in such a way that theresult will comply with the aforementioned representations.Inspired from [24], we append these features to the STFTand DWT spectrograms and organize them into two additionalrepresentations. In the next section, we provide more detailsabout the training of the front-end classifiers using these fourspectrogram sets.V. C LASSIFICATION M ODELS
Since an adversary runs the adversarial attack against theclassifier, the choice of the victim network architecture affectsthe fooling rate of the model. This issue has been stud-ied in [2] for the advanced GoogLeNet [25] and AlexNet[26] architectures trained on DWT (with linear, logarithmic,and logarithmic real visualizations), STFT, and their pooledspectrograms. Since our main objective is investigating theimpact of adversarially training on advanced deep learningclassifiers, we additionally include ResNet-X architectureswith X ∈ { , , } [27] and VGG-16 [28] architectures. The pretrained models of these six classifiers have beenused and the input and output layers have been fine-tunedas described in [2]. Computational hardware used for allexperiments are two NVIDIA GTX- -Ti with × GBmemory in addition to a -bit Intel Core-i - ( . GHz)CPU with GB RAM. We carry out our experiments usingthe five-fold cross validation setup for all the spectrogram sets.As a common practice in model performance analysis, we pre-serve 70% of the entire samples for training and developmentfollowed by running the early stopping scenario. We reportrecognition accuracy of these models for the remaining 30%samples.In the next section, we provide the detailed setup for theadversarial algorithms mentioned in section II. We addition-ally discuss budget allocations required by the adversary forsuccessfully attacking the six finely trained victim models.VI. A
DVERSARIAL A TTACK S ETUP
For effectively attacking the classifiers, the adversary shouldtune the hyperparameters required by the attack algorithmssuch as the number of iteration, the perturbation limitation, thenumber of line search within the manifold, which we expressthem all as the budget allocations. For finding the optimalrequired budgets, we bind the fooling rates of the attackalgorithms to a predefined threshold
AU C > . associatedwith the area under curve of the attack success. In otherwords, we allocate as much budget as needed for reaching the AU C > . for all attacks against the victim models. This is acritical threshold for demonstrating the extreme vulnerabilityof neural networks against adversarial attacks.In accordance to the above note, we use Foolbox [29], thefreely available python package in support of the uniformreproducible implementations of the attack algorithms. Forthe BIM-a and BIM-b algorithms, we define the (cid:15) ≥ . with the confidence of ( ≥ ). In the JSMA framework, weset the number of iterations to a maximum of 1000 and thescaling factor within [0 , (with equivalent displacementof 50). The number of iterations in the DeepFool attack isinitialized to 100 with the supremum value in light of 600and the static step of 100. For the costly CWA attack, weset the search step c ∈ { , , , } within the number ofiteration { , , , . } associated with every c . Exceptof the DeepFool which is a non-targeted attack, we randomlyselect targeted wrong labels for the rest of the algorithms.There are four hyperparameters required for the black-boxPIA algorithm. We empirically limit the perturbation boundto (cid:15) ≥ . followed by an iterative line search to findthe most approximately optimal variance in the NES gradientestimation. We initialize the number of iteration to 500 withdecay rate of . and the learning rate η ∈ [0 . , . .In the framework which we attack the front-end audio clas-sifiers, we run the algorithms on the shuffled batches of 500samples up to 50 batches of 100 samples randomly selectedfrom the clean spectrograms in every step toward spanningthe entire datasets. These attacks are performed consideringthe abovementioned allocated budgets once before and after WT (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 0 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) ∞ = 1 . DWT | Tonnetz (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 0 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) ∞ = 1 . STFT (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) ∞ = 2 . STFT | Tonnetz (cid:107) (cid:15) (cid:107) = 0 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 2 . (cid:107) (cid:15) (cid:107) = 1 . (cid:107) (cid:15) (cid:107) ∞ = 2 . Fig. 1: Crafted adversarial examples for the ResNet-56 using the six optimization-based attack algorithms. The first columnof the figure denotes the original representations for the randomly selected sample from the class of ’children playing’ in theUrbanSound8K dataset. Other columns are associated with the attack algorithms namely, BIM-a, BIM-b, JSMA, DeepFool,CWA, and PIA, respectively. Adversarial Perturbation values have been written at the bottom of each adversarial spectrogram.adversarially training in order to measure the robustness ofthe models. Section VII provides details on how adversariallytraining has been implemented.VII. A
DVERSARIALLY T RAINING
The idea of adversarially training was firstly proposed in [5],where authors showed that, augmenting the training datasetwith the one-shot FGSM adversarial examples improves therobustness of the victim models. As commonly known, themain advantage of this simple approach is that, it does not shat-ter nor obfuscate gradient information while runs a fast non-iterative procedure. This has made the adversarially trainingto be a relatively reliable defense approach. However, it maynot confidently defend against stronger white-box adversarialalgorithms [8].Many adversarial defense approaches have been introducedduring the past years which have been reported to outperformFGSM-based adversarially training [30], [31], [32]. However,some studies have been reported that these advanced defenseapproaches shatter gradient vectors and they might easily breakagainst strong adversarial attacks which do not incorporatethe exact gradient information such as the backward passdifferentiable approximation [6]. Augmenting the clean training dataset with adversarialexamples in the adversarially trained framework is shown inEq. 14 [5]. J (cid:48) ( x , l, w ) = αJ ( x , l, w ) + (1 − α ) J ( x (cid:48) , l, w ) (14)where α is a subjective weight scalar definable by the ad-versary. Additionally, J and w denote the loss function andthe derived weight vector of the victim model, respectively.Moreover x and x (cid:48) refer to the legitimate and adversarialexample associated with the genuine label l . Adversariallytraining using a costly attack algorithm is very time-consumingand memory prohibitive in practice. Therefore, we use theFGSM for augmenting the original spectrogram datasets withthe adversarial examples according to the assumption of J (cid:48) ( x , l, w ) = J ( x (cid:48) , l, w ) .In the next section, we report our achieved results for thedense neural network models about the adversarial attacks andadversarially training on four different representations, namelySTFT, DWT, STFT appended with tonnetz features, and DWTappended with tonnetz chromagrams.VIII. E XPERIMENTAL R ESULTS
We conduct our experiments on two environmental soundsdatasets: UrabanSound8K [33] and ESC-50 [34]. The firstdataset contains 8732 short recording arranged in 10 classesABLE I: Recognition performance (%) of the audio classifiers trained on the original spectrogram datasets (without adversarialexample augmentation). Values inside of the parenthesis indicate the recognition percentage drop after adversarially trainingthe models with the fooling rate
AU C > . . Accordingly, the maximum perturbation is achieved at (cid:107) (cid:15) (cid:107) ≤ . Outperformingaccuracies are shown in bold face. Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16ESC-50 STFT . , (06 .
89) 64 . , (10 .
91) 66 . , (12 .
13) 67 . , (14 . , (09 .
29) 68 . , (08 . DWT . , (08 .
42) 65 . , (11 .
23) 67 . , (15 .
71) 67 . , (18 . , (11 .
09) 71 . , (16 . STFT | Tonnetz . , (24 .
09) 64 . , (23 .
76) 67 . , (19 .
48) 66 . , (23 . , (25 .
19) 70 . , (23 . DWT | Tonnetz . , (19 .
07) 68 . , (18 .
53) 68 . , (24 .
27) 67 . , (21 . , (18 .
21) 68 . , (18 . UrbanSound8K STFT . , (10 .
35) 86 . , (21 .
43) 88 . , (14 .
94) 88 . , (09 . , (23 .
06) 87 . , (14 . DWT . , (16 .
35) 87 . , (19 .
59) 88 . , (15 .
08) 88 . , (19 . , (15 .
49) 90 . , (16 . STFT | Tonnetz . , (25 .
77) 86 . , (22 .
05) 88 . , (17 .
64) 88 . , (26 .
42) 89 . , (20 . , (21 . DWT | Tonnetz . , (16 .
83) 87 . , (20 .
41) 88 . , (29 .
12) 89 . , (27 . , (26 .
08) 89 . , (24 . TABLE II: Robustness comparison (average
AU C % ) of the adversarially trained models attacked with the constraint (cid:107) (cid:15) (cid:107) ≤ .Victim models with lower fooling rates are indicated in bold. Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16ESC-50 STFT . .
13 55 .
31 53 .
87 51 . DWT . .
56 54 .
18 52 .
26 52 . STFT | Tonnetz . .
10 55 .
29 54 .
19 52 . DWT | Tonnetz . .
87 53 .
77 50 .
42 51 . UrbanSound8K STFT . .
08 55 .
91 57 .
30 54 . DWT . .
59 54 .
40 55 .
86 53 . STFT | Tonnetz . .
75 51 .
02 54 .
11 52 . DWT | Tonnetz . .
13 56 .
81 55 .
38 55 . TABLE III: Comparison of (cid:15) r for attacking the original and adversarially trained models with the constraint of AU C > . .Higher values for (cid:15) r associated with each representation are shown in bold. Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16ESC-50 STFT .
412 1 .
631 1 .
897 2 . . DWT .
562 1 .
509 1 .
741 1 .
982 1 . STFT | Tonnetz .
804 1 .
918 2 . .
095 1 . DWT | Tonnetz .
014 2 .
336 1 .
788 1 . . UrbanSound8K STFT .
562 1 . .
372 1 .
991 1 . DWT .
154 2 .
287 2 .
764 1 . . STFT | Tonnetz .
231 2 .
108 1 .
981 2 .
003 1 . DWT | Tonnetz .
606 2 .
199 2 .
405 1 . . (car horn, dog bark, drilling, jackhammer, street music, siren,children playing, air conditioner, engine idling and gun shot)with the audio length of < seconds. ESC-50 dataset con-tains 2K audio signals with an equal length of five secondsorganized in 50 classes.For enhancing both quality and quantity of these datasets,especially for ESC-50, we filter samples using the pitch-shifting operation in the temporal domain as proposed in [16].According to their proposed 1D filtration setup, we use thescales of { . , . , . , . } . This increases the size of thedatasets by a factor of 4.Following the explanations provided in section IV aboutthe spectrogram production, the dimension of each resultingspectrogram is × for both STFT and DWT (the log-arithmic scale) representations on the two datasets. Moreover,the dimensions of the resulting chromagrams is × ,which will be appended to the aforementioned representations.Table I summarizes recognition accuracies of the classifierstrained on these spectrograms. Additionally, this table showsthe effect of the adversarially training on the recognitionperformance of these models. The classifiers in Table I have been selected for evaluationon the test sets after running the five-fold cross-validationscenario on the randomized development portion of the train-ing datasets. Regarding this table, different architectures ofthe deep neural networks show competitive performances.However, in the majority of the cases, the ResNet-56 outper-forms other classifiers averaged over 10 repeated experimentson the spectrograms. The highest recognition accuracy hasbeen achieved by the ResNet-56 architecture, trained on theappended representation of DWT and tonnetz chromagramsfor both UrbanSound8K and ESC-50 datasets. The number ofparameters in the ResNet-56 is 11.3% and 14.26% higher thanits rival models VGG-16 and ResNet-34, respectively.Fig. 1 visually compares the adversarial examples craftedagainst the outperforming classifier, the ResNet-56, usingthe six adversarial attacks with a randomly selected audiosample and represented with the four spectrograms approachesdescribed earlier. Although the generated spectrograms arevisually very similar to their legitimate counterparts, they allmake the classifier to predict wrong labels.Table I also shows the drop ratio of the recognition ac-uracies after adversarially trained the models following theprocedure explained in section VII. The maximum requiredadversarial perturbation for complying with the fooling rateof AU C > . is achieved at (cid:107) (cid:15) (cid:107) ≤ , averaged over allthe attacks. In attacking the adversarially trained models, theprocedures outlined in section VI has been implemented indi-vidually for every audio classifier. According to the obtainedresults, adversarially training considerably reduces the perfor-mance of all models. For the ESC-50, the neural networkstrained on the appended representation of STFT and tonnetzfeatures (STFT | Tonnetz) has experienced the most negativeimpact compared to other representations. The average dropratio for adversarially trained models on the DWT | Tonnetzrepresentations is slightly more than the STFT | Tonnetzcounterparts for the UrbanSound8K dataset. However, forboth datasets, these ratio for models trained on the DWTspectrogram are considerably higher than those trained withthe STFT representations.We measure the fooling rate of adversarially trained modelsafter attacking them using the same six adversarial algorithmsfollowing the procedure explained in section VI with theimposed condition of (cid:107) (cid:15) (cid:107) ≤ for the adversarial perturbation.This experiment uncovers the impact of adversarially trainingon the robustness of the audio classifiers (see Table II).We applied the aforementioned condition to make this tablecomparable with Table I. Regarding the results reported inTable II, adversarially training has improved the robustness ofall the classifiers, particularly AlexNet.For investigating the overall impact of the adversariallytraining on the robustness of audio classifiers, we attackthe adversarially trained models using the same six attackalgorithms without the condition of (cid:107) (cid:15) (cid:107) ≤ . Unfortunately,we could achieve the fooling rate with AU C > . for allthe classifiers following the attack procedure explained in sec-tion VI. However, attacking the adversarially trained modelsrequires larger values for the adversarial perturbation ( (cid:107) (cid:15) (cid:107) )compared to attacking the original models and consequently,increases the number of callbacks to the original spectrogramwith extra batch gradient computations. This might degrade thequality of the generated spectrograms. In order to analyticallycompare the maximum adversarial perturbation required forthe original and the adversarially trained models, we computethe average perturbation ratio as shown in Eq. 15: (cid:15) r = (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) a (cid:15) o (cid:12)(cid:12)(cid:12)(cid:12) (15)where (cid:15) a and (cid:15) o denote the average adversarial perturbationrequired for successfully attacking the adversarially trainedand original models (both with AU C > . ), respectively.Table III summarizes values for (cid:15) r for the victim modelstrained on different representations.Note that an (cid:15) r ≥ indicates the positive impact ofadversarially training on the robustness of the audio clas-sifiers via increasing the computational cost of the attackby expanding the magnitude of the required adversarial per-turbation. With respect to the measured (cid:15) r metric for all the front-end classifiers, the ResNet-56 architecture showedbetter robustness against adversarial attacks in average for50% of the experiments. In other words, attacking this modeladds additional cost for the adversary in crafting adversarialexamples with the AU C > . .IX. C ONCLUSION
In this paper, we presented the impact of adversarially train-ing as a gradient obfuscation-free defense approach againstadversarial attacks. We trained six advanced deep learningclassifiers on four different 2D representations of environmen-tal audio signals and run five white-box and one black-box at-tack algorithms against these victim models. We demonstratedthat adversarially training considerably reduces the recognitionaccuracy of the classifier but improves the robustness againstsix types of targeted and non-targeted adversarial examplesby constraining over the maximum required adversarial per-turbation to (cid:107) (cid:15) (cid:107) ≤ . In other words, adversarially trainingis not a remedy for the threat of adversarial attacks, how-ever it escalates the cost of attack for the adversary withdemanding larger adversarial perturbations compared to thenon-adversarially trained models.A CKNOWLEDGMENT
This work was funded by the Natural Sciences and Engi-neering Research Council of Canada (NSERC) under GrantRGPIN 2016-04855 and Grant RGPIN 2016-06628.R
EFERENCES[1] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attackson speech-to-text,” arXiv preprint arXiv:1801.01944 , 2018.[2] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “A robust approachfor securing audio classification against adversarial attacks,”
IEEETransactions on Information Forensics and Security , vol. 15, pp. 2147–2159, 2020.[3] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “Detection of adver-sarial attacks and characterization of adversarial subspace,” in
IEEE IntlConf on Acoustics, Speech and Signal Processing (ICASSP) , 2020, pp.3097–3101.[4] V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifyingenvironmental sounds using image recognition networks,”
ProcediaComputer Science , vol. 112, pp. 2048–2056, 2017.[5] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv preprint arXiv:1412.6572 , 2014.[6] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give afalse sense of security: Circumventing defenses to adversarial examples,” arXiv preprint arXiv:1802.00420 , 2018.[7] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, M. E.Houle, G. Schoenebeck, D. Song, and J. Bailey, “Characterizing ad-versarial subspaces using local intrinsic dimensionality,” arXiv preprintarXiv:1801.02613 , 2018.[8] F. Tram`er, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, andP. McDaniel, “Ensemble adversarial training: Attacks and defenses,” arXiv preprint arXiv:1705.07204 , 2017.[9] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-chine learning: from phenomena to black-box attacks using adversarialsamples,” arXiv preprint arXiv:1605.07277 , 2016.[10] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in thephysical world,” arXiv preprint arXiv:1607.02533 , 2016.[11] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,” in .IEEE, 2016, pp. 372–387.12] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simpleand accurate method to fool deep neural networks,” in
IEEE Conf CompVis Patt Recog , 2016, pp. 2574–2582.[13] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in
IEEE Symp Secur Priv , 2017, pp. 39–57.[14] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber, “Natural evolutionstrategies,” in . IEEE, 2008, pp. 3381–3387.[15] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adver-sarial attacks with limited queries and information,” arXiv preprintarXiv:1804.08598 , 2018.[16] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “Unsupervised featurelearning for environmental sound classification using weighted cycle-consistent generative adversarial network,”
Applied Soft Computing ,vol. 86, p. 105912, 2020.[17] I. Patel and Y. S. Rao, “Speech recognition using hidden markovmodel with mfcc-subband technique,” in .IEEE, 2010, pp. 168–172.[18] L. Juvela, B. Bollepalli, X. Wang, H. Kameoka, M. Airaksinen, J. Yam-agishi, and P. Alku, “Speech waveform synthesis from mfcc sequenceswith generative adversarial networks,” in . IEEE,2018, pp. 5679–5683.[19] V. Mitra and C.-J. Wang, “Content based audio classification: a neuralnetwork approach,”
Soft Computing , vol. 12, no. 7, pp. 639–646, 2008.[20] S. Patidar and R. B. Pachori, “Classification of cardiac sound signalsusing constrained tunable-q wavelet transform,”
Expert Systems withApplications , vol. 41, no. 16, pp. 7161–7170, 2014.[21] C. Harte, M. Sandler, and M. Gasser, “Detecting harmonic change inmusical audio,” in
Proceedings of the 1st ACM workshop on Audio andmusic computing multimedia , 2006, pp. 21–26.[22] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,and O. Nieto, “Librosa: Audio and music signal analysis in python,” in , vol. 8, 2015.[23] S. Hanov, “Wavelet sound explorer software,” http://stevehanov.ca/wavelet/, 2008.[24] Y. Su, K. Zhang, J. Wang, and K. Madani, “Environment soundclassification using a two-stream cnn based on decision-level fusion,”
Sensors , vol. 19, no. 7, p. 1733, 2019.[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
IEEE Conf Comp Vis Patt Recog , 2015, pp. 1–9.[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proc. IEEE conference on computer vision and patternrecognition , 2016, pp. 770–778.[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[29] J. Rauber, W. Brendel, and M. Bethge, “Foolbox: A python toolbox tobenchmark the robustness of machine learning models,” arXiv preprintarXiv:1707.04131 , 2017.[30] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillationas a defense to adversarial perturbations against deep neural networks,”in
IEEE Symposium on Security and Privacy , 2016, pp. 582–597.[31] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometerencoding: One hot way to resist adversarial examples,” in
InternationalConference on Learning Representations , 2018.[32] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten, “Counter-ing adversarial images using input transformations,” arXiv preprintarXiv:1711.00117 , 2017.[33] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in , Orlando,FL, USA, Nov. 2014.[34] K. J. Piczak, “Esc: Dataset for environmental sound classification,” in