[PDF] From Sound Representation to Model Robustness

Abstract

In this paper, we investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network. Averaged over various experiments on three benchmarking environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures such as GoogLeNet and AlexNet both in terms of classification accuracy and the number of training parameters. Therefore we set this model as our front-end classifier for subsequent investigations. Herein, we measure the impact of different settings required for generating more informative mel-frequency cepstral coefficient (MFCC), short-time Fourier transform (STFT), and discrete wavelet transform (DWT) representations on our front-end model. This measurement involves comparing the classification performance over the adversarial robustness. On the balance of average budgets allocated by adversary and the cost of attack, we demonstrate an inverse relationship between recognition accuracy and model robustness against six attack algorithms. Moreover, our experimental results show that while the ResNet-18 model trained on DWT spectrograms achieves the highest recognition accuracy, attacking this model is relatively more costly for the adversary compared to other 2D representations.

Full PDF

SSUBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 1

From Sound Representation to Model Robustness

Mohammad Esmaeilpour,

Student Member, IEEE,

Patrick Cardinal,

Member, IEEE, and Alessandro LameirasKoerich,

Member, IEEE

Abstract —In this paper, we demonstrate the extreme vulnera-bility of a residual deep neural network architecture (ResNet-18)against adversarial attacks in time-frequency representations ofaudio signals. We evaluate MFCC, short time Fourier transform(STFT), and discrete wavelet transform (DWT) to modulateenvironmental sound signals in 2D representation spaces. ResNet-18 not only outperforms other dense deep learning classiﬁers (i.e.,GoogLeNet and AlexNet) in terms of recognition accuracy, butalso it considerably transfers adversarial examples to other victimclassiﬁers. On the balance of average budgets allocated by adver-saries and the cost of the attack, we notice an inverse relationshipbetween high recognition accuracy and model robustness againstsix strong adversarial attacks. We investigated this relationship tothe three 2D representation domains, which are commonly usedto represent audio signals, on three benchmarking environmentalsound datasets. The experimental results have shown that whilethe ResNet-18 classiﬁer trained on DWT spectrograms achievesthe highest recognition accuracy, attacking this model is relativelymore costly for the adversary compared to the MFCC and STFTrepresentations.

Index Terms —Spectrogram, discrete wavelet transform, shorttime Fourier transform, Mel-frequency cepstral coefﬁcient, en-vironmental sound classiﬁcation, adversarial attack, deep neuralnetwork, ResNet.

I. I

NTRODUCTION D EVELOPING reliable sound recognition algorithms hasalways been a signiﬁcant challenge for the signal pro-cessing community motivated by real-life applications such asmusic annotation [1], emotion recognition for medical pur-poses [2], and environmental event analysis [3]. For analyzingsurrounding scene either for surveillance [4] or multimediasensor networks [5], there is a constant need for understandingenvironmental events. Raised by these concerns, several unsu-pervised [6] and supervised [7] algorithms have been devisedfor the classiﬁcation of environmental sounds.With the proliferation of deep learning (DL) algorithmsduring the last decade for image-related tasks, large volume ofpublications on 2D audio representations have been released.The DL architectures which have been primarily developed forcomputer vision applications have been well adapted for soundrecognition tasks with recognition accuracy competitive tohuman understanding. However, such algorithms require largeamounts of training data. While comprehensive audio datasetsare usually not available, many data augmentation approacheshave been introduced so that to allow an appropriate trainingof DL models and improve their performance on sound-related

M. Esmaeilpour, P. Cardinal and A. L. Koerich are with the Departmentof Software and Information Technology Engineering, ´Ecole de Technolo-gie Sup´erieure ( ´ETS), University of Quebec, Montreal, QC, Canada, e-mail: ([email protected], [email protected],[email protected]). tasks. Dynamic range compression, time stretching, pitch-shifting, and background noise are among those to name afew [7]. These approaches apply directly to audio waveformsaffecting low-level sampled datapoints of the audio signalwhich may not necessarily improve performance of the front-end classiﬁcation models [8]. In response to this concern,high-level data augmentation approaches have been developed,which are particularly useful for 2D audio representations[9], [10]. Experimental results on a variety of environmentalsound datasets attest considerable positive impact of high-level data augmentation on overall performance of DL clas-siﬁers (e.g., AlexNet [11], GoogLeNet [12], etc.) [8]. Whilerecognition performances of these models are competitive withthe human level of understanding, a recent study has demon-strated the vulnerability of these convolutional neural networks(ConvNets) trained on 2D representations of audio signalsagainst adversarial attacks [13]. Their experiments have beenprimarily conducted on the combination of discrete wavelettransform (DWT), short-time Fourier transform (STFT), andcross recurrence plot (CRP) spectrograms. Moreover, theyhave shown that crafted adversarial examples are transferablenot only between two dense ConvNets, but also to supportvector machines (SVM). This poses a potential harm for soundrecognition systems, especially when the highest recognitionaccuracy has been reported on 2D representations of audiosignals [14]. Developing reliable environmental sound classi-ﬁers obliges to study adversarial attacks in greater details andmeasure their impacts on different sound representations.Toward proposing robust classiﬁers, there have been somedebates and case studies on the link between intrusion ofadversarial examples and loss functions for some victim clas-siﬁers [15]. It has been shown that the integration of moreconvex loss functions in the victim model (or in the surrogatecounterpart) might increase the chance of crafting strongeradversarial examples [15]. It might also depend on some otherkey factors such as the properties of the classiﬁer, input sampledistribution, adversarial setups, etc. In order to study otherpotential links between robustness and input representation,we evaluate the robustness and the transferability of somestate-of-the-art ConvNets against adversarial attacks trainedon different 2D sound representations. Our primary front-endConvNet is ResNet-18 architecture because of its superiorrecognition performance compared to other ConvNet architec-tures. We discuss the ResNet-18 architecture in Section IV-Band brieﬂy report our ﬁndings on other dense architecturessuch as GoogLeNet and AlexNet in Section V.Our main contribution in this paper is demonstrating thelower vulnerability of residual networks trained on DWTspectrograms compared to STFT and MFCC representationson the balance of the average budget (number of iterations a r X i v : . [ ee ss . A S ] J u l UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 2 or callbacks). Furthermore, we show that there is a covariantrelationship between high recognition accuracy and foolingrate of these victim ConvNets, averaged over allocated budgetsfor searching optimal perturbations. This paper is organizedas follows. In Section II, we brieﬂy review some strongadversarial attacks for 2D audio representations. Explanationson different 2D audio representations that have been used inthe experiments are summarized in Section III. Experimentalresults as well as associated discussions are presented inSection IV. The conclusions and perspectives of future workare presented in the last section.II. A

DVERSARIAL A TTACKS

An adversarial attack can be formulated as an optimizationproblem toward achieving a very small perturbation parameter δ as stated in Eq. 1 [16]. min δ f ∗ ( x + δ ) (cid:54) = f ( x ) (1)where x and f ∗ denote a legitimate random sample and thepost-activation function of the victim classiﬁer, respectively.The optimal value for δ should be as small as possible so as tonot being perceivable by the human visual system. Althoughperceiving the applied perturbation on 2D audio representa-tions such as spectrograms is very difﬁcult. To satisfy such aconstraint, many attack algorithms have been introduced bothin white and black-box scenarios. In this paper, we brieﬂy goover six strong targeted and non-targeted adversarial attacks,which are well adapted to sound recognition models trainedon 2D audio representations. We use the average fooling rateof these attacks which is a standard metric for assessing therobustness of victim ConvNets trained on different 2D audiorepresentations. A. Limited-Memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)

Szegedy et al. [16] discuss that the viability of fooling deepneural networks with fake examples is due to their extremelylow probability because such examples are rarely seen in agiven dataset. This could be understood as the pitfall of deepnetworks in low generalizability to unseen but very similarsamples. However, they propose an optimization algorithm tomislead ﬁnely trained DL models, based on Eq. 2: min x (cid:48) c (cid:107) (cid:15) (cid:107) + J w ( x (cid:48) , l (cid:48) ) (2)where c is a positive scaling factor achievable by the linesearch strategy, x (cid:48) denotes the associated crafted adversarialexample, l (cid:48) refers to its target label, and J w denotes theloss function for updating weights ( w ). There is a variety ofchoices for this function such as cross-entropy loss or anyother surrogate function. The solution of this optimizationproblem is quite costly and it has been proposed to use the L-BFGS optimizer, subject to ≤ x (cid:48) ≤ M where M refersto the maximum possible intensity in a spectrogram. Thisattack is the baseline for the adversarial algorithms that aresubsequently presented. B. Fast Gradient Sign Method (FGSM)

Goodfellow et al. [17] explain the existence of adversarialexamples of linear nature of deep neural networks eventhose with super-dense hidden layers. Toward this claim, theyproposed a fast optimization algorithm based on Eq. 3: x (cid:48) ← x + (cid:15) · sign( ∇ x J ( x , l )) (3)where (cid:15) is a small constant for controlling the applied per-turbation to the legitimate sample x . Different choices of (cid:96) p norms can be integrated into the FGSM attack, and theadversary should make a trade-off between high similaritiesand a large enough perturbation to be able to fool a model.The formulation of Eq. 3 for (cid:96) norm is shown in Eq. 4. x (cid:48) ← x + (cid:15) ∇ x J ( x , l ) (cid:107)∇ x J ( x , l ) (cid:107) (4)where for satisfying the constraint x (cid:48) ∈ [0 , M ] , the resultingadversarial spectrogram should be clipped or truncated. Thiswhite-box adversarial attack is targeted toward a pre-deﬁnedwrong label by the adversary in a one-shot scenario. C. Basic Iterative Method (BIM)

This non-targeted adversarial attack [18] is in fact theiterative version of the FGSM optimization algorithm, whichcrafts and positions potential adversarial examples ideallyoutside of legitimate subspaces via optimizing Eq. 5 for (cid:15) : x (cid:48) n +1 ← clip x ,(cid:15) (cid:8) x (cid:48) n + (cid:15) · sign( ∇ x J ( x n , l )) (cid:9) (5)where clip is a function for keeping generated examples withinthe range [ x − (cid:15), x + (cid:15) ] as deﬁned in Eq. 6. min (cid:8) M, x + (cid:15), max { , x − (cid:15), x (cid:48) } (cid:9) (6)where M =255 for 8-bit RGB visualization of spectrograms.There are two implementations for this optimization al-gorithm either by keeping optimizing up to reach the ﬁrstadversarial example (BIM-a) or continue optimizing to apredeﬁned number of iterations (BIM-b). The latter usuallygenerates stronger adversarial examples, though it is morecostly since it usually requires more callbacks. Both BIMattacks are iterative and white-box algorithms minimizingEq. 5 for optimal perturbation (cid:15) measured by (cid:96) ∞ norm. D. Jacobian-based Saliency Map Attack (JSMA)

Similar to the FGSM attack, this algorithm also uses gra-dient information for perturbing the input taking advantage ofa greedy approach [19]. This attack is targeted toward a pre-deﬁned wrong label ( l (cid:48) ). In fact, it optimizes for argmin (cid:15) x (cid:107) (cid:15) x (cid:107) subject to f ∗ ( x + (cid:15) x ) = l (cid:48) (optimizing with (cid:96) ). There arethree steps in developing JSMA adversarial examples. First,computing the derivative of the victim model as Eq. 7. ∇ f ( x ) = ∂f j ( x ) ∂x i (7)where x i denotes pixels intensities. Second, a saliency mapshould be computed to detect the least effective pixel val-ues for perturbation according to the desired outputs of the UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 3 model. Speciﬁcally, the saliency map for pixels in cases where ∂f l ( x ) /∂ x i < or (cid:80) j (cid:54) = l ∂f j ( x ) /∂ x i > should be set tozero since there are detectable variations, otherwise: S map ( x , l (cid:48) )[ i ] = ∂f l ( x ) ∂ x i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) j (cid:54) = l (cid:48) ∂f j ( x ) ∂ x i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (8)where S map denotes the saliency map for every given spec-trogram x and target label l (cid:48) . The last step of the JSMA isapplying the perturbation on the original input according tothe achieved map. E. Carlini and Wagner Attack (CWA)

This is an iterative and white-box adversarial algorithm [15],which can use three types of distance metrics: (cid:96) , (cid:96) ∞ , and (cid:96) norms. In this paper, we focus on the latter distance measure,which makes the algorithm very strong even against distillationnetwork. The optimization problem in this attack is given byEq. 9. min (cid:15) (cid:107) x (cid:48) − x (cid:107) + cf ( x (cid:48) ) (9)where c is a constant value as explained in Eq. 2. Assumingthe target class is l (cid:48) and G ( x (cid:48) ) i denotes the logits of the trainedmodel f before softmax activation corresponding to the i -thclass, then: f ( x (cid:48) ) = max (cid:26) max i (cid:54) = l (cid:48) { G ( x (cid:48) ) i } − G ( x (cid:48) ) l (cid:48) , − κ (cid:27) (10)where κ is a tunable conﬁdence parameter for increasing mis-classiﬁcation conﬁdence toward label l (cid:48) . The actual adversarialexample is given by Eq. 11. x (cid:48) = 12 [tanh(arctanh( x ) + (cid:15) ) + 1] (11)where the tanh activation function is used in replacement ofbox-constraint optimization. For non-targeted attacks, Eq. 10should be updated as: f ( x (cid:48) ) = max (cid:26) G ( x (cid:48) ) l − max i (cid:54) = l { G ( x (cid:48) ) i } , − κ (cid:27) (12) F. DeepFool Adversarial Attack

Moosavi-Dezfooli et al. [20] proposed a white-box algo-rithm for ﬁnding the most optimal perturbation for redirectingthe position of a legitimate sample toward a pre-deﬁned targetlabel using linear approximation. The optimization problemfor achieving optimal (cid:15) is given by Eq. 13. arg min (cid:107) (cid:15) (cid:107) s . t . sign( f ( x (cid:48) )) (cid:54) = sign( f ( x )) (13)where (cid:15) = − f ( x ) w / (cid:107) w (cid:107) . DeepFool can also be modiﬁedto a non-targeted attack optimizing for hyperplanes of thevictim model. In this paper, we implement targeted DeepFoolattack and averaged over available labels measuring over (cid:96) and (cid:96) ∞ . In practice, this scenario is not only faster but alsomore destructive than BIMs.In the next section we provide a brief overview of common2D representations of audio signals using time-frequencytransformations. Finally, we carry out our adversarial experi-ments on the transformed audio signals (spectrograms). III. 2D A UDIO R EPRESENTATIONS

Representing audio signals using time-frequency plots isa standard operation in audio and speech processing, whichaims representing such signals into a compact and informativeway. Fourier and wavelet transforms are the most commonlyused approaches for mapping an audio signal into frequency-magnitude representations. In this section we brieﬂy reviewsome of the most common approaches.

A. Short-Time Fourier Transform (STFT)

For a given continuous signal a ( t ) which is distributed overtime, its STFT using a Hann window function w ( τ ) can becomputed using Eq. 14. STFT (cid:8) a ( t ) (cid:9) ( τ, ω ) = (cid:90) ∞−∞ a ( t ) w ( t − τ ) e − jωt dt (14)where τ and ω are time and frequency axes, respectively. Thistransform is quite generalizable to discrete time domain for adiscrete signal a [ n ] as: STFT (cid:8) a [ n ] (cid:9) ( m, ω ) = ∞ (cid:88) n = −∞ a [ n ] w [ n − m ] e − jωn (15)where m (cid:28) n and ω is a continuous frequency coefﬁcient. Inother words, for generating the STFT of a discrete signal, weneed to divide it into overlapping shorter length sub-signalsand compute Fourier transform on it, which results in anarray of complex coefﬁcients. Computing the square of themagnitude of this array yields to a spectrogram representationas shown in Eq. 16. (16) Sp STFT (cid:8) a [ n ] (cid:9) ( m, ω ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) n = −∞ a [ n ] w [ n − m ] e − jωn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) This 2D representation shows frequency distribution overdiscrete time and compared to the original signal a [ n ] , it hasa lower dimensionality, although it is a lossy operation. B. Mel-Frequency Cepstral Coefﬁcients (MFCC)

This transform is a variation of the STFT with some ad-ditional postprocessing operations including non-linear trans-formation. For every column of the achieved spectrogram, wecompute its dot product with a number of Mel ﬁlter bank(power estimates of amplitudes distributed over frequency).For increasing the resolution of the resulting vector, logarith-mic ﬁltering should be applied and ﬁnally, it will be mappedto another 1D representation using discrete cosine transform.This representation has been widely used for sound en-hancement and classiﬁcation. Furthermore, it has been wellstudied as a standard approach for conventional generativemodels incorporating Markov chain and Gaussian mixturemodes [21], [22].

UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 4

C. Discrete Wavelet Transform (DWT)

Wavelet transform maps the continuous signal a ( t ) intotime and scale (frequency) coefﬁcients similar to STFT usingEq. 17. DWT (cid:8) a ( t ) (cid:9) = 1 (cid:112) | s | (cid:90) ∞−∞ a ( t ) ψ (cid:0) t − τs (cid:1) dt (17)where s and τ denote discrete scale and time variations,respectively, and ψ is the core transformation function whichis also known as mother function (see Eq. 18). There are avariety of mother functions for different applications such asthe complex Morlet which is given by Eq. 18: ψ ( t ) = 1 √ π e − jωt e − t / (18)Discrete time formulation for this transform is shown inEq. 19. DWT (cid:8) a [ k, n ] (cid:9) = (cid:90) ∞−∞ a ( t ) h (cid:0) na k T − t (cid:1) (19)where n and k are integer values for the continuous motherfunction of h . Spectral representation for this transformedsignal is a 2D array which is computed by Eq. 20: Sp DWT (cid:8) a ( t ) (cid:9) = (cid:12)(cid:12) DWT (cid:8) a [ k, n ] (cid:9)(cid:12)(cid:12) (20)In the next section, we explain our experiments on threebenchmarking sound datasets. We ﬁrstly generate separatespectrogram sets with the three aforementioned representationusing different conﬁgurations. Second, we train a ResNet onthese datasets and run adversarial attack algorithms againstthem. Finally, we measure both the fooling rate and the costof attacks. We demonstrate that for different spectrogramconﬁgurations these metrics are meaningfully different.IV. E XPERIMENTS

We use three environmental sound datasets in all ourexperiments: UrbanSound8k [23], ESC-50 [24], and ESC-10 [24]. The ﬁrst dataset includes 8732 four-second lengthaudio samples distributed in 10 classes: engine idling, carhorn, children playing, drilling, air conditioner, jackhammer,dog bark, siren, gun shot, and street music. ESC-50 is acomprehensive dataset with 50 different classes and overall2000 ﬁve-second length audio recordings of natural acousticsounds. A subset of this dataset is ESC-10 which has beenreleased with 10 classes and 400 recordings.For increasing both the quality and the quantity of sam-ples of these datasets we apply pitch shifting augmentationapproach with scales . , . , . , and . as proposed in[8], which positively affect classiﬁcation accuracy. This dataaugmentation operation generates four extra audio samplesfor every original audio sample and eventually it increasesthe size of the original dataset by the factor of four. Wediscuss the usefulness of this 1D data augmentation approachin Section V.In the following subsection, we explain the details ofgenerating 2D representations for audio signals. To this aim weuse the open-source Librosa signal processing python library[25] and our upgraded version of the wavelet toolbox [26]. A. Generating Spectrograms

For every dataset including augmented signals we sepa-rately generate independent sets of 2D representations, namelyMFCC, STFT, and DWT. Our aim is to investigate which audiorepresentation yields a better trade-off between recognitionaccuracy and robustness for a victim model against a varietyof strong adversarial attacks.

1) MFCC Production Settings:

There are four major set-tings in generating MFCC spectrogram using Librosa. Thedefault value for sampling rate is 22050 Hz. Since there isno optimal approach for determining the best sampling rateso that generate the most informative spectrogram, we runextensive experiments using sampling rates from 8 to 24 kHz.The second tunable hyperparameter is the number of MFCCs( N MFCC ) which we examine different values for it: 13, 20,and 40 per frame with hop length of 1024. Normalization ofdiscrete cosine transform (type 2 or 3) using orthonormal DCTbasis for MFCC production is the third setting. By default,this hyperparameter is set to true in almost all the librariesincluding Librosa, although we measure performance of thefront-end classiﬁer trained to MFCC spectrograms withoutnormalization. The last argument is about the number ofcepstral ﬁltering ( CF ) [27] to be applied on MFCC features.The sinusoidal CF reduces involvement of higher order coefﬁ-cients and improve recognition performance [28] (see Eq. 21). M ← M (cid:20) (cid:18) π ( n + 1) CF (cid:19)(cid:21) CF (21)where M stands for MFCC array with size [ n, :] . We in-vestigate the effect of CF on the overall performance ofclassiﬁcation models.

2) STFT Production Settings:

For producing STFT repre-sentations, we use default conﬁgurations for general hyperpa-rameters as outlined in the Librosa manual. For assigning thelength of the windowed signal, we use 2048, 1024, and 512with associated sampling rates. We also use variable windowsize: 2048 (default value), 1024, and 512 (very small win-dow) associated with default hop size of 512. We investigatepotential effects of these conﬁgurations for the resiliency ofthe victim models against adversarial attacks.

3) DWT Production Settings:

For generating DWT repre-sentations, we modiﬁed the sound explorer software [26] tosupport Haar and Mexican Hat wavelet mother functions inaddition to complex Morlet. Sampling frequency for DWTspectrograms has been set up to 8 kHz and 16 kHz withconstant frame length of 50 ms. Moreover, by convention, theoverlapping threshold is set to 50%. In our experiments wemeasure the impacts of these DWT conﬁgurations visualizedin logarithmic scale (for higher resolution) on both recognitionaccuracy and robustness against adversarial attacks.In the next subsection, we discuss possible choices for theclassiﬁcation models to be separately trained on the aforemen-tioned spectrogram representations and setups. We select ourﬁnal front-end classiﬁer from a diverse domain of traditionalhandcrafted-based feature learning algorithms to state-of-the-art DL architectures.

UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 5

B. Classiﬁcation Model

For the choice of classiﬁcation algorithms, we initiallyincluded both conventional classiﬁers such as linear and Gaus-sian SVM [13], random forest [8], and some deep leaningarchitectures. Speciﬁcally, we selected pre-trained GoogLeNet,AlexNet, and ResNet [29] models tuned for our three bench-marking datasets. We preserved the architectures of these Con-vNets except for the ﬁrst layer and the last layer for mappinglogits into class labels (softmax layer). The dimension ofspectrograms at the input layer has been also kept as deﬁnedin the standard implementations of these networks. Sincespectrograms may have different dimensions according to theirlength and transformation schemes, we bilinearly interpolatethem to ﬁt 128 ×

128 for all the ConvNets.Performance comparison of the aforementioned SVMs,GoogLeNet and AlexNet against a few adversarial attacks havealready been studied mainly for DWT representations of envi-ronmental sound datasets in [13]. Their experiments have beenconducted on standard spectrograms without validating poten-tial impacts of different settings in the process of producingdifferent representations. In this paper, we carry out extensiveexperiments using: (i) three common 2D representations foraudio signals, namely MFCC, STFT, and DWT; (ii) more andstronger targeted and non-targeted algorithms for adversarialattacks; (iii) fair comparison on fooling rates of victim modelstaking their cost of attacks averaged over the allocated budgetsinto account.We primarily select a ConvNet as our front-end classiﬁer forthe sake of simplicity and interpretability of results. We presentconcise results for other classiﬁcation models in Section V.For such an aim, we selected ResNet architectures becausesuch ConvNet is currently the best-performing classiﬁers forseveral tasks [30]. Our implementations corroborate that, onaverage, these ConvNet architectures outperform all the above-mentioned algorithms (both SVMs and other DL approaches)trained on spectrograms. Among the possible architectures forResNet (ResNet-18, ResNet-34, and ResNet-56), we selectedResNet-18 according to its highest recognition performanceand relatively low number of parameters compared to others.Recalling that, we investigate potential effects of spectrogramconﬁgurations on the classiﬁer which not only has a verycompetitive recognition accuracy compared to others, but alsorequires less number of training parameters. Thus herein,we speciﬁcally focus on the ResNet-18 network and all ourinvestigations will consider this victim architecture.For every conﬁguration to produce the 2D representations,we generate an individual set of spectrograms and train anindependent ResNet-18 classiﬁer on each set. We use a 5-fold cross validation setup on 70% of the overall datasetvolume (training plus development). For avoiding overtraining,we implemented early stopping technique in training andﬁnally report mean recognition accuracy on the test sets (30%remaining).

C. Adversarial Attacks

In this section, we provide details for attacking the mod-els trained on 2D audio representations. We examine their robustness against six strong adversarial attacks by reportingobtained average fooling rates using the area under the ROCcurve (AUC) as a performance metric.

1) Settings for Attack Algorithms:

In FGSM and BIMsattacks, possible ranges for (cid:15) have been deﬁned from . to any possible supremum under different conﬁdence intervals( ≥ ). For the implementation of the DeepFool attack, weuse the open-source Foolbox package [31] with iterations from100 to 1000 (10 different scales with a step of 100). In theimplementation of the JSMA attack, the number of iterationshas been set to ( m i γ ) /n i where m i and n i denote the totalnumber of pixels and scaling factor within [0 , (withdisplacement a of 40), respectively. Also γ is the maximumallowed distortion (ideally < . / ) within the maximumnumber of iterations. Budget allocated to CWA is within { , , , } for search steps in c within { , , , , } iterations in each search step using early stopping. For targetedattacks (i.e., FGSM, JSMA, and CWA) we randomly selecttargeted wrong labels for running adversarial optimizationalgorithms.We executed these attack algorithms on two NVIDIA GTX-1080-Ti with × GB of memory except for the DeepFoolattack, which was executed on 64-bit Intel Core-i7-7700 (3.6GHz) CPU with 64 GB memory. For attacks on the smallestdataset (ESC-10), we used batches of 200 samples. For largerdatasets (ESC-50 and UrbanSound8k), we used 25 batches of100 samples.

2) Adversarial Attacks for MFCC Representations:

Weﬁrstly investigate the potential effect of different samplingrates in MFCC production on the performance of the trainedmodels. To this end, sampling rates have been selected fromfairly low (8 kHz) to moderately high (24 kHz) ranges includ-ing the default frequency value (22.05 kHz) deﬁned in Librosa.Therefore, we trained four ResNet-18 models per datasetassociated with four sampling rates. The results summarized inTable I show that the recognition performance of the classiﬁersis, to some extent, dependent on the sampling rates. For ESC-10 and UrbanSound8k datasets, the sampling rate of 8 kHzimproves recognition accuracy while 16 kHz works better forESC-50. These results imply that a low sampling rate ﬁltersout high frequency components and negatively affects learningof discriminative features from the spectrograms.We attack these models using the aforementioned six ad-versarial algorithms and measure their fooling rates aver-aged over different budgets as explained in Section IV-C.From the results shown in Table I, we notice an inverserelationship between recognition accuracy and robustness ofthese models, on average. For instance, ResNet-18 trained onMFCC spectrograms of the ESC-10 dataset sampled at 8 kHzreaches the highest recognition accuracy, but this model isless robust against ﬁve out of six adversarial attacks, averagedover the allocated budgets. We present two hypotheses on thisissue. Firstly, adversarial attacks are essentially optimization-based problems and their ﬁnal results are dependent onthe hyperparameters deﬁned by the adversary. Conﬁdenceintervals, number of callbacks to the original spectrogram,number of iterations in optimization formulation, line searchfor the optimal coefﬁcient are among those to name a few.

UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 6

TABLE I: Performance comparison of models trained on MFCC representations with different sampling rates averaged overexperiments and budgets. Relatively better performances are in bold face.

Dataset Sampling Recognition AUC Score, Number of Gradients for Adversarial AttacksRate (kHz) Accuracy (%) FGSM DeepFool BIM-a BIM-b JSMA CWAESC-10 8 , 1 0.9473, 074 , 065 , 110 , 096 , 134616 72.15 0.9456, 1 , 046 0.9334, 059 0.9375, 197 0.9144, 151 0.9616, 143522.05 72.06 0.9467, 1 0.9518, 129 0.9309, 088 0.9379, 186 0.9145, 213 0.9405, 147124 70.13 0.9471, 1 0.9341, 078 0.9298, 115 0.9327, 171 0.9233, 091 0.9302, 1149ESC-50 8 69.89 0.9517, 1 0.9023, 061 0.9612, 084 0.9703, 193 0.9288, 118 0.9598, 241816 , 1 , 248 , 209 , 160 , 251 0.9672, 263922.05 69.97 0.9534, 1 0.9386, 331 0.9430, 423 0.9581, 288 0.9233, 219 0.9434, 231824 67.25 0.9433, 1 0.9214, 208 0.9307, 159 0.9415, 216 0.9187, 417 , 2744UrbanSound8k 8 , 1 , 326 0.9411, 317 , 223 , 398 , 279116 70.81 0.9508, 1 0.9215, 631 0.9346, 519 0.9389, 817 0.9447, 442 0.9449, 380522.05 69.57 0.9457, 1 0.9151, 269 , 184 0.9256, 513 0.9370, 416 0.9456, 301524 69.33 0.9440, 1 0.9221, 318 0.9236, 299 0.9120, 862 0.9242, 343 0.9371, 2816

Fooling rate of a victim model is dependent on tuning thesehyperparameters. Our second hypothesis is on the statisticalperspective of training a neural network. A model with higherrecognition accuracy has probably learned a better decisionboundary via maximizing the intra-class similarity and inter-class dissimilarity. Attacking this model, provides a widersearch area for the adversary to ﬁnd pinholes of the model,especially when the decision boundaries among classes liein the vicinity of each other. Table I also compares averagenumber of gradient for batch execution required by everyattack algorithm. Regarding statistics of this table, CWA isthe costliest adversarial attacks for spectrograms with differentsampling rates.The default value for the number of MFCCs ( N MFCC ) is 20as deﬁned in Librosa. However, we encompass values froma minimum number of 13 to a maximum of 40 in gener-ating MFCC spectrograms; although increasing N MFCC > N MFCC in the performance of the classiﬁers. Morespeciﬁcally, recognition performance of the trained modelson spectrograms with N MFCC =

13 is 14% less than modelstrained on spectrograms with N MFCC ≥

20, on average. Ourexperimental results on attacking victim models trained onspectrograms with low N MFCC unveils their extreme vulnera-bilities. However, in terms of cost of the attack, these modelsneed fewer callbacks for gradient computations for yieldingAUC >

90% (see Fig. 1). This could be due to the nature ofadversarial attacks as optimization formulations regardless ofthe performance of the victim models.Using orthonormal discrete cosine transform basis functionis a standard approach in crafting MFCC spectrograms. In ourexperiments we produced two separate subsets of spectrogramswith and without normalization to measure its potential effecton both the recognition accuracy and the fooling rate (seeFig. 2). Disabling this normalization scheme causes a drop inthe recognition accuracy and in the cost of the attack of themodels to 7% and 8.5%, respectively on average.For the choice of the cepstral ﬁltering, we covered values inthe range (cid:2) , ( d × N MFCC ) (cid:3) where maximum d is 2.5 with hopsize of 0.5 in the production of spectrograms. Values above thesupremum of this interval generate higher-order coefﬁcients in

10 15 20 25 30 35 40

Number of MFCCs R e c ogn i t i on A cc u r a cy ESC-10ESC-50UrbanSound8K

10 15 20 25 30 35 40

Number of MFCCs G r ad i en t C o m pu t a t i on Fig. 1: Effect of N MFCC on the recognition accuracy andon the average cost of the attack (number of batch gradientcomputation) over six adversarial algorithms for ResNet-18models.

Sampling Rate (kHz) R e c ogn i t i on A cc u r a cy ESC-10ESC-50UrbanSound8K

Sampling Rate (kHz) G r ad i en t C o m pu t a t i on Fig. 2: Normalization effect on the recognition accuracy andon the average cost for reaching AUC > UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 7 linear-like weighting distributions which considerably reducerecognition accuracy on average to about 48%. Optimal valuesfor d are 0, 0.5, and 0.3 for ESC-10, ESC-50, and Urban-Sound8k, respectively (see Fig. 3). d R e c ogn i t i o m A cc u r a cy ESC-10ESC-50UrbanSound8K d G r d i en t C o m pu t a t i on s Fig. 3: Cepstral ﬁltering effect on the recognition accuracy andon the average cost for reaching AUC >

3) Adversarial Attacks for STFT Representations:

Thereis a signiﬁcant similarity in producing MFCC and STFTspectrograms mainly in terms of transformation and frequencymodulation. Therefore, we omit experimental results relevantto measuring impacts of sampling rates on the robustness ofvictim classiﬁers. Fooling rates of ResNet-18 models on STFTrepresentations are similar to MFCC representations and suchrates support the aforementioned inverse relationship betweenrecognition accuracy and robustness against attacks.Table II summarizes adversarial experiments conducted onSTFT representations with the same aforementioned setupdescribed in Section IV-C. This table illustrates the impact ofthe number of FFTs ( N FFT ) both on the recognition accuracyand on the robustness of victim models against adversarialattacks averaged over all the different adversarial setups. ForESC-10 and ESC-50 datasets, N FFT = N FFT . Since small window lengths improve the temporalresolution of the ﬁnal STFT representation, we evaluate theperformance of the models on small window lengths in therange (cid:2)(cid:0) . × N FFT (cid:1) , N

FFT (cid:3) with hop size of N FFT / . Asshown in Fig. 4, the evaluation on ESC-50 and UrbanSound8kdatasets uncovers that models trained on STFT representationswith window length of . × N FFT outperform others. On theESC-10 dataset, window length of N FFT resulted in betterperformance in terms of recognition accuracy.Comparing the recognition accuracy of Tables I and IIshows that STFT provides better discriminative features for the R e c ogn i t i on A cc u r a cy ESC-10ESC-50UrbanSound8K G r ad i en t C o m pu t a t i on Fig. 4: Effect of scales for N FFT on the recognition accuracyand on the average cost of the attack for reaching AUC >

4) Adversarial Attacks for DWT representations:

There isno algorithmic approach for obtaining the optimal motherfunction to generate DWT spectrograms. Therefore, we haveemployed several functions, from simple Haar to complexMorlet to investigate the potential impacts on the recognitionaccuracy and on the adversarial robustness of the victimmodels. We exploited an analytical approach recasting mul-tiple experiments. Table IV shows that although complexMorlet mother function outperforms other mother functionsin terms of recognition accuracy. However, it shows morevulnerability against adversarial examples, averaged over sixattack algorithms with different budgets.Table III compares the recognition accuracy of modelstrained on DWT representations with complex Morlet motherfunction. We have evaluated these models on DWT spectro-grams with sampling rates of 8 kHz and 16 kHz. Whereas forESC-50, sampling rate of 8 kHz shows better performancefor the classiﬁers comparing their recognition accuracies.There are three ﬁndings in these tables. Firstly, averaged overall the allocated budgets for the attacks, models trained onDWT representations demonstrate a slightly higher robustnessagainst adversarial attacks compared to MFCC and STFTspectrograms. Secondly, the highest recognition accuracy hasbeen achieved for classiﬁers trained on DWT representations.Thirdly, the trade-off between recognition accuracy and ad-versarial robustness of the victim models are noticeable fordifferent sampling rates. Moreover, the cost of the attack

UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 8

TABLE II: Performance comparison of models trained on STFT representations with different N FFT averaged over experimentsand budgets. Relatively better performances are in bold face.

Dataset Number Recognition AUC Score, Number of Gradients for Adversarial Attacksof FFTs Accuracy (%) FGSM DeepFool BIM-a BIM-b JSMA CWAESC-10 512 82.41 0.9768, 1 0.9430, 089 0.9576, 109 0.9717, 134 0.9662, 141 0.9846, 14151 024 , 1 , 129 , 091 , 183 0.9531, 209 , 20082 048 80.56 0.9651, 1 0.9544, 092 0.9407, 163 0.9529, 279 0.9588, 341 0.8731, 1730ESC-50 512 82.44 0.9786, 1 0.9542, 082 0.9583, 109 0.9665, 244 0.9614, 128 0.9618, 19951 024 , 1 0.9512, 331 , 267 , 179 0.9702, 361 , 23532 048 83.12 0.9567, 1 , 145 0.9765, 211 0.9606, 567 , 399 0.9729, 2412UrbanSound8k 512 90.58 0.9761, 1 0.9414, 583 , 442 0.9682, 421 0.9402, 345 0.9539, 25691 024 91.74 0.9827, 1 0.9752, 322 0.9340, 471 , 719 0.9515, 502 0.9654, 32712 048 , 1 , 643 0.9407, 602 0.9630, 408 , 655 , 3342

TABLE III: Performance comparison of models trained on DWT representations with different sampling rates averaged overdifferent budgets. Relatively better performances are shown in bold face.

Dataset SamplingRate (kHz) RecognitionAccuracy (%) AUC Score, Number of Gradients for Adversarial AttacksFGSM DeepFool BIM-a BIM-b JSMA CWAESC-10 8 , 1 , 429 0.9307, 612 , 744 , 781 , 420516 82.04 0.9068, 1 0.9192, 672 , 490 0.9347, 513 0.9018, 801 0.9216, 4439ESC-50 8 80.34 , 1 , 367 0.9161, 452 0.9314, 809 0.9168, 298 0.9233, 398116 , 628 , 701 , 561 , 4575UrbanSound8k 8 , 1 , 761 , 841 , 738 , 691 0.9320, 468416 91.83 0.9321, 1 0.9274, 533 0.9125, 719 0.9408, 941 0.9139, 774 , 4879

TABLE IV: Comparison of mother functions on the perfor-mance of the models. Outperforming values are shown in boldface.

Dataset MotherFunction Average RecognitionAccuracy (%) AverageAUC ScoreESC-10 Haar 82.14 95.14Mexican Hat 84.51 94.19Complex Morlet

ESC-50 Haar 83.08 92.16Mexican Hat 84.33 93.40Complex Morlet

UrbanSound8k Haar 91.22 96.16Mexican Hat 93.48 95.63Complex Morlet (number of gradient computations) for models trained on DWTis considerably higher than other two representations.In all these experiments, we assumed a frame length of 50ms with 50% overlapping to convolve the input signal withmother functions. We have also carried out experiments onstudying the potential effect of frame length in performanceof the models. They showed that short frame lengths (e.g., 30ms) drop the recognition performance of the models for thethree benchmarking datasets. Additionally, long frames such as50 ms introduce a high redundancy in frequency plots, whichresults in dropping the recognition accuracy (see Fig. 5). Fig. 6visually compares crafted adversarial examples for the threerepresentations. Although they are visually very similar to theirlegitimate counterparts, they conﬁdently drive the classiﬁertoward wrong predictions. This showcases the active threatof adversarial attacks for the sound recognition models.V. D

ISCUSSION

In this section we provide additional discussion regardingour results. We brieﬂy discuss some secondary aspects of ourexperiments which could be relevant for future studies.

30 35 40 45 50 55 60 65 70

Frame Length R e c ogn i t i on A cc u r a cy ESC-10ESC-50UrbanSound8K

30 35 40 45 50 55 60 65 70

Frame Length G r ad i en t C o m pu t a t i on s Fig. 5: DWT frame length effect on the recognition accuracyand on the average cost of the attack for yielding AUC > A. Neural Network Architecture

For selecting the front-end classiﬁer, we measured recog-nition accuracy and total number of training parameters forall candidates. We explored DL architectures without residualblocks (AlexNet) and with inception blocks (GoogLeNet) forthe choice of victim classiﬁers. Our experiments unveiledthat these dense networks do not outperform ResNet-18 interms of recognition accuracy. Although average recognitionaccuracy of ResNet-18 and GoogLeNet are competitive onspectrograms, the latter has 1.41 × more training parameters.On average, recognition performance of AlexNet is 8% lowerthan ResNet-18, even if it has 61% fewer parameters. Further-more, recognition performance of other ResNet models such as UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 9

Original Attacked SpectrogramsSpectrograms FGSM DeepFool BIM-a BIM-b JSMA CWAMFCC (cid:107) (cid:15) (cid:107) = 0 . , l (cid:48) = 2 (cid:107) (cid:15) (cid:107) = 0 . , l (cid:48) = 3 (cid:107) (cid:15) (cid:107) = 0 . , l (cid:48) = 4 (cid:107) (cid:15) (cid:107) = 0 . , l (cid:48) = 5 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 6 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 7 STFT (cid:107) (cid:15) (cid:107) = 0 . , l (cid:48) = 2 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 3 (cid:107) (cid:15) (cid:107) = 0 . , l (cid:48) = 4 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 5 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 6 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 7 DWT (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 2 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 3 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 4 (cid:107) (cid:15) (cid:107) = 1 . , l (cid:48) = 5 (cid:107) (cid:15) (cid:107) = 2 . , l (cid:48) = 6 (cid:107) (cid:15) (cid:107) = 2 . , l (cid:48) = 7 Fig. 6: Crafted adversarial spectrograms for the three audio representations. The original audio sample has been randomlyselected from the class of dog bark ( l = 1 ). Examples shown in columns two to seven are associated with the six adversarialattacks for the original input sample. Required perturbation ( (cid:15) ) and the target labels ( l (cid:48) ) are shown under each spectrogram.ResNet-34 and ResNet-56 are very competitive to ResNet-18,but the latter requires 50% fewer parameters. In comparing therobustness of these models against adversarial attacks, they allcan reach fooling rates higher than 95%. Taking the allocatedbudgets into account, the ResNet-18 is the costliest networkin terms of the number of required gradient computations forthe adversary, followed by GoogLeNet and AlexNet. B. Data Augmentation

For improving the performance of the classiﬁers, we aug-mented the original datasets only at waveform level (1D) usingtime-stretching ﬁlter except for DWT representations whichwe additionally scaled the spectrograms by a logarithmicfunction. Removing 1D data augmentation negatively affectsrecognition accuracy of the models with drop ratios of about0.056%, 0.036%, and 0.029% for MFCC, STFT, and DWTspectrograms, respectively. For measuring the robustness ofthese models against adversarial examples, we run attackalgorithms on random batches of size 100 among the entiredatasets. The experimental results have shown that for reachingthe fooling rates as close as the values reported in Tables I toIII, less gradient computation is required mainly for JSMAand CWA attacks.

C. Adversarial on Raw Audio

All adversarial attack algorithms discussed in this paperwere developed for any 2D representation, from natural imagesto spectrograms. Unfortunately, these adversarial attacks donot ﬁt end-to-end classiﬁers trained on audio waveforms [32],[33]. Optimizing Eq. 1 even for a very short audio signalsampled at a low rate is very costly and they are not transfer-able while being played over the air [34]. Toward addressingthis interesting open problem, we trained several end-to-endConvNets on randomly selected batches of environmentalsound datasets. Upon running both targeted and non-targeted attacks against ConvNets we could reduce performance ofvictim classiﬁers by 30% in average. Interestingly, multiplyingthe adversarial examples by a random small scalar restorecorrect label of the audio waveforms. In other words, whereasadversarial spectrograms, 1D adversarial audio waveforms arenot resilient against any additional perturbation. This probablygoes back to the nature of audio signals regardless of thenumber of channels they may contain.

D. Adversarial Transferability

Adversarial examples are transferable from each victimmodel to another [35]. To address this question we exploredthis potential property on deep neural networks trained ondifferent spectrograms. Table V reports the transferabilityratios averaged over budgets with batch sizes of 100. Craftedadversarial examples for victim models are less transferable inMFCC representations while DWT spectrograms have highertransferring rates on average. Examples generated in STFTdomain are more transferable compared to MFCC, this may bedue to the higher order of information in STFT spectrograms.VI. C

ONCLUSION

In this paper, we demonstrated the covariant relationshipbetween recognition accuracy and robustness of ResNet-18trained on 2D representations of environmental audio signalsaveraged over allocated budgets by the adversary. This rela-tionship is generalizable to other DL architectures and thisis a common behavior for models trained on spectrograms.Additionally, we showed that our front-end classiﬁer can reachthe highest recognition accuracy when it is trained on DWTrepresentation. Furthermore, attacking this model is on averagemore costly for the adversary compared to models trained onMFCC and STFT representations. This proves the superiorityof DWT representation for environmental sound recognition.

UBMITTED TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. X, NO. X, JULY 2020 10

TABLE V: Average transferability ratio of adversarial examples among ConvNets. Higher ratios are shown in boldface.

MFCC STFT DWTDataset Models ResNet-18 GoogLeNet AlexNet ResNet-18 GoogLeNet AlexNet ResNet-18 GoogLeNet AlexNetResNet-18 1

GoogLeNet

ESC-10 AlexNet 0.491

ESC-50 AlexNet 0.523 Moreover, we examined the transferability of crafted adver-sarial examples among AlexNet, GoogLeNet, and ResNet-18for the three spectrogram representations. According to ourresults, the lowest transferability ratio was achieved for MFCCspectrograms averaged over six different adversarial attacks.In our future studies, we are determined to investigate thisproperty for networks trained on speech datasets.R

EFERENCES[1] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a multitrackclassical music performance dataset for multimodal music analysis:Challenges, insights, and applications,”

IEEE Trans Multimedia , vol. 21,no. 2, pp. 522–535, 2018.[2] W.-L. Zheng, J.-Y. Zhu, and B.-L. Lu, “Identifying stable patterns overtime for emotion recognition from eeg,”

IEEE Trans Affect Computing ,2017.[3] R. Radhakrishnan, A. Divakaran, and A. Smaragdis, “Audio analysis forsurveillance applications,” in

IEEE Workshop Appl Signal Proc AudioAcous , 2005, pp. 158–161.[4] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti,“Scream and gunshot detection and localization for audio-surveillancesystems,” in

IEEE Conf Adv Video Sign Based Surv , 2007, pp. 21–26.[5] D. Steele, J. Krijnders, and C. Guastavino, “The sensor city initiative:cognitive sensors for soundscape transformations,”

GIS Ostrava , pp. 1–8, 2013.[6] J. Salamon and J. P. Bello, “Unsupervised feature learning for urbansound classiﬁcation,” in

Intl Conf Acous Speech Sign Proc , 2015, pp.171–175.[7] ——, “Deep convolutional neural networks and data augmentation forenvironmental sound classiﬁcation,”

IEEE Sign Proc Lett , vol. 24, no. 3,pp. 279–283, 2017.[8] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “Unsupervised featurelearning for environmental sound classiﬁcation using weighted cycle-consistent generative adversarial network,”

Applied Soft Computing ,vol. 86, p. 105912, 2020.[9] T. Kaneko, S. Takaki, H. Kameoka, and J. Yamagishi, “Generative adver-sarial network-based postﬁlter for stft spectrograms.” in

INTERSPEECH ,2017, pp. 3389–3393.[10] A. Mathur, A. Isopoussu, F. Kawsar, N. Berthouze, and N. D. Lane,“Mic2mic: using cycle-consistent generative adversarial networks toovercome microphone variability in speech systems,” in , 2019, pp. 169–180.[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

NIPS , 2012, pp. 1097–1105.[12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in

IEEE Conf Comp Vis Patt Recog , 2015, pp. 1–9.[13] M. Esmaeilpour, P. Cardinal, and A. L. Koerich, “A robust approach forsecuring audio classiﬁcation against adversarial attacks,”

IEEE Trans InfForensics Security , vol. 15, pp. 2147–2159, 2020.[14] V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifyingenvironmental sounds using image recognition networks,”

ProcediaComp Sci , vol. 112, pp. 2048–2056, 2017.[15] N. Carlini and D. Wagner, “Towards evaluating the robustness of neuralnetworks,” in

IEEE Symp Secur Priv , 2017, pp. 39–57. [16] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,and R. Fergus, “Intriguing properties of neural networks,” arXiv preprintarXiv:1312.6199 , 2013.[17] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessingadversarial examples,” arXiv Prepr arXiv:1412.6572 , 2014.[18] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in thephysical world,” arXiv Prepr arXiv:1607.02533 , 2016.[19] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, andA. Swami, “The limitations of deep learning in adversarial settings,”in

IEEE European Symp Secur Privacy , 2016, pp. 372–387.[20] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simpleand accurate method to fool deep neural networks,” in

IEEE Conf CompVis Patt Recog , 2016, pp. 2574–2582.[21] L. Shi, I. Ahmad, Y. He, and K. Chang, “Hidden markov modelbased drone sound recognition using mfcc technique in practical noisyenvironments,”

Journal of Communications and Networks , vol. 20, no. 5,pp. 509–518, 2018.[22] A. Maurya, D. Kumar, and R. Agarwal, “Speaker recognition for hindispeech signal using mfcc-gmm approach,”

Procedia Computer Science ,vol. 125, pp. 880–887, 2018.[23] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy forurban sound research,” in , Orlando, FL,USA, 2014.[24] K. J. Piczak, “Esc: Dataset for environmental sound classiﬁcation,” in , 2015, pp. 1015–1018.[25] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,and O. Nieto, “Librosa: Audio and music signal analysis in python,” in , vol. 8, 2015.[26] S. Hanov, “Wavelet sound explorer software,” http://stevehanov.ca/wavelet/, 2008.[27] B.-H. Juang, L. Rabiner, and J. Wilpon, “On the use of bandpass lifteringin speech recognition,”

IEEE Trans Acous Speech Sign Proc , vol. 35,no. 7, pp. 947–954, 1987.[28] K. K. Paliwal, “Decorrelated and liftered ﬁlter-bank energies for robustspeech recognition,” in , 1999.[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

IEEE Conf Comp Vis Patt Recog , 2016, pp. 770–778.[30] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C.Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al. , “Cnnarchitectures for large-scale audio classiﬁcation,” in

IEEE Intl ConfAcous Speech Sign Proc , 2017, pp. 131–135.[31] J. Rauber, W. Brendel, and M. Bethge, “Foolbox: A python toolbox tobenchmark the robustness of machine learning models,” arXiv preprintarXiv:1707.04131 , 2017.[32] S. Abdoli, P. Cardinal, and A. L. Koerich, “End-to-end environmentalsound classiﬁcation using a 1d convolutional neural network,”

ExpertSystems with Applications , vol. 136, pp. 252–263, 2019.[33] K. M. Koerich, S. Abdoli, M. Esmaeilpour, A. S. Britto Jr., and A. L.Koerich, “Cross-representation transferability of adversarial attacks:From spectrograms to audio waveforms,” in

IEEE International JointConference on Neural Networks (IJCNN) , 2020, pp. 1–7.[34] N. Carlini and D. Wagner, “Audio adversarial examples: Targeted attackson speech-to-text,” arXiv Prepr arXiv:1801.01944 , 2018.[35] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in ma-chine learning: from phenomena to black-box attacks using adversarialsamples,” arXiv Prepr arXiv:1605.07277arXiv Prepr arXiv:1605.07277