[PDF] Improving adversarial robustness of deep neural networks by using semantic information

Abstract

The vulnerability of deep neural networks (DNNs) to adversarial attack, which is an attack that can mislead state-of-the-art classifiers into making an incorrect classification with high confidence by deliberately perturbing the original inputs, raises concerns about the robustness of DNNs to such attacks. Adversarial training, which is the main heuristic method for improving adversarial robustness and the first line of defense against adversarial attacks, requires many sample-by-sample calculations to increase training size and is usually insufficiently strong for an entire network. This paper provides a new perspective on the issue of adversarial robustness, one that shifts the focus from the network as a whole to the critical part of the region close to the decision boundary corresponding to a given class. From this perspective, we propose a method to generate a single but image-agnostic adversarial perturbation that carries the semantic information implying the directions to the fragile parts on the decision boundary and causes inputs to be misclassified as a specified target. We call the adversarial training based on such perturbations "region adversarial training" (RAT), which resembles classical adversarial training but is distinguished in that it reinforces the semantic information missing in the relevant regions. Experimental results on the MNIST and CIFAR-10 datasets show that this approach greatly improves adversarial robustness even using a very small dataset from the training data; moreover, it can defend against FGSM adversarial attacks that have a completely different pattern from the model seen during retraining.

Full PDF

11 Improving adversarial robustness of deep neuralnetworks by using semantic information

Lina Wang, Rui Tang, Yawei Yue, Xingshu Chen, Wei Wang, Yi Zhu, and Xuemei Zeng

Abstract —The vulnerability of deep neural networks (DNNs)to adversarial attack, which is an attack that can mislead state-of-the-art classiﬁers into making an incorrect classiﬁcation withhigh conﬁdence by deliberately perturbing the original inputs,raises concerns about the robustness of DNNs to such attacks.Adversarial training, which is the main heuristic method forimproving adversarial robustness and the ﬁrst line of defenseagainst adversarial attacks, requires many sample-by-samplecalculations to increase training size and is usually insufﬁcientlystrong for an entire network. This paper provides a new per-spective on the issue of adversarial robustness, one that shiftsthe focus from the network as a whole to the critical partof the region close to the decision boundary corresponding toa given class. From this perspective, we propose a method togenerate a single but image-agnostic adversarial perturbationthat carries the semantic information implying the directions tothe fragile parts on the decision boundary and causes inputs to bemisclassiﬁed as a speciﬁed target. We call the adversarial trainingbased on such perturbations “region adversarial training” (RAT),which resembles classical adversarial training but is distinguishedin that it reinforces the semantic information missing in therelevant regions. Experimental results on the MNIST and CIFAR-10 datasets show that this approach greatly improves adversarialrobustness even using a very small dataset from the training data;moreover, it can defend against FGSM adversarial attacks thathave a completely different pattern from the model seen duringretraining.

Index Terms —adversarial robustness, semantic information,region adversarial training, targeted universal perturbations.

I. I

NTRODUCTION A S an accepted technique in machine learning, deeplearning (DL) has proved itself capable of performingsingularly well on a number of categories of machine learningtasks [1]. In particular, deep neural networks (DNNs) canlearn very effective models for input classiﬁcation. State-of-the-art DNNs have achieved impressive performance in tasksof computer vision [2], [3], speech recognition [4], [5], andnatural language understanding [6], [7] and provide solutionsbased on these tasks for many other problems, such as inmedical science [8]. The universal approximator theorem [9]guarantees the representational power of DNNs but does notindicate whether a training algorithm will be able to discovera function having all the desired properties.For all the success of deep learning algorithms, Szegedyet al. [10], [11] revealed an inherent weakness of DNNs by

Lina Wang, Rui Tang, Yawei Yue and Xingshu Chen are withthe College of Cybersecurity, Sichuan University, Chengdu 610065,China (e-mail: [email protected], [email protected],[email protected], [email protected]) .Wei Wang, Yi Zhu and Xuemei Zeng are with the CybersecurityResearch Institute, Sichuan University, Chengdu 610065, China (e-mail:[email protected], [email protected], ). pointing out the existence of a new type of attack calledan adversarial attack. The adversary in this type of attackmisleads models into producing an incorrect output with anadversarial example, a plausible member of input datasets thatis only slightly different from benign examples, created byadding a carefully constructed adversarial perturbation. Forexample, the images on the diagonal in Fig. 1 are unperturbedclean examples, and the other images are adversarial examplesmisclassiﬁed as speciﬁed target classes that are almost imper-ceptible to human vision. Recent studies have made it clearthat DNNs are universally vulnerable to adversarial examples;this seems to contradict the assumptions that underlie manydeep learning methods and suggests that our deep classiﬁersbased on modern machine learning techniques have only builta Potemkin village instead of learning the true underlyingconcepts that determine a correct output label. Ideally, thelabel estimated by a classiﬁer should not be altered by asufﬁciently small perturbation of an input data point, let alonean adversarial perturbation. This excellent property, calledrobustness, is extremely signiﬁcant for DNNs when appliedin realistic contexts, and above all in security-critical environ-ments [12]. Because of the importance and imminence of theissue, the robustness of classiﬁers to adversarial examples hasbeen attracting much attention in recent years.Previous studies on the robustness of DNNs have ap-proached the question from two directions, attempting either toprove a lower bound of robustness through formal guaranteesor to ﬁnd an upper bound of robustness through adversarialattacks. The formal approach is sound but difﬁcult to carry outin practice [13], whereas heuristic defenses against adversarialattacks are not sufﬁciently strong [14]. There is a puzzlingproblem concerning the latter approach. It is generally believedthat neural networks are not learning the true concepts [11], yetthe adversarial perturbations generated by almost all knownmethods appear to be chaotic! This seems counter-intuitive,because if the network is missing important information re-lated to the true underlying concepts, this information shouldbe reﬂected in the adversarial examples, representing the blindspots of the network.In addition, despite a number of meaningful studies onthe issue, achieving ideal robustness remains a difﬁcult goal.Improving the adversarial robustness of a network as a wholeis rather ambitious and difﬁcult; sometimes enhancing therobustness of particular regions in the manifold represented bythe network can provide a greater beneﬁt in reality. This is evenmore remarkable for certain application scenarios, especiallysecurity-sensitive applications. For example, for a classiﬁerthat distinguishes different kinds of animals, it is no more a r X i v : . [ c s . L G ] A ug Fig. 1: Illustration of targeted unieversal perturbations (TUPs)attacks on a typical DNN using images sampled from CIFAR-10, showing source–target pairs. To facilitate the presentation,we use the numbers 0–9 to represent the ten classes of CIFAR-10. The number preceding each row represents the sourceclass, and the number preceding each column represents thetarget class. For example, an image with a row number of 1and a column number of 2 is a TUP adversarial example whosetrue label is class 1 but is incorrectly classiﬁed as class 2. Allof the original images displayed were selected at random.dangerous to classify dogs as cats than dogs as birds, but in thecase of a multi-category classiﬁer for malware classiﬁcationor for trafﬁc sign recognition as used in autonomous vehicles,things are quite different [15]. Incorrectly classifying a yieldsign as a stop sign is likely to be safer than misclassifying it asa sign that allows vehicles to pass. Similarly, misclassifyingmalware as belonging to the wrong malware family is lessharmful than incorrectly classifying it as benign.Furthermore, almost all heuristic methods for improvingadversarial robustness require a large number of calculationson a very large dataset of a size comparable to that of thetraining set. This considerably reduces their suitability forpractical scenarios, especially application environments havinghigh timeliness requirements.In this paper, we focus on the region corresponding toa certain class in the manifold represented by the attackednetwork, and we propose a method to extract semantic infor-mation that is universal for most examples from a small set ofthe data points that lie very close to the classiﬁer’s decisionboundary separating one class from all others. The key ideais to emphasize to the classiﬁer the semantic informationit has not yet learned and to prompt the classiﬁer to learna clearer (usually more complicated) decision boundary andthe underlying concepts. We retain this universality propertyacross the inputs as [16] did, but unlike researchers in previousstudies, we generate perturbations containing semantic infor-mation instead of meaningless noise with the aim of improvingrobustness. The main contributions of this paper are as follows: • We ﬁnd that there exists a single perturbation applicableto most of the inputs that could constitute a targetedadversarial attack on a classiﬁer, and, importantly, thatsuch perturbations are not meaningless but contain ex-plicit semantic information. Furthermore, we proposean algorithm for generating such targeted universal per-turbations (TUPs). The algorithm computes a series ofperturbation vectors one at a time, sending a data pointto the classiﬁcation boundary of the region correspondingto the speciﬁed target class for a set of points in thetraining dataset, and then aggregates the perturbationvectors to ﬁnd a universal vector indicating the directionto the region in an iterative way. We show that theproposed algorithm can calculate such a perturbation ona very small set of training data points, which causes newsamples to be misclassiﬁed as a speciﬁc target class withhigh probability. • We present a new approach to improve adversarial ro-bustness, called region adversarial training (RAT), andformalize it conceptually. RAT pays special attention tothe region near the decision boundary corresponding to aselected target class and then uses the extracted semanticinformation related to this region to guide the retrainingprocess. The information used by RAT comprises thecommon patterns for most samples that follow the samedistribution as the training data, and these patterns containsemantic information related to the true underlying con-cepts; consequently, RAT can not only perform well ona very small data set, but also defend against adversarialattacks that have never been seen by the network before. • We validate the algorithm by reporting the results ofextensive experiments using MNIST [17] and CIFAR-10[2] and show that the perturbations achieve a similar highattack success rate for each target class. We also system-atically evaluate the choice of algorithm parameters. Weﬁnd that our TUP perturbations not only retain the univer-sality property of being able to fool unseen data pointsbut also transfer well across different architectures andcan work well even when calculated from a very smalldataset. We show experimentally that using the proposedalgorithm to provide examples for region adversarialtraining (even on a very small set from training data),the test set accuracy on both TUP adversarial examplesand the best-known FGSM adversarial examples [11] canbe increased on MNIST and CIFAR-10.The rest of this paper is organized as follows. In Section II,we summarize recent work on generating adversarial examplesand improving adversarial robustness. Section III provides thepreliminaries and deﬁnes the notation. Then, we introducethe proposed approaches for ﬁnding TUPs and formalizethe region adversarial training method in Section IV. Theexperiments we conducted to test the proposed method aredescribed and their results analyzed in Section V. Finally, weconclude with Section VI.II. R

ELATED WORK

As our goal is to extract missed semantic informationthrough a method of generating adversarial examples and F Training data

Clean InputAdversarial InputA B CatDogDog

Region Adversarial Training TUP

Clean Input

Adversarial F F Computing TUPs （ a ）（ b ）

213 Inference of a common network A

Inference of a robust network B

Fig. 2: Overview of region adversarial training (RAT) based on TUP adversarial examples. (a) shows the entire RAT process.(b) is an illustration of Algorithm 1 for computing a TUP perturbation.then to improve the robustness of DNNs, this section ﬁrstintroduces the work related to the generation of adversarialexamples and then describes the studies on improving adver-sarial robustness.

Adversarial examples.

Szegedy et al. [10] discovered theexistence of the possibility of adversarial attacks on deepneural networks by generating adversarial examples using box-constrained L-BFGS. The fact that deep neural networks aresurprisingly susceptible to such adversarial attacks triggeredthe wide interest of researchers in the security and machinelearning communities, and since then, a sizable body of relatedliterature has introduced several new methods for craftingadversarial examples to construct an upper bound on therobustness of neural networks. Goodfellow et al. [11] proposeda method called “Fast Gradient Sign Method” (FGSM), whichperturbs an image to increase the loss of the classiﬁer on theresulting image based on the “linearity hypothesis” of deepnetwork models in higher-dimensionality space. Instead ofusing the L -norm as in FGSM, Kurakin et al. [18] presentedan alternative approach named “Fast Gradient L ∞ ” and alsoextended FGSM to a “target class” variation wherein the labelof the class least likely to be predicted by the attacked networkis used as the target class. Unlike the one-step methods, whichtake a single step in the direction that increases the loss, theBasic Iterative Method (BIM) [19] computes the perturbationiteratively by adjusting the direction step by step. Papernot etal. [20] modiﬁed pixels of the original image one at a time bycomputing a saliency map and then monitored the effect of thechanges. The method proposed by Su et al. [21] was deducedfor the extreme case in which only one pixel in the image isallowed to change for the attacker, and they reported a fairlygood success rate, . . A more reﬁned algorithm, namedDeepFool [22], moves a given image toward the boundaryof a polyhedron through a small vector based on an iterativelinearization of the classiﬁer to compute a minimal-normadversarial perturbation. All of the above methods computeadversarial perturbations to fool an attacked network using a single image; the method in [16] is fundamentally different.The authors computed perturbations that do not involve adata-dependent optimization but fooled the classiﬁer on allimages through one and the same perturbation. However, theperturbations they performed caused the clean samples tobe misclassiﬁed as any (unpredictable) class and containedvery little semantic information. By contrast, our approachcomputes a perturbation that moves the sample in a speciﬁcdirection chosen to cause the perturbed sample to be mis-classiﬁed as a target class t while preserving the universalityproperty across samples, without the need to use any complexgenerative models such as in [23]. More importantly, ourapproach extracts explicit semantic information with very fewsamples and generates adversarial perturbations that showthis semantic information clearly and that exhibit a patterncompletely different from the others. Adversarial robustness.

The appearance of adversarial ex-amples reveals the intrinsic vulnerability of the existing neuralnetwork methodology; therefore, studies on improving its ro-bustness to adversarial examples are of great importance. Workhas generally been developing in two different directions. Oneway of making neural networks robust to adversarial attacksfocuses on formally ensuring their robustness. Robustnessveriﬁcation is a general method for obtaining safety guarantees[24], [25], [26], [27], [28], [29], [30]; it is typically based onsophisticated theory and is usually computationally expensive.As the investigation in this paper does not involve formalveriﬁcation techniques, we do not go into detail here. Theother way is to explore heuristic defenses against adversarialexamples (including their detection), by means of modifyingnetworks directly [31], [32], [33], [34], [35], [36], [37], [38],[39], [40], [41], [42], using extra network add-ons [43],[44], [45], [46], or changing the training procedure or usingmodiﬁed inputs in the inference phase [47], [48], [49], [50],[51], [52], [53], [54]. The method presented in this paper isof this type and, more speciﬁcally, falls into the category of adversarial training [11], [55], which modiﬁes the training procedure with adversarial inputs. What most distinguishesour work from other adversarial training methods is thatwhereas to our knowledge all existing methods improve theadversarial robustness of networks as a whole, ours focuses oncertain regions in the manifold represented by the network.In addition, all existing studies on adversarial training haveused an image-speciﬁc method to increase the size of thetraining dataset, which requires at least one calculation foreach example on a very large dataset (usually a multiple ofthe training set). To the best of our knowledge, the method in[16] is the only exception; it calculates a single image-agnosticperturbation for a set of training points, but it leads to only aslight improvement in robustness. Our method is designed toenhance the robustness of DNNs on a very small set by usingperturbations that contain semantic information and retain theuniversality property but that are completely different from thepatterns in [16]. III. P

RELIMINARIES

A. Neural networks: Deﬁnitions and notation

A neural network used as a multi-class classiﬁer, which isthe case exclusively studied in this paper, is given an input andprovides a corresponding class probability vector as output.Formally, a classiﬁer (cid:98) f : R n → { . . . K } accepts an input x ∈ R n and provides an estimated label (cid:98) f ( x ) as output forit. We assume that x ∼ ψ , where ψ denotes a distribution ofinputs in R n . The output vector (cid:98) f ( x ) represents the probabilitythat the input x belongs to each of the K classes. The classiﬁerassigns the label (cid:98) y ( x ) = argmax (cid:98) f ( x ) i to the input x ; theground-truth label is denoted by y . The model (cid:98) f depends onsome parameters θ , but as the network is ﬁxed for our methodof crafting an adversarial perturbation, we will omit θ from (cid:98) f when there is no ambiguity. We deﬁne J ( θ , x , y ) as the lossfunction used to train the model. B. Adversarial examples

Given a naturally occurring example (clean example) x and a classiﬁer (cid:98) f ( · ) , an adversarial example [10] is an inputthat causes the classiﬁer to make a mistake. An adversarylaunches adversarial attacks by crafting adversarial examples.Let x (cid:48) = x + r be an adversarial example that is very similar to x , where r is a small vector called an adversarial perturbation.More precisely, an untargeted adversarial example is one thatcauses the classiﬁer to predict any incorrect label (i.e., it makes (cid:98) f ( x (cid:48) ) (cid:54) = (cid:98) f ( x ) ), and a targeted adversarial example is onethat causes the classiﬁer to change the prediction to somespeciﬁc target class t (i.e., (cid:98) f ( x (cid:48) ) = t ). It is apparent thatuntargeted adversarial attacks are strictly less powerful thantargeted adversarial attacks, meaning that if an adversarialexample can cause a targeted adversarial attack, it can certainlycause an untargeted adversarial attack[56],[57]. The similaritybetween x and x (cid:48) is usually measured by some distance metric d ( · ) . In the literature for generating adversarial examples,the three distance metrics L -norm, L -norm, and L ∞ -norm (collectively, L p -norms) are widely used. The L p -norm of avector v is deﬁned as (cid:107) v (cid:107) p = (cid:32) n (cid:88) i =1 | v i | p (cid:33) p . In this paper, we focus on the L ∞ distance. It is true thatno distance metric is a perfect measure of human perception,especially considering different scenarios. Constructing andevaluating a good distance metric may be intuitive, but wedo not judge which distance metric is optimal as it is not thefocus of this paper. Instead, we use L ∞ distance, as L p issufﬁcient for the computer vision classiﬁcation task that is thefocus of this paper, and L ∞ norm is considered as the optimalchoice [14] and has been widely used in many studies [58][59]. C. Threat model

The threat model of a system, which often involves ad-versarial goals and capabilities, can be used to measure thesecurity of the system. If a system using a DNN is viewedas a generalized data processing pipeline, at inference phasethe system collects inputs from sensors or data repositoriesand then processes the inputs in the digital domain and feedsthem to the model to produce an output for external systemsor users to receive and act upon. According to the attacksurface deﬁned with this procedure, in this paper we consideradversaries that are capable of manipulating the collection andprocessing of data to tamper with the output. The adversarieshave no knowledge of the model architecture or values of anyparameters or trainable weights, but they have direct access toat least some of the training data, and of course they can querythe model, i.e., feed it inputs and receive outputs. Finally, bymodeling the adversarial goals using a classical approach thatincludes conﬁdentiality, integrity, and availability, called CIA[60], it can be seen that the main threat from such adversariesis to compromise the integrity of the DNN-based system. Asthey are capable of destroying the input–output mapping ofthe model, they can also achieve the goal of underminingavailability, despite the difference between availability andintegrity in deﬁnition.IV. M

ETHODOLOGY

An overview of the method for generating TUP adversarialexamples and performing region adversarial training is givenin Fig. 2(a). A conceptual illustration of the method forcomputing TUPs is presented in Fig. 2 (b). As shown in Fig. 2(a), a common neural network A can correctly classify a cleaninput but cannot correctly classify an adversarial example inthe inference phase. Retraining using our RAT method basedon TUPs results in a more robust network B that can correctlyclassify even the unseen adversarial examples. In Fig. 2 (b),we use black solid lines represent a simple decision boundary(which is linear in this case) for the original network. A set ofdata points can be easily separated with the simple decisionboundary, but the L ∞ balls around the data points cannot beseparated well. Let F k = (cid:110) x : ˆ f k ( x ) − ˆ f t ( x ) = 0 (cid:111) (in thecase shown in Fig. 2(b), k = 1 , , ) describe the region of the space where the classiﬁer outputs label t . For each point whoseground-truth label is not t but is not classiﬁed correctly bythe simple decision boundary, the method calculates a vectorthat touches a polyhedron that approximates the region F k .Then, by continuously aggregating these vectors and updatingthe perturbation vector, we ﬁnally obtain a TUP perturbationthat captures the semantic information that the network hasnot learned but is about the decision boundary of the regionwhere the classiﬁer outputs label t . Using this information toretrain the network, a more complicated decision boundary,needed to separate adversarial examples in the L ∞ balls, canbe obtained (represented by the red curve in Fig. 2(b)). Thismakes the resulting network more robust against adversarialattacks with bounded L ∞ perturbations. A. Targeted universal perturbations

The problem of generating an adversarial example for aninput x is equivalent to that of ﬁnding a minimum adversarialperturbation r that satisﬁes the adversarial condition. Formally,this problem can be deﬁned as follows: min r d ( x, x + r ) s.t. (cid:98) f ( x + r ) = t. (1)In Eq. (1), x and x + r must be drawn from the samedistribution ψ and the same feature space. As our aim is tocause a targeted adversarial attack for most inputs through asingle perturbation and to extract semantic information fromthem, the problem differs a bit. Our generation method focuseson the following question: Can we ﬁnd a perturbation vector r ∈ R n that causes the classiﬁer to misclassify almost all datapoints sampled from ψ as a certain class t that differs fromthe correct prediction for the original input? In other words,we look for a vector r for most x ∼ ψ such that (cid:98) f ( x + r ) = t (cid:54) = (cid:98) f ( x ) . (2)According to the concepts of adversarial examples and adver-sarial perturbations as described before, each of the followingtwo constraints on the perturbation vector r must be satisﬁed: d ( r ) ≤ η (3a) P x ∼ ψ ( (cid:98) f ( x + r ) = t ) ≥ − δ. (3b)In Eq. (3a), we use d ( r ) as a measure of the quantiﬁedsimilarity. In Eq. (3b), we use − δ to denote the successrate threshold, where the parameter δ ∈ (0 , is a scalar. Theparameter η restricts the magnitude of the perturbation. Thesmaller the value of η , the harder it is for a human to perceivethe perturbation in the image, and on the other hand, a larger − δ value (i.e., a smaller δ value) implies a stronger attack thatis more powerful for generating a desired perturbation. We callsuch a perturbation r a targeted- ( δ, η ) -universal perturbation (TUP), as this single input-agnostic perturbation, restricted bythe parameters δ and η , causes the predicted label of most datapoints sampled from the data distribution ψ to be convertedto the target class t . Algorithm 1.

In this paper, we propose an algorithmthat seeks a common perturbation r for most data pointsin X = { x , . . . , x s } , which is a set of images sampledfrom the same distribution ψ , such that the attacked neuralnetwork is caused to misclassify the perturbed input as a pre-selected target class t and such that r satisﬁes (cid:107) r (cid:107) ∞ ≤ η .The algorithm progressively establishes the target universalperturbation via an iterative procedure over the data pointsin X . At each iteration, it computes a minimal perturbation ∆ r i that sends the current perturbed point x i + r i toward thedecision boundary of target class t of the classiﬁer, and thenaggregates ∆ r i to the current instance of the target universalperturbation r i , as illustrated in Fig. 2(b). More speciﬁcally,as long as data point x i perturbed by the current r i is notclassiﬁed as the target class t by the attacked model, we solvethe following optimization problem to ﬁnd a supplemental ∆ r i that will lead to misclassiﬁcation on x i : ∆ r i ← arg min σ (cid:107) σ (cid:107) ∞ s.t. (cid:98) f ( x i + r i + σ ) = t. (4)We treat the problem in Eq. (4) as a suitable optimizationinstance and solve it by existing optimization algorithms suchas that given in [14]. To reduce the computational cost whileensuring that the constraint (cid:107) r (cid:107) ∞ ≤ η is satisﬁed, the updatedperturbation r is further clipped and projected onto the (cid:96) ∞ ball, with radius η and centered at , every k iterations; theprojection operator P ∞ ,η is deﬁned as follows: P ∞ ,η = arg min r (cid:107) r − r (cid:107) s.t. (cid:107) r (cid:107) ∞ ≤ η. (5)Then, we use the operator in Eq. (5) to update the perturbationvector r in the i th iteration as follows: r ← (cid:40) P ∞ ,η ( r + ∆ r i ) , for i | k = 0 , i (cid:54) = 0 r + ∆ r i , otherwise . (6)When the attack success rate for target class t exceeds thedesired threshold − δ on the perturbed dataset X r := { x + r, . . . , x s + r } , the algorithm is stopped. The successrate S ucc ( X r ) is deﬁned as the likelihood of success that theperturbation will change the label to the target class t . In otherwords, the terminal condition of the algorithm is S ucc ( X r ) := 1 s s (cid:88) i =1 (cid:98) f ( x i + r )= t ≥ − δ, (7)where (cid:98) f ( x i + r )= t is the indicator function. The details of thealgorithm are provided as Algorithm 1. B. Region adversarial training

In order to use the TUP approach to enhance the adversarialrobustness of deep networks, we introduce a training method,which we call region adversarial training (RAT). The purposeof the training is not to enhance the entire network in un-differentiated ways; instead, it focuses on the weaker regionsof the network or the regions of most interest to the user.In region adversarial training, the network is not trained onall inputs from the training set perturbed but on a mixture of

Algorithm 1:

Computation of targeted universal perturba-tion.

Input:

Dataset X , classiﬁer (cid:98) f , target class t , desired L ∞ -norm of the perturbation η , desired projectionoperator step size k , desired accuracy onperturbed data points δ Output: targeted- ( δ, η ) -universal perturbation (TUP)vector r Initialize r ← . while S ucc ( X r ) < − δ do for every x i ∈ X do if (cid:98) f ( x i + r ) (cid:54) = t then ∆ r i ← arg min σ (cid:107) σ (cid:107) ∞ s.t. (cid:98) f ( x i + r i + σ ) = t if i | k = 0 and i (cid:54) = 0 then Update the perturbation using theprojection operator: r ← P ∞ ,η ( r + ∆ r i ) else Update the perturbation: r ← r + ∆ r i end end end end original training data and training data perturbed by the TUPmethod. The targeted universal perturbation that is computedcan be considered as containing more complex information ofa certain class region’s decision boundaries that the networkhas not yet learned from the original training set. The intuitionbehind region adversarial training is that incorporating thisinformation into the training will improve the classiﬁcationaccuracy on adversarial examples of the classiﬁer for this class.Formally, let Θ ∗ be the weights of a neural network; thenstandard training learns Θ ∗ as Θ ∗ = arg min θ E x ∈ χ J ( θ , x , y ) . (8)The adversarial training proposed by Szegedy et al. [10] wasoriginally for solving the following min–max formulation: Θ ∗ = arg min θ E x ∈ χ [ max δ ∈ ∆( x ) J ( θ , x + δ , y )] , (9)where δ represents adversarial perturbations computed bysome method; in [10], a linear approximation method namedFast Gradient Sign Method (FGSM) was used to generate δ . The original adversarial training process trained on theperturbed samples roughly, without direction or distinction.The region adversarial training method proposed here paysspecial attention to the region of the space where the classiﬁer outputs a certain class label t in the manifold represented bythe network. Using this method, Θ ∗ is computed as Θ ∗ = arg min θ E x ∈ χ [ max δ ∈ ∆( x ) [ J ( θ , x , y )+ J adv ( θ , x + δ , y )]] , (10) J ( θ , x , y ) = (cid:88) x i ∈ χ,f ( x i )= t J ( θ , x i , y ) , (11) J adv ( θ , x + δ , y ) = (cid:88) x i ∈ χ,f ( x i ) (cid:54) = t J ( θ , x i + δ , y ) . (12)The saddle point problem in Eq. (10) is similar to that inEq. (9) in its composition of an inner maximization problemand an outer minimization problem. The loss function J inEq. (11) is independent of the perturbation δ , and so the innermaximization problem in Eq. (10) can be rewritten as J ( θ , x , y ) + max δ ∈ ∆( x ) [ J adv ( θ , x + δ , y )] . (13)Let r t be the perturbation vector found by Algorithm 1; then r t can be interpreted as a scheme for maximizing the loss J adv in Eq. (13). Thus, the weights Θ ∗ are computed by theregion adversarial training as Θ ∗ = arg min θ E x ∈ χ [ J ( θ , x , y ) + J adv ( θ , x + r t , y )] . (14)Eq. (14) can be used for any suitable loss function J ( θ , x , y ) ;in this paper, we use the common cross-entropy loss functionfor neural networks. V. E VALUATION

In this section, the cases for the experimental investigationare introduced. Before turning to our approach for generatingadversarial examples and improving adversarial robustness,we describe the architectures of the models on which weevaluated the proposed approach and the datasets we used.Then, we describe how the TUPs were generated for MNISTand CIFAR-10, show the performance of our TUP attack,discuss the inﬂuence of parameter selection, and study theproperty of transferability across different models and the per-formance on small datasets. Finally, based on the experimentalresults, we discuss whether the proposed region adversarialtraining with TUPs can improve adversarial robustness notonly against TUP itself but also against FGSM adversarialexamples. Furthermore, we also remark on the size of the set X needed to achieve the desired results. A. Experimental setup

Dataset description.

To ascertain the feasibility and ef-fectiveness of the algorithm proposed in this paper, a seriesof experiments were performed on two widely used machinelearning datasets: MNIST and CIFAR-10. The MNIST datasetis a collection of black and white images of handwritten digits;it contains ,

000 28 × training samples and , testsamples, each pixel of which is encoded as a real numberbetween and . The CIFAR-10 dataset consists of , × color images, which are divided into a training setof , images and a test set of , images, each pixel of which takes the value of a real number between and for three color channels. For both the MNIST and CIFAR-10datasets, we created a validation set containing examplesfrom the training set. Each image in the MNIST and CIFAR-10dataset is associated with a label from ten classes. In MNIST,the classes are the values ranging from to , representing thedigit written, and in CIFAR-10, the ten classes are airplane,automobile, bird, cat, deer, dog, frog, horse, ship, and truck.TABLE I: Baseline accuracy (acc.) of ﬁve MNIST classiﬁersand ﬁve CIFAR-10 classiﬁers. Architecture (MNIST) Acc. (%)

Classiﬁer-M-Primary (Classiﬁer Mp ) 99.34Classiﬁer-M-Alternate-0 (Classiﬁer M ) 99.31Classiﬁer-M-Alternate-1 (Classiﬁer M ) 99.38Classiﬁer-M-Alternate-2 (Classiﬁer M ) 99.30Classiﬁer-M-Alternate-3 (Classiﬁer M ) 99.35 Architecture (CIFAR-10) Acc. (%)

Classiﬁer-C-Primary (Classiﬁer Cp ) 77.74Classiﬁer-C-Alternate-0 (Classiﬁer C ) 78.03Classiﬁer-C-Alternate-1 (Classiﬁer C ) 73.55Classiﬁer-C-Alternate-2 (Classiﬁer C ) 73.09Classiﬁer-C-Alternate-3 (Classiﬁer C ) 75.46 Architecture characteristics.

To begin our empirical ex-plorations, we trained ﬁve networks each for the standardMNIST and CIFAR-10 classiﬁcation tasks. The ﬁve networksdiffered only in their initial weights or their architectures.The baseline accuracies on clean data (unperturbed data) arelisted in Table I. The details of the model architectures andthe hyper-parameters we used are given in the Appendix. Theperformance of the networks on MNIST was comparable tostate-of-the-art performance [61], but note that the accuracyon CIFAR-10 was much lower for all ﬁve networks. Thestate-of-the-art accuracy on CIFAR-10 is higher [62], but toachieve this performance, data augmentation or additionaldropout must be used. In the context of adversarial robustness,researchers are typically concerned with the original data,and we achieved a test accuracy of . , which is veryclose to the state-of-the-art validation accuracy without anydata augmentation [63]. We did not attempt to increase thisnumber through tuning hyper-parameters or any of the manyother techniques available, as we wanted to use a typicalconvolution structure (based on the well-studied LeNet [64])that is commonly used in other studies and training approachesthat are identical to those presented in [36] and [14] to makeit easy for others to compare with or replicate our work. B. Crafting of adversarial examples using TUPs

Success rate.

To evaluate the attack performance of theproposed algorithm, we report the success rate, which isdeﬁned as the proportion of samples that are misclassiﬁed astarget class t when perturbed by our perturbation, on CIFAR-10 and MNIST (Fig. 3). For all of the model architectures,results are reported on set X , which was randomly selectedfrom the training sets of CIFAR-10 and MNIST to computethe perturbation, and on a validation set that had never beenused during the process of computing the perturbation. The set X contained , images, and the validation set contained images for both CIFAR-10 and MNIST. As can be seen, the perturbations achieved quite high success rates forall sets of conditions, although there are some differences inthe success rates because of differences in architecture, targetclasses, datasets, and parameter selection, which we discussbelow. Notably, these results demonstrate the universalityproperty, namely, that any image in the validation set can beused to fool the classiﬁer into misclassifying it as a targetclass t (different from its source class) by the mere addition ofthe TUP perturbation computed on another disjoint set. Fig. 1illustrates images before and after perturbation by TUPs; notethat in most cases, the perturbations are nearly imperceptible.We display these perturbations in Fig. 4, where the patternsof the perturbations are clearly shown and are seen to containdistinct semantic information. Let N be the number of imagesin set X , representing the size of X . In all of the aboveexperiments, we used k = N , η = 0 . for CIFAR-10 and k = N , η = 0 . for MNIST, chosen empirically. Althoughthese values for parameters k and η worked well enough,we explored further to learn whether there might be differentoptions for other situations. The effect of the parameter valueswas evaluated on the baseline network Classiﬁer Cp , and someof these results are shown in Fig. 5. Please note that in order toreduce the amount of calculation required, we chose a smallerset X , which included CIFAR-10 images, to compute theadversarial perturbations and selected the target frog (class 6,chosen randomly from the ten classes) to use as an example.Using a larger value for the projection step size k results infewer projection operations. Thus, it is natural to hypothesizethat the success rate will decrease as k increases. The resultsdisplayed in Fig. 5(a) do not violate our intuition: With k =100 , . of the examples in the validation set disjoint with X were classiﬁed incorrectly as the target class frog, whereaswhen k increased to (equal to the size of X ) the attacksuccess rate decreased to . . In contrast with this modestdecrease in the attack success rate, it is surprising to see thatthe calculation time decreased dramatically as k increased.When k = 3000 , it required only slightly more than half thetime needed for k = 100 . Therefore, as long as one is notpursuing extremely high performance in terms of the successrate, a larger value of k is not a bad choice because it cangreatly reduce the time complexity of the algorithm.The effect of parameter η , the radius of the l ∞ ball onwhich the perturbation is projected during the computation, israther interesting. We varied η from . to . and foundthat the success rate increased linearly from η = 0 . to η =0 . and then plateaued from η = 0 . to η = 0 . ; clearly,therefore, increasing η increased the attack success rate of aTUP perturbation, as displayed in Fig. 5(b). It should be notedthat the method proposed in Algorithm 1 is not theoreticallyguaranteed to converge to the optimal solution, as it operatesin a greedy way. When η was chosen to be very small, thesuccess rate oscillated back and forth far below the desiredperformance, and we observed that the smaller the value, themore violent the oscillation, and thus the more difﬁcult theconvergence.We compared the proposed TUP method with the mostwell-known version of FGSM [11] on the baseline modelClassiﬁer Cp . We used Cleverhans [65] to re-implement the Fig. 3: Success rates of TUP adversarial examples on X andthe disjoint validation set for targeted attacks of each targetclass (from 0 to 9). Left column: Success rate of attacksagainst the ﬁve networks on CIFAR-10; (a)–(e) correspond tomodels Classiﬁer Cp , Classiﬁer C , Classiﬁer C , Classiﬁer C ,and Classiﬁer C , respectively. Right column: Success rateof attacks against the ﬁve networks on MNIST; (f)–(j) cor-respond to models Classiﬁer Mp , Classiﬁer M , Classiﬁer M ,Classiﬁer M , and Classiﬁer M , respectively. Fig. 4: Perturbations computed by TUP method for CIFAR-10. The ten classes shown are the target classes chosen forthe respective attack. The pixel values of the perturbations arescaled for visibility. In order to show the semantic informa-tion carried by the perturbation more clearly, two randomlyselected images from the training set are displayed for eachtarget class. Same-colored boxes on the perturbation imagesand the sample images indicate the same semantic concept.

100 200 300 500 600 1000 1500 3000 value of k0.60.81.0 S u cc e ss r a t e (a) XVal.0.0 0.1 0.2 0.3 0.4 0.5value of η0.900.951.00 S u cc e ss r a t e (b) XVal.

Fig. 5: Effect of values of parameters k and η on attack successrate.“target class” variation [18] of FGSM, as the TUP method canbe used to launch a targeted attack. We generated TUPadversarial examples and

FGSM adversarial examplesfor each source–target pair on CIFAR-10. In Fig. 6, the leftcolumn represents the number of successful untargeted attacksout of the attacks for each source–target pair, and theright column represents the number of targeted attacks. Theﬁrst row corresponds to the TUP attack, and the second rowto the FGSM attack. As shown by the heat maps, the TUPmethod had high success rates in both targeted and untargetedattacks, whereas FGSM only achieved a comparable successrate in the untargeted attacks, performing poorly in the targetedattacks. The number of successful TUP attacks was almost

Fig. 6: Heat maps of the number of times an attack wassuccessful with the corresponding source–target class pair, forboth targeted and untargeted attacks by TUP and FGSM. (a)TUP untargeted attacks, (b) TUP targeted attacks, (c) FGSMuntargeted attacks, and (d) FGSM targeted attacks.evenly distributed across each source–target class pair, andthe heat maps for TUP are almost symmetric. This means thatfor two classes A and B, perturbing images from A to B isapproximately as difﬁcult as perturbing from B to A for a TUPattack. For an FGSM attack, however, there exist some speciﬁcsource–target class pairs that are much more vulnerable thanothers in both targeted and untargeted attacks. This indicatesthat the TUP method has found a universal way to perturb theinputs in a certain direction as speciﬁed by the target class,whereas FGSM is inclined to perturb the original images inthe direction of some vulnerable target class shared by manydata points.

Cross-model transferability.

Previous work demonstratedthe transferability property of adversarial examples, that is,that adversarial examples crafted to mislead one model canaffect other models provided they are trained to perform thesame task, even if their architectures are different or theirtraining sets are disjoint [10], [66], [67]. To measure thecross-model transferability of perturbations crafted by the TUPmethod, i.e., the extent to which the perturbations computedfor a speciﬁc architecture are effective for another, we com-puted perturbations for each architecture on both MNISTand CIFAR-10 for each target class and fed the addition ofeach universal perturbation to the other network. We reportthe average attack success rate for the ten target classeson all other architectures for the same dataset in Table II.The perturbations had an average cross-model success rateof greater than on MNIST and on CIFAR-10. Inthe best cases, perturbations computed for the Classiﬁer M network had a success rate of . with Classiﬁer M (on MNIST), and perturbations computed for Classiﬁer Cp had a . success rate with Classiﬁer C (on CIFAR-10).We observed that the perturbations computed for differentarchitectures had discrepant transfer capabilities across other TABLE II: Cross-model success rates (%) on CIFAR-10 andMNIST. Rows indicate the architecture for which the TUPswere computed, and columns indicate the architecture forwhich the success rate is reported. The maximum value ineach row is shown in bold font. Classiﬁer Cp Classiﬁer C Classiﬁer C Classiﬁer C Classiﬁer C Classiﬁer Cp - Classiﬁer C - 27.64 36.88 45.23 Classiﬁer C - 37.95 43.92 Classiﬁer C Classiﬁer C Classiﬁer Mp Classiﬁer M Classiﬁer M Classiﬁer M Classiﬁer M Classiﬁer Mp - 78.78 51.29 35.75 M M M Classiﬁer M TABLE III: Success rates (%) corresponding to different sizes N for set X on MNIST and CIFAR-10. Dataset N

100 200 300 400 500 600 700 800 900MNIST 81.79 83.57 89.16 92.02 93.32 95.06 95.36 96.08 96.08CIFAR-10 69.37 88.88 92.45 94.93 95.09 96.69 97.54 97.90 98.63

Datasets N architectures. For example, the perturbations computed forClassiﬁer M (all above , and best case . ) andClassiﬁer Cp (all above except for Classiﬁer C , and bestcase . ) generalized better than other architectures onthe same dataset. The bold numbers in Table II represent thehighest cross-model success rate of the TUPs calculated oneach model. These results show that the TUPs we created dotransfer to some extent across models, thereby demonstratingthat our TUP perturbations are not an artifact of a speciﬁcnetwork nor of a particular selection of training set but havea degree of universality with respect to both data points andarchitectures. Size of set X . As described previously, each of the TUPsabove was computed for a set X , a random selection of , examples from the training set (excluding images thatwere originally classiﬁed as class t ). Is such a large set X necessary to achieve similar attack success rates? The answerto this question may allow the TUP method to be made morepractical. Using a smaller set X does allow a more realisticassumption regarding the attacker’s access to data; that is,that the attacker has access to only a subset of the trainingdata rather than full access to any examples that were used intraining the target model. Meanwhile, using a smaller set X makes the algorithm faster.Table III shows the success rates on the validation setscreated with TUPs computed on variously sized subsets of thetraining set. We repeated the experiment ten times for each set X , each time randomly selecting the attack target t from theten classes of CIFAR-10 and MNIST; we report the averageresults for the ten trials for each X . To eliminate the effect ofdifferent projection steps and focus on the inﬂuence of the sizeof set X , the projection operator was omitted here; althoughthis does cause the success rate to be higher than was shownbefore (at the expense of the quality of perturbed images), it Fig. 7: Some TUP perturbations generated on sets X of differ-ent sizes, using four randomly selected classes as examples.Columns (a)–(e) correspond to N = 1000 , , , ,and , respectively.does not affect the trend of the change in success rate as thesize of X is varied.We might expect that perturbations computed on highernumbers of data samples to result in higher attack successrates, and this holds true when set X contains fewer than samples. With a set X containing just CIFAR-10 images,the attack was successful for . of the images on thevalidation set, and when perturbations were computed on MNIST images, the attack succeeded in more than ofcases. When set X was expanded to contain samples, theperturbation computed on X fooled . of the validationimages on CIFAR-10 and . on MNIST; the success ratesdid not change much after that. This surprising result suggeststhat the proposed method is able to extract a large amount ofuseful information from a very small dataset.To illustrate this observation, we display the perturbationsof four randomly selected target classes on CIFAR-10 corre-sponding to different sizes of set X in Fig. 7. As the imagesshow, explicit and rich semantic information was captured byperturbations computed on a very small X ; the perturbationsdiffered only slightly when computed on X sets of differentsizes. This hints that the structure of the dataset is quitemeaningful in the construction of TUPs, whereas the quantityof data has no sizable effect. C. Effect of region adversarial training on adversarial robust-ness

We now examine the effect of region adversarial trainingwith perturbed examples on the baseline models Classiﬁer Cp and Classiﬁer Mp . We used the TUP perturbations computedfor the networks Classiﬁer Cp and Classiﬁer Mp (described inSection V-B and presented in Fig. 3) and performed regionadversarial training according to the method described inSection IV-B. Speciﬁcally, we included the adversarial coun-terparts of the original data during training through the simpleaddition of a targeted TUP perturbation to all of the cleanexamples classiﬁed as classes other than the target class t by the attacked network. Then, we retrained the two baselinemodels for epochs. We report the classiﬁcation accuracy forthe perturbed adversarial examples on the test set in Fig. 8.We observe that although the accuracy was not as high asthat attained on the clean dataset, the use of region adver-sarial training did greatly improve the classiﬁcation accuracyon adversarial examples compared with the accuracy beforeretraining. As Fig. 8(a) and Fig. 8(c) show, the accuracy onthe perturbed test set rose to more than for each class inCIFAR-10; the previous accuracy was less than for allclasses (average . , minimum . ). For MNIST, theaccuracy increased to over for each class, compared withless than (average . , minimum . ) before.Another exciting ﬁnding is that region adversarial trainingusing the TUPs not only strengthens the adversarial robustnessto TUP perturbations themselves but is also effective againstother adversarial attacks, such as the most well-known attackmethod, FGSM [11]. We generated targeted FGSMadversarial examples for each class in CIFAR-10 and MNISTvalidation sets. The accuracies of these FGSM adversarialexamples before and after region adversarial training withTUPs on the two models Classiﬁer Cp and Classiﬁer Mp arereported in Fig. 8(b) and Fig. 8(d). The results show that thenetworks trained using region adversarial training exhibitedgreater robustness properties that were not limited to theperturbations the models saw during retraining; their ability tocorrectly classify the unseen FGSM patterns was also greatlyimproved (by an average of . for CIFAR-10 and . for MNIST). Note that in all of experiments reported inthis paper, better results were obtained on MNIST than onCIFAR-10. One key reason is that the models we trained inthis study perform better on the clean MNIST dataset thanon the clean CIFAR-10 (as explained in Section V-A); thus,we could say that the model Classiﬁer Mp is more powerfulthan Classiﬁer Cp when they are performing their respectivetasks. On the other hand, the MNIST dataset contains onlyblack and white images, which have a pure background. Inaddition, in order to provide heuristic comparisons, we weremore conservative in the parameter selections for CIFAR-10.One question that remains is the following: Given that aTUP attack on a very small set can be quite powerful (as wehave demonstrated), can region adversarial training with suchattacks still improve robustness further? To investigate thisissue, we measured the accuracies on the test set of CIFAR-10 against TUP and FGSM attacks after region adversarialtraining with TUPs computed on X sets of different sizesfor Classiﬁer Cp ; these are reported in Fig. 9. For comparison,the results after adversarial training [11] with FGSM are alsoshown in Fig. 9. The original accuracies (before retraining) areshown in Table IV. Note that only the number of images (fromthe training set) needed to generate adversarial examples foradversarial training has been changed; the ﬁnal accuracy wascalculated on the test set. As FGSM computes perturbationson a single image at a time, whereas TUP computes an image-agnostic perturbation and then simply adds the perturbation tothe clean input, the accuracies for FGSM shown in Table IVdo not change with N .From the results shown, we ﬁnd that region adversarial Fig. 8: Comparison of accuracy before and after region adversarial training based on TUP against TUP and FGSM attacks.TABLE IV: Accuracy(%) on test set against TUP and FGSM adversarial examples corresponding to different sizes of set X for the ten target classes. N Fig. 9: Accuracy against TUP and FGSM attacks on test set before and after region adversarial training based on TUP (red)or classical adversarial training based on FGSM perturbations (blue).training based on the TUP algorithm offers comparative advan-tages in improving adversarial robustness through heuristics-based techniques. Firstly, both region adversarial training(RAT) based on TUP and adversarial training (AT) based onFGSM improve the test accuracy of the adversarial patternsthey used during the retraining. In all cases of the ten targetclasses, however, RAT improves the accuracy of TUPs morethan AT improves the accuracy of FGSM. Secondly, RATalso improves the robustness of the network against FGSM;in fact, there is not much of a gap between RAT and ATin their improvement of FGSM accuracy. By contrast, ATimproves the accuracy of TUP much less than does RAT. Thirdly, the accuracy– N curves for RAT are ﬂatter than thosefor AT on both TUP and FGSM; this indicates that only asmall number of samples are needed for the RAT method toachieve good results in enhancing adversarial robustness. Thismight be because of the difference in the principles of thetwo methods, AT hoping that the network can extract omittedinformation from a large number of adversarial exampleson its own during retraining, and RAT using the missingsemantic information near the classiﬁcation boundary to guidethe network’s training. VI. C

ONCLUSION

In summary, the method proposed in this paper improvesthe adversarial robustness of deep neural networks by empha-sizing to deep models the missed semantic information of theregion around the decision boundary. Our research builds onrecent research on the generation of image-agnostic universaladversarial perturbations to fool deep neural networks, but itdoes so with attention to two entirely different goals: to havethe perturbations extract the unlearned semantic informationof a speciﬁc region in the manifold represented by a networkand to use them to enhance the robustness of the network.We have proposed an algorithm named TUP to extract thisinformation that the model has not yet learned but that isessential for correctly classifying the adversarial examples.The algorithm uses an iterative process on a subset of thetraining set to obtain a universal property across inputs, asmany previous algorithms have done, but we interfered withthe iterative process to push it toward the region correspondingto a speciﬁed target class. Furthermore, to enhance adversarialrobustness, we designed region adversarial training based onthe TUP perturbations. Experimental results on two datasetsand ten classiﬁers show that region adversarial training basedon the TUP algorithm not only improves robustness againstTUPs but also markedly improves robustness against FGSMperturbations.The TUP algorithm uses just a few training samples toeffectively extract the semantic information obscured by theblind spots of the deep models, and at the same time itprovides a powerful adversarial attack method that exhibitstransferability across different architectures. The proposedregion adversarial training method based on the TUP algorithmoffers an efﬁcient way to enhance the robustness of classiﬁers,especially the robustness of the region corresponding to aspeciﬁc class, as the perturbation is universal for each class.By simply calculating a TUP perturbation on a very small setand then adding the perturbation to clean images, the methodobtains the adversarial examples required for region adversar-ial training. The proposed approach provides new ideas forenhancing the adversarial robustness of DNNs and can beused as a fast and efﬁcient tool, especially in scenarios wherethe cost of being attacked in a certain region of the networkis much higher than in others. Investigation and theoreticalanalysis of the geometric correlations between different partsof the decision boundary are left for future work.A

CKNOWLEDGMENT

This work was supported by the National Natural ScienceFoundation of China (Grant Nos. U19A2081, 61802270) andthe Fundamental Research Funds for the Central Universities(Grant No. 2019SCU12069, SCU2018D018).R

EFERENCES[1] Y Lecun, Y Bengio, and G Hinton. Deep learning.

Nature ,521(7553):436, 2015.[2] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers offeatures from tiny images. Technical report, Citeseer, 2009. [3] Francisco Prez-Hernndez, Siham Tabik, Alberto Lamas, Roberto Ol-mos, and Francisco Herrera. Object detection binary classiﬁers method-ology based on deep learning to identify small objects handled sim-ilarly: Application in video surveillance.

Knowledge-Based Systems ,page 105590, 2020.[4] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahmanMohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, PatrickNguyen, Brian Kingsbury, et al. Deep neural networks for acousticmodeling in speech recognition.

IEEE Signal processing magazine ,29, 2012.[5] Gbor Gosztolya. Posterior-thresholding feature extraction for paralin-guistic speech classiﬁcation.

Knowledge-Based Systems , page 104943,2019.[6] I Sutskever, O Vinyals, and QV Le. Sequence to sequence learningwith neural networks.

Advances in NIPS , 2014.[7] Basemah Alshemali and Jugal Kalita. Improving the reliability of deepneural networks in nlp: A review.

Knowledge-Based Systems , 191,2020.[8] Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan MSwetter, Helen M Blau, and Sebastian Thrun. Dermatologist-levelclassiﬁcation of skin cancer with deep neural networks.

Nature ,542(7639):115, 2017.[9] Kurt Hornik, Maxwell B Stinchcombe, and Halbert White. Multilayerfeedforward networks are universal approximators.

Neural Networks ,2(5):359–366, 1989.[10] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna,Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing propertiesof neural networks. arXiv preprint arXiv:1312.6199 , 2013.[11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explainingand harnessing adversarial examples. arXiv preprint arXiv:1412.6572 ,2014.[12] Wei Liu, Zhiming Luo, and Shaozi Li. Improving deep ensemble ve-hicle classiﬁcation by using selected adversarial samples.

Knowledge-Based Systems , 160(NOV.15):167–175, 2018.[13] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed de-fenses against adversarial examples. arXiv preprint arXiv:1801.09344 ,2018.[14] Nicholas Carlini and David Wagner. Towards evaluating the robustnessof neural networks. In , pages 39–57. IEEE, 2017.[15] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel.Man vs. computer: Benchmarking machine learning algorithms fortrafﬁc sign recognition.

Neural networks , 32:323–332, 2012.[16] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, andPascal Frossard. Universal adversarial perturbations. In

Proceedingsof the IEEE conference on computer vision and pattern recognition ,pages 1765–1773, 2017.[17] Yann LeCun. The mnist database of handwritten digits. http://yann.lecun. com/exdb/mnist/ , 1998.[18] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarialmachine learning at scale. arXiv preprint arXiv:1611.01236 , 2016.[19] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarialexamples in the physical world. arXiv preprint arXiv:1607.02533 ,2016.[20] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson,Z Berkay Celik, and Ananthram Swami. The limitations of deeplearning in adversarial settings. In , pages 372–387. IEEE, 2016.[21] Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. Onepixel attack for fooling deep neural networks.

IEEE Transactions onEvolutionary Computation , 23(5):828–841, 2019.[22] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and PascalFrossard. Deepfool: a simple and accurate method to fool deep neuralnetworks. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 2574–2582, 2016.[23] Sayantan Sarkar, Ankan Bansal, Upal Mahbub, and Rama Chellappa.Upset and angri: Breaking high performance image classiﬁers. arXivpreprint arXiv:1707.01159 , 2017.[24] Vincent Tjeng, Kai Y. Xiao, and Russ Tedrake. Evaluating robustness ofneural networks with mixed integer programming. In , 2019.[25] Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and SumanJana. Formal security analysis of neural networks using symbolicintervals. In { USENIX } Security Symposium ( { USENIX } Security18) , pages 1599–1614, 2018. [26] Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov,Swarat Chaudhuri, and Martin Vechev. Ai2: Safety and robustnesscertiﬁcation of neural networks with abstract interpretation. In , pages 3–18. IEEE,2018.[27] Eric Wong and Zico Kolter. Provable defenses against adversarialexamples via the convex outer adversarial polytope. In InternationalConference on Machine Learning , pages 5286–5295, 2018.[28] Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus P¨uschel, andMartin Vechev. Fast and effective robustness certiﬁcation. In

Advancesin Neural Information Processing Systems , pages 10802–10813, 2018.[29] Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh,Luca Daniel, Duane Boning, and Inderjit Dhillon. Towards fastcomputation of certiﬁed robustness for relu networks. In

InternationalConference on Machine Learning , pages 5276–5285, 2018.[30] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and LucaDaniel. Efﬁcient neural network robustness certiﬁcation with generalactivation functions. In

Advances in neural information processingsystems , pages 4939–4948, 2018.[31] Shixiang Gu and Luca Rigazio. Towards deep neural network architec-tures robust to adversarial examples. arXiv preprint arXiv:1412.5068 ,2014.[32] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and YoshuaBengio. Contractive auto-encoders: Explicit invariance during featureextraction. In

In International Conference on Machine Learning .Citeseer, 2011.[33] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarialrobustness and interpretability of deep neural networks by regularizingtheir input gradients. In

Thirty-second AAAI conference on artiﬁcialintelligence , 2018.[34] Chunchuan Lyu, Kaizhu Huang, and Hai-Ning Liang. A uniﬁedgradient regularization family for adversarial examples. In , pages 301–309. IEEE, 2015.[35] Linh Nguyen, Sky Wang, and Arunesh Sinha. A learning and maskingapproach to secure learning. In

International Conference on Decisionand Game Theory for Security , pages 453–464. Springer, 2018.[36] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Anan-thram Swami. Distillation as a defense to adversarial perturbationsagainst deep neural networks. In , pages 582–597. IEEE, 2016.[37] Aran Nayebi and Surya Ganguli. Biologically inspired protec-tion of deep networks from adversarial attacks. arXiv preprintarXiv:1703.09202 , 2017.[38] Dmitry Krotov and John Hopﬁeld. Dense associative memory is robustto adversarial inputs.

Neural computation , 30(12):3151–3167, 2018.[39] Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet.Houdini: Fooling deep structured prediction models. arXiv preprintarXiv:1707.05373 , 2017.[40] Ji Gao, Beilun Wang, Zeming Lin, Weilin Xu, and Yanjun Qi. Deep-cloak: Masking deep neural network models for robustness againstadversarial samples. arXiv preprint arXiv:1702.06763 , 2017.[41] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, DimitrisTsipras, and Adrian Vladu. Towards deep learning models resistantto adversarial attacks. arXiv preprint arXiv:1706.06083 , 2017.[42] Taesik Na, Jong Hwan Ko, and Saibal Mukhopadhyay. Cascadeadversarial machine learning regularized with a uniﬁed embedding. arXiv preprint arXiv:1708.02582 , 2017.[43] Naveed Akhtar, Jian Liu, and Ajmal Mian. Defense against universaladversarial perturbations. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 3389–3398, 2018.[44] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: De-tecting adversarial examples in deep neural networks. arXiv preprintarXiv:1704.01155 , 2017.[45] Shiwei Shen, Guoqing Jin, Ke Gao, and Yongdong Zhang. Ape-gan: Adversarial perturbation elimination with gan. arXiv preprintarXiv:1707.05474 , 2017.[46] Hyeungill Lee, Sungyeob Han, and Jungwoo Lee. Generative adver-sarial trainer: Defense to adversarial perturbations with gan. arXivpreprint arXiv:1705.03387 , 2017.[47] Swami Sankaranarayanan, Arpit Jain, Rama Chellappa, and Ser NamLim. Regularizing deep networks using efﬁcient layerwise adversarialtraining. In

Thirty-Second AAAI Conference on Artiﬁcial Intelligence ,2018.[48] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarialtraining methods for semi-supervised text classiﬁcation. arXiv preprintarXiv:1605.07725 , 2016. [49] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow.Improving the robustness of deep neural networks via stability training.In

Proceedings of the ieee conference on computer vision and patternrecognition , pages 4480–4488, 2016.[50] Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M Roy.A study of the effect of jpg compression on adversarial images. arXivpreprint arXiv:1608.00853 , 2016.[51] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens VanDer Maaten. Countering adversarial images using input transforma-tions. arXiv preprint arXiv:1711.00117 , 2017.[52] Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Fred Hohman,Li Chen, Michael E Kounavis, and Duen Horng Chau. Keeping thebad guys out: Protecting and vaccinating deep learning with jpegcompression. arXiv preprint arXiv:1705.02900 , 2017.[53] Yan Luo, Xavier Boix, Gemma Roig, Tomaso Poggio, and Qi Zhao.Foveation-based mechanisms alleviate adversarial examples. arXivpreprint arXiv:1511.06292 , 2015.[54] Konda Reddy Mopuri, Utsav Garg, and R Venkatesh Babu. Fastfeature fool: A data independent approach to universal adversarialperturbations. arXiv preprint arXiv:1707.05572 , 2017.[55] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understandingadversarial training: Increasing local stability of supervised modelsthrough robust optimization.

Neurocomputing , 307:195–204, 2018.[56] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong,Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, CihangXie, et al. Adversarial attacks and defences competition.

ComputerVision and Pattern Recognition , pages 195–231, 2018.[57] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deeplearning in computer vision: A survey.

Computer Vision and PatternRecognition , 2018.[58] Nicolas Papernot and Patrick McDaniel. On the effectiveness ofdefensive distillation. arXiv preprint arXiv:1607.05113 , 2016.[59] David Warde-Farley and Ian Goodfellow. adversarial perturbations ofdeep neural networks.

Perturbations, Optimization, and Statistics , 311,2016.[60] B Guttman and E Roback. An introduction to computer security : Thenist handbook.

Natl Inst of Standards & Technology Special PublicationSp , 27(1):3–18, 1995.[61] Dan Cirean, Ueli Meier, and Juergen Schmidhuber. Multi-column deepneural networks for image classiﬁcation. In

Computer Vision & PatternRecognition , 2012.[62] Benjamin Graham. Fractional max-pooling.

ArXiv , abs/1412.6071,2014.[63] Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid.Convolutional kernel networks.

Advances in Neural InformationProcessing Systems , pages 2627–2635, 2014.[64] Yann LeCun, Patrick Haffner, L´eon Bottou, and Yoshua Bengio. Objectrecognition with gradient-based learning. In

Shape, Contour andGrouping in Computer Vision , 1999.[65] Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow,Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, TomBrown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, KarenHambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheats-ley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong,David Berthelot, Paul Hendricks, Jonas Rauber, and Rujun Long.Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768 , 2018.[66] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferabil-ity in machine learning: from phenomena to black-box attacks usingadversarial samples. arXiv preprint arXiv:1605.07277 , 2016.[67] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha,Z Berkay Celik, and Ananthram Swami. Practical black-box attacksagainst machine learning. In