[PDF] Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers

Abstract

Neural networks trained on visual data are well-known to be vulnerable to often imperceptible adversarial perturbations. The reasons for this vulnerability are still being debated in the literature. Recently Ilyas et al. (2019) showed that this vulnerability arises, in part, because neural network classifiers rely on highly predictive but brittle "non-robust" features. In this paper we extend the work of Ilyas et al. by investigating the nature of the input patterns that give rise to these features. In particular, we hypothesize that in a neural network trained in a standard way, non-robust features respond to small, "non-semantic" patterns that are typically entangled with larger, robust patterns, known to be more human-interpretable, as opposed to solely responding to statistical artifacts in a dataset. Thus, adversarial examples can be formed via minimal perturbations to these small, entangled patterns. In addition, we demonstrate a corollary of our hypothesis: robust classifiers are more effective than standard (non-robust) ones as a source for generating transferable adversarial examples in both the untargeted and targeted settings. The results we present in this paper provide new insight into the nature of the non-robust features responsible for adversarial vulnerability of neural network classifiers.

Full PDF

AAdversarial Perturbations Are Not So Weird: Entanglement of Robust andNon-Robust Features in Neural Network Classiﬁers

Jacob M. Springer Melanie Mitchell Garrett T. Kenyon Abstract

Neural networks trained on visual data are well-known to be vulnerable to often imperceptibleadversarial perturbations. The reasons for thisvulnerability are still being debated in the litera-ture. Recently Ilyas et al. (2019) showed that thisvulnerability arises, in part, because neural net-work classiﬁers rely on highly predictive but brit-tle “non-robust” features. In this paper we extendthe work of Ilyas et al. by investigating the natureof the input patterns that give rise to these features.In particular, we hypothesize that in a neural net-work trained in a standard way, non-robust fea-tures respond to small, “non-semantic” patternsthat are typically entangled with larger, robust pat-terns, known to be more human-interpretable, asopposed to solely responding to statistical artifactsin a dataset. Thus, adversarial examples can beformed via minimal perturbations to these small,entangled patterns. In addition, we demonstrate acorollary of our hypothesis: robust classiﬁers aremore effective than standard (non-robust) ones asa source for generating transferable adversarialexamples in both the untargeted and targeted set-tings. The results we present in this paper providenew insight into the nature of the non-robust fea-tures responsible for adversarial vulnerability ofneural network classiﬁers.

1. Introduction

It is well-known that neural network classiﬁers trained onvisual data are susceptible to adversarial examples—imagesthat have been minimally perturbed so as to look unchangedto humans but are classiﬁed incorrectly, even though theoriginal image is correctly classiﬁed. Many explanationshave been offered for this susceptibility as well as for thetransferability of adversarial examples across network ar- Los Alamos National Laboratory, Los Alamos, NM Santa FeInstitute, Santa Fe, NM. Correspondence to: Jacob M. Springer.Preprint. chitectures and even training sets; however, the ML com-munity’s understanding of these phenomena remains incom-plete.In this paper we extend the work of Ilyas et al. (2019) onunderstanding these phenomena as the result of highly pre-dictive but brittle non-robust features that are learned byneural networks undergoing standard supervised training.Ilyas et al. proposed that adversarial examples are not “bugs”resulting from properties of a network that cause odd be-havior on examples that are off the training manifold, butrather that they are due to learned “features” that respond topredictive—yet non-interpretable—patterns in the dataset.Ilyas et al. showed that such non-robust features correspondto patterns that are widespread in the data. Moreover, theyshowed that these non-robust features can account for muchif not all of the high accuracy of neural networks on im-age datasets such as CIFAR-10 and ImageNet-9, and thatperturbations targeting these non-robust features allow fortransferable adversarial attacks.Here, we investigate the nature of these non-robust features.In particular, we give empirical evidence that in a neuralnetwork trained in a standard way, predictive but non-robustfeatures respond to small patterns that are typically entan-gled with larger, human-interpretable patterns, as opposedto responding to separate statistical artifacts in a dataset.We argue that the non-robust features of neural networksthat can be exploited to construct seemingly uninterpretableadversarial perturbations in fact rely on the same underlyingpatterns captured by more robust features. While we usemany of the experimental methods proposed by Ilyas et al.(2019) as tools, the question which we seek to answer andour results are distinct: we identify a relationship betweenrobust and non-robust features and then demonstrate thatour ﬁnding can motivate highly effective targeted transfer-able adversarial examples. Ilyas et al., on the other hand,consider robust and non-robust features separately.By demonstrating that a classiﬁer for which only non-robustfeatures are useful can achieve well-above-chance accuracyon a test set in which only robust features are useful, weconﬁrm that networks learn non-robust features that areclosely entangled with robust features. Robust features havebeen shown to be substantially more interpretable than non- a r X i v : . [ c s . L G ] F e b dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network ClassiﬁersType A: Robust features. These fea-tures commonly respond to human-interpretable (“semantic”) patterns(here of a dog or a cat).

Type B:

Non-robust features that respond tosmall yet highly predictive patterns that bythemselves appear non-semantic, yet are en-tangled with patterns associated with robustfeatures (e.g., in (a)).

Type C:

Non-robust features that respond tohighly predictive patterns that are artifacts inthe dataset, and are independent of robust fea-tures.

Figure 1.

A sketch of the possible relationships between robust and non-robust features. Note that this is oversimpliﬁed as non-robustfeatures could respond to combinations of (b) and (c). robust features (Engstrom et al., 2019b; Kaur et al., 2019;Santurkar et al., 2019). By establishing the relationshipbetween non-robust and robust features, we argue that non-robust features may be more interpretable than they appearthrough the lens of adversarial perturbations. In other words,non-robust features may not be so weird, after all.

The results of this paper have important implications. Weexplain conceptually why adversarial perturbations mightappear non-semantic yet are related to semantic features.We present and test a corollary of our results: transferableexamples (targeted and untargeted) can be generated moreeffectively by using robust classiﬁers rather than standardclassiﬁers. Finally, our results provide new insights into thenature of adversarial vulnerabilities, and suggest directionsof future research.

2. Terminology

In this section, following Ilyas et al. (2019), we deﬁne sev-eral important terms, in particular the notions of robust and non-robust features.Deep learning classiﬁers, in particular convolutional neuralnetworks (CNNs) are typically composed of a sequence ofnon-linear transformations called layers, the result of whichis fed through a ﬁnal linear classiﬁer layer to select a class(Krizhevsky et al., 2012; LeCun et al., 1998). We refer tothe penultimate layer as the representation layer , denoted rep ( x ) , where x is the n dimensional input to the network(e.g., an image). We refer to each unit of the representationlayer as a feature that the network computes for classiﬁca-tion. Each feature f : R n → R maps the input to a real number—the feature’s activation, given the input.We adapt the following deﬁnitions from Ilyas et al. (2019),who considered the binary classiﬁcation case, in whichthe dataset D consists of pairs ( x, y ) ∈ D , x ∈ R n , y ∈{ , − } , and the ﬁnal layer outputs either or − .• A feature f is useful for a dataset D when there exists a ρ > such that, E ( x,y ) ∈D [ y · f ( x )] ≥ ρ, or, more intuitively, when f is correlated with the class y of an input x .• For a given input x , adversarial example ˆ x = x + δ , and ε > , δ is said to be a permissible perturbation when (cid:107) δ (cid:107) < ε .• A feature f is robust when, for some γ > , E ( x,y ) ∈D [ min (cid:107) δ (cid:107) ≤ ε y · f ( x + δ )] ≥ γ, More intuitively, a feature is robust when it is correlatedwith the class y even under worst-case permissible pertur-bations of the input x .• For simplicity, a feature is said to be non-robust when it isuseful but not robust . (Here we do not consider non-usefulfeatures.)A feature is a property of a classiﬁer, and describes a wayin which the classiﬁer measures information in the input.However, it will also be conceptually useful to refer to theinformation itself. Thus we deﬁne the closely related notionof a pattern P ⊂ R n , a subset of inputs. We say that animage x contains a pattern P when x ∈ P . For example, dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers a pattern of stop signs would be the set of all images thatcontain a stop sign.We are primarily concerned with the relationship betweenrobust features and non-robust features. The main question we address in this paper is, what is thenature of the non-robust features described by Ilyas et al.(2019)? We hypothesize that many non-robust featureslearned by standard (i.e., non-robust classiﬁers) respond topatterns that are entangled with human-interpretable robust features, rather than responding to dataset artifacts. More-over, these entangled non-robust features can be exploited tocreate adversarial examples. If true, this hypothesis has im-portant implications for defenses against, and transferabilityof, adversarial examples.We illustrate our hypothesis with a conceptual diagram inFigure 1 consisting of three different types of patterns thatfeatures in a network could respond to. Robust featuresrespond to Type A patterns—ones that are interpretable byhumans as giving rise to a particular class. Type C illustratesa type of pattern that non-robust features might respond to:highly predictive non-semantic patterns in the dataset thatare unrelated to the robust features in the image. Thesemight be called “spurious correlations” or “artifacts” in thedataset. Type B illustrates another possibility for non-robustfeatures: they might respond to small but highly predictivecomponents of robust features (e.g., the nose of a dog orthe whiskers of a cat) that could be easily exploited to yieldadversarial examples.We conﬁrm in this paper that Type B features are indeedlearned by CNNs, that these features can, in some cases, ex-plain the high accuracy of CNNs, and that these features canbe exploited in adversarial attacks. Thus the adversarial vul-nerability of CNNs is not necessarily due to non-semanticartifacts, but can be explained in terms of non-robust fea-tures that are entangled with robust features that respond tosemantically meaningful patterns.

3. Methods

In order to test our hypothesis that Type B features arelearned by standard CNN classiﬁers, we use the followingstrategy, inspired by Ilyas et al. (2019). We construct aneural network classiﬁer in which only non-robust featuresare useful. We then construct a new test set for which onlyrobust features are useful for classiﬁcation.We then show that that the classiﬁer achieves substantiallyhigher than chance accuracy on the constructed test set; this seed = 0 = 1/64 = 1/8 = 1 original

Figure 2.

Examples of robustiﬁed images by ε of robust classiﬁer.The rightmost column depicts the original CIFAR-10 image; theleftmost column depicts the seed ( x ) from which the gradientdescent of the robustiﬁcation process began; the rest are robustiﬁedimages. Since the representation layers of less robust classiﬁersare more easily perturbed, robustiﬁed images generated with a lessrobust model (smaller value of ε ) will appear closer to the (random)CIFAR-10 image that seeded the gradient-descent process. (Bestviewed in color.) implies that the classiﬁer must be using non-robust featuresthat are entangled with robust features—namely, Type Bfeatures.The following subsections describe how we construct thisclassiﬁer and test set. We use the term zero-robust clas-siﬁer for the classiﬁer for which only non-robust featuresare useful. To train these classiﬁers, we follow the methodof Ilyas et al. (2019). We ﬁrst construct a training set D zero from our original training set D such that under D zero , allrobust features are non-useful, i.e., totally uncorrelated withthe assigned label, while non-robust features remain useful.By training a standard classiﬁer on this distribution, thestandard classiﬁer should only learn non-robust features.The new training set D zero can be constructed as follows:for each input-label pair ( x, y ) ∈ D , choose a new label ˆ y uniformly at random from the possible labels. Construct apermissible adversarial example ˆ x (i.e., ˆ x = x + δ , where δ is a permissible perturbation) such that ˆ x is classiﬁed as ˆ y .Then, let (ˆ x, ˆ y ) be an element of D zero . Ilyas et al. (2019)argued that there should be almost no correlation betweenrobust features and the assigned labels in this new dataset.Thus, a classiﬁer trained on this dataset should not rely onrobust features. Negative-Robust Classiﬁer: Addressing Problem of“Robust Feature Leakage”.

As pointed out by Engstromet al. (2019a), there may be a small correlation between dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers robust features and the assigned labels in D zero , due to “ro-bust feature leakage,” which may account for some of theaccuracy of a zero-robust classiﬁer on a robustiﬁed dataset.We control for this possibility with the method proposedby Ilyas et al. (2019): we construct a negative-robust clas-siﬁer, where robust features are anti-correlated with thelabel and non-robust features are correlated with the label.Thus, robust features should actively hurt accuracy and anypositive accuracy can certainly be attributed to non-robustfeatures. We construct a negative-robust classiﬁer similarto the construction of the zero-robust classiﬁer, but while ˆ y is selected uniformly at random to construct D zero , for thenegative-robust classiﬁer we deterministically permute theclass labels, associating each label y with the corresponding ˆ y in the permutation. Then for each ( x, y ) pair in D wecreate (ˆ x, ˆ y ) , where ˆ x is an adversarial example based on x targeting class ˆ y . Using a deterministic permutation of thelabel classes makes robust features anti-correlated with thecorrect label class. In order to construct a test set for which only robust featuresare useful, we follow the approach of Ilyas et al. (2019).First we construct a robust classiﬁer , and then use thatclassiﬁer to construct the desired dataset.

Robust Classiﬁer.

We construct robust classiﬁers via stan-dard adversarial training (Madry et al., 2017). Formally, wewish to ﬁnd classiﬁer parameters θ ∗ that minimize the fol-lowing expression: θ ∗ = arg min θ E ( x,y ) ∈D [ max (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, y )] where L ( θ, x, y ) computes the cross-entropy loss of theclassiﬁer with parameters θ on the input x and label y . Themagnitude of adversarial examples to which the classiﬁeris robust is characterized by the ε parameter, which wecall the robustness parameter of a classiﬁer. Larger valuesof ε correspond to more robust classiﬁers (i.e., they arerobust to larger perturbations). The special case when ε = 0 corresponds to a standard classiﬁer. We refer to a classiﬁerwith robustness parameter ε as an ε -robust classiﬁer .The above saddle-point optimization problem can be solvedby ﬁrst solving the inner maximization problem with pro-jected gradient ascent and then solving the outer maximiza-tion problem with standard back-propagation. We trainwith projected gradient descent as proposed by Madry et al.(2017). Since this problem is well studied, we defer to priorliterature for discussion of the matter (Carlini et al., 2019;Madry et al., 2017). Test Set Robustiﬁcation.

To construct a test set for whichonly robust features are useful for classiﬁcation, we traina robust classiﬁer C as described above, and use it to cre-ate a new test set ˆ D in which there is a one-to-one map-ping between the elements of ˆ D and the original test set D : ( x, y ) (cid:55)→ (ˆ x, y ) . In particular, an image ˆ x ∈ ˆ D is con-structed so that any feature of C that is useful in classifying x is equally useful in classifying ˆ x , in the sense that usefulfeatures are equally correlated with the label y , and no otherpossible feature is useful. In short, our construction willensure that the only useful features for classifying ˆ x will bethe features of C, which are, by deﬁnition, robust features.Deﬁne rep C ( x ) as the output of the m -dimensional rep-resentation (penultimate) layer of robust classiﬁer C withinput x . Then for r ∈ R m , rep C − ( r ) is a set of imageswith robust features r .We cannot easily compute rep C − ( r ) . However, for each el-ement of x our original test set, we can ﬁnd an approximateinverse by solving the minimization problem min x r (cid:107) rep C ( x r ) − rep C ( x ) (cid:107) . We solve this via gradient descent, starting from an image x . In other words, starting from randomly chosen test-setimage x , we search for an image x r whose representation rep C ( x r ) is as close as possible to rep C ( x ) .Following Ilyas et al. (2019), if we choose the initial image x uniformly from the test set, and the test set has a uniformdistribution over labels, then all features in rep C ( x ) areuncorrelated (in expectation over x ) with x ’s label. Thisensures (in expectation) that any features that respond topatterns in x will not be correlated with x ’s label, y . Sincewe apply gradient descent to x in order to ﬁnd an ˆ x whoserepresentation is similar as possible to C ’s representationfor x , the only features that correlate with y will be those in C , which are robust by deﬁnition. Thus only robust featuresare useful for classifying ˆ x .Following Ilyas et al. (2019), we refer to the inversion pro-cess the robustiﬁcation of images. Given a test set D , we canconstruct a new test set ˆ D by computing this approximationof rep − ( rep ( x )) and labeling it y for every ( x, y ) ∈ D .If robust classiﬁer C has robustness parameter ε > , wesay that ˆ D is the ε - robustiﬁcation of D and will be referredto as R ε . Figure 2 shows examples of the robustiﬁcationprocess on CIFAR-10 test images. We evaluate our methods on two datasets: CIFAR-10(Krizhevsky et al., 2009) and ImageNet-9, a derivative ofImageNet (Russakovsky et al., 2015) with two key differ-ences from the original ImageNet dataset: ﬁrst, we use a dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers × pixel downsampled version of the ImageNet data forcomputational feasibility (Chrabaszcz et al., 2017), second,adopted from Ilyas et al. (2019), we reduce the number ofclasses to nine: dog, cat, frog, turtle, bird, primate, ﬁsh,crab, insect. See Ilyas et al. (2019) for further details.For the CIFAR-10 dataset, we train ResNet50 classiﬁers(He et al., 2016a;b). However, due to the high computa-tional cost of training robust models on ImageNet-9, wetrain a classiﬁer with fewer parameters, ResNet20, on theImageNet-9 dataset.

4. Results of Experiments

Table 1.

Accuracy of zero-robust, negative-robust, standard, androbust ResNet50 classiﬁers on robustiﬁed CIFAR-10 test set. Allclassiﬁers are trained starting with random initial weights. The

Test column is the accuracy on the original CIFAR-10 test set. The R ε columns give accuracy on robustiﬁed CIFAR-10 test set, generatedwith respect to a robust ResNet50 with robustness parameter ε . Classiﬁer Test R R / R / R / R Standard 94.1 58.3 79.9 75.1 69.2 56.3Robust 86.5 9.93 18.2 70.5 86.5 70.0Zero-robust 46.8 38.8 23.3 22.1 23.4 21.4Neg-robust 21.9 31.8 11.1 12.8 14.3 13.2

Table 2.

Accuracy of zero-robust, negative-robust, standard, androbust ResNet20 classiﬁers on robustiﬁed ImageNet-9 test set.Analogous to Table 1. Note: since there are nine classes, chanceaccuracy is approximately . . Classiﬁer Test R R / R / R / R Standard 94.8 51.9 76.3 63.4 54.6 47.2Robust 82.9 44.1 46.1 49.0 51.7 56.0Zero-robust 49.2 24.8 24.3 24.6 28.6 31.4Neg-robust 38.3 25.4 23.2 21.8 24.4 23.7Recall our deﬁnition of Type B features: Non-robust fea-tures learned by a network that respond to small yet highlypredictive patterns that by themselves appear non-semantic,yet are entangled with patterns associated with robust fea-tures. Our hypothesis is that Type B features are prevalentin standard (i.e., non-robust) image classiﬁers, and can beexploited to create adversarial examples. That is, Type Cfeatures (which respond to dataset artifacts) are not the onlyfeatures responsible for adversarial vulnerability.We present here the result of an experiment in which weevaluate zero-robust and negative-robust classiﬁers, repre-senting classiﬁers that rely exclusively on non-robust fea-tures, on CIFAR-10 and ImageNet-9 test sets that have beenrobustiﬁed—that is, in which only robust features are useful.The results, given in Tables 1 and 2, conﬁrm our hypothesis: the zero-robust classiﬁer performs above chance on all ro-bustiﬁed test sets; this can only occur when Type B featuresare present. Similarly, the negative-robust classiﬁer, whichacts as a control for the problem of robust feature leakage(see Section 3.2), has above-chance accuracy on robustiﬁedtest sets except for when ε is large. We expect this result,since the presence of robust features decreases accuracy innegative-robust classiﬁers. In fact, in the absence of TypeB features, we would expect the negative-robust classiﬁerto perform at below-chance accuracy on the robustiﬁed testsets. As above-chance accuracy by the negative-robust canonly be explained by the presence of Type B features, theaccuracy of the negative-robust classiﬁer serves as a lower-bound on the accuracy for which can be accounted by TypeB features. The fact that the accuracy is above chance formany of the robustiﬁed test sets further conﬁrms our hypoth-esis, and rejects the possibility that robust feature leakageentirely explains the above-chance accuracy of the zero-robust classiﬁer on the robustiﬁed test sets. We include theaccuracies of a standard classiﬁer and an example robustclassiﬁer with robustness parameter ε = 0 . for comparisonwith the zero- and negative-robust classiﬁer results.We observe that the accuracy of zero-robust and negative-robust classiﬁers decreases on the robustiﬁed test sets asthe robustness parameter ε increases. We have already dis-cussed that we expect this in the negative-robust classiﬁerdue to the negative inﬂuence of robust features on accuracy.However, we hypothesize that the reason zero-robust ac-curacy decreases is similar to that of the well-establishedresult that the accuracy of robust classiﬁers decreases on theoriginal test set as the robustness parameter of the classiﬁerincreases (Tsipras et al., 2019; Zhang et al., 2019). Weconﬁrm this well-known result in Figure 4. We speculatethat as the robustness increases, the classiﬁer begins to ig-nore useful features, even some that are robust for smaller ε .Thus, the robustiﬁed test sets contain fewer useful featuresas ε increases, and thus zero-robust accuracy decreases.

5. The Universality of Robust Features

The previous section established that non-robust featuresrespond to patterns that are entangled with the patterns re-sponded to by robust features. However, there may be manyequally predictive “non-robust” patterns that can be entan-gled with a single “robust” pattern. For example, as il-lustrated in Figure 1, a non-robust Type B feature whichresponds to the texture of a dog nose may be a componentof a robust Type A feature that identiﬁes the shape of anentire dog. However, there may be many components ofa particular robust feature that would equivalently predict“dog”. Standard (non-robust) classiﬁer C might learn a par-ticular subset of predictive non-robust (i.e., Type B) features,whereas another standard classiﬁer C with a different ar- dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers Robustness parameter A cc u r a c y o n Figure 3.

Accuracy of a standard ResNet50 classiﬁer on robus-tiﬁcations of the CIFAR-10 test dataset, generated with respectto robust ResNet50 classiﬁers with varying robustness parameter ε . See Appendix for the analogous graph for ImageNet-9. (Bestviewed in color.) chitecture or different initial weights might learn a differentsubset, with both classiﬁers exhibiting similar accuracy ona test set. The features of both classiﬁers would be respond-ing to different subpatterns of the same underlying “robust”patterns; in this case, the features learned by C would notstrongly overlap the features learned by C .The features learned by a classiﬁer that tend to overlap withthe features learned by other classiﬁers trained on the samedataset are characterized by universality , as they are roughlythe same across classiﬁers (Olah et al., 2020). Li et al.(2015) called this phenomenon “convergent learning”.We hypothesize that robust features will be more universalthan non-robust features, since universal features likely rep-resent the underlying, better generalizing properties of thedataset that are captured by robust features. Analogous tothe way we demonstrated the existence of Type B featuresentangled with Type A features in Section 3, here we giveevidence for the hypothesis that robust features are moreuniversal than non-robust features by evaluating standardclassiﬁers on robustiﬁed test sets.Recall that for a test set D T that has been robustiﬁed withrespect to a robust classiﬁer C , the only features that areuseful for classiﬁcation are robust features useful to C . Ifa different classiﬁer C (cid:48) has high accuracy on D T , then C (cid:48) must be using the same (or very similar) features as C . Ifmany different classiﬁers have high accuracy on D T , thenwe can say that C ’s robust features are universal . Robustness parameter A cc u r a c y o n o r i g i n a l t e s t d a t a Figure 4.

Accuracy of robust ResNet50 classiﬁers trained on theCIFAR-10 training data and evaluated on the CIFAR-10 originaltest data. See Appendix for the analogous graph for ImageNet-9.(Best viewed in color.)

Figure 3 shows the accuracy of a standard (non-robust)ResNet50 classiﬁer, trained on the CIFAR-10 training set,when tested on robustiﬁed CIFAR-10 test sets with varyingrobustness parameter ε . The ﬁgure shows that the the stan-dard classiﬁer, which we’ll call C (cid:48) , has signiﬁcantly higheraccuracy on robustiﬁed test sets with < ε ≤ / , with apeak at ε = / .To show why this supports the hypothesis that robust fea-tures are more universal than non-robust features, let C ε denote the ε -robust ResNet50 classiﬁer used to create therobustiﬁed test set R ε . Figure 3 shows that the accuracy of C (cid:48) is low on R , which was designed speciﬁcally so thatthe features of C would be useful to classify it. This meansthat C ’s features, which are likely to be mostly non-robustsince ε = 0 , tend not to be shared by C (cid:48) . However, the ac-curacy of C (cid:48) on R ε for < ε ≤ / is dramatically higher,meaning that the features that are useful to classiﬁers C ε arealso useful to C (cid:48) , for ε in that range. Thus the robust fea-tures useful to these C ε tend to be shared by C (cid:48) . Repeatedevaluations with different initial weights for C (cid:48) exhibitedthe same behavior. Thus, the results shown in Figure 3 sup-port the hypothesis that robust features are more universalthan non-robust features.It is noteworthy that accuracy is highest for a relatively smallrobustness parameter of ε = / . We hypothesize that asrobustness increases, while patterns associated with Type Bfeatures may remain present in the robustiﬁed dataset, pat-terns associated with Type C features, which were presentin the original dataset, may be progressively removed, caus- dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers Robustness parameter of source classifier E rr o r o n o r i g i n a l l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception

Figure 5.

Error rate of different standard classiﬁers evaluated onuntargeted transfer attacks. Each error rate is measured on a setof untargeted adversarial examples generated from the CIFAR-10test set to attack a source classiﬁer (ResNet50) with robustnessparameter ε . Each adversarial example was generated with aperturbation δ where (cid:107) δ (cid:107) < . A higher error corresponds to astronger transfer attack. See Appendix for the analogous graph forImageNet-9. (Best viewed in color.) ing a detriment to accuracy as the robustness parameterincreases signiﬁcantly. Similarly, we speculate that this mayalso explain the well-known decrease in accuracy on origi-nal test data associated with an increase in model robustnessas shown in Figure 4 (Tsipras et al., 2019; Zhang et al.,2019).Our results suggest that even when performance is criticaland the drop in accuracy associated with adversarial trainingis unacceptable, adding even slight robustness ( ε = / ) candrastically improve the universality of the learned featuresof a neural network. Networks can be improved in this waywith little to no additional computational cost due to recentadvances in improving the efﬁciency of adversarial training(Jeddi et al., 2020; Shafahi et al., 2019; Wong et al., 2020).We repeat these experiments using ResNet20 classiﬁerstrained on the ImageNet-9 dataset and ﬁnd similar results,presented in the Appendix. Adversarial examples designed to fool a given classiﬁerare sometimes transferable : they can also successfully foolother classiﬁers, even those with different architectures fromthe original (“source”) classiﬁer. This is true for both un-targeted adversarial examples x u , which are consideredto be successful if a classiﬁer predicts any incorrect labelwhen given x u , and for targeted adversarial examples x t ,which are considered to be successful if a classiﬁer predictsa speciﬁc targeted (incorrect) label y t when given x t . It is Robustness parameter of source classifier A cc u r a c y o n t a r g e t l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception

Figure 6.

Accuracy of different standard classiﬁers on targetedtransfer attacks. Each accuracy is measured as the fraction of timesthe target label y t is predicted for targeted adversarial example x t .The evaluation is done on a set of targeted adversarial examplesgenerated from the CIFAR-10 test set to attack a source ResNet50classiﬁer with robustness parameter ε . Each adversarial examplewas generated with a perturbation δ where (cid:107) δ (cid:107) < . A higheraccuracy corresponds to a stronger transfer attack. See Appendixfor the analogous graph for ImageNet-9. (Best viewed in color.) typically much harder to design transferable targeted thanuntargeted adversarial examples (Liu et al., 2016).Our earlier hypothesis—that robust features are more uni-versal than non-robust features—suggests that adversarialexamples designed to fool a robust classiﬁer transfer (i.e.,that exploit robust features) should be more transferablethan adversarial examples designed to fool a non-robustclassiﬁer. In this section we give evidence for this conclu-sion: we show that if a robust classiﬁer is used as a sourcemodel for creating adversarial examples, those examplestransfer more effectively than if a non-robust classiﬁer isused as the source. This is an important novel result for re-searchers trying to create transferable adversarial examples:one should use a robust model as the source, instead of astandard model.To give evidence for these hypotheses, we train ResNet50classiﬁers on the CIFAR-10 training set using adversarialtraining with varying robustness parameters ε (in addition,we train ResNet20 on ImageNet-9 data, see Appendix). Wethen use these robust classiﬁers to generate targeted anduntargeted adversarial examples via projected gradient de-scent (Madry et al., 2017), using the standard adversarialobjectives. In the untargeted case, given an initial input x and a parameter ε to specify the maximum L2 norm of apermissible perturbation δ , we seek to ﬁnd an adversarialexample x u = x + ˆ δ (where x is the original example with dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers label y ) such that ˆ δ = arg max (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, y ) , and for the targeted case, for a target label y t , we seek toﬁnd an adversarial example x t = x + ˆ δ such that ˆ δ = arg min (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, ˆ y ) . We evaluate the generated adversarial examples on stan-dard classiﬁers with different architectures (Chollet, 2017;Howard et al., 2017; Huang et al., 2017; Simonyan & Zis-serman, 2014; Szegedy et al., 2016). For the untargetedcase, Figure 5 plots, for each classiﬁer, the error rate on theoriginal labels y , given adversarial examples x u . This errorrate is plotted as a function of the robustness parameter ε ofthe source classiﬁer used to generate the adversarial exam-ples ( x u ). The higher the error rate, the more successful thetransfer attack. To evaluate targeted adversarial examples,in Figure 6 we plot the accuracy of each standard classiﬁeron adversarial examples x t with respect to the target label y t —i.e., the fraction of times the classiﬁer predicted thetarget label y t , given example x t . The higher this accuracy,the more successful the transfer attack.Figure 5 shows that untargeted adversarial examples beneﬁtslightly if generated with respect to a robust classiﬁer witha small robustness parameter, < ε ≤ / with a peak at ε ≈ / . The success of transferable adversarial examplesdecreases as ε is further increased.Figure 6 shows that transferable targeted adversarial exam-ples attain a more striking increase in success by increas-ing the robustness parameter of the source classiﬁer. Ourtechnique for generating transferable targeted adversarialexamples is largely ineffective when the source classiﬁeris equivalent to standard classiﬁer (i.e., ε = 0 ). However,using a robust source classiﬁer with ε ≈ / dramaticallyincreases the success of targeted transfer attacks.Experiments on the ImageNet-9 dataset using ResNet20 assource models behave similarly (see Appendix for details).Our goal here is not to generate state-of-the-art transfer at-tacks; we leave this to future work. Thus, we do not employcomplex state-of-the-art transfer attacks and instead opt forthe simple projected gradient descent approach. Nonethe-less, our results conﬁrm that robust features with small ro-bustness parameter have high universality and can improvetransfer attacks, especially in the targeted case, in whichimprovements are dramatic.

6. Related Work

Adversarial examples and associated adversarial robustnesshave been studied extensively in the machine learning liter-ature (Allen-Zhu & Li, 2020; Athalye et al., 2018; Biggio et al., 2013; Carlini & Wagner, 2017a;b; Carlini et al., 2019;Cohen et al., 2019; Engstrom et al., 2019b; Feinman et al.,2017; Goodfellow et al., 2014; Kaur et al., 2019; Madryet al., 2017; Metzen et al., 2017; Moosavi-Dezfooli et al.,2016; Papernot et al., 2016b; Raghunathan et al., 2018; San-turkar et al., 2019; Shafahi et al., 2018; Stutz et al., 2019;Szegedy et al., 2014; Uesato et al., 2018; Warde-Farley &Goodfellow, 2016). Non-robust features have been studieddirectly by Ilyas et al. (2019) who ﬁnds that non-robust fea-tures are present in image training datasets. By contrast, weestablish an entanglement relationship between robust andnon-robust features.We hypothesize that the overlap between robust and non-robust features may be explained in part from a simplicitybias (Arpit et al., 2017; De Palma et al., 2019; Nakkiranet al., 2019; Shah et al., 2020; Valle-Pérez et al., 2018; Wuet al., 2017), which suggests that neural networks learnsimple functions more easily than complex functions, andfrom gradient starvation (Combes et al., 2018; Pezeshkiet al., 2020), which suggests that once highly predictivefeatures are learned, other features become increasinglydifﬁcult to learn due to a diminishing gradient. If non-robustfeatures are learned more readily and support sufﬁcientlylow classiﬁcation loss, such a combination may impedethe learning of robust features, for instance, see Nakkiran(2019). Similarly, texture bias may be related (Hermannet al., 2020).Many papers have shown that deep learning classiﬁers mayrely on non-semantic cues or shortcuts (Geirhos et al., 2020;Jo & Bengio, 2017; McCoy et al., 2019; Wang et al., 2020;Wei, 2020) and that optimization-based feature visualiza-tion may be inadequate to visualize neural network features(Borowski et al., 2020; Hendrycks et al., 2019). On the otherhand, there has been work on feature visualization (Aubry& Russell, 2015; Dosovitskiy & Brox, 2016; Mahendran& Vedaldi, 2015; Olah et al., 2017; Simonyan et al., 2013;Zhang & Zhu, 2018).Related to our hypothesis that robust features are more uni-versal than non-robust features, prior work has found thatadversarial robustness serves as an effective prior for trans-fer learning (Liang et al., 2020; Salman et al., 2020; Terziet al., 2020; Utrera et al., 2020). There has been some workon the universality hypothesis (Kornblith et al., 2019; Liet al., 2015; Olah et al., 2020; Raghu et al., 2017). Addi-tionally, there has been signiﬁcant work on constructingand evaluating transferable targeted adversarial examples(Eykholt et al., 2018; Kurakin et al., 2016; Liu et al., 2016;Moosavi-Dezfooli et al., 2017; Papernot et al., 2016a; 2017;Tramèr et al., 2017). dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers

7. Conclusion

In this paper, we have presented an empirical study demon-strating that robust and non-robust features can be entangledin standard neural network classiﬁers. In particular, we haveproposed three different types of features: robust features(Type A), non-robust features that are entangled with ro-bust features (Type B), and non-robust features that are notentangled with robust features (Type C). Type A featuresare known to appear signiﬁcantly more semantic than non-robust features (Engstrom et al., 2019b; Kaur et al., 2019;Santurkar et al., 2019). To our knowledge, no previousstudy has addressed the question of whether the non-robustfeatures underlying adversarial vulnerability are the non-semantic yet highly predictive artifacts in image datasets, orif the features appear non-semantic for a different reason.We present evidence that at least some of the non-robustfeatures indicate the presence of the same patterns of thedataset as robust features, suggesting that while appearingnon-semantic when observed through the lens of their gra-dient and adversarial perturbations, these features may, infact, indicate the presence of features that are aligned withhuman perception.In addition, we provide evidence that robust features can bethought of as more universal than non-robust features, andas result, we ﬁnd that robust classiﬁers are more effectiveas source classiﬁers for generating transferable adversarialexamples in both the untargeted and targeted settings.We believe that this work is an important step towards un-derstanding the nature of non-robust features and answeringthe universality hypothesis (Olah et al., 2020). With a theoryof robust and non-robust features, and thus of adversarialvulnerability and resilience, we hope to eventually under-stand how to build more interpretable classiﬁers and be ableto better defend against adversarial attacks.

Acknowledgements

Research presented in this article was supported by theLaboratory Directed Research and Development programof Los Alamos National Laboratory under project number20210043DR.The authors would like to thank Rory Soiffer for his helpfulideas and discussion.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-enberg, J., Mané, D., Monga, R., Moore, S., Murray, D.,Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan,V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M.,Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL . Softwareavailable from tensorﬂow.org.Allen-Zhu, Z. and Li, Y. Feature puriﬁcation: How adversar-ial training performs robust deep learning. arXiv preprintarXiv:2005.10190 , 2020.Arpit, D., Jastrz˛ebski, S., Ballas, N., Krueger, D., Bengio,E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A.,Bengio, Y., et al. A closer look at memorization in deepnetworks. arXiv preprint arXiv:1706.05394 , 2017.Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Syn-thesizing robust adversarial examples. In

Internationalconference on machine learning , pp. 284–293. PMLR,2018.Aubry, M. and Russell, B. C. Understanding deep featureswith computer-generated imagery. In

Proceedings of theIEEE International Conference on Computer Vision , pp.2875–2883, 2015.Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndi´c, N.,Laskov, P., Giacinto, G., and Roli, F. Evasion attacksagainst machine learning at test time. In

Joint Europeanconference on machine learning and knowledge discoveryin databases , pp. 387–402. Springer, 2013.Borowski, J., Zimmermann, R. S., Schepers, J., Geirhos, R.,Wallis, T. S., Bethge, M., and Brendel, W. Exemplarynatural images explain cnn activations better than featurevisualizations. arXiv preprint arXiv:2010.12606 , 2020.Carlini, N. and Wagner, D. Adversarial examples are noteasily detected: Bypassing ten detection methods. In

Proceedings of the 10th ACM Workshop on ArtiﬁcialIntelligence and Security , pp. 3–14, 2017a.Carlini, N. and Wagner, D. Towards evaluating the robust-ness of neural networks. In , pp. 39–57. IEEE, 2017b.Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber,J., Tsipras, D., Goodfellow, I., Madry, A., and Kurakin,A. On evaluating adversarial robustness. arXiv preprintarXiv:1902.06705 , 2019.Chollet, F. Xception: Deep learning with depthwise separa-ble convolutions. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 1251–1258, 2017.Chollet, F. et al. Keras. https://keras.io , 2015. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers

Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsam-pled variant of imagenet as an alternative to the cifardatasets. arXiv preprint arXiv:1707.08819 , 2017.Cohen, J., Rosenfeld, E., and Kolter, Z. Certiﬁed adver-sarial robustness via randomized smoothing. In

Interna-tional Conference on Machine Learning , pp. 1310–1320.PMLR, 2019.Combes, R. T. d., Pezeshki, M., Shabanian, S., Courville,A., and Bengio, Y. On the learning dynamics of deepneural networks. arXiv preprint arXiv:1809.06848 , 2018.De Palma, G., Kiani, B., and Lloyd, S. Random deepneural networks are biased towards simple functions. In

Advances in Neural Information Processing Systems , pp.1964–1976, 2019.Dosovitskiy, A. and Brox, T. Inverting visual representationswith convolutional networks. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 4829–4837, 2016.Engstrom, L., Gilmer, J., Goh, G., Hendrycks, D., Ilyas,A., Madry, A., Nakano, R., Nakkiran, P., Santurkar,S., Tran, B., Tsipras, D., and Wallace, E. A discus-sion of ’adversarial examples are not bugs, they arefeatures’.

Distill , 2019a. doi: 10.23915/distill.00019.https://distill.pub/2019/advex-bugs-discussion.Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D.,Tran, B., and Madry, A. Adversarial robustness asa prior for learned representations. arXiv preprintarXiv:1906.00945 , 2019b.Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A.,Xiao, C., Prakash, A., Kohno, T., and Song, D. Robustphysical-world attacks on deep learning visual classiﬁca-tion. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 1625–1634, 2018.Feinman, R., Curtin, R. R., Shintre, S., and Gardner, A. B.Detecting adversarial samples from artifacts. arXivpreprint arXiv:1703.00410 , 2017.Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R.,Brendel, W., Bethge, M., and Wichmann, F. A. Short-cut learning in deep neural networks. arXiv preprintarXiv:2004.07780 , 2020.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 , 2014.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016a. He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In

European conference oncomputer vision , pp. 630–645. Springer, 2016b.Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., andSong, D. Natural adversarial examples. arXiv preprintarXiv:1907.07174 , 2019.Hermann, K., Chen, T., and Kornblith, S. The originsand prevalence of texture bias in convolutional neuralnetworks.

Advances in Neural Information ProcessingSystems , 33, 2020.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:Efﬁcient convolutional neural networks for mobile visionapplications. arXiv preprint arXiv:1704.04861 , 2017.Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 4700–4708, 2017.Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran,B., and Madry, A. Adversarial examples are not bugs,they are features. In

Advances in Neural InformationProcessing Systems , pp. 125–136, 2019.Jeddi, A., Shaﬁee, M. J., and Wong, A. A simple ﬁne-tuning is all you need: Towards robust deep learning viaadversarial ﬁne-tuning. arXiv preprint arXiv:2012.13628 ,2020.Jo, J. and Bengio, Y. Measuring the tendency of cnnsto learn surface statistical regularities. arXiv preprintarXiv:1711.11561 , 2017.Kaur, S., Cohen, J., and Lipton, Z. C. Are perceptually-aligned gradients a general property of robust classiﬁers? arXiv preprint arXiv:1910.08640 , 2019.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similar-ity of neural network representations revisited. In

Interna-tional Conference on Machine Learning , pp. 3519–3529.PMLR, 2019.Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images.

Master’s thesis, Departmentof Computer Science, University of Toronto , 2009.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassiﬁcation with deep convolutional neural networks.

Advances in neural information processing systems , 25:1097–1105, 2012. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers

Kurakin, A., Goodfellow, I., and Bengio, S. Adversar-ial examples in the physical world. arXiv preprintarXiv:1607.02533 , 2016.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.Li, Y., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. E.Convergent learning: Do different neural networks learnthe same representations? In

FE@ NIPS , pp. 196–212,2015.Liang, K., Zhang, J. Y., Koyejo, O., and Li, B. Does adver-sarial transferability indicate knowledge transferability? arXiv preprint arXiv:2006.14512 , 2020.Liu, Y., Chen, X., Liu, C., and Song, D. Delving intotransferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770 , 2016.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks. arXiv preprint arXiv:1706.06083 ,2017.Mahendran, A. and Vedaldi, A. Understanding deep im-age representations by inverting them. In

Proceedingsof the IEEE conference on computer vision and patternrecognition , pp. 5188–5196, 2015.McCoy, R. T., Pavlick, E., and Linzen, T. Right for thewrong reasons: Diagnosing syntactic heuristics in naturallanguage inference. arXiv preprint arXiv:1902.01007 ,2019.Metzen, J. H., Genewein, T., Fischer, V., and Bischoff, B.On detecting adversarial perturbations. arXiv preprintarXiv:1702.04267 , 2017.Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deep-fool: a simple and accurate method to fool deep neuralnetworks. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pp. 2574–2582,2016.Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., andFrossard, P. Universal adversarial perturbations. In

Pro-ceedings of the IEEE conference on computer vision andpattern recognition , pp. 1765–1773, 2017.Nakkiran, P. Adversarial robustness may be at odds withsimplicity. arXiv preprint arXiv:1901.00532 , 2019.Nakkiran, P., Kalimeris, D., Kaplun, G., Edelman, B., Yang,T., Barak, B., and Zhang, H. Sgd on neural networkslearns functions of increasing complexity. In

Advances inNeural Information Processing Systems , pp. 3496–3506,2019. Olah, C., Mordvintsev, A., and Schubert, L. Feature visual-ization.

Distill , 2(11):e7, 2017.Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov,M., and Carter, S. Zoom in: An introduction to cir-cuits.

Distill , 2020. doi: 10.23915/distill.00024.001.https://distill.pub/2020/circuits/zoom-in.Papernot, N., McDaniel, P., and Goodfellow, I. Transfer-ability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprintarXiv:1605.07277 , 2016a.Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik,Z. B., and Swami, A. The limitations of deep learning inadversarial settings. In , pp. 372–387. IEEE,2016b.Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik,Z. B., and Swami, A. Practical black-box attacks againstmachine learning. In

Proceedings of the 2017 ACM onAsia conference on computer and communications secu-rity , pp. 506–519, 2017.Pezeshki, M., Kaba, S.-O., Bengio, Y., Courville, A., Precup,D., and Lajoie, G. Gradient starvation: A learning procliv-ity in neural networks. arXiv preprint arXiv:2011.09468 ,2020.Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein,J. Svcca: Singular vector canonical correlation analysisfor deep learning dynamics and interpretability. arXivpreprint arXiv:1706.05806 , 2017.Raghunathan, A., Steinhardt, J., and Liang, P. Certiﬁeddefenses against adversarial examples. arXiv preprintarXiv:1801.09344 , 2018.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., and Fei-Fei, L. ImageNet Large ScaleVisual Recognition Challenge.

International Journal ofComputer Vision (IJCV) , 115(3):211–252, 2015. doi:10.1007/s11263-015-0816-y.Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry,A. Do adversarially robust imagenet models transferbetter? arXiv preprint arXiv:2007.08489 , 2020.Santurkar, S., Ilyas, A., Tsipras, D., Engstrom, L., Tran, B.,and Madry, A. Image synthesis with a single (robust)classiﬁer. In

Advances in Neural Information ProcessingSystems , pp. 1262–1273, 2019.Shafahi, A., Huang, W. R., Studer, C., Feizi, S., and Gold-stein, T. Are adversarial examples inevitable? arXivpreprint arXiv:1809.02104 , 2018. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers

Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson,J., Studer, C., Davis, L. S., Taylor, G., and Goldstein,T. Adversarial training for free! arXiv preprintarXiv:1904.12843 , 2019.Shah, H., Tamuly, K., Raghunathan, A., Jain, P., and Netra-palli, P. The pitfalls of simplicity bias in neural networks. arXiv preprint arXiv:2006.07710 , 2020.Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-side convolutional networks: Visualising image clas-siﬁcation models and saliency maps. arXiv preprintarXiv:1312.6034 , 2013.Stutz, D., Hein, M., and Schiele, B. Disentangling adversar-ial robustness and generalization. In

Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pp. 6976–6987, 2019.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing propertiesof neural networks. In , 2014.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computer vi-sion. In

Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2818–2826, 2016.Terzi, M., Achille, A., Maggipinto, M., and Susto, G. A.Adversarial training reduces information and improvestransferability. arXiv preprint arXiv:2007.11259 , 2020.Tramèr, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc-Daniel, P. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 , 2017.Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., andMadry, A. Robustness may be at odds with accuracy. In

International Conference on Learning Representations ,2019.Uesato, J., O’donoghue, B., Kohli, P., and Oord, A. Ad-versarial risk and the dangers of evaluating against weakattacks. In

International Conference on Machine Learn-ing , pp. 5025–5034. PMLR, 2018.Utrera, F., Kravitz, E., Erichson, N. B., Khanna, R., andMahoney, M. W. Adversarially-trained deep nets transferbetter. arXiv preprint arXiv:2007.05869 , 2020.Valle-Pérez, G., Camargo, C. Q., and Louis, A. A. Deeplearning generalizes because the parameter-function mapis biased towards simple functions. arXiv preprintarXiv:1805.08522 , 2018. Wang, H., Wu, X., Huang, Z., and Xing, E. P. High-frequency component helps explain the generalizationof convolutional neural networks. In

Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pp. 8684–8694, 2020.Warde-Farley, D. and Goodfellow, I. Adversarial perturba-tions of deep neural networks.

Perturbations, Optimiza-tion, and Statistics , 311, 2016.Wei, K.-A. A.

Understanding non-robust features in imageclassiﬁcation . PhD thesis, Massachusetts Institute ofTechnology, 2020.Wong, E., Rice, L., and Kolter, J. Z. Fast is better thanfree: Revisiting adversarial training. arXiv preprintarXiv:2001.03994 , 2020.Wu, L., Zhu, Z., et al. Towards understanding generalizationof deep learning: Perspective of loss landscapes. arXivpreprint arXiv:1706.10239 , 2017.Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., andJordan, M. Theoretically principled trade-off betweenrobustness and accuracy. In

International Conference onMachine Learning , pp. 7472–7482. PMLR, 2019.Zhang, Q.-s. and Zhu, S.-C. Visual interpretability for deeplearning: a survey.

Frontiers of Information Technology& Electronic Engineering , 19(1):27–39, 2018. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers

A. Experimental Setup

In this section, we describe details of the methods that we use to train our standard, robust, zero-robust, and negative-robustmodels, construct robustiﬁed images, and construct adversarial examples.

A.1. Dataset Considerations

Our experiments involve training many models, a large portion of which require adversarial training. Similarly, ourevaluation procedures involve constructing robust datasets, another highly computationally expensive operation. In orderto ensure the computational feasibility of these problems, we conduct all experiments on two image datasets: CIFAR-10(Krizhevsky et al., 2009) and a version of ImageNet (Russakovsky et al., 2015) which has been downsampled to × pixels (Chrabaszcz et al., 2017) and in which we only use images which fall under one of nine non-overlapping classlabels: dog (classes 151–268), cat (classes 281–285), frog (classes 30–32), turtle (classes 33-37), bird (classes 80–100),primate (classes 365–382), ﬁsh (classes 389–397), crab (classes 118-121), insect (classes 300-319). This dataset is the samesubset of the ImageNet dataset used by Ilyas et al. (2019), with the additional modiﬁcation that our dataset is downsampled.CIFAR-10 has 50,000 training images and ImageNet-9 has approximately 200,000.While we conduct our experiments only on these two datasets, we expect that our results should be general across all imagedatasets, and likely across other domains as well. A.2. Training & Evaluation Details

We describe the training procedure for the four different types of models that we construct: standard, robust, zero-robust,and negative-robust.For each model, we train using the Adam optimizer (Kingma & Ba, 2014) with default TensorFlow parameters ( β =0 . , β = 0 . ) (Abadi et al., 2015). The learning rate for each model is speciﬁed in Table 3. Each model was trained forat most 100 epochs. Models with an epoch count marked with an asterisks (*) had their training stopped early in order tomaximize the performance on a validation set. Models that are not marked with an asterisks were trained for the entire 100epochs. Standard Models

To train standard models, we minimize expected cross-entropy loss over the training dataset. Allparameters are speciﬁed in Table 3.

Robust Models

To train robust models, we compute a set of adversarial examples for each batch during training. We useprojected gradient descent to construct these adversarial examples (Madry et al., 2017). For adversarial examples constructedfrom CIFAR-10 images, we use 16 steps with a step size of 0.5. For adversarial examples constructed from ImageNetimages, we use 8 steps with a step size of 1.0. For adversarial training, we train to minimize the expected cross-entropy lossof the network evaluated on adversarial examples labeled as the true label. All other parameters are speciﬁed in Table 3.

Zero-Robust Models

To train zero-robust models, we apply the methods proposed in Section 3.2. In particular, weconstruct adversarial examples (see Table 4 for parameters) using projected gradient descent to minimize the targetedadversarial loss functions min (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, ˆ y ) where x is the original image, ε is the maximum allowed L2 norm difference between the adversarial example and theoriginal image, and ˆ y is the targeted class. For each image in CIFAR-10, we construct ten adversarial examples, onetargeting each of the ten classes, labeled as the target class. For each image in the ImageNet-9 dataset, we construct a singleadversarial example with a target selected uniformly at random from the nine possible classes, similarly labeled as thetarget class. We train zero-robust models for both CIFAR-10 and ImageNet-9 using standard training on the appropriateconstructed dataset. The hyperparameters for training the models can be found in Table 3. Negative-Robust Models

We train negative-robust models in the same way as we train zero-robust models, except with aslightly different dataset. For each image with label y in the CIFAR-10 dataset (numbered 0–9 to represent each class inCIFAR-10), we construct a new dataset with an adversarial example targeting the class y + 1 mod 10 , and labeled as this dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers target class. We construct the ImageNet-9 version similarly, using y + 1 mod 9 as the target label instead. The parametersfor constructing the dataset can be found in Table 4 and the hyperparameters for training the models in Table 3. Victim Models

To test the transferability of adversarial examples generated on robust models, we train standard modelswith different architectures: DenseNet121, InceptionV3, MobileNetV2, VGG16, ResNet50V2, and Xception (Chollet, 2017;Howard et al., 2017; Huang et al., 2017; Simonyan & Zisserman, 2014; Szegedy et al., 2016). We initialize these modelswith pre-trained weights from training on the original ImageNet dataset (Chollet et al., 2015). For the hyperparameters, seeTable 3.

Table 3.

Model parameters for each model trained in this paper. Parameters were selected by hand. In addition to what is listed below, thezero- and negative- robust models for both ImageNet-9 and CIFAR-10 were stopped prior to 100 epochs in order to maximize accuracy ona validation set. Each model was trained on a IBM Power9 architecture machine with two NVIDIA Tesla V100 GPUs. Each standardmodel took approximately 24 hours to train. Each robust models took approximately 48 hours.

Model Architecture Batchsize Epochs Learningrate LRdecay Dataaugmentation PretrainingStandard (CIFAR) ResNet56 (v2) 200 100 1e-3 Yes Yes NoRobust (CIFAR) ResNet56 (v2) 200 100 1e-3 Yes Yes NoZero-robust (CIFAR) ResNet56 (v2) 200 100* 1e-4 Yes No NoNeg-robust (CIFAR) ResNet56 (v2) 200 100* 1e-4 Yes No NoVictim models (CIFAR) Misc 200 100 1e-4 Yes Yes YesStandard (ImageNet) ResNet20 (v2) 128 100 1e-3 Yes Yes NoRobust (ImageNet) ResNet20 (v2) 128 100 1e-3 Yes Yes NoZero-Robust (ImageNet) ResNet20 (v2) 128 100* 1e-3 Yes No NoNeg-robust (ImageNet) ResNet20 (v2) 128 100* 1e-3 Yes No NoVictim models (ImageNet) Misc 128 100 1e-4 Yes Yes Yes

Table 4.

Parameters for the various datasets constructed in this paper. Parameters were selected by hand.

Dataset Maximum allowed change(L2 norm from seed) Number of steps Step sizeRobustiﬁed (CIFAR) ∞

400 0.05Zero-robust (CIFAR) 0.5 200 0.1Neg-robust (CIFAR) 0.5 200 0.1Untargeted Adversarial Examples (CIFAR) 2.0 100 0.1Targeted Adversarial Examples (CIFAR) 2.0 100 0.1Robustiﬁed (ImageNet) ∞

200 0.1Zero-robust (ImageNet) 0.5 200 0.1Neg-robust (ImageNet) 0.5 200 0.1Untargeted Adversarial Examples (ImageNet) 4.0 100 0.1Targeted Adversarial Examples (ImageNet) 4.0 100 0.1

B. Extended Results

In this section, we present extended results including the extended data from Tables 1 and 2 (Tables 5 and 6). In addition,we include ImageNet-9 versions of Figures 3, 5, and 6 (Figures 7, 9, and 8). We present the accuracy of each of ourrobust classiﬁers on the original ImageNet-9 testing set (Figure 10). We include extended examples of robustiﬁed images(Figures 11 and 12). Finally, we provide examples of adversarial examples generated on robust models (Figures 13 and 14). dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers

Table 5.

Extended data: Accuracy of zero-robust, negative-robust, standard, and robust ResNet50 classiﬁers on robustiﬁed CIFAR-10 testset. All classiﬁers are trained starting with random initial weights. The

Test column is the accuracy on the original CIFAR-10 test set. The R ε columns give accuracy on robustiﬁed CIFAR-10 test set, generated with respect to a robust ResNet50 with robustness parameter ε . Classiﬁer Test R R / R / R / R / R / R / R / R / R Standard 94.1 58.3 77.3 81.4 82.2 83.7 79.9 79.0 75.1 69.2 56.3Robust ( ε = 0 . ) 86.5 9.93 9.97 9.96 10.0 11.1 18.2 44.5 70.5 86.5 70.0Zero-robust 46.8 38.8 35.1 26.1 27.2 26.2 23.3 22.6 22.1 23.4 21.4Negative-robust 21.9 31.8 15.8 14.6 16.6 11.3 11.1 11.0 12.8 14.3 13.2 Table 6.

Extended data: Accuracy of zero-robust, negative-robust, standard, and robust ResNet20 classiﬁers on robustiﬁed ImageNet-9test set. Analogous to Table 5. Note: since there are nine classes, chance accuracy is approximately . . Classiﬁer Test R R / R / R / R / R / R / R / R / R Standard 94.8 51.9 68.3 71.5 75.5 76.7 76.3 72.1 63.4 54.6 47.2Robust ( ε = 1 . ) 82.9 41.9 42.1 42.5 42.5 42.5 44.1 46.1 49.0 51.7 56.0Zero-robust 49.2 24.8 28.3 30.5 29.3 27.5 24.3 24.2 24.6 28.6 31.4Negative-robust 38.3 25.4 18.8 18.4 19.2 20.2 23.2 23.6 21.8 24.4 23.7 Robustness parameter A cc u r a c y o n Figure 7.

Accuracy of a standard ResNet20 classiﬁer on robustiﬁcations of the ImageNet-9 test dataset, generated with respect to robustResNet20 classiﬁers with varying robustness parameter ε . (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers Robustness parameter of source classifier E rr o r o n o r i g i n a l l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception

Figure 8.

Robustness parameter of source classifier A cc u r a c y o n t a r g e t l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception

Figure 9.

Robustness parameter A cc u r a c y o n o r i g i n a l t e s t d a t a Figure 10.

Accuracy of robust ResNet20s of varying robustness parameter trained on the ImageNet-9 training data and evaluated on theImageNet-9 original test data. Conﬁrms well known result: test accuracy decreases as robustness increases. s ee d = = / = / = / = / = / = / = / = / = o r i g i n a l Figure 11.

Extended data: Examples of robustiﬁed images by ε of robust classiﬁer. The rightmost column depicts the original CIFAR-10image; the leftmost column depicts the seed ( x ) from which the gradient descent of the robustiﬁcation process began; the rest arerobustiﬁed images. The representation layer of an ε -robust classiﬁer for each speciﬁed ε is approximately the same for the robustiﬁedimage and the associated original image. Since the representation layers of less robust classiﬁers are more easily perturbed, robustiﬁedimages generated with respect to classiﬁers with a smaller robustness parameter appear closer to the (random) CIFAR-10 image thatseeded the gradient-descent process. (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers s ee d = = / = / = / = / = / = / = / = / = o r i g i n a l Figure 12.

Extended data: Examples of robustiﬁed images by ε of robust classiﬁer. The rightmost column depicts the original ImageNet-9image; the leftmost column depicts the seed ( x ) from which the gradient descent of the robustiﬁcation process began; the rest arerobustiﬁed images. The representation layer of an ε -robust classiﬁer for each speciﬁed ε is approximately the same for the robustiﬁedimage and the associated original image. Since the representation layers of less robust classiﬁers are more easily perturbed, robustiﬁedimages generated with respect to classiﬁers with a smaller robustness parameter appear closer to the (random) ImageNet-9 image thatseeded the gradient-descent process. (Best viewed in color.) o r i g i n a l = = / = / = / = / = / = / = / = / = Figure 13.

Extended data: Examples of adversarial examples for robust CIFAR-10 classiﬁers, of varying robustness. The left-most imageis the original image from CIFAR-10. All other images are adversarial. Each adversarial example was generated via PGD from theoriginal image x such that the resulting adversarial example ˆ x = x + δ where (cid:107) δ (cid:107) ≤ (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classiﬁers o r i g i n a l = = / = / = / = / = / = / = / = / = Figure 14.

Extended data: Examples of adversarial examples for robust ImageNet-9 classiﬁers, of varying robustness. The left-most imageis the original image from ImageNet-9. All other images are adversarial. Each adversarial example was generated via PGD from theoriginal image x such that the resulting adversarial example ˆ x = x + δ where (cid:107) δ (cid:107) ≤4