Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
AAdversarial Perturbations Are Not So Weird: Entanglement of Robust andNon-Robust Features in Neural Network Classifiers
Jacob M. Springer Melanie Mitchell Garrett T. Kenyon Abstract
Neural networks trained on visual data are well-known to be vulnerable to often imperceptibleadversarial perturbations. The reasons for thisvulnerability are still being debated in the litera-ture. Recently Ilyas et al. (2019) showed that thisvulnerability arises, in part, because neural net-work classifiers rely on highly predictive but brit-tle “non-robust” features. In this paper we extendthe work of Ilyas et al. by investigating the natureof the input patterns that give rise to these features.In particular, we hypothesize that in a neural net-work trained in a standard way, non-robust fea-tures respond to small, “non-semantic” patternsthat are typically entangled with larger, robust pat-terns, known to be more human-interpretable, asopposed to solely responding to statistical artifactsin a dataset. Thus, adversarial examples can beformed via minimal perturbations to these small,entangled patterns. In addition, we demonstrate acorollary of our hypothesis: robust classifiers aremore effective than standard (non-robust) ones asa source for generating transferable adversarialexamples in both the untargeted and targeted set-tings. The results we present in this paper providenew insight into the nature of the non-robust fea-tures responsible for adversarial vulnerability ofneural network classifiers.
1. Introduction
It is well-known that neural network classifiers trained onvisual data are susceptible to adversarial examples—imagesthat have been minimally perturbed so as to look unchangedto humans but are classified incorrectly, even though theoriginal image is correctly classified. Many explanationshave been offered for this susceptibility as well as for thetransferability of adversarial examples across network ar- Los Alamos National Laboratory, Los Alamos, NM Santa FeInstitute, Santa Fe, NM. Correspondence to: Jacob M. Springer
Type B:
Non-robust features that respond tosmall yet highly predictive patterns that bythemselves appear non-semantic, yet are en-tangled with patterns associated with robustfeatures (e.g., in (a)).
Type C:
Non-robust features that respond tohighly predictive patterns that are artifacts inthe dataset, and are independent of robust fea-tures.
Figure 1.
A sketch of the possible relationships between robust and non-robust features. Note that this is oversimplified as non-robustfeatures could respond to combinations of (b) and (c). robust features (Engstrom et al., 2019b; Kaur et al., 2019;Santurkar et al., 2019). By establishing the relationshipbetween non-robust and robust features, we argue that non-robust features may be more interpretable than they appearthrough the lens of adversarial perturbations. In other words,non-robust features may not be so weird, after all.
The results of this paper have important implications. Weexplain conceptually why adversarial perturbations mightappear non-semantic yet are related to semantic features.We present and test a corollary of our results: transferableexamples (targeted and untargeted) can be generated moreeffectively by using robust classifiers rather than standardclassifiers. Finally, our results provide new insights into thenature of adversarial vulnerabilities, and suggest directionsof future research.
2. Terminology
In this section, following Ilyas et al. (2019), we define sev-eral important terms, in particular the notions of robust and non-robust features.Deep learning classifiers, in particular convolutional neuralnetworks (CNNs) are typically composed of a sequence ofnon-linear transformations called layers, the result of whichis fed through a final linear classifier layer to select a class(Krizhevsky et al., 2012; LeCun et al., 1998). We refer tothe penultimate layer as the representation layer , denoted rep ( x ) , where x is the n dimensional input to the network(e.g., an image). We refer to each unit of the representationlayer as a feature that the network computes for classifica-tion. Each feature f : R n → R maps the input to a real number—the feature’s activation, given the input.We adapt the following definitions from Ilyas et al. (2019),who considered the binary classification case, in whichthe dataset D consists of pairs ( x, y ) ∈ D , x ∈ R n , y ∈{ , − } , and the final layer outputs either or − .• A feature f is useful for a dataset D when there exists a ρ > such that, E ( x,y ) ∈D [ y · f ( x )] ≥ ρ, or, more intuitively, when f is correlated with the class y of an input x .• For a given input x , adversarial example ˆ x = x + δ , and ε > , δ is said to be a permissible perturbation when (cid:107) δ (cid:107) < ε .• A feature f is robust when, for some γ > , E ( x,y ) ∈D [ min (cid:107) δ (cid:107) ≤ ε y · f ( x + δ )] ≥ γ, More intuitively, a feature is robust when it is correlatedwith the class y even under worst-case permissible pertur-bations of the input x .• For simplicity, a feature is said to be non-robust when it isuseful but not robust . (Here we do not consider non-usefulfeatures.)A feature is a property of a classifier, and describes a wayin which the classifier measures information in the input.However, it will also be conceptually useful to refer to theinformation itself. Thus we define the closely related notionof a pattern P ⊂ R n , a subset of inputs. We say that animage x contains a pattern P when x ∈ P . For example, dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers a pattern of stop signs would be the set of all images thatcontain a stop sign.We are primarily concerned with the relationship betweenrobust features and non-robust features. The main question we address in this paper is, what is thenature of the non-robust features described by Ilyas et al.(2019)? We hypothesize that many non-robust featureslearned by standard (i.e., non-robust classifiers) respond topatterns that are entangled with human-interpretable robust features, rather than responding to dataset artifacts. More-over, these entangled non-robust features can be exploited tocreate adversarial examples. If true, this hypothesis has im-portant implications for defenses against, and transferabilityof, adversarial examples.We illustrate our hypothesis with a conceptual diagram inFigure 1 consisting of three different types of patterns thatfeatures in a network could respond to. Robust featuresrespond to Type A patterns—ones that are interpretable byhumans as giving rise to a particular class. Type C illustratesa type of pattern that non-robust features might respond to:highly predictive non-semantic patterns in the dataset thatare unrelated to the robust features in the image. Thesemight be called “spurious correlations” or “artifacts” in thedataset. Type B illustrates another possibility for non-robustfeatures: they might respond to small but highly predictivecomponents of robust features (e.g., the nose of a dog orthe whiskers of a cat) that could be easily exploited to yieldadversarial examples.We confirm in this paper that Type B features are indeedlearned by CNNs, that these features can, in some cases, ex-plain the high accuracy of CNNs, and that these features canbe exploited in adversarial attacks. Thus the adversarial vul-nerability of CNNs is not necessarily due to non-semanticartifacts, but can be explained in terms of non-robust fea-tures that are entangled with robust features that respond tosemantically meaningful patterns.
3. Methods
In order to test our hypothesis that Type B features arelearned by standard CNN classifiers, we use the followingstrategy, inspired by Ilyas et al. (2019). We construct aneural network classifier in which only non-robust featuresare useful. We then construct a new test set for which onlyrobust features are useful for classification.We then show that that the classifier achieves substantiallyhigher than chance accuracy on the constructed test set; this seed = 0 = 1/64 = 1/8 = 1 original
Figure 2.
Examples of robustified images by ε of robust classifier.The rightmost column depicts the original CIFAR-10 image; theleftmost column depicts the seed ( x ) from which the gradientdescent of the robustification process began; the rest are robustifiedimages. Since the representation layers of less robust classifiersare more easily perturbed, robustified images generated with a lessrobust model (smaller value of ε ) will appear closer to the (random)CIFAR-10 image that seeded the gradient-descent process. (Bestviewed in color.) implies that the classifier must be using non-robust featuresthat are entangled with robust features—namely, Type Bfeatures.The following subsections describe how we construct thisclassifier and test set. We use the term zero-robust clas-sifier for the classifier for which only non-robust featuresare useful. To train these classifiers, we follow the methodof Ilyas et al. (2019). We first construct a training set D zero from our original training set D such that under D zero , allrobust features are non-useful, i.e., totally uncorrelated withthe assigned label, while non-robust features remain useful.By training a standard classifier on this distribution, thestandard classifier should only learn non-robust features.The new training set D zero can be constructed as follows:for each input-label pair ( x, y ) ∈ D , choose a new label ˆ y uniformly at random from the possible labels. Construct apermissible adversarial example ˆ x (i.e., ˆ x = x + δ , where δ is a permissible perturbation) such that ˆ x is classified as ˆ y .Then, let (ˆ x, ˆ y ) be an element of D zero . Ilyas et al. (2019)argued that there should be almost no correlation betweenrobust features and the assigned labels in this new dataset.Thus, a classifier trained on this dataset should not rely onrobust features. Negative-Robust Classifier: Addressing Problem of“Robust Feature Leakage”.
As pointed out by Engstromet al. (2019a), there may be a small correlation between dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers robust features and the assigned labels in D zero , due to “ro-bust feature leakage,” which may account for some of theaccuracy of a zero-robust classifier on a robustified dataset.We control for this possibility with the method proposedby Ilyas et al. (2019): we construct a negative-robust clas-sifier, where robust features are anti-correlated with thelabel and non-robust features are correlated with the label.Thus, robust features should actively hurt accuracy and anypositive accuracy can certainly be attributed to non-robustfeatures. We construct a negative-robust classifier similarto the construction of the zero-robust classifier, but while ˆ y is selected uniformly at random to construct D zero , for thenegative-robust classifier we deterministically permute theclass labels, associating each label y with the corresponding ˆ y in the permutation. Then for each ( x, y ) pair in D wecreate (ˆ x, ˆ y ) , where ˆ x is an adversarial example based on x targeting class ˆ y . Using a deterministic permutation of thelabel classes makes robust features anti-correlated with thecorrect label class. In order to construct a test set for which only robust featuresare useful, we follow the approach of Ilyas et al. (2019).First we construct a robust classifier , and then use thatclassifier to construct the desired dataset.
Robust Classifier.
We construct robust classifiers via stan-dard adversarial training (Madry et al., 2017). Formally, wewish to find classifier parameters θ ∗ that minimize the fol-lowing expression: θ ∗ = arg min θ E ( x,y ) ∈D [ max (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, y )] where L ( θ, x, y ) computes the cross-entropy loss of theclassifier with parameters θ on the input x and label y . Themagnitude of adversarial examples to which the classifieris robust is characterized by the ε parameter, which wecall the robustness parameter of a classifier. Larger valuesof ε correspond to more robust classifiers (i.e., they arerobust to larger perturbations). The special case when ε = 0 corresponds to a standard classifier. We refer to a classifierwith robustness parameter ε as an ε -robust classifier .The above saddle-point optimization problem can be solvedby first solving the inner maximization problem with pro-jected gradient ascent and then solving the outer maximiza-tion problem with standard back-propagation. We trainwith projected gradient descent as proposed by Madry et al.(2017). Since this problem is well studied, we defer to priorliterature for discussion of the matter (Carlini et al., 2019;Madry et al., 2017). Test Set Robustification.
To construct a test set for whichonly robust features are useful for classification, we traina robust classifier C as described above, and use it to cre-ate a new test set ˆ D in which there is a one-to-one map-ping between the elements of ˆ D and the original test set D : ( x, y ) (cid:55)→ (ˆ x, y ) . In particular, an image ˆ x ∈ ˆ D is con-structed so that any feature of C that is useful in classifying x is equally useful in classifying ˆ x , in the sense that usefulfeatures are equally correlated with the label y , and no otherpossible feature is useful. In short, our construction willensure that the only useful features for classifying ˆ x will bethe features of C, which are, by definition, robust features.Define rep C ( x ) as the output of the m -dimensional rep-resentation (penultimate) layer of robust classifier C withinput x . Then for r ∈ R m , rep C − ( r ) is a set of imageswith robust features r .We cannot easily compute rep C − ( r ) . However, for each el-ement of x our original test set, we can find an approximateinverse by solving the minimization problem min x r (cid:107) rep C ( x r ) − rep C ( x ) (cid:107) . We solve this via gradient descent, starting from an image x . In other words, starting from randomly chosen test-setimage x , we search for an image x r whose representation rep C ( x r ) is as close as possible to rep C ( x ) .Following Ilyas et al. (2019), if we choose the initial image x uniformly from the test set, and the test set has a uniformdistribution over labels, then all features in rep C ( x ) areuncorrelated (in expectation over x ) with x ’s label. Thisensures (in expectation) that any features that respond topatterns in x will not be correlated with x ’s label, y . Sincewe apply gradient descent to x in order to find an ˆ x whoserepresentation is similar as possible to C ’s representationfor x , the only features that correlate with y will be those in C , which are robust by definition. Thus only robust featuresare useful for classifying ˆ x .Following Ilyas et al. (2019), we refer to the inversion pro-cess the robustification of images. Given a test set D , we canconstruct a new test set ˆ D by computing this approximationof rep − ( rep ( x )) and labeling it y for every ( x, y ) ∈ D .If robust classifier C has robustness parameter ε > , wesay that ˆ D is the ε - robustification of D and will be referredto as R ε . Figure 2 shows examples of the robustificationprocess on CIFAR-10 test images. We evaluate our methods on two datasets: CIFAR-10(Krizhevsky et al., 2009) and ImageNet-9, a derivative ofImageNet (Russakovsky et al., 2015) with two key differ-ences from the original ImageNet dataset: first, we use a dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers × pixel downsampled version of the ImageNet data forcomputational feasibility (Chrabaszcz et al., 2017), second,adopted from Ilyas et al. (2019), we reduce the number ofclasses to nine: dog, cat, frog, turtle, bird, primate, fish,crab, insect. See Ilyas et al. (2019) for further details.For the CIFAR-10 dataset, we train ResNet50 classifiers(He et al., 2016a;b). However, due to the high computa-tional cost of training robust models on ImageNet-9, wetrain a classifier with fewer parameters, ResNet20, on theImageNet-9 dataset.
4. Results of Experiments
Table 1.
Accuracy of zero-robust, negative-robust, standard, androbust ResNet50 classifiers on robustified CIFAR-10 test set. Allclassifiers are trained starting with random initial weights. The
Test column is the accuracy on the original CIFAR-10 test set. The R ε columns give accuracy on robustified CIFAR-10 test set, generatedwith respect to a robust ResNet50 with robustness parameter ε . Classifier Test R R / R / R / R Standard 94.1 58.3 79.9 75.1 69.2 56.3Robust 86.5 9.93 18.2 70.5 86.5 70.0Zero-robust 46.8 38.8 23.3 22.1 23.4 21.4Neg-robust 21.9 31.8 11.1 12.8 14.3 13.2
Table 2.
Accuracy of zero-robust, negative-robust, standard, androbust ResNet20 classifiers on robustified ImageNet-9 test set.Analogous to Table 1. Note: since there are nine classes, chanceaccuracy is approximately . . Classifier Test R R / R / R / R Standard 94.8 51.9 76.3 63.4 54.6 47.2Robust 82.9 44.1 46.1 49.0 51.7 56.0Zero-robust 49.2 24.8 24.3 24.6 28.6 31.4Neg-robust 38.3 25.4 23.2 21.8 24.4 23.7Recall our definition of Type B features: Non-robust fea-tures learned by a network that respond to small yet highlypredictive patterns that by themselves appear non-semantic,yet are entangled with patterns associated with robust fea-tures. Our hypothesis is that Type B features are prevalentin standard (i.e., non-robust) image classifiers, and can beexploited to create adversarial examples. That is, Type Cfeatures (which respond to dataset artifacts) are not the onlyfeatures responsible for adversarial vulnerability.We present here the result of an experiment in which weevaluate zero-robust and negative-robust classifiers, repre-senting classifiers that rely exclusively on non-robust fea-tures, on CIFAR-10 and ImageNet-9 test sets that have beenrobustified—that is, in which only robust features are useful.The results, given in Tables 1 and 2, confirm our hypothesis: the zero-robust classifier performs above chance on all ro-bustified test sets; this can only occur when Type B featuresare present. Similarly, the negative-robust classifier, whichacts as a control for the problem of robust feature leakage(see Section 3.2), has above-chance accuracy on robustifiedtest sets except for when ε is large. We expect this result,since the presence of robust features decreases accuracy innegative-robust classifiers. In fact, in the absence of TypeB features, we would expect the negative-robust classifierto perform at below-chance accuracy on the robustified testsets. As above-chance accuracy by the negative-robust canonly be explained by the presence of Type B features, theaccuracy of the negative-robust classifier serves as a lower-bound on the accuracy for which can be accounted by TypeB features. The fact that the accuracy is above chance formany of the robustified test sets further confirms our hypoth-esis, and rejects the possibility that robust feature leakageentirely explains the above-chance accuracy of the zero-robust classifier on the robustified test sets. We include theaccuracies of a standard classifier and an example robustclassifier with robustness parameter ε = 0 . for comparisonwith the zero- and negative-robust classifier results.We observe that the accuracy of zero-robust and negative-robust classifiers decreases on the robustified test sets asthe robustness parameter ε increases. We have already dis-cussed that we expect this in the negative-robust classifierdue to the negative influence of robust features on accuracy.However, we hypothesize that the reason zero-robust ac-curacy decreases is similar to that of the well-establishedresult that the accuracy of robust classifiers decreases on theoriginal test set as the robustness parameter of the classifierincreases (Tsipras et al., 2019; Zhang et al., 2019). Weconfirm this well-known result in Figure 4. We speculatethat as the robustness increases, the classifier begins to ig-nore useful features, even some that are robust for smaller ε .Thus, the robustified test sets contain fewer useful featuresas ε increases, and thus zero-robust accuracy decreases.
5. The Universality of Robust Features
The previous section established that non-robust featuresrespond to patterns that are entangled with the patterns re-sponded to by robust features. However, there may be manyequally predictive “non-robust” patterns that can be entan-gled with a single “robust” pattern. For example, as il-lustrated in Figure 1, a non-robust Type B feature whichresponds to the texture of a dog nose may be a componentof a robust Type A feature that identifies the shape of anentire dog. However, there may be many components ofa particular robust feature that would equivalently predict“dog”. Standard (non-robust) classifier C might learn a par-ticular subset of predictive non-robust (i.e., Type B) features,whereas another standard classifier C with a different ar- dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers Robustness parameter A cc u r a c y o n Figure 3.
Accuracy of a standard ResNet50 classifier on robus-tifications of the CIFAR-10 test dataset, generated with respectto robust ResNet50 classifiers with varying robustness parameter ε . See Appendix for the analogous graph for ImageNet-9. (Bestviewed in color.) chitecture or different initial weights might learn a differentsubset, with both classifiers exhibiting similar accuracy ona test set. The features of both classifiers would be respond-ing to different subpatterns of the same underlying “robust”patterns; in this case, the features learned by C would notstrongly overlap the features learned by C .The features learned by a classifier that tend to overlap withthe features learned by other classifiers trained on the samedataset are characterized by universality , as they are roughlythe same across classifiers (Olah et al., 2020). Li et al.(2015) called this phenomenon “convergent learning”.We hypothesize that robust features will be more universalthan non-robust features, since universal features likely rep-resent the underlying, better generalizing properties of thedataset that are captured by robust features. Analogous tothe way we demonstrated the existence of Type B featuresentangled with Type A features in Section 3, here we giveevidence for the hypothesis that robust features are moreuniversal than non-robust features by evaluating standardclassifiers on robustified test sets.Recall that for a test set D T that has been robustified withrespect to a robust classifier C , the only features that areuseful for classification are robust features useful to C . Ifa different classifier C (cid:48) has high accuracy on D T , then C (cid:48) must be using the same (or very similar) features as C . Ifmany different classifiers have high accuracy on D T , thenwe can say that C ’s robust features are universal . Robustness parameter A cc u r a c y o n o r i g i n a l t e s t d a t a Figure 4.
Accuracy of robust ResNet50 classifiers trained on theCIFAR-10 training data and evaluated on the CIFAR-10 originaltest data. See Appendix for the analogous graph for ImageNet-9.(Best viewed in color.)
Figure 3 shows the accuracy of a standard (non-robust)ResNet50 classifier, trained on the CIFAR-10 training set,when tested on robustified CIFAR-10 test sets with varyingrobustness parameter ε . The figure shows that the the stan-dard classifier, which we’ll call C (cid:48) , has significantly higheraccuracy on robustified test sets with < ε ≤ / , with apeak at ε = / .To show why this supports the hypothesis that robust fea-tures are more universal than non-robust features, let C ε denote the ε -robust ResNet50 classifier used to create therobustified test set R ε . Figure 3 shows that the accuracy of C (cid:48) is low on R , which was designed specifically so thatthe features of C would be useful to classify it. This meansthat C ’s features, which are likely to be mostly non-robustsince ε = 0 , tend not to be shared by C (cid:48) . However, the ac-curacy of C (cid:48) on R ε for < ε ≤ / is dramatically higher,meaning that the features that are useful to classifiers C ε arealso useful to C (cid:48) , for ε in that range. Thus the robust fea-tures useful to these C ε tend to be shared by C (cid:48) . Repeatedevaluations with different initial weights for C (cid:48) exhibitedthe same behavior. Thus, the results shown in Figure 3 sup-port the hypothesis that robust features are more universalthan non-robust features.It is noteworthy that accuracy is highest for a relatively smallrobustness parameter of ε = / . We hypothesize that asrobustness increases, while patterns associated with Type Bfeatures may remain present in the robustified dataset, pat-terns associated with Type C features, which were presentin the original dataset, may be progressively removed, caus- dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers Robustness parameter of source classifier E rr o r o n o r i g i n a l l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception
Figure 5.
Error rate of different standard classifiers evaluated onuntargeted transfer attacks. Each error rate is measured on a setof untargeted adversarial examples generated from the CIFAR-10test set to attack a source classifier (ResNet50) with robustnessparameter ε . Each adversarial example was generated with aperturbation δ where (cid:107) δ (cid:107) < . A higher error corresponds to astronger transfer attack. See Appendix for the analogous graph forImageNet-9. (Best viewed in color.) ing a detriment to accuracy as the robustness parameterincreases significantly. Similarly, we speculate that this mayalso explain the well-known decrease in accuracy on origi-nal test data associated with an increase in model robustnessas shown in Figure 4 (Tsipras et al., 2019; Zhang et al.,2019).Our results suggest that even when performance is criticaland the drop in accuracy associated with adversarial trainingis unacceptable, adding even slight robustness ( ε = / ) candrastically improve the universality of the learned featuresof a neural network. Networks can be improved in this waywith little to no additional computational cost due to recentadvances in improving the efficiency of adversarial training(Jeddi et al., 2020; Shafahi et al., 2019; Wong et al., 2020).We repeat these experiments using ResNet20 classifierstrained on the ImageNet-9 dataset and find similar results,presented in the Appendix. Adversarial examples designed to fool a given classifierare sometimes transferable : they can also successfully foolother classifiers, even those with different architectures fromthe original (“source”) classifier. This is true for both un-targeted adversarial examples x u , which are consideredto be successful if a classifier predicts any incorrect labelwhen given x u , and for targeted adversarial examples x t ,which are considered to be successful if a classifier predictsa specific targeted (incorrect) label y t when given x t . It is Robustness parameter of source classifier A cc u r a c y o n t a r g e t l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception
Figure 6.
Accuracy of different standard classifiers on targetedtransfer attacks. Each accuracy is measured as the fraction of timesthe target label y t is predicted for targeted adversarial example x t .The evaluation is done on a set of targeted adversarial examplesgenerated from the CIFAR-10 test set to attack a source ResNet50classifier with robustness parameter ε . Each adversarial examplewas generated with a perturbation δ where (cid:107) δ (cid:107) < . A higheraccuracy corresponds to a stronger transfer attack. See Appendixfor the analogous graph for ImageNet-9. (Best viewed in color.) typically much harder to design transferable targeted thanuntargeted adversarial examples (Liu et al., 2016).Our earlier hypothesis—that robust features are more uni-versal than non-robust features—suggests that adversarialexamples designed to fool a robust classifier transfer (i.e.,that exploit robust features) should be more transferablethan adversarial examples designed to fool a non-robustclassifier. In this section we give evidence for this conclu-sion: we show that if a robust classifier is used as a sourcemodel for creating adversarial examples, those examplestransfer more effectively than if a non-robust classifier isused as the source. This is an important novel result for re-searchers trying to create transferable adversarial examples:one should use a robust model as the source, instead of astandard model.To give evidence for these hypotheses, we train ResNet50classifiers on the CIFAR-10 training set using adversarialtraining with varying robustness parameters ε (in addition,we train ResNet20 on ImageNet-9 data, see Appendix). Wethen use these robust classifiers to generate targeted anduntargeted adversarial examples via projected gradient de-scent (Madry et al., 2017), using the standard adversarialobjectives. In the untargeted case, given an initial input x and a parameter ε to specify the maximum L2 norm of apermissible perturbation δ , we seek to find an adversarialexample x u = x + ˆ δ (where x is the original example with dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers label y ) such that ˆ δ = arg max (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, y ) , and for the targeted case, for a target label y t , we seek tofind an adversarial example x t = x + ˆ δ such that ˆ δ = arg min (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, ˆ y ) . We evaluate the generated adversarial examples on stan-dard classifiers with different architectures (Chollet, 2017;Howard et al., 2017; Huang et al., 2017; Simonyan & Zis-serman, 2014; Szegedy et al., 2016). For the untargetedcase, Figure 5 plots, for each classifier, the error rate on theoriginal labels y , given adversarial examples x u . This errorrate is plotted as a function of the robustness parameter ε ofthe source classifier used to generate the adversarial exam-ples ( x u ). The higher the error rate, the more successful thetransfer attack. To evaluate targeted adversarial examples,in Figure 6 we plot the accuracy of each standard classifieron adversarial examples x t with respect to the target label y t —i.e., the fraction of times the classifier predicted thetarget label y t , given example x t . The higher this accuracy,the more successful the transfer attack.Figure 5 shows that untargeted adversarial examples benefitslightly if generated with respect to a robust classifier witha small robustness parameter, < ε ≤ / with a peak at ε ≈ / . The success of transferable adversarial examplesdecreases as ε is further increased.Figure 6 shows that transferable targeted adversarial exam-ples attain a more striking increase in success by increas-ing the robustness parameter of the source classifier. Ourtechnique for generating transferable targeted adversarialexamples is largely ineffective when the source classifieris equivalent to standard classifier (i.e., ε = 0 ). However,using a robust source classifier with ε ≈ / dramaticallyincreases the success of targeted transfer attacks.Experiments on the ImageNet-9 dataset using ResNet20 assource models behave similarly (see Appendix for details).Our goal here is not to generate state-of-the-art transfer at-tacks; we leave this to future work. Thus, we do not employcomplex state-of-the-art transfer attacks and instead opt forthe simple projected gradient descent approach. Nonethe-less, our results confirm that robust features with small ro-bustness parameter have high universality and can improvetransfer attacks, especially in the targeted case, in whichimprovements are dramatic.
6. Related Work
Adversarial examples and associated adversarial robustnesshave been studied extensively in the machine learning liter-ature (Allen-Zhu & Li, 2020; Athalye et al., 2018; Biggio et al., 2013; Carlini & Wagner, 2017a;b; Carlini et al., 2019;Cohen et al., 2019; Engstrom et al., 2019b; Feinman et al.,2017; Goodfellow et al., 2014; Kaur et al., 2019; Madryet al., 2017; Metzen et al., 2017; Moosavi-Dezfooli et al.,2016; Papernot et al., 2016b; Raghunathan et al., 2018; San-turkar et al., 2019; Shafahi et al., 2018; Stutz et al., 2019;Szegedy et al., 2014; Uesato et al., 2018; Warde-Farley &Goodfellow, 2016). Non-robust features have been studieddirectly by Ilyas et al. (2019) who finds that non-robust fea-tures are present in image training datasets. By contrast, weestablish an entanglement relationship between robust andnon-robust features.We hypothesize that the overlap between robust and non-robust features may be explained in part from a simplicitybias (Arpit et al., 2017; De Palma et al., 2019; Nakkiranet al., 2019; Shah et al., 2020; Valle-Pérez et al., 2018; Wuet al., 2017), which suggests that neural networks learnsimple functions more easily than complex functions, andfrom gradient starvation (Combes et al., 2018; Pezeshkiet al., 2020), which suggests that once highly predictivefeatures are learned, other features become increasinglydifficult to learn due to a diminishing gradient. If non-robustfeatures are learned more readily and support sufficientlylow classification loss, such a combination may impedethe learning of robust features, for instance, see Nakkiran(2019). Similarly, texture bias may be related (Hermannet al., 2020).Many papers have shown that deep learning classifiers mayrely on non-semantic cues or shortcuts (Geirhos et al., 2020;Jo & Bengio, 2017; McCoy et al., 2019; Wang et al., 2020;Wei, 2020) and that optimization-based feature visualiza-tion may be inadequate to visualize neural network features(Borowski et al., 2020; Hendrycks et al., 2019). On the otherhand, there has been work on feature visualization (Aubry& Russell, 2015; Dosovitskiy & Brox, 2016; Mahendran& Vedaldi, 2015; Olah et al., 2017; Simonyan et al., 2013;Zhang & Zhu, 2018).Related to our hypothesis that robust features are more uni-versal than non-robust features, prior work has found thatadversarial robustness serves as an effective prior for trans-fer learning (Liang et al., 2020; Salman et al., 2020; Terziet al., 2020; Utrera et al., 2020). There has been some workon the universality hypothesis (Kornblith et al., 2019; Liet al., 2015; Olah et al., 2020; Raghu et al., 2017). Addi-tionally, there has been significant work on constructingand evaluating transferable targeted adversarial examples(Eykholt et al., 2018; Kurakin et al., 2016; Liu et al., 2016;Moosavi-Dezfooli et al., 2017; Papernot et al., 2016a; 2017;Tramèr et al., 2017). dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
7. Conclusion
In this paper, we have presented an empirical study demon-strating that robust and non-robust features can be entangledin standard neural network classifiers. In particular, we haveproposed three different types of features: robust features(Type A), non-robust features that are entangled with ro-bust features (Type B), and non-robust features that are notentangled with robust features (Type C). Type A featuresare known to appear significantly more semantic than non-robust features (Engstrom et al., 2019b; Kaur et al., 2019;Santurkar et al., 2019). To our knowledge, no previousstudy has addressed the question of whether the non-robustfeatures underlying adversarial vulnerability are the non-semantic yet highly predictive artifacts in image datasets, orif the features appear non-semantic for a different reason.We present evidence that at least some of the non-robustfeatures indicate the presence of the same patterns of thedataset as robust features, suggesting that while appearingnon-semantic when observed through the lens of their gra-dient and adversarial perturbations, these features may, infact, indicate the presence of features that are aligned withhuman perception.In addition, we provide evidence that robust features can bethought of as more universal than non-robust features, andas result, we find that robust classifiers are more effectiveas source classifiers for generating transferable adversarialexamples in both the untargeted and targeted settings.We believe that this work is an important step towards un-derstanding the nature of non-robust features and answeringthe universality hypothesis (Olah et al., 2020). With a theoryof robust and non-robust features, and thus of adversarialvulnerability and resilience, we hope to eventually under-stand how to build more interpretable classifiers and be ableto better defend against adversarial attacks.
Acknowledgements
Research presented in this article was supported by theLaboratory Directed Research and Development programof Los Alamos National Laboratory under project number20210043DR.The authors would like to thank Rory Soiffer for his helpfulideas and discussion.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev-enberg, J., Mané, D., Monga, R., Moore, S., Murray, D.,Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan,V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M.,Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.URL . Softwareavailable from tensorflow.org.Allen-Zhu, Z. and Li, Y. Feature purification: How adversar-ial training performs robust deep learning. arXiv preprintarXiv:2005.10190 , 2020.Arpit, D., Jastrz˛ebski, S., Ballas, N., Krueger, D., Bengio,E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A.,Bengio, Y., et al. A closer look at memorization in deepnetworks. arXiv preprint arXiv:1706.05394 , 2017.Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Syn-thesizing robust adversarial examples. In
Internationalconference on machine learning , pp. 284–293. PMLR,2018.Aubry, M. and Russell, B. C. Understanding deep featureswith computer-generated imagery. In
Proceedings of theIEEE International Conference on Computer Vision , pp.2875–2883, 2015.Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndi´c, N.,Laskov, P., Giacinto, G., and Roli, F. Evasion attacksagainst machine learning at test time. In
Joint Europeanconference on machine learning and knowledge discoveryin databases , pp. 387–402. Springer, 2013.Borowski, J., Zimmermann, R. S., Schepers, J., Geirhos, R.,Wallis, T. S., Bethge, M., and Brendel, W. Exemplarynatural images explain cnn activations better than featurevisualizations. arXiv preprint arXiv:2010.12606 , 2020.Carlini, N. and Wagner, D. Adversarial examples are noteasily detected: Bypassing ten detection methods. In
Proceedings of the 10th ACM Workshop on ArtificialIntelligence and Security , pp. 3–14, 2017a.Carlini, N. and Wagner, D. Towards evaluating the robust-ness of neural networks. In , pp. 39–57. IEEE, 2017b.Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber,J., Tsipras, D., Goodfellow, I., Madry, A., and Kurakin,A. On evaluating adversarial robustness. arXiv preprintarXiv:1902.06705 , 2019.Chollet, F. Xception: Deep learning with depthwise separa-ble convolutions. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 1251–1258, 2017.Chollet, F. et al. Keras. https://keras.io , 2015. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsam-pled variant of imagenet as an alternative to the cifardatasets. arXiv preprint arXiv:1707.08819 , 2017.Cohen, J., Rosenfeld, E., and Kolter, Z. Certified adver-sarial robustness via randomized smoothing. In
Interna-tional Conference on Machine Learning , pp. 1310–1320.PMLR, 2019.Combes, R. T. d., Pezeshki, M., Shabanian, S., Courville,A., and Bengio, Y. On the learning dynamics of deepneural networks. arXiv preprint arXiv:1809.06848 , 2018.De Palma, G., Kiani, B., and Lloyd, S. Random deepneural networks are biased towards simple functions. In
Advances in Neural Information Processing Systems , pp.1964–1976, 2019.Dosovitskiy, A. and Brox, T. Inverting visual representationswith convolutional networks. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 4829–4837, 2016.Engstrom, L., Gilmer, J., Goh, G., Hendrycks, D., Ilyas,A., Madry, A., Nakano, R., Nakkiran, P., Santurkar,S., Tran, B., Tsipras, D., and Wallace, E. A discus-sion of ’adversarial examples are not bugs, they arefeatures’.
Distill , 2019a. doi: 10.23915/distill.00019.https://distill.pub/2019/advex-bugs-discussion.Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D.,Tran, B., and Madry, A. Adversarial robustness asa prior for learned representations. arXiv preprintarXiv:1906.00945 , 2019b.Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A.,Xiao, C., Prakash, A., Kohno, T., and Song, D. Robustphysical-world attacks on deep learning visual classifica-tion. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 1625–1634, 2018.Feinman, R., Curtin, R. R., Shintre, S., and Gardner, A. B.Detecting adversarial samples from artifacts. arXivpreprint arXiv:1703.00410 , 2017.Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R.,Brendel, W., Bethge, M., and Wichmann, F. A. Short-cut learning in deep neural networks. arXiv preprintarXiv:2004.07780 , 2020.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-ing and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 , 2014.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016a. He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In
European conference oncomputer vision , pp. 630–645. Springer, 2016b.Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., andSong, D. Natural adversarial examples. arXiv preprintarXiv:1907.07174 , 2019.Hermann, K., Chen, T., and Kornblith, S. The originsand prevalence of texture bias in convolutional neuralnetworks.
Advances in Neural Information ProcessingSystems , 33, 2020.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:Efficient convolutional neural networks for mobile visionapplications. arXiv preprint arXiv:1704.04861 , 2017.Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. In
Proceedings of the IEEE conference on computer visionand pattern recognition , pp. 4700–4708, 2017.Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran,B., and Madry, A. Adversarial examples are not bugs,they are features. In
Advances in Neural InformationProcessing Systems , pp. 125–136, 2019.Jeddi, A., Shafiee, M. J., and Wong, A. A simple fine-tuning is all you need: Towards robust deep learning viaadversarial fine-tuning. arXiv preprint arXiv:2012.13628 ,2020.Jo, J. and Bengio, Y. Measuring the tendency of cnnsto learn surface statistical regularities. arXiv preprintarXiv:1711.11561 , 2017.Kaur, S., Cohen, J., and Lipton, Z. C. Are perceptually-aligned gradients a general property of robust classifiers? arXiv preprint arXiv:1910.08640 , 2019.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similar-ity of neural network representations revisited. In
Interna-tional Conference on Machine Learning , pp. 3519–3529.PMLR, 2019.Krizhevsky, A., Hinton, G., et al. Learning multiple layersof features from tiny images.
Master’s thesis, Departmentof Computer Science, University of Toronto , 2009.Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.
Advances in neural information processing systems , 25:1097–1105, 2012. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
Kurakin, A., Goodfellow, I., and Bengio, S. Adversar-ial examples in the physical world. arXiv preprintarXiv:1607.02533 , 2016.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, 1998.Li, Y., Yosinski, J., Clune, J., Lipson, H., and Hopcroft, J. E.Convergent learning: Do different neural networks learnthe same representations? In
FE@ NIPS , pp. 196–212,2015.Liang, K., Zhang, J. Y., Koyejo, O., and Li, B. Does adver-sarial transferability indicate knowledge transferability? arXiv preprint arXiv:2006.14512 , 2020.Liu, Y., Chen, X., Liu, C., and Song, D. Delving intotransferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770 , 2016.Madry, A., Makelov, A., Schmidt, L., Tsipras, D., andVladu, A. Towards deep learning models resistant toadversarial attacks. arXiv preprint arXiv:1706.06083 ,2017.Mahendran, A. and Vedaldi, A. Understanding deep im-age representations by inverting them. In
Proceedingsof the IEEE conference on computer vision and patternrecognition , pp. 5188–5196, 2015.McCoy, R. T., Pavlick, E., and Linzen, T. Right for thewrong reasons: Diagnosing syntactic heuristics in naturallanguage inference. arXiv preprint arXiv:1902.01007 ,2019.Metzen, J. H., Genewein, T., Fischer, V., and Bischoff, B.On detecting adversarial perturbations. arXiv preprintarXiv:1702.04267 , 2017.Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. Deep-fool: a simple and accurate method to fool deep neuralnetworks. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pp. 2574–2582,2016.Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., andFrossard, P. Universal adversarial perturbations. In
Pro-ceedings of the IEEE conference on computer vision andpattern recognition , pp. 1765–1773, 2017.Nakkiran, P. Adversarial robustness may be at odds withsimplicity. arXiv preprint arXiv:1901.00532 , 2019.Nakkiran, P., Kalimeris, D., Kaplun, G., Edelman, B., Yang,T., Barak, B., and Zhang, H. Sgd on neural networkslearns functions of increasing complexity. In
Advances inNeural Information Processing Systems , pp. 3496–3506,2019. Olah, C., Mordvintsev, A., and Schubert, L. Feature visual-ization.
Distill , 2(11):e7, 2017.Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov,M., and Carter, S. Zoom in: An introduction to cir-cuits.
Distill , 2020. doi: 10.23915/distill.00024.001.https://distill.pub/2020/circuits/zoom-in.Papernot, N., McDaniel, P., and Goodfellow, I. Transfer-ability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprintarXiv:1605.07277 , 2016a.Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik,Z. B., and Swami, A. The limitations of deep learning inadversarial settings. In , pp. 372–387. IEEE,2016b.Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik,Z. B., and Swami, A. Practical black-box attacks againstmachine learning. In
Proceedings of the 2017 ACM onAsia conference on computer and communications secu-rity , pp. 506–519, 2017.Pezeshki, M., Kaba, S.-O., Bengio, Y., Courville, A., Precup,D., and Lajoie, G. Gradient starvation: A learning procliv-ity in neural networks. arXiv preprint arXiv:2011.09468 ,2020.Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein,J. Svcca: Singular vector canonical correlation analysisfor deep learning dynamics and interpretability. arXivpreprint arXiv:1706.05806 , 2017.Raghunathan, A., Steinhardt, J., and Liang, P. Certifieddefenses against adversarial examples. arXiv preprintarXiv:1801.09344 , 2018.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., and Fei-Fei, L. ImageNet Large ScaleVisual Recognition Challenge.
International Journal ofComputer Vision (IJCV) , 115(3):211–252, 2015. doi:10.1007/s11263-015-0816-y.Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry,A. Do adversarially robust imagenet models transferbetter? arXiv preprint arXiv:2007.08489 , 2020.Santurkar, S., Ilyas, A., Tsipras, D., Engstrom, L., Tran, B.,and Madry, A. Image synthesis with a single (robust)classifier. In
Advances in Neural Information ProcessingSystems , pp. 1262–1273, 2019.Shafahi, A., Huang, W. R., Studer, C., Feizi, S., and Gold-stein, T. Are adversarial examples inevitable? arXivpreprint arXiv:1809.02104 , 2018. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson,J., Studer, C., Davis, L. S., Taylor, G., and Goldstein,T. Adversarial training for free! arXiv preprintarXiv:1904.12843 , 2019.Shah, H., Tamuly, K., Raghunathan, A., Jain, P., and Netra-palli, P. The pitfalls of simplicity bias in neural networks. arXiv preprint arXiv:2006.07710 , 2020.Simonyan, K. and Zisserman, A. Very deep convolu-tional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-side convolutional networks: Visualising image clas-sification models and saliency maps. arXiv preprintarXiv:1312.6034 , 2013.Stutz, D., Hein, M., and Schiele, B. Disentangling adversar-ial robustness and generalization. In
Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pp. 6976–6987, 2019.Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing propertiesof neural networks. In , 2014.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,Z. Rethinking the inception architecture for computer vi-sion. In
Proceedings of the IEEE conference on computervision and pattern recognition , pp. 2818–2826, 2016.Terzi, M., Achille, A., Maggipinto, M., and Susto, G. A.Adversarial training reduces information and improvestransferability. arXiv preprint arXiv:2007.11259 , 2020.Tramèr, F., Papernot, N., Goodfellow, I., Boneh, D., and Mc-Daniel, P. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 , 2017.Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., andMadry, A. Robustness may be at odds with accuracy. In
International Conference on Learning Representations ,2019.Uesato, J., O’donoghue, B., Kohli, P., and Oord, A. Ad-versarial risk and the dangers of evaluating against weakattacks. In
International Conference on Machine Learn-ing , pp. 5025–5034. PMLR, 2018.Utrera, F., Kravitz, E., Erichson, N. B., Khanna, R., andMahoney, M. W. Adversarially-trained deep nets transferbetter. arXiv preprint arXiv:2007.05869 , 2020.Valle-Pérez, G., Camargo, C. Q., and Louis, A. A. Deeplearning generalizes because the parameter-function mapis biased towards simple functions. arXiv preprintarXiv:1805.08522 , 2018. Wang, H., Wu, X., Huang, Z., and Xing, E. P. High-frequency component helps explain the generalizationof convolutional neural networks. In
Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pp. 8684–8694, 2020.Warde-Farley, D. and Goodfellow, I. Adversarial perturba-tions of deep neural networks.
Perturbations, Optimiza-tion, and Statistics , 311, 2016.Wei, K.-A. A.
Understanding non-robust features in imageclassification . PhD thesis, Massachusetts Institute ofTechnology, 2020.Wong, E., Rice, L., and Kolter, J. Z. Fast is better thanfree: Revisiting adversarial training. arXiv preprintarXiv:2001.03994 , 2020.Wu, L., Zhu, Z., et al. Towards understanding generalizationof deep learning: Perspective of loss landscapes. arXivpreprint arXiv:1706.10239 , 2017.Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., andJordan, M. Theoretically principled trade-off betweenrobustness and accuracy. In
International Conference onMachine Learning , pp. 7472–7482. PMLR, 2019.Zhang, Q.-s. and Zhu, S.-C. Visual interpretability for deeplearning: a survey.
Frontiers of Information Technology& Electronic Engineering , 19(1):27–39, 2018. dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
A. Experimental Setup
In this section, we describe details of the methods that we use to train our standard, robust, zero-robust, and negative-robustmodels, construct robustified images, and construct adversarial examples.
A.1. Dataset Considerations
Our experiments involve training many models, a large portion of which require adversarial training. Similarly, ourevaluation procedures involve constructing robust datasets, another highly computationally expensive operation. In orderto ensure the computational feasibility of these problems, we conduct all experiments on two image datasets: CIFAR-10(Krizhevsky et al., 2009) and a version of ImageNet (Russakovsky et al., 2015) which has been downsampled to × pixels (Chrabaszcz et al., 2017) and in which we only use images which fall under one of nine non-overlapping classlabels: dog (classes 151–268), cat (classes 281–285), frog (classes 30–32), turtle (classes 33-37), bird (classes 80–100),primate (classes 365–382), fish (classes 389–397), crab (classes 118-121), insect (classes 300-319). This dataset is the samesubset of the ImageNet dataset used by Ilyas et al. (2019), with the additional modification that our dataset is downsampled.CIFAR-10 has 50,000 training images and ImageNet-9 has approximately 200,000.While we conduct our experiments only on these two datasets, we expect that our results should be general across all imagedatasets, and likely across other domains as well. A.2. Training & Evaluation Details
We describe the training procedure for the four different types of models that we construct: standard, robust, zero-robust,and negative-robust.For each model, we train using the Adam optimizer (Kingma & Ba, 2014) with default TensorFlow parameters ( β =0 . , β = 0 . ) (Abadi et al., 2015). The learning rate for each model is specified in Table 3. Each model was trained forat most 100 epochs. Models with an epoch count marked with an asterisks (*) had their training stopped early in order tomaximize the performance on a validation set. Models that are not marked with an asterisks were trained for the entire 100epochs. Standard Models
To train standard models, we minimize expected cross-entropy loss over the training dataset. Allparameters are specified in Table 3.
Robust Models
To train robust models, we compute a set of adversarial examples for each batch during training. We useprojected gradient descent to construct these adversarial examples (Madry et al., 2017). For adversarial examples constructedfrom CIFAR-10 images, we use 16 steps with a step size of 0.5. For adversarial examples constructed from ImageNetimages, we use 8 steps with a step size of 1.0. For adversarial training, we train to minimize the expected cross-entropy lossof the network evaluated on adversarial examples labeled as the true label. All other parameters are specified in Table 3.
Zero-Robust Models
To train zero-robust models, we apply the methods proposed in Section 3.2. In particular, weconstruct adversarial examples (see Table 4 for parameters) using projected gradient descent to minimize the targetedadversarial loss functions min (cid:107) δ (cid:107) ≤ ε L ( θ, x + δ, ˆ y ) where x is the original image, ε is the maximum allowed L2 norm difference between the adversarial example and theoriginal image, and ˆ y is the targeted class. For each image in CIFAR-10, we construct ten adversarial examples, onetargeting each of the ten classes, labeled as the target class. For each image in the ImageNet-9 dataset, we construct a singleadversarial example with a target selected uniformly at random from the nine possible classes, similarly labeled as thetarget class. We train zero-robust models for both CIFAR-10 and ImageNet-9 using standard training on the appropriateconstructed dataset. The hyperparameters for training the models can be found in Table 3. Negative-Robust Models
We train negative-robust models in the same way as we train zero-robust models, except with aslightly different dataset. For each image with label y in the CIFAR-10 dataset (numbered 0–9 to represent each class inCIFAR-10), we construct a new dataset with an adversarial example targeting the class y + 1 mod 10 , and labeled as this dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers target class. We construct the ImageNet-9 version similarly, using y + 1 mod 9 as the target label instead. The parametersfor constructing the dataset can be found in Table 4 and the hyperparameters for training the models in Table 3. Victim Models
To test the transferability of adversarial examples generated on robust models, we train standard modelswith different architectures: DenseNet121, InceptionV3, MobileNetV2, VGG16, ResNet50V2, and Xception (Chollet, 2017;Howard et al., 2017; Huang et al., 2017; Simonyan & Zisserman, 2014; Szegedy et al., 2016). We initialize these modelswith pre-trained weights from training on the original ImageNet dataset (Chollet et al., 2015). For the hyperparameters, seeTable 3.
Table 3.
Model parameters for each model trained in this paper. Parameters were selected by hand. In addition to what is listed below, thezero- and negative- robust models for both ImageNet-9 and CIFAR-10 were stopped prior to 100 epochs in order to maximize accuracy ona validation set. Each model was trained on a IBM Power9 architecture machine with two NVIDIA Tesla V100 GPUs. Each standardmodel took approximately 24 hours to train. Each robust models took approximately 48 hours.
Model Architecture Batchsize Epochs Learningrate LRdecay Dataaugmentation PretrainingStandard (CIFAR) ResNet56 (v2) 200 100 1e-3 Yes Yes NoRobust (CIFAR) ResNet56 (v2) 200 100 1e-3 Yes Yes NoZero-robust (CIFAR) ResNet56 (v2) 200 100* 1e-4 Yes No NoNeg-robust (CIFAR) ResNet56 (v2) 200 100* 1e-4 Yes No NoVictim models (CIFAR) Misc 200 100 1e-4 Yes Yes YesStandard (ImageNet) ResNet20 (v2) 128 100 1e-3 Yes Yes NoRobust (ImageNet) ResNet20 (v2) 128 100 1e-3 Yes Yes NoZero-Robust (ImageNet) ResNet20 (v2) 128 100* 1e-3 Yes No NoNeg-robust (ImageNet) ResNet20 (v2) 128 100* 1e-3 Yes No NoVictim models (ImageNet) Misc 128 100 1e-4 Yes Yes Yes
Table 4.
Parameters for the various datasets constructed in this paper. Parameters were selected by hand.
Dataset Maximum allowed change(L2 norm from seed) Number of steps Step sizeRobustified (CIFAR) ∞
400 0.05Zero-robust (CIFAR) 0.5 200 0.1Neg-robust (CIFAR) 0.5 200 0.1Untargeted Adversarial Examples (CIFAR) 2.0 100 0.1Targeted Adversarial Examples (CIFAR) 2.0 100 0.1Robustified (ImageNet) ∞
200 0.1Zero-robust (ImageNet) 0.5 200 0.1Neg-robust (ImageNet) 0.5 200 0.1Untargeted Adversarial Examples (ImageNet) 4.0 100 0.1Targeted Adversarial Examples (ImageNet) 4.0 100 0.1
B. Extended Results
In this section, we present extended results including the extended data from Tables 1 and 2 (Tables 5 and 6). In addition,we include ImageNet-9 versions of Figures 3, 5, and 6 (Figures 7, 9, and 8). We present the accuracy of each of ourrobust classifiers on the original ImageNet-9 testing set (Figure 10). We include extended examples of robustified images(Figures 11 and 12). Finally, we provide examples of adversarial examples generated on robust models (Figures 13 and 14). dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
Table 5.
Extended data: Accuracy of zero-robust, negative-robust, standard, and robust ResNet50 classifiers on robustified CIFAR-10 testset. All classifiers are trained starting with random initial weights. The
Test column is the accuracy on the original CIFAR-10 test set. The R ε columns give accuracy on robustified CIFAR-10 test set, generated with respect to a robust ResNet50 with robustness parameter ε . Classifier Test R R / R / R / R / R / R / R / R / R Standard 94.1 58.3 77.3 81.4 82.2 83.7 79.9 79.0 75.1 69.2 56.3Robust ( ε = 0 . ) 86.5 9.93 9.97 9.96 10.0 11.1 18.2 44.5 70.5 86.5 70.0Zero-robust 46.8 38.8 35.1 26.1 27.2 26.2 23.3 22.6 22.1 23.4 21.4Negative-robust 21.9 31.8 15.8 14.6 16.6 11.3 11.1 11.0 12.8 14.3 13.2 Table 6.
Extended data: Accuracy of zero-robust, negative-robust, standard, and robust ResNet20 classifiers on robustified ImageNet-9test set. Analogous to Table 5. Note: since there are nine classes, chance accuracy is approximately . . Classifier Test R R / R / R / R / R / R / R / R / R Standard 94.8 51.9 68.3 71.5 75.5 76.7 76.3 72.1 63.4 54.6 47.2Robust ( ε = 1 . ) 82.9 41.9 42.1 42.5 42.5 42.5 44.1 46.1 49.0 51.7 56.0Zero-robust 49.2 24.8 28.3 30.5 29.3 27.5 24.3 24.2 24.6 28.6 31.4Negative-robust 38.3 25.4 18.8 18.4 19.2 20.2 23.2 23.6 21.8 24.4 23.7 Robustness parameter A cc u r a c y o n Figure 7.
Accuracy of a standard ResNet20 classifier on robustifications of the ImageNet-9 test dataset, generated with respect to robustResNet20 classifiers with varying robustness parameter ε . (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers Robustness parameter of source classifier E rr o r o n o r i g i n a l l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception
Figure 8.
Error rate of different standard classifiers evaluated on untargeted transfer attacks. Each error rate is measured on a set ofuntargeted adversarial examples generated from the ImageNet-9 test set to attack a source classifier (ResNet20) with robustness parameter ε . Each adversarial example was generated with a perturbation δ where (cid:107) δ (cid:107) < . A higher error corresponds to a stronger transfer attack.(Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
Robustness parameter of source classifier A cc u r a c y o n t a r g e t l a b e l s DenseNet121InceptionV3MobileNetV2VGG16ResNet50V2Xception
Figure 9.
Accuracy of different standard classifiers on targeted transfer attacks. Each accuracy is measured as the fraction of times thetarget label y t is predicted for targeted adversarial example x t . The evaluation is done on a set of targeted adversarial examples generatedfrom the ImageNet-9 test set to attack a source ResNet20 classifier with robustness parameter ε . Each adversarial example was generatedwith a perturbation δ where (cid:107) δ (cid:107) < . A higher accuracy corresponds to a stronger transfer attack. (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers
Robustness parameter A cc u r a c y o n o r i g i n a l t e s t d a t a Figure 10.
Accuracy of robust ResNet20s of varying robustness parameter trained on the ImageNet-9 training data and evaluated on theImageNet-9 original test data. Confirms well known result: test accuracy decreases as robustness increases. s ee d = = / = / = / = / = / = / = / = / = o r i g i n a l Figure 11.
Extended data: Examples of robustified images by ε of robust classifier. The rightmost column depicts the original CIFAR-10image; the leftmost column depicts the seed ( x ) from which the gradient descent of the robustification process began; the rest arerobustified images. The representation layer of an ε -robust classifier for each specified ε is approximately the same for the robustifiedimage and the associated original image. Since the representation layers of less robust classifiers are more easily perturbed, robustifiedimages generated with respect to classifiers with a smaller robustness parameter appear closer to the (random) CIFAR-10 image thatseeded the gradient-descent process. (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers s ee d = = / = / = / = / = / = / = / = / = o r i g i n a l Figure 12.
Extended data: Examples of robustified images by ε of robust classifier. The rightmost column depicts the original ImageNet-9image; the leftmost column depicts the seed ( x ) from which the gradient descent of the robustification process began; the rest arerobustified images. The representation layer of an ε -robust classifier for each specified ε is approximately the same for the robustifiedimage and the associated original image. Since the representation layers of less robust classifiers are more easily perturbed, robustifiedimages generated with respect to classifiers with a smaller robustness parameter appear closer to the (random) ImageNet-9 image thatseeded the gradient-descent process. (Best viewed in color.) o r i g i n a l = = / = / = / = / = / = / = / = / = Figure 13.
Extended data: Examples of adversarial examples for robust CIFAR-10 classifiers, of varying robustness. The left-most imageis the original image from CIFAR-10. All other images are adversarial. Each adversarial example was generated via PGD from theoriginal image x such that the resulting adversarial example ˆ x = x + δ where (cid:107) δ (cid:107) ≤ (Best viewed in color.) dversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers o r i g i n a l = = / = / = / = / = / = / = / = / = Figure 14.
Extended data: Examples of adversarial examples for robust ImageNet-9 classifiers, of varying robustness. The left-most imageis the original image from ImageNet-9. All other images are adversarial. Each adversarial example was generated via PGD from theoriginal image x such that the resulting adversarial example ˆ x = x + δ where (cid:107) δ (cid:107) ≤4