[PDF] Understanding Misclassifications by Attributes

Abstract

In this paper, we aim to understand and explain the decisions of deep neural networks by studying the behavior of predicted attributes when adversarial examples are introduced. We study the changes in attributes for clean as well as adversarial images in both standard and adversarially robust networks. We propose a metric to quantify the robustness of an adversarially robust network against adversarial attacks. In a standard network, attributes predicted for adversarial images are consistent with the wrong class, while attributes predicted for the clean images are consistent with the true class. In an adversarially robust network, the attributes predicted for adversarial images classified correctly are consistent with the true class. Finally, we show that the ability to robustify a network varies for different datasets. For the fine grained dataset, it is higher as compared to the coarse-grained dataset. Additionally, the ability to robustify a network increases with the increase in adversarial noise.

Full PDF

UUnderstanding Misclassiﬁcations by Attributes

Sadaf Gulshad , Zeynep Akata , Jan Hendrik Metzen , and Arnold Smeulders UvA-Bosch Delta Lab, University of Amsterdam, The Netherlands, Bosch Center for AI (BCAI), Renningen, Germany

Abstract

In this paper, we aim to understand and explain the de-cisions of deep neural networks by studying the behavior ofpredicted attributes when adversarial examples are intro-duced. We study the changes in attributes for clean as wellas adversarial images in both standard and adversariallyrobust networks. We propose a metric to quantify the ro-bustness of an adversarially robust network against adver-sarial attacks. In a standard network, attributes predictedfor adversarial images are consistent with the wrong class,while attributes predicted for the clean images are consis-tent with the true class. In an adversarially robust net-work, the attributes predicted for adversarial images classi-ﬁed correctly are consistent with the true class. Finally, weshow that the ability to robustify a network varies for dif-ferent datasets. For the ﬁne grained dataset, it is higher ascompared to the coarse grained dataset. Additionally, theability to robustify a network increases with the increase inadversarial noise.

1. Introduction

Understanding neural networks is crucial in applicationslike autonomous vehicles, health care, robotics, for validat-ing and debugging, as well as for building the trust of users[15, 32]. This paper strives to understand and explain thedecisions of deep neural networks by studying the behaviorof predicted attributes when adversarial examples are intro-duced. We argue that even if no adversaries are being in-serted in real world applications, adversarial examples canbe exploited for understanding neural networks in their fail-ure modes. Most of the state of the art approaches for in-terpreting neural networks work by focusing on features toproduce saliency maps by considering class speciﬁc gradi-ent information [24, 26, 28], or by ﬁnding the part of theimage which inﬂuences classiﬁcation the most and remov-ing it by adding perturbations [34, 10]. These approachesreveal the part in the image where there is support to theclassiﬁcation and visualize the performance of known goodexamples. This tells a little about the boundaries of a class

Which bird is this?

Painted Bunting.

Because it has attributes:

Red BellyBlue Head

Herring Gull

Because it has attributes:

White BellyWhite HeadRed BellyBlue HeadPainted Bunting.

Because it has attributes:

Herring Gull

Because it has attributes: -White Belly-White Head

Which bird is this?Which bird is this?

Red bellyBlue headBlack bill

Blue headRed bellyGreen wing

Blue HeadRed Belly-White Belly-White Head

Noise Blue headRed bellyGreen wing

Blue HeadRed Belly-White Belly-White Head

Adversarial Clean ImageClass AttributesAdversarial Image Clean Image Attribute PredictorAdversarial ImageClean Image

Figure 1. Our study with interpretable attribute prediction-grounding framework shows that, for a clean image predicted at-tributes “red belly” and “blue head” are coherent with the groundtruth class (painted bunting), and for an adversarial image “whitebelly” and “white head” are coherent with the wrong class (herringgull) . where dubious examples reside.However, humans motivate their decisions through se-mantically meaningful observations. For example, this typeof bird has a blue head and red belly so, this must be apainted bunting. Hence, we study changes in the predictedattribute values of samples under mild modiﬁcation of theimage through adversarial perturbations. We believe thisalternative dimension of study can provide a better under-standing of how misclassiﬁcation in a deep network can bestbe communicated to humans. Note that, we consider adver-sarial examples that are generated to fool only the classiﬁerand not the interpretation (attributes) mechanism.Interpreting deep neural network decisions for adversar-ial examples helps in understanding their internal function-ing [30, 9]. Therefore, we explore

How do the attribute values change under an adversarialattack on the standard classiﬁcation network?

However while, describing misclassiﬁcations due to ad-versarial examples with attributes helps in understandingneural networks, assessing whether the attribute values stillretain their discriminative power after making the networkrobust to adversarial noise is equally important. Hence, we1 a r X i v : . [ c s . C V ] O c t lso ask How do the attribute values change under an adversarialattack on a robust classiﬁcation network?

To answer these questions, we design experiments to in-vestigate which attribute values change when an image ismisclassiﬁed with increasing adversarial perturbations, andfurther when the classiﬁer is made robust against an ad-versarial attack. Through these experiments we intend todemonstrate what attributes are important to distinguish be-tween the right and the wrong class. For instance, as shownin Figure 1, “blue head” and ”red belly” associated with theclass “painted bunting” are predicted correctly for the cleanimage. On the other hand, due to predicting attributes in-correctly as “white belly” and “white head”, the adversarialimage gets classiﬁed into “herring gull” incorrectly. Afteranalysing the changes in attributes with a standard and witha robust network we propose a metric to quantify the robust-ness of the network against adversarial attacks. Therefore,we ask

Can we quantify the robustness of an adversarially ro-bust network?

In order to answer the third question, we design a ro-bustness quantiﬁcation metric for both standard as well asattribute based classiﬁers.To the best of our knowledge we are the ﬁrst to exploitadversarial examples with attributes to perform a system-atic investigation on neural networks, both quantitatively and qualitatively , for not only standard , but also for ad-versarially robust networks. We explain the decisions ofdeep computer vision systems by identifying what attributeschange when an image is perturbed in order for a classi-ﬁcation system to produce a speciﬁc output. Our resultson three benchmark attribute datasets with varying size andgranularity elucidate why adversarial images get misclassi-ﬁed, and why the same images are correctly classiﬁed withthe adversarially robust framework. Finally we introduce anew metric to quantify the robustness of a network for bothgeneral as well as attribute based classiﬁers.

2. Related Work

In this section, we discuss related work on interpretabil-ity and adversarial examples.

Interpretability.

Explaining the output of a decision makeris motivated by the need to build user trust before deploy-ing them into the real world environment. Previous workis broadly grouped into two: 1) rationalization , that is, jus-tifying the network’s behavior and 2) introspective expla-nation , that is, showing the causal relationship between in-put and the speciﬁc output [8]. Text-based class discrimi-native explanations [13, 22], text-based interpretation withsemantic information [7] and counter factual visual expla-nations [12] fall in the ﬁrst category. On the other handactivation maximization [26, 37], learning the perturbation mask [10], learning a model locally around its predictionand ﬁnding important features by propagating activationdifferences [23, 25] fall in the second group. The ﬁrst grouphas the beneﬁt of being human understandable, but it lacksthe causal relationship between input and output. The sec-ond group incorporates internal behavior of the network, butlacks human understandability. In this work, we incorporatehuman understandable justiﬁcations through attributes andcausal relationship between input and output through adver-sarial attacks.

Interpretability of Adversarial Examples.

After analyz-ing neuronal activations of the networks for adversarial ex-amples in [6] it was concluded that the networks learn re-current discriminative parts of objects instead of semanticmeaning. In [14], the authors proposed a datapath visualiza-tion module consisting of the layer level, feature level, andthe neuronal level visualizations of the network for cleanas well as adversarial images. In [35], the authors inves-tigated adversarially trained convolutional neural networksby constructing images with different textural transforma-tions while preserving the shape information to verify theshape bias in adversarially trained networks compared withstandard networks. Finally, in [31], the authors showed thatthe saliency maps from adversarially trained networks alignwell with human perception.These approaches use saliency maps for interpreting theadversarial examples, but saliency maps [24] are often weakin justifying classiﬁcation decisions, especially for ﬁne-grained adversarial images. For instance, in Figure 2 thesaliency map of a clean image classiﬁed into the groundtruth class, “red winged blackbird”, and the saliency mapof a misclassiﬁed adversarial image, look quite similar. In-stead, we propose to predict and ground attributes for bothclean and adversarial images to provide visual as well asattribute-based interpretations. In fact, our predicted at-tributes for clean and adversarial images look quite differ-ent. By grounding the predicted attributes one can infer thatthe “orange wing” is important for “red winged blackbird”while the “red head” is important for “red faced cormorant”.Indeed, when the attribute value for orange wing decreasesand red head increases the image gets misclassiﬁed.

Adversarial Examples.

Small carefully crafted perturba-tions, called adversarial perturbations , when added to theinputs of deep neural networks, result in adversarial exam-ples . These adversarial examples can easily drive the clas-siﬁers to the wrong classiﬁcation [29]. Such attacks involveiterative fast gradient sign method (IFGSM) [17], Jacobian-based saliency map attacks [21], one pixel attacks [27], Car-lini and Wagner attacks [5] and universal attacks [20]. Weselect IFGSM for our experiments, but our method can alsobe used with other types of adversarial attacks.Adversarial examples can also be used for understandingneural networks. [4] aims at utilizing adversarial examples lack HeadBlack BeakOrange WingBlack WingBlack HeadBlack BeakRed HeadBlack Tail

Predicted class:

Red winged blackbird

True class:

Red winged blackbird

CleanAdversarial

Predicted class:

Red faced cormorant

True class:

Red winged blackbird

Figure 2.

Adversarial images are difﬁcult to explain: when theanswer is wrong, often saliency based methods (left) fail to detectwhat went wrong. Instead, attributes (right) provide intuitive andeffective visual and textual explanations. for understanding deep neural networks by extracting thefeatures that provide the support for classiﬁcation into thetarget class. The most salient features in the images providethe way to interpret the decision of a classiﬁer, but they lackhuman understandability. Additionally, ﬁnding the mostsalient features is computationally rather expensive. Thecrucial point, however, is that if humans explain classiﬁca-tion by attributes, they are also natural candidates to studymisclassiﬁcation and robustness. Hence, in this work in or-der to understand neural networks we utilize adversarial ex-amples with attributes which explain the misclassiﬁcationdue to adversarial attacks.

3. Method

In this section, in order to explain what attributes changewhen an adversarial attack is performed on the classiﬁca-tion mechanism of the network, we detail a two-step frame-work. First, we perturb the images using adversarial attackmethods and robustify the classiﬁers via adversarial train-ing. Second, we predict the class speciﬁc attributes andvisually ground them on the image to provide an intuitivejustiﬁcation of why an image is classiﬁed as a certain class.Finally, we introduce our metric for quantifying the robust-ness of an adversarially robust network against adversarialattacks.

Given a clean n -th input x n and its respective groundtruth class y n predicted by a model f ( x n ) , an adversarialattack model generates an image ˆ x n for which the predictedclass is y , where y (cid:54) = y n . In the following, we detail an ad-versarial attack method for fooling a general classiﬁer andan adversarial training technique that robustiﬁes it. Adversarial Attacks.

The iterative fast gradient signmethod (IFGSM) [17] is leveraged to fool only the classiﬁernetwork. IFGSM solves the following equation to produceadversarial examples:

Attribute Prediction

Red bellyBlue head

Painted Bunting White headWhite BellyRed BellyBlue head

Attribute GrounderAdversarial Training

White bellyWhite head

Herring Gull

Attribute PredictionRed bellyBlue headBlack bill

Painted Bunting White headWhite BellyRed BellyBlue head

Attribute GrounderAdversarial Training White bellyWhite headYellow bill

Herring Gull

Figure 3.

Interpretable attribute prediction-grounding model.

After an adversarial attack or adversarial training step, image fea-tures of both clean θ ( x n ) and adversarial images θ (ˆ x ) are ex-tracted using Resnet and mapped into attribute space φ ( y ) bylearning the compatibility function F ( x n , y n ; W ) between imagefeatures and class attributes. Finally, attributes predicted by at-tribute based classiﬁer A qx n ,y n are grounded by matching themwith attributes predicted by Faster RCNN A jx n for clean and ad-versarial images. ˆ x = x n ˆ x i +1 n = Clip (cid:15) { ˆ x in + α Sign ( (cid:53) ˆ x in L (ˆ x in , y n )) } (1)where (cid:53) ˆ x in L represents the gradient of the cost functionw.r.t. perturbed image ˆ x in at step i . α determines the stepsize which is taken in the direction of sign of the gradientand ﬁnally, the result is clipped by epsilon Clip (cid:15) . Adversarial Robustness.

We use adversarial training asa defense against adversarial attacks which minimizes thefollowing objective [11]: L adv ( x n , y n ) = α L ( x n , y n ) + (1 − α ) L (ˆ x n , y ) (2)where, L ( x n , y n ) is the classiﬁcation loss for clean images, L (ˆ x n , y ) is the loss for adversarial images and α regulatesthe loss to be minimized. The model ﬁnds the worst caseperturbations and ﬁne tunes the network parameters to re-duce the loss on perturbed inputs. Hence, this results in arobust network f r (ˆ x ) , using which improves the classiﬁca-tion accuracy on the adversarial images. Our attribute prediction and grounding model uses at-tributes to deﬁne a joint embedding space that the imagesare mapped to.

Attribute prediction.

The model is shown in Figure 3.During training our model maps clean training images closeto their respective class attributes, e.g. “painted bunting”with attributes “red belly, blue head”, whereas adversarialimages get mapped close to a wrong class, e.g. “herringgull” with attributes “white belly, white head”.We employ structured joint embeddings (SJE) [1] to pre-dict attributes in an image. Given the input image features θ ( x n ) ∈ X and output class attributes φ ( y n ) ∈ Y from thesample set S = { ( θ ( x n ) , φ ( y n ) , n = 1 ...N } SJE learns amapping f : X → Y by minimizing the empirical risk ofthe form N (cid:80) Nn =1 ∆( y n , f ( x n )) where ∆ : Y × Y → R es-timates the cost of predicting f ( x n ) when the ground truthlabel is y n . compatibility function F : X × Y → R is deﬁnedbetween input X and output Y space: F ( x n , y n ; W ) = θ ( x n ) T W φ ( y n ) (3)Pairwise ranking loss L ( x n , y n , y ) is used to learn the pa-rameters ( W ) : ∆( y n , y ) + θ ( x n ) T W φ ( y n ) − θ ( x n ) T W φ ( y ) (4)Attributes are predicted for both clean and adversarial im-ages by: A n,y n = θ ( x n ) W , ˆA n,y = θ (ˆ x n ) W (5)The image is assigned to the label of the nearest output classattributes φ ( y n ) . Attribute grounding.

In our ﬁnal step, we ground the pre-dicted attributes on to the input images using a pre-trainedFaster RCNN network and visualize them as in [3]. Thepre-trained Faster RCNN model F ( x n ) predicts the bound-ing boxes denoted by b j . For each object bounding box itpredicts the class Y j as well as the attribute A j [2]. b j , A j , Y j = F ( x n ) (6)where, j is the bounding box index. The most discrimi-native attributes predicted by SJE are selected based on thecriteria that they change the most when the image is per-turbed with noise. For clean images we use: q = argmax i ( A in,y n − φ ( y i )) (7)and for adversarial images we use: p = argmax i ( ˆA in,y − φ ( y in )) . (8)where i is the attribute index, q and p are the indexesof the most discriminative attributes predicted by SJE and φ ( y i ) , φ ( y in ) are wrong class and ground truth class at-tributes respectively. Then we search for selected attributes A qx n ,y n , A p ˆ x n ,y in attributes predicted by Faster RCNN foreach bounding box A jx n , A j ˆ x n , and when the attributes pre-dicted by SJE and Faster RCNN are found, that is A qx n ,y n = A jx n , A p ˆ x n ,y = A j ˆ x n we ground them on their respectiveclean and adversarial images. Note that the adversarial im-ages being used here are generated to fool only the gen-eral classiﬁer and not the attribute predictor nor the FasterRCNN . To describe the ability of a network for robustiﬁcation,independent of its performance on a standard classiﬁer weintroduce a metric called robust ratio . We calculate the lossof accuracy L R on a robust classiﬁer, by comparing a stan-dard classiﬁer f ( x n ) on clean images with the robust clas-siﬁer f r (ˆ x n ) on the adversarially perturbed images as givenbelow: L R = f ( x n ) − f r (ˆ x n ) (9) And then we calculate the loss of accuracy L S on a stan-dard classiﬁer, by comparing its accuracy on the clean andadversarially perturbed images: L S = f ( x n ) − f (ˆ x n ) (10)The ability to robustify is then deﬁned as: R = L R L S (11) R is the robust ratio. It indicates the fraction of the classi-ﬁcation accuracy of the standard classiﬁer recovered by therobust classiﬁer when adding noise.

4. Experiments

In this section, we perform experiments on three differ-ent datasets and analyse the change in attributes for clean aswell as adversarial images. We additionally analyse resultsfor our proposed robustness quantiﬁcation metric on bothgeneral and attribute based classiﬁers.

Datasets.

We experiment on three datasets, Animals withAttributes 2 (AwA) [18], Large attribute (LAD) [36] andCaltech UCSD Birds (CUB) [33]. AwA contains 37322 im-ages (22206 train / 5599 val / 9517 test) with 50 classesand 85 attributes per class. LAD has 78017 images (40957train / 13653 val / 23407 test) with 230 classes and 359 at-tributes per class. CUB consists of 11,788 images (5395train / 599 val / 5794 test) belonging to 200 ﬁne-grained cat-egories of birds with 312 attributes per class. All the threedatasets contain real valued class attributes representing thepresence of a certain attribute in a class.Visual Genome Dataset [16] is used to train the Faster-RCNN model which extracts the bounding boxes using1600 object and 400 attribute annotations. Each boundingbox is associated with an attribute followed by the object,e.g. a brown bird.

Image Features and Adversarial Examples.

We extractimage features and generate adversarial images using theﬁne-tuned Resnet-152. Adversarial attacks are performedusing IFGSM method with epsilon (cid:15) values . , . and . . The l ∞ norm is used as a similarity measure betweenclean input and the generated adversarial example. Adversarial Training.

As for adversarial training, we re-peatedly computed the adversarial examples while trainingthe ﬁne-tuned Resnet-152 to minimize the loss on these ex-amples. We generated adversarial examples using the pro-jected gradient descent method. This is a multi-step variantof FGSM with epsilon (cid:15) values . , . and . respec-tively for adversarial training as in [19].Note that we are not attacking the attribute based net-work directly but we are attacking the general classiﬁer andextracting features from it for training the attribute basedclassiﬁer. Similarly, the adversarial training is also per-formed on the general classiﬁer and the features extracted igure 4. Comparing the accuracy of the general and the attribute based classiﬁers for adversarial examples to investigate changein attributes.

We evaluate both classiﬁers by extracting features from a standard network and the adversarially robust network. from this model are used for training the attribute basedclassiﬁer.

Attribute Prediction and Grounding.

At test time the image features are projected onto theattribute space. The image is assigned with the label ofthe nearest ground truth attribute vector. The predicted at-tributes are grounded by using Faster-RCNN pre-trained onVisual Genome Dataset since we do not have ground truthpart bounding boxes for any of attribute datasets.

5. Results

We investigate the change in attributes quantitatively (i)by performing classiﬁcation based on attributes and (ii)by computing distances between attributes in embeddingspace. We additionally investigate changes qualitatively bygrounding the attributes on images for both standard andadversarially robust networks.At ﬁrst, we compare the general classiﬁer f ( x n ) and theattribute based classiﬁer f ( x n ) in terms of the classiﬁca-tion accuracy on clean images. Since the attribute basedmodel is a more explainable classiﬁer, it predicts attributes,compared to general classiﬁer, which predicts the class la-bel directly. Therefore, we ﬁrst verify whether the attributebased classiﬁer performs equally well as the general clas-siﬁer. We ﬁnd that, the attribute based and general clas-siﬁer accuracies are comparable for AWA (general: 93.53,attribute based: 93.83). The attribute based classiﬁer ac-curacy is slightly higher for LAD (general: 80.00, attributebased: 82.77), and slightly lower for CUB (general: 81.00,attribute based: 76.90) dataset.To qualitatively analyse the predicted attributes, weground them on clean and adversarial images. We selectour images among the ones that are correctly classiﬁedwhen clean and incorrectly classiﬁed when adversariallyperturbed. Further we select the most discriminative at-tributes based on equation 7 and 8. We evaluate attributesthat change their value the most for the CUB, attributesfor the AWA, and attributes for the LAD dataset. AWA CUB

Standard Network

Figure 5.

Attribute distance plots for standard learning frame-works.

Standard learning framework plots are shown for the cleanand the adversarial image attributes.

Withadversarial attacks, the accuracy of both the general and at-tribute based classiﬁers drops with the increase in perturba-tions see Figure 4 (blue curves). The drop in accuracy of thegeneral classiﬁer for the ﬁne grained CUB dataset is higheras compared to the coarse AWA dataset which conﬁrms ourhypothesis. For example, at (cid:15) = 0 . for the CUB datasetthe general classiﬁer’s accuracy drops from to ( ≈ drop), while for the AWA dataset it drops from . to . ( ≈ drop). However, the drop inaccuracy with the attribute based classiﬁer is almost equalfor both, ≈ . We propose one of the reasons behindthe smaller drop of accuracy for the CUB dataset with theattribute based classiﬁer compared to the general classiﬁeris that for ﬁne grained datasets there are many common at-tributes among classes. Therefore, in order to misclassify animage a signiﬁcant number of attributes need to be changed.For a coarse grained dataset, changing a few attributes issufﬁcient for misclassiﬁcation. Another reason is that thereare more attributes per class in the CUB dataset as com-pared to the AWA dataset.For the coarse dataset the attribute based classiﬁer showscomparable performance with the general classiﬁer. Whilefor the ﬁne grained dataset the attribute based classiﬁer ellow BeakBlack EyeWhite Wing -White Underparts-Black Eye-Spotted Wing-Bill length same as head Example from adversary (Wrong Class) Adversarial

Clean

Adversarial

Clean Adversarial

Standard Network(Targeted Attacks)

With Adversarial Attacks on Standard Network

Gray WingBlack Wing -Small Size-Brown Back-Buff undertail-White Undertail Red HeadBlack WingBlack Tail Black Tail-Black Wing-White upper Tail-Solid Back-Black Back Black Eye Black HeadBlack Wing Black Head-Buff Bill -Forehead gray-Striped wing-Black back-White belly

Robust Network

White WingBlack WingBlack HeadWhite Head

Adversarial with robust network (Ground truth Class)

White WingBlack Wing-Striped Back-Pointed Tail-Striped WingBlack BeakYellow HeadOrange Head

Adversarial with robust network (Ground truth Class)

Black Beak-Black Belly-Bill Longer than head-Striped WingBlack Wing-Blue Undertail-Pointed tail-Striped WingBlue HeadBlack Wing

Adversarial with robust network (Ground truth Class)

Example from adversary (Wrong Class)

Red HeadBlack HeadGray WingBlack Leg Blue BeakBlack HeadBrown WingBrown Tail-Brown throat-Spotted WingBrown WingLong WingLong Tail-Color Brown-Multicolored Breast

Orange HeadBlack BeakBrown WingLong Wing

Is white Is black Is black-Have claws-Is domestic -Is Gray Is walking Is black Is blackIs walking-Is white-Is grazer-Is domesticIs redIs green -Has insects-Is growing-Has nutlets-Is big

AWAand LAD

Is white Is blackIs red Is red-Is Yellow-Is lively-Is small-Is cool

Clean

CUB

Figure 6.

Qualitative analysis for adversarial attacks on standard network.

The attributes ranked by importance for the classiﬁcationdecision are shown below the images. The grounded attributes are color coded for visibility (the ones in gray could not be grounded). Theattributes for clean images are related to the ground truth classes whereas the ones predicted for adversarial images are related to the wrongclasses.

Robust Network

AWA CUB

Figure 7.

Attribute distance plots for robust learning frame-works.

Robust learning framework plots are shown only for theadversarial image attributes but for adversarial images misclassi-ﬁed with the standard features and correctly classiﬁed with therobust features. shows better performance than the general classiﬁer so alarge change in attributes is required to cause misclassiﬁca-tion with attributes. Overall, the drop in the accuracy withthe adversarial attacks demonstrates that, with adversarialperturbations, the attribute values change towards those thatbelong to the new class and cause the misclassiﬁcation.

By Computing Distances in Embedding Space.

In orderto perform analysis on attributes in embedding space, weconsider the images which are correctly classiﬁed withoutperturbations and misclassiﬁed with perturbations. Further,we select the top of the most discriminative attributesusing equation 7 and 8. Our aim is to analyse the change inattributes in embedding space.We contrast the Euclidean distance between predicted at- tributes of clean and adversarial samples: d = d { A n,y n , ˆA n,y } = (cid:107) A n,y n − ˆA n,y (cid:107) (12)with the Euclidean distance between the ground truth at-tribute vector of the correct and wrong classes: d = d { φ ( y n ) , φ ( y ) } = (cid:107) φ ( y n ) − φ ( y )) (cid:107) (13)and show the results in Figure 5. Where, A n,y n denotes thepredicted attributes for the clean images classiﬁed correctly,and ˆA n,y denotes the predicted attributes for the adversar-ial images misclassiﬁed with a standard network. The cor-rect ground truth class attribute is referred to as φ ( y n ) andwrong class attributes are φ ( y ) .We observe that for the AWA dataset the distances be-tween the predicted attributes for adversarial and clean im-ages d are smaller than the distances between the groundtruth attributes of the respective classes d . The closenessin predicted attributes for clean and adversarial images ascompared to their ground truths shows that attributes changetowards the wrong class but not completely. This is due tothe fact that for coarse classes, only a small change in at-tribute values is sufﬁcient to change the class.The ﬁne-grained CUB dataset behaves differently. Theoverlap between d and d distributions demonstratesthat attributes of images belonging to ﬁne-grained classeschange signiﬁcantly as compared to images from coarsecategories. Although the ﬁne grained classes are closer to lack BeakYellow HeadOrange Head Adversarial with robust network (Ground truth Class)

Black Beak-Black Belly-Bill Longer than head-Striped WingBlack Wing-Blue Undertail-Pointed tail-Striped WingBlue HeadBlack Wing

Adversarial with robust network (Ground truth Class)

With Adversarial Attacks on Robust Network

Adversarial

Example from adversary (Wrong Class) Adversarial

Example from adversary (Wrong Class)

Figure 8.

Qualitative analysis for adversarial attacks on robust network.

The attributes are ranked by importance for the classiﬁcationdecision, the grounded attributes are color coded for visibility (the ones in gray could not be grounded). The attributes for adversarialimages with robust network are related to ground truth classes whereas the ones predicted for adversarial images change towards wrongclasses each other, due to the existence of many common attributesamong ﬁne grained classes, attributes need to change sig-niﬁcantly to cause misclassiﬁcation. Hence, for the coarsedataset, the attributes change minimally, while for the ﬁnegrained dataset they change signiﬁcantly.

We observe in Figure 6 that the most discriminative at-tributes for the clean images are coherent with the groundtruth class and are localized accurately; however, for adver-sarial images they are coherent with the wrong class. Thoseattributes which are common among both clean and adver-sarial classes are localized correctly on the adversarial im-ages; however, the attributes which are not related to theground truth class, the ones that are related to the wrongclass can not get grounded as there is no visual evidencethat supports the presence of these attributes. For example“brown wing, long wing, long tail” attributes are common inboth classes; hence, they are present both in the clean imageand the adversarial image. On the other hand, “has a browncolor” and “a multicolored breast” are related to the wrongclass and are not present in the adversarial image. Hence,they can not be grounded. Similarly, in the second exam-ple none of the attributes are grounded. This is becauseattributes changed completely towards the wrong class andthe evidence for those attributes is not present in the im-age. This indicates that attributes for the clean images cor-respond to the ground truth class and for adversarial imagescorrespond to the wrong class. Additionally, only thoseattributes common among both the wrong and the groundtruth classes get grounded on adversarial images.Similarly, our results on the LAD and AWA datasets inthe second row of Figure 6 show that the grounded attributeson clean images conﬁrm the classiﬁcation into the groundtruth class while the attributes grounded on adversarial im-ages are common among clean and adversarial images. Forinstance, in the ﬁrst example of AWA, the “is black” at- tribute is common in both classes so it is grounded on bothimages, but “has claws” is an important attribute for the ad-versarial class. As it is not present in the ground truth class,it is not grounded.Compared to misclassiﬁcations caused by adversarialperturbations on CUB, images do not necessarily get mis-classiﬁed into the most similar class for the AWA and LADdatasets as they are coarse grained datasets. Therefore,there is less overlap of attributes between ground truth andadversarial classes, which is in accordance with our quanti-tative results. Furthermore, the attributes for both datasetsare not highly structured, as different objects can be distin-guished from each other with only a small number of at-tributes.

Ourevaluation on the standard and adversarially robust net-works shows that the classiﬁcation accuracy improves forthe adversarial images when adversarial training is used torobustify the network: 4 (purple curves). For example, inFigure 4 for AWA the accuracy of the general classiﬁer im-proved from . to . ( ≈ improvement) foradversarial attack with (cid:15) = 0 . . As expected for the ﬁnegrained CUB dataset the improvement is ≈ higher thanthe AWA dataset. However, for the attribute based classi-ﬁer, the improvement in accuracy for AWA ( ≈ . ) isalmost double that of the CUB dataset ( ≈ ). We pro-pose this is because the AWA dataset is coarse, so in orderto classify an adversarial image correctly to its ground truthclass, a small change in attributes is sufﬁcient. Converselythe ﬁne grained CUB dataset requires a large change in at-tribute values to correctly classify an adversarial image intoits ground truth class. Additionally, CUB contains moreper class attributes. For a coarse AWA dataset the attributes igure 9. Ability to robustify a network.

Ability to robustify a network with increasing adversarial perturbations is shown for threedifferent datasets for both general and attribute based classiﬁers. change back to the correct class and represent the correctclass accurately. While for the ﬁne grained CUB dataset, alarge change in attribute values is required to correctly clas-sify images.This shows that with a robust network, the change in at-tribute values for adversarial images indicate to the groundtruth class, resulting in better performance. Overall, we ob-serve by analysing attribute based classiﬁer accuracy thatwith the adversarial attacks the change in attribute valuesindicates in which wrong class it is assigned and with therobust network the change in attribute values indicates to-wards the ground truth class.

By Computing Distances in Embedding Space

We compare the distances between the predicted at-tributes of only adversarial images that are classiﬁed cor-rectly with the help of an adversarially robust network ˆA rn,y n and classiﬁed incorrectly with a standard network ˆA n,y : d = d { ˆA rn,y n , ˆA n,y } = (cid:107) ˆA rn,y n − ˆA n,y (cid:107) (14)with the distances between the ground truth target class at-tributes φ ( y n ) and ground truth wrong class attributes φ ( y ) : d = d { φ ( y n ) , φ ( y ) } = (cid:107) φ ( y n ) − φ ( y )) (cid:107) (15)The results are shown in Figure 7. By comparing Figure 7with Figure 5 we observe a similar behavior. The plots inFigure 5 are plotted between clean and adversarial imageattributes. While plots in Figure 7 are plotted between onlyadversarial images but classiﬁed correctly with an adversar-ially robust network and misclassiﬁed with a standard net-work. This shows that the adversarial images classiﬁed cor-rectly with a robust network behave like clean images, i.e. arobust network predicts attributes for the adversarial imageswhich are closer to their ground truth class. Finally, our analysis with correctly classiﬁed images by theadversarially robust network shows that adversarial imageswith the robust network behave like clean images also vi-sually. In the Figure 8, we observe that the attributes of an adversarial image with a standard network are closerto the adversarial class attributes. However, the groundedattributes of adversarial image with a robust network arecloser to its ground truth class. For instance, the ﬁrst exam-ple contains a “blue head” and a “black wing” whereas oneof the most discriminating properties of the ground truthclass “blue head” is not relevant to the adversarial class.Hence this attribute is not predicted as the most relevant byour model, and thus our attribute grounder did not groundit. This shows that the attributes for adversarial images clas-siﬁed correctly with the robust network are in accordancewith the ground truth class and hence get grounded on theadversarial images.

The results for our proposed robustness quantiﬁcationmetric are shown in Figure 9. We observe that the abil-ity to robustify a network against adversarial attacks variesfor different datasets. The network with ﬁne grained CUBdataset is easy to robustify as compared to coarse AWA andLAD datasets. For the general classiﬁer as expected theability to robustify the network increases with the increasein noise. For the attribute based classiﬁer the ability to ro-bustify the network is high with the small noise but it dropsas the noise increases (at (cid:15) = 0 . ) and then again increasesat high noise value (at (cid:15) = 0 . ).

6. Conclusion

In this work we conducted a systematic study on under-standing the neural networks by exploiting adversarial ex-amples with attributes. We showed that if a noisy samplegets misclassiﬁed then its most discriminative attribute val-ues indicate to which wrong class it is assigned. On theother hand, if a noisy sample is correctly classiﬁed with therobust network then the most discriminative attribute valuesindicate towards the ground truth class. Finally, we pro-posed a metric for quantifying the robustness of a networkand showed that the ability to robustify a network varies fordifferent datasets. Overall the ability to robustify a networkincreases with the increase in adversarial perturbations. eferences [1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Eval-uation of output embeddings for ﬁne-grained image classiﬁ-cation. In

CVPR . IEEE, 2015.[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson,S. Gould, and L. Zhang. Bottom-up and top-down atten-tion for image captioning and visual question answering. In

CVPR , 2018.[3] L. Anne Hendricks, R. Hu, T. Darrell, and Z. Akata. Ground-ing visual explanations. In

ECCV , 2018.[4] Anonymous. Evaluations and methods for explanationthrough robustness analysis. In

Submitted to InternationalConference on Learning Representations , 2020. under re-view.[5] N. Carlini and D. Wagner. Towards evaluating the robustnessof neural networks. In SP . IEEE, 2017.[6] Y. Dong, H. Su, J. Zhu, and F. Bao. Towards interpretabledeep neural networks by leveraging adversarial examples. arXiv , 2017.[7] Y. Dong, H. Su, J. Zhu, and B. Zhang. Improving inter-pretability of deep neural networks with semantic informa-tion. In CVPR , 2017.[8] M. Du, N. Liu, and X. Hu. Techniques for interpretable ma-chine learning. arXiv , 2018.[9] M. Du, N. Liu, Q. Song, and X. Hu. Towards explanationof dnn-based prediction with guided feature inversion. In

SIGKDD . ACM, 2018.[10] R. C. Fong and A. Vedaldi. Interpretable explanations ofblack boxes by meaningful perturbation. arXiv , 2017.[11] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining andharnessing adversarial examples. In

ICLR , 2015.[12] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, andS. Lee. Counterfactual visual explanations. arXiv preprintarXiv:1904.07451 , 2019.[13] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue,B. Schiele, and T. Darrell. Generating visual explanations.In

ECCV . Springer, 2016.[14] L. Jiang, S. Liu, and C. Chen. Recent research advances oninteractive machine learning.

Journal of Visualization , 2018.[15] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. Tex-tual explanations for self-driving vehicles. In

Proceedings ofthe European conference on computer vision (ECCV) , pages563–578, 2018.[16] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi-sual genome: Connecting language and vision using crowd-sourced dense image annotations.

International Journal ofComputer Vision , 123(1):32–73, 2017.[17] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial ex-amples in the physical world.

ICLR workshop , 2017.[18] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning todetect unseen object classes by between-class attribute trans-fer. In

CVPR . IEEE, 2009.[19] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, andA. Vladu. Towards deep learning models resistant to adver-sarial attacks.

ICLR , 2018. [20] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deep-fool: a simple and accurate method to fool deep neural net-works. In

CVPR , 2016.[21] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik,and A. Swami. The limitations of deep learning in adversar-ial settings. In , pages 372–387. IEEE, 2016.[22] D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Dar-rell, and M. Rohrbach. Multimodal explanations: Justifyingdecisions and pointing to the evidence. In

CVPR , 2018.[23] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trustyou?: Explaining the predictions of any classiﬁer. In

Pro-ceedings of the 22nd ACM SIGKDD international confer-ence on knowledge discovery and data mining , pages 1135–1144. ACM, 2016.[24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra. Grad-cam: Visual explanations fromdeep networks via gradient-based localization. In

ICCV ,2017.[25] A. Shrikumar, P. Greenside, and A. Kundaje. Learning im-portant features through propagating activation differences.In

Proceedings of the 34th International Conference on Ma-chine Learning-Volume 70 , pages 3145–3153. JMLR. org,2017.[26] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-side convolutional networks: Visualising image classiﬁca-tion models and saliency maps. arXiv , 2013.[27] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack forfooling deep neural networks.

IEEE Transactions on Evolu-tionary Computation , 2019.[28] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attributionfor deep networks. In

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pages 3319–3328. JMLR. org, 2017.[29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus. Intriguing properties of neuralnetworks.

ICLR , 2013.[30] G. Tao, S. Ma, Y. Liu, and X. Zhang. Attacks meet inter-pretability: Attribute-steered detection of adversarial sam-ples. In

NeurIPS , 2018.[31] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, andA. Madry. Robustness may be at odds with accuracy. stat ,1050, 2018.[32] H. Uzunova, J. Ehrhardt, T. Kepp, and H. Handels. In-terpretable explanations of black box classiﬁers applied onmedical images by meaningful perturbations using varia-tional autoencoders. In

Medical Imaging 2019: Image Pro-cessing , volume 10949, page 1094911. International Societyfor Optics and Photonics, 2019.[33] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The caltech-ucsd birds-200-2011 dataset. 2011.[34] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In

ECCV . Springer, 2014.[35] T. Zhang and Z. Zhu. Interpreting adversariallytrained convolutional neural networks. arXiv preprintarXiv:1905.09797 , 2019.36] B. Zhao, Y. Fu, R. Liang, J. Wu, Y. Wang, and Y. Wang.A large-scale attribute dataset for zero-shot learning. arXiv ,2018.[37] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visu-alizing deep neural network decisions: Prediction differenceanalysis.