Understanding Misclassifications by Attributes
Sadaf Gulshad, Zeynep Akata, Jan Hendrik Metzen, Arnold Smeulders
UUnderstanding Misclassifications by Attributes
Sadaf Gulshad , Zeynep Akata , Jan Hendrik Metzen , and Arnold Smeulders UvA-Bosch Delta Lab, University of Amsterdam, The Netherlands, Bosch Center for AI (BCAI), Renningen, Germany
Abstract
In this paper, we aim to understand and explain the de-cisions of deep neural networks by studying the behavior ofpredicted attributes when adversarial examples are intro-duced. We study the changes in attributes for clean as wellas adversarial images in both standard and adversariallyrobust networks. We propose a metric to quantify the ro-bustness of an adversarially robust network against adver-sarial attacks. In a standard network, attributes predictedfor adversarial images are consistent with the wrong class,while attributes predicted for the clean images are consis-tent with the true class. In an adversarially robust net-work, the attributes predicted for adversarial images classi-fied correctly are consistent with the true class. Finally, weshow that the ability to robustify a network varies for dif-ferent datasets. For the fine grained dataset, it is higher ascompared to the coarse grained dataset. Additionally, theability to robustify a network increases with the increase inadversarial noise.
1. Introduction
Understanding neural networks is crucial in applicationslike autonomous vehicles, health care, robotics, for validat-ing and debugging, as well as for building the trust of users[15, 32]. This paper strives to understand and explain thedecisions of deep neural networks by studying the behaviorof predicted attributes when adversarial examples are intro-duced. We argue that even if no adversaries are being in-serted in real world applications, adversarial examples canbe exploited for understanding neural networks in their fail-ure modes. Most of the state of the art approaches for in-terpreting neural networks work by focusing on features toproduce saliency maps by considering class specific gradi-ent information [24, 26, 28], or by finding the part of theimage which influences classification the most and remov-ing it by adding perturbations [34, 10]. These approachesreveal the part in the image where there is support to theclassification and visualize the performance of known goodexamples. This tells a little about the boundaries of a class
Which bird is this?
Which bird is this?
Painted Bunting.
Because it has attributes:
Red BellyBlue Head
Herring Gull
Because it has attributes:
White BellyWhite HeadRed BellyBlue HeadPainted Bunting.
Because it has attributes:
Herring Gull
Because it has attributes: -White Belly-White Head
Which bird is this?Which bird is this?
Red bellyBlue headBlack bill
Blue headRed bellyGreen wing
Blue HeadRed Belly-White Belly-White Head
Noise Blue headRed bellyGreen wing
Blue HeadRed Belly-White Belly-White Head
Adversarial Clean ImageClass AttributesAdversarial Image Clean Image Attribute PredictorAdversarial ImageClean Image
Figure 1. Our study with interpretable attribute prediction-grounding framework shows that, for a clean image predicted at-tributes “red belly” and “blue head” are coherent with the groundtruth class (painted bunting), and for an adversarial image “whitebelly” and “white head” are coherent with the wrong class (herringgull) . where dubious examples reside.However, humans motivate their decisions through se-mantically meaningful observations. For example, this typeof bird has a blue head and red belly so, this must be apainted bunting. Hence, we study changes in the predictedattribute values of samples under mild modification of theimage through adversarial perturbations. We believe thisalternative dimension of study can provide a better under-standing of how misclassification in a deep network can bestbe communicated to humans. Note that, we consider adver-sarial examples that are generated to fool only the classifierand not the interpretation (attributes) mechanism.Interpreting deep neural network decisions for adversar-ial examples helps in understanding their internal function-ing [30, 9]. Therefore, we explore
How do the attribute values change under an adversarialattack on the standard classification network?
However while, describing misclassifications due to ad-versarial examples with attributes helps in understandingneural networks, assessing whether the attribute values stillretain their discriminative power after making the networkrobust to adversarial noise is equally important. Hence, we1 a r X i v : . [ c s . C V ] O c t lso ask How do the attribute values change under an adversarialattack on a robust classification network?
To answer these questions, we design experiments to in-vestigate which attribute values change when an image ismisclassified with increasing adversarial perturbations, andfurther when the classifier is made robust against an ad-versarial attack. Through these experiments we intend todemonstrate what attributes are important to distinguish be-tween the right and the wrong class. For instance, as shownin Figure 1, “blue head” and ”red belly” associated with theclass “painted bunting” are predicted correctly for the cleanimage. On the other hand, due to predicting attributes in-correctly as “white belly” and “white head”, the adversarialimage gets classified into “herring gull” incorrectly. Afteranalysing the changes in attributes with a standard and witha robust network we propose a metric to quantify the robust-ness of the network against adversarial attacks. Therefore,we ask
Can we quantify the robustness of an adversarially ro-bust network?
In order to answer the third question, we design a ro-bustness quantification metric for both standard as well asattribute based classifiers.To the best of our knowledge we are the first to exploitadversarial examples with attributes to perform a system-atic investigation on neural networks, both quantitatively and qualitatively , for not only standard , but also for ad-versarially robust networks. We explain the decisions ofdeep computer vision systems by identifying what attributeschange when an image is perturbed in order for a classi-fication system to produce a specific output. Our resultson three benchmark attribute datasets with varying size andgranularity elucidate why adversarial images get misclassi-fied, and why the same images are correctly classified withthe adversarially robust framework. Finally we introduce anew metric to quantify the robustness of a network for bothgeneral as well as attribute based classifiers.
2. Related Work
In this section, we discuss related work on interpretabil-ity and adversarial examples.
Interpretability.
Explaining the output of a decision makeris motivated by the need to build user trust before deploy-ing them into the real world environment. Previous workis broadly grouped into two: 1) rationalization , that is, jus-tifying the network’s behavior and 2) introspective expla-nation , that is, showing the causal relationship between in-put and the specific output [8]. Text-based class discrimi-native explanations [13, 22], text-based interpretation withsemantic information [7] and counter factual visual expla-nations [12] fall in the first category. On the other handactivation maximization [26, 37], learning the perturbation mask [10], learning a model locally around its predictionand finding important features by propagating activationdifferences [23, 25] fall in the second group. The first grouphas the benefit of being human understandable, but it lacksthe causal relationship between input and output. The sec-ond group incorporates internal behavior of the network, butlacks human understandability. In this work, we incorporatehuman understandable justifications through attributes andcausal relationship between input and output through adver-sarial attacks.
Interpretability of Adversarial Examples.
After analyz-ing neuronal activations of the networks for adversarial ex-amples in [6] it was concluded that the networks learn re-current discriminative parts of objects instead of semanticmeaning. In [14], the authors proposed a datapath visualiza-tion module consisting of the layer level, feature level, andthe neuronal level visualizations of the network for cleanas well as adversarial images. In [35], the authors inves-tigated adversarially trained convolutional neural networksby constructing images with different textural transforma-tions while preserving the shape information to verify theshape bias in adversarially trained networks compared withstandard networks. Finally, in [31], the authors showed thatthe saliency maps from adversarially trained networks alignwell with human perception.These approaches use saliency maps for interpreting theadversarial examples, but saliency maps [24] are often weakin justifying classification decisions, especially for fine-grained adversarial images. For instance, in Figure 2 thesaliency map of a clean image classified into the groundtruth class, “red winged blackbird”, and the saliency mapof a misclassified adversarial image, look quite similar. In-stead, we propose to predict and ground attributes for bothclean and adversarial images to provide visual as well asattribute-based interpretations. In fact, our predicted at-tributes for clean and adversarial images look quite differ-ent. By grounding the predicted attributes one can infer thatthe “orange wing” is important for “red winged blackbird”while the “red head” is important for “red faced cormorant”.Indeed, when the attribute value for orange wing decreasesand red head increases the image gets misclassified.
Adversarial Examples.
Small carefully crafted perturba-tions, called adversarial perturbations , when added to theinputs of deep neural networks, result in adversarial exam-ples . These adversarial examples can easily drive the clas-sifiers to the wrong classification [29]. Such attacks involveiterative fast gradient sign method (IFGSM) [17], Jacobian-based saliency map attacks [21], one pixel attacks [27], Car-lini and Wagner attacks [5] and universal attacks [20]. Weselect IFGSM for our experiments, but our method can alsobe used with other types of adversarial attacks.Adversarial examples can also be used for understandingneural networks. [4] aims at utilizing adversarial examples lack HeadBlack BeakOrange WingBlack WingBlack HeadBlack BeakRed HeadBlack Tail
Predicted class:
Red winged blackbird
True class:
Red winged blackbird
CleanAdversarial
Predicted class:
Red faced cormorant
True class:
Red winged blackbird
Figure 2.
Adversarial images are difficult to explain: when theanswer is wrong, often saliency based methods (left) fail to detectwhat went wrong. Instead, attributes (right) provide intuitive andeffective visual and textual explanations. for understanding deep neural networks by extracting thefeatures that provide the support for classification into thetarget class. The most salient features in the images providethe way to interpret the decision of a classifier, but they lackhuman understandability. Additionally, finding the mostsalient features is computationally rather expensive. Thecrucial point, however, is that if humans explain classifica-tion by attributes, they are also natural candidates to studymisclassification and robustness. Hence, in this work in or-der to understand neural networks we utilize adversarial ex-amples with attributes which explain the misclassificationdue to adversarial attacks.
3. Method
In this section, in order to explain what attributes changewhen an adversarial attack is performed on the classifica-tion mechanism of the network, we detail a two-step frame-work. First, we perturb the images using adversarial attackmethods and robustify the classifiers via adversarial train-ing. Second, we predict the class specific attributes andvisually ground them on the image to provide an intuitivejustification of why an image is classified as a certain class.Finally, we introduce our metric for quantifying the robust-ness of an adversarially robust network against adversarialattacks.
Given a clean n -th input x n and its respective groundtruth class y n predicted by a model f ( x n ) , an adversarialattack model generates an image ˆ x n for which the predictedclass is y , where y (cid:54) = y n . In the following, we detail an ad-versarial attack method for fooling a general classifier andan adversarial training technique that robustifies it. Adversarial Attacks.
The iterative fast gradient signmethod (IFGSM) [17] is leveraged to fool only the classifiernetwork. IFGSM solves the following equation to produceadversarial examples:
Attribute Prediction
Red bellyBlue head
Painted Bunting White headWhite BellyRed BellyBlue head
Attribute GrounderAdversarial Training
White bellyWhite head
Herring Gull
Attribute PredictionRed bellyBlue headBlack bill
Painted Bunting White headWhite BellyRed BellyBlue head
Attribute GrounderAdversarial Training White bellyWhite headYellow bill
Herring Gull
Figure 3.
Interpretable attribute prediction-grounding model.
After an adversarial attack or adversarial training step, image fea-tures of both clean θ ( x n ) and adversarial images θ (ˆ x ) are ex-tracted using Resnet and mapped into attribute space φ ( y ) bylearning the compatibility function F ( x n , y n ; W ) between imagefeatures and class attributes. Finally, attributes predicted by at-tribute based classifier A qx n ,y n are grounded by matching themwith attributes predicted by Faster RCNN A jx n for clean and ad-versarial images. ˆ x = x n ˆ x i +1 n = Clip (cid:15) { ˆ x in + α Sign ( (cid:53) ˆ x in L (ˆ x in , y n )) } (1)where (cid:53) ˆ x in L represents the gradient of the cost functionw.r.t. perturbed image ˆ x in at step i . α determines the stepsize which is taken in the direction of sign of the gradientand finally, the result is clipped by epsilon Clip (cid:15) . Adversarial Robustness.
We use adversarial training asa defense against adversarial attacks which minimizes thefollowing objective [11]: L adv ( x n , y n ) = α L ( x n , y n ) + (1 − α ) L (ˆ x n , y ) (2)where, L ( x n , y n ) is the classification loss for clean images, L (ˆ x n , y ) is the loss for adversarial images and α regulatesthe loss to be minimized. The model finds the worst caseperturbations and fine tunes the network parameters to re-duce the loss on perturbed inputs. Hence, this results in arobust network f r (ˆ x ) , using which improves the classifica-tion accuracy on the adversarial images. Our attribute prediction and grounding model uses at-tributes to define a joint embedding space that the imagesare mapped to.
Attribute prediction.
The model is shown in Figure 3.During training our model maps clean training images closeto their respective class attributes, e.g. “painted bunting”with attributes “red belly, blue head”, whereas adversarialimages get mapped close to a wrong class, e.g. “herringgull” with attributes “white belly, white head”.We employ structured joint embeddings (SJE) [1] to pre-dict attributes in an image. Given the input image features θ ( x n ) ∈ X and output class attributes φ ( y n ) ∈ Y from thesample set S = { ( θ ( x n ) , φ ( y n ) , n = 1 ...N } SJE learns amapping f : X → Y by minimizing the empirical risk ofthe form N (cid:80) Nn =1 ∆( y n , f ( x n )) where ∆ : Y × Y → R es-timates the cost of predicting f ( x n ) when the ground truthlabel is y n . compatibility function F : X × Y → R is definedbetween input X and output Y space: F ( x n , y n ; W ) = θ ( x n ) T W φ ( y n ) (3)Pairwise ranking loss L ( x n , y n , y ) is used to learn the pa-rameters ( W ) : ∆( y n , y ) + θ ( x n ) T W φ ( y n ) − θ ( x n ) T W φ ( y ) (4)Attributes are predicted for both clean and adversarial im-ages by: A n,y n = θ ( x n ) W , ˆA n,y = θ (ˆ x n ) W (5)The image is assigned to the label of the nearest output classattributes φ ( y n ) . Attribute grounding.
In our final step, we ground the pre-dicted attributes on to the input images using a pre-trainedFaster RCNN network and visualize them as in [3]. Thepre-trained Faster RCNN model F ( x n ) predicts the bound-ing boxes denoted by b j . For each object bounding box itpredicts the class Y j as well as the attribute A j [2]. b j , A j , Y j = F ( x n ) (6)where, j is the bounding box index. The most discrimi-native attributes predicted by SJE are selected based on thecriteria that they change the most when the image is per-turbed with noise. For clean images we use: q = argmax i ( A in,y n − φ ( y i )) (7)and for adversarial images we use: p = argmax i ( ˆA in,y − φ ( y in )) . (8)where i is the attribute index, q and p are the indexesof the most discriminative attributes predicted by SJE and φ ( y i ) , φ ( y in ) are wrong class and ground truth class at-tributes respectively. Then we search for selected attributes A qx n ,y n , A p ˆ x n ,y in attributes predicted by Faster RCNN foreach bounding box A jx n , A j ˆ x n , and when the attributes pre-dicted by SJE and Faster RCNN are found, that is A qx n ,y n = A jx n , A p ˆ x n ,y = A j ˆ x n we ground them on their respectiveclean and adversarial images. Note that the adversarial im-ages being used here are generated to fool only the gen-eral classifier and not the attribute predictor nor the FasterRCNN . To describe the ability of a network for robustification,independent of its performance on a standard classifier weintroduce a metric called robust ratio . We calculate the lossof accuracy L R on a robust classifier, by comparing a stan-dard classifier f ( x n ) on clean images with the robust clas-sifier f r (ˆ x n ) on the adversarially perturbed images as givenbelow: L R = f ( x n ) − f r (ˆ x n ) (9) And then we calculate the loss of accuracy L S on a stan-dard classifier, by comparing its accuracy on the clean andadversarially perturbed images: L S = f ( x n ) − f (ˆ x n ) (10)The ability to robustify is then defined as: R = L R L S (11) R is the robust ratio. It indicates the fraction of the classi-fication accuracy of the standard classifier recovered by therobust classifier when adding noise.
4. Experiments
In this section, we perform experiments on three differ-ent datasets and analyse the change in attributes for clean aswell as adversarial images. We additionally analyse resultsfor our proposed robustness quantification metric on bothgeneral and attribute based classifiers.
Datasets.
We experiment on three datasets, Animals withAttributes 2 (AwA) [18], Large attribute (LAD) [36] andCaltech UCSD Birds (CUB) [33]. AwA contains 37322 im-ages (22206 train / 5599 val / 9517 test) with 50 classesand 85 attributes per class. LAD has 78017 images (40957train / 13653 val / 23407 test) with 230 classes and 359 at-tributes per class. CUB consists of 11,788 images (5395train / 599 val / 5794 test) belonging to 200 fine-grained cat-egories of birds with 312 attributes per class. All the threedatasets contain real valued class attributes representing thepresence of a certain attribute in a class.Visual Genome Dataset [16] is used to train the Faster-RCNN model which extracts the bounding boxes using1600 object and 400 attribute annotations. Each boundingbox is associated with an attribute followed by the object,e.g. a brown bird.
Image Features and Adversarial Examples.
We extractimage features and generate adversarial images using thefine-tuned Resnet-152. Adversarial attacks are performedusing IFGSM method with epsilon (cid:15) values . , . and . . The l ∞ norm is used as a similarity measure betweenclean input and the generated adversarial example. Adversarial Training.
As for adversarial training, we re-peatedly computed the adversarial examples while trainingthe fine-tuned Resnet-152 to minimize the loss on these ex-amples. We generated adversarial examples using the pro-jected gradient descent method. This is a multi-step variantof FGSM with epsilon (cid:15) values . , . and . respec-tively for adversarial training as in [19].Note that we are not attacking the attribute based net-work directly but we are attacking the general classifier andextracting features from it for training the attribute basedclassifier. Similarly, the adversarial training is also per-formed on the general classifier and the features extracted igure 4. Comparing the accuracy of the general and the attribute based classifiers for adversarial examples to investigate changein attributes.
We evaluate both classifiers by extracting features from a standard network and the adversarially robust network. from this model are used for training the attribute basedclassifier.
Attribute Prediction and Grounding.
At test time the image features are projected onto theattribute space. The image is assigned with the label ofthe nearest ground truth attribute vector. The predicted at-tributes are grounded by using Faster-RCNN pre-trained onVisual Genome Dataset since we do not have ground truthpart bounding boxes for any of attribute datasets.
5. Results
We investigate the change in attributes quantitatively (i)by performing classification based on attributes and (ii)by computing distances between attributes in embeddingspace. We additionally investigate changes qualitatively bygrounding the attributes on images for both standard andadversarially robust networks.At first, we compare the general classifier f ( x n ) and theattribute based classifier f ( x n ) in terms of the classifica-tion accuracy on clean images. Since the attribute basedmodel is a more explainable classifier, it predicts attributes,compared to general classifier, which predicts the class la-bel directly. Therefore, we first verify whether the attributebased classifier performs equally well as the general clas-sifier. We find that, the attribute based and general clas-sifier accuracies are comparable for AWA (general: 93.53,attribute based: 93.83). The attribute based classifier ac-curacy is slightly higher for LAD (general: 80.00, attributebased: 82.77), and slightly lower for CUB (general: 81.00,attribute based: 76.90) dataset.To qualitatively analyse the predicted attributes, weground them on clean and adversarial images. We selectour images among the ones that are correctly classifiedwhen clean and incorrectly classified when adversariallyperturbed. Further we select the most discriminative at-tributes based on equation 7 and 8. We evaluate attributesthat change their value the most for the CUB, attributesfor the AWA, and attributes for the LAD dataset. AWA CUB
Standard Network
Figure 5.
Attribute distance plots for standard learning frame-works.
Standard learning framework plots are shown for the cleanand the adversarial image attributes.
Withadversarial attacks, the accuracy of both the general and at-tribute based classifiers drops with the increase in perturba-tions see Figure 4 (blue curves). The drop in accuracy of thegeneral classifier for the fine grained CUB dataset is higheras compared to the coarse AWA dataset which confirms ourhypothesis. For example, at (cid:15) = 0 . for the CUB datasetthe general classifier’s accuracy drops from to ( ≈ drop), while for the AWA dataset it drops from . to . ( ≈ drop). However, the drop inaccuracy with the attribute based classifier is almost equalfor both, ≈ . We propose one of the reasons behindthe smaller drop of accuracy for the CUB dataset with theattribute based classifier compared to the general classifieris that for fine grained datasets there are many common at-tributes among classes. Therefore, in order to misclassify animage a significant number of attributes need to be changed.For a coarse grained dataset, changing a few attributes issufficient for misclassification. Another reason is that thereare more attributes per class in the CUB dataset as com-pared to the AWA dataset.For the coarse dataset the attribute based classifier showscomparable performance with the general classifier. Whilefor the fine grained dataset the attribute based classifier ellow BeakBlack EyeWhite Wing -White Underparts-Black Eye-Spotted Wing-Bill length same as head Example from adversary (Wrong Class) Adversarial
Clean
Adversarial
Clean Adversarial
Standard Network(Targeted Attacks)
With Adversarial Attacks on Standard Network
Gray WingBlack Wing -Small Size-Brown Back-Buff undertail-White Undertail Red HeadBlack WingBlack Tail Black Tail-Black Wing-White upper Tail-Solid Back-Black Back Black Eye Black HeadBlack Wing Black Head-Buff Bill -Forehead gray-Striped wing-Black back-White belly
Robust Network
White WingBlack WingBlack HeadWhite Head
Adversarial with robust network (Ground truth Class)
White WingBlack Wing-Striped Back-Pointed Tail-Striped WingBlack BeakYellow HeadOrange Head
Adversarial with robust network (Ground truth Class)
Black Beak-Black Belly-Bill Longer than head-Striped WingBlack Wing-Blue Undertail-Pointed tail-Striped WingBlue HeadBlack Wing
Adversarial with robust network (Ground truth Class)
Example from adversary (Wrong Class)
Example from adversary (Wrong Class)
Red HeadBlack HeadGray WingBlack Leg Blue BeakBlack HeadBrown WingBrown Tail-Brown throat-Spotted WingBrown WingLong WingLong Tail-Color Brown-Multicolored Breast
Orange HeadBlack BeakBrown WingLong Wing
Is white Is black Is black-Have claws-Is domestic -Is Gray Is walking Is black Is blackIs walking-Is white-Is grazer-Is domesticIs redIs green -Has insects-Is growing-Has nutlets-Is big
AWAand LAD
Is white Is blackIs red Is red-Is Yellow-Is lively-Is small-Is cool
Clean
CUB
Figure 6.
Qualitative analysis for adversarial attacks on standard network.
The attributes ranked by importance for the classificationdecision are shown below the images. The grounded attributes are color coded for visibility (the ones in gray could not be grounded). Theattributes for clean images are related to the ground truth classes whereas the ones predicted for adversarial images are related to the wrongclasses.
Robust Network
AWA CUB
Figure 7.
Attribute distance plots for robust learning frame-works.
Robust learning framework plots are shown only for theadversarial image attributes but for adversarial images misclassi-fied with the standard features and correctly classified with therobust features. shows better performance than the general classifier so alarge change in attributes is required to cause misclassifica-tion with attributes. Overall, the drop in the accuracy withthe adversarial attacks demonstrates that, with adversarialperturbations, the attribute values change towards those thatbelong to the new class and cause the misclassification.
By Computing Distances in Embedding Space.
In orderto perform analysis on attributes in embedding space, weconsider the images which are correctly classified withoutperturbations and misclassified with perturbations. Further,we select the top of the most discriminative attributesusing equation 7 and 8. Our aim is to analyse the change inattributes in embedding space.We contrast the Euclidean distance between predicted at- tributes of clean and adversarial samples: d = d { A n,y n , ˆA n,y } = (cid:107) A n,y n − ˆA n,y (cid:107) (12)with the Euclidean distance between the ground truth at-tribute vector of the correct and wrong classes: d = d { φ ( y n ) , φ ( y ) } = (cid:107) φ ( y n ) − φ ( y )) (cid:107) (13)and show the results in Figure 5. Where, A n,y n denotes thepredicted attributes for the clean images classified correctly,and ˆA n,y denotes the predicted attributes for the adversar-ial images misclassified with a standard network. The cor-rect ground truth class attribute is referred to as φ ( y n ) andwrong class attributes are φ ( y ) .We observe that for the AWA dataset the distances be-tween the predicted attributes for adversarial and clean im-ages d are smaller than the distances between the groundtruth attributes of the respective classes d . The closenessin predicted attributes for clean and adversarial images ascompared to their ground truths shows that attributes changetowards the wrong class but not completely. This is due tothe fact that for coarse classes, only a small change in at-tribute values is sufficient to change the class.The fine-grained CUB dataset behaves differently. Theoverlap between d and d distributions demonstratesthat attributes of images belonging to fine-grained classeschange significantly as compared to images from coarsecategories. Although the fine grained classes are closer to lack BeakYellow HeadOrange Head Adversarial with robust network (Ground truth Class)
Black Beak-Black Belly-Bill Longer than head-Striped WingBlack Wing-Blue Undertail-Pointed tail-Striped WingBlue HeadBlack Wing
Adversarial with robust network (Ground truth Class)
With Adversarial Attacks on Robust Network
Adversarial
Example from adversary (Wrong Class) Adversarial
Example from adversary (Wrong Class)
Figure 8.
Qualitative analysis for adversarial attacks on robust network.
The attributes are ranked by importance for the classificationdecision, the grounded attributes are color coded for visibility (the ones in gray could not be grounded). The attributes for adversarialimages with robust network are related to ground truth classes whereas the ones predicted for adversarial images change towards wrongclasses each other, due to the existence of many common attributesamong fine grained classes, attributes need to change sig-nificantly to cause misclassification. Hence, for the coarsedataset, the attributes change minimally, while for the finegrained dataset they change significantly.
We observe in Figure 6 that the most discriminative at-tributes for the clean images are coherent with the groundtruth class and are localized accurately; however, for adver-sarial images they are coherent with the wrong class. Thoseattributes which are common among both clean and adver-sarial classes are localized correctly on the adversarial im-ages; however, the attributes which are not related to theground truth class, the ones that are related to the wrongclass can not get grounded as there is no visual evidencethat supports the presence of these attributes. For example“brown wing, long wing, long tail” attributes are common inboth classes; hence, they are present both in the clean imageand the adversarial image. On the other hand, “has a browncolor” and “a multicolored breast” are related to the wrongclass and are not present in the adversarial image. Hence,they can not be grounded. Similarly, in the second exam-ple none of the attributes are grounded. This is becauseattributes changed completely towards the wrong class andthe evidence for those attributes is not present in the im-age. This indicates that attributes for the clean images cor-respond to the ground truth class and for adversarial imagescorrespond to the wrong class. Additionally, only thoseattributes common among both the wrong and the groundtruth classes get grounded on adversarial images.Similarly, our results on the LAD and AWA datasets inthe second row of Figure 6 show that the grounded attributeson clean images confirm the classification into the groundtruth class while the attributes grounded on adversarial im-ages are common among clean and adversarial images. Forinstance, in the first example of AWA, the “is black” at- tribute is common in both classes so it is grounded on bothimages, but “has claws” is an important attribute for the ad-versarial class. As it is not present in the ground truth class,it is not grounded.Compared to misclassifications caused by adversarialperturbations on CUB, images do not necessarily get mis-classified into the most similar class for the AWA and LADdatasets as they are coarse grained datasets. Therefore,there is less overlap of attributes between ground truth andadversarial classes, which is in accordance with our quanti-tative results. Furthermore, the attributes for both datasetsare not highly structured, as different objects can be distin-guished from each other with only a small number of at-tributes.
Ourevaluation on the standard and adversarially robust net-works shows that the classification accuracy improves forthe adversarial images when adversarial training is used torobustify the network: 4 (purple curves). For example, inFigure 4 for AWA the accuracy of the general classifier im-proved from . to . ( ≈ improvement) foradversarial attack with (cid:15) = 0 . . As expected for the finegrained CUB dataset the improvement is ≈ higher thanthe AWA dataset. However, for the attribute based classi-fier, the improvement in accuracy for AWA ( ≈ . ) isalmost double that of the CUB dataset ( ≈ ). We pro-pose this is because the AWA dataset is coarse, so in orderto classify an adversarial image correctly to its ground truthclass, a small change in attributes is sufficient. Converselythe fine grained CUB dataset requires a large change in at-tribute values to correctly classify an adversarial image intoits ground truth class. Additionally, CUB contains moreper class attributes. For a coarse AWA dataset the attributes igure 9. Ability to robustify a network.
Ability to robustify a network with increasing adversarial perturbations is shown for threedifferent datasets for both general and attribute based classifiers. change back to the correct class and represent the correctclass accurately. While for the fine grained CUB dataset, alarge change in attribute values is required to correctly clas-sify images.This shows that with a robust network, the change in at-tribute values for adversarial images indicate to the groundtruth class, resulting in better performance. Overall, we ob-serve by analysing attribute based classifier accuracy thatwith the adversarial attacks the change in attribute valuesindicates in which wrong class it is assigned and with therobust network the change in attribute values indicates to-wards the ground truth class.
By Computing Distances in Embedding Space
We compare the distances between the predicted at-tributes of only adversarial images that are classified cor-rectly with the help of an adversarially robust network ˆA rn,y n and classified incorrectly with a standard network ˆA n,y : d = d { ˆA rn,y n , ˆA n,y } = (cid:107) ˆA rn,y n − ˆA n,y (cid:107) (14)with the distances between the ground truth target class at-tributes φ ( y n ) and ground truth wrong class attributes φ ( y ) : d = d { φ ( y n ) , φ ( y ) } = (cid:107) φ ( y n ) − φ ( y )) (cid:107) (15)The results are shown in Figure 7. By comparing Figure 7with Figure 5 we observe a similar behavior. The plots inFigure 5 are plotted between clean and adversarial imageattributes. While plots in Figure 7 are plotted between onlyadversarial images but classified correctly with an adversar-ially robust network and misclassified with a standard net-work. This shows that the adversarial images classified cor-rectly with a robust network behave like clean images, i.e. arobust network predicts attributes for the adversarial imageswhich are closer to their ground truth class. Finally, our analysis with correctly classified images by theadversarially robust network shows that adversarial imageswith the robust network behave like clean images also vi-sually. In the Figure 8, we observe that the attributes of an adversarial image with a standard network are closerto the adversarial class attributes. However, the groundedattributes of adversarial image with a robust network arecloser to its ground truth class. For instance, the first exam-ple contains a “blue head” and a “black wing” whereas oneof the most discriminating properties of the ground truthclass “blue head” is not relevant to the adversarial class.Hence this attribute is not predicted as the most relevant byour model, and thus our attribute grounder did not groundit. This shows that the attributes for adversarial images clas-sified correctly with the robust network are in accordancewith the ground truth class and hence get grounded on theadversarial images.
The results for our proposed robustness quantificationmetric are shown in Figure 9. We observe that the abil-ity to robustify a network against adversarial attacks variesfor different datasets. The network with fine grained CUBdataset is easy to robustify as compared to coarse AWA andLAD datasets. For the general classifier as expected theability to robustify the network increases with the increasein noise. For the attribute based classifier the ability to ro-bustify the network is high with the small noise but it dropsas the noise increases (at (cid:15) = 0 . ) and then again increasesat high noise value (at (cid:15) = 0 . ).
6. Conclusion
In this work we conducted a systematic study on under-standing the neural networks by exploiting adversarial ex-amples with attributes. We showed that if a noisy samplegets misclassified then its most discriminative attribute val-ues indicate to which wrong class it is assigned. On theother hand, if a noisy sample is correctly classified with therobust network then the most discriminative attribute valuesindicate towards the ground truth class. Finally, we pro-posed a metric for quantifying the robustness of a networkand showed that the ability to robustify a network varies fordifferent datasets. Overall the ability to robustify a networkincreases with the increase in adversarial perturbations. eferences [1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Eval-uation of output embeddings for fine-grained image classifi-cation. In
CVPR . IEEE, 2015.[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson,S. Gould, and L. Zhang. Bottom-up and top-down atten-tion for image captioning and visual question answering. In
CVPR , 2018.[3] L. Anne Hendricks, R. Hu, T. Darrell, and Z. Akata. Ground-ing visual explanations. In
ECCV , 2018.[4] Anonymous. Evaluations and methods for explanationthrough robustness analysis. In
Submitted to InternationalConference on Learning Representations , 2020. under re-view.[5] N. Carlini and D. Wagner. Towards evaluating the robustnessof neural networks. In SP . IEEE, 2017.[6] Y. Dong, H. Su, J. Zhu, and F. Bao. Towards interpretabledeep neural networks by leveraging adversarial examples. arXiv , 2017.[7] Y. Dong, H. Su, J. Zhu, and B. Zhang. Improving inter-pretability of deep neural networks with semantic informa-tion. In CVPR , 2017.[8] M. Du, N. Liu, and X. Hu. Techniques for interpretable ma-chine learning. arXiv , 2018.[9] M. Du, N. Liu, Q. Song, and X. Hu. Towards explanationof dnn-based prediction with guided feature inversion. In
SIGKDD . ACM, 2018.[10] R. C. Fong and A. Vedaldi. Interpretable explanations ofblack boxes by meaningful perturbation. arXiv , 2017.[11] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining andharnessing adversarial examples. In
ICLR , 2015.[12] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, andS. Lee. Counterfactual visual explanations. arXiv preprintarXiv:1904.07451 , 2019.[13] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue,B. Schiele, and T. Darrell. Generating visual explanations.In
ECCV . Springer, 2016.[14] L. Jiang, S. Liu, and C. Chen. Recent research advances oninteractive machine learning.
Journal of Visualization , 2018.[15] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata. Tex-tual explanations for self-driving vehicles. In
Proceedings ofthe European conference on computer vision (ECCV) , pages563–578, 2018.[16] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi-sual genome: Connecting language and vision using crowd-sourced dense image annotations.
International Journal ofComputer Vision , 123(1):32–73, 2017.[17] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial ex-amples in the physical world.
ICLR workshop , 2017.[18] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning todetect unseen object classes by between-class attribute trans-fer. In
CVPR . IEEE, 2009.[19] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, andA. Vladu. Towards deep learning models resistant to adver-sarial attacks.
ICLR , 2018. [20] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deep-fool: a simple and accurate method to fool deep neural net-works. In
CVPR , 2016.[21] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik,and A. Swami. The limitations of deep learning in adversar-ial settings. In , pages 372–387. IEEE, 2016.[22] D. H. Park, L. A. Hendricks, Z. Akata, B. Schiele, T. Dar-rell, and M. Rohrbach. Multimodal explanations: Justifyingdecisions and pointing to the evidence. In
CVPR , 2018.[23] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trustyou?: Explaining the predictions of any classifier. In
Pro-ceedings of the 22nd ACM SIGKDD international confer-ence on knowledge discovery and data mining , pages 1135–1144. ACM, 2016.[24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra. Grad-cam: Visual explanations fromdeep networks via gradient-based localization. In
ICCV ,2017.[25] A. Shrikumar, P. Greenside, and A. Kundaje. Learning im-portant features through propagating activation differences.In
Proceedings of the 34th International Conference on Ma-chine Learning-Volume 70 , pages 3145–3153. JMLR. org,2017.[26] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep in-side convolutional networks: Visualising image classifica-tion models and saliency maps. arXiv , 2013.[27] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack forfooling deep neural networks.
IEEE Transactions on Evolu-tionary Computation , 2019.[28] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attributionfor deep networks. In
Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pages 3319–3328. JMLR. org, 2017.[29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan,I. Goodfellow, and R. Fergus. Intriguing properties of neuralnetworks.
ICLR , 2013.[30] G. Tao, S. Ma, Y. Liu, and X. Zhang. Attacks meet inter-pretability: Attribute-steered detection of adversarial sam-ples. In
NeurIPS , 2018.[31] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, andA. Madry. Robustness may be at odds with accuracy. stat ,1050, 2018.[32] H. Uzunova, J. Ehrhardt, T. Kepp, and H. Handels. In-terpretable explanations of black box classifiers applied onmedical images by meaningful perturbations using varia-tional autoencoders. In
Medical Imaging 2019: Image Pro-cessing , volume 10949, page 1094911. International Societyfor Optics and Photonics, 2019.[33] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The caltech-ucsd birds-200-2011 dataset. 2011.[34] M. D. Zeiler and R. Fergus. Visualizing and understandingconvolutional networks. In
ECCV . Springer, 2014.[35] T. Zhang and Z. Zhu. Interpreting adversariallytrained convolutional neural networks. arXiv preprintarXiv:1905.09797 , 2019.36] B. Zhao, Y. Fu, R. Liang, J. Wu, Y. Wang, and Y. Wang.A large-scale attribute dataset for zero-shot learning. arXiv ,2018.[37] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. Visu-alizing deep neural network decisions: Prediction differenceanalysis.