How Useful Are the Machine-Generated Interpretations to General Users? A Human Evaluation on Guessing the Incorrectly Predicted Labels
HHow Useful Are the Machine-Generated Interpretations to General Users?A Human Evaluation on Guessing the Incorrectly Predicted Labels
Hua Shen, Ting-Hao (Kenneth) Huang
College of Information Sciences and TechnologyThe Pennsylvania State University201 Old Main, University Park, PA 16802, USA { huashen218, txh710 } @psu.edu Abstract
Explaining to users why automated systems make certainmistakes is important and challenging. Researchers have pro-posed ways to automatically produce interpretations for deepneural network models. However, it is unclear how useful these interpretations are in helping users figure out why theyare getting an error. If an interpretation effectively explains tousers how the underlying deep neural network model works,people who were presented with the interpretation should bebetter at predicting the model’s outputs than those who werenot. This paper presents an investigation on whether or notshowing machine-generated visual interpretations helps usersunderstand the incorrectly predicted labels produced by im-age classifiers. We showed the images and the correct labelsto 150 online crowd workers and asked them to select the in-correctly predicted labels with or without showing them themachine-generated visual interpretations. The results demon-strated that displaying the visual interpretations did not in-crease, but rather decreased , the average guessing accuracyby roughly 10%.
Introduction
Explaining to users why automated systems make certainmistakes is important. As deep neural network technologiesachieve higher performance, they have been applied to im-portant domains, influencing important decisions in health-care, transportation, and education. However, due to the non-linear, complicated structures of neural models, the high per-formance of deep neural networks is achieved at the cost ofinterpretability. In response, researchers have proposed waysto explain the inner workings of deep neural networks by au-tomatically producing interpretations (Melis and Jaakkola2018; Selvaraju et al. 2017; Ribeiro, Singh, and Guestrin2016). Such machine-generated interpretations help vari-ous stakeholders (Strobelt et al. 2017): researchers, whodevelop new deep-learning architectures; machine-learningengineers, who train and optimize existing networks; prod-uct engineers, who apply general-purpose pre-trained net-works to various tasks; and the general users, who want tounderstand system outputs (Chu, Roy, and Andreas 2020;
Copyright c (cid:13)
Smith-Renner et al. 2020; Selvaraju et al. 2017). This pa-per focuses on the end users – who may not understandthe mechanism of the underlying deep neural networks,but are most influenced by their outputs – to investigatewhether machine-generated interpretations can help usersmake sense of errors made by algorithms.We use the image-classification task as our test bed. Neu-ral image classifiers generate interpretations through twoapproaches: designing proxies, which are inherently inter-pretable ( e.g. , decision tree), to substitute the black-boxdeep neural networks (Melis and Jaakkola 2018); or generat-ing post-hoc interpretations outside the deep neural networkworkflow (Selvaraju et al. 2017), which is where our workwill focus. Most post-hoc interpretations are in the form ofinstance-wise interpretation – for example, saliency maps ofinput images. A saliency map highlights the most informa-tive region of the image with respect to its classification la-bel, unveiling post-hoc evidence of the neural network pre-diction. This line of work was in part motivated by the needof “end users” (Du et al. 2018; Nourani et al. 2019), “non-expert users” (Ribeiro, Singh, and Guestrin 2016), or “un-trained users” (Selvaraju et al. 2017), and the generated in-terpretations were often evaluated by how much they couldboost users’ trust of deep neural networks. However, it isstill unclear how useful these interpretations are in helpingusers make sense of automated system errors.The need for interpretability arises due to
Incompleteness in the problem formalization, making it difficult to makefurther judgements or optimizations (Doshi-Velez and Kim2017). When a user observed a few cases where the auto-mated system incorrectly labeled his/her images, it was dif-ficult for the user to decide what to do. Did the errors occurbecause the system’s accuracy level is low? If so, should theuser switch to another system? Are the images too compli-cated for computers, in which case users should not expectreliable image labels? Did the underlying algorithms havebiases that worsened with certain types of images? We ar-gue that errors expose existing incompleteness in the prob-lem formalization, requiring users to seek interpretations.Namely, an important use case of interpretations is to helpusers figure out what is going on when they get certain er-rors. Researchers have proposed evaluations to assess how a r X i v : . [ c s . H C ] A ug igure 1: The workflow of “Guessing the Incorrectly Predicted Label” task. Each worker is presented with an image and told thatthe deep neural network incorrectly predicted its label (Step 1). The worker may also be presented with visual interpretations(Step 2). The worker is then asked to guess the incorrectly predicted label (“Carousel” in this example) from five options, fourof them being distractors (Step 3). If an interpretation effectively explains how the underlying deep neural network model worksto users, the people who were presented with the interpretation should be better at predicting the model’s outputs.much an interpretation reflects the model’s behavior (alsoknown as “fidelity”) (Melis and Jaakkola 2018) or boostsusers’ trust in automated systems (Selvaraju et al. 2017;Ribeiro, Singh, and Guestrin 2016). However, it is unclearhow useful these interpretations are in helping users figureout why they are getting an error.This paper introduces a method that uses crowd workersfrom Amazon Mechanical Turk (MTurk) to directly evalu-ate the usefulness of interpretations in helping users to rea-son about the errors of deep neural networks . Figure 1overviews the workflow. In this task, each worker is pre-sented with an image and told that the deep neural net-work incorrectly predicted its label. The worker may also bepresented with a set of interpretations ( e.g. , saliency maps)that explain how the deep neural network “perceives” thisimage and makes the final prediction. The worker is thenasked to guess the incorrectly predicted label from fiveoptions, four of them being distractors. If an interpretationeffectively explains how the underlying deep neural networkmodel works to users, the people who were presented withthe interpretation should be better at predicting the model’soutputs than those who were not.This paper tried to answer two research questions: First (RQ1) , do machine-generated visual interpretations help hu-man users better identify predicted labels? Second (RQ2) ,when do (and when do not) the visual interpretations help? Related Work
Interpretation Methods
Our work focuses on post-hocinterpretations. These methods generate saliency maps to in-dicate where the neural networks “look” in the images fortheir predictions’ evidence. Existing methods can be catego-rized into four lines:
Backprop-Based : computes the gradi-ent (or variants) of the neural network output to score the im-portance of each input pixel, such as SmoothGrad (Smilkovet al. 2017);
Representation-Based : uses the feature maps atintermediate layer of neural networks to generate saliencymaps, like GradCAM (Selvaraju et al. 2017);
Meta-Model- The code and interface are available via GitHub:https://github.com/huashen218/GuessWrongLabel
Based : trains a meta-model to predict the saliency mapfor any given input in a single feed-forward pass, such asRTS (Dabkowski and Gal 2017);
Perturbation-Based : findsthe saliency map by perturbing the input with minimum in-tervention and observing the change in model prediction,like ExtremalPerturb (Fong, Patrick, and Vedaldi 2019).
Evaluating Interpretations
Evaluating the effectivenessof interpretations is critical in practice. Existing evaluationsanswer two questions: whether the interpretations genuinelyreflect neural network behavior (Adebayo et al. 2018), andwhether the interpretations are useful for users. To answerthe latter question, a set of metrics are proposed to involvehuman evaluation. For instance, trust assessment and usersatisfaction is verified in Smith-Renner et al. (2020) by sur-veying general users. Mental model evaluations designedby Bucinca et al. (2020) and Chu, Roy, and Andreas (2020)measure whether general users can understand and pre-dict model outputs. Feng and Boyd-Graber (2019) creates ahuman-computer cooperative task to measure how much in-terpretation improves human performance. However, morestudy is needed to investigate how general users perceiveand predict neural networks’ failure cases, which is of vitalimportance in building trust and correcting model behavior.
Human-AI Collaboration
Although human computationhas traditionally played a data annotation role in deep learn-ing systems, there is increasing interest in incorporating itinto diverse stages of human-AI hybrid systems (Nourani etal. 2019). Due to its goal of building human understandingand trust in black-box neural networks, interpretation is in-herently a human-centric problem. Related efforts involvehuman perception of different types of interpretation repre-sentations in visual interfaces (Roy et al. 2019), etc.
Method
We used a deep neural network to label images and em-ployed several interpreters to generate visual interpretationsfor the images. We showed each image the deep neural net-work had labeled incorrectly to a group of online crowdorkers and asked them to guess which images the deepneural network had mistakenly labelled. Only the workersin the control group were presented with the visual interpre-tations. We detail the procedure of the study in this section.
Step 1: Labeling Images
We trained an image classi-fier on ImageNet dataset, with its TOP-1 accuracy reach-ing 78.67% (Xie et al. 2019). We randomly selected imageswhose labels were incorrectly identified by the classifier.
Step 2: Generating Instance-Wise Interpretations
Foreach image in the misclassified subset, we used three exist-ing interpreters – i.e., input perturbation (Fong, Patrick, andVedaldi 2019), intermediate feature extraction (Selvaraju etal. 2017), and output backpropagation (Smilkov et al. 2017)– to explain three aspects of this image.
Input perturba-tion interpretation (column 2-4 in Figure 2) observes howthe output value changes as input is “deleted” in differentsub-regions. We used
ExtremalPerturb , which aims to finda small pixel subset that, when preserved, are sufficient tokeep model output stable. Moreover, ExtremalPerturb al-lows researchers to explicitly constrain the percentage ofpreserved pixels. We provided three levels of percentage: a = { , , and } . Inter-Feature extraction in-terpretation (column 5 in Figure 2) looks at intermediatelayers of the neural network to indicate the discriminativeimage regions used by the model for prediction. We used
GradCAM , which extracts the gradient information flow-ing into the last convolutional layers, to explain the impor-tance of each pixel.
Output backpropagation interpreta-tion (column 6 in Figure 2) leverages backpropagation totrack information from the model’s output back to its in-put to generate the saliency map. We used
SmoothGrad ,which samples similar images by adding noise to the orig-inal image and using the average of the resulting heatmapsto obtain the final interpretation. We eventually generated (i) three saliency maps from input perturbation view with 20%,40% and 50% percentages respectively, (ii) one saliencymap from intermediate feature extraction view, and (iii) onesaliency map from the output backpropagation view.
Step 3: Having Crowd Workers Guess the IncorrectlyPredicted Label
Next, we recruited crowd workers onMTurk to complete tasks . The workers were shown the im-age and its correct label, and were informed that “a computeralgorithm misidentified this image as something else.” Onlythe workers in the control group, as shown in Figure 1, werepresented with the visual interpretations. On the interface,we explained that the visual interpretations are “visualiza-tions that try to show how the algorithm sees this image,”and provided comprehensive descriptions for each interpre-tation. For example, we explained “input perturbation inter-pretation” with a 20% mask (column 2 in Figure 2) as “We Each Human Intelligence Task (HIT) contained one image,and multiple workers were recruited to answer the question. Theprice of a HIT is $0.05. Four built-in MTurk qualifications are used:Locale (US Only), HIT Approval Rate ( ≥ ≥ Figure 2: Examples of five types of errors in image clas-sification. The visual interpretations are generated by threeexisting interpreters (see “Step 2” in the Method section.)only allow the algorithm to see 20% of the image and askthe algorithm to choose which 20% is the most important re-gion. The black mask blocks the regions the algorithm paysless attention to.” The workers are then asked to guess theincorrectly predicted label from five options. One of theoptions was the incorrect label predicted by the deep neu-ral network model, and the remaining four were randomlyselected from the whole label set of ImageNet ( i.e., correct label, but itrequires workers to take the mechanism of deep neural net-works into account to guess the incorrect label predicted bydeep neural networks. MTurk workers are appropriate par-ticipants because they represent general users who do notnecessarily understand deep neural network models nor aretrained for reasoning about these models’ errors.
Categorizing Error Cases Manually
To inspect useful-ness of interpretation in fine-grained model failure scenarios(RQ2), the authors inspected 1,000 misclassified images andcategorized them into five types of errors (Figure 2), in partbased on the literature (Arjovsky et al. 2019).1.
Local Character Inference (C1):
The model arrives atwrong prediction by looking at only part of the object.For instance, in Figure 2(C1), the error might be due tothe model partially capturing the restaurant dome, whichlooks similar to the canopy of a pulled rickshaw.2.
Multiple Objects Selection (C2):
For images with mul-tiple objects, the model makes a prediction by choos-
29 23 28 15 5 100
No-Int **0.87 **0.75 *0.81
25 10 47 12 6 100
Table 1: Results of Experiment 1. Showing the workersmachine-generated visual interpretations reduced their av-erage accuracy in guessing the incorrectly predicted labels.(Unpaired t-test. *: p < < Similar Appearance Inference (C3):
The model mis-classifies the object in the image into another class witha similar appearance, as shown in Figure 2(C3).4.
Correlation Learning (C4):
The model exploits corre-lational relationships in training data to apply an incor-rect label to the image. For example, in Figure 2(C4), themodel predicts a “shower curtain” by identifying the bath-room context, even if no curtain is in the image.5.
Incorrect Gold-Standard Labels (C5):
The true labelof the images might be incorrect in the ImageNet. Fig-ure 2(C5) shows an example.
Experimental Results
Experiment 1: Testing Two Conditions in the SameBatch of HITs
Experiment 1 had two conditions: [Inter-pretation] ( i.e. , [Int]) and [No-Interpretation] ( i.e. , [No-Int]).The only difference is that HITs in the [No-Int] group donot show the interpretations to workers in interfaces. Weevenly divided 200 randomly selected image samples intotwo groups. We posted these 200 images in a same batchof HITs at the same time on MTurk, where each HIT re-cruits nine different workers. A total of 1,800 submissions(900 submissions in each condition) were contributed by 41workers in [Int] and 40 workers in [No-Int] conditions re-spectively. We did not control the workers’ participation, soa worker could participate in both groups. Thirty-six out of45 workers participated in both conditions.Surprisingly, in Experiment 1, showing the workersmachine-generated visual interpretations reduced theiraverage accuracy in guessing the incorrectly predictedlabels.
We calculated the accuracy as the percentage of cor-rectly inferring the classifier’s prediction among all 900 sub-missions in each condition. The accuracy collected in [Int]was 0.73, while the accuracy in [No-Int] was 0.81. The dif-ference was statistically significant (unpaired t-test, p < C1 C2 C3 C4 C5 OverallInt
No-Int **0.84 *0.59 **0.73
44 20 112 18 6 200
Table 2: Results of Experiment 2. The machine-generatedvisual interpretation again reduced the average human ac-curacy in inferring model misclassification. (Paired t-test. *:p < < Experiment 2: Testing with Two None Overlapping Setsof Workers
Experiment 2 was controlled more strictly.We randomly selected another 200 images (different fromthose used in Experiment 1), and used the same photo inboth [Int] and [No-Int] conditions. We used custom MTurkqualifications to control the participants: workers who par-ticipated in one condition could not accept HITs in the othercondition. We recruited 10 different workers for each im-age, in which five workers were in the [Int] group and theother five were in the [No-Int] group. A total of 2,000 sub-missions (with 1,000 submissions in each condition) werecollected, contributed by 42 workers in the [Int] conditionand 63 workers in the [No-Int] condition respectively.In Experiment 2, the machine-generated visual interpre-tation again reduced the average human accuracy in in-ferring model misclassification (Table 2.) The accuracyof [Int] was 0.63, whereas accuracy in [No-Int] conditionwas 0.73. The difference was again statistically significant(paired t-test, p < Discussion
Our experiments showed that, in the case of image clas-sification, machine-generated visual interpretations are notnecessarily useful in helping users understand deep neuralnetwork failures. It could even be harmful, as in the caseswhere the errors were probably caused by similar appear-ances between items (C3) or by mistakenly learning from thebackground or scenes of the images (C4). System designersshould use caution when displaying machine-generated in-terpretations to users.
Why it did not help?
More research is required to dis-cover why showing interpretations was ineffective. Here,we submit several of our hypotheses with the goal of help-ing future explorations. First, the interpreters are not goodenough to help humans. The representational power – in-cluding the correctness, sensitivity, etc., of the interpreta-tion model – might not be sufficient to augment human rea-soning about errors. Although machine-generated interpre-tations captured some of the deep neural network’s behav-ors, it may not be good enough to help humans. Second,the format is insufficient. The saliency maps may not bethe most efficient format to convey information to humans.For example, when a saliency maps model changes an innerparameter, this change might not be obvious enough to benoticeable by humans, but could still affect the final predic-tions. Third, the interpreters may work poorly in cases wherethe image classifier failed.
Limitations
We are aware that this work has several lim-itations. First, the sample size was relatively small. Giventhat classifiers incorrectly labelled more than 10,000 imagesin the ImageNet validation set alone, 200 images are rel-atively small portion of the data. Second, we only testedthree particular types of interpretations, and also presentedthe interpretations together on the same page. This exper-imental setup introduces the possibility of missing out onthe “best” interpretations, or different interpretations mightaffect each other and reduce their effectiveness. Third, werecruited MTurk workers with certain qualifications to sim-ulate general users. It is difficult to eliminate data noisestemmed from workers’ misunderstanding or incognizanceof images or options. Finally, we only tested visual inter-pretations for image classifiers. It requires more research tostudy if similar effects could be generalized to other tasks.
Conclusion
The goal of this study was to evaluate the usefulness ofmachine-generated visual interpretations for general users’reasoning about model errors. To this end, we utilized the“guess incorrectly predicted labels” task to examine the use-fulness of visual interpretations. Our two sets of control ex-periments, with 3,800 submissions contributed by 150 on-line crowd workers, suggest that showing the interpretationsdoes not increase, but rather decreases , the average accuracyof human guesses by roughly 10%.
Acknowledgements
We thank Ting Wang for his support. We also thank theworkers on MTurk who participated in our studies.
References [Adebayo et al. 2018] Adebayo, J.; Gilmer, J.; Muelly, M.;Goodfellow, I.; Hardt, M.; and Kim, B. 2018. Sanity checksfor saliency maps. In
Proceedings of Advances in NeuralInformation Processing Systems (NeurIPS) , 9505–9515.[Arjovsky et al. 2019] Arjovsky, M.; Bottou, L.; Gulrajani,I.; and Lopez-Paz, D. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893 .[Bucinca et al. 2020] Bucinca, Z.; Lin, P.; Gajos, K. Z.; andGlassman, E. L. 2020. Proxy tasks and subjective measurescan be misleading in evaluating explainable ai systems. In
Proceedings of the 25th International Conference on Intelli-gent User Interfaces (IUI) , 454–464.[Chu, Roy, and Andreas 2020] Chu, E.; Roy, D.; and An-dreas, J. 2020. Are visual explanations useful? a case study in model-in-the-loop prediction. arXiv preprintarXiv:2007.12248 .[Dabkowski and Gal 2017] Dabkowski, P., and Gal, Y. 2017.Real Time Image Saliency for Black Box Classifiers. In
Pro-ceedings of Advances in Neural Information Processing Sys-tems (NeurIPS) .[Doshi-Velez and Kim 2017] Doshi-Velez, F., and Kim, B.2017. Towards a rigorous science of interpretable machinelearning. arXiv preprint arXiv:1702.08608 .[Du et al. 2018] Du, M.; Liu, N.; Song, Q.; and Hu, X.2018. Towards explanation of dnn-based prediction withguided feature inversion. In
Proceedings of ACM Interna-tional Conference on Knowledge Discovery and Data Min-ing (KDD) , 1358–1367.[Feng and Boyd-Graber 2019] Feng, S., and Boyd-Graber, J.2019. What can ai do for me? evaluating machine learninginterpretations in cooperative play. In
Proceedings of the24th International Conference on Intelligent User Interfaces(IUI) , 229–239.[Fong, Patrick, and Vedaldi 2019] Fong, R.; Patrick, M.; andVedaldi, A. 2019. Understanding deep networks via ex-tremal perturbations and smooth masks. In
Proceedingsof the IEEE International Conference on Computer Vision(ICCV) , 2950–2958.[Melis and Jaakkola 2018] Melis, D. A., and Jaakkola, T.2018. Towards robust interpretability with self-explainingneural networks. In
Proceedings of Advances in Neural In-formation Processing Systems (NeurIPS) .[Nourani et al. 2019] Nourani, M.; Kabir, S.; Mohseni, S.;and Ragan, E. D. 2019. The effects of meaningful and mean-ingless explanations on trust and perceived system accuracyin intelligent systems. In
Proceedings of the AAAI Con-ference on Human Computation and Crowdsourcing , vol-ume 7, 97–105.[Ribeiro, Singh, and Guestrin 2016] Ribeiro, M. T.; Singh,S.; and Guestrin, C. 2016. ”Why Should I Trust You?”: Ex-plaining the Predictions of Any Classifier. In
Proceedingsof ACM International Conference on Knowledge Discoveryand Data Mining (KDD) .[Roy et al. 2019] Roy, C.; Shanbhag, M.; Nourani, M.; Rah-man, T.; Kabir, S.; Gogate, V.; Ruozzi, N.; and Ragan, E. D.2019. Explainable activity recognition in videos. In
IUIWorkshops .[Selvaraju et al. 2017] Selvaraju, R. R.; Cogswell, M.; Das,A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In
Proceedings of the IEEE internationalconference on computer vision (ICCV) , 618–626.[Smilkov et al. 2017] Smilkov, D.; Thorat, N.; Kim, B.;Vi´egas, F.; and Wattenberg, M. 2017. SmoothGrad: Remov-ing Noise by Adding Noise. In
International Conferenceon Machine Learning Workshop on Visualization for DeepLearning .[Smith-Renner et al. 2020] Smith-Renner, A.; Fan, R.;Birchfield, M.; Wu, T.; Boyd-Graber, J.; Weld, D. S.; andFindlater, L. 2020. No explainability without accountabil-ty: An empirical study of explanations and feedback ininteractive ml. In
Proceedings of the 2020 CHI Conferenceon Human Factors in Computing Systems , 1–13.[Strobelt et al. 2017] Strobelt, H.; Gehrmann, S.; Pfister, H.;and Rush, A. M. 2017. Lstmvis: A tool for visual analy-sis of hidden state dynamics in recurrent neural networks.
IEEE transactions on visualization and computer graphics arXiv preprint arXiv:1904.12848arXiv preprint arXiv:1904.12848