A Human-Grounded Evaluation Benchmark for Local Explanations of Machine Learning
AA Human-Grounded EvaluationBenchmark for Local Explanations ofMachine Learning
Sina Mohseni
Department of ComputerScience & EngineeringTexas A&M [email protected]
Eric D. Ragan
Department of VisualizationDepartment of ComputerScience & EngineeringTexas A&M [email protected]
The benchmark is available online athttps://github.com/SinaMohseni/ML-Interpretability-Evaluation-Benchmark forresearch purposes.
Abstract
In order for people to be able to trust and take advantageof the results of advanced machine learning and artificialintelligence solutions for real decision making, people needto be able to understand the machine rationale for givenoutput. Research in explain artificial intelligence (XAI) ad-dresses the aim, but there is a need for evaluation of hu-man relevance and understandability of explanations. Ourwork contributes a novel methodology for evaluating thequality or human interpretability of explanations for machinelearning models. We present an evaluation benchmark forinstance explanations from text and image classifiers. Theexplanation meta-data in this benchmark is generated fromuser annotations of image and text samples. We describethe benchmark and demonstrate its utility by a quantitativeevaluation on explanations generated from a recent ma-chine learning algorithm. This research demonstrates howhuman-grounded evaluation could be used as a measure toqualify local machine-learning explanations.
Author Keywords interpretable machine learning; Human subject evaluation;local explanations; Human computer interacion.
ACM Classification Keywords
H.5.m [Information interfaces and presentation]: Miscella-neous a r X i v : . [ c s . H C ] J a n ntroduction Figure 1:
Human and machineinteractions have been studied indifferent levels, but mainlyanchored/centered to the humanside of conversation. Interpretablemachine learning is aboutself-describing causal machinelearning algorithms that explaindecision making rules and processto the human.
With the recent and continuing advancements in robustdeep neural networks, the prominence of machine learningand artificial intelligence models is growing for automateddecision-making support and especially in critical areassuch as financial analysis, medical management systems,military planning, and autonomous systems. In such cases,human experts, operators, and decision makers can takeadvantage of new machine learning techniques to assist intaking real-world actions. In order to do so, however, thesepeople need to be able to trust and understand the machineoutputs, predictions, and recommendations. Unlike shal-low machine learning models that can be interpretable andeasier to understand in terms of classification logic, deeplearning models are significantly more complex and oftenconsidered as black-box models due to their poor trans-parency and understandability.Thus, for machine-assisted decision-making using new ma-chine learning technology, advancements are needed inachieving explainability and supporting human understand-ing. This is the primary goal of recent research thrusts in explainable artificial intelligence (XAI). For a system to ef-fectively serve human users, people need to be able to un-derstand the reasoning behind the machine’s decisionsand actions. Numerous researchers have recently beenworking to advance explainability through methods such asvisualization [9, 3, 10] or model simplification [8, 15]. Differ-ent types of interpretability and explainability are possible.For human explainability, for instance, local explanations can be used to explain the connection between a single in-put instance the the resulting machine output, while globalexplanations aim to provide more holistic presentation ofhow the system works as a whole or for collections of in-stances. While a multi-faceted topic, the ultimate goal is forpeople to can understand machine models, and it is there- fore important to involve human feedback and reasoningas a requisite component for evaluating the explainabilityor understandability of XAI methods and models. However,since the majority of research in the area of XAI is lead byexperts in machine learning and artificial intelligence, rela-tively little work has involved human evaluation.In this paper, we describe an novel evaluation methodol-ogy for assessing the relevance and appropriateness of local explanations of machine output. We present a human-grounded evaluation benchmark for evaluating instanceexplanations of images and textual data. The benchmarkconsists of human-annotated samples of images and textarticles to approximate the most important regions for hu-man understanding and classification. By comparing theexplanation results from classification models to the bench-mark’s annotation meta-data, it is possible to evaluate thequality and appropriateness of XAI local explanations. Todemonstrate the utility of such a benchmark, we performa quantitative evaluation of explanations generated from arecent machine learning algorithm. We have also made thebenchmark publicly available online for research purposes.
Background
Researchers have argued the importance of interpretablemachine learning and how its demand rises from the incom-pleteness of problem formalization (e.g., [2]). For instance,in many cases, user might lose trust to the system with thedoubt that if the machine has taken all necessary factorsinto account. In this situation, an interpretable model canassist user with generating explanations. Lipton [7] statesinterpretable machine learning is needed when there is amismatch between machine objectives and real-world sce-narios, which means a transparent machine learning modelshould share information and decision making details with auser to prevent mismatch objectives problems. The goals ofAI naturally motivate a merger between human-computerinteraction (HCI) and artificial intelligence (AI) disciplinesfor the creation and evaluation of solutions that are inter-pretable and explainable for users. It is important that thesecommunities work together to achieve useful and meaning-ful explanations of machine learning technology.
Explanation Strategies
Interpretable models such as tree-based models [8] andrule lists [15] have been proposed as examples that can bedirectly explained or summarized using relatively simple orcommon visualization methods. For more complex black-box models such as deep neural networks (DNN), othermethods have been explore to generate local explanationsfor each individual instance as well as for global explana-tions of the entire model. Local explanations in the form ofsalience maps is a popular way to generate explanations inDNNs. This approach presents features with the greatestcontribution to the classification. For example, Simonyan etal. [13], used output gradient to generate a mask of whichpixels is the model relying on for classification task. In otherwork, Ribeiro et al. [11] presented a model-agnostic algo-rithm that generates local explanations for any classifierin different data domains. As another example, Ross etal. [12] proposed an iterative approach using an input gradi-ent that can improve its explanation by constraining expla-nation by a loss function.Data visualization is also a basic tool to show the relation-ship between data points and clusters. Methods like MDSand t-SNE [9] generate a 2D mapping of high-dimensionaldata to visualize spatial relation of data clusters. Visual an-alytics tools such as ActiVIS [3] and Squares [10] take ad-vantage of a 2D mapping of data points along with feature-cluster and instance-cluster views to help users with perfor-mance analysis and in understanding classification logic.
Evaluating Explanations
In considering evaluation approaches for XAI, Doshi-Velez [2]proposed three categories: application-grounded , human-grounded and functionality-grounded evaluations. Thesecategories vary in evaluation cost and inclusiveness. In thistaxonomy, functionality-grounded evaluation uses formaldefinitions of interpretability as a proxy for qualifying expla-nations and no human research is involved. Application-grounded evaluation is done with expert users reviewingthe model and explanations in real tasks. Analytics toolslike
ActiVis [3] with participatory design procedure and casestudies show satisfactory results from the expert users inmachine learning field. Krause et al. [4] also proposed a vi-sual analytics tool for the medical domain to debug binaryclassifiers with instance-level explanations. They workedtightly with the medical team and hospital management tooptimize processing times in the emergency department ofthe hospital.In contrast, human-grounded evaluations are generally per-formed with non-expert users and simplified tasks. To date,there are few research studies involving human subjects toassess XAI. Ribeiro et al. [11] presented an experiment tostudy whether users can identify the best classifier usingtheir explanations. In their study, participants reviewed ex-planations generated for two image classifiers. They alsoperformed a small study where the researchers intention-ally trained a classifier incorrectly with biased data to studywhether participants could to identify the connection be-tween the incorrect features the resulting erroneous classi-fications. Also studying interpretability for people, Lakkarujet al. [5] conducted research with interpretable decisionsets, which are groups of independent if-then rules. Theyevaluated interpretability through a user study where partic-ipants looked at the decision-set rules and answered a setof questions to measure their understanding of the model.he authors reported that both accuracy and average timespent in understanding the decisions was improved withtheir interpretable decision sets comparing to a baselinewith bayesian decision lists.
Human Evaluation Approaches (a)(b)
Figure 2:
Examples of users feedforward explanations for imageand text data. (a) Heat map viewsfrom 10 users for drawing acontour around the area whichexplains the object in the image.(b) Heat map view from two expertreviewers for highlighting wordsrelated to the “electronic” topic.
We discuss two main classes of approaches for humanevaluation for interpetability, with the difference dependingon whether users have prior knowledge or access to sam-ple explanations. In one way, users review existing explana-tions and provide specific feedback for those explanations.The other option is to capture users’ thoughts and opinionsof the most relevant features based on the input and outputwithout review of example explanations.The explanations could be in any form such as verbal orlocal explanations and on any data such as image, text,or tabular data. The following subsections provide furtherdescription of each of human-grounded local explanationsevaluation types for machine learning .
Evaluating with Explanation Review and Feedback
For the purposes of evaluating existing known explanations,it is possible to collect user feedback about the quality ofthe explanation given the original input and the resultingoutput. For example, users could review several optionsand choose the best machine-generated explanation for astraight-forward comparison.User decisions are made with knowledge about the input,the explanations, and the output label. We would expectusers to generally pick explanations that most closely matchtheir logic and background knowledge. One advantage ofthis method is the ability for a clear comparison of multipleinterpretable machine learning algorithms. Another meansof capturing user feedback would be letting a user interac-tively refine machine generated explanations. This method has more flexibility in allowing rejection of wrong featuresand adding new features to the explanations. Quantifyingthe difference between an initial given explanation and auser-edited explanation could give a clear measure of qual-ity for the initial machine-generated explanations. The dis-advantage of this method is that human review is alwaysa comparison relative to an existing explanation, whichmeans (1) some form of explanation must already exist,(2) the evaluation is specific to the particular explanationsreviewed, and (2) reviewing the existing explanation mightbias a user’s perception.
Evaluating with Input Review and Feedforward
Another option for human evaluation is to collect feedbackabout the features that would best contribute towards an ex-planation for a given output—and users could provide suchinformation without seeing example explanations. For ex-ample, explanations could be obtained by presenting theuser with the input and output label and then asking to findthe relevant features corresponding to the label. For exam-ple, if the data is a text article about a “computer science”topic, the user would find and annotate words and phrasesrelated to the topic. User choice is made with knowledgeabout the input along with the output label. Increasing thenumber of users results in capturing a wide spectrum ofuser explanations on each input. In this method, explana-tions are weighted features from multiple user opinions. Forexample, Figure 2 examples of text and image heatmapsgenerated by this approach for our benchmark.This can be thought of as a feedforward approach, as theinformation from reviewers would be independent of anyparticular explanation. Consequently, this approach canresult in a reusable benchmark that can apply to variousexplanations. valuation Benchmark
Because our goal was a benchmark that could be usedfor evaluation of known input and classifications, we cap-tured explanations in a feedforward approach where theusers were asked to annotate relevant regions in imagesand words in text articles that are most related to the topicor subject. The preliminary deployment of this benchmarkconsists of a subset of 100 sample images and text articlesfrom the well-known
ImageNet [1] and
20 Newsgroup [6]data sets. The initial version of this benchmark is availableonline for research purposes.(a)(b) Figure 3: (a) User generatedweighted-masks for an exampleimage of a cat. We use thisweighted-mask to evaluatemachine generated explanationsaccuracy. (b) Machine generatedexplanation by the LIME algorithmfor the same image. Irrelevantred-highlighted regions in thisimage cause low explanationprecision comparing to humangenerated explanations.
Annotated Image Examples
All image samples were collected from the
ImageNet dataset from 20 general categories (example categories includeanimals, plants, humans, indoor objects, and outdoor ob-jects). Our preliminary benchmark includes 5 images percategory for a total of 100 images. In a review-board ap-proved user study, 10 participants viewed images on atablet and used a stylus to annotate key regions of the im-age. We asked them to draw a contour in the image aroundthe area most important to recognizing the object, or theportion that, if removed, you could not recognize the object.None of the participants were experts in any of the imagecategories. Each participant annotated all images in a ran-dom ordering.All user annotations are accumulated to create a weightedexplanation mask (see Figure 3a) over the image. Figure 2ashows a heatmap views of user annotated explanationsover two sample image, where “hot” colors (red) showsmore commonly highlighted regions, and “cooler” colors(blue) show areas that were highlighted less frequently. Wealso masked all user annotations with exact contour shapesto reduce the impact of user imprecision or hand jitter. https://github.com/SinaMohseni/ML-Interpretability-Evaluation-Benchmark Annotated Text Examples
All text articles were collected from two categories (med-ical or sci.med , and electronic or sci.elect ) from the data set. For each category, expert reviewershighlighted the most important words relevant to the giventopic name (i.e., medical or electronic). Reviewers wereinstructed to highlight words which, if removed, you couldnot recognize the main topic of the article. Two electricalengineers and two physicians volunteered as experts to an-notate 100 documents from each topic. Figure 2(b) showsa single tone heat map view of user annotated explanationsover a partial sample text article.
Use Case
To demonstrate the utility of our benchmark, we presenta use case in evaluating local explanations from the well-known LIME explainer [11]. Similar to the previously pre-sented research on LIME [11], we used the pre-trainedmodel from Google’s
Inception v3 [14] for image classifi-cation.Next, we compared the machine-generated explanationwith our evaluation benchmark. The comparison is donepixel-wise for each image sample We compared our weighted-masks (see Figure 3a) to the LIME results for all 100 im-ages in our benchmark set. We calculated true positive,false positive and false negative pixels with bit-wise opera-tions, and precision and recall for the set were calculated as0.39 and 0.58, respectively. The low precision is indicativeof extraneous irrelevant regions of the images being high-lighted in explanations by the LIME algorithm. Figure 3bshows an example of image explanations from the LIMEalgorithm where two of the red highlighted patches show re-gions that do not correspond to the cat in the image. Usingthis evaluation method, we would hope to see algorithmsproduce local explanations with closer alignment to usernnotations.
Acknowledgements
This research is based on work supported by the DARPAXAI program under Grant
References [1] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. 2009. ImageNet: A large-scale hierar-chical image database. In
IEEE Conference on Com-puter Vision and Pattern Recognition. CVPR.
IEEE,248–255.[2] Finale Doshi-Velez and Been Kim. 2017. Towards arigorous science of interpretable machine learning.(2017).[3] Minsuk Kahng, Pierre Y Andrews, Aditya Kalro, andDuen Horng Polo Chau. 2018. A cti V is: Visual Ex-ploration of Industry-Scale Deep Neural Network Mod-els.
IEEE transactions on visualization and computergraphics
24, 1 (2018), 88–97.[4] Josua Krause, Aritra Dasgupta, Jordan Swartz, Yin-dalon Aphinyanaphongs, and Enrico Bertini. 2017.A Workflow for Visual Diagnostics of Binary Classi-fiers using Instance-Level Explanations. arXiv preprintarXiv:1705.01968 (2017).[5] Himabindu Lakkaraju, Stephen H Bach, and JureLeskovec. 2016. Interpretable decision sets: A jointframework for description and prediction. In
Proceed-ings of the 22nd ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining . ACM,1675–1684.[6] Ken Lang. 1995. Newsweeder: Learning to filter net-news. In
Proceedings of the Twelfth International Con-ference on Machine Learning . 331–339.[7] Zachary C Lipton. 2016. The mythos of model inter-pretability. arXiv preprint arXiv:1606.03490 (2016). [8] Yin Lou, Rich Caruana, and Johannes Gehrke. 2012.Intelligible models for classification and regression. In
Proceedings of the 18th ACM SIGKDD internationalconference on Knowledge discovery and data mining .ACM, 150–158.[9] Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE.
Journal of MachineLearning Research
9, Nov (2008), 2579–2605.[10] Donghao Ren, Saleema Amershi, Bongshin Lee, JinaSuh, and Jason D Williams. 2017. Squares: Sup-porting interactive performance analysis for multiclassclassifiers.
IEEE transactions on visualization andcomputer graphics
23, 1 (2017), 61–70.[11] Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Why should i trust you?: Explain-ing the predictions of any classifier. In
Proceedings ofthe 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining . ACM, 1135–1144.[12] Andrew Slavin Ross, Michael C Hughes, and FinaleDoshi-Velez. 2017. Right for the Right Reasons: Train-ing Differentiable Models by Constraining their Expla-nations. arXiv preprint arXiv:1703.03717 (2017).[13] Karen Simonyan, Andrea Vedaldi, and Andrew Zis-serman. 2013. Deep inside convolutional networks:Visualising image classification models and saliencymaps. arXiv preprint arXiv:1312.6034 (2013).[14] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,Jon Shlens, and Zbigniew Wojna. 2016. Rethinkingthe inception architecture for computer vision. In
Pro-ceedings of the IEEE Conference on Computer Visionand Pattern Recognition . 2818–2826.[15] Fulton Wang and Cynthia Rudin. 2015. Falling rulelists. In