[PDF] Measure Utility, Gain Trust: Practical Advice for XAI Researcher

Abstract

Research into the explanation of machine learning models, i.e., explainable AI (XAI), has seen a commensurate exponential growth alongside deep artificial neural networks throughout the past decade. For historical reasons, explanation and trust have been intertwined. However, the focus on trust is too narrow, and has led the research community astray from tried and true empirical methods that produced more defensible scientific knowledge about people and explanations. To address this, we contribute a practical path forward for researchers in the XAI field. We recommend researchers focus on the utility of machine learning explanations instead of trust. We outline five broad use cases where explanations are useful and, for each, we describe pseudo-experiments that rely on objective empirical measurements and falsifiable hypotheses. We believe that this experimental rigor is necessary to contribute to scientific knowledge in the field of XAI.

Full PDF

MMeasure Utility, Gain Trust: Practical Advice for XAI Researchers

Brittany Davis * School of Engineering and Applied Sciences

Washington State University Maria Glenski † Data Sciences and Analytics Group

Paciﬁc Northwest National LaboratoryWilliam Sealy ‡ Cognitive Engineering Center

Georgia Institute of Technology Dustin Arendt § Visual Analytics Group

Paciﬁc Northwest National Laboratory A BSTRACT

Research into the explanation of machine learning models, i.e., ex-plainable AI (XAI), has seen a commensurate exponential growthalongside deep artiﬁcial neural networks throughout the past decade.For historical reasons, explanation and trust have been intertwined.However, the focus on trust is too narrow, and has led the researchcommunity astray from tried and true empirical methods that pro-duced more defensible scientiﬁc knowledge about people and ex-planations. To address this, we contribute a practical path forwardfor researchers in the XAI ﬁeld. We recommend researchers focuson the utility of machine learning explanations instead of trust. Weoutline ﬁve broad use cases where explanations are useful and, foreach, we describe pseudo-experiments that rely on objective empiri-cal measurements and falsiﬁable hypotheses. We believe that thisexperimental rigor is necessary to contribute to scientiﬁc knowledgein the ﬁeld of XAI.

Index Terms:

Human-centered computing—Human computer in-teraction (HCI)—HCI design and evaluation methods—User studies;Computing methodologies—Machine learning—Machine learningapproaches—Neural networks

OSITION : M

EASURE U TILITY , G

AIN T RUST

Many AI, HCI, and Visualization researchers may have assumed thepurpose of a machine learning explanation is to enhance, increase,or calibrate users’ trust in the model [11]. In contrast with thisviewpoint and the large body of existing research, we argue that trustis an insufﬁcient metric for evaluating explanations (or models). Werecommend that researchers objectively measure the utility of theexplanation instead of subjectively measuring trust. Trust shouldmanifest through experience — as users use a system containinga model and an explanation, their trust in the system will growif that system is reliable and provides a beneﬁt. We also suspectthat designing explanations to optimize trust may short-circuit thenatural process of trust-building and also mislead users. This couldoccur if the design of the explanation obfuscates the actual utility orcapability of the system. Optimizing for trust could be considereda form of “metric hacking”, leading to artiﬁcially inﬂated levels oftrust compared to what would build naturally through experience.If users’ increased trust in the model is not necessarily the mostappropriate way to measure the “goodness” of an explanation, thenwhat is? Do explanations have an intrinsic value? We believe theanswer to the latter question is simply no. Unlike machine learningmodels, which are built directly upon ground truth data, a ground * e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] § e-mail: [email protected] truth “correct” explanation of a machine learning model is not gen-erally available. Thus, we are not likely to ﬁnd a way to measurethe error between a given explanation and the correct one, exceptin very speciﬁc use cases, e.g., image captioning. Much like a datavisualization, the purpose of an explanation is to communicate use-ful information to a human — the visualization community has nomethods for directly measuring the correctness of a visualization.We do however have many methods to measure the utility of visual-ization, which all require considering what tasks users perform withthat visualization. In fact, many explanations are visualisations [23],so the full range of techniques the visualization community has em-ployed for evaluating visualizations remain relevant for evaluatingmachine learning explanations.There are many use cases where explanations are helpful as partof a larger workﬂow. Hohman et al. [23] and Mohseni et al. [32]both describe in depth the various users and uses of explanations.For the purposes of our argument in this paper, we focus on thefollowing use cases: debugging, validating, and selecting a model;understanding a model; teaming with a model; and challenging amodel. Each of these use cases allow us to imagine “downstreamtasks” where we can quantitatively measure the users’ performancewith or without the explanation. Resulting differences in perfor-mance provide indirect but compelling evidence of the utility of theexplanation. There are many potential ways of measuring “expla-nation goodness”, which interested readers can ﬁnd surveyed byMohseni et al. [32]. Hoffman et al. [22] also discuss several methodsfor evaluating explanations including those that go beyond trust.We have taken the position that trust is a ﬂawed metric for mea-suring the “goodness” of a machine learning explanation. The nextsection of this paper will provide historical context for this argument.We found that trust is a metric advocated by people who want themodels they build to be adopted (thus, a good explanation is onethat builds trust and increases the likelihood of adoption). The ﬁnalsection of this paper provides practical guidance for how to measurethe utility of an explanation across the use cases mentioned above. HE P AST : HISTORICAL REASONS FOR THE COUPLINGOF TRUST AND EXPLANATIONS

While the success of deep neural networks in various machine learn-ing tasks is a recent phenomenon, machine learning as a generalpractice has existed for much longer. Almost since the beginning,developers of machine learning algorithms have sought to increasethe acceptance and uptake of those algorithms. To do so, researchershad to prove to others what they felt they knew to be true — thatthese algorithms could be trusted to work correctly and had bene-ﬁt. Thus, explanations of classical machine learning were initiallydeveloped for the purpose of increasing users’ trust. As the ﬁeldchanged over time, the coupling of explanations to trust-buildingbecame a habit more than best-practice. Today, explanations can andshould be used more widely beyond increasing trust — measuringtheir effectiveness should account for this broader applicability. a r X i v : . [ c s . H C ] S e p .1 “Classical” Machine Learning In 2000, AlexNet had not yet won an image classiﬁcation compe-tition and the ﬁeld of artiﬁcial intelligence (AI) was focused onthings like “symbolic systems”, which were built on pieces of mod-ular knowledge, gleaned from experts or large databases and storedinto a searchable structure such as a decision tree. Herlocker etal. [21] wrote that “current recommender systems are black boxes,providing no transparency into the working of the recommendation.Explanations provide that transparency, exposing the reasoning anddata behind a recommendation” and presented an experiment thatmeasured which explanations increased consumers acceptance ofrecommendations generated by an intelligent agent and which “nega-tively contributing to the acceptance of the recommendation.” Thesesystems may appear to be mysterious to end users but they were justdata structures to the researchers who built and deployed them. Re-searchers knew how to construct and debug these structures, becauseit was the same way that they constructed and debugged code.However, these researchers needed a way to get the skeptical pub-lic including corporations to accept these expert systems. As Bilgicand Mooney [7] put it in 2005, “in order for users to beneﬁt, theymust trust the systems recommendations and accept them. A sys-tems ability to explain its recommendations in a way that makes itsreasoning more transparent can contribute signiﬁcantly to users ac-ceptance of its suggestions.” This was the apparent goal of trust withpre-neural-network-AI: to help purchasers and users to understandthe limits of the system so that an inevitable wrong answer wouldnot scare them away. By 2006, Pu and Chen [37] referred to thispursuit as “investigating design issues for trust-inducing interfaces,”and in 2009, Haynes et al. [20] observed that “[a]s intelligent agentsbecome more pervasive in our day-to-day computing environment,and as their role becomes more consequential with respect to humanpurposes, they will be increasingly called upon to communicate in away that engenders trust”

In 2012, AlexNet achieved a top-5 error rate of 15.3% in the Ima-geNet Large Scale Visual Recognition Challenge [28] and by 2013researchers were trying to peer inside the Convolutional Neural Net-works which made up AlexNet. Why? Because, as Zeiler and Fer-gus [51] noted less than a year later, “there is no clear understandingof why they perform so well, or how they might be improved.” Or, asHolzinger et al. [25] put it,“Technically, the problem of explainabil-ity is as old as AI itself and classic AI represented comprehensibleretraceable approaches. However, their weakness was in dealingwith uncertainties of the real world. Through the introduction ofprobabilistic learning, applications became increasingly successfulbut increasingly opaque.” Researchers were suddenly in the sameposition as those suspicious consumers: they weren’t sure why orhow neural networks worked, or what tweaks to their constructionmight improve or deteriorate performance. In response, researchersbegan trying to ﬁnd ways to prove to themselves that these new‘expert systems were trustworthy.From ﬁguring out which layers were detecting edges versus tex-ture [43], to generating generalized images of what speciﬁc neuronclusters were capturing [33], or understanding what changes couldproduce wrong answers in adversarial attacks, researchers in AI hadto go back to basics when neural networks exploded onto the scene.By 2016, Zeiler and Ferguss [4] method of pushing convolutionsbackwards in a network had inspired the development of Layer-wise Relevance Propagation. That same year, Ribiero et al. [38]published their methodology, called LIME, for generating local orglobal counter-explanations for image classiﬁers. In 2017, a method-ology called Grad-CAM was developed, which used gradients tohighlight parts of the input to an image classiﬁer which contributedmost to the outcome [42]. All of these methods were used by theresearch community to try and better understand the inner workings of deep neural networks.Most work on end-user-facing trust in this period was still an-alyzing trust in symbolic systems and other more established ma-chine learning techniques. These experiments still largely aimedto increase users trust in the systems, by understanding what mod-ulated user trust up or down. Researchers tested explanations ofsimpliﬁed models of bagged decision trees, and found that “whensoundness was very low, participants experienced more mental de-mand and lost trust in the explanations” [29]. Another experimenttested explanations of a system which added a probabilistic MarkovDecision Process to a task planning application based on a ﬁnitestate machine, and found “that transparency explanations can helpto reduce the negative effects of trust loss [34].” Some experimentsused the “Wizard of Oz” approach to test user trust in automatedsystems, where “the behavior of the software is controlled by theresearcher unbeknown to the participant [8].” Another experimentconducted in 2016 used an Auto-Encoder developed in 2010 to ﬁnd“that perceived system ability was more important in determiningtrust amongst with-explanation participants and perceived trans-parency was a greater inﬂuence on the trust of participants who didnot receive explanations [24].” Berkovsky et al. [6] ran an experi-ment “to investigate the dependencies between various aspects ofrecommendation interfaces and user-system trust.”

Appropriate

Trust

In August of 2016, DARPA announced its Explainable ArtiﬁcialIntelligence program. The announcement brought into commonuse among researchers the phrase “appropriate trust and the idea ofexplainable AI among the general population [19]. Around the sametime, the European Union (EU) had announced that the General DataProtection Regulation (GDPR) would take effect in 2018 [48]. Thisregulation speciﬁed that companies could no longer use systems tomake certain decisions about consumers who lived in the EU, unlessthe company could explain the decision. These two events seem tohave spurred researchers to tentatively turn their efforts at explainingneural networks outside of the research community, engaging endusers and sometimes everyday people.Researchers began to comb over the progress made on peeringinside of neural networks and trying to ﬁnd ways to use these tools toincrease appropriate trust among people outside the AI community.For example, Schaefer et al. [41] found that “by understanding thetransparency elements that increase effective bi-directional commu-nication [in human-computer teams], we can engender appropriatetrust and reliance in the system.”In addition, much debate ensued over deﬁning, justifying, andmeasuring explainable and trust in this new context. Jiang et al. [27]presented the concept of a trust score, to provide “information aboutthe relative positions of the data points, which may be lost in com-mon approaches such as the model conﬁdence when the modelis trained using stochastic gradient descent.” A plethora of tax-onomies for trust measurements and explanations appeared. Adadiand Berrada [1] claimed explanations are needed for four reasons:to justify, to control, to improve, and to discover. Pallotta et al. [35]argue that “explanations need to be carefully crafted to ﬁt with theirdesired aim, and described a methodology which would increaseuser trust enough to prevent users from interfering with home heat-ing systems. Holzinger et al. [26] argued that increasing trust indeep learning systems necessarily included mechanisms for users tochange the systems outcomes.Out of this debate emerged a growing consensus that the emphasisshould be on the word ‘appropriate in the phrase appropriate trust.Yin et al. [50] measured users trust in a model and found that boththe actual capabilities of the model and the speciﬁc instances seenby the user inﬂuence a users trust in the model. That is to say, if amodel is not accurate, and this is evident to users, they dont trustit and thats a good thing. However, another experiment found thatsers can rely too heavily on a poor model, reporting that “in 67.3%of all cases, participants predicted that the system would be correct,whereas it was only correct in 42.9% of the cases [2].”Neural networks and deep learning shifted the paradigm in sucha way that engendering trust, even appropriate trust, is no longersufﬁcient. The idea of trust is already ambiguous, and applyingit to deep neural networks when even those who build them donot understand the intricacies of the model logic or determinationscreates too many layers of abstraction to produce meaningful science.Instead, researchers should re-evaluate the revelations of 2013, andbe humbled by the fact that we still do not know how to debug orimprove these networks with much certainty. To that end, creatingfalsiﬁable and provable hypotheses should replace the concept ofincreasing or calibrating trust. Although appropriate trust is a moreobjective measurement of trust, we ought to be measuring thesesystems with more relevant metrics that can be clearly measured,tested, and replicated.

OSSIBLE F UTURE : D

OWNSTREAM T ASKS AND F ALSI - FIABLE H YPOTHESES

As previously argued, the utility of an explanation must be tied to itspurpose — why it was created and the context in which it is intendedto be used. Trust should not be a metric we maximize throughdesign, but should be a beneﬁt gained after a user interacts with auseful system over time. Thus we believe that studies evaluatingexplanations should not measure trust unless they are longitudinal,“in the wild”, and consider the entire system. Instead, we argue thatresearch studies that seek to measure the beneﬁt of novel explana-tions should focus on utility over trust. We have identiﬁed ﬁve broaduse cases where researchers could design experiments to measurethe utility of explanations.1.

Model Debugging and Validation : Is the model working asdesigned? Why is the model making mistakes?2.

Model Selection : Potentially going beyond simple performancemetrics like F-score, which model is best?3.

Mental Model and Model Understanding : How does the modelfunction or behave? Can I learn something interesting fromthe model?4.

Human Machine Teaming : Can I do a task with the modelbetter than on my own (and better than the model on its own)?5.

Model Feedback, Challenging, and Prescription : When I amaffected by a model’s decision, how do I challenge that decisionor correct the model when it’s wrong about me? What shouldI change about me to get a better outcome in the future?The utility of a model and associated explanations can be mea-sured from several viewpoints. We focus on three: the model devel-oper, the end user of the model, and “imposed users” (individualsor groups who are affected by the model’s decisions, outcomes, orrecommendations). The model developer may consider each of theﬁrst three contexts — Model Debugging and Validation; ModelSelection; and Mental Model and Model Understanding — as theyiterate in development. In contrast, end users will typically partici-pate in Model Selection, Mental Model and Model Understanding,or Human Machine Teaming. Model Feedback, Challenging, andPrescription is of greatest interest to imposed users. They may alsobe interested in the Mental Model and Model Understanding usecase. Figure 1 illustrates the overlap of relevant use cases across thethree types of users. Our taxonomy is similar to that of Mohseni etal [32], although we distinguish imposed users from end users anddo not distinguish end users from “data experts”.In addition to considering the above users and use cases, i.e., thecontext in which an explanation is used, we also need to designcontrolled experiments with falsiﬁable hypotheses. We believe the

Figure 1: Venn diagram illustrating the overlap in interest acrossuse cases for differing viewpoints – from the developer who buildsthe model, the end user who selects and applies the model, or the imposed user who is affected by the decisions or recommendationsthe model provides. “gold standard” XAI evaluation experiment should be one whereall participants perform the same task, but a randomly assignedgroup of participants performs the task without the assistance ofthe explanation being evaluated. By comparing the performance ofgroups with and without the explanation, we can make claims aboutthe beneﬁt of that explanation for the task. The hypothesis (thatthe explanation is useful) is falsiﬁed when there is no signiﬁcantdifference between these groups.Not all experiments in this research domain should follow thisrigid experimental design, but we suggest following some basicguidelines. Researchers should ﬁrst consider the three key com-ponents of their system: the human, the machine, and the expla-nation. Next, researchers should determine what can be measured(or is meaningful to measure) using different combinations of thesecomponents. Below are examples of four combinations of thesecomponents and how they provide us different information:•

Human only : baseline performance of human at the task (thefully manual scenario)•

Machine only : baseline performance of machine at the task(the fully automated scenario)•

Human + Machine : baseline performance of the system whenthe user can rely on the machine learning output•

Human + Machine + Explanation : the performance of thesystem when the explanation and machine learning output areavailable to the userThus, one should compare the performance of

Human + Ma-chine + Explanation group against

Human + Machine group. The

Human only and

Machine only provide additional context for thiscomparison. For example, there may be a signiﬁcant beneﬁt of theexplanation, but perhaps the

Machine only performance is greatest,indicating that the system should be fully automated.Of course, researchers should use their best judgement aboutwhat combinations are meaningful or practical for their speciﬁcapplications. For example, Yang et al. [49] designed an experimentin-line with this framework. However, the researchers decided not tomeasure the

Human only condition because there was no reasonableexpectation that unaided participants had the expertise to perform thetask (identifying the species of a tree given a picture of a single leaf).The researchers decided to measure

Human + Machine performanceas a baseline instead, which they found closely aligned with

Machineonly performance, indicative of overtrust. In that study, the

Human+ Machine + Explanation performance was greater than both of thebaselines, providing strong evidence of the beneﬁt of the explanation.he remainder of this section is organized around the use casespreviously discussed and hypothetical downstream tasks for eval-uation purposes. For each downstream task we describe a pseudo-experiment that is intended to provide inspiration for researcherswho wish to conduct the scientiﬁcally rigorous research we haveargued for in our position statement.

Model Debugging and Validation is a developer-focused use casethat leverages explanations as a means of improving the model.Here, explanations are used to provide insight into the mechanismsof complex “black box” models, e.g., neural networks or other deeplearning models [9], to identify ﬂaws or biases in the algorithmor the training data that can be addressed in development. Forexample, a developer may use explanations of model decisions ofvarying conﬁdence to probe whether the model relies strongly onnon-actionable or biased features and identify constraints that arenecessary to implement within the model (as such models shouldnot be used in most, if not all, settings [46]).

Downstream Task: Given an incorrect model decision and cor-responding explanation, determine the reason the model made amistake.

A set of model mistakes are coded by the research team in order toestablish a ground truth. In the case of a classiﬁcation task, thesecoded mistakes may be an annotation of the image qualities thatmisled the model, e.g., occlusion of the subject, pixellation, andartifacts in the image as well as qualities of the model behavior, e.g.,misplaced model attention, lack of training examples. Inter-raterreliability is established on the coded mistakes and the images arepresented to users to determine if explanation methods are usefulin helping users determining why a model made an error. Users’ability to correctly describe the reason for the model’s mistakes aremeasured with and without the explanation.

Downstream Task: Identify if the model will improve from addi-tional training.

Given three datasets, a machine learning model is trained on one andthe user is presented with data in all three sets and the performanceof the model, e.g. F-score on validation (from train) and the twotest sets. The user is asked whether the performance on the test setwould signiﬁcantly improve if the model can include the second testset in its training. Both train and test examples are available to theuser, and in the

Human + Machine + Explanation condition, theuser can view explanations of model decisions on the original trainand test data. The ground truth is measurable because the differencein model performance on the smaller and larger training set are avail-able. A variant of this experiment would ask the user to estimate thedifference in performance. A challenge of this experimental designis the creation of datasets with suitable differences in performance.

Downstream Task: Given a model that exploits artifacts or loop-holes of the data, describe the model’s behavior.

Model “intelligence” has been reasonably challenged in recent yearsby the discovery of ‘Clever Hans’ behaviors. This reveals themodel’s reliance on features of the data that humans would con-sider unintuitive (such as source tags in images) and are threats togeneralizability [30]. In this task, users explore the model explana-tion to describe how decisions are being made about classes withinthe data. Performance is measured by determining whether usersare able to discover undesirable behaviors in the model’s decision-making, such as identiﬁcation of trains by spotting rails, boats justby identifying water, or wolves by focusing on snow [38]. A controlcondition, i.e. no explanation, is possible, but would require showingthe user many correct and incorrect classiﬁcations to give the useran opportunity to understand the model based solely on behavior.

Our second use case,

Model Selection , is typically the focus of trustand explanation analyses when considered together because it seeksto answer the intuitive question of “do I trust this model enough touse it?” or “do I trust this model more than another, and thus shoulduse it instead?” Although we argue this is a narrow application oftrust and explanations together, it is a common (and important) usecase to consider. We note that the model selection use case is morecomplicated (and potentially problematic) when a second predictiveor generative model is required to generate the explanations ratherthan using artifacts of or features extracted directly from the modelunder assessment (e.g. captions generated from image inputs andmodel decisions to explain model decisions).

Downstream Task: Determine which model is better suited for agiven task.

In this task, users consider two models and either the decisions aloneor the decisions alongside their accompanying explanations. Users’performance identifying which model performs better on an unseentest set across the two groups can be used to quantify the quality ofthe accompanying explanations — e.g., by measuring whether usingthe explanations to identify whether the model decision was right(or wrong) for the right (or wrong) reasons enabled users to betterdistinguish which model is best suited for the task.As a variant, users may be asked to determine which of the modelswould best extend to a speciﬁc out-of-domain task, e.g., classifyingfoods after seeing classiﬁcation examples of animals or classifyingposts on Twitter after seeing classiﬁcation examples using Redditdata. These tasks can intuitively be extended to an experimentranking multiple models, all of which can be presented and pairedwith or without an explanation method and outcomes.

The

Mental Model and Model Understanding use case relates towhether a user builds an accurate mental model, that is, a mentalmodel which functionally mirrors the overall behavior and decisionmaking of the machine learning model. Visualizations are a populartype of explanation for building and eliciting mental models. Mentalmodels are important for developers to understand their own models.They are also critical for end users who use machine learning to gaininsight about a new domain or to complete a task. Accurate mentalmodels can also be helpful for imposed users who are affected bymodel decisions, for example, if they are denied services becauseof the model’s classiﬁcation of their proﬁle or history. These usersmay want to understand how the model makes decisions in order toset expectations of how the model will impact the user.This use case differs from

Model Selection or Model Debug-ging and Validation in that the focus shifts from explaining speciﬁcdecisions to comprehending the relationships between the modelinput and output, or understanding the inner workings of a model’sdecision-making. This use case is challenging because there areno ways to directly observe a user’s mental model, and appropriatemetrics for measuring understanding are still widely debated [45].

Downstream Task: Given examples of past behaviors, extrapolatewhat a model will do given unseen inputs.

In this task, users are given a series of inputs, and the correspondingmodel outputs. Then, users are presented with a series of previouslyunseen inputs. A randomly assigned subset, representing the

Human+ Machine + Explanation condition, are given the correspondingexplanations. The control group, representing the

Human + Machine condition, is not provided the explanations. Users are then askedto either select from a list, or describe in their own words, theirexpectation of model output. More accurate user predictions of themodel output for the altered input provides evidence of the qualityof the mental model built with the support of the explanation.ome variations of this task include using different types of inputfor the user to base their extrapolations on. For instance, the set ofinputs may consist of all new inputs, none ever seen in the initialseries of inputs with associated outputs, or a mix of inputs seenbefore and new inputs. Ribiero et al. [39] used all new inputs whentesting a tool which presents a visual summary of why the modelmade a speciﬁc classiﬁcation. In their user study, users were asked topredict a model’s output ﬁrst without an explanation, then presentedwith a set of decisions with explanation and ﬁnally asked to performone more round of predicting the model’s outputs.Another variation could be in the temporal dimension of extrap-olation. For a model such as an image classiﬁer, any input andoutput can happen in any order. However, with a model such as areinforcement learning agent, the task could consist of predictingthe agent’s immediate next move, or any number of time steps inthe future, as explored by Anderson et al. [3]. This type of variationwould explore if a user’s mental model is accurate enough to predictthe model’s future choices, and how far into the future that mentalmodel accurately extends.

Downstream Task: Given information about a model’s past per-formance, match the model to a novel output.

When the context of decisions belonging only to a speciﬁed modelor models is explicitly set, as in the above described task, users mayoverestimate how well they understand a model. In this task, usergroups have that narrowed context removed, and are asked to dif-ferentiate between multiple models. User groups are given a seriesof input-output pairs for a set of models during a training phase. Itis speciﬁed which model created each output to help users buildmental models. After the training phase, users are presented witha new, previously unseen set of inputs and corresponding modeloutputs. Users are asked to match the models to the new decisions,or to identify if none of the previously presented models wouldhave produced the given decision. This task is repeated includingexplanations for the decisions made by the model.The

Human + Machine condition would be represented by a usergroup receiving no additional explanation of the model’s outputs dur-ing the process of building a mental model. The

Human + Machine+ Explanation condition would be represented by a group of userswho get an explanation of each model’s outputs in the mental modelbuilding phase. Any improvement in ability to correctly correspondmodels to the new decisions would signal that explanations helpusers to better understand the model.One variant of this experiment would be to test users’ understand-ing of a single model by purposely altering the output of the modeland testing if users can identify if the output is incorrect, and whatpart of the output is incorrect. Chang et al. [10] use this variation ina user study of a model which sorted documents into topics, whichare deﬁned by a set of words. Users were tested to see if they coulddetect manipulation by researchers of both the words describing atopic, and the topic assigned to a document.Another variant of this task would be to change the trainingprocess, so that instead of using the input-output pairs, the userswould be given a global explanation or description of each model.For example, users may be told that an image classiﬁer was trainedon a speciﬁc data set of birds, and has been observed to rely heavilyon beak shape and size in its classiﬁcations.

Downstream Task: Given a model’s past behavior, identify if ex-planations speed up user’s creation of a mental model.

This task consists of selecting the expected model output given apreviously unseen input (evaluation phase) after having studied ex-amples of inputs and the corresponding models outputs (learningphase). Users in the

Human + Machine + Explanation group are alsoprovided an explanation corresponding to each model input-outputpair. Users are informed that they will be timed and can proceed through as many sets of inputs and outputs as they want. Users canswitch to the evaluation phase at anytime where real-time feedbackis given on whether the user chose the right output. The user is freeto move back to the learning phase to study more sets of knowninputs and outputs before returning again to the evaluation phase.The total time spent, number of cycles, or accuracy in predictingthe model behavior can be compared across groups to determine thebeneﬁt of the explanation. In a variation used by Lim et al. [31], aset number of examples are presented in a learning phase and usersare timed according to how long they spend in the learning phase.Once users move on to the evaluation phase, they cannot return tothe learning phase, and the time taken to answer each question inthe task phase is also recorded. Lim et al. compared groups of usersreceiving explanations during the learning phase to groups gettingno explanations. Alternatively, this task can be used to comparethe efﬁcacy of different explanations, which could be imperative insafety-critical environments.

Human Machine Teaming distinguishes models as teammates, be-yond simple tools. Here, high-quality explanations can elevate mod-els to act as teammates by providing more insightful and actionablerecommendations. Essentially, good explanations can serve as themodels response to “explain your work” or “why?” queries andassist users in complicated tasks or alleviate the cognitive load ofhuman teammates. As human machine teaming necessarily gener-ates more speciﬁc use cases than the tasks enumerated in previoussections, this section utilizes more speciﬁc use cases, that can ofcourse be generalized to other domains.

Downstream Task: Identify whether users should accept or rejectmodel decisions.

Given a series of inputs, users are tasked with agreeing or disagreeingwith a model’s output, accompanied by an explanation of interestfor a subset of user groups. Performance in this case is measuredbased on the accuracy of ﬁnal decisions selected by the user, whichis either the model decision when users agree with the decision orthe user-selected decision when users disagree with the model. Thistask effectively calibrates the users’ appropriate trust in the model.Performance of the

Human + Machine + Explanation group iscompared directly to the

Human + Machine group. This experimentrequires that the

Human only performance be less than the

Machineonly performance to show a beneﬁt, such as classifying tree leavesin images [49] (instead of everyday objects) or performing qualitycontrol in an assembly line scenario [5].An extension to this task can consider reduction of the cognitiveload for users as the desired outcome and user performance can alsobe measured using proxies for cognitive load such as the number ofcorrect judgements made by the user in a limited time period or thetime needed to complete a speciﬁed number of correct judgements.

Downstream Task: Determine whether properly abstracted expla-nations improve human experience and performance in an au-tonomous driving scenario.

One of the challenges when designing model explanations lies inunderstanding which end user the explanation is being designedfor. For example, levels of abstraction change drastically if expla-nations are designed to target the software engineer responsible forautonomous navigation and collision avoidance rather than the driversitting behind the wheel. In this task, user groups driving simulatedautonomous vehicles would be provided with simplistic explanationsof the car’s behavior as a driving test takes place such as warningsabout poor object detection in fog or reduced traction in sharp curves.These explanations would be given when environmental dangers areencountered during the simulation. A control group would receiveno explanations. Performance can then be measured based on thesers’ situational awareness (SAGAT, etc. [16]), attention switchingfrom the road to the explanations, trust questionnaires, and otherphysiological indicators (e.g., heart rate, eye motion).An extension to this task can present the explanations of the car’sbehavior in a “user manual” style before the driving test beginsand measure how often the drivers accurately respond to hazardousconditions with no real-time input.

Downstream Task: Determine whether explanations increase theefﬁciency of a human machine team.

Healthcare has been a focus of recent work in human machineteaming, and serves as an exemplary application domain. Studieshave shown the effectiveness of explanations on both trust and in-terpretability in ML models focused on medical diagnosis [14]. Asthe availability and complexity of medical technology increases,physicians may ﬁnd themselves in need of machine agents who canhelp them narrow in on useful treatments and diagnoses. In this task,the user works with an automated physician’s assistant who makesrecommendations for data collection, testing, and diagnoses duringa physician-patient interaction.The model presents choices to the physician with accompanyingexplanations of these choices in the

Human + Machine + Explana-tion case, such as visuals of speciﬁc patient data and its risk contri-bution for certain diagnoses. This can present the issue of branching,where presented choices may generate a longer path to the end goalwith the potential of detours that do not offer viable paths to thesolution. Two control groups exist for this task, that being

Human +Machine groups with no generated explanations of proffered choicesand

Human only groups which receive no model assistance. Howlong it takes user groups to arrive at the correct diagnosis, the cost ofthat treatment, and how many incorrect branches were explored areall viable performance metrics to establish whether the explanationsof model beneﬁted users. As this task requires trained experts, i.e.physicians, the

Human only baseline is meaningful.A wide array of extended tasks can be generated from this initialexample. Time limits can be imposed on the entire task to con-sider cognitive load on the users, and to examine whether or notuser groups are able to ﬁnish the task at all. The use case can alsobe extended into the work allocation domain, wherein the modelrecommends actions and gives explanations (or does not) that havecascading consequences on the subsequent work tasks, creating adynamic environment that is shaped by the human-machine teamand which can end up in any number of end states with measurableutilities. Additional physiological measures, post-experiment ques-tionnaires, and human factors metrics (e.g. situational awarenessor perceived cognitive load) can then be applied to understand theusefulness of these model explanations along different dimensions.

The

Model Feedback, Challenging, and Prescription use case ef-fectively considers trust in the system , i.e., the model in context ofthe impact on imposed users of its decisions and subsequent recom-mendations or actions taken. The need for effective explanation ofmodel decisions for recourse is a natural response to the continuedwidespread application of artiﬁcial intelligence or machine learningmodels to supplement or automate tasks in domains where incorrector biased recommendations can have signiﬁcant human impacts.These domains include predictive policing [17], recidivism predic-tion [12, 15], and hate speech and abusive language identiﬁcationonline [13, 36, 40]. As an example of the recognized necessity ofclear explanations for this use case, the European Union’s GDPRdirectly addresses the “right of citizens to receive an explanation foralgorithmic decisions” [18].

Downstream Task: Given a model’s decision, identify how to geta better outcome.

In this task, a user may be given an input and a decision from amodel and asked to identify what has to be changed or updated inthe input to get a better decision. User groups are either given anaccompanying explanation for the decision of the unaltered input aswell as or only the model output. This task touches on counterfactualexplanations and the prescriptive use of algorithmic decision makingsystems. The “What if tool” [47] supports this type of counterfactualreasoning. However, the tool is oriented towards end users and devel-opers — there is still a research opportunity to design explanationsthat support imposed users for this use case.Given the model is trustworthy , providing the right outcome giventhe right data, but the user desires a different outcome, does themodel explanation provide enough information for a user to identifythe personal changes needed to obtain the desired outcome?

Glass-Box [44] is an example of such a downstream task in practice – users,given a pre-established persona, probe a loan application system thatprovides contrastive, counterfactual explanations to understand andchallenge the model’s automated decisions.

Downstream Task: Given a model’s decision, determine if the de-cision was based on incorrect data, biased data, or bad inference.

User groups are provided an overview of the training examples fora given model and a series of input-output pairs where the modeloutput was incorrect. Users are then asked to identify why the modelprovided an incorrect response or what methods might be employedto correct the decision (e.g., more training examples for edge-caseinputs, removal of misleading or biased training examples, or re-moval or preprocessing of ﬂawed inputs) and whether the changerequired is a reasonable expectation (reduce the level of outstand-ing credit to receive a new load) or an indication of bias (changeyour gender/race to receive a new loan). A control group (

Human+ Machine ) will not receive model explanations, but would haveaccess to the training dataset in order to contrast performance withthe

Human + Machine + Explanation group to examine the beneﬁtsof the explanations. More accurate user predictions of the modeloutput for the altered input provides a quantitative measure of thequality of the explanation to identify biases in the model – whetherthe model is fair across the affected population – and also as a meansto identify challenge-worthy individual decisions.

ONCLUSIONS

In this paper we have argued that trust in a machine learning modelis a beneﬁt of a useful and reliable system that employs that model.However, trust develops slowly over time, and to rely on trust as ametric for evaluating the value of an explanation is problematic andcould lead to artiﬁcially inﬂated levels of trust to the users’ detriment.We believe trust should only be measured in a longitudinal andempirical study considering the full system.Instead, researchers should design for and measure utility. Utility-oriented evaluation encourages researchers to consider the broadercontext of the explanation, i.e., how it is intended to be used. It alsoencourages researchers to employ scientiﬁc methodologies to evalu-ate explanations, leveraging falsiﬁable hypotheses and objectivelymeasurable quantities as evidence. Towards this end, we have sug-gested many pseudo-experimental designs involving “downstreamtasks” that can be used to evaluate explanations in this manner. Wehope the impact of this work will be to inspire many new experimentsthat solidify the scientiﬁc foundation relating humans, machines,and explanations. A CKNOWLEDGMENTS

The authors wish to thank Matthew Taylor for helpful discussionsaround downstream tasks in reinforcement learning.

EFERENCES [1] A. Adadi and M. Berrada. Peeking Inside the Black-Box: A Surveyon Explainable Artiﬁcial Intelligence (XAI).

IEEE Access , 6:52138–52160, 2018. doi: 10.1109/ACCESS.2018.2870052[2] A. Alqaraawi, M. Schuessler, P. Weiß, E. Costanza, and N. Berthouze.Evaluating saliency map explanations for convolutional neural net-works: A user study. In

Proceedings of the 25th International Confer-ence on Intelligent User Interfaces , pp. 275–285. ACM, Cagliari Italy,Mar. 2020. doi: 10.1145/3377325.3377519[3] A. Anderson, J. Dodge, A. Sadarangani, Z. Juozapaitis, E. Newman,J. Irvine, S. Chattopadhyay, A. Fern, and M. Burnett. Explaining rein-forcement learning to mere mortals: An empirical study. In

Proceed-ings of the Twenty-Eighth International Joint Conference on ArtiﬁcialIntelligence (IJCAI-19) , 2019.[4] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨uller, andW. Samek. On Pixel-Wise Explanations for Non-Linear Classi-ﬁer Decisions by Layer-Wise Relevance Propagation.

PLOS ONE ,10(7):e0130140, July 2015. doi: 10.1371/journal.pone.0130140[5] G. Bansal, B. Nushi, E. Kamar, W. S. Lasecki, D. S. Weld, andE. Horvitz. Beyond accuracy: The role of mental models in human-aiteam performance. In

Proceedings of the AAAI Conference on HumanComputation and Crowdsourcing , vol. 7, pp. 2–11, 2019.[6] S. Berkovsky, R. Taib, and D. Conway. How to Recommend?: UserTrust Factors in Movie Recommender Systems. In

Proceedings ofthe 22nd International Conference on Intelligent User Interfaces , pp.287–300. ACM, Limassol Cyprus, Mar. 2017. doi: 10.1145/3025171.3025209[7] M. Bilgic and R. Mooney. Explaining Recommendations: Satisfactionvs. Promotion.

Proceedings of Beyond Personalization 2005: A Work-shop on the Next Stage of Recommender Systems Research at the 2005International Conference on Intelligent User Interfaces , 01, 2005.[8] A. Bussone, S. Stumpf, and D. O’Sullivan. The Role of Explanationson Trust and Reliance in Clinical Decision Support Systems. In , pp. 160–169.IEEE, Dallas, TX, USA, Oct. 2015. doi: 10.1109/ICHI.2015.26[9] D. Castelvecchi. Can we open the black box of ai?

Nature ,538(7623):2023, 2016. doi: 10.1038/538020a[10] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Readingtea leaves: How humans interpret topic models. vol. 32, pp. 288–296,01 2009.[11] A. Chatzimparmpas, R. M. Martins, I. Jusuﬁ, K. Kucher, F. Rossi, andA. Kerren. The state of the art in enhancing trust in machine learningmodels with the use of visualizations. In

Computer graphics forum(Print) , 2020.[12] A. Chouldechova. Fair prediction with disparate impact: a study of biasin recidivism prediction instruments.

Big data , 5(2):153–163, 2017.[13] T. Davidson, D. Bhattacharya, and I. Weber. Racial bias in hate speechand abusive language detection datasets. In

Proceedings of the ThirdWorkshop on Abusive Language Online , pp. 25–35, 2019.[14] W. K. Diprose, N. Buist, N. Hua, Q. Thurier, G. Shand, and R. Robin-son. Physician understanding, explainability, and trust in a hypotheticalmachine learning risk calculator.

Journal of the American MedicalInformatics Association , 27(4):592–600, 02 2020. doi: 10.1093/jamia/ocz229[15] J. Dressel and H. Farid. The accuracy, fairness, and limits of predictingrecidivism.

Science advances , 4(1):eaao5580, 2018.[16] M. R. Endsley. Measurement of situation awareness in dynamicsystems.

Human Factors: The Journal of the Human Fac-tors and Ergonomics Society , 37(1):6584, 1995. doi: 10.1518/001872095779049499[17] D. Ensign, S. A. Friedler, S. Neville, C. Scheidegger, and S. Venkata-subramanian. Runaway feedback loops in predictive policing. In

Conference on Fairness, Accountability and Transparency , pp. 160–171, 2018.[18] B. Goodman and S. Flaxman. European union regulations on algo-rithmic decision-making and a right to explanation.

AI magazine ,38(3):50–57, 2017.[19] D. Gunning. Broad Agency Announcement: Explainable ArtiﬁcialIntelligence (XAI), 2016. [20] S. R. Haynes, M. A. Cohen, and F. E. Ritter. Designs for ExplainingIntelligent Agents.

International Journal of Human-Computer Studies ,67(1):90–110, Jan. 2009. doi: 10.1016/j.ijhcs.2008.09.008[21] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborative ﬁl-tering recommendations. In

Proceedings of the 2000 ACM Conferenceon Computer Supported Cooperative Work - CSCW ’00 , pp. 241–250.ACM Press, Philadelphia, Pennsylvania, United States, 2000. doi: 10.1145/358916.358995[22] R. R. Hoffman, S. T. Mueller, G. Klein, and J. Litman. Met-rics for explainable ai: Challenges and prospects. arXiv preprintarXiv:1812.04608 , 2018.[23] F. Hohman, M. Kahng, R. Pienta, and D. H. Chau. Visual analyticsin deep learning: An interrogative survey for the next frontiers.

IEEEtransactions on visualization and computer graphics , 25(8):2674–2693,2018.[24] D. Holliday, S. Wilson, and S. Stumpf. User Trust in Intelligent Sys-tems: A Journey Over Time. In

Proceedings of the 21st InternationalConference on Intelligent User Interfaces - IUI ’16 , pp. 164–168.ACM Press, Sonoma, California, USA, 2016. doi: 10.1145/2856767.2856811[25] A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. M¨uller. Caus-ability and Explainability of Artiﬁcial Intelligence in Medicine.

WIREsData Mining and Knowledge Discovery , 9(4), July 2019. doi: 10.1002/widm.1312[26] K. Holzinger, K. Mak, P. Kieseberg, and A. Holzinger. Can we trustMachine Learning Results? Artiﬁcial Intelligence in Safety-Criticaldecision Support.

Research and Innovation , pp. 112–113, 2018.[27] H. Jiang, B. Kim, M. Guan, and M. Gupta. To Trust Or Not To Trust AClassiﬁer. , 2018.[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classiﬁcationwith deep convolutional neural networks.

Communications of the ACM ,60(6):84–90, May 2017. doi: 10.1145/3065386[29] T. Kulesza, S. Stumpf, M. Burnett, S. Yang, I. Kwan, and W.-K. Wong.Too much, too little, or just right? Ways explanations impact end users’mental models. In , pp. 3–10. IEEE, San Jose, CA, USA, Sept.2013. doi: 10.1109/VLHCC.2013.6645235[30] S. Lapuschkin, S. Wldchen, A. Binder, G. Montavon, W. Samek, andK.-R. Mller. Unmasking clever hans predictors and assessing whatmachines really learn.

Nature Communications , 10(1), Mar 2019. doi:10.1038/s41467-019-08987-4[31] B. Lim, A. Dey, and D. Avrahami. Why and why not explanationsimprove the intelligibility of context-aware intelligent systems.

Confer-ence on Human Factors in Computing Systems - Proceedings , 04 2009.doi: 10.1145/1518701.1519023[32] S. Mohseni, N. Zarei, and E. D. Ragan. A survey of evaluation meth-ods and measures for interpretable machine learning. arXiv preprintarXiv:1811.11839 , 2018.[33] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Syn-thesizing the preferred inputs for neurons in neural networks via deepgenerator networks. arXiv:1605.09304 [cs] , Nov. 2016.[34] F. Nothdurft, F. Richter, and W. Minker. Probabilistic Human-Computer Trust Handling. In

Proceedings of the 15th Annual Meetingof the Special Interest Group on Discourse and Dialogue (SIGDIAL) ,pp. 51–59. Association for Computational Linguistics, Philadelphia,PA, U.S.A., 2014. doi: 10.3115/v1/W14-4307[35] V. Pallotta, P. Bruegger, and B. Hirsbrunner. Smart heating systems:Optimizing heating systems by kinetic-awareness. In , pp. 887–892.IEEE, London, United Kingdom, Nov. 2008. doi: 10.1109/ICDIM.2008.4746833[36] J. H. Park, J. Shin, and P. Fung. Reducing gender bias in abusive lan-guage detection. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pp. 2799–2804, 2018.[37] P. Pu and L. Chen. Trust building with explanation interfaces. In

Proceedings of the 11th International Conference on Intelligent UserInterfaces - IUI ’06 , p. 93. ACM Press, Sydney, Australia, 2006. doi:10.1145/1111449.1111475[38] M. T. Ribeiro, S. Singh, and C. Guestrin. ” why should i trust you?”xplaining the predictions of any classiﬁer. In

Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery anddata mining , pp. 1135–1144, 2016.[39] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precisionmodel-agnostic explanations. In

AAAI , 2018.[40] M. Sap, D. Card, S. Gabriel, Y. Choi, and N. A. Smith. The riskof racial bias in hate speech detection. In

Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics , pp.1668–1678, 2019.[41] K. E. Schaefer, E. R. Straub, J. Y. Chen, J. Putney, and A. Evans. Com-municating intent to develop shared situation awareness and engendertrust in human-agent teams.

Cognitive Systems Research , 46:26–39,Dec. 2017. doi: 10.1016/j.cogsys.2017.02.002[42] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, andD. Batra. Grad-CAM: Visual Explanations from Deep Networks viaGradient-Based Localization. In , pp. 618–626. IEEE, Venice, Oct. 2017.doi: 10.1109/ICCV.2017.74[43] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolu-tional Networks: Visualising Image Classiﬁcation Models and SaliencyMaps. arXiv:1312.6034 [cs] , Apr. 2014.[44] K. Sokol and P. A. Flach. Glass-box: Explaining ai decisions withcounterfactual statements through conversation with a voice-enabledvirtual assistant. In

IJCAI , pp. 5868–5870, 2018.[45] A. Toon. Where is the understanding?

Synthese , 192, 2015. doi: 10.1007/s11229-015-0702-8[46] B. Ustun, A. Spangher, and Y. Liu. Actionable recourse in linear classi-ﬁcation. In

Proceedings of the Conference on Fairness, Accountability,and Transparency , pp. 10–19, 2019.[47] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Vi´egas, andJ. Wilson. The what-if tool: Interactive probing of machine learningmodels.

IEEE transactions on visualization and computer graphics ,26(1):56–65, 2019.[48] B. Wolford. https://gdpr.eu/what-is-gdpr/.[49] F. Yang, Z. Huang, J. Scholtz, and D. L. Arendt. How do visualexplanations foster end users’ appropriate trust in machine learning? In

Proceedings of the 25th International Conference on Intelligent UserInterfaces , pp. 189–201, 2020.[50] M. Yin, J. Wortman Vaughan, and H. Wallach. Understanding the Effectof Accuracy on Trust in Machine Learning Models. In

Proceedings ofthe 2019 CHI Conference on Human Factors in Computing Systems -CHI ’19 , pp. 1–12. ACM Press, Glasgow, Scotland Uk, 2019. doi: 10.1145/3290605.3300509[51] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolu-tional Networks.