[PDF] How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

Abstract

There have been several research works proposing new Explainable AI (XAI) methods designed to generate model explanations having specific properties, or desiderata, such as fidelity, robustness, or human-interpretability. However, explanations are seldom evaluated based on their true practical impact on decision-making tasks. Without that assessment, explanations might be chosen that, in fact, hurt the overall performance of the combined system of ML model + end-users. This study aims to bridge this gap by proposing XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information. We conducted an experiment following XAI Test to evaluate three popular post-hoc explanation methods -- LIME, SHAP, and TreeInterpreter -- on a real-world fraud detection task, with real data, a deployed ML model, and fraud analysts. During the experiment, we gradually increased the information provided to the fraud analysts in three stages: Data Only, i.e., just transaction data without access to model score nor explanations, Data + ML Model Score, and Data + ML Model Score + Explanations. Using strong statistical analysis, we show that, in general, these popular explainers have a worse impact than desired. Some of the conclusion highlights include: i) showing Data Only results in the highest decision accuracy and the slowest decision time among all variants tested, ii) all the explainers improve accuracy over the Data + ML Model Score variant but still result in lower accuracy when compared with Data Only; iii) LIME was the least preferred by users, probably due to its substantially lower variability of explanations from case to case.

Full PDF

HHow can I choose an explainer? An Application-groundedEvaluation of Post-hoc Explanations

Sérgio Jesus

Feedzai, DCC-FCUPUniversidade do [email protected]

Catarina Belém

Vladimir Balayan

João Bento

Pedro Saleiro

Pedro Bizarro

João Gama

LIAAD, INESCTECUniversidade do [email protected]

ABSTRACT

Data Only , i.e. , just transaction datawithout access to model score nor explanations, Data + ML ModelScore , and

Data + ML Model Score + Explanations . Using strong sta-tistical analysis, we show that, in general, these popular explainershave a worse impact than desired. Some of the conclusion high-lights include: i) showing

Data Only results in the highest decisionaccuracy and the slowest decision time among all variants tested,ii) all the explainers improve accuracy over the

Data + ML ModelScore variant but still result in lower accuracy when compared with

Data Only ; iii) LIME was the least preferred by users, probably dueto its substantially lower variability of explanations from case tocase.

KEYWORDS

XAI, Evaluation, Explainability, LIME, SHAP, User Study

FAccT ’21, March 3–10, 2021, Virtual Event, Canada © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.This is the author’s version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in

Conference onFairness, Accountability, and Transparency (FAccT ’21), March 3–10, 2021, Virtual Event,Canada , https://doi.org/10.1145/3442188.3445941.

ACM Reference Format:

Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro,Pedro Bizarro, and João Gama. 2021. How can I choose an explainer? AnApplication-grounded Evaluation of Post-hoc Explanations. In

Conferenceon Fairness, Accountability, and Transparency (FAccT ’21), March 3–10, 2021,Virtual Event, Canada.

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3442188.3445941

Figure 1: End-users’ average decision accuracy vs. averagetime to make a decision for each variant tested in our evalua-tion experiment of post-hoc explanations. We used balancedsamples of positive and negative instances, therefore, a ran-dom decision process would have accuracy.

The interest in ML models’ explainability has been growing in thelast years, as a counteractive effort to the current AI black-boxparadigm, coupled with increased public scrutiny and evolvingregulatory law [1–4]. However, this growth in Explainable AI (XAI) a r X i v : . [ c s . A I] J a n AccT ’21, March 3–10, 2021, Virtual Event, Canada Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama research work has not been accompanied by effective evaluationmethodologies [5]. The field is still in its early stages.Even though every persona interacting with a black-box MLmodel may benefit of model explainability, each persona has a spe-cific role, objectives, actions at disposal, background, domain knowl-edge, and, consequently, different explainability requirements [6–8].As a result, the evaluation of XAI methods must be performed withthe target persona and the associated task in mind [7, 8]. Notwith-standing, in seminal works of XAI methods, it is common to seeintroduced one or multiple ad-hoc evaluation setups, mostly focusedon ideal explanations desiderata [9–12]. In some cases, user exper-iments are simulated [9] or even completely discarded from theevaluation step [13]. As a consequence, there is a lack of systematiccomparison between different methods accurately and exhaustively.These reasons culminate, ultimately, in general skepticism aboutthe reliability and usefulness of XAI methods, especially when theapplication is of high responsibility.

In this work, we focus on XAI evaluation having the end-user astarget persona . We consider the end-user as the decision-maker, thehuman-in-the-loop, who usually is a domain expert, such as a judge,a doctor, or a fraud analyst. We argue that, for end-users, the valueof explanations is heavily determined by how useful they are to theassociated decision task and, for that reason, that their evaluationshould be made by measuring their impact in the performance ofthe end-users. This implies involving end-users in the evaluationprocess, in a setup with a real task and real data. Additionally,metrics should reflect directly the users’ performance, e.g. , howaccurate the decisions are, or how fast they are made.We propose XAI Test, an application-grounded evaluation method-ology tailored to isolate the impact of gradually providing differentlevels of information to the end-user. A useful XAI method pro-duces explanations that improve the overall performance of thecombined system of ML model + end-user. To perform a reliable as-sessment, XAI Test requires testing different combinations of data,model score, and XAI methods in a real task with real end-users.Specific performance metrics must be defined ( e.g. , accuracy ordecision time), the agreement between end-users is considered oneach variant, and user perception captured through questionnaires.Lastly, statistical tests are employed to detect significant differencesbetween each variant.

Using XAI Test, we conducted an empirical evaluation in the taskof fraud detection in financial transactions. We employed threedifferent post-hoc explainers and observed their impact on human-in-the-loop performance, measuring accuracy, recall, false positiverate (FPR), and decision time. We additionally collected the users’perception of usefulness, variety, and relevance of each presentedexplanation.We quantified and isolated the impact of the different interactingparties in a Human-AI collaborative setting by following a three-stage evaluation approach with increased information. Figure 1shows how the average accuracy of the decision varies with thedecision time for each of the evaluated variants. We observe a clear trade-off between effectiveness and efficiency as the end-user getsaccess to additional ML model information. In particular, we observethat when no model-related information is shown to the end-user( i.e., Data Only ), although slower, leads to more accurate decisions.Conversely, the accuracy obtained in the mid-level informationstage (

Data + ML Model Score ) yields faster decisions but muchworse accuracy - a result that is partially improved by adding modelexplanations.

In this section, we provide an overview of the current evaluationparadigm in XAI research. In particular, we briefly discuss the oftenconsidered desiderata, as well as the different techniques used tomeasure them. We end by enumerating a few representative state-of-the-art evaluation approaches and by describing how these failto convey a robust analysis of the real impact of XAI methods inreal-life Human-AI decision-making systems.

Most research work on XAI measures some kind of proxy of intu-itive desiderata for the ideal explanation, such as fidelity or faith-fulness [9, 14], which states that surrogate models that are used toobtain post-hoc explanations should be able to mimic the behaviorof the explained ML model; robustness or stability [15], whichmeasures whether similar input instances get similar explanations; human-interpretability or comprehensibility [16], which mea-sures how easily a human interprets the result from the explanationmethod.Despite being common sense that a good explanation must havehigh fidelity, be robust, and be intelligible, those characteristicsby themselves do not say much about the actual benefit of hav-ing an explanation in a specific real-world application, nor do themeasurements completely represent those characteristics.Previous work often assumes that a model is interpretable because it belongs to a certain family of models – such as sparselinear models, decision trees, and rules lists [17–20], or additivemodels [21–23] – and the only focus when generating explanationsis on the accuracy of those models. These explanations are directlyderived from interpreting the ML model parameters. Most of thetimes, these over-simplified definitions of model intelligibility aredetached from the requirements of real-world applications [24]. Ingeneral, these simpler models have much lower predictive accuracythan other more complex models, such as deep neural networksor tree ensembles. Only in a few high-stakes tasks (such as creditscoring [25]) is the complexity of an ML model viewed as an actuallimitation, and only in these particular cases, there is no alternativeto simpler, more intelligible models.Several works assess fidelity as a measure of the quality ofan explanation. Fidelity has been assessed both directly [9, 14, 26,27], by measuring differences in predictions of the surrogate andexplained models, as well as indirectly [11], by measuring howwell a human can predict the output of a ML system with andwithout being exposed to explanations. Again, this is another metricdetached from real-world impact of showing an explanation to agiven persona, as it focuses on how well an XAI model approximatesthe function learned by the original ML model. ow can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations FAccT ’21, March 3–10, 2021, Virtual Event, Canada

Other works defend the importance of robustness . It is mea-sured by directly computing how much the output of an explanationmethod changes with its input [28, 29] or by showing the sensi-tivity of explanations to adversarial attacks [30]. However, thesemetrics are not directly related to how an explanation might helpthe end-user to better perform their task.

Interpretability is also assessed by measuring how approxi-mate a XAI method explanation is to an explanation produced bya human expert [3, 31]. Those approaches are somehow restrictedto tasks where the behavior of humans is intuitive, and generallyclose to the ground truth (such as problems in natural languageprocessing and computer vision), but may not be suitable to com-plex predictive tasks based on tabular data, where the analysis hasto take into account multiple features and interactions, making thetask harder and less intuitive.The way XAI desiderata is being interpreted and measured isdisperse and lacking in consensus, as shown by the different meth-ods to measure the same property. Several problems are pointed tocurrent practices, such as non-overlapping and discordant motiva-tions and objectives for interpretability [24], attributing the samelevel of interpretability to ML models originating from the samemodel class [32], or the lack of evaluation of XAI methods with theintended end-users [33].Frameworks have been developed [7, 34] as an attempt on tack-ling the challenges of XAI evaluation, however, these frameworksare still recent and have yet to see wide adoption. The field ismissing a systematic and objective way of comparing explanationmethods [35, 36], which promotes research practices where eachwork uses customised metrics and desiderata that are thought tobe the most adequate, encumbering the choice of XAI methods fora given task. This is especially important in scenarios of real-worldHuman-AI decision-making systems, where XAI methods may havea greater impact.

While many ad-hoc evaluation setups have been used to empiricallyvalidate research on XAI methods, these either found on idyllicdesiderata or overlook the human-in-the-loop and their explainabil-ity needs. In an attempt to standardize the existing XAI evaluationapproaches, Doshi-Velez and Kim [5] propose a taxonomy to cate-gorize the different types of XAI evaluation practices. In their work,the authors subdivide the evaluation practices into three distinctgroups, depending on whether it resorts to humans or not andon the task they are being employed on. The first group encom-passes automated evaluation on proxy tasks and is designated as functionality-grounded evaluation. Experiments in this categorymay try to simulate human behavior [11], and apply these simu-lations to real tasks, such as fraud detection [37]. Other works donot consider the human factor as part of the evaluation [9, 14].Both other groups of evaluation methods use humans in the pro-cess of evaluation but differ on the task being done. If the evaluationtask is a simplified proxy of a real task, the method is designated human-grounded evaluation, while if the task is in a real-world set-ting, the method is deemed application-grounded evaluation. Thesemethods introduce the human component in the evaluation loop tocollect feedback in the form of questionnaires, surveys, interviews, performance at the task, among others. Their focus, however, shiftsfrom how humans perceive and interact with the explanations in human-grounded evaluation, to how it affects the whole systemperformance in application-grounded evaluation.The evaluation of explanations through experimentation hasbeen done in several past works. Most experimental studies useproxy tasks with real human subjects, i.e. , human-grounded experi-ments, such as trivia answer [38], clinical prescription simulation[16, 20], detection of deceptive reviews [39], comparison betweenhuman feedback and explainer output [3], or human prediction ofmodel output on unseen instances based on the explanation of themodel behavior [11].By analysing the experiments conducted in other works, there isa clear gap in evaluation using real tasks with real end-users. Moreoften than not, explanations are employed in mocked tasks, andthe results obtained can not be generalized to high responsibilityreal-world tasks. Simulating human behavior is prone to humanbias, since in many cases it depends on the developers’ own in-tuition of the problem, and may diverge from reality, producingunrealistic results. Additionally, seldom do these experiments com-pare explanation methods, but rather test different visualizationsor output types for these methods, which emphasizes more on thepresentation rather than explanations’ content.

The evaluation of the true impact of a given explanation in the end-user experience is not an easy task. Ideally, it should be focused onobjectively measuring its utility (or usefulness) in the users’ decisionmaking process. This should rely on the collection of metrics fromreal users while performing real tasks on real data.We propose

XAI Test , an application-grounded evaluation me-thodology that relies on realistic settings and statistical tests torobustly assess and compare the explanations’ utility of differentXAI methods, using metrics that correspond to the performanceof the user. Rather than evaluating explainability through idyllicdesiderata, we opt for evaluating it through metrics that quantifythe true impact in the human decision-making.The methodology consists of the following steps: (1) formulatethe hypotheses; (2) outline the experimental setup; (3) define thestatistical tests to report the results with; (4) conduct the threestages of the experiment; and (5) apply statistical tests to obtainedmeasurements.With this methodology, we aim to find answers to a set hy-potheses ( e.g., is method 𝐴 more efficient than method 𝐵 ? Is it moreaccurate? ). In the case of an XAI experiment, these hypothesesare related to the utility of the explanations and how they impactthe end result of a given task. To support or reject the formulatedhypotheses, it is necessary to objectively measure users’ perfor-mance at the task ( e.g. , through accuracy, or decision time). It isalso important do define other elements of the experiment, includ-ing the explainers, ML models, corresponding configurations totest, number of users that partake on the experiment, datasets, andother task-specific details, such as experiment scheduling and usedsoftware. Equally important for ensuring a robust evaluation isthe confidence of the reported results. To this end, we define theappropriate statistical tests as well as their parameters, which are AccT ’21, March 3–10, 2021, Virtual Event, Canada Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama significance level, statistical power, effect size, and sample size. Aprior knowledge of the distributions is required to choose theseparameters. In Section 3.3, we elaborate on the choices made interms of hypothesis testing. The ensuing step is then to conductthe experiments in a way that isolates the impact of explanations inthe decision making process. For this reason, we advocate for theexecution of, at least, three stages, each providing added levels ofinformation: (1)

Data only , (2)

Data + ML Model Score , and (3)

Data +ML Model Score + Explanations . Finally, the last step of the proposedmethodology concerns the collected results and their analysis.The following sections describe the methodology employed inthe evaluation of XAI methods. This includes the way explanationsare employed, the measured metrics, and the battery of statisticaltests to determine any significant difference.

The metrics choice is task-specific. In Human-AI cooperative sys-tems where the true data labels are known, it is possible to combinethis information with the user decision to compute performancemetrics (based on the confusion matrix), such as recall, FPR, pre-cision, or false omission rate (FOR). These measures allow us toobjectively quantify the impact of different components ( e.g. , modelscore and/or different explanation types) in the human decision-making process. In practice, accuracy, recall, and FPR are betterchoices, because the denominator either depends on the samplesize, or on the number of label positives and label negatives of thesample. Since these are constant over the course of the experimentand do not depend on the number of predicted positives and nega-tives (as it is the case for a metric such as precision and FOR), wecan determine a priori the exact sample size for each metric.In most systems, time is also a determining factor and should,therefore, be monitored during system modifications. In Human-AIdecision making systems, explanations serve to help the human-in-the-loop to make a faster decision, by pointing them to whatthe model perceives to be the most important information for thedecision. Consequently, this is an important aspect to measurewhen discerning the impact of explanations in decision-makingprocesses.Another relevant point, despite being more subjective, is theuser’s perception of the explanation quality, including its relevanceand usefulness. For this reason, we propose a set of predefinedfive-point

Likert-type scale questions, specified in Figure 2.Finally, often times, decisions diverge from user to user. We ex-pect the addition of more information ( e.g. , model scores and/orexplanations) to mitigate such differences. To accurately measurethis effect, we use an agreement set where a subset of the data isshared between users with the intent of computing the metrics ofagreement. We use

Fleiss’ Kappa [40] as the agreement metric be-cause our experiments will incorporate multiple users. Additionally,we calculate the average agreement, which is the average pair-wiseagreement between users.

While, in the first stage of the experiment, humans only have accessto instance-specific information (feature data), in the second thehuman is provided with information of the model score, calibrated

Figure 2: Questionnaire performed to the users after eachinstance with explanation. for simplification. Consequently, users may sometimes perceive itas a measure of how confident the model is about predicting a givenclass : scores closer to 1 or 0 express confidence, whereas scoresaround 0.5 convey more uncertainty.The third stage of the experiment involves, in addition to theML model score, the explanations. How and which information toshow for which explainer should be defined in the experimentalsetup. There are many degrees of freedom when configuring anexplainer: the explainer type ( e.g. , self-explainable, post-hoc ), thenumber of features to consider, how to represent the explanation( e.g. , feature contributions, heatmaps, scores, visualizations) as tominimize the cognitive load during the task execution. Anotherimportant aspect to pay attention to are the biases that may arise ifexplanation methods are distinguishable due to some factor ( e.g. ,their representation). Mitigating their representational differencesis, therefore, a preventive step towards isolating the quality andrelevance of the explanation methods from all the other possiblevisual factors.

The appropriate choice of a statistical test depends on two factors:(1) the metric distribution and (2) the end-goal of the test. Moststatistical tests aim at identifying significant differences betweenmeasured averages of performance metrics in different scenarios(control vs treatment). In this case, we use of Chi-squared test [41]for multiple group comparison of instance-level binary metrics,such as accuracy, recall, and FPR. Conversely, for continuous per-formance metrics like decision time, we use a non-parametric testnamed

Kruskal-Wallis H [42] to validate whether the samples be-long to the same underlying process. This test is particularly suitedfor non-normal distribution of continuous variables.We are interested in comparing pairs of groups and, specifically,in running comparisons between each variant and the control group.In these cases, we use

Chi-squared test with the pairs to be testedin the performance values, and the Mann-Whitney U test [43] oncontinuous data. P-values must also be corrected for family-wiseerror rate with the Holm-Bonferroni method [44].In order to quantify the perceived usefulness and relevance ofthe explanations measured through the questionnaire, we aim toidentify distribution differences between different explainers forthe proposed questions. We find the

Kruskal-Wallis H to better suit ow can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations FAccT ’21, March 3–10, 2021, Virtual Event, Canada this goal when comparing multiple variants. To report the results ofpaired tests, we apply the

Kolmogorov-Smirnov test [45] correctedwith the

Holm-Bonferroni method [46].

We employ our proposed application-grounded methodology, XAITest, to evaluate and compare different explanation methods in areal-world decision-making task: fraud detection in payment trans-actions. We had access to a real fraud prevention system comprisinga deployed ML model that predicts the risk of fraud for each pay-ment transaction in a given online retailer. The fraud analyst isresponsible for accepting or declining payment transactions forwhich the ML model is more uncertain about (the score is withina review band). This decision-making task is performed througha web interface in which the fraud analyst can inspect details ofthe payment transaction ( e.g. , shipping address, billing email, timesince last transaction) which represents the feature data ( i.e. , DataOnly ) together with the risk score, given by the ML model, and anexplanation.While business requirements aim for more effective and efficientdecisions, often, the model information is not sufficient to meet suchcriteria ( e.g. , disagreement between fraud analysts and ML modelor even mistrust in the model predictions). In an attempt to bridgethis Human-AI gap, we conjecture that explanations promote betterhuman performance in such predictive fraud task . Therefore, theprime goal of this experiment is to assess the real impact of showingexplanations to real humans (the fraud analysts) interacting with areal ML model.

As the first step of XAI Test, we formulated our hypotheses. Sincewe used a production system without permission to modify theML model, we focus on the evaluation of post-hoc explanation methods. With this in mind, we set out to answer the followinghypotheses: • H1 . Showing fraud analysts the ML Model Score improvestheir performance over Data Only ; • H2 . Showing post-hoc explanations significantly improveshuman performance over Data Only and/or

Data + ML ModelScore ; • H3. Explanations from different post-hoc explainers impacthumans differently; Assuming that humans trust the ex-planations, some explainers promote more effective and/orefficient decisions; • H4 . Each post-hoc explainer is perceived differently in termsof relevance, usefulness, and diversity; • H5 . Showing explanations increases fraud analysts agree-ment over the same set of transactions; • H6 . Showing model score information increases fraud ana-lysts agreement over the same set of transactions. Explanations produced by post-hoc methods. Defined in Section 4.2.

We evaluate the above hypotheses using metrics indicative of thefraud analysts’ performance in terms of both efficiency and efficacyat the decision-making task.

Metrics:

We use the average decision time (of fraud analysts)as an efficiency measure and we use accuracy, FPR, and recall asmeasures of their effectiveness. Moreover, to address H4 , we alsomeasure their perceived relevance, usefulness, and diversity of theexplanations through the questionnaire in Figure 2. ML model : As an application-grounded evaluation of a real-world system, we used the fraud prevention system’s ML model: aRandom Forest’s variant [47].

Explainers : Among the various XAI methods for tabular data,we opted for two of the most commonly used post-hoc explainers :LIME [9] and SHAP [3]. In particular, we leveraged the fact thatthe model is a decision tree ensemble to use the tree-based SHAPexplainer - TreeSHAP [13]. We also included a third explainer specif-ically tailored for tree-based algorithms, known by ML practitionersas TreeInterpreter [48]. In terms of hyperparameters, we ran a fewsensitivity tests to determine the most appropriate hyperparametersfor the proposed task. From this analysis, we concluded that bothSHAP and TreeInterpreter could be used with their out-of-the-boxparametrization, whereas LIME had to be tweaked, specially, dueto its stochastic nature . Thus, besides the random seed, we alsoset the number of perturbed samples to 5k. Explanation format : The explanations format for the three ex-plainers consists of pairs of feature-contribution . We decided to onlydisplay the top 6 pairs based on contribution value. Unlike othertabular explanation formats, such as decision lists and decision sets[20]) the feature-contribution format benefits from its readability,simplicity and visualization flexibility.Furthermore, to create a seamless experiment, we used this out-put’s simplicity to homogenize the explanations representationacross explainers. Given a set of feature-contribution pairs, we: (1)sort it in descending order by absolute contribution value , and(2) transform it into a human-readable format. This transforma-tion comprises mapping the feature name to a natural languagedescription plus parsing the feature value ( e.g. , converting timefrom seconds to days).We further added a color-based visual cue to reflect the changesin the associated suspicious risk (score): negative contributionsrepresented with green , as they contribute for lower scores andconsequently legitimate transactions, and, conversely, positive con-tributions represented with red . Figure 3, illustrates an explanationshown to a fraud analyst during the experiments. Users : Three professional fraud analysts partook in the experi-ment. They were all experienced users of the fraud detection systemused in the experiment.

Data : Two different samples were considered: (1) a training sam-ple, derived from the same data set used to train the ML model,and (2) an experiment sample, from the production period of theML model. We used the former as the background for LIME (toobtain information about features distributions). To create it, we LIME’s internal local fidelity metric showed improvements exclusively upon varia-tions on the number of perturbed samples. Higher contributions reflect more important features.

AccT ’21, March 3–10, 2021, Virtual Event, Canada Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama

Figure 3: Visual representation of an explanation, as viewedby fraud analysts during the experiment (obfuscated to pre-serve privacy). randomly sampled 100k transactions from the model’s training set.Conversely, the sample for running the experiment itself, dubbedexperiment sample, was extracted from the model’s production pe-riod (November 2019), for which we had fraud labels. We extracteda stratified sample to attain 50% fraud prevalence.To replicate a real scenario for the experiment the sample ex-clusively comprises transactions that lie in the review band, i.e. ,transactions with higher model uncertainty. The final experimentsample size totals 1300 transactions. In the following section, wedisclose how these transactions were distributed across the differentexperiment stages.

We conducted all three stages of the experiment, as XAI Test sug-gests (see Section 3.2). Given that each stage added levels of infor-mation, we decided to run them in a way that allows fraud analyststo incrementally stabilize their mental model of the task (as theyadapt to new information within the system). This leads to thefollowing experiment outline:(1)

Data only : information exclusively about the transaction(payment details and history) is available;(2)

Data + ML Model Score : both transaction data and the modelscore are available;(3)

Data + ML Model Score + Explanations : all of the above infor-mation is complemented with an explanation (from LIME,SHAP, or TreeInterpreter) of the model score.As our baseline, we considered the stage where every informa-tion except the data was withheld from fraud analysts (

Data only stage), as it allows us to isolate and quantify the real impacts ofdifferent information types in the human’s performance in (andunderstanding of) the task. In the absence of prior knowledge aboutthe metrics distribution for this particular task, we used a total of400 transactions to conduct the two initial stages of the experiment(200 for each stage). Each of these samples were created withoutreplacement from the experiment sample (stratified by fraud labelto keep fraud prevalence at 50%). We found 200 transactions tobe a good compromise between the pressing time and businessconstraints ( e.g. , availability of the analysts) and the quality andrigor of the experiment. On the other hand, we leveraged the results obtained in the initialexperiments (

Data only and

Data + ML Model Score ) to computethe sample size required to obtain significant results at the desiredpower, 𝛽 , significance level, 𝛼 , and effect size, 𝛿 . We set 𝛿 = 𝛽 = − 𝛼 = .

9, since we perceive both error types associatedwith statistical hypothesis testing (type I and type II) to be of equalimportance during the experiment. In the end, and assuming theproxy estimates of the analysts’ distribution were representativeof their true performance, we concluded that a sample with 300transactions would suffice for rigorously running the third stageof the experiment, the

Data + ML Model Score + Explanations stage.Each sample was divided equally for each analyst. Each analystreviewed the same number of transactions for every explainer inthe experiment (100 transactions per explainer), which guaranteedthe results were equally balanced and that the experiment resultswere not skewed towards a specific explainer or user.To address hypothesis H5 , we defined a subset of each sample tobelong to an agreement set. In practice, this implies that all usersreviewed the same exact transactions of the agreement set. This setaccounted for about 12 .

5% of the transactions on every experimentstage.

In this section, we evaluate how various levels of information affectthe human’s decision-making process in a fraud detection task. Wefirst examine the impact of disclosing information about the MLmodel score when compared to withholding that information. Wealso analyse the impact of showing different post-hoc explanationson top of the information about the ML model score. We discussthe obtained results in terms of human effectiveness and efficiencyat detecting fraudulent transactions.Table 1 shows the experiment results for the conducted three-stage experiment (each stage reflects a group). Besides isolating thecontributions of the different system components, this table alsocomprises the evaluation results of three popular post-hoc expla-nation methods, being one of the most comprehensive evaluationand comparison of XAI methods to date.Our results show that data alone induces better decisions, whileshowing model scores or model scores with explanations signif-icantly improves the decision time. Our results suggest that, inpractical settings where decision speed is a main requirement, MLmodels explanations carry a significant speed up in human decision-making, as depicted in Figure 5. Additionally, data alone carries abetter result in both accuracy and recall, registering even a signifi-cant difference in accuracy when compared to the group with modelscore, as depicted in Figure 4. Finally, we provide insights aboutthe variability and agreement of the different post-hoc explainersbased on the produced explanations for the experiment.

We first analyse the difference in between human decision-makingwith and without presence of the ML model score. We evaluate H1 ow can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations FAccT ’21, March 3–10, 2021, Virtual Event, Canada Table 1: Performance, time and agreement metrics for all variants of the experiment. Statistical significance istested between each explainer and each of the two groups that do not show explanations or only among explain-ers. ★ indicates significant difference with Data Only; no statistically significant difference was detected betweeneach explainer and the Data + ML Model Score; ♦ indicates significant difference with all other explainers. Theagreement metric is Fleiss’ Kappa .Group Explainer Sample Size Metrics Agreementaccuracy (%) recall (%) FPR (%) time (s)

Data Only - 200 ★ ★ -0.02Data +ML Model Score +Explanations LIME 300 58.59 27.03 ★ TreeInterpreter 300 56.52 25.55 12.67 ★ ♦ ★ H6 is examined under the agreementsmeasures mentioned in Section 3.1. Showing end-users the ML Model Score improves av-erage decision time over Data Only . Our results show thatwithholding the model score leads to significantly slower decisions.Using the

Mann-Whitney U test, we detect a significant differencein times between

Data Only and

Data + ML Model Score ( 𝑝 < . ) (Table 1). A more thorough analysis of the performance metrics(see Figure 5) reveals an approximate decrease of 25% of the rela-tive average time to decide, when presenting information aboutthe model score. When considering time as the performancemetric, these results corroborate H1 (as defined in Section 4.1). Showing end-users the ML Model Score deterioratestheir accuracy over Data Only . Our results demonstrate thatwithholding information about the model score significantly im-proves the user’s predictive accuracy. Table 1 shows that, afterthe application of the

Chi-squared test, significant differences arisebetween

Data only and

Data + ML Model Score ( 𝑝 = . ) . Theseresults contradict H1 (when using accuracy as the performancemetric). This might derive from the fact that the instances beingreviewed are in a score band near the decision threshold, and, there-fore, have a higher associated uncertainty when being classified. Showing end-users the ML Model Score does not sig-nificantly improve recall or FPR over Data Only . Our resultsdo not exhibit statistically relevant improvements in terms of otherusers’ performance metrics like recall or FPR. Considering theseas the desired performance metric reveals to be inconclusive and,therefore, does not suffice to support nor reject H1 .In general, Figure 4 shows a degradation in all metrics derivedfrom the confusion matrix, when comparing the ML Model Score group to

Data Only , as both recall and accuracy registered a loss of10% and FPR registered an increase of around 4% percentage points.

Showing the ML Model Score decreases agreement . The consensus among fraud analysts was shown to decrease as we incorporated more information. This is visible in Table 1, asthe measurement of

Fleiss’ Kappa went from 0 .

41 in the

Data Only variant to − .

02 in the

Data + ML Model Score variant. The formerreflects a setting where users, on average, agreed on the trans-action label 76 .

67% of the times, whereas in the latter they onlyagreed on 63 .

33% of the times. This refutes the idea that showingmore information would guide (or shape) users thinking processby giving hints about relevant aspects and, consequently, disproveshypothesis H6 .We hypothesize this large difference is due to (1) too small agree-ment set and (2) high proportion of transactions classified as legiti-mate ( i.e. , 77%), leading to extra sensitivity to disagreements aboutfraudulent transactions. We further examine the performance differences between decision-making tasks involving

Data Only and

Data + ML Model Score +Explanations . In particular, we examine the impact of three distinctvariants of the

Data + ML Model Score + Explanations group: LIME,SHAP, and TreeInterpreter.

Showing post-hoc explanations significantly improvesend-users average speed over Data Only . Figure 5 shows theconfidence intervals of decision time for each group. By running amultiple group comparison using the

Kruskal-Wallis H test, we ob-serve statistically significant differences between explainer-basedvariants and the

Data Only group ( 𝑝 < . ) , which corrobo-rates H2 (when the performance metric is the reviewing time). Weidentify significant differences for every explainer, when they arecompared pair-wise to the Data Only variant by using the

Holm-Bonferroni corrected

Mann-Whitney U tests we obtain p-valuesbetween 1 × − (for LIME) and 0 .

09 (for TreeInterpreter). Whencomparing against

Data + ML Model Score , all explainers showincreased decision time but this is not statistically significant.

Different post-hoc explainers impact the end-users de-cision speed differently.

We also examine paired comparisons

AccT ’21, March 3–10, 2021, Virtual Event, Canada Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama

Figure 4: Confidence intervals ( ) for each performance metric of all variants of the experiment. The interval is calculatedthrough the beta distribution for the estimated parameter 𝑝 of each metric. between the different explainers to address H3 in terms of the deci-sion efficiency. We detect significant differences when comparingLIME to TreeInterpreter ( 𝑝 < . ) and SHAP to TreeInterpreter ( 𝑝 = . ) . In other words, results show that among the threeevaluated post-hoc explainers, TreeInterpreter potentiates signifi-cantly faster decision-making processes. These results corroborate H3 when considering the average time review as the fraud analysts’measure of performance. Figure 5: Confidence intervals for the average decision timeof each variant. The interval represents the Standard Errorof the sample multiplied by . , representing a Confi-dence Interval, centered around the mean group’s mean.

Showing post-hoc explanations does not significantlyimprove end-users efficacy . In addition to efficiency, we exam-ine the impacts of showing explanations to the human decision-making in terms of accuracy, FPR, and recall. As visible in Figure 4,all evaluated explainers are associated with deteriorated values forthe predictive-accuracy metrics, except for the error-based metric,FPR. Effectively, although the values are not statistically significant,all explainers seem to lead to less false positives. Furthermore, asvisible in Table 1 the multiple group comparison

Chi-squared testprovided no conclusive results and, consequently, no paired testswere conducted between the explainer variants. Notwithstandingthe lower accuracy and recall values of each explainer when com-pared to the

Data Only variant, explainers were still able to improveupon the results obtained for the

Data + ML Model Score variant,although this improvement was also not statistically significant.The obtained results disprove H2 and H3 , when the consideredperformance metrics are either accuracy, FPR, or recall. Notwithstanding these results, we emphasize that, performance-wise, the selected decision time metric is the most volatile metricand, therefore, the most susceptible to vary during the experimentdue to some unaccounted external factors (such as connectivityissues or distractions). Post-hoc explainers are perceived differently in termsof relevance, usefulness, and diversity by the end-users.

Weperform a multiple group comparison

Kruskal-Wallis H test tocompare the results obtained with the questionnaire in Figure2. While no significant result is detected for the first question ( 𝑝 = . ) , the test reveals significant changes relative to the sec-ond and third questions, that is, "The explanation helped me reviewfaster." ( 𝑝 < . ) and "The explanation was useful to help me makea decision." ( 𝑝 < . ) ). Figure 6 shows the distribution of the an-swers to the three questions posed during the last stage of theconducted experiment, discriminated by explainer. We observe thatTreeInterpreter is the explainer with most positive answers (blue),especially in the third question. We also notice the high number ofneutral answers, neither , and practically non-existing number ofextreme answers, i.e. , strongly agree or strongly disagree . We canfurther observe, in statistical terms, that in the second question(middle), LIME registers a significant difference when compared toboth SHAP ( 𝑝 < . ) and TreeInterpreter ( 𝑝 < . ) . On the otherhand, in the third question, no paired test registered a significantdifference. In this question, TreeInterpreter is the explainer withresults closer to significance. These results support H4 , as eachdistinct explainers are indeed perceived differently by the users. Showing explanations increases end-users agreementover the same set of transactions.

We also examine the impactsin the agreement of the fraud analysts’ decisions. Table 1 showsLIME to be the only explainer capable of improving the agreementbeyond the

Data Only group. However, when compared with the

Data + ML Model Score variant, all explainer variants seem to evokemore consensus among the fraud analysts. Quantitatively speaking,LIME achieves by far the best agreement result with a

Fleiss’ Kappa of 0 .

53, and fraud analysts agree, on average, on 84 .

62% of the deci-sions. Also promising, but still inferior to the agreement achievedwhen all information is withheld from the user, is TreeInterpreterwith a

Fleiss’ Kappa of 0 .

30, and with an average agreement of69 . Fleiss’ Kappa of 0 .

15, an average agreement of 64 . ow can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations FAccT ’21, March 3–10, 2021, Virtual Event, Canada Figure 6: Distribution of answers to feedback questionnaire. partially corroborate H5 , as LIME actually seems to improve ana-lysts agreement. However, the same does not verify for the otherexplanation methods. We analyse the variety and agreement of the explanations usedduring the third stage of the experiment. To this end, we collectthe explanations of the different evaluated explainers (LIME, SHAP,and TreeInterpreter) for every transaction of the experiment. Eachexplanation comprises six feature-contribution pairs which are thebasis of the explanations. To better comprehend the explainers’ behavior, we measuredthe diversity of their explanations. This implies comparing howmany of the 111 available features are actually being used to createthe explanations: LIME showed the least diversity, using 34 fea-tures (30 .

6% of the total set of features), followed by SHAP, whichused a total of 89 features (80 .

2% of the total set of features), andTreeInterpreter, which used a total of 107 features (96 .

4% of thetotal set of features). Lower values in the number of used featurestranslates into less variability in the explanations. This also ends upreflecting on the occurrence rate of the most popular feature ( i.e. ,the feature used the most times to explain an instance), which inLIME occurred in 89 .

7% of the transactions, as opposed to the mostcommon feature in TreeInterpreter which only occurred 45 .

3% ofthe explanations.The agreement between explainers is calculated by how manyfeatures two given explainers choose to integrate the explanationnormalized by the length of the explanation. For example, if inan instance LIME and SHAP had chosen 2 features in common toexplain the instance score, and the other 4 features were differentfor each explainer, the agreement in that instance would be 33 . . i.e. , 53 .

0% of the features used bySHAP for a given explanation were also used by TreeInterpreter.Likewise, the agreement for the other explainers’ pairs produces anagreement of 41 .

0% (between LIME and SHAP) and 23 .

5% (betweenLIME and TreeInterpreter). These results show that the outputexplanation for a given instance depends on the post-hoc methodchosen to explain it, i.e. , different explainers will choose differentfeatures to explain a given instance.

In this section, we outline the main limitations of our empiricalstudy. We have a constraint in the number of participants as wellas their availability for the experiment, which in turn limits thesample size for the experiment. This has an impact on the effectsize, or the rates of errors for the statistical tests. To perform testswith higher sensibility to smaller changes on the measured metrics,it is necessary to increase the sample size.Another limitation of the study is that we cannot control all thepossible external factors, such as difficulty of the instance, userattention to the tested information (data, model score, and expla-nations), connectivity speed, among other factors. However, themitigation of the effects of such unaccountable factors is only pos-sible when running large scale randomized controlled trials.This study showed no significant differences in performancemetrics derived from the confusion matrix between LIME, SHAP,and TreeInterpreter, using the same explanation format. A relevantstudy is to explore how different configurations and visualizationsalter the observed results.

The recent developments of XAI methods has not been accompa-nied by a robust and practical assessment of their true impact ondecision-making tasks. More often than not, the quality of these

AccT ’21, March 3–10, 2021, Virtual Event, Canada Sérgio Jesus, Catarina Belém, Vladimir Balayan, João Bento, Pedro Saleiro, Pedro Bizarro, and João Gama methods is measured through proxy desiderata ( e.g. , fidelity or ro-bustness), hence, failing to convey the information of the actualimpact on the end-users’ performance ( e.g. , accuracy or decisiontime). The lack of awareness towards the performance of the wholemodel + explanations + end-users may result in sub-optimal deci-sion processes.With this work, we hope to fill in this gap by proposing XAITest, an application-grounded evaluation methodology suited fordetaching the true impact of different information levels ( e.g. , modelscore, explanations) in Human-AI collaborative systems. FollowingXAI Test, we conducted a user study to evaluate three well-known post-hoc explainability methods (i.e., LIME, SHAP, TreeInterpreter)on a real-world fraud detection task, encompassing 3 fraud ana-lysts, an ML production model, and real-world data. Throughoutthe experiment, we progressively elevate the level of informationpresented to the analysts in three stages. We begin with informa-tion exclusively about the data (

Data only ) and subsequentiallyunveil information about the ML model score (

Data + ML ModelScore ) and, in the last stage, about the explanations (

Data + MLModel Score + Explanations ). In the course of the experiment, wecollect measures of the performance of the analysts in function ofthe revealed information. These include the duration, the accuracy,recall, and FPR of the decisions made, as well as the user’s feedbackon the perceived utility of the explanations.To the best of our knowledge, this is the first study to performa quantitative benchmark of the impact of different explanationmethods on human decision-making performance on a real-worldsetting (real task, real data, real users). We complement this analysiswith a strong battery of statistical tests to strengthen the validityof our conclusions. Obtained results reveal that, when providedwith

Data only information, fraud analysts decide significantlybetter but also more slowly when compared to variants that includeinformation about the ML model. In this regard, our results showexplanations (

Data + ML Model Score + Explanations ) to slightlyimprove the accuracy upon the

Data + ML Model Score but to stillfall short of the accuracy achieved in the

Data only setup. Finally,amongst the three evaluated explainers, the analysts identify LIMEas the least-favoured explanation method, potentially, due to itslow explanations diversity.In general, our results seem to suggest an existing trade-offbetween effectiveness and efficiency as the analysts are providedwith added levels of information. This raises awareness towardsblindly selecting popular post-hoc explanation methods in real-world decision-making settings.

The project CAMELOT (reference POCI-01-0247-FEDER-045915)leading to this work is co-financed by the ERDF - European Re-gional Development Fund through the Operational Program forCompetitiveness and Internationalisation - COMPETE 2020, theNorth Portugal Regional Operational Program - NORTE 2020 andby the Portuguese Foundation for Science and Technology - FCTunder the CMU Portugal international partnership.

REFERENCES [1] General Data Protection Regulation. Regulation (eu) 2016/679 of the europeanparliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of suchdata, and repealing directive 95/46.

Official Journal of the European Union (OJ) ,59(1-88):294, 2016.[2] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trustyou?": Explaining the predictions of any classifier, 2016.[3] Scott M Lundberg and Su-In Lee. A unified approach to interpreting modelpredictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett, editors,

Advances in Neural Information ProcessingSystems 30 , pages 4765–4774. Curran Associates, Inc., 2017.[4] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy dispar-ities in commercial gender classification. In

Conference on fairness, accountabilityand transparency , pages 77–91, 2018.[5] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretablemachine learning, 2017.[6] Richard Tomsett, Dave Braines, Dan Harborne, Alun Preece, and SupriyoChakraborty. Interpretable to whom? a role-based model for analyzing inter-pretable machine learning systems. arXiv preprint arXiv:1806.07552 , 2018.[7] Sina Mohseni, Niloofar Zarei, and Eric D Ragan. A multidisciplinary survey andframework for design and evaluation of explainable ai systems. arXiv , pagesarXiv–1811, 2018.[8] Kasun Amarasinghe, Kit Rodolfa, Hemank Lamba, and Rayid Ghani. Explainablemachine learning for public policy: Use cases, gaps, and research directions. arXivpreprint arXiv:2010.14374 , 2020.[9] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should i trustyou?": Explaining the predictions of any classifier. In

Proceedings of the 22NdACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’16, pages 1135–1144, New York, NY, USA, 2016. ACM.[10] Scott Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan Prutkin, BalaNair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. Fromlocal explanations to global understanding with explainable ai for trees.

NatureMachine Intelligence , 2, 01 2020.[11] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Anchors: High-precisionmodel-agnostic explanations. In

AAAI Conference on Artificial Intelligence (AAAI) ,2018.[12] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning impor-tant features through propagating activation differences. In Doina Precup andYee Whye Teh, editors,

Proceedings of the 34th International Conference on Ma-chine Learning , volume 70 of

Proceedings of Machine Learning Research , pages3145–3153, International Convention Centre, Sydney, Australia, 06–11 Aug 2017.PMLR.[13] Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualizedfeature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 , 2018.[14] Gregory Plumb, Denali Molitor, and Ameet S. Talwalkar. Model agnostic super-vised local explanations. In

Advances in Neural Information Processing Systems31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS2018, 3-8 December 2018, Montréal, Canada. , pages 2520–2529, 2018.[15] David Alvarez-Melis and Tommi S. Jaakkola. On the robustness of interpretabilitymethods.

CoRR , abs/1806.08049, 2018.[16] Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, andFinale Doshi-Velez. How do humans understand explanations from machinelearning systems? an evaluation of the human-interpretability of explanation.

CoRR , abs/1802.00682, 2018.[17] William W. Cohen and Yoram Singer. A simple, fast, and effective rule learner.In

Proceedings of the Sixteenth National Conference on Artificial Intelligence andthe Eleventh Innovative Applications of Artificial Intelligence Conference InnovativeApplications of Artificial Intelligence , AAAI ’99/IAAI ’99, pages 335–342, MenloPark, CA, USA, 1999. American Association for Artificial Intelligence.[18] Jerome H. Friedman and Bogdan E. Popescu. Predictive learning via rule ensem-bles.

Ann. Appl. Stat. , 2(3):916–954, 09 2008.[19] Krzysztof Dembczynski, Wojciech Kotlowski, and Roman Slowinski. Maximumlikelihood rule ensembles. In

Machine Learning, Proceedings of the Twenty-FifthInternational Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008 , pages224–231, 2008.[20] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. Interpretable decisionsets: A joint framework for description and prediction. In

Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery and data mining ,pages 1675–1684. ACM, 2016.[21] Joel Vaughan, Agus Sudjianto, Erind Brahimi, Jie Chen, and Vijayan N. Nair. Ex-plainable neural networks based on additive index models.

CoRR , abs/1806.01933,2018.[22] Xuezhou Zhang, Sarah Tan, Paul Koch, Yin Lou, Urszula Chajewska, and RichCaruana. Axiomatic interpretability for multiclass additive models. In

Proceedingsof the 25th ACM SIGKDD International Conference on Knowledge Discovery & DataMining , KDD ’19, pages 226–234, New York, NY, USA, 2019. ACM.[23] Rich Caruana, Paul Koch, Yin Lou, Marc Sturm , Johannes Gehrke, and NoemieElhadad. Intelligible models for healthcare: Predicting pneumonia risk and hos-pital 30-day readmission. In

KDD’15, August 10-13, 2015, Sydney, NSW, Australia .ACM, August 2015. ow can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations FAccT ’21, March 3–10, 2021, Virtual Event, Canada [24] Zachary C Lipton. The mythos of model interpretability.

Queue , 16(3):31–57,2018.[25] Cynthia Rudin and Yaron Shaposhnik. Globally-consistent rule-based summary-explanations for machine learning models: Application to credit-risk evaluation.

SSRN Electronic Journal , 01 2019.[26] Ivan Sanchez, Tim Rocktaschel, Sebastian Riedel, and Sameer Singh. Towardsextracting faithful and descriptive representations of latent variable models.In

AAAI Spring Syposium on Knowledge Representation and Reasoning (KRR):Integrating Symbolic and Neural Approaches , 2015.[27] Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. Distill-and-compare: Au-diting black-box models using transparent model distillation. In

Proceedings ofthe 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2018, New Orleans,LA, USA, February 02-03, 2018 , pages 303–310, 2018.[28] David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability withself-explaining neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grau-man, N. Cesa-Bianchi, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 31 , pages 7775–7784. Curran Associates, Inc., 2018.[29] David Alvarez-Melis and Tommi S. Jaakkola. On the robustness of interpretabilitymethods.

CoRR , abs/1806.08049, 2018.[30] Amirata Ghorbani, Abubakar Abid, and James Y. Zou. Interpretation of neuralnetworks is fragile. In

The Thirty-Third AAAI Conference on Artificial Intelli-gence, AAAI 2019, The Thirty-First Innovative Applications of Artificial IntelligenceConference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances inArtificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1,2019. , pages 3681–3688, 2019.[31] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Jun Cai, James Wexler, Fer-nanda Viegas, and Rory Abbott Sayres. Interpretability beyond feature attribution:Quantitative testing with concept activation vectors (tcav). 2018.[32] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretablemachine learning. arXiv preprint arXiv:1702.08608 , 2017.[33] Tim Miller, Piers Howe, and Liz Sonenberg. Explainable ai: Beware of inmatesrunning the asylum or: How i learnt to stop worrying and love the social andbehavioural sciences. arXiv preprint arXiv:1712.00547 , 2017.[34] W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and BinYu. Definitions, methods, and applications in interpretable machine learning.

Proceedings of the National Academy of Sciences , 116(44):22071–22080, Oct 2019. [35] Zachary C Lipton. The doctor just won’t accept that! arXiv preprintarXiv:1711.08037 , 2017.[36] Philipp Schmidt and Felix Biessmann. Quantifying interpretability and trust inmachine learning systems. arXiv preprint arXiv:1901.08558 , 2019.[37] Hilde J. P. Weerts, Werner van Ipenburg, and Mykola Pechenizkiy. Case-basedreasoning for assisting domain experts in processing fraud alerts of black-boxmachine learning models, 2019.[38] Shi Feng and Jordan Boyd-Graber. What can ai do for me? evaluating machinelearning interpretations in cooperative play. In

Proceedings of the 24th Interna-tional Conference on Intelligent User Interfaces , IUI ’19, page 229–239, New York,NY, USA, 2019. Association for Computing Machinery.[39] Vivian Lai and Chenhao Tan. On human predictions with explanations andpredictions of machine learning models: A case study on deception detection.In

Proceedings of the Conference on Fairness, Accountability, and Transparency ,FAT* ’19, page 29–38, New York, NY, USA, 2019. Association for ComputingMachinery.[40] Joseph L. Fleiss and Jacob Cohen. The equivalence of weighted kappa andthe intraclass correlation coefficient as measures of reliability.

Educational andPsychological Measurement , 33(3):613–619, 1973.[41] Karl Pearson F.R.S. X. on the criterion that a given system of deviations fromthe probable in the case of a correlated system of variables is such that it canbe reasonably supposed to have arisen from random sampling.

The London,Edinburgh, and Dublin Philosophical Magazine and Journal of Science , 50(302):157–175, 1900.[42] William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion varianceanalysis.

Journal of the American Statistical Association , 47(260):583–621, 1952.[43] H. B. Mann and D. R. Whitney. On a test of whether one of two random variablesis stochastically larger than the other.

Annals of Mathematical Statistics , 18:50–60,1947.[44] Sture Holm. A simple sequentially rejective multiple test procedure.

ScandinavianJournal of Statistics , 6(2):65–70, 1979.[45] Andrey Kolmogorov. Sulla determinazione empirica di una lgge di distribuzione.

Inst. Ital. Attuari, Giorn. , 4:83–91, 1933.[46] Sture Holm. A simple sequentially rejective multiple test procedure.

ScandinavianJournal of Statistics , 6(2):65–70, 1979.[47] Leo Breiman. Random forests.