Benchmarking and Survey of Explanation Methods for Black Box Models
Francesco Bodria, Fosca Giannotti, Riccardo Guidotti, Francesca Naretto, Dino Pedreschi, Salvatore Rinzivillo
BBenchmarking and Survey of Explanation Methods forBlack Box Models
Francesco Bodria , Fosca Giannotti , Riccardo Guidotti , Francesca Naretto , Dino Pedreschi ,and Salvatore Rinzivillo Scuola Normale Superiore, Pisa, Italy, { name.surname } @sns.it ISTI-CNR, Pisa, Italy, { name.surname } @isti.cnr.it Largo Bruno Pontecorvo, Pisa, Italy, { name.surname } @unipi.it Abstract.
The widespread adoption of black-box models in Artificial Intelligence has en-hanced the need for explanation methods to reveal how these obscure models reach specificdecisions. Retrieving explanations is fundamental to unveil possible biases and to resolvepractical or ethical issues. Nowadays, the literature is full of methods with different explana-tions. We provide a categorization of explanation methods based on the type of explanationreturned. We present the most recent and widely used explainers, and we show a visualcomparison among explanations and a quantitative benchmarking.
Keywords:
Explainable Artificial Intelligence, Interpretable Machine Learning, Transparent Mod-els
Today AI is one of the most important scientific and technological areas, with a tremendoussocio-economic impact and a pervasive adoption in many fields of modern society. The impressiveperformance of AI systems in prediction, recommendation, and decision making support is generallyreached by adopting complex Machine Learning (ML) models that “hide” the logic of their internalprocesses. As a consequence, such models are often referred to as “black-box models” [59,47,95].Examples of black-box models used within current AI systems include deep learning models andensemble such as bagging and boosting models. The high performance of such models in termsof accuracy has fostered the adoption of non-interpretable ML models even if the opaqueness ofblack-box models may hide potential issues inherited by training on biased or unfair data [77].Thus there is a substantial risk that relying on opaque models may lead to adopting decisions thatwe do not fully understand or, even worse, violate ethical principles. Companies are increasinglyembedding ML models in their AI products and applications, incurring a potential loss of safetyand trust [32]. These risks are particularly relevant in high-stakes decision making scenarios, such asmedicine, finance, automation. In 2018, the European Parliament introduced in the GDPR a set ofclauses for automated decision-making in terms of a right of explanation for all individuals to obtain“meaningful explanations of the logic involved” when automated decision making takes place. Also,in 2019, the High-Level Expert Group on AI presented the ethics guidelines for trustworthy AI .Despite divergent opinions among legals regarding these clauses [53,121,35], everybody agrees thatthe need for the implementation of such a principle is urgent and that it is a huge open scientificchallenge.As a reaction to these practical and theoretical ethical issues, in the last years, we have witnessedthe rise of a plethora of explanation methods for black-box models [59,3,13] both from academiaand from industries. Thus, eXplainable Artificial Intelligence (XAI) [87] emerged as investigatingmethods to produce or complement AI to make accessible and interpretable the internal logic andthe outcome of the model, making such process human understandable.This work aims to provide a fresh account of the ideas and tools supported by the currentexplanation methods or explainers from the different explanations offered. . We categorize expla-nations w.r.t. the nature of the explanations providing a comprehensive ontology of the explanationprovided by available explainers taking into account the three most popular data formats: tabulardata, images, and text. We also report extensive examples of various explanations and qualitativeand quantitative comparisons to assess the faithfulness, stability, robustness, and running time https://ec.europa.eu/justice/smedataprotect/ https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai This work extends and complete “A Survey Of Methods For Explaining Black-Box Models” appearedin ACM computing surveys (CSUR), 51(5), 1-42 [59]. a r X i v : . [ c s . A I] F e b F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi of the explainers. Furthermore, we include a quantitative numerical comparison of some of theexplanation methods aimed at testing their faithfulness, stability, robustness, and running time.The rest of the paper is organized as follows. Section 2 summarizes existing surveys on ex-plainability in AI and interpretability in ML and highlights the differences between this work andprevious ones. Then, Section 3 presents the proposed categorization based on the type of expla-nation returned by the explainer and on the data format under analysis. Sections 4, 5, 6 presentthe details of the most recent and widely adopted explanation methods together with a qualitativeand quantitative comparison. Finally, Section 8 summarizes the crucial aspects that emerged fromthe analysis of the state of the art and future research directions.
The widespread need for XAI in the last years caused an explosion of interest in the design ofexplanation methods [52]. For instance, in the books [90,105] are presented in details the most well-known methodologies to make general machine learning models interpretable [90] and to explainthe outcomes of deep neural networks [105].In [59], the classification is based on four categories of problems, and the explanation methodsare classified according to the problem they are able to solve. The first distinction is between ex-planation by design (also named intrinsic interpretability and black-box explanation (also named post-hoc interpretability [3,92,26]). The second distinction in [59], further classify the black-box ex-planation problem into model explanation, outcome explanation and black-box inspection. Modelexplanation, achieved by global explainers [36], aims at explaining the whole logic of a model.Outcome explanation, achieved by local explainers [102,84], understand the reasons for a specificoutcome. Finally, the aim of black-box inspection, is to retrieve a visual representation for un-derstanding how the black-box works. Another crucial distinction highlighted in [86,59,3,44,26] isbetween model-specific and model-agnostic explanation methods. This classification depends onwhether the technique adopted to explain can work only on a specific black-box model or can beadopted on any black-box.In [50], the focus is to propose a unified taxonomy to classify the existing literature. The follow-ing key terms are defined: explanation , interpretability and explainability . An explanation answersa “why question” justifying an event. Interpretability consists of describing the internals of a sys-tem in a way that is understandable to humans. A system is called interpretable if it producesdescriptions that are simple enough for a person to understand using a vocabulary that is meaning-ful to the user. An alternative, but similar, classification of definitions is presented in [13], with aspecific taxonomy for explainers of deep learning models. The leading concept of the classificationis Responsible Artificial Intelligence, i.e., a methodology for the large-scale implementation of AImethods in real organizations with fairness, model explainability, and accountability at its core.Similarly to [59], in [13] the term interpretability (or transparency) is used to refer to a passivecharacteristic of a model that makes sense for a human observer. On the other hand, explainabilityis an active characteristic of a model, denoting any action taken with the intent of clarifying or de-tailing its internal functions. Further taxonomies and definitions are presented in [92,26]. Anotherbranch of the literature review is focusing on the quantitative and qualitative evaluation of expla-nation methods [105,26]. Finally, we highlight that the literature reviews related to explainabilityare focused not just on ML and AI but also on social studies [87,24], recommendation systems [131],model-agents[10], and domain-specific applications such as health and medicine [117].In this survey we decided to rewrite the taxonomy proposed in [59] but from a data typeperspective. In light of the works mentioned above, we believe that an updated systematic cate-gorization of explanation methods based on the type of explanation returned and comparing theexplanations is still missing in the literature. This paper aims to categorize explanation methods concerning the type of explanation returnedand present the most widely adopted quantitative evaluation measures to validate explanationsunder different aspects and benchmark the explainers adopting these measures. The objective is toprovide to the reader a guide to map a black-box model to a set of compatible explanation methods. enchmarking and Survey of Explanation Methods for Black Box Models 3
Table 1: Examples of explanations divided for different data type and explanation
TABULAR IMAGE TEXT
Rule-Based (RB)
A set of premises that therecord must satisfy in order tomeet the rule’s consequence. r = Education ≤ College → ≤ k Saliency Maps (SM)
A map which highlight thecontribution of each pixel at theprediction.
Sentence Highlighting (SH)
A map which highlight thecontribution of each word at theprediction.
Feature Importance (FI)
A vector containing a value foreach feature. Each valueindicates the importance of thefeature for the classification.
Concept Attribution (CA)
Compute attribution to a target“concept” given by the user. Forexample, how sensitive is theoutput (a prediction of zebra) toa concept (the presence ofstripes)?
Attention Based (AB)
This type of explanation gives amatrix of scores which revealhow the word in the sentenceare related to each other.
Prototypes (PR)
The user is provided with a series of examples that characterize a class of the black box p = Age ∈ [35 , , Education ∈ [ College , Master ] → “ ≥ k ” p = → “cat” p = “... not bad ...” → “positive” Counterfactuals (CF)
The user is provided with a series of examples similar to the input query but with different classprediction q = Education ≤ College → “ ≤ k ” c = Education ≥ Master → “ ≥ k ” q = → “3” c = → “8” q =The movie is not that bad → “positive” c =The movie is that bad → “negative” Furthermore, we systematically present a qualitative comparison of the explanations that also helpunderstand how to read these explanations returned by the different methods . In this survey, we present explanations and explanation methods acting on the three principal datatypes recognized in the literature: tabular data, images and text [59]. In particular, for every ofthese data types, we have distinguished different types of explanations illustrated in Table 1. ATable appearing at the beginning of each subsequent Section summarizes the explanation methodsby grouping them accordingly to the classification illustrated in Table 1. Besides, in every sectionwe present the meaning of each type of explanation. The acronyms reported in capital letters inTable 1, in this section and in the following are used in the remainder of the work to quicklycategorize the various explanations and explanation methods. We highlight that the nature of thiswork is tied to test the available libraries and toolkits for XAI. Therefore, the presentation of theexisting methods is focused on the most recent works (specifically from 2018 to the date of writing)and to those papers providing a usable implementation that is nowadays widely adopted. All the experiments in the next sections are performed on a server with GPU: 1xTesla K80 , compute 3.7,having 2496 CUDA cores , 12GB GDDR5 VRAM, CPU: 1xsingle core hyper threaded Xeon [email protected] i.e (1 core, 2 threads) with 16 GB of RAM, or on a server: CPU: 16x Intel(R) Xeon(R)Gold 5120 CPU @ 2.20GHz (64 bits), 63 gb RAM. The code for reproducing the results is available https://github.com/kdd-lab/XAI-Survey . F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi
Fig. 1: Existing taxonomy for the classification of explanation methods.
In this section, we synthetically recall the existing taxonomy and classification of XAI meth-ods present in the literature [59,3,50,13,105,26] to allow the reader to complete the proposedexplanation-based categorization of explanation methods. We summarize the fundamental distinc-tions adopted to annotate the methods in Figure 1.The first distinction separates explainable by design methods from black-box explanation meth-ods: – Explainable by design methods are
INtrinsically (IN) explainable methods that returns adecision, and the reasons for the decision are directly accessible because the model is transpar-ent. – Black-box explanation are
Post-Hoc (PH) explanation methods that provides explanationsfor a non interpretable model that takes decisions.The second differentiation distinguishes post-hoc explanation methods in global and local: – Global (G) explanation methods aim at explaining the overall logic of a black-box model.Therefore the explanation returned is a global, complete explanation valid for any instance; – Local (L) explainers aim at explaining the reasons for the decision of a black-box model fora specific instance.The third distinction categorizes the methods into model-agnostic and model-specific: – Model-Agnostic (A) explanation methods can be used to interpret any type of black-boxmodel; – Model-Specific (S) explanation methods can be used to interpret only a specific type ofblack-box model.To provide to the reader a self-contained review of XAI methods, we complete this sectionby rephrasing succinctly and unambiguously the definitions of explanation, interpretability, trans-parency, and complexity: – Explanation [13,59] is an interface between humans and an AI decision-maker that is bothcomprehensible to humans and an accurate proxy of the AI. Consequently, explainability is theability to provide a valid explanation. – Interpretability [59], or comprehensibility [51], is the ability to explain or provide the meaningin understandable terms a human. Interpretability and comprehensibility are normally tied tothe evaluation of the model complexity. – Transparency [13], or equivalently understandability or intelligibility, is the capacity of a modelof being interpretable itself. Thus, the model allows a human to understand its functioningwithout explaining its internal structure or the algorithmic means by which the model processesdata internally. – Complexity [42] is the degree of effort required by a user to comprehend an explanation. Thecomplexity can consider the user background or eventual time limitation necessary for theunderstanding.
The validity and the utility of explanations methods should be evaluated in terms of goodness,usefulness, and satisfaction of explanations. In the following, we describe a selection of establishedmethodologies for the evaluation of explanation methods both from the qualitative and quantitativepoint of view. Moreover, depending on the kind of explainers under analysis, additional evaluationcriteria may be used.
Qualitative evaluation is important to understand the actual usability ofexplanations from the point of view of the end-user: they satisfy human curiosity, find meanings,safety, social acceptance and trust. In [42] is proposed a systematization of evaluation criteria intothree major categories: enchmarking and Survey of Explanation Methods for Black Box Models 5 Functionally-grounded metrics aim to evaluate the interpretability by exploiting some for-mal definitions that are used as proxies. They do not require humans for validation. Thechallenge is to define the proxy to employ, depending on the context. As an example, we canvalidate the interpretability of a model by showing the improvements w.r.t. to another modelalready proven to be interpretable by human-based experiments.2.
Application-grounded evaluation methods require human experts able to validate the spe-cific task and explanation under analysis [124,114]. They are usually employed in specificsettings. For example, if the model is an assistant in the decision making process of doctors,the validation is done by the doctors.3.
Human-grounded metrics evaluate the explanations through humans who are not experts.The goal is to measure the overall understandability of the explanation in simplified tasks [78,73].This validation is most appropriate for general testing notions of the quality of an explanation.Moreover, in [42,43] are considered several other aspects: the form of the explanation; the numberof elements the explanation contains; the compositionality of the explanation, such as the orderingof FI values; the monotonicity between the different parts of the explanation; uncertainty andstochasticity, which take into account how the explanation was generated, such as the presence ofrandom generation or sampling.In quantitative evaluation , the evaluation focuses on the performance of the explainer and howclose the explanation method f is to the black-box model b . Concerning quantitative evaluationwe can consider two different types of criteria :1. Completeness w.r.t. the black-box model . The metrics aim at evaluating how closely f approximates b .2. Completeness w.r.t. to specific task . The evaluation criteria are tailored for a particulartask or behavior.In the first criterion, we group the metrics that are often used in the literature [102,103,58,115].One of the metric most used in this setting is the fidelity that aims to evaluate how good is f at mimicking the black-box decisions. There are different specializations of fidelity, depending onthe type of explanator under analysis [58]. For example, in methods where there is a creation of asurrogate model g to mimic b , fidelity compares the prediction of b and g on the instances Z usedto train g .Another measure of completeness w.r.t. b is the stability , which aims at validating how con-sistent the explanations are for similar records. The higher the value, the better is the modelto present similar explanations for similar inputs. Stability can be evaluated by exploiting the Lipschitz constant [8] as L x = max (cid:107) e x − e x (cid:48) (cid:107)(cid:107) x − x (cid:48) (cid:107) , ∀ x (cid:48) ∈ N x where x is the explained instance, e x theexplanation and N x is a neighborhood of instances x (cid:48) similar to x .Besides the synthetic ground truth experimentation proposed in [55], a strategy to validate thecorrectness of the explanation e = f ( b, x ) is to remove the features that the explanation method f found important and see how the performance of b degrades. These metrics are called deletion and insertion [97]. The intuition behind deletion is that removing the “cause” will force the black-boxto change its decision. Among the deletion methods, there is the faithfulness [8] which is tailored forFI explainers. It aims to validate if the relevance scores indicate true importance: we expect higherimportance values for attributes that greatly influence the final prediction . Given a black-box b and the feature importance e extracted from an importance-based explanator f , the faithfulnessmethod incrementally removes each of the attributes deemed important by f . At each removal, theeffect on the performance of b is evaluated. These values are then employed to compute the overallcorrelation between feature importance and model performance. This metrics corresponds to a valuebetween − monotonicity is an implementationof an insertion method: it evaluates the effect of b by incrementally adding each attribute in orderof increasing importance. In this case, we expect that the black-box performance increases byadding more and more features, thereby resulting in monotonically increasing model performance.Finally, other standard metrics, such as accuracy , precision and recall , are often evaluated to testthe performance of the explanation methods. The running time is also an important evaluation. An implementation of the faithfulness is available in aix360 , presented in Section 7 F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi
Table 2: Summary of methods for explaining black-boxes for tabular data. The methods are sortedby explanation type: Features Importance (FI), Rule-Based (RB), Counterfactuals (CF), Proto-types (PR), and Decision Tree (DT). For every method, there is a data type on which it is possibleto apply it: only on tabular (TAB) or any data (ANY). If it is an Intrinsic Model (IN) or a Post-Hoc one (PH), a local method (L) or a global one (G), and finally if it is model agnostic (A) ormodel-specific (S).
Type Name Ref. Authors Year Data Type IN/PH G/L A/S Code shap [84] Lundberg et al. 2007 ANY PH G/L A link lime [102] Ribeiro et al. 2016 ANY PH L A link lrp [17] Bach et al. 2015 ANY PH L A link dalex [19] Biecek et al. 2020 ANY PH L/G A link nam [6] Agarwal et al. 2020 TAB PH L S link ciu [9] Anjomshoae et al. 2020 TAB PH L A linkFI maple [99] Plumb et al. 2018 TAB PH/IN L A link anchor [103] Ribeiro et al. 2018 TAB PH L/G A link lore [58] Guidotti et al. 2018 TAB PH L A link slipper [34] Cohen et al. 1999 TAB IN L S link lri [123] Weiss et al. 2000 TAB IN L S - mlrule [39] Domingos et al. 2008 TAB IN G/L S link rulefit [48] Friedman et al. 2008 TAB IN G/L S link scalable-brl [127] Yang et al. 2017 TAB IN G/L A - rulematrix [88] Ming et al. 2018 TAB PH G/L A link ids [78] Lakkaraju et al. 2016 TAB IN G/L S link trepan [36] Craven et al. 1996 TAB PH G S link dectext [22] Boz et al. 2002 TAB PH G S - msft [31] Chipman et al. 1998 TAB PH G S - cmm [41] Domingos et al. 1998 TAB PH G S - sta [132] Zhou et al. 2016 TAB PH G S - skoperule [48] Gardin et al. 2020 TAB PH L/G A linkRB glocalx [107] Setzu et al. 2019 TAB PH L/G A link mmd-critic [74] Kim et al. 2016 ANY IN G S link protodash [61] Gurumoorthy et al. 2019 TAB IN G A link tsp [116] Tan et al. 2020 TAB PH L S -PR ps [20] Bien et al. 2011 TAB IN G/L S - cem [40] Dhurandhar et al. 2018 ANY PH L S link dice [91] Mothilal et al. 2020 ANY PH L A link face [100] Poyiadzi et al. 2020 ANY PH L A -CF cfx [7] Albini et al. 2020 TAB PH L IN - In this Section we present a selection of approaches for explaining decision systems acting on tabulardata. In particular, we present the following types of explanations based on:
Features Importance (FI, Section 4.1),
Rules (RB, Section 4.2),
Prototype (PR) and
Counterfactual (CF) (Section 4.3).Table 2 summarizes and categorizes the explainers. After the presentation of the explanationmethods, we report experiments obtained from the application of them on two datasets : adult and german . We trained the following ML models: Logistic Regression (LG), XGBoost (XGB), andCatboost (CAT). Feature importance is one of the most popular types of explanation returned by local explanationmethods. The explainer assigns to each feature an importance value which represents how muchthat particular feature was important for the prediction under analysis. Formally, given a record x , an explainer f ( · ) models a feature importance explanation as a vector e = { e , e , . . . , e m } , inwhich the value e i ∈ e is the importance of the i th feature for the decision made by the black-boxmodel b ( x ). For understanding the contribution of each feature, the sign and the magnitude ofeach value e i are considered. W.r.t. the sign, if e i <
0, it means that feature contributes negativelyfor the outcome y ; otherwise, if e i >
0, the feature contributes positively. The magnitude, instead, adult : https://archive.ics.uci.edu/ml/datasets/adult , german : https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data) enchmarking and Survey of Explanation Methods for Black Box Models 7 Fig. 2:
TOP : lime application on the same record for adult (a/b), german (c/d): a/c are the LGmodel explanation and b/d the CAT model explanation. All the models correctly predicted theoutput class. BOTTOM : Force plot returned by shap explaining XGB on two records of adult : (e),labeled as class 1 ( > K ) and, (f), labeled as class 0 ( ≤ K ). Only the features that contributedmore (i.e. higher shap ’s values) to the classification are reported.represents how great the contribution of the feature is to the final prediction y . In particular,the greater the value of | e i | , the greater its contribution. Hence, when e i = 0 it means that the i th feature is showing no contribution for the output decision. An example of a feature basedexplanation is e = { age = 0 . , income = 0 . , education = − . } , y = deny . In this case, age is themost important feature for the decision deny , income is not affecting the outcome and education has a small negative contribution. LIME , Local Interpretable Model-agnostic Explanations [102], is a model-agnostic explanationapproach which returns explanations as features importance vectors. The main idea of lime is thatthe explanation may be derived locally from records generated randomly in the neighborhood of theinstance that has to be explained. The key factor is that it samples instances both in the vicinity of x (which have a high weight) and far away from x (low weight), exploiting π x , a proximity measureable to capture the locality. We denote b the black-box and x the instance we want to explain.To learn the local behavior of b , lime draws samples weighted by π x . It samples these instancesaround x by drawing nonzero elements of x uniformly at random. This gives to lime a perturbedsample of instances { z ∈ R d } to fed to the model b and obtain b ( z ). They are then used to trainthe explanation model g ( · ): a sparse linear model on the perturbed samples. The local featureimportance explanation consists of the weights of the linear model. A number of papers focus onovercoming the limitations of lime , providing several variants of it. dlime [130] is a deterministicversion in which the neighbors are selected from the training data by an agglomerative hierarchicalclustering. ilime [45] randomly generates the synthetic neighborhood using weighted instances. alime [108] runs the random data generation only once at “training time”. kl-lime [96] adoptsa Kullback-Leibler divergence to explain Bayesian predictive models. qlime [23] also considernonlinear relationships using a quadratic approximation.In Figure 2 are reported examples of lime explanations relative to our experimentation on adult (top) and german (bottom) . We fed the same record into two black-boxes, and thenwe explained it. Interestingly, for adult , lime considers a similar set of features as important(even if with different values of importance) for the two models: on 6 features, only one differs. Adifferent scenario is obtained by the application of lime on german : different features are considerednecessary by the two models. However, the confidence of the prediction between the two modelsis quite different: both of them predict the output label correctly, but CAT has a higher value,suggesting that this could be the cause of differences between the two explanations. SHAP , SHapley Additive exPlanations [84], is a local-agnostic explanation method, whichcan produce several types of models. All of them compute shap values: a unified measure offeature importance based on the Shapley values , a concept from cooperative game theory. In For reproducibility reasons, we fixed the random seed. We refer the intrested reader to: https://christophm.github.io/interpretable-ml-book/shapley.html
F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi
Fig. 3: shap application on adult : a record labelled > K (top-left) and one as ≤ K (top-right).They are obtained applying the TreeExplainer on a XGB model and then the decision plot , in whichall the input features are shown. At the bottom, the application of shap to explain the outcomeof a set of record by XGB on adult . The interaction values among the features are reported.particular, the different explanation models proposed by shap differ in how they approximate thecomputation of the shap values. All the explanation models provided by shap are called additivefeature attribution methods and respect the following definition: g ( z (cid:48) ) = φ + (cid:80) Mi =1 φ i z (cid:48) i , where z (cid:48) ≈ x as a real number, z (cid:48) ∈ [0 , φ i ∈ R are effects assigned to each feature, while M is thenumber of simplified input features. shap has 3 properties: (i) local accuracy , meaning that g ( x )matches b ( x ); (ii) missingness , which allows for features x i = 0 to have no attributed impact on the shap values; (iii) consistency , meaning that if a model changes so that the marginal contribution ofa feature value increases (or stays the same), the shap value also increases (or stays the same). Theconstruction of the shap values allows to employ them both locally , in which each observation getsits own set of shap values; and globally , by exploiting collective shap values. There are 5 strategiesto compute shap ’s values: KernelExplainer , LinearExplainer , TreeExplainer , GradientExplainer ,and
DeepExplainer . In particular, the
KernelExplainer is an agnostic method while the others arespecifically designed for different kinds of ML models.In our experiments with shap we applied: (i) the
LinearExplainer to the LG models, (ii) the
TreeExplainer to the XGB and (iii) KernelExplainer to the CAT models. In Figures 2 we reportthe application of shap on adult through force plot . The plot shows how each feature contributesto pushing the output model value away from the base value, which is an average among thetraining dataset’s output model values. The red features are pushing the output value higher whilethe ones in blue are pushing it lower. For each feature is reported the actual value for the recordunder analysis. Only the features with the highest shap values are shown in this plot. In the firstforce plot, the features that are pushing the value higher are contributing more to the output value,and it is possible to note it by looking at the base value (0 .
18) and the actual output value (0 . .
0, and it is interesting to see that only
Age , Relationship and
Hours Per Week are contributing to pushing it lower. Figure 3 (left and center)depicts the decision plots : in this case, we can see the contribution of all the input features indecreasing order of importance. In particular, the line represents the feature importance for therecord under analysis. The line starts at its corresponding observations’ predicted value. In thefirst plot, predicted as class > k , the feature Occupation is the most important, followed by
Age and
Relationship . For the second plot, instead,
Age , Relationship and
Hours Per Week are themost important feature. Besides the local explanations, shap also offers a global interpretation ofthe model-driven by the local interpretations. Figure 3 (right) reports a global decision plot thatrepresents the feature importance of 30 records of adult . Each line represents a record, and thepredicted value determines the color of the line.
DALEX [19] is a post-hoc, local and global agnostic explanation method. Regarding local ex-planations, dalex contains an implementation of a variable attribution approach [104]. It consistsof a decomposition of the model’s predictions, in which each decomposition can be seen as a localgradient and used to identify the contribution of each attribute. Moreover, dalex contains the ceteris-paribus profiles, which allow for a
What-if analysis by examining the influence of a variableby fixing the others. Regarding the global explanations, dalex contains different exploratory tools:model performance measures, variable importance measures, residual diagnoses, and partial depen-dence plot. In Figure 4 are reported some local explanations obtained by the application of dalex to an XGB model on adult . On the left are reported two explanation plots for a record classifiedas class > k . On the top, there is a visualization based on Shapely values, which highlights as enchmarking and Survey of Explanation Methods for Black Box Models 9 Fig. 4: Explanations of dalex for two records of adult : b ( x ) = 0 ( ≤
50) (left), b ( x ) = 1 ( > K )(right) to explain an XGB in form of Shapely values (top), break down plots (bottom). The y-axisis the features important, the x-axis the positive/negative contribution.Fig. 5: TOP : Results of ebm on adult : overall global explanation (left), example of a global expla-nation for education number (right). BOTTOM : Local explanations of ebm on adult : left, a record classified as 1 ( > K ); right arecord classified as 0 ( ≤ Age (35 years old), followed by occupation . At the bottom, there is a
Breakdown plot, in which the green bars represent positive changes in the mean predictions, whilethe red ones are negative changes. The plot also shows the intercept , which is the overall meanvalue for the predictions. It is interesting to see that
Age and occupation are the most importantfeatures that positively contributed to the prediction for both the plots. In contrast,
Sex is posi-tively important for Shapely values but negatively important for the
Breakdown plot. On the rightpart of Figure 4 we report a record classified as < k . In this case, there are important differencesin the feature considered most important by the two methods: for the Shapely values, Age and
Relationship are the two most important features, while in the Breakdown plot
Hours Per Week is the most important one.
CIU , Contextual Importance and Utility [9], is a local, agnostic explanation method. ciu isbased on the idea that the context, i.e., the set of input values being tested, is a key factor ingenerating faithful explanations. The authors suggest that a feature that may be important in acontext may be irrelevant in another one. ciu explains the model’s outcome based on the contextualimportance (CI) , which approximates the overall importance of a feature in the current context,and on the contextual utility (CU) , which estimates how good the current feature values are for agiven output class. Technically, ciu computes the values for CI and CU by exploiting Monte Carlosimulations. We highlight that this method does not require creating a simpler model to employfor deriving the explanations.
NAM , Neural Additive Models [6], is a different extension of gam . This method aims tocombine the performance of powerful models, such as deep neural networks, with the inherentintelligibility of generalized additive models. The result is a model able to learn graphs that describe x = { Education = Bachelors, Occupation = Prof-specialty, Sex = Male, Na-tiveCountry = Vietnam, Age = =
3, HoursWeek =
40, Race = Asian-Pac-Islander, MaritialStatus = Married-civ, Relationship = Husband,CapitalGain = 0, CapitalLoss = 0 } , > k x = { Education = College, Occupation = Sales,Sex = Male, NativeCountry = US, Age =
19, Workclass =
2, HoursWeek =
15, Race = White, MaritialStatus = Married-civ,Relationship = Husband, CapitalGain =2880, CapitalLoss = 0 } , ≤ kr anchor = { EducationNum > Bachelors,Occupation ≤ > ≤ < Age ≤ }→ > k r anchor = { Education ≤ College,MaritialStatus > } → ≤ kr lore = { Education > > ≤ ≤ ≤ } → > k r lore = { Education ≤ Masters, Occupation > -0.34, HoursWeek ≤
40, WorkClass ≤ ≤ ≤ }→ ≤ kc lore = { CapitalLoss ≥ } → ≤ k c lore = { Education > Masters } → > k { CapitalGain > } → > k { Occupation ≤ -0.34 } → > k Fig. 6: Explanations of anchor and lore for adult to explain an XGB model.how the prediction is computed. nam trains multiple deep neural networks in an additive fashionsuch that each neural network attend to a single input feature.
Decision rules give the end-user an explanation about the reasons that lead to the final prediction.The majority of explanation methods for tabular data are in this category since decision rules arehuman-readable. A decision rule r , also called factual or logic rule [58], has the form p → y , inwhich p is a premise, composed of a Boolean condition on feature values, while y is the consequenceof the rule. In particular, p is a conjunction of split conditions of the form x i ∈ [ v ( l ) i , v ( u ) i ], where x i is a feature and v ( l ) i , v ( u ) i are lower and upper bound values in the domain of x i extended with ±∞ . An instance x satisfies r , or r covers x , if every Boolean conditions of p evaluate to true for x . If the instance x to explain satisfies p , the rule p → y represents then a candidate explanationof the decision g ( x ) = y . Moreover, if the interpretable predictor mimics the behavior of the black-box in the neighborhood of x , we further conclude that the rule is a candidate local explanationof b ( x ) = g ( x ) = y . We highlight that, in the context of rules we can also find the so-called counterfactual rules [58]. Counterfactual rules have the same structure of decision rules, with theonly difference that the consequence of the rule y is different w.r.t. b ( x ) = y . They are importantto explain to the end-user what should be changed to obtain a different output. An example of arule explanation is r = { age < , income < k, education ≤ Bachelor } , y = deny . In this case,the record { age = 18 , income = 15 k, education = Highschool } satisfies the rule above. A possiblecounterfactual rule, instead can be: r = { income > k, education ≥ Bachelor } , y = allow . ANCHOR [103] is a model-agnostic system that outputs rules as explanations. This approach’sname comes from the output rules, called anchors . The idea is that, for decisions on which theanchor holds, changes in the rest of the instance’s feature values do not change the outcome.Formally, given a record x , r is an anchor if r ( x ) = b ( x ). To obtain the anchors, anchor perturbsthe instance x obtaining a set of synthetic records employed to extract anchors with precisionabove a user-defined threshold. First, since the synthetic generation of the dataset may lead to amassive number of samples anchor exploits a multi-armed bandit algorithm [72]. Second, sincethe number of all possible anchors is exponential anchor uses a bottom-up approach and a beamsearch. Figure 6 reports some rules obtained by applying anchor to a XGB model on adult . Thefirst rule has a high precision (0 . . Relationship and
Education Num , which are the features highlighted by mostof the explanation models proposed so far. In particular, in this case, for having a classification > k , the Relationship should be husband and the
Education Num at least bachelor degree.
Education Num can also be found in the second rule, in which case has to be less or equal toCollege, followed by the
Maritial Status , which can be anything other than married with a civilian.This rule has an even better precision (0 . . enchmarking and Survey of Explanation Methods for Black Box Models 11 r = { Age >
34, HoursPerWeek > > ≤ ≤ ≤ ≤ } → { Prec = 0.79 %, Rec = 0.15 %, Cov = 1 } r = { Age ≤
19, WorkClass ≤ ≤
30, Education ≤ > > > ≤ } → { Prec = 0.99 %, Rec = 0.19 %, Cov = 1 } Fig. 7: skoperule global explanations of XGB on adult . On the left, a rule for class > k , onthe right for class < k . LORE , LOcal Rule-based Explainer [58], is a local agnostic method that provides faithfulexplanations in the form of rules and counterfactual rules. lore is tailored explicitly for tabulardata. It exploits a genetic algorithm for creating the neighborhood of the record to explain. Sucha neighborhood produces a more faithful and dense representation of the vicinity of x w.r.t. lime .Given a black-box b and an instance x , with b ( x ) = y , lore first generates a synthetic set Z of neighbors through a genetic algorithm. Then, it trains a decision tree classifier g on this setlabeled with the black-box outcome b ( Z ). From g , it retrieves an explanation that consists of twocomponents: (i) a factual decision rule, that corresponds to the path on the decision tree followedby the instance x to reach the decision y , and (ii) a set of counterfactual rules, which have a differentclassification w.r.t. y . This counterfactual rules set shows the conditions that can be varied on x in order to change the output decision. In Figure 6 we report the factual and counterfactual rulesof lore for the explanation of the same records showed for anchor . It is interesting to note that,differently from anchor and the others models proposed above, lore explanations focuses moreon the Education Num , Occupation , Capital Gain and
Capital Loss , while the features about therelationship are not present.
RuleMatrix [88] is a post-hoc agnostic explanator tailored for the visualization of the rulesextracted. First, given a training dataset and a black-box model, rulematrix executes a ruleinduction step, in which a rule list is extracted by sampling the input data and their predictedlabel by the black-box. Then, the rules extracted are filtered based on thresholds of confidenceand support. Finally, rulematrix outputs a visual representation of the rules. The user interfaceallows for several analyses based on plots and metrics, such as fidelity.One of the most popular ways for generating rules is by extracting them from a decision tree.In particular, due to the method’s simplicity and interpretability, decision trees explain black-boxmodels’ overall behavior. Many works in this setting are model specific to exploit some structuralinformation of the black-box model under analysis.
TREPAN [36] is a model-specific global explainer tailored for neural networks. Given a neuralnetwork b , trepan generates a decision tree g that approximates the network by maximizing thegain ratio and the model fidelity. DecText is a global model-specific explainer tailored for neural networks [22]. The aim of dectext is to find the most relevant features. To achieve this goal, dectext resembles trepan ,with the difference that it considers four different splitting methods. Moreover, it also considers apruning strategy based on fidelity to reduce the final explanation tree’s size. In this way, dectext can maximize the fidelity while keeping the model simple.
MSFT [31] is a specific global post-hoc explanation method that outputs decision trees startingfrom random forests. It is based on the observation that, even if random forests contain hundredsof different trees, they are quite similar, differing only for few nodes. Hence, the authors proposedissimilarity metrics to summarize the random forest trees using a clustering method. Then, foreach cluster, an archetype is retrieved as an explanation.
CMM , Combined Multiple Model procedure [41], is a specific global post-hoc explanationmethod for tree ensembles. The key point of cmm is the data enrichment. In fact, given an inputdataset X , cmm first modifies it n times. On the n variants of the dataset, it learns a black-box.Then, random records are generated and labeled using a bagging strategy on the black-boxes. Inthis way, the authors were able to increase the size of the dataset to build the final decision tree. STA , Single Tree Approximation [132], is a specific global post-hoc explanation method tailoredfor random forests, in which the decision tree, used as an explanation, is constructed by exploitingtest hypothesis to find the best splits.
SkopeRules is a post-hoc, agnostic model, both global and local , based on the rulefit [48]idea to define an ensemble method and then extract the rules from it. skope-rules employs fastalgorithms such as bagging or gradient boosting decision tress. After extracting all the possiblerules, skope-rules removes rules redundant or too similar by a similarity threshold. Differently https://skope-rules.readthedocs.io/en/latest/skope_rules.html from rulefit , the scoring method does not solve the L1 regularization. Instead, the weights aregiven depending on the precision score of the rule. We can employ skoperules in two ways: (i) as an explanation method for the input dataset, which describes, by rules, the characteristics ofthe dataset; (ii) as a transparent method by outputting the rules employed for the prediction. InFigure 7, we report the rule extracted by rulefit with highest precision and recall for each classfor adult . Similarly to the models analyzed so far, we can find Relationship and
Education amongthe features in the rules. In particular, for the first rule, for > k , the Education has to be atleast a Bachelor degree, while for the other class, it has to be at least fifth or sixth. Interestingly,it is also mentioned the
Capital Gain and
Capital Loss which were considered as important byfew models, such as lore . We also tested skoperules to create a rule-based classifier obtaininga precision of 0 .
68 on adult .Moreover, with skoperules , it is possible to explain, using rules, the entire dataset withoutconsidering the output labels; or obtain a set of rules for each output class. We tested both ofthem, but we report only the case of rules for each class. In particular, we report the rule with thehighest precision and recall for each class for adult in Figure 7.
Scalable-BRL [127] is an interpretable probabilistic rule-based classifier that optimizes theposterior probability of a Bayesian hierarchical model over the rule lists. The theoretical part ofthis approach is based on [81]. The particularity of scalable-brl is that it is scalable , due to aspecific bit vector manipulation.
GLocalX [1] is a rule-based explanation method which exploits a novel approach: the localto global paradigm. The idea is to derive a global explanation by subsuming local logical rules.
GLocalX start from an array of factual rules and following a hierarchical bottom up fashionmerges rules covering similar records and expressing the same conditions.
GLocalX finds thesmallest possible set of rules that is: (i) general, meaning that the rules should apply to a largesubset of the dataset; (ii) has high accuracy. The final explanation proposed to end-user is a set ofrules. In [1] the authors validated the model in constrained settings: limited or no access to dataor local explanations. A simpler version of
GLocalX is presented [107]: here, the final set of rulesis selected through a scoring system based on rules generality, coverage, and accuracy.
A prototype, also called archetype or artifact, is an object representing a set of similar records. Itcan be (i) a record from the training dataset close to the input data x ; (ii) a centroid of a clusterto which the input x belongs to. Alternatively, (iii) even a synthetic record, generating followingsome ad-hoc process. Depending on the explanation method considered, different definitions andrequirements to find a prototype are considered. Prototypes serve as examples: the user understandsthe model’s reasoning by looking at records similar to his/hers. MMD-CRITIC [74] is a “before the model” methodology, in the sense that it only analyses thedistribution of the dataset under analysis. It produces prototypes and criticisms as explanations fora dataset using
Maximum Mean Discrepancy (MMD) . The first ones explain the dataset’s generalbehavior, while the latter represent points that are not well explained by the prototypes. mmd-critic selects prototypes by measuring the difference between the distribution of the instancesand the instances in the whole dataset. The set of instances nearer to the data distribution arecalled prototypes, and the farthest are called criticisms. mmd-critic shows only minority datapoints that differ substantially from the prototype but belong in the same category. For criticism, mmd-critic selects criticisms from parts of the dataset underrepresented by the prototypes, withan additional constraint to ensure the criticisms are diverse.
ProtoDash [61] is a variant of mmd-critic . It is an explainer that employs prototypicalexamples and criticisms to explain the input dataset. Differently, w.r.t. mmd-critic , protodash associates non-negative weights, which indicate the importance of each prototype. In this way, itcan reflect even some complicated structures. Privacy-Preserving Explanations [21] is a local post-hoc agnostic explanability methodwhich outputs prototypes and shallow trees as explanations. It is the first approach that considersthe concept of privacy in explainability by producing privacy protected explanations . To achieve agood trade-off between privacy and comprehensibility of the explanation, the authors construct theexplainer by employing micro aggregation to preserve privacy. In this way, the authors obtained aset of clusters, each with a representative record c i , where i is the i − th cluster. From each cluster,a shallow decision trees is extracted to provide an exhaustive explanation while having good com-prehensibility due to the limited depth of the trees. When a new record x arrives, a representativerecord and its associated shallow tree are selected. In particular, from g the representative c i closerto x is selected, depending on the decision of the black-box. enchmarking and Survey of Explanation Methods for Black Box Models 13 PS , Prototype Selection ( ps ) [20] is an interpretable model, composed by two parts. First,the ps seeks a set of prototypes that better represent the data under analysis. It uses a set coveroptimization problem with some constraints on the properties the prototypes should have. Eachrecord in the original input dataset D is then assigned to a representative prototype. Then, theprototypes are employed to learn a nearest neighbor rule classifier. TSP , Tree Space Prototype [116], is a local, post-hoc and model-specific approach, tailored forexplaining random forests and gradient boosted trees. The goal is to find prototypes in the treespace of the tree ensemble b . Given a notion of proximity between trees, with variants dependingon the kind of ensemble, tsp is able to extract prototypes for each class. Different variants areproposed for allowing for the selection of a different number of prototypes for each class. Counterfactuals describe a dependency on the external facts that led to a particular decisionmade by the black-box model. It focuses on the differences to obtain the opposite predictionw.r.t. b ( x ) = y . Counterfactuals are often addressed as the prototypes’ opposite. In [122] is for-malized the general form a counterfactual explanation should have: b ( x ) = y was returned becausevariables of x has values x , x , ..., x n . Instead, if x had values x , x , ..., x n and all the other vari-ables has remained constant, b ( x ) = ¬ y would have been returned, where x is the record x with thesuggested changes. An ideal counterfactual should alter the values of the variables as little as possi-ble to find the closest setting under which y is returned instead of ¬ y . Regarding the counterfactualexplainers, we can divide them into three categories: exogenous , which generates the counterfactu-als synthetically; endogenous , in which the counterfactuals are drawn from a reference population,and hence they can produce more realistic instances w.r.t. the exogenous ones; or instance-based ,which exploits a distance function to detect the decision boundary of the black-box. There areseveral desiderata in this context: efficiency, robustness, diversity, actionability, and plausibility,among others [122,71,69]. To better understand the complex context and the many available pos-sibilities, we refer the interested reader to [15,120,25]. In [25] is presented a study that evaluatesthe understandability of factual and counterfactual explanations. The authors analyzed the mentalmodel theory, which stated that people construct models that simulate the assertions described.They conducted experiments on a group of people highlighting that people prefer reasoning usingmental models and find it challenging to consider probability, calculus, and logic. There are manyworks in this area of research; hence, we briefly present only the most representative methods inthis category. MAPLE [99] is a post-hoc local agnostic explanation method that can also be used as atransparent model due to its internal structure. It combines random forests with feature selectionmethods to return feature importance based explanations. maple is based on two methods:
SILO and
DStump . SILO is employed for obtaining a local training distribution, based on the randomforest leaves’.
DStump , instead, ranks the features by importance. maple considers the best k features from DStump to solve a weighted linear regression problem. In this case, the explanationis the coefficient of the local linear model, i.e., the estimated local effect of each feature.
CEM , Contrastive Explanations Method [40], is a local, post-hoc and model-specific expla-nation method, tailored for neural networks which outputs contrastive explanations . cem has twocomponents: Pertinent Positives (PP) , which can be seen as prototypes, and are the minimal andsufficient factors that have to be present to obtain the output y ; and Pertinent Negatives (PN) ,which are counterfactuals factors, that should be minimally and necessarily absent. cem is formu-lated as an optimization problem over the perturbation variable δ . In particular, given x to explain, cem considers x = x + δ , where δ is a perturbation applied to x . During the process, there aretwo values of δ to minimize: δ p for the pertinent positives, and δ n for the pertinent negatives. cem solves the optimization problem with a variant that employs an autoencoder to evaluate the close-ness of x to the data manifold. ceml [14] is also a Python toolbox for generating counterfactualexplanations, suitable for ML models designed in Tensorflow, Keras, and PyTorch. DICE , Diverse Counterfactual Explanations [91] is a local, post-hoc and agnostic methodwhich solves an optimization problem with several constraints to ensure feasibility and diversity when returning counterfactuals. Feasibility is critical in the context of counterfactual since it allowsavoiding examples that are unfeasible. As an example, consider the case of a classifier that deter-mines whether to grant loans. If the classifier denies the loan to an applicant with a low salary,the cause may be low income. However, a counterfactual such as “You have to double your salary”may be unfeasible, and hence it is not a satisfactory explanation. The feasibility is achieved byimposing some constraints on the optimization problem: the proximity constraint, from [122], the sparsity constraint, and then user-defined constraints. Besides feasibility, another essential factoris diversity, which provides different ways of changing the outcome class.
FACE , Feasible and Actionable Counterfactual Explanations [100] is a local, post-hoc agnosticexplanation method that focuses on returning “achievable” counterfactuals. Indeed, face uncovers“feasible paths” for generating counterfactual. These feasible paths are the shortest path distancesdefined via density-weighted metrics. It can extract counterfactuals that are coherent with theinput data distribution. face generates a graph over the data points, and the user can selectthe prediction, the density, also the weights, and a conditions function. face updates the graphaccordingly to these constraints and applies the shortest path algorithm to find all the data pointsthat satisfy the requirements.
CFX [7] is a local, post-hoc, and model-specific method that generates counterfactuals expla-nations for Bayesian Network Classifiers. The explanations are built from relations of influencebetween variables, indicating the reasons for the classification. In particular, this method’s mainachievement is that it can find pivotal factors for the classification task: these factors, if removed,would give rise to a different classification.
In this section we present some transparent methods, tailored for tabular data. In particular, wefirst present some models which output feature importance, then methods which outputs rules.
EBM , Explainable Boosting Machine [93] is an interpretable ML algorithm. Technically, ebm is a variant of a Generalized Additive Model ( gam ) [64], i.e., a generalized linear model thatincorporates nonlinear forms of the predictors. For each feature, ebm uses a boosting procedure totrain the generalized linear model: it cycles over the features, in a round-robin fashion, to train onefeature function at a time and mitigate the effects of co-linearity. In this way, the model learns thebest set of feature functions, which can be exploited to understand how each feature contributesto the final prediction. ebm is implemented by the interpretml
Python Library . We trainedan ebm on adult . In Figure 5 we show a global explanation reporting the importance for eachfeature used by ebm . We observe that Maritial Status is the most important feature, followedby
Relationship and
Age . In Figure 5 we show an inspection of the feature
Education Number illustrating how the prediction score changes depending on the value of the feature. In Figure 5,are also reported two examples of local explanations for ebm . For the first record, predicted as > k , the most important feature is Education Num , which is Master for this record. For thesecond record, predicted as < k , the most important feature is Relationship . This feature isimportant for both records: in the first (husband) is pushing the value higher, while in the second(own-child) lower.
TED [65] is an intrinsically transparent approach that requires in input a training dataset inwhich its explanation correlates each record. Explanations can be of any type, such as rules orfeature importance. For the training phase, the framework allows using any ML model capableof dealing with multilabel classification. In this way, the model can classify the record in inputand correlate it with its explanation. A possible limitation of this approach is the creation of theexplanations to feed during the training phase. ted is implemented in aix360 . SLIPPER [34] is a transparent rule learner based on a modified version of Adaboost. It outputscompact and comprehensible rules by imposing constraints on the rule builder.
LRI [123] is a transparent rule learner that achieves good performance while giving inter-pretable rules as explanations. In lri , each class of the training is represented by a set of rules,without ordering. The rules are obtained by an induction method that weights the cumulative erroradaptively without pruning. When a new record is considered, all the available rules are testedon it. The output class is the one that has the most satisfying set of rules for the record underanalysis.
MlRules [39] is a transparent rule induction algorithm solving classification tasks throughprobability estimation. Rule induction is done with boosting strategies, but a maximum likelihoodestimation is applied for rule generation.
RuleFit [48] is a transparent rule learner that exploits an ensemble of trees. As a first step,it creates an ensemble model by using gradient boosting. The rules are then extracted from theensemble: each path in each tree is a rule. After the rules’ extraction, they are weighted accordingto an optimization problem based on L1 regularization.
IDS , Interpretable Decision Sets [78], is a transparent and highly accurate model based on decision sets . Decision sets are sets of independent, short, accurate, and non-overlapping if-thenrules. Hence, they can be applied independently. https://github.com/interpretml/interpret enchmarking and Survey of Explanation Methods for Black Box Models 15 Table 3: Comparison on the fidelity and the faithfulness metrics of different explanation methods.For every evaluation we report the mean and the standard deviation over a subset of 50 test setrecords.
Fidelity FaithfulnessDataset Black-Box lime shap anchor lore lime shap adult
LG 0.979 0.613 (0.37)XGB 0.977 0.877 0.978 (0.49)CAT 0.96 0.777 0.988 (0.37) german LG (0.60) 0.19 (0.63)XGB (0.21)CAT 0.979 0.670 0.620 (0.32) Table 4: Comparison on the stability metric. We report the mean and the standard deviation overa subset of 30 test records.
Dataset Black-Box lime shap anchor lore adult
LG 24.37 (2.74) 1.52 (4.49) 22.36 (8.37) 21.76 (11.80)XGB 10.16 (6.48) 2.17 (2.18) 26.53 (13.08) 30.01 (20.52)CAT 0.35 (0.43) 0.03 (0.01) 6.51 (4.40) 27.80 (70.05) german
LG 18.87 (0.73) 19.01 (23.44) 101.07 (62.75) 622.12 (256.70)XGB 26.08 (14.50) 38.43 (30.66) 121.40 (98.43) 725.81 (337.26)CAT 2.49 (9.91) 15.92 (10.71) 123.79 (76.86) 756.70 (348.21)
We validated explanation models by considering the two most important metrics in the contextof tabular data: fidelity , and the stability . In particular, we evaluated lime , shap , anchor and lore . The results of the fidelity are reported in Table 3. The fidelity values are relatively high forall the methods highlighting that the local surrogate models are good at mimicking their black-box models. Regarding the feature importance-based models, lime shows higher values of fidelityw.r.t. shap , especially for adult . In particular, shap has lower values for the CAT models (both german and adult ), suggesting that it may be not good in explaining this kind of ensemble models.Concerning the rule-based models, the fidelity is high for both of them. However, we remark that anchor shows lower values of fidelity for the CAT model for german , a behavior which is similarto the one of shap . We compared lime and shap on faithfulness and monotonicity. Overall, we didnot find any model to be monotonic, and hence we do not report any results. The results for thefaithfulness are reported in Table 3. For adult , the faithfulness is quite low, especially for lime .The model with the highest faithfulness is CAT explained by shap . Regarding german , instead, thevalues are higher, highlighting a better faithfulness overall. However, also for this dataset shap hasa better faithfulness w.r.t. lime . In Table 4 are reported the results obtained from the analysis onthe stability. For this metric, a high value means that the model presents high instability, meaningthat we can have quite different explanations for similar inputs. None of the methods is remarkablystable according to this metric. Runtime Analysis
Table 5 shows the explanation runtime approximated as order of magnitude.Overall, feature importance explanation algorithms are faster w.r.t. the rule-based ones. In partic-ular, shap is the most efficient, followed by lime . We remark that the computation time of lore depends on the number of neighbors to generate exploiting a genetic algorithm (in this case, weconsidered 1000 samples). anchor , instead, requires a minimum precision as well as skoperule (we selected min precision of 0 . Discussion
In the context of tabular data, many explainable methods have been proposed. Inparticular, the most explored area is feature importance-based explanators, such as lime and shap .These methods provide an importance value for each feature in the input. It is suitable for domainexperts who know the meaning of the features employed. However, it may be too difficult for acommon end-user to understand, especially when obtaining such importance values is complex.In contrast, rule-based explanations, prototypes, and counterfactuals are more suitable for the Since shap is not training a local surrogate, we evaluate the fidelity of shap by learning a classifier onthe sum of the shap ’s values.6 F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi
Table 5: Explanation runtime expressed in seconds for explainers of tabular classifiers approximatedas order of magnitude.
Dataset Black-Box lime shap dalex anchor lore skoperule adult
LG 0.1
90 2 15 100XGB german
LG 0.007
Table 6: Explainers for black-boxes classifying image data sorted by explanation type: SaliencyMaps (SM), Concept Attributions (CA), Counterfactuals (CF), and Prototypes (PR). For everymethod is indicated if is possible it for images (IMG) only, or for ANY type of data, if it is anIntrinsic (IN) or a Post-Hoc (PH) model, Local (L) or Global (G), and if it is model Agnostic (A)or model-Specific (S).
Type Name Ref. Authors Year Data Type IN/PH G/L A/S Code shap [84] Lundberg et al. 2007 ANY PH L A link lime [102] Ribeiro et al. 2016 ANY PH L A link (cid:15) -lrp [17] Bach et al. 2015 ANY PH L S link intgrad [115] Sundararajan et al. 2017 ANY PH L S link deeplift [110] Shrikumar et al. 2017 ANY PH L S link smoothgrad [112] Smilkov et al. 2017 IMG PH L S link xrai [70] Kapishnikov et al. 2019 ANY PH L S link gradcam [106] Selvaraju et al. 2017 IMG PH L S link gradcam++ [27] Chattopadhay et al. 2018 IMG PH L S linkSM rise [97] Petsiuk et al. 2018 IMG PH L S link tcav [75] Kim et al. 2018 IMG PH L A link ace [49] Ghorbani et al. 2019 IMG PH G A link conceptshap [129] Yeh et al. 2020 IMG PH G A -CA cace [54] Goyal et al. 2019 IMG IN G A - cem [40] Dhurandhar, Amit, et al. 2018 IMG PH L A link abele [57] Guidotti et al. 2020 IMG PH L A link l2x [29] Chen et al. 2018 ANY PH L A linkCF guided proto [118] Van Looveren et al. 2019 IMG PH L A link mmd-critic [74] Kim et al. 2016 ANY IN G A link- [76] Koh et al. 2017 ANY PH L A linkPR protonet [28] Chen et al. 2019 IMG IN G S link common end-user due to their logical structure and the similarity by example they exploit. This isparticularly true in decision rules correlated by counterfactual ones, like in lore . The end-user canunderstand why she received that outcome, but she also has a suggestion about what to change toachieve another classification. However, fewer methods are proposed in this context w.r.t. featureimportance explanations. In particular, the majority of rule and prototype-based explanators areintrinsic. For the few post-hoc ones, on average, they require more time to provide an explanationw.r.t. feature importance ones. Regarding the post-hoc prototype-based models, there are someinteresting approaches, but there is no code for them, highlighting that they are still in an earlystage of development. During the past few years, counterfactuals have witnessed a particularly greatinterest. Overall, even if the rules, prototypes, and counterfactuals seem to be the best solution,there are still several open questions and challenges in this research area such as improving theefficiency and the accuracy of these explanation algorithms as well as considering the constraintsof the domain in which the model is being employed.
This section presents the solutions in the state of the art, proposing explanations for decisionsystems acting on image data. In particular, we distinguish the following types of explanations:
Saliency Maps (SM, Section 5.1,
Concept Attribution (CA, Section 5.2),
Prototypes (PR, Sec-tion 5.3) and
Counterfactuals (CF, Section 5.4). Table 6 summarizes and categorizes the explana- enchmarking and Survey of Explanation Methods for Black Box Models 17
Fig. 8: Examples of saliency maps obtained with the algorithm exposed in Section 5.1 on variousdatasets. The first row are the original images of the dataset and on top of them we have thepredicted class from the original model.tion methods acting on image data. For the experiments, we considered three datasets : mnist , cifar in its 10 class flavor and imagenet . We choose these datasets because they are the mostutilized, and we have different types of classes with various image dimensions. On these threedatasets, we trained the models most used in literature to evaluate the explanation methods: for mnist and cifar we a CNN with two convolutions and two linear layers, while for imagenet theVGG16 network [111]. A Saliency Map (SM) is an image in which a pixel’s brightness represents how salient the pixel is.Formally, a SM is modeled as a matrix S which dimensions are the sizes of the image we want toexplain, and the values s ij are the saliency values of the pixels ij . The greater the value of s ij thebigger is the saliency of that pixel. To visualize SM, we can use a divergent color map for example,ranging from red to blue. A positive value (red) means that the pixel ij has contributed positivelyto the classification, while a negative one (blue) means that it has contributed negatively. Thereare two methods for creating SMs. The first one assigns to every pixel a saliency value. The secondone segments the image into different pixel groups and then assign a saliency value for each group. LIME , already presented in Section 4, can also be used to retrieve SM for classifiers workingon images. For images, the perturbation is done by segmentation. More in detail, lime divides mnist : http://yann.lecun.com/exdb/mnist/ , cifar : , and imagenet : http://image-net.org/ :8 F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi the input image into segments called superpixels . Then it creates the neighborhood by randomlysubstituting the super-pixels with a uniform, possibly neutral, color. This neighborhood is thenfed into the black-box, and a sparse linear model is learned on top. An example of such a super-pixel explanation is shown in Figure 8. The super-pixel segmentation is critical to obtain a goodexplanation. For small resolution images, the segmentation in lime does not work out of the box,resulting in the algorithm selecting all the image as a super-pixel. To obtain a decent result, theuser needs to tune the segmentation parameters. Recently, many research improved and extended lime [109,96,130,23] (cid:15) -LRP , Layer-wise Relevance Propagation [17] is a model specific method which produce post-hoc local explanations for any type of data. (cid:15) -lrp explains the classifier’s decisions by decom-position. The (cid:15) -lrp redistribution process was introduced for feed-forward neural networks [12].Mathematically, it redistributes the prediction y backwards using local redistribution rules untilit assigns a relevance score R i to each pixel value. Let a i be the neuron activations at layer l , R j be the relevance scores associated to the neurons at layer l + 1 and w ij be the weight connectingneuron i to neuron j . The simple (cid:15) -lrp rule redistributes relevance from layer l + 1 to layer l is: R i = (cid:80) j a i w ij (cid:80) i a i w ij + (cid:15) R j where a small stabilization term (cid:15) is added to prevent division by zero.Intuitively, this rule redistributes relevance proportionally from layer l + 1 to each neuron in l based on the connection weights. The final explanation is the relevance of the input layer. Figure 8shows some examples of (cid:15) -lrp in the third row. As with all the pixel-wise explanation method,the algorithm works very well on mnist while it is difficult to address larger images. A variantof (cid:15) -lrp is spray [80] which builds a specrtal clustering on top of the local instance-based (cid:15) -lrp explanations. Similar work is done in [82]: it starts with the (cid:15) -lrp of the input instance and findsthe LRP attribution relevance for a single input of interest x . INTGRAD , Integrated Gradient [115], is a model-specific method that produces post-hoclocal explanations for any type of data. intgrad utilizes the gradients of a black-box along withthe sensitivity techniques like (cid:15) -lrp . For this reason, it can be applied only on differentiable models.Formally, given b and x , and let x (cid:48) be the baseline input. , intgrad constructs a path from x (cid:48) to x and computes the gradients of points along the path. For example, with images, the pointsare taken by overlapping x on x (cid:48) and gradually modifying the opacity of x . Integrated gradientsare obtained by cumulating the gradients of these points. Formally, the integrated gradient alongthe i th dimension for an input x and baseline x (cid:48) is defined as follows. Here, ∂b ( x ) /∂x i is thegradient of b ( x ) along the i th dimension. The equation for computing the scores is: e i ( x ) = ( x i − x (cid:48) i ) (cid:82) α =0 ∂b ( x (cid:48) + α ( x − x (cid:48) )) ∂x i dα . An example of intgrad explanations is in Figure 8. The saliency mapsobtained tend to have uniform pixels than (cid:15) -lrp . As shown before, (cid:15) -lrp highlights that whenpredicting the “deer” the most salient regions are in the background. However, an arbitrary choiceof baselines could cause issues. For example, a black baseline image could cause the method tolower the importance of black pixels in the source image. This problem is due to the differencebetween the image’s pixel and the baseline ( x i − x (cid:48) i ) present in the integral equation. ExpectedGradients [46] tries to overcome this problem by averaging intgrad to different baselines.
DEEPLIFT [110], is a model-specific and data-agnostic explainer which produces post-hoclocal explanations. It computes SMs in a backward fashion similarly to (cid:15) -lrp , but it uses abaseline reference like in intgrad . deeplift uses the slope, instead of the gradients, which de-scribes how the output y = b ( x ) changes as the input x differs from a baseline x (cid:48) . Like (cid:15) -lrp ,an attribution value r is assigned to each unit i of the neural network going backward fromthe output y . This attribution represents the relative effect of the unit activated at the orig-inal network input x compared to the activation at the baseline reference x (cid:48) . deeplift com-putes the starting values of the last layer L by the difference between the output of the in-put and baseline. Then, it uses the following recursive equation to compute the attribution val-ues of layer l using the attributions of layer l + 1 to obtain the values of the starting layer: r ( l ) i = (cid:80) j a ji − a (cid:48) ji (cid:80) i a ji − (cid:80) i a (cid:48) ji r ( l +1) j , a ji = w ( l +1 ,l ) ji x ( l ) i , a (cid:48) ji = w ( l +1 ,l ) ji x (cid:48) ( l ) i where w l +1 ,lij are the weights ofthe network between the layer l and the layer l + 1, and a are the activation values. As for int-grad , picking a baseline is not trivial and might require domain experts. The SMs obtained with deeplift are very similar to those obtained with (cid:15) -lrp (Figure 8). SMOOTHGRAD [112] is a post-hoc model-specific and data-agnostic explanation method.A SM tends to be noisy, especially for pixel-wise saliency maps. smoothgrad tries to overcomethis problem by smoothing the noisiness in the SMs. Usually, a SM is created directly on the dlime : https://github.com/rehmanzafar/dlime_experiments The baseline x (cid:48) is generally chosen as a zero matrix or vector. For example, for the image domain, thebaseline is generally a black or a white image.enchmarking and Survey of Explanation Methods for Black Box Models 19 Fig. 9: Visual Comparison of saliency maps obtained by taking the gradient of the output y w.r.t. theinput image x (center) and smoothgrad (bottom). On the three image in the center the saliencymap changes drastically. On all three cases is focusing to the subject of the image completelychanging original values. This is true also for the seashore image on the far right.Fig. 10: (Top) Explanations of deep-shap on mnist . (Bottom) Explanations of grad-shap on imagenet .gradient of the model’s output signal w.r.t. the input ∂y/∂x . smoothgrad augments this processby smoothing the gradients with a Gaussian noise kernel. It takes x , applies Gaussian noise to it,and retrieve the SM for every perturbed image, using the gradient. The final SM is an averageof these. Formally, given a saliency method f ( x ) which produces a saliency map s , its smoothedversion ˆ f can be expressed as: ˆ f = n (cid:80) n f ( x + N (0 , σ )) where n is the number of samples, and N (0 , σ ) is the Gaussian noise. In [4,5] are shown some weaknesses of smoothgrad : people tendto evaluate SMs on what they are expected to see. For example, in a bird image, we want to see theshape of a bird. However, this does not mean that this is what the network is looking at. Figure 9highlights this problem. We obtained the SMs taking the gradient of the output w.r.t. the input,and then we used smoothgrad . We observe that the SMs completely changed their behavior,moving in direction of the subject. SHAP , presented in Section 4, has two explanators that can be employed for deep networks tai-lored for image classification: deep-shap and grad-shap . deep-shap is a high-speed approxima-tion algorithm for shap values in deep learning models that builds on a connection with deeplift .The implementation is different from the original deeplift by using as baseline a distribution ofbackground samples instead of a single value and using shapley equations to linearise non-linearcomponents of the black-box such as max, softmax, products, divisions, etc. grad-shap , instead,is based on intgrad and smoothgrad [115,112]. intgrad values are a bit different from shap values, and require a single reference value to integrate from. As an adaptation to make them ap-proximate shap values, grad-shap reformulates the integral as an expectation and combines thatexpectation with sampling reference values from the background dataset as done in smoothgrad .We tested both deep-shap and grad-shap experimentally and the results are shown in Figure 10. deep-shap outputs a saliency map explaining every class of the input image. grad-shap insteadproduce a pixel-wise saliency map similar to those shown before. XRAI [70] is based on intgrad and inherits its properties. Differently from intgrad , xrai first over-segments the image. It iteratively tests each region’s importance, fusing smaller regionsinto larger segments based on attribution scores. It is divided into three steps: segmentation , get attribution , and selecting regions . The segmentation is repeated several times with differentsegments to reduce the dependency on image segmentation. For attribution, xrai uses intgrad with black and white baselines averaged. Finally, to select regions, xrai leverages the fact that,given two regions, the one that sums to the more positive value should be more important to the classifier. From this observation, xrai starts with an empty mask, then selectively adds the regionsthat yield the maximum gain in the total attributions per area. The saliency maps obtained from xrai are very different from those already presented. Figure 8 shows some examples. As all thesegmentation methods xrai performs at its best when having high-resolution images. However, itstill obtains good results on low-resolution images. GRADCAM [106] is a model-specific post-hoc local explainer for image data. It uses thegradient information flowing into the last convolutional layer of a CNN to assign saliency values toeach neuron for a particular decision. Convolutional layers naturally retain spatial information infully-connected layers, so we can expect the last convolutional layers to have the best compromisebetween high-level semantics and detailed spatial information. To create the SM, gradcam takesthe feature maps created at the last layer of the convolutional network a . Then, it computes thegradient of an output of a particular class y c for every feature map activations k , i.e., ∂y c /∂a k .This equation returns a tensor of dimensions [ k, v, u ] where k is the number of features maps and u, v are height and width of the image. gradcam compute the saliency value for every featuremaps by pooling the dimensions of the image. The final heatmap is calculated as a weighted sumof these values. Notice that this results in a coarse heatmap of the same size as the convolutionalfeature maps. An up-sampling technique is applied to the final result to produce a map of theinitial image dimension. From Figure 8 is clear that these coarse grain heatmap style are verycharacteristic of gradcam . These heat maps highlight very different parts of the image comparedto the other methods. GRADCAM++ [27] extends gradcam solving some related issues. The spatial footprint inan image is essential for gradcam ’s visualizations to be robust. Hence, if there are multiple objectswith slightly different orientations or views, different feature maps may be activated with differingspatial footprints and the one with lesser footprints fade away in the final sum. gradcam++ fixthis problem by taking a weighted average of the pixel-wise gradients. In particular, gradcam++ reformulates gradcam by explicitly coding the structure of the weights α ck as: α ck = (cid:80) i (cid:80) j w kcij · ReLU (cid:0) ∂y c /∂a kij (cid:1) where ReLU is the Rectified Linear Unit activation function, and w kcij are theweighting co-efficients for the pixel-wise gradients for class c and convolutional feature map a k . Theidea is that w ck captures the importance of a particular activation map a k , and positive gradientsare preferred to indicate visual features that increase the output neuron’s activation rather thanthose that suppress the output neuron’s activation. RISE [97] is a model-agnostic method which produces post-hoc local explanations on imagedata. To produce a saliency map for an image x , rise generate N random mask M i ∈ [0 ,
1] fromGaussian noise. The input image x is element-wise multiplied with these masks M i , and the resultis fed to the base model. The saliency map is obtained as a linear combination of the masks M i withthe predictions from the black box corresponding to the respective masked inputs. The intuitionbehind this is that b ( x (cid:12) M i ) is high when pixels preserved by mask M i are essential. Qualitative and Quantitative Comparison of Saliency Maps
In Figure 8, we report theSMs obtained for every method tested. The segmentation used by lime is very poor with smallimages as it results in super-pixels big as the whole image in some cases. On the other hand,those produced by xrai are much more clear. For the majority of images, the SMs are very similaramong those returned by the various explainers but we can observe conflicts. For instance, in cifar we can assume that the background is useful to predict the class deer, but we do not know-how.Some explainers highlight the top background while other the bottom background, so it is difficultto understand. Moving to bigger images, these conflicts become more evident. Let us look at theice hockey image. The class in the dataset here is “puck”: the hockey disk. lime highlights theice as important, while other methods ( xrai and gradcam++ ) highlight the stick of the player. gradcam highlights the fans while rise the hockey player. Thus, for the same image, we canobtain very different explanations. When moving to the second image from imagenet (the mask),we can observe that all the methods capture the same pattern. A straw hat in the backgroundtriggered the class “shower cap” while the correct one was “mask”. In the “seashore” of imagenet ,we have an island in the sea. The top three predicted classes are: seashore (0 . .
04) and cliff (0 . lime , smoothgrad , rise , and gradcam was fooled that the promontory is important to the class “seashore”. We can conclude that SMsare very fragile when we have multiple classes in the image, even if these classes has very lowpredicted probability.To investigate more the performance of the methods analyzed we computed the deletion andthe insertion metric, discussed in Section 3.2. For a query image, we substitute pixels in order ofimportance scores given by the explanation method. For insertion , we blurred the image and then enchmarking and Survey of Explanation Methods for Black Box Models 21 Fig. 11: Example of Insertion (on the left) and Deletion (on the right) metric computation per-formed on lime and the hockey image. The area under the curve is 0.2156 for deletion and 0.5941for Insertion.Table 7: Insertion (left) and deletion (right) metrics expressed as AUC of accuracy vs. percentageof removed/inserted pixels. mnist cifar imagenet lime (cid:15) -lrp intgrad deeplift smoothgrad (0.03) 0.55 (0.23) 0.34 (0.26) xrai gradcam gradcam++ rise (0.21) (0.26) mnist cifar imagenet lime (cid:15) -lrp (0.01) 0.127 (0.11) (0.02) intgrad deeplift (0.01) 0.127 (0.11) (0.02) smoothgrad xrai gradcam gradcam++ rise (0.01) (0.07) 0.044 (0.05) slowly inserted pixels while substituting with black pixels for deletion. For every substitution wemade, we query the image to the black-box, obtaining an accuracy. The final score is obtained bytaking the area under the curve (AUC) [62] of accuracy as a function of the percentage of removedpixels. In Figure 11 we have an example of this metric computed on the hockey figure of imagenet .For every dataset, we performed this metric calculation for a set of 100 samples, and then weaveraged. The results are shown in Table 7. Insertion scores decrease while augmenting the datasetimage dimension because we have higher information and more pixels have to be inserted to higherthe accuracy. On the other hand, deletion scores decrease. This fact could be because since we havegreater information, it is easier to decrease accuracy. The best methods are highlighted in bold,and we can see that rise is the best in three out of five experiments. rise is followed by deeplift ,and (cid:15) -lrp . Segmentation based methods ( lime , xrai , gradcam , gradcam++ ) struggles whenusing low-resolution images. Most ML models are designed to operate on low-level features like edges and lines in a picturethat do not correspond to high-level concepts that humans can easily understand. In [4,128], theypointed out that feature-based explanations applied to state-of-the-art complex black-box modelscan yield non-sensible explanations. Concept-based explainability constructs the explanation basedon human-defined concepts rather than representing the inputs based on features and internalmodel (activation) states. This idea of high-level features is more familiar to humans, that aremore likely to accept it. For example, a low-level explanation for images is to assign to everypixel a saliency value. Although it is possible to look at every pixel and infer their numericalvalues, these make no sense to humans: we do not say that the 5 th pixel of an image has avalue of 28. Instead CA method quantifies, for example, how much the concepts “stripes”, hascontributed to the class prediction of “zebra”. Formally, given a set of images belonging to a concept[ x (1) , x (2) , ..., x ( i ) ]with x ( i ) ∈ C , CA methods can be thought as a function f : ( b, [ x ( i ) ]) → e whichassign a score e to the concept C basing on the predictions and the values of the black-box b onthe set [ x ( i ) ]. TCAV , Testing with Concept Activation Vectors [75] is a model-agnostic method that producespost-hoc global explanations for image classifiers. tcav provides a quantitative explanation of howimportant is a concept for the prediction. Every concept is represented by a particular vector called
Fig. 12: tcav scores for three concepts: ice, Hockey player, and cheering people (fans) for the class puck of imagenet . On the left the query image; on the center some sample of the image tested in tcav as concepts, and on the right the histogram of the scores with errors. The hockey players hasbeen classified as a puck , but the saliency maps are very different alongside methods. Here we cansee that the ice and the hockey players are important concepts, while the background fans are notsignificant. Concept Activation Vectors (CAVs) created by interpret an internal state of a neural network interms of human-friendly concepts. tcav uses directional derivatives to quantify the degree to which,a user-defined concept, is vital to a classification result.For example, how sensitive a prediction of “zebra” is to the presence of “stripes”. tcav requirestwo main ingredients: (i) concept-containing inputs and negative samples (random inputs), and (ii) pre-trained ML models on which the concepts are tested. The concept-containing and randominputs are fed into the model to obtain the predictions to test how well a trained ML model captureda particular concept. A linear classifier is trained to distinguish the activation of the network dueto concept-containing vs. random inputs. The result of this training is concept activation vectors(CAVs) . Once CAVs are defined, the directional derivative of the class probability along CAVs canbe computed for each instance that belongs to a class. The “concept importance” for a class iscomputed as a fraction of the class instances that get positively activated by the concept containinginputs vs. random inputs. In Figure 12, we can see an Example of tcav explanation. The user mustcollect some images of some concept, like “`ıce”, “hockey player” and “fans”. Then tcav computethe score for everyone of these, telling us which one has more impact on the prediction of a queryimage.
ACE , Automated Concept-based Explanation [49], is the evolution of tcav , and it does notneed any concept example. It can automatically discover them. It takes training images and seg-ments them using a segmentation method. These super-pixels are fed into the black-box model asthere where input images clustered in the activation space. Then we can obtain like in tcav howmuch these clusters contributed to the prediction of a class.
ConceptSHAP [129] defines an importance score for each concept discovered. Similar to ace , conceptshap aims at having concepts consistently clustered to certain coherent spatial regions. conceptshap finds the importance of each individual concepts from a set of m concept vectors C s = { c , c , . . . , c m } by utilizing Shapley values. CaCE , Causal Concept Effect [54], is another variation of tcav . It looks at the causal effect ofthe presence or absence of high-level concepts on the deep learning model’s prediction. tcav cansuffer from confounding of concepts that could happen if the training data instances have multipleclasses, even with a low correlation. cace can be computed exactly if the concepts of interest arechanged by intervening in the counterfactual data generation.
Another possible explanation for images is to produce prototypical images that best represent aparticular class. Human reasoning is often prototype-based, using representative examples as a basisfor categorization and decision-making. Similarly, prototype explanation models use representativeexamples to explain and cluster data.
MMD-CRITIC [74], already presented in Section 4, can be applied to retrieve image pro-totypes and criticisms. In Figure 13 is presented an application of mmd-critic on cifar . Wecan extract some interesting knowledge from these methods. For example, in the criticism images,planes are all on a white background or have a different form from the usual one. We can concludethat in cifar , most planes are in the sky and have a passenger airplane shape. PROTONET [28] is a model-agnostic explainer that produces post-hoc global explanations onimage data. It figures out some prototypical parts of images (named prototypes) and then uses them enchmarking and Survey of Explanation Methods for Black Box Models 23
Fig. 13: Criticism (on the left) and prototypes (on the right), output of mmd-critic from cifar .On the criticisms we have a lot of planes on white background, so the sky background is importantfor the plane.Fig. 14: (a) :Explanation of cem on mnist : query on the center, Pertinent Negative left, and
Per-tinent Positive right. (b) : Explanation of guidedproto on mnist : left to right, the query, theclosest counterfactuals labeled as 6, and 8. (c) : Explanation of abele on mnist : left query, rightSM. Green/yellow areas can be exchanged without impact.to classify, hence making the classification process interpretable. A special architecture is neededto produce prototypes. The network learns from the training set a limited number of prototypicalparts useful in classifying a new image. The model identifies several parts on the test image thatlook like some training image prototypical parts. Then, it makes the prediction based on a weightedcombination of the similarity scores between parts of the image and the learned prototypes. Theperformance is comparable to the actual state of the art but with more interpretability. Influence Functions [76] is another variant for building prototypes. Instead of building pro-totypical images for a class, it tries to find the most responsible images for a given predictionusing influence functions. Influence functions is a classic technique from robust statistics to tracea model’s prediction through the learning algorithm and back to its training data, thereby identi-fying training points most responsible for a given prediction. Visualizing the training points mostresponsible for a prediction could be useful for more in-depth insights into model behavior.
Counterfactuals are another type of explanation for images. Its application for images is similarto the one already done for tabular data in Section 4.4. As output, counterfactuals methods forimages produce samples of images similar to the original one but with altered prediction. Somemethods output only the pixel variation, others the whole altered image.
Guided Prototypes , Interpretable Counterfactual Explanations Guided by Prototypes ( guid-edproto ) [118] proposes a model-agnostic method to find interpretable counterfactuals. guided-proto perturbs the input image to find the closest image to the original one but with a differentclassification by using an objective loss function L = cL pred + βL + L optimized using gradientdescent. The first term, cL pred , encourages the perturbed instance to predict another class then x while the others are regularisation terms. In Figure 14 we have an example of application of guidedproto on mnist . It is interesting to notice how easy it is to change the digit class withvery few focused pixels. CEM , Contrastive Explanation Method ( cem ) [40], already presented in Section 4, can alsobe applied on image data. For images, Pertinent Positives (PP) or Pertinent Negatives (PN) arethe pixels that lead to the same or a different class w.r.t. the original instance. To create PP’sand PN’s, feature-wise perturbation is done by keeping the perturbations sparse and close to theoriginal instance through an objective function that contains an elastic net βL + L regularizer.An auto-encoder is trained to reconstruct images of the training set. As a result, the perturbedinstance lies close to the training data manifold. In fact, in Figure 14, we can see how very fewpixels are obtained as explanations on mnist . L2X [29] finds the pixels that change the classification. It is based on learning a function for ex-tracting a subset of the most informative features for each given sample using Mutual Information. l2x adopts a variational approximation to efficiently compute the Mutual Information and gives
Table 8: Explanation runtime expressed in seconds for explainers of image classifiers approximatedas order of magnitude. D a t a s e t B l a c k - B o x l i m e (cid:15) - l r p i n t g r a d d eep l i f t s m oo t h g r a d x r a i g r a d c a m g r a d c a m ++ r i s e tc av mm d - c r i t i c c e m g u i d e d p r o p a b e l e mnist CNN 1 1 0.03 2 0.04 1 0.1 0.1 0.5 - - 580 11 2000 cifar
CNN 10 1 0.06 1 0.07 1.5 0.15 0.15 2 - 277 - - 1800 imagenet
VGG16 50 2 5 3 0.8 18 0.25 0.25 21 300 - - - - a value for a group of pixels called patches . If the value is positive, a group contributed positivelyto the prediction. Otherwise, it contributed negatively.
ABELE , Adversarial black-box Explainer generating Latent Exemplars) [56], is a local, model-agnostic explainer that produces explanations composed of: (i) a set of exemplar and counter-exemplar images, and (ii) a saliency map. The end-user can understand the classification by lookingat images similar to those under analysis that received the same prediction or a different one.Moreover, by exploiting the SM, it is possible to understand the areas of the images that cannotbe changed and varied without impacting the outcome. abele exploits an adversarial autoencoder(AAE) to generate the record’s local neighborhood to explain x . It builds the neighborhood on alatent local decision tree, which mimics the behavior of b . Finally, exemplars and counter-exemplarsare selected, exploiting the rules extracted from the decision tree. The SM is obtained by a pixel-by-pixel difference between x and the exemplars. In Figure 14 we have an example of applicationof abele on mnist . Green and yellow areas can change without impacting the black-box outcome,while the gray areas must remain the same to have the same prediction. Runtime Analysis
Table 8 shows the explanation runtime approximated as order of magnitude.We notice that gradcam and gradcam++ are the fastest methods, especially for big models likethe VGG network. In general, pixel-wise Saliency explanations are more comfortable to obtain,while segmentation slows a lot, especially for high-resolution images. CA, CF, and PR methodsare very slow compared to the SM. This problem is because these algorithms need additionaltraining or use some searching algorithm.
Discussion
When dealing with images, the most diffused explanations are Saliency Maps (Sec-tion 5.1). The literature presents a multitude of methods that are capable of producing such typeof explanation. The problem with saliency maps is the confirmation bias [4]. Also, humans do notthink in terms of pixels. The explanation of Saliency Maps is provided in terms of pixels, whichare low level features that are useful only for an expert user who wants to check the robustness ofthe black-box. For a general audience, there is the need to build an explanation in terms of higherfeatures called concepts. This is the goal of Concept Attributions based explanations (Section 5.2).For a concept selected by a human team, these types of methods compute a score that evaluatesthe probability that the selected concept has influenced the prediction. Concept based explanationsare a very recent type of explanation for images, and they have potential improvements. It is afirst step in the direction of human-like explanations. Human-friendly concepts make it possible tobuild straightforward and useful explanations. Humans still need to map images to concepts, butit is a small price to pay to augment the human-machine interaction. Other approaches are basedon the concept of producing examples to support the explanation. Prototypes and Counterfactual(Sections 5.3 and 5.4) are two types of similar explanations but with very different meaning. Thegoal of prototypes is to produce an example that reflects the common proprieties of a class, whilethe goal of counterfactual is to produce examples similar to the input, but with a different pre-dicted class. The first one is useful for model inspection, while the ladder for the user experience.In particular, counterfactuals are more user-friendly since they highlight the changes to make toobtain the desired prediction.
For text data, we can distinguish the following types of explanations:
Saliency Maps (SM) , de-scribed in Section 6.1,
Attention-Based methods (AB) , described in Section 6.2,
Other Methods , enchmarking and Survey of Explanation Methods for Black Box Models 25 Table 9: Summary of methods for opening and explaining black-boxes.
Type Name Ref. Authors Year Data Type IN/PH G/L A/S Code lime [102] Ribeiro et al. 2016 ANY PH L A link intgrad [115] Sundararajan et al. 2017 ANY PH L S link l2x [29] Chen et al. 2018 ANY PH L A link deeplift [110] Shrikumar et al. 2017 ANY PH L S linkSH lionets [89] Mollas et al. 2019 ANY PH L S link- [83] Li et al. 2014 TXT PH L S - exbert [66] Hoover et al. 2019 TXT PH L S linkAB - [119] Vaswani et al. 2017 TXT PH L S - anchor [103] Ribeiro et al. 2018 TXT PH L A link quint [2] Abujabal et al. 2017 TXT PH L S - criage [98] Pezeshkpour et al. 2019 TXT PH L S link lasts [60] Guidotti et al. 2020 TXT PH L S - xspells [79] Lampridis et al. 2020 TXT PH L S link - [101] Rajani et al. 2019 TXT PH L S -Other doctorxai [94] Panigutti et al. 2020 ANY PH L S - Fig. 15: Example of sentence highlighting, on top we have the score produce by IntGrad and belowwe have in order, LIME, DeepLift and the baseline which consists of multiplying the input withthe gradient w.r.t. input. The sentence is taken from imdb detailed in Section 6.3. Additional details available [37]. Table 9 summarizes the explanation meth-ods acting on text data. Text, unlike tabular and image data, does not have a structure. The varietyand complexity of tasks related to text are enormous and in literature is known as
Natural Lan-guage Processing (NLP) [33]. In the following, we analyze text classification in detail because,among information retrieval, machine translation, and question answering. Text classification isthe main topic where XAI methods exist in literature. Examples of usage in text classification aresentiment analysis, topic labeling, and spam. Text classification is the process of assigning tags orcategories to text according to its content. Using labeled examples as training data, a ML modelcan learn the different associations between pieces of text and a particular output called tags. Tagscan be thought of as labels which distinguish different type of text. For sentiment analysis, it ispossible to have tags as positive, negative, or neutral. XAI techniques are generally applied tounderstand what words are the most relevant for a specific tag assignment. We experimented onthree datasets: sst , imdb , and yelp . We selected these datasets , because they are the most usedon sentiment classification and have different dimensions. On these datasets we trained differentblack-box models. For every explainer we present an example of an application on one or moredatasets. As seen in Section 5.1, saliency-based explanations are prevalent because they present visuallyperceptive explanations.
Saliency highlighting is saliency maps applied to text and consists ofassigning to every word a score based on the importance that that word had in the final prediction.Formally, a Sentence Highlighting (SH) is modeled as a vector s who explain a classification y = b ( x )of a black-box b on x . The dimensions of s are the words present in the sentence x we want toexplain, and the value s i is the saliency value of the word i . The greater the value of s i the biggeris the importance of that word. A positive value indicates a positive contribution towards y , whilea negative one means that the word has contributed negatively. Some examples are reported in sst : https://nlp.stanford.edu/sentiment/index.html , imdb : https://ai.stanford.edu/~amaas/data/sentiment/ , yelp : Table 10: Deletion (right) and Insertion (left) metrics and computed on Sentence Highlighting fordifferent datasets. sst imdb yelp intgrad lime deeplift gradientx input sst imdb yelp intgrad lime deeplift gradientx input
Figure 15. To obtain such an explanation, it is possible to adapt some of the saliency maps methodspresented in Section 5.1.
LIME [102], presented in Section 4, can be applied to text with a modification to the pertur-bation of the original input. Given an input sentence x , lime creates a neighborhood of sentencesby replacing one or multiple words with spaces. Another possible variation is to insert a similarword instead of removing them. INTGRAD [115], presented in Section 4, can also be exploited to explain text classifiers.Indeed, gradient-based methods are challenging to apply to NLP models because the vector rep-resenting every word is usually averaged into a single sentence vector. Since it does not exist amean operation gradient, the explainer cannot redistribute the signal back to the original vectors.On the other hand, intgrad is immune to this problem because the saliency values are computedas a difference with a baseline value. intgrad computes the saliency value of a single word asa difference from the sentence without it. For a fair comparison, we substituted the words withspaces as done for lime . DEEPLIFT [110], presented in Section 4, can also be applied on text following the sameprinciple of intgrad . For the experiments, we adopt the same preprocessing used for lime and intgrad . L2X [29] can produce a SH explanation for text. In particular, for text, the patches are now agroup of words.
Qualitative and Quantitative Comparison of Sentence Highlighting
Besides the methodsexposed above we tested also a baseline method. This baseline named Gradient × Input takes theblack-box gradient of the input w.r.t to the output and multiply these value by the input values.The results are shown in Figure 15. The highlighted words are very different among the variousmethods. intgrad and lime are the ones who output meaningful explanations, while deeplift struggles a lot to diversify from the baseline. We also measured the deletion/insertion and reportthe results in Table 10. For both metrics, we have very poor performance among all the methods.However removing a single word barely changes the meaning of the sentence.
Attention was proposed in [126] to improve the model performance. The authors managed to showthrough an attention layer which parts of the images contributed most to realize the caption.Attention is a layer to put on top of the model that, for each pixel, ij of the image x , generatesa positive weight α ij , i.e., the attention weight. This value can be interpreted as the probabilitythat a pixel ij is in the right place to focus on producing the next word in the caption. Attentionmechanisms allow models to look over all the information the original sentence holds and learn thecontext [125,18]. Therefore, it has caught the interest of XAI researchers who started using theseweights as an explanation. The explanation e of the instance x is composed by the set of attentionvalues ( α ), one for each feature x i . Attention is nowadays a delicate argument, and while it is clearthat it augments the performance of models, it is less clear if it helps gain interpretability andwhat are the relationship with model outputs [67]. Attention Based Sentence Highlighting [83] is an AB mechanism to produce a heatmapexplanation similar to the one used for SMs. The scores are computed for every word of the sentenceby using the attention layer of the black-box. The weights α ij of the attention layer are used as ascore. The higher the score, the redder highlighting. Attention Matrix [30] looks at the dependencies between words for producing explanations.It is a self-attention method, sometimes called intra-attention . attentionmatrix relates differentpositions of a single sequence to compute its internal representation. The attention of a sentence x composed of N words can be understood as an N × N matrix, where each row and columns enchmarking and Survey of Explanation Methods for Black Box Models 27 Fig. 16: Saliency heat-map matrix generatedfrom the method presented in [30]. The rowand the columns of the matrix correspond tothe words in the sentence ‘Read the book, for-get the movie!”. Each value of the matrix showsthe attention weight α ij of the annotation of the i -th word w.r.t. the j -th. Fig. 17: Representation of the attention inBERT for a sentence taken from imdb using thevisualization of [66]. The greater the attentionbetween two words, the bigger the line. Here isselected only the attention related to the word“sucks”.represent a word in the input sentence. The values of the matrix are the attention values of everypossible combination of the tokens. This matrix is a representation of values pointing from eachword to every other word [119] (see Figure 16). We can also visualize this matrix with a focus on theconnection between words [66] as in Figure 17, where the thickness of the lines is the self-attentionvalue between two tokens. Runtime Analysis
NLP models are usually very large resulting in poor performance in termsof runtime. Apart from Attention Matrix methods which are instant, we notice that for all thedatasets, the time are pretty much the same in the order of magnitude of ten seconds. The timeof the methods is independent from the dataset size.
Discussion
Explanations of text data are at the very early stages compared to tabular data andimages. The majority of the methods focus on low feature explanation by giving a score to wordsthat make up the sentence. As said for image type of explanations in Section 5, these low featureexplanations are useful to check the model’s robustness, not to give a useful explanation for thefinal inexpert user. Natural Language processing is a very complex field, and find a human-friendlyexplanation is challenging. Researchers are working in the direction of creating explanation withhigh concept [113], and using humans to augment these type of concept [101], as done for ConceptAttribution.
There other methods that are important to mention when we talking about XAI using text orsequential data.
ANCHOR , presented in Section 4, can be adapted to text by using as perturbation theword UNK. It consists of perturbing a sentence by substituting words with UNK (unknown). Forexample, It shows how “sucks” contributed to the negative prediction of the sentence, but whencoupled with “love” then the sentence prediction switches to positive.
Natural Language Explanation verbalizes explanations in natural human language. Naturallanguage can be generated with complex deep learning models , e.g., by training a model with naturallanguage explanations and coupling with a generative model [101]. Besides, it can also be generatedusing a simple template-based approach [2].
XSPELLS [79] is a model-agnostic explainer returning exemplars and counterexamples sen-tences as explanation. It re-implements abele for text data by using LSTM layers in the au-toencoder. Exemplars and counterexemplars are selected, exploiting the rules extracted from thedecision tree learned in the latent space.
LASTS , Local Agnostic Shapelet-based Time Series explainer ( lasts ) [60], is a variation of abele for time series. Since a text could be interpreted as a time series, we report here this work.As explanation lasts returns exemplars and counterexamples time series and shapelet-based rules.Shapelets are locally discriminative subsequences characterizing the classification. An example ofa rule is: “If these shapelets are present and these others not, then x is classified as y ”. DOCTORXAI [94] is a local post-hoc model-agnostic explainer acting on sequential datain the medical setting. In particular, it exploits a medical ontology to perturb the data and togenerate neighbors. doctorxai is designed on healthcare data, but it can theoretically be appliedto every type of sequential data with an ontology.
A significant number of toolboxes for the ML explanation have been proposed during the last fewyears. In the following, we report the most popular Python toolkits with a brief description of theexplanation models they provide . AIX360 [16] contains both intrinsic, post-hoc, local, and global explainer and it can be usedwith every kind of input dataset. Regarding the local post-hoc explanations, different methodsare implemented, such as lime [102], shap [84], cem [40], cem-maf [85] and protodash [61]).Another interesting method proposed in this toolkit is ted [65,38], which provides intrinsic localexplanations and provides global explanations based on rules.
CaptumAI is a library built for
PyTorch models. CaptumAI divides the available algorithms into three categories:
Primary Attri-bution , in which there are methods able to evaluate the contribution of each input feature to theoutput of a model: intgrad [115], grad-shap [84], deeplift [110], lime [102], gradcam [106].
Layer Attribution , in which the focus is on the contribution of each neuron: e.g. gradcam [106] and layer-deeplift [110] . Neuron Attribution , in which is analyzed the contribution of each inputfeature on the activation of a particular hidden neuron: e.g. neuron-intgrad [115] , neuron-grad-shap [84] . InterpretML [93] contains intrinsic and post-hoc methods for Python and R. In-terpretML is particularly interesting due to the intrinsic methods it provides: Explainable BoostingMachine ( ebm ), Decision Tree, and Decision Rule List. These methods offer a user-friendly visual-ization of the explanations, with several local and global charts. InterpretML also contains the mostpopular methods, such as lime and shap . DALEX [19] is an R and Python package that providespost-hoc and model-agnostic explainers that allow local and global explanations. It is tailored fortabular data and is able to produce different kinds of visualization plots.
Alibi provides intrinsicand post-hoc models. It can be used with any type of input dataset and both for classification andregression tasks. Alibi provides a set of counterfactual explanations, such as cem , and, interest-ingly, an implementation of anchor [103]. Regarding global explanation methods, Alibi contains ale (Accumulated Local Effects) [11], which is a method based on partial dependence plots [59].
FAT-Forensics takes into account fairness, accountability and transparency . Regarding intrinsicexplainability, it provides methods to assess explainability under three perspectives: data, models,and predictions. For accountability , it offers a set of techniques that assesses privacy, security, androbustness . For fairness , it contains methods for bias detection . What-If Tool is a toolkit providinga visual interface from which it is possible to play without coding. Moreover, it can work directlywith ML models built on
Cloud AI Platform ( https://cloud.google.com/ai-platform ). It con-tains a variety of approaches to get feature attribution values such as shap [84], intgrad [115],and smoothgrad [106]. This paper has presented a survey of the last advances on XAI methods, following a categorizationbased on the data types and explanation strategies. We measured and evaluated a set of benchmarksfor each explanation technique for a comparison from both the quantitative and qualitative pointof view.Our literature review revealed interesting trends in the strategies proposed for an explana-tion. For tabular data, feature importance is the most widely adopted strategy, particularly forExplainable-by-Design solutions and model agnostic black box explanations. Rule-based explana-tions are gaining attention since their logic formalization enables a deeper understanding of the AImodel’s internal decisions. Recently, methods that explain in terms of counterfactuals are yieldinginteresting results. For image data, the most considerable adopted technique is based on the cre-ation of Saliency Maps, which translate to the image domain the feature relevance approach fortabular data, highlighting the portions of the relevant images for the AI model outcome. However,other approaches, like Concept Attribution, Prototypes, and Counterfactual, are rising in recentyears. The explanation techniques are still limited for text data, but it is still possible to highlighta few trends. We recall the Sentence Highlight that, similarly to feature importance for tabulardata, provides a weight to the portion of the input that contributed, positively or negatively, to theoutcome. Across the different data types, different approaches tend to use similar strategies. Thisis also evident if we look at the internals of these algorithms. For example, several methods exploit AIX360: https://github.com/Trusted-AI/AIX360 , CaptumAI: https://captum.ai/ , InterpretML: https://github.com/interpretml/interpret , Alibi https://github.com/SeldonIO/alibi , FAT-Forensics: https://github.com/fat-forensics/fat-forensics , What-If Tool: https://github.com/pair-code/what-if-tool .enchmarking and Survey of Explanation Methods for Black Box Models 29 the generation of a synthetic neighborhood around an instance to reconstruct the local distributionof data around the point to investigate. This stochastic generation is the base of several methods,and it also explains the low performance on the stability measure (see Table 3). Another frequentstrategy consists of learning a surrogate model from partial training data (sometimes created fromthe neighborhood generation). This approach tries to bring the benefit of intrinsic methods in thecontext of black box explanation.In recent years the contributions on the Explainable AI topics are constantly growing, partic-ularly in AI and ML. However, there are still a restricted number of contributions focusing onthe comparison of these methods. A definition of a unifying metric for measuring the efficacy ofexplanation strategies is difficult, particularly when human-grounded evaluations are addressed.We believe that next year of research will focus more on the human side, emphasizing the human-machine interactions and aligning the generation of the explanation with the cognitive model ofthe final user. Some preliminary results of this direction are presented in [55,68,63]. We believethat XAI must be addressed more in the development of AI applications in the future, and wehope that this work could help in its development.
Acknowledgements
This work is partially supported by the European Community H2020 pro-gramme under the funding schemes: INFRAIA-1-2014-2015 Res. Infr. G.A. 871042
SoBigData++ ,G.A. 952026
HumanE AI Net , G.A. 825619
AI4EU , G.A. 834756
XAI . References
1. Glocalx – from local to global explanations of black box ai models.
ArXiv , 2021.2. A. Abujabal, R. S. Roy, M. Yahya, and G. Weikum. Quint: Interpretable question answering overknowledge bases. In
Proceedings of the 2017 Conference on Empirical Methods in Natural LanguageProcessing: System Demonstrations , pages 61–66, 2017.3. A. Adadi and M. Berrada. Peeking inside the black-box: A survey on explainable artificial intelligence(xai).
IEEE Access , 6:52138–52160, 2018.4. J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliencymaps. In
Advances in Neural Information Processing Systems , pages 9505–9515, 2018.5. J. Adebayo, M. Muelly, I. Liccardi, and B. Kim. Debugging tests for model explanations. arXivpreprint arXiv:2011.05429 , 2020.6. R. Agarwal, N. Frosst, X. Zhang, R. Caruana, and G. E. Hinton. Neural additive models: Interpretablemachine learning with neural nets. arXiv preprint arXiv:2004.13912 , 2020.7. E. Albini, A. Rago, P. Baroni, and F. Toni. Relation-based counterfactual explanations for bayesiannetwork classifiers. In
Proceedings of the Twenty-Ninth International Joint Conference on ArtificialIntelligence, IJCAI (2020, To Appear) , 2020.8. D. Alvarez Melis and T. Jaakkola. Towards robust interpretability with self-explaining neural net-works.
Advances in Neural Information Processing Systems , 31:7775–7784, 2018.9. S. Anjomshoae, T. Kampik, and K. Fr¨amling. Py-ciu: A python library for explaining machinelearning predictions using contextual importance and utility. In
IJCAI-PRICAI 2020 Workshop onExplainable Artificial Intelligence (XAI) , 2020.10. S. Anjomshoae, A. Najjar, D. Calvaresi, and K. Fr¨amling. Explainable agents and robots: Resultsfrom a systematic literature review. In , pages 1078–1088. Inter-national Foundation for Autonomous Agents and Multiagent Systems, 2019.11. D. W. Apley and J. Zhu. Visualizing the effects of predictor variables in black box supervised learningmodels. arXiv preprint arXiv:1612.08468 , 2016.12. L. Arras, G. Montavon, K.-R. M¨uller, and W. Samek. Explaining recurrent neural network predictionsin sentiment analysis. arXiv preprint arXiv:1706.07206 , 2017.13. A. B. Arrieta, N. D´ıaz-Rodr´ıguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garc´ıa, S. Gil-L´opez, D. Molina, R. Benjamins, et al. Explainable artificial intelligence (xai): Concepts, taxonomies,opportunities and challenges toward responsible ai.
Information Fusion , 58:82–115, 2020.14. A. Artelt. Ceml: Counterfactuals for explaining machine learning models - a python toolbox. , 2019 - 2020.15. A. Artelt and B. Hammer. On the computation of counterfactual explanations–a survey. arXivpreprint arXiv:1911.07749 , 2019.16. V. Arya, R. K. E. Bellamy, P.-Y. Chen, A. Dhurandhar, M. Hind, S. C. Hoffman, S. Houde, Q. V.Liao, R. Luss, A. Mojsilovi´c, S. Mourad, P. Pedemonte, R. Raghavendra, J. Richards, P. Sattigeri,K. Shanmugam, M. Singh, K. R. Varshney, D. Wei, and Y. Zhang. One explanation does not fit all:A toolkit and taxonomy of ai explainability techniques, 2019.17. S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨uller, and W. Samek. On pixel-wise explana-tions for non-linear classifier decisions by layer-wise relevance propagation.
PloS one , 10(7):e0130140,2015.0 F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi18. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align andtranslate. arXiv preprint arXiv:1409.0473 , 2014.19. P. Biecek and T. Burzykowski. Explanatory model analysis, 2020. Data Science Series.20. J. Bien and R. Tibshirani. Prototype selection for interpretable classification.
The Annals of AppliedStatistics , pages 2403–2424, 2011.21. A. Blanco-Justicia, J. Domingo-Ferrer, S. Mart´ınez, and D. S´anchez. Machine learning explainabilityvia microaggregation and shallow decision trees.
Knowledge-Based Systems , 2020.22. O. Boz. Extracting decision trees from trained neural networks. In
Proceedings of the eighth ACMSIGKDD international conference on Knowledge discovery and data mining , 2002.23. S. Bramhall, H. Horn, M. Tieu, and N. Lohia. Qlime-a quadratic local interpretable model-agnosticexplanation approach.
SMU Data Science Review , 3(1):4, 2020.24. R. M. Byrne. Counterfactuals in explainable artificial intelligence (xai): Evidence from human rea-soning. In
IJCAI , pages 6276–6282, 2019.25. R. M. Byrne and P. Johnson-Laird. If and or: Real and counterfactual possibilities in their truth andprobability.
Journal of Experimental Psychology: Learning, Memory, and Cognition , 46(4):760, 2020.26. D. V. Carvalho, E. M. Pereira, and J. S. Cardoso. Machine learning interpretability: A survey onmethods and metrics.
Electronics , 8(8):832, 2019.27. A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalizedgradient-based visual explanations for deep convolutional networks. In , pages 839–847. IEEE, 2018.28. C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su. This looks like that: deep learningfor interpretable image recognition. In
Advances in neural information processing systems , pages8930–8941, 2019.29. J. Chen, L. Song, M. Wainwright, and M. Jordan. Learning to explain: An information-theoreticperspective on model interpretation. 2018.30. J. Cheng, L. Dong, and M. Lapata. Long short-term memory-networks for machine reading. arXivpreprint arXiv:1601.06733 , 2016.31. H. Chipman, E. George, and R. McCulloh. Making sense of a forest of trees.
Computing Science andStatistics , 1998.32. A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism predictioninstruments.
Big data , 5(2):153–163, 2017.33. K. Chowdhary. Natural language processing. In
Fundamentals of Artificial Intelligence , pages 603–649. Springer, 2020.34. W. W. Cohen and Y. Singer. A simple, fast, and effective rule learner.
AAAI/IAAI , 99(335-342):3,1999.35. G. Comand`e. Regulating algorithms’ regulation? first ethico-legal principles, problems, and opportu-nities of algorithms. In
Transparent Data Mining for Big and Small Data , pages 169–206. Springer,2017.36. M. Craven and J. W. Shavlik. Extracting tree-structured representations of trained networks. In
Advances in neural information processing systems , pages 24–30, 1996.37. M. Danilevsky, K. Qian, R. Aharonov, Y. Katsis, B. Kawas, and P. Sen. A survey of the state ofexplainable ai for natural language processing. arXiv preprint arXiv:2010.00711 , 2020.38. S. Dash, O. Gunluk, and D. Wei. Boolean decision rules via column generation. In
Advances inNeural Information Processing Systems , pages 4655–4665, 2018.39. K. Dembczy´nski, W. Kot(cid:32)lowski, and R. S(cid:32)lowi´nski. Maximum likelihood rule ensembles. In
Proceedingsof the 25th international conference on Machine learning , pages 224–231, 2008.40. A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam, and P. Das. Explanationsbased on the missing: Towards contrastive explanations with pertinent negatives. In
Advances inNeural Information Processing Systems , 2018.41. P. Domingos. Knowledge discovery via multiple models.
Intelligent Data Analysis , 1998.42. F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. arXivpreprint arXiv:1702.08608 , 2017.43. F. Doshi-Velez and B. Kim. Considerations for evaluation and generalization in interpretable machinelearning. In
Explainable and interpretable models in computer vision and machine learning , pages 3–17. Springer, 2018.44. F. K. Doˇsilovi´c, M. Brˇci´c, and N. Hlupi´c. Explainable artificial intelligence: A survey. In , pages 0210–0215. IEEE, 2018.45. R. ElShawi, Y. Sherif, M. Al-Mallah, and S. Sakr. Ilime: Local and global interpretable model-agnosticexplainer of black-box decision. In
European Conference on Advances in Databases and InformationSystems , pages 53–68. Springer, 2019.46. G. Erion, J. D. Janizek, P. Sturmfels, S. Lundberg, and S.-I. Lee. Learning explainable models usingattribution priors. arXiv preprint arXiv:1906.10670 , 2019.47. A. A. Freitas. Comprehensible classification models: a position paper.
ACM SIGKDD explorationsnewsletter , 15(1):1–10, 2014.48. J. Friedman and B. E. Popescu. Predictive learning via rule ensembles.
The Annals of AppliedStatistics , 2:916–954, 2008.enchmarking and Survey of Explanation Methods for Black Box Models 3149. A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim. Towards automatic concept-based explanations. In
Advances in Neural Information Processing Systems , pages 9277–9286, 2019.50. L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. Explaining explanations: Anoverview of interpretability of machine learning. In , pages 80–89. IEEE, 2018.51. M. Gleicher. A framework for considering comprehensibility in modeling.
Big data , 4(2):75–88, 2016.52. R. Goebel, A. Chander, K. Holzinger, F. Lecue, Z. Akata, S. Stumpf, P. Kieseberg, and A. Holzinger.Explainable ai: the new 42? In
International cross-domain conference for machine learning andknowledge extraction , pages 295–303. Springer, 2018.53. B. Goodman and S. Flaxman. Eu regulations on algorithmic decision-making and a “right to expla-nation”. In
ICML workshop on human interpretability in machine learning (WHI 2016), New York,NY. http://arxiv. org/abs/1606.08813 v1 , 2016.54. Y. Goyal, A. Feder, U. Shalit, and B. Kim. Explaining classifiers with causal concept effect (cace). arXiv preprint arXiv:1907.07165 , 2019.55. R. Guidotti. Evaluating local explanation methods on ground truth.
Artificial Intelligence , page103428, 2020.56. R. Guidotti, A. Monreale, S. Matwin, and D. Pedreschi. Black box explanation by learning imageexemplars in the latent feature space. In
Joint European Conference on Machine Learning andKnowledge Discovery in Databases , pages 189–205. Springer, 2019.57. R. Guidotti, A. Monreale, S. Matwin, and D. Pedreschi. Explaining image classifiers generatingexemplars and counter-exemplars from latent representations. In
AAAI , pages 13665–13668, 2020.58. R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, and F. Giannotti. Local rule-basedexplanations of black box decision systems.
CoRR , 2018.59. R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methodsfor explaining black box models.
ACM computing surveys (CSUR) , 51(5):1–42, 2018.60. R. Guidotti, A. Monreale, F. Spinnato, D. Pedreschi, and F. Giannotti. Explaining any time seriesclassifier. In
CogMI , page 1. IEEE, 2020.61. K. S. Gurumoorthy, A. Dhurandhar, G. Cecchi, and C. Aggarwal. Efficient data representationby selecting prototypes with importance weights. In , pages 260–269. IEEE, 2019.62. D. J. Hand and R. J. Till. A simple generalisation of the area under the roc curve for multiple classclassification problems.
Machine learning , 45(2):171–186, 2001.63. P. Hase and M. Bansal. Evaluating explainable ai: Which algorithmic explanations help users predictmodel behavior? arXiv preprint arXiv:2005.01831 , 2020.64. T. J. Hastie and R. J. Tibshirani.
Generalized additive models , volume 43. CRC press, 1990.65. M. Hind, D. Wei, M. Campbell, N. C. Codella, A. Dhurandhar, A. Mojsilovi´c, K. Natesan Rama-murthy, and K. R. Varshney. Ted: Teaching ai to explain its decisions. In
Proceedings of the 2019AAAI/ACM Conference on AI, Ethics, and Society , pages 123–129, 2019.66. B. Hoover, H. Strobelt, and S. Gehrmann. exbert: A visual analysis tool to explore learned represen-tations in transformers models. arXiv preprint arXiv:1910.05276 , 2019.67. S. Jain and B. C. Wallace. Attention is not explanation. arXiv preprint arXiv:1902.10186 , 2019.68. J. V. Jeyakumar, J. Noor, Y.-H. Cheng, L. Garcia, and M. Srivastava. How can i explain this to you?an empirical study of deep neural network explanation methods.
Advances in Neural InformationProcessing Systems , 33, 2020.69. K. Kanamori et al. Dace: Distribution-aware counterfactual explanation by mixed-integer linearoptimization. In
IJCAI-20, International Joint Conferences on Artificial Intelligence Organization ,2020.70. A. Kapishnikov, T. Bolukbasi, F. Vi´egas, and M. Terry. Xrai: Better attributions through regions.In
Proceedings of the IEEE International Conference on Computer Vision , pages 4948–4957, 2019.71. A.-H. Karimi et al. Model-agnostic counterfactual explanations for consequential decisions. In
Inter-national Conference on Artificial Intelligence and Statistics . PMLR, 2020.72. M. N. Katehakis and A. F. Veinott Jr. The multi-armed bandit problem: decomposition and compu-tation.
Mathematics of Operations Research , 12, 1987.73. B. Kim, C. M. Chacha, and J. A. Shah. Inferring team task plans from human meetings: A generativemodeling approach with logic-based prior.
Journal of Artificial Intelligence Research , 2015.74. B. Kim, R. Khanna, and O. Koyejo. Examples are not enough, learn to criticize! criticism for inter-pretability. NIPS’16, 2016.75. B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. Interpretability beyond featureattribution: Quantitative testing with concept activation vectors (tcav). In
International conferenceon machine learning , pages 2668–2677. PMLR, 2018.76. P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. arXiv preprintarXiv:1703.04730 , 2017.77. A. Kurenkov. Lessons from the pulse model and discussion. the gradient, 2020.78. H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for descrip-tion and prediction. In
Proceedings of the 22nd ACM SIGKDD international conference on knowledgediscovery and data mining , pages 1675–1684, 2016.2 F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi79. O. Lampridis, R. Guidotti, and S. Ruggieri. Explaining sentiment classification with synthetic ex-emplars and counter-exemplars. In
International Conference on Discovery Science , pages 357–373.Springer, 2020.80. S. Lapuschkin, S. W¨aldchen, A. Binder, G. Montavon, W. Samek, and K.-R. M¨uller. Unmaskingclever hans predictors and assessing what machines really learn.
Nature communications , 10(1):1–8,2019.81. B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al. Interpretable classifiers using rulesand bayesian analysis: Building a better stroke prediction model.
The Annals of Applied Statistics ,9(3):1350–1371, 2015.82. H. Li, Y. Tian, K. Mueller, and X. Chen. Beyond saliency: understanding convolutional neuralnetworks from saliency prediction on layer-wise relevance propagation.
Image and Vision Computing ,83:70–86, 2019.83. J. Li, W. Monroe, and D. Jurafsky. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 , 2016.84. S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In
Advances inneural information processing systems , pages 4765–4774, 2017.85. R. Luss, P.-Y. Chen, A. Dhurandhar, P. Sattigeri, Y. Zhang, K. Shanmugam, and C.-C. Tu. Gener-ating contrastive explanations with monotonic attribute functions. arXiv preprint arXiv:1905.12698 ,2019.86. D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen. Comprehensible credit scoring mod-els using rule extraction from support vector machines.
European journal of operational research ,183(3):1466–1476, 2007.87. T. Miller. Explanation in artificial intelligence: Insights from the social sciences.
Artificial Intelligence ,267:1–38, 2019.88. Y. Ming, H. Qu, and E. Bertini. Rulematrix: Visualizing and understanding classifiers with rules.
IEEE transactions on visualization and computer graphics , 25(1):342–352, 2018.89. I. Mollas, N. Bassiliades, and G. Tsoumakas. Lionets: local interpretation of neural networks throughpenultimate layer decoding. In
Joint European Conference on Machine Learning and KnowledgeDiscovery in Databases , pages 265–276. Springer, 2019.90. C. Molnar.
Interpretable Machine Learning . Lulu. com, 2020.91. R. K. Mothilal, A. Sharma, and C. Tan. Explaining machine learning classifiers through diversecounterfactual explanations. In
Proceedings of the 2020 Conference on Fairness, Accountability, andTransparency , 2020.92. W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu. Definitions, methods, andapplications in interpretable machine learning.
Proceedings of the National Academy of Sciences ,116(44):22071–22080, 2019.93. H. Nori, S. Jenkins, P. Koch, and R. Caruana. Interpretml: A unified framework for machine learninginterpretability. arXiv preprint arXiv:1909.09223 , 2019.94. C. Panigutti, A. Perotti, and D. Pedreschi. Doctor xai: an ontology-based approach to black-boxsequential data classification explanations. In
Proceedings of the 2020 Conference on Fairness, Ac-countability, and Transparency , pages 629–639, 2020.95. F. Pasquale.
The black box society: The secret algorithms that control money and information . HarvardUniversity Press, 2015.96. T. Peltola. Local interpretable model-agnostic explanations of bayesian predictive models viakullback-leibler projections. arXiv preprint arXiv:1810.02678 , 2018.97. V. Petsiuk, A. Das, and K. Saenko. Rise: Randomized input sampling for explanation of black-boxmodels. arXiv preprint arXiv:1806.07421 , 2018.98. P. Pezeshkpour, Y. Tian, and S. Singh. Investigating robustness and interpretability of link predictionvia adversarial modifications. arXiv preprint arXiv:1905.00563 , 2019.99. G. Plumb, D. Molitor, and A. S. Talwalkar. Model agnostic supervised local explanations. In
Advancesin Neural Information Processing Systems , 2018.100. R. Poyiadzi, K. Sokol, R. Santos-Rodriguez, T. De Bie, and P. Flach. Face: feasible and actionablecounterfactual explanations. In
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society ,2020.101. N. F. Rajani, B. McCann, C. Xiong, and R. Socher. Explain yourself! leveraging language modelsfor commonsense reasoning. arXiv preprint arXiv:1906.02361 , 2019.102. M. T. Ribeiro, S. Singh, and C. Guestrin. ” why should i trust you?” explaining the predictionsof any classifier. In
Proceedings of the 22nd ACM SIGKDD international conference on knowledgediscovery and data mining , pages 1135–1144, 2016.103. M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations. In
AAAI , volume 18, pages 1527–1535, 2018.104. M. Robnik-ˇSikonja and I. Kononenko. Explaining classifications for individual instances.
IEEETransactions on Knowledge and Data Engineering , 20(5), 2008.105. W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, and K.-R. M¨uller.
Explainable AI: interpreting,explaining and visualizing deep learning , volume 11700. Springer Nature, 2019.106. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visualexplanations from deep networks via gradient-based localization. In
Proceedings of the IEEE inter-national conference on computer vision , pages 618–626, 2017.enchmarking and Survey of Explanation Methods for Black Box Models 33107. M. Setzu, R. Guidotti, A. Monreale, and F. Turini. Global explanations with local scoring. In
JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases , pages 159–171.Springer, 2019.108. S. M. Shankaranarayana and D. Runje. Alime: Autoencoder based approach for local interpretability.In
International Conference on Intelligent Data Engineering and Automated Learning , pages 454–463.Springer, 2019.109. S. Shi, X. Zhang, and W. Fan. A modified perturbed sampling method for local interpretable model-agnostic explanation. arXiv preprint arXiv:2002.07434 , 2020.110. A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagatingactivation differences. arXiv preprint arXiv:1704.02685 , 2017.111. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.112. D. Smilkov, N. Thorat, B. Kim, F. Vi´egas, and M. Wattenberg. Smoothgrad: removing noise byadding noise. arXiv preprint arXiv:1706.03825 , 2017.113. S. Srivastava, I. Labutov, and T. Mitchell. Joint concept learning and semantic parsing from naturallanguage explanations. In
Proceedings of the 2017 conference on empirical methods in natural languageprocessing , pages 1527–1536, 2017.114. A. Suissa-Peleg, D. Haehn, S. Knowles-Barley, V. Kaynig, T. R. Jones, A. Wilson, R. Schalek, J. W.Lichtman, and H. Pfister. Automatic neural reconstruction from petavoxel of electron microscopydata.
Microscopy and Microanalysis , 2016.115. M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. arXiv preprintarXiv:1703.01365 , 2017.116. S. Tan, M. Soloviev, G. Hooker, and M. T. Wells. Tree space prototypes: Another look at makingtree ensembles interpretable. In
Proceedings of the 2020 ACM-IMS on Foundations of Data ScienceConference , pages 23–34, 2020.117. E. Tjoa and C. Guan. A survey on explainable artificial intelligence (xai): towards medical xai. arXivpreprint arXiv:1907.07374 , 2019.118. A. Van Looveren and J. Klaise. Interpretable counterfactual explanations guided by prototypes. arXivpreprint arXiv:1907.02584 , 2019.119. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, (cid:32)L. Kaiser, and I. Polosukhin.Attention is all you need. In
Advances in neural information processing systems , pages 5998–6008,2017.120. S. Verma, J. Dickerson, and K. Hines. Counterfactual explanations for machine learning: A review. arXiv preprint arXiv:2010.10596 , 2020.121. S. Wachter, B. Mittelstadt, and L. Floridi. Why a right to explanation of automated decision-makingdoes not exist in the general data protection regulation.
International Data Privacy Law , 7(2):76–99,2017.122. S. Wachter, B. Mittelstadt, and C. Russell. Counterfactual explanations without opening the blackbox: Automated decisions and the gdpr.
Harv. JL & Tech. , 2017.123. S. M. Weiss and N. Indurkhya. Lightweight rule induction. In
Proceedings of the Seventeenth Inter-national Conference on Machine Learning , ICML ’00, page 1135–1142. Morgan Kaufmann PublishersInc., 2000.124. J. J. Williams, J. Kim, A. Rafferty, S. Maldonado, K. Z. Gajos, W. S. Lasecki, and N. Heffernan.Axis: Generating explanations at scale with learnersourcing and machine learning. In
Proceedings ofthe Third (2016) ACM Conference on Learning @ Scale , L@S ’16, 2016.125. Z. Wu and D. C. Ong. Context-guided bert for targeted aspect-based sentiment analysis. arXivpreprint arXiv:2010.07523 , 2020.126. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show,attend and tell: Neural image caption generation with visual attention. In
International conferenceon machine learning , pages 2048–2057, 2015.127. H. Yang, C. Rudin, and M. Seltzer. Scalable bayesian rule lists. In
International Conference onMachine Learning , pages 3921–3930. PMLR, 2017.128. M. Yang and B. Kim. Bim: Towards quantitative evaluation of interpretability methods with groundtruth. arXiv preprint arXiv:1907.09701 , 2019.129. C.-K. Yeh, B. Kim, S. Arik, C.-L. Li, T. Pfister, and P. Ravikumar. On completeness-aware concept-based explanations in deep neural networks.
Advances in Neural Information Processing Systems , 33,2020.130. M. R. Zafar and N. M. Khan. Dlime: A deterministic local interpretable model-agnostic explanationsapproach for computer-aided diagnosis systems. arXiv preprint arXiv:1906.10263 , 2019.131. Y. Zhang and X. Chen. Explainable recommendation: A survey and new perspectives. arXiv preprintarXiv:1804.11192 , 2018.132. Y. Zhou and G. Hooker. Interpreting models via single tree approximation. arXiv preprintarXiv:1610.09036arXiv preprintarXiv:1610.09036