Towards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End
Ramaravind Kommiya Mothilal, Divyat Mahajan, Chenhao Tan, Amit Sharma
TTowards Unifying Feature Attribution and CounterfactualExplanations: Different Means to the Same End
Ramaravind K. Mothilal
Microsoft Research [email protected]
Divyat Mahajan
Microsoft Research [email protected]
Chenhao Tan
University of Colorado [email protected]
Amit Sharma
Microsoft Research [email protected]
ABSTRACT
To explain a machine learning model, there are two main approaches:feature attributions that assign an importance score to each inputfeature, and counterfactual explanations that provide input exam-ples with minimal changes to alter the model’s prediction. To unifythese approaches, we provide an interpretation based on the ac-tual causality framework and present two key results in terms oftheir use. First, we present a method to generate feature attributionexplanations from a set of counterfactual examples. These featureattributions convey how important a feature is to changing theclassification outcome of a model, especially on whether a subsetof features is necessary and/or sufficient for that change, whichfeature attribution methods are unable to provide. Second, we showhow counterfactual examples can be used to evaluate the goodnessof an attribution-based explanation in terms of its necessity andsufficiency. As a result, we highlight the complementarity of thesetwo approaches. Our evaluation on three benchmark datasets —Adult-Income, LendingClub, and German-Credit— confirms thecomplementarity. Feature attribution methods like LIME and SHAPand counterfactual explanation methods like Wachter et al. andDiCE often do not agree on feature importance rankings. In addi-tion, by restricting the features that can be modified for generatingcounterfactual examples, we find that the top-k features from LIMEor SHAP are often neither necessary nor sufficient explanations ofa model’s prediction. Finally, we present a case study of differentexplanation methods on a real-world hospital triage problem.
ACM Reference Format:
Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma.2018. Towards Unifying Feature Attribution and Counterfactual Explana-tions: Different Means to the Same End. In
Woodstock ’18: ACM Symposiumon Neural Gaze Detection, June 03–05, 2018, Woodstock, NY.
ACM, New York,NY, USA, 15 pages. https://doi.org/10.1145/1122445.1122456
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Woodstock ’18, June 03–05, 2018, Woodstock, NY © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456
As complex machine learning (ML) models are being deployed inhigh-stakes domains like finance and healthcare, explaining whythey make a certain prediction has emerged as a critical task. Expla-nations of a ML model’s prediction have found many uses, includingto understand the most important features [33, 41], discover anyunintended bias [46], debug the model [28], increase trust [27, 31],and provide recourse suggestions for unfavorable predictions [55].There are two main kinds of explanations: attribution-based and counterfactual-based . Attribution-based explanations provide ascore or ranking over features, conveying the relative importanceof each feature to the model’s output. Example methods includelocal function approximation using linear models [41] and game-theoretic attribution such as Shapley values [33]. The second kind,counterfactual-based explanations, instead generate examples thathave an alternative model output with minimum changes in the in-put features, known as counterfactual examples (CF) [55]. Becauseof the differences in the type of output and how they are generated,attribution- and CF-based are largely studied independent of eachother.In this paper, we demonstrate the fundamental connections be-tween attribution-based and counterfactual-based explanations (seeFig. 1). First, we show how counterfactual-based explanations can beused to evaluate attribution-based explanations on key properties.In particular, we consider the necessity (is a feature value necessaryfor the model’s output?) and the sufficiency (is the feature valuesufficient for generating the model output?). Second, we proposea simple method by which counterfactual-based explanations cangenerate an importance ranking for features, just like attribution-based explanations, and study the correlation between these featureimportance scores. Therefore, rather than being separate, they arecomplementary methods towards the same goal.
Figure 1: Complementarity of explanation methods. a r X i v : . [ c s . L G ] F e b oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma To provide a formal connection, we introduce the frameworkof actual causality [16] to the explanation literature that reasonsabout the causes of a particular event, unlike the more commoncausal inference setting that estimates the effect of a particularevent [39, 42]. Using actual causality, we provide a definition ofa model explanation and propose two desirable properties of anexplanation: necessity and sufficiency for generating a given modeloutput. A good explanation should satisfy both, but we find thatcurrent explanation methods optimize either one of them. CF-basedmethods like Wachter et al. (henceforth “
WachterCF ”) and DiCE [37]find examples that highlight the necessary feature value for a givenmodel output whereas attribution-based methods like LIME [41]and SHAP [33] focus on the sufficiency of a feature value. Thus,the actual causality framework underscores their complementar-ity: we need to provide both necessity and sufficiency for a goodexplanation.Our empirical analysis, using LIME and SHAP as examplesof attribution-based and WachterCF and DiCE as examples ofcounterfactual-based methods, confirms this complementarity. First,we show that counterfactual-based methods can be used to evaluateexplanations from LIME and SHAP. By allowing only a specificfeature to change in generating CFs, we can evaluate the necessityof the feature’s value for the model’s predicted output. Similarly,by generating CFs with all but a specific feature, we can evalu-ate the sufficiency of the feature’s value for causing the model’soutcome. On benchmark datasets related to income or credit pre-dictions (Adult-Income, German-Credit and LendingClub), we findthat the top-ranked features from LIME and SHAP are often nei-ther necessary nor sufficient. In particular, for Adult-Income andGerman-Credit, more counterfactuals can be generated by usingfeatures except the top-3 than using any of the top-3 features, andit is easy to generate counterfactuals even if one of the top-rankedfeatures is not changed at all.Second, we show that CF examples can be used to generate fea-ture importance scores that complement the scores from LIME andSHAP. The scores from DiCE and WachterCF do not always agreewith those from attribution-based methods: DiCE and WachterCFtend to assign relatively higher scores to low-ranked features fromLIME and SHAP, likely because it is possible to generate valid CFsusing those features as well. Ranks generated by the four methodsalso disagree: not only do attribution-based methods disagree withcounterfactual-based methods, but LIME and SHAP also disagreeon many features and so do WachterCF and DiCE.Our results reveal the importance of considering multiple ex-planation methods to understand the prediction of an ML model.Different methods have different objectives (and empirical approxi-mations). Hence, a single method may not convey the full picture.To demonstrate the value of considering multiple kinds of expla-nation, we analyze a high-dimensional real-world dataset that hasover 200 features where the ML model’s task is to predict whethera patient will be admitted to a hospital. The differences observedabove are magnified: an analyst may reach widely varying conclu-sion about the ML model depending on which explanation methodthey choose.
DiCE considers triage features as the most important,LIME considers chief-complaint features as the most important,while SHAP identifies demographic features as the most important.We also find odd results with LIME on necessity: changing the 3rd most important feature provides more valid CFs than changing themost important feature.To summarize, we make the following contributions: • A unifying framework for attribute-based and counterfactualexamples using actual causality; • A method to evaluate attribution-based methods on the ne-cessity and sufficiency of their top-ranked features; • Empirical investigation of explanations using both com-monly used datasets and a high-dimensional dataset.
We discuss the desirable properties that any explanation methodshould have, the two main types of explanations, and how differentexplanation methods compare to each other. There is also importantwork on building intelligible models by design [8, 32, 43] that wedo not discuss here.
Explanations serve a variety of purposes, including debugging forthe model-developer, evaluating properties for an auditor, and pro-viding recourse and trust for an end individual. Therefore, it isnatural that explanations have multiple desirable properties basedon the context. Sokol and Flach [50] and Miller [35] list the differentproperties that an explanation ideally should adhere to. Differentworks have evaluated the soundness (truthfulness to the ML model),completeness (generalizability to other examples), parsimony, andactionability of explanations. In general, counterfactual-based meth-ods optimize soundness over completeness, while methods thatsummarize data to produce an attribution score are less sound butoptimize for completeness.In comparison, the notions of necessity and sufficiency of a fea-ture value for a model’s output are less studied. In natural languageprocessing (NLP), sufficiency and comprehensiveness have beendefined based on the output probability in the context of rationaleevaluation (e.g., whether a subset of words leads to the same pre-dicted probability as the full text) [7, 12, 56]. By using a formalframework of actual causality [16], we define the necessity andsufficiency metrics for explaining any ML model, provide a methodusing counterfactual examples to compute them, and evaluate com-mon explanation methods on them.
Majority of the work in explainable ML provides attribution-basedexplanations [48, 52]. Feature attribution methods are local expla-nation techniques that assign importance scores to features basedon certain criteria, such as by approximating the local decisionboundary [41] or estimating the Shapley value [33]. A feature’sscore captures its contribution to the predicted value of an instance.In contrast, counterfactual explanations [9, 13, 20, 40, 54, 55] areminimally-tweaked versions of the original input that lead to adifferent predicted outcome than the original prediction. In addi-tion to proximity to the original input, it is important to ensurefeasibility [21], real-time response [45], and diversity among coun-terfactuals [37, 44].We provide a unified view of these two explanations. They neednot be considered separate: counterfactuals can provide another owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY way to generate feature attributions, as suggested by Sharma et al.[47] and Barocas et al. [6]. We extend this intuition by conduct-ing an extensive empirical study on the attributions generated bycounterfactuals, and comparing them to other attribution-basedmethods. In addition, we introduce a formal causality frameworkto show how different explanation methods simply correspond todifferent notions of a feature “causing” the model output: counter-factuals focus on the necessity of a feature while other methodstend to focus on its sufficiency to cause the model output.
Let 𝑓 ( 𝒙 ) be a machine learning model and 𝒙 denote a vector of 𝑑 features, ( 𝑥 , 𝑥 , ...𝑥 𝑑 ) . Given input 𝒙 and the output 𝑓 ( 𝒙 ) , a com-mon explanation task is to determine which features are responsiblefor this particular prediction.Though both attribution-based and counterfactual-based meth-ods aim to explain a model’s output at a given input, the differenceand similarity in their implications are not clear. While featureattributions highlight features that are important in terms of theircontributions to the model prediction, it does not imply that chang-ing important features is sufficient or necessary to lead to a different(desired) outcome. Similarly, while CF explanations provide insightsfor reaching a different outcome, the features changed may not in-clude the most important features of feature attribution methods.Below we show that while these explanation methods may ap-pear distinct, they are all motivated by the same principle of whethera feature is a “cause” of the model’s prediction, and to what extent.We provide a formal framework based on actual causality [16] tointerpret them. We first define actual cause and how it can be used to explain anevent. In our case, the classifier’s prediction is an event, and theinput features are the potential causes of the event. According toHalpern [16], causes of an event are defined w.r.t to a structuralcausal model (SCM) that defines the relationship between the poten-tial causes and the event. In our case, the learnt ML model 𝑓 is theSCM ( 𝑀 ) that governs how the prediction output is generated fromthe input features. The structure of the SCM consists of each featureas a node that causes other intermediate nodes (e.g., different layersof a neural network), and then finally leads to the output node. Weassume that the feature values are generated from an unknownprocess governed by a set of parameters that we collectively denoteas 𝒖 , or the context . Together, ( 𝑀, 𝒖 ) define a specific configurationof the input 𝒙 and the output 𝑓 ( 𝒙 ) of the model.For simplicity, the following definitions assume that individualfeatures are independent of each other, and thus any feature can bechanged without changing other features. However, in explanationgoals such as algorithmic recourse it is important to consider thecausal dependencies between features themselves [21, 25, 34]; weleave such considerations for future work. Definition 3.1 (Actual Cause, (Original definition) [16]).
A subsetof feature values 𝒙 𝑗 = 𝑎 is an actual cause of the model output 𝑓 ( 𝒙 − 𝑗 = 𝑏, 𝒙 𝑗 = 𝑎 ) = 𝑦 ∗ under the causal setting ( 𝑀, 𝒖 ) if all thefollowing conditions hold: (1) Given ( 𝑀, 𝒖 ) , 𝒙 𝑗 = 𝑎 and 𝑓 ( 𝒙 − 𝑗 = 𝑏, 𝒙 𝑗 = 𝑎 ) = 𝑦 ∗ .(2) There exists a subset of features 𝑊 ⊆ 𝒙 − 𝑗 such that if 𝑊 is set to 𝑤 ′ , then ( 𝒙 𝑗 ← 𝑎,𝑊 ← 𝑤 ′ ) ⇒ ( 𝑦 = 𝑦 ∗ ) and ( 𝒙 𝑗 ← 𝑎 ′ ,𝑊 ← 𝑤 ′ ) ⇒ 𝑦 ≠ 𝑦 ∗ for some value 𝑎 ′ .(3) 𝒙 𝑗 is minimal, namely, there is no strict subset 𝒙 𝑠 ⊂ 𝒙 𝑗 suchthat 𝒙 𝑠 = 𝑎 𝑠 satisfies conditions 1 and 2, where 𝑎 𝑠 ⊂ 𝑎 .In the notation above, 𝒙 𝑖 ← 𝑣 denotes that 𝒙 𝑖 is intervenedon and set to the value 𝑣 , irrespective of its observed value under ( 𝑀, 𝒖 ) . Intuitively, a feature value 𝒙 𝑗 = 𝑎 is an actual cause of 𝑦 ∗ ifunder some value 𝑏 ′ of the other features 𝒙 − 𝑗 , there exists a value 𝑎 ′ ≠ 𝑎 such that 𝑓 ( 𝒙 − 𝑗 = 𝑏 ′ , 𝑎 ′ ) ≠ 𝑦 ∗ and 𝑓 ( 𝒙 − 𝑗 = 𝑏 ′ , 𝑎 ) = 𝑦 ∗ .For instance, consider a linear model with three binary features 𝑓 ( 𝑥 , 𝑥 , 𝑥 ) = 𝐼 ( . 𝑥 + . 𝑥 + . 𝑥 > = . ) and an observedprediction of 𝑦 =
1. Here each feature 𝑥 𝑗 = 𝑦 = but-for cause. Definition 3.2 (But-for Cause).
A subset of feature values 𝒙 𝑗 = 𝑎 is a but-for cause of the model output 𝑓 ( 𝒙 − 𝑗 = 𝑏, 𝒙 𝑗 = 𝑎 ) = 𝑦 ∗ under the causal setting ( 𝑀, 𝒖 ) if it is an actual cause and the emptyset 𝑊 = 𝜙 satisfies condition 2.That is, changing the value of 𝑥 𝑗 alone changes the predictionof the model at 𝒙 . On the linear model, now we obtain a betterpicture: 𝑥 is always a but-for cause for 𝑦 = 𝑦 =
0, thatis true only in certain special cases. The only context in which 𝑥 and 𝑥 are but-for causes for 𝑦 = 𝑥 = necessity of aparticular feature subset for the obtained model output, it does notcapture sufficiency . Sufficiency means that setting a feature subset 𝒙 𝑗 ← 𝑎 will always lead to the given model output, irrespective ofthe values of other features. To capture sufficiency, therefore, weneed an additional condition. 𝒙 𝑗 ← 𝑎 ⇒ 𝑦 = 𝑦 ∗ ∀ 𝒖 ∈ 𝑈 (1)That is, for the feature subset value 𝒙 𝑗 = 𝑎 to be a sufficientcause, the above statement should be valid in all possible contexts.Based on the above definitions, we are now ready to define an idealexplanation that combines the idea of actual cause and sufficiency. Definition 3.3 (
Ideal Model Explanation ). A subset of featurevalues 𝒙 𝑗 = 𝑎 is an explanation for a model output 𝑦 ∗ relative to aset of contexts 𝑈 , if(1) Existence:
There exists a context 𝒖 ∈ 𝑈 such that 𝒙 𝑗 = 𝑎 and 𝑓 ( 𝒙 − 𝑗 = 𝑏, 𝒙 𝑗 = 𝑎 ) = 𝑦 ∗ .(2) Necessity:
For each context 𝒖 ∈ 𝑈 where 𝒙 𝑗 = 𝑎 and 𝑓 ( 𝒙 − 𝑗 = 𝑏, 𝒙 𝑗 = 𝑎 ) = 𝑦 ∗ , some feature subset 𝒙 𝑠𝑢𝑏 ⊆ 𝒙 𝑗 isan actual cause under ( 𝑀, 𝒖 ) (satisfies conditions 1-3 fromDefinition 3.1).(3) Sufficiency:
For all contexts 𝒖 ′ ∈ 𝑈 , 𝒙 𝑗 ← 𝑎 ⇒ 𝑦 = 𝑦 ∗ .(4) Minimality: 𝒙 𝑗 is minimal, namely, there is no strict subset 𝒙 𝑠 ⊂ 𝒙 𝑗 such that 𝒙 𝑠 = 𝑎 𝑠 satisfies conditions 1-3 above,where 𝑎 𝑠 ⊂ 𝑎 .This definition captures the intuitive meaning of explanation.For a given feature 𝑥 , condition 2 states that the feature affects oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma the output (output changes if the feature is changed under certainconditions), and condition 3 states that as long as the feature isunchanged, the output cannot be changed. In practice, however,it is rare to find such clean explanations of a ML model’s output.Even in our simple linear model above, no feature is sufficient tocause the output. For most realistic ML models, an ideal explanation is impractical.Therefore, we now describe the concept of partial explanations [16]that relaxes the necessity and sufficiency conditions to consider thefraction of contexts over which these conditions are valid. Partialexplanations are characterized by two metrics.The first metric captures the extent to which a feature value is necessary to cause the model’s (original) output. 𝛼 = Pr ( 𝑥 𝑗 is a cause of 𝑦 ∗ | 𝒙 𝑗 = 𝑎, 𝑦 = 𝑦 ∗ ) (2)where ‘is a cause’ means that 𝑥 𝑗 = 𝑎 satisfies Definition 3.1. Thesecond metric captures sufficiency using conditional probability ofoutcome given the feature’s value. 𝛽 = Pr ( 𝑦 = 𝑦 ∗ | 𝒙 𝑗 ← 𝑎 ) (3)where 𝒙 𝑗 ← 𝑎 denotes an intervention to set 𝒙 𝑗 to 𝑎 . Both proba-bilities are over the set of contexts. Combined, they can be called ( 𝛼, 𝛽 ) goodness of an explanation. When both 𝛼 = 𝛽 = 𝛼 = 𝒙 𝑗 = 𝑎 is a necessary cause of 𝑦 = 𝑦 ∗ and 𝛽 = 𝒙 𝑗 = 𝑎 is a sufficient cause of 𝑦 = 𝑦 ∗ . In other words,a feature value 𝑥 𝑗 = 𝑎 is a good explanation for a model’s output 𝑦 ∗ if the feature value is an actual cause of the outcome and 𝑦 = 𝑦 ∗ with high probability whenever 𝑥 𝑗 = 𝑎 . Armed with the ( 𝛼, 𝛽 ) goodness of explanation metrics, we nowshow how common explanation methods can be considered asspecial cases of the above framework. Counterfactual-based explanations.
First, we show how counterfactual-based explanations relate to ( 𝛼, 𝛽 ) : When only but-for causes (in-stead of simply actual causes) are allowed, 𝛼 and 𝛽 capture theintuition behind counterfactuals. Given 𝑦 ∗ and a candidate featuresubset 𝒙 𝑗 , 𝛼 corresponds to fraction of contexts where 𝑥 𝑗 is a but-for cause. That means, keeping everything else constant and onlychanging 𝒙 𝑗 , how often does the classifier’s outcome change? Eqn. 2reduces to 𝛼 𝐶𝐹 = Pr (( 𝒙 𝑗 ← 𝑎 ′ ⇒ 𝑦 ≠ 𝑦 ∗ )| 𝒙 𝑗 = 𝑎, 𝑦 = 𝑦 ∗ ) (4)where the above probability is over a reasonable set of contexts(e.g., all possible values for discrete features and a bounded regionaround the original feature value for continuous features). By defi-nition, each of the perturbed inputs above that change the value of 𝑦 can be considered as a counterfactual example [55]. Counterfac-tual explanation methods aim to find the smallest perturbation inthe feature values that change the output, and correspondingly themodified feature subset 𝑥 𝑗 is a but-for cause of the output. 𝛼 𝐶𝐹 pro-vides a metric to summarize the outcomes of all such perturbationsand to rank any feature subset for their necessity in generatingthe original model output. In practice, however, computing 𝛼 iscomputationally prohibitive and therefore explanation methods empirically find a set of counterfactual examples and allow (man-ual) analysis on the found counterfactuals. In §4, we will see howwe can develop a feature importance score using counterfactualsthat is inspired from the 𝛼 𝐶𝐹 formulation. 𝛽 corresponds to the fraction of contexts where 𝑥 𝑗 = 𝑎 is suffi-cient to keep 𝑦 = 𝑦 ∗ . That corresponds to the degree of sufficiencyof the feature subset: keep 𝒙 𝑗 constant but change everything elseand check how often the outcome remains the same. While notcommon, such a perturbation can be considered as a special caseof the counterfactual generation process, where we specificallyrestrict change in the given feature set. A similar idea is exploredin (local) anchor explanations (Ribeiro et al). It is also related topertinent positives and pertinent negatives [13]. Attribution-based explanations.
Next, we show the connectionof attribution-based explanations with ( 𝛼, 𝛽 ) . 𝛽 is defined as inEqn. 3, the fraction of all contexts where 𝒙 𝑗 ← 𝑎 leads to 𝑦 = 𝑦 ∗ .Depending on how we define the set of all contexts, we obtaindifferent local attribute-based explanations. The total number ofcontexts is 2 𝑚 for 𝑚 binary features and is infinite for continuousfeatures. For ease of exposition, we consider binary features below.LIME can be interpreted as estimating 𝛽 for a restricted set ofcontexts (random samples) near the input point. Rather than check-ing Eqn. 1 for each of the random sampled points and estimating 𝛽 using Eqn. 3, it uses linear regression to estimate 𝛽 ( 𝑎, 𝑦 ∗ ) − 𝛽 ( 𝑎 ′ , 𝑦 ∗ ) .Note that linear regression estimates E [ 𝑌 | 𝒙 𝑗 = 𝑎 ] − E [ 𝑌 | 𝒙 𝑗 = 𝑎 ′ ] are equivalent to Pr [ 𝑌 = | 𝒙 𝑗 = 𝑎 ] − Pr [ 𝑌 = | 𝒙 𝑗 = 𝑎 ′ ] for a binary 𝑦 . It estimates effects for all features at once using linear regression,assuming that each feature’s importance is independent.Shapley value-based methods take a different approach. Shapleyvalue for a feature is defined as the number of times that includinga feature leads to the observed outcome, averaged over all possibleconfigurations of other input features. That is, they define the validcontexts for a feature value as all valid configurations of the otherfeatures (size 2 𝑚 − ). The intuition is to see, at different valuesof other features, whether the given feature value is sufficient tocause the desired model output 𝑦 ∗ . The goal of estimating Shapleyvalues corresponds to the equation for 𝛽 described above (with anadditional term for comparing it to the baseline).Note how selection of the contexts effectively defines the type ofattribution-based explanation method [25, 51]. For example, we mayweigh the contexts based on their likelihood to obtain a probabilitydistribution over contexts, leading to feasible attribute explana-tions [2]. Example and practical implications.
The above analysis indi-cates that different explanation methods optimize for either 𝛼 or 𝛽 :counterfactual explanations are inspired from the 𝛼 𝐶𝐹 metric andattribution-based methods like LIME and SHAP from the 𝛽 metric.Since 𝛽 focuses on the power of a feature to lead to the observed out-come and 𝛼 on its power to change the outcome conditional that the(feature, outcome) are already observed, the two metrics need not bethe same. For example, consider a model, 𝑦 = 𝐼 ( . 𝑥 + . 𝑥 ≥ . ) where 𝑥 , 𝑥 ∈ [ , ] are continuous features, and an input point ( 𝑥 = , 𝑥 = , 𝑦 = ) . To explain this prediction, LIME or SHAPwill assign high importance to 𝑥 compared to 𝑥 since it has ahigher coefficient value of 0.45. Counterfactuals would also giveimportance to 𝑥 (e.g., reduce 𝑥 by 0.12 to obtain 𝑦 = 𝑥 (e.g., reduce 𝑥 to 0.49), depending on how the owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY loss function from the original input is defined (which defines theset of contexts for 𝛼 ). Appendix A.1 shows the importance scoresby different methods for this example.Therefore, a good explanation ideally needs both high 𝛼 and 𝛽 toprovide the two different facets. Our framework suggests that thereis value in evaluating both qualities for an explanation method,and in general considering both types of explanations for theircomplementary value in understanding a model’s output. In thefollowing, we propose methods for evaluating necessity ( 𝛼 𝐶𝐹 ) andsufficiency ( 𝛽 ) of an explanation and study their implications inreal-world datasets. To connect attribution-based methods with counterfactual expla-nation, we propose two methods. The first measures the necessityand sufficiency of any attribution-based explanation using coun-terfactuals, and the second creates feature importance scores usingcounterfactual examples.
For our empirical evaluation, we looked for explanation methodsthat are publicly available on GitHub. For attribution-based meth-ods, we use the two most popular open-source libraries, LIME [41]and SHAP [33]. For counterfactual methods, we choose based ontheir popularity and whether a method supports generating CFsusing user-specified feature subsets (a requirement for our experi-ments). Alibi [22], AIX360 [5], DiCE [37], and MACE [20] are mostpopular on GitHub, but only DiCE explicitly supports CFs fromfeature subsets (more details about method selection are in Suppl.A.2). We also implemented the seminal method from Wachter et al.for CF explanations, calling it WachterCF.
Attribution-based methods.
For a given test instance 𝒙 and a MLmodel 𝑓 ( . ) , LIME perturbs its feature values and uses the perturbedsamples to build a local linear model 𝑔 of complexity Ω ( 𝑔 ) . Thecoefficients of the linear model are used as explanations 𝜁 and largercoefficients imply higher importance. Formally, LIME generatesexplanations by optimizing the following loss where 𝐿 measureshow close 𝑔 is in approximating 𝑓 in the neighborhood of 𝒙 , 𝜋 𝒙 . 𝜁 ( 𝒙 ) = arg min 𝑔 ∈ 𝐺 L ( 𝑓 , 𝑔, 𝜋 𝒙 ) + Ω ( 𝑔 ) (5)SHAP, on the other hand, assigns importance score to a featurebased on Shapley values, which are computed using that feature’saverage marginal contribution across different coalitions of all fea-tures. Counterfactual generation method.
For counterfactual expla-nations, the method from Wachter et al. optimizes the followingloss, where 𝒄 is a counterfactual example. 𝒄 ∗ = arg min 𝒄 yloss ( 𝑓 ( 𝒄 ) , 𝑦 ) + 𝜆 𝑑𝑖𝑠𝑡 ( 𝒄 , 𝒙 ) (6)The two additive terms in the loss minimize (1) yloss ( . ) betweenML model 𝑓 ( . ) ’s prediction and the desired outcome 𝑦 , (2) distancebetween 𝒄 𝑖 and test instance 𝒙 . For obtaining multiple CFs for thesame input, we simply re-initialize the optimization with a newrandom seed. As a result, this method may not be able to find uniqueCFs. The second method, DiCE, handles the issue of multiple uniqueCFs by introducing a diversity term to the loss, based on a determi-nantal point processes based method [24]. It returns a diverse set of 𝑛𝐶𝐹𝑠 counterfactuals by solving a combined optimization problemover multiple CFs, where 𝒄 𝑖 is a counterfactual example: C( 𝒙 ) = arg min 𝒄 ,..., 𝒄 𝑛𝐶𝐹 𝑛𝐶𝐹 𝑛𝐶𝐹 ∑︁ 𝑖 = yloss ( 𝑓 ( 𝒄 𝑖 ) , 𝑦 ) + 𝜆 𝑛𝐶𝐹 𝑛𝐶𝐹 ∑︁ 𝑖 = 𝑑𝑖𝑠𝑡 ( 𝒄 𝑖 , 𝒙 )− 𝜆 dpp_diversity ( 𝒄 , . . . , 𝒄 𝑛𝐶𝐹 ) (7) Suppose 𝑦 ∗ = 𝑓 ( 𝒙 𝑗 = 𝑎, 𝒙 − 𝑗 = 𝑏 ) is the output of a classifier 𝑓 for input 𝒙 . To measure necessity of a feature value 𝒙 𝑗 = 𝑎 for themodel output 𝑦 ∗ , we would like to operationalize Eqn. 4. A simpleway is to use a method for generating counterfactual explanations,but restrict it such that only 𝒙 𝑗 can be changed. The fraction of timesthat changing 𝒙 𝑗 leads to a valid counterfactual example indicatesthat the extent to which 𝒙 𝑗 = 𝑎 is necessary for the current modeloutput 𝑦 ∗ . That is, if we can change the model’s output by changing 𝒙 𝑗 , it means that the 𝒙 𝑗 features’ values are necessary to generatethe model’s original output. Necessity is thus defined asNecessity = (cid:205) 𝑖, 𝒙 𝑗 ≠ 𝑎 ( 𝐶𝐹 𝑖 ) nCF ∗ 𝑁 , (8)where 𝑁 is the total number of test instances for which nCF coun-terfactuals are generated each.For the sufficiency condition from Equation 3, we adopt thereverse approach. Rather than changing 𝒙 𝑗 , we fix it to its originalvalue and let all other features vary their values, If no unique validcounterfactual examples are generated, then it implies that 𝒙 𝑗 = 𝑎 is sufficient for causing the model output 𝑦 ∗ . If not, then (1- fractionof times that unique CFs are generated) tells us about the extent ofsufficiency of 𝒙 𝑗 = 𝑎 . In practice, even when using all the features,we may not obtain 100% success in generating valid counterfactuals.Therefore, we modify the sufficiency metric to compare the fractionof unique CFs generated using all features to the fraction of uniqueCFs generated while keeping 𝒙 𝑗 constant.Sufficiency = (cid:205) 𝑖 ( 𝐶𝐹 𝑖 ) nCF ∗ 𝑁 − (cid:205) 𝑖, 𝒙 𝑗 ← 𝑎 ( 𝐶𝐹 𝑖 ) nCF ∗ 𝑁 (9) In addition to evaluating properties of attribution-based explainers,counterfactual explanations offer a natural way of generating fea-ture attribution scores based on the extent to which a feature valueis necessary for the outcome. The intuition comes from Equation 4:a feature that is changed more often when generating counterfac-tual examples must be an important feature. Below we describe themethod WachterCF FA and DiCE FA to generate attribution scoresfrom a set of counterfactual examples.To explain the output 𝑦 ∗ = 𝑓 ( 𝒙 ) , the DiCE FA algorithm proceedsby generating a diverse set of nCF counterfactual examples for theinput 𝒙 , where nCF is the number of counterfactuals. To generatemultiple CFs using WachterCF, we run the optimization in eq: 6multiple times with random initialization as suggested by Wachter oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma et al. A feature 𝑥 𝑗 that is important in changing a predicted outcome,is more likely to be changed frequently in nCF CFs than a feature 𝑥 𝑘 that is less important. For each feature, therefore, the attributionscore is the fraction of CF examples that have a modified value ofthe feature. To generate a local explanation, the attribution score isaveraged over multiple values of nCF, typically going from 1 to 8.To obtain a global explanation, this attribution score is averagedover many test inputs. We use three common datasets in explainable ML literature. • Adult-Income. This dataset [23] is based on the 1994 Censusdatabase and contains information like Age, Gender, Martial Sta-tus, Race, Education Level, Occupation, Work Class and WeeklyWork Hours. It is available online as part of the UCI machinelearning repository. The task is to determine if the income of aperson would be higher than $50 ,
000 (1) or not (0). We processthe dataset using techinques proposed by prior work [58] andobtain a total of 8 features. • LendingClub. Lending Club is a peer-to-peer lending company,which helps in linking borrowers and investors. We use the dataabout the loans from LendingClub for the duration (2007-2011)and use techniques proposed works [10, 19, 53] for processingthe data. We arrive at 8 features, with the task to classify thepayment of the loan by a person (1) versus no payment of theloan (0). • German-Credit. German Credit [1] consists of various featureslike Credit Amount, Credit History, Savings, etc regarding peoplewho took loans from a bank. We utilize all the features presentin the dataset for the task of credit risk prediction, whether aperson has good credit risk (1) or bad credit risk (0).
Implementation Details.
We trained ML models for differentdatasets in PyTorch and use the default parameters of LIME andDiCE in all our experiments unless specified otherwise. We use thesame value of 𝜆 for both DiCE (Eqn. 7) and WachterCF (Eqn. 6)and set 𝜆 to 1.0. For SHAP, we used its KernelExplainer interfacewith median value of features as background dataset. As SHAP’sKernelExplainer is slow with a large background dataset, we usedmedian instead. However, the choice of KernelExplainer and ourbackground dataset setting can limit the strength of SHAP , andwe leave further exploration of different configurations of SHAP tofuture work.Note that DiCE’s hyperparameters for proximity and diversityin CFs are important. For instance, the diversity term enforces thatdifferent features change their values in different counterfactuals.Otherwise we may obtain multiple duplicate counterfactual exam-ples that change the same feature. Results in the main paper arebased on the default hyperparameters in DiCE, but our results arerobust to different choices of these hyperparameters (see Suppl.A.3). We start by examining the necessity and sufficiency of top featuresderived with feature attribution methods through counterfactual See issues 391 and 451 on SHAP’s GitHub repository:https://github.com/slundberg/shap/issues generation. Namely, we measure whether we can generate validCFs by changing only the 𝑘 -th most important feature (necessity)or changing other features except the 𝑘 -th most important feature(sufficiency). Remember that necessity and sufficiency are definedwith respect to the original output. For example, if changing afeature can vary the predicted outcome, then it means that thisfeature is necessary for the original prediction. Are important features necessary?
Given top features identifiedbased on feature attribution methods (LIME and SHAP), we investi-gate whether we can change the prediction outcomes by using only the 𝑘 -th most important feature, where 𝑘 ∈ {1,2,3}, We choose small 𝑘 since the number of features is small in these datasets. Specifically,we measure the average percentage of unique and valid counter-factuals generated using DiCE and WachterCF for 200 random testinstances by fixing other features and changing only the 𝑘 -th mostimportant feature. This analysis helps us understand if the top fea-tures from LIME or SHAP are necessary to produce the currentmodel output. Fig. 2a shows the results for different datasets whenasked to generate different numbers of CFs. While we producedCFs for nCF ∈ {1,2,4,6,8}, we show results only for 1, 4, and 8 forbrevity. To provide a benchmark, we also consider the case wherewe use all the other features that are not in the top three.Our results in Fig. 2a suggest that the top features are mostlyunnecessary for the original prediction: changing them is less likelyto alter the predicted outcome. For instance, in German-Credit,none of the top features have a necessity of above 50%, in fact oftenbelow 30%. In comparison, features outside the top three can alwaysachieve almost 100%. This is likely related to the fact that thereare 20 features in German-Credit, but it still highlights the limitedutility in explanation by focusing on the top features from featureattribution methods. Similar results also show up in Adult-Income,but not as salient as in German-Credit.In LendingClub, we do find that the top feature is relativelyhigher on the necessity metric. Upon investigation, we find thisdataset has a categorical feature grade of seven levels, which isassigned by the lending company as an indicator of loan repayment.The loan grade is designed based on a combination of factors includ-ing credit score. Since the quality of loan grade is highly correlatedwith loan repayment status, both LIME and SHAP give high im-portance score to this feature for most test instances – they assignhighest score for 98% and 73% of the test instances respectively.As a result, changing LIME’s top-1 feature is enough to get almostperfect unique valid CFs when generating one counterfactual. How-ever, the necessity of a single feature quickly reduces as we generatemore CFs. Even in this dataset where there is a dominant feature,the features other than the top-3 become more necessary than thetop feature (grade) for 𝑛𝐶𝐹 > all features upto top- 𝑘 to be changed(Suppl. A.4) and find that necessity of the top- 𝑘 subset increases, butis still less than 100% for nCF >
1. That is, changing all top-3 ranked owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY A d u l t - I n c o m e nCF=1 N e c e ss i t y nCF=4 nCF=8 L e n d i n g C l u b N e c e ss i t y G e r m a n - C r e d i t N e c e ss i t y LIME(WachterCF) SHAP(WachterCF) LIME(DiCE) SHAP(DiCE) (a) Necessity A d u l t - I n c o m e nCF=1 S u ff i c i e n c y nCF=4 nCF=8 L e n d i n g C l u bS u ff i c i e n c y G e r m a n - C r e d i t S u ff i c i e n c y LIME(WachterCF) SHAP(WachterCF) LIME(DiCE) SHAP(DiCE) (b) Sufficiency
Figure 2: The 𝑦 -axis represents the necessity and sufficiency measures at a particular nCF, as defined in §4.2. Fig. 2a shows theresults when we are only allowed to change the 𝑘 -th most important features ( 𝑘 = , , ) or the other features, while Fig. 2bshows the results when we fix the 𝑘 -th most important features ( 𝑘 = , , ) but are allowed to change other features. Whilenecessity is generally aligned with feature ranking derived from LIME/SHAP, the most important features often cannot leadto changes in the model output on their own. In almost all cases, “rest” achieves better success in producing CFs using bothDiCE and WachterCF. For sufficiency, none of these top features are sufficient to preserve original model output. DiCE andWachterCF differ the most for LendingClub with 𝑛𝐶𝐹 > , where latter’s difficulty to generate unique multiple CFs increasesthe measured sufficiency of a feature. features is also not enough to generate counterfactuals for all inputexamples, especially for higher-dimensional German-Credit. Are important features sufficient?
Similar to necessity, we measure the sufficiency of top featuresfrom attribution-based methods by fixing the 𝑘 -th most importantfeature and allowing DiCE and WachterCF to change the other fea-tures. If the 𝑘 -th most important feature is sufficient for the originalprediction, we would expect a low success rate in generating validCFs with the other features, and our sufficiency measure wouldtake high values.Fig. 2b shows the opposite. We find that the validity is closeto 100% (hence very low sufficiency) till 𝑛𝐶𝐹 = without changing the 𝑘 -th most important feature based on LIME or SHAPin Adult-Income and German-Credit. In comparison, for Lending-Club, while no change in the top-2 or top-3 does not affect theperfect validity, however, no change in the most important featuredoes decrease the validity when generating more than one CFs using DiCE . This result again highlights the dominance of grade inLendingClub. However, even in this case, the sufficiency metric isstill below 20%. Sufficiency results using WachterCF are similarlylow, except for LendingClub when 𝑛𝐶𝐹 >
1. Here WachterCF, withonly random initialization and no explicit diversity loss formulation,could not generate multiple unique CFs (without changing the most important features) for many inputs, and therefore the measuredsufficiency is relatively higher. We also repeat the above analysisby fixing all the top- 𝑘 features and get similarly low sufficiencyresults (see Suppl. A.4). Implications.
These results qualify the interpretation of “impor-tant” features returned by common attribution methods like LIMEor SHAP. Highly ranked features may often neither be necessarynor sufficient, and our results suggest that these properties be-come weaker for top-ranked features as the number of features ina dataset increases. In any practical scenario, hence, it is importantto check whether necessity or sufficiency is desirable for an ex-planation. While we saw mostly consistent results with DiCE andWachterCF, the results on LendingClub indicate that the methodused to generate CFs matters too. Defining the loss function with orwithout diversity corresponds to different set of contexts on whichnecessity or sufficiency is estimated, which needs to be decidedbased on the application. Generally, whenever there are multiplekinds of attribution rankings to choose from, these results showthe value of using CFs to evaluate them. oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma I m p o r t a n c e S c o r e kth featureAdult-Income kth featureLendingClub kth featureGerman-Credit WachterCF FA DiCE FA LIME SHAP (a) Average feature importance scores (nCF=4). C o rr e l a t i o n nCFAdult-Income nCFLendingClub nCFGerman-Credit DiCE FA -LIME DiCE FA -SHAP LIME-SHAP WachterCF FA -LIME WachterCF FA -SHAP WachterCF FA - DiCE FA (b) Correlation between feature importance scores. Figure 3: In Fig. 3a, feature indexes in 𝑥 -axis are based on the ranking from LIME. SHAP mostly agrees with LIME, but lessimportant features based on LIME can have high feature importance based on WachterCF FA and DiCE FA . Fig. 3b shows thecorrelation of feature importance scores from different methods: LIME and SHAP are more similar to each other than toDiCE FA and WachterCF FA . In German-Credit, the correlation with DiCE FA can become negative as nCF grows. As discussed in §4, counterfactual methods can not only evaluate,but also generate their own feature attribution rankings basedon how often a feature is changed in the generated CFs. In thissection, we compare the feature importance scores from DiCE FA and WachterCF FA to that from LIME and SHAP, and investigatehow they can provide additional, complementary information abouta ML model. Correlation with LIME or SHAP feature importance.
We startby examining how the importance scores from different methodsvary for different features and datasets. Fig. 3a shows the averagefeature importance score across 200 random test instances whennCF =
4. For LIME and SHAP, we take the absolute value of fea-ture importance score to indicate contribution. LIME and SHAPagree very well on for Adult-Income and LendingClub. While theymostly agree in German-Credit, there are some bumps indicatingdisagreements. In comparison, DiCE FA and WachterCF FA are lesssimilar to LIME than SHAP. This is especially salient in the high-dimensional German-Credit dataset. The features that are ranked13th and 18th by LIME– the no. of existing credits a person holds atthe bank and the no. of people being liable to provide maintenancefor – are the top two important features by DiCE FA ’s scores. Theyare ranked 1st and 2nd, respectively, by DiCE FA in 98% of the testinstances. Similarly, the 16th ranked feature by LIME, maximumcredit amount, is the most important feature by WachterCF FA .We then compute the Pearson corr. between these average fea-ture importance scores derived with different explanation methodsin Fig. 3b for different nCF. We find that LIME and SHAP agreeon the feature importance on average for all the three datasets,similar to what was observed in Fig. 3a at nCF=4. The correlation isespecially strong for Adult-Income and LendingClub each of whichhave only 8 features.Comparing CF-based and feature attribution methods, we findthat they are well correlated in LendingClub. This, again, can be at-tributed to the dominance of grade . All methods choose to considergrade as an important feature. In Adult-Income, the correlation ofCF-based methods with SHAP and LIME decreases as 𝑛𝐶𝐹 increases. This is not surprising since at higher 𝑛𝐶𝐹 , while DiCE changes di-verse features of different importance levels (according to LIME orSHAP) to get CFs, WachterCF does so to a lesser extent with randominitializations. For instance, in Fig. 3a at 𝑛𝐶𝐹 =
4, the feature thatis ranked 6th on average by LIME, hours-per-week , is changed byWachterCF almost to the same extent as the top-3 features. Similarly,DiCE varies this feature almost twice more than feature sex , whichis ranked 4th on average. Hence, we can expect that the averagefrequency of changing the most important feature would decreasewith increasing 𝑛𝐶𝐹 and less important features would start tovary more (see §5). By highlighting the less-important features asper LIME or SHAP, DiCE FA and WachterCF FA focuses on findingdifferent subsets of necessary features that can change the modeloutput. In particular, even without a diversity loss, WachterCF FA varies less important features to get valid CFs. In comparison, LIMEand SHAP tend to prefer sufficiency of features in contributing tothe original model output.This trend is amplified in German-Credit dataset that has thehighest number of features: correlation between DiCE FA and LIMEor SHAP is below 0.25 for all values of nCF and can also be neg-ative as nCF increases. We hypothesize that this is due to thenumber of features. German-Credit has 20 features and in gen-eral with increasing feature set size, we find that DiCE is able togenerate CFs even with less important features of LIME or SHAP.Even though WachterCF FA varies less important features as shownfor nCF=4 in Fig. 3a, it has a relatively moderate correlation withLIME/SHAP. This implies that attribution-based and CF-based meth-ods agree more when CFs are generated without diversity. Interest-ingly, the WachterCF FA and DiCE FA correlate less with each otherthan WachterCF FA correlates with LIME/SHAP, indicating the mul-tiple variations possible in generating CFs over high-dimensionaldata. Further, LIME and SHAP also agree less in German-Creditcompared to other datasets, suggesting that datasets with few fea-tures such as Adult-Income and LendingClub may provide limitedinsights into understanding explanation methods in practice, espe-cially as real-world datasets tend to be high-dimensional. Differences in feature ranking.
Feature importance scores can be difficult to compare and in-terpret, therefore many visualization tools show the ranking of owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY ageworkclasseducationmarital_statusoccupationracesexhours_per_week ************ **** ******** ************* **** **** ******** **** ************ **** ********SHAP vs DiCELIME vs DiCELIME vs SHAP (a) Adult Dataset emp_lengthannual_incopen_acccredit_yearsgradehome_ownershippurposeaddr_state ***************** ******** **** **************************** ************** *******SHAP vs DiCELIME vs DiCELIME vs SHAP (b) Lending Club Dataset
15 10 5 0 5 10 15 account_check_statusduration_in_monthcredit_historypurposecredit_amountsavingspresent_emp_sinceinstallment_as_income_percpersonal_status_sexother_debtorspresent_res_sincepropertyageother_installment_planshousingcredits_this_bankjobpeople_under_maintenancetelephoneforeign_worker ************ ******************** ************ **************************** ******** **** ************ *** *************** ************ * ******* ******************** ******** *************** *** ******************** **** ************ SHAP vs DiCELIME vs DiCELIME vs SHAP (c) German Credit Dataset
Figure 4: Correlation between the importance ranking of a feature across instances by LIME, SHAP, and DiCE. The x-axisdenotes the mean difference in the rankings for each feature over all the test inputs. Stars denote significance levels usingp-values ( **** : 𝑝 < − , *** : 𝑝 < − , ** : 𝑝 < − , * : 𝑝 < ∗ − ) ageworkclasseducationmarital_statusoccupationracesexhours_per_week ************ *********** ******************** **** ******** **************** **** ********SHAP vs WachterCFLIME vs WachterCFLIME vs SHAP (a) Adult Dataset emp_lengthannual_incopen_acccredit_yearsgradehome_ownershippurposeaddr_state **** ******************** ******** **** **************************** ************ ************ SHAP vs WachterCFLIME vs WachterCFLIME vs SHAP (b) Lending Club Dataset
15 10 5 0 5 10 account_check_statusduration_in_monthcredit_historypurposecredit_amountsavingspresent_emp_sinceinstallment_as_income_percpersonal_status_sexother_debtorspresent_res_sincepropertyageother_installment_planshousingcredits_this_bankjobpeople_under_maintenancetelephoneforeign_worker ******* ******************** ************ ******************** *** *********** ****************** ************ **** ************ ******** ******************** ******************** ********** ******** ************ ************ SHAP vs WachterCFLIME vs WachterCFLIME vs SHAP (c) German Credit Dataset
Figure 5: Correlation between the importance ranking of a feature across instances by LIME, SHAP, and WachterCF. The x-axis denotes the mean difference in the rankings for each feature over all the test inputs. Stars denote significance levels usingp-values ( **** : 𝑝 < − , *** : 𝑝 < − , ** : 𝑝 < − , * : 𝑝 < ∗ − ) features based on importance. Fig. 4 (for DiCE FA ) and Fig. 5 (forWachterCF FA ) show the mean difference in the rankings induced by feature importance scores from different explanation methodsfor each feature, computed over 200 test inputs. We also perform oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma paired 𝑡 -tests to test if there is a significant difference betweenrankings from different methods for the same feature. This analysisallows us to see the local differences in feature rankings beyondaverage feature importance score.For most features across all datasets, we find that the featurerankings on individual inputs can be significantly different. In otherwords, the differences between explanation methods are magnifiedif we focus on feature ranking. This is true even when comparingLIME and SHAP, which otherwise show high positive correlationin average (global) feature importance score. For instance, in Adult-Income, LIME consistently ranks marital status and sex higher thanSHAP, while SHAP tend to rank work class, race, and occupationhigher. Interestingly, they tend to agree on the ranking of continu-ous features, i.e., hours per week and age. As expected, LIME andDiCE provide different rankings for all features, while SHAP andDiCE differs in all except marital status. Similarly, we see a largedifference in feature rankings for German-Credit and LendingClubdatasets. Implications.
Feature importance rankings by counterfactuals arequite different from attribution-based methods like LIME/SHAP.In particular, they focus more on the less-important features fromLIME/SHAP and this trend accentuates as the number of feature di-mensions increases. In settings where necessity of features is impor-tant (e.g., algorithmic recourse for individuals), attribution rankingsfrom CFs may be more appropriate than standard attribution-basedmethods, and our method makes it possible to generate them.At the same time, attributions from both kinds of explanationmethods are sensitive to implementation details. While we expectedsignificant differences between DiCE FA and the two attribution-based methods based on global feature importance scores fromFig. 3, we also find significant differences between LIME and SHAPon individual inputs, and between DiCE FA and WachterCF FA onaggregate importances. In general, our results demonstrate the dif-ficulty in building a single, ideal explanation method. Explanationscapture different theoretical notions such as necessity and suffi-ciency, which is why DiCE FA disagrees in its ranking on almost allfeatures with LIME and SHAP. To understand the complementarities between different explanationmethods on realistic datasets, we present a case study using a real-world hospital admission prediction problem with 222 features.Predicting patients who are likely to get admitted during emergencyvisits helps hospitals to better allocate their resources, provideappropriate medical interventions, and improving patient treatmentrates [4, 11, 14, 15, 18, 29, 36, 38]. Given the importance of thedecision, it is critical that the predictions from an ML model beexplainable to doctors in the emergency department. We leveragethe dataset and models by Hong et al. who uses a variety of MLmodels including XGBoost and deep neural networks to predicthospital admission at emergency department (ED) from “triage” anddemographic information, and other data collected during previousED visits.
Data and model training.
We use the ML model based on triagefeatures, demographic features and chief complaints informationfrom Hong et al. Triage features consist of 13 variables to indicate demographics triage cc050100150200 M e a n R a n k i n g s DiCE FA LIMESHAP
Figure 6: Mean rankings of different types of feature groupsby DiCE FA , LIME, and SHAP. The lower the ranking, themore important the features are. the severity of ailments when a patient arrives at the ED. This modelalso uses 9 demographic features, including including race, gender,and religion, and 200 binary features indicating the presence ofvarious chief complaints. As a result, this dataset has many morefeatures than Adult-Income, LendingClub, and German-Credit. Werefer to this dataset as HospitalTriage. We reproduce the deep neuralnetwork used by Hong et al. which has two hidden layers with 300and 100 neurons respectively. The model achieves a precision andrecall of 0.81 each and an AUC of 0.87 on the test set. Further, weused a 50% sample of the original data, consisting of 252K datapoints, for model training as the authors show that the accuracysaturates beyond this point. We sample 200 instances from the testset over which we evaluate the attribution methods. In-depth look at the feature ranking.
We start with the featureranking produced by different methods to help familiarize with thisreal-world dataset. We then replicate the experiments in §5 and §6.We focus on DiCE in this comparison as WachterCF can struggleto generate multiple unique valid CFs when 𝑛𝐶𝐹 > FA , LIME,and SHAP using the same method as in §6. Fig. 6 shows the dis-tribution of mean rankings of different types of features in Hos-pitalTriage according to our feature attribution methods . Thisdataset has three category of features — demographics, triage andchief complaints. We find that SHAP ranks binary chief-complaintsfeatures much higher on average than DiCE FA and LIME ( 𝑟𝑎𝑛𝑘 ∝ 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 ). Though DiCE FA and LIME disagree on demographicsand triage features rankings, they both have similar mean rankingson chief-complaints features which constitutes 90% of the features.Hence, DiCE FA and LIME has a relatively higher correlation (seeFig. 8b) compared to any other methods.Furthermore, DiCE FA considers demographics and triage fea-tures more important as compared to the chief-complaints features,since the former features have smaller rank ( <
80) on average. Incontrast, LIME assigns them a larger rank. This has implications infairness: when the ML model is evaluated based on LIME alone, themodel would be seen as fair since chief-complaints features con-tribute more to the prediction on average. However, DiCE FA andSHAP shows that demographic features can also be changed to altera prediction, raising questions about making decisions based on sen-sitive features. Indeed, Hong et al. [17] present a low-dimensionalXGBoost model by identifying features using information gain as We assign features the maximum of the ranks when there is a tie. DiCE FA ’s andLIME’s rankings are invariant to the treatment of ties whereas SHAP’s is. We choosethe maximum to better distinguish different methods’ rankings. owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY metric. They find that 5 out of 9 demographic details – insurancestatus, marital status, employment status, race, and gender, and6 out of 13 triage features are identified as important in their re-fined model. On the other hand, only 8 out of 200 chief-complaintsfeatures are found important. Necessity and sufficiency.
Next, we replicate the experiments from §5 for HospitalTriage tounderstand the necessity and sufficiency of the important featuresof LIME and SHAP in generating CFs. The trend for SHAP in Fig. 7is similar to what was observed in Fig. 2a— changing the moreimportant features is more likely to generate valid CFs and hencehigher necessity (green line). However, in the case of LIME, weobserve that the third important feature leads to more CFs, almostdouble than that of the first or second feature only. The reason isthat in around 26% of the test instances, LIME rates EmergencySeverity Index (ESI) as the third most important feature. ESI is acategorical feature indicating the level of severity assigned by thetriage nurse [17]. DiCE FA considers this feature important to changethe outcome prediction and ranks it among the top-10 features formore than 60% of the test instances. ESI is also one of the top-3features by the information gain metric in the refined XGBoostmodel from Hong et al.The sufficiency results (Fig. 7) are similar to Fig. 2b. Any of thetop-3 features are not sufficient for generating CFs. At nCF =
1, thesame number of valid counterfactuals (100%) can be generated whilekeeping the 1st, 2nd or the 3rd feature fixed, as when changing allfeatures. Similarly, the same number of valid counterfactuals (68%)can be generated at nCF =
8, irrespective of whether the top-kfeatures are changed or not. Note that the overall fraction of validcounterfactuals generated decreases as nCF increase, indicatingthat it is harder to generate diverse counterfactuals for this dataset.We expect the lack of sufficiency of top-ranked features to hold inmany datasets, as the number of features increases. T r i a g e D a t a nCF=1 N e c e ss i t y nCF=4 nCF=8LIME SHAP T r i a g e D a t a nCF=1 S u ff i c i e n c y nCF=4 nCF=8LIME SHAP Figure 7:
Necessity and
Sufficiency measures at a particularnCF, as defined in §4.2, for the HospitalTriage data. Similarity between feature importance from different meth-ods.
Fig. 8b shows the correlation of feature importance score derivedfrom different methods. Different from what was observed for otherdatasets in Fig. 3b, LIME and SHAP have almost zero correlationbetween the feature rankings in HospitalTriage. This observationresonates with prior work demonstrating the instability and lackof robustness of these feature attribution methods, i.e., they cansignificantly differ when used to explain complex nonlinear modelswith high dimensional data [3, 26, 49, 57]. In the case of Hospital-Triage, the importance scores given by LIME and SHAP are indeedvery different for most of the features. For instance, SHAP assignsclose to zero weights for many binary “chief-complaint” featuresof HospitalTriage data in most of the test instances, while LIMEassigns diverse importance scores. For instance, Fig. 8a shows theabsolute feature attribution scores of different methods at 𝑛𝐶𝐹 = FA agrees more with SHAP than withLIME for other datasets (except LendingClub where all methodsagreed due to a dominating feature), here we obtain the reversetrend. DiCE FA has relatively weaker correlation with SHAP in thecase of HospitalTriage, echoing the difference observed for chiefcomplaints in Fig. 6. In particular, at nCF = =
8, theyboth have no correlation on average feature rankings. At highernCF, DiCE varies more number of binary features most of which areassigned very low weights by SHAP and hence the disagreement.
Implications.
To summarize, we show how analyzing the featureattribution methods on a real-world problem highlights the comple-mentarity and the differences in these methods. First, the highestranked features by attribution-based methods like LIME are notsufficient, and are not always the most necessary for causing theoriginal model output; more valid counterfactuals can be gener-ated by varying a feature with larger rank compared to those withsmaller rank. Second, there are substantial differences in feature im-portance scores from the different methods, to the extent that theycan completely change the interpretation of a model with respectto properties like fairness. Unlike the previous low-dimensionaldatasets, even LIME and SHAP demonstrate substantial differencesin global feature importance scores. DiCE FA rankings somehowstrike a balance between the two methods in importance: DiCE FA agrees with SHAP on demographics features and with LIME onchief complaint features. Finally, similar to results in §6, DiCE FA dis-tributes feature importance more equally, especially for the featureswith larger rank from LIME and SHAP. Our work represents the first empirical attempt to unify explanationmethods based on feature attribution and counterfactual generation.We provide a framework based on actual causality to interpret thesetwo approaches. Through an empirical investigation on a varietyof datasets, we demonstrate intriguing similarities and differences oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma I m p o r t a n c e S c o r e kth featureTriageData DiCE FA LIME SHAP (a) Average feature importance scores (nCF=4). C o rr e l a t i o n nCFTriageData DiCE FA -LIME DiCE FA -SHAP LIME-SHAP (b) Correlation between feature importance scores. Figure 8: In Fig. 8a, feature indexes in 𝑥 -axis are based on ranking from LIME. SHAP presents very different outcomes fromLIME, and their feature importance show much smaller variation than DiCE FA . Fig. 8b directly compares feature importancescore from different methods: the correlation between LIME and SHAP is much weaker than in Fig. 3b. between these methods. Our results show that it is not enoughto focus on only the top features identified by feature attributionmethods such as LIME and SHAP. They are neither sufficient nornecessary. Other features are (sometimes more) meaningful andcan potentially provide actionable changes.We also find significant differences in feature importance inducedfrom different explanation methods. While feature importance in-duced from DiCE and WachterCF can be highly correlated withLIME and SHAP on low-dimensional datasets such as Adult-Income,they become more different as the feature dimension grows. Evenin German-Credit with 20 features, they can show no or even neg-ative correlation when generating multiple CFs. Interestingly, wenoticed differences even among methods of the same kind (LIME vs.SHAP and WachterCF FA vs. DiCE FA ), indicating that more workis needed to understand the empirical properties of explanationmethods on high-dimensional datasets.Our study highlights the importance of using different explana-tion methods and of future work to find which explanation methodsare more appropriate for a given question. There can be many validquestions that motivate a user to look for explanations [30]. Evenfor the specific question of which features are important, the defi-nition of importance can still vary, for example, actual causes vs.but-for causes. It is important for our research community to avoidthe one-size-fits-all temptation that there exists a uniquely best wayto explain a model. Overall, while it is a significant challenge toleverage the complementarity of different explanation methods, webelieve that the existence of different explanation methods providesexciting opportunities for combining these explanations. REFERENCES [1] Accessed 2019. UCI Machine Learning Repository. German credit dataset. https://archive.ics.uci.edu/ml/support/statlog+(german+credit+data)[2] Kjersti Aas, Martin Jullum, and Anders Løland. 2019. Explaining individualpredictions when features are dependent: More accurate approximations toShapley values. arXiv preprint arXiv:1903.10464 (2019).[3] David Alvarez-Melis and Tommi S Jaakkola. 2018. On the robustness of inter-pretability methods. arXiv preprint arXiv:1806.08049 (2018).[4] Rajiv Arya, Grant Wei, Jonathan V McCoy, Jody Crane, Pamela Ohman-Strickland,and Robert M Eisenstein. 2013. Decreasing length of stay in the emergencydepartment with a split emergency severity index 3 patient flow model.
AcademicEmergency Medicine
20, 11 (2013), 1171–1179. [5] Vijay Arya, Rachel KE Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind,Samuel C Hoffman, Stephanie Houde, Q Vera Liao, Ronny Luss, AleksandraMojsilović, et al. 2019. One explanation does not fit all: A toolkit and taxonomyof ai explainability techniques. arXiv preprint arXiv:1909.03012 (2019).[6] Solon Barocas, Andrew D Selbst, and Manish Raghavan. 2020. The hidden assump-tions behind counterfactual explanations and principal reasons. In
Proceedings ofthe 2020 Conference on Fairness, Accountability, and Transparency . 80–89.[7] Samuel Carton, Anirudh Rathore, and Chenhao Tan. 2020. Evaluating and Char-acterizing Human Rationales. In
Proceedings of EMNLP .[8] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and NoemieElhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk andhospital 30-day readmission. In
Proceedings of KDD .[9] Susanne Dandl, Christoph Molnar, Martin Binder, and Bernd Bischl. 2020. Multi-objective counterfactual explanations. In
International Conference on ParallelProblem Solving from Nature . Springer, 448–469.[10] Kevin Davenport. 2015. Lending Club Data Analysis Revisited with Python.https://kldavenport.com/lending-club-data-analysis-revisted-with-python/[11] Thomas Desautels, Jacob Calvert, Jana Hoffman, Melissa Jay, Yaniv Kerem, LisaShieh, David Shimabukuro, Uli Chettipally, Mitchell D Feldman, Chris Barton,et al. 2016. Prediction of sepsis in the intensive care unit with minimal electronichealth record data: a machine learning approach.
JMIR medical informatics
4, 3(2016), e28.[12] Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong,Richard Socher, and Byron C. Wallace. 2019. ERASER: A Benchmark to EvaluateRationalized NLP Models. In
Proceedings of ACL .[13] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting,Karthikeyan Shanmugam, and Payel Das. 2018. Explanations based on themissing: Towards contrastive explanations with pertinent negatives. In
Advancesin Neural Information Processing Systems . 592–603.[14] Andrea Freyer Dugas, Thomas D Kirsch, Matthew Toerper, Fred Korley, GayaneYenokyan, Daniel France, David Hager, and Scott Levin. 2016. An electronicemergency triage system to improve patient distribution by critical outcomes.
The Journal of emergency medicine
50, 6 (2016), 910–918.[15] Julian S Haimovich, Arjun K Venkatesh, Abbas Shojaee, Andreas Coppi, FrederickWarner, Shu-Xia Li, and Harlan M Krumholz. 2017. Discovery of temporal anddisease association patterns in condition-specific hospital utilization rates.
PloSone
12, 3 (2017), e0172049.[16] Joseph Y Halpern. 2016.
Actual causality . MiT Press.[17] Woo Suk Hong, Adrian Daniel Haimovich, and R Andrew Taylor. 2018. Predictinghospital admission at emergency department triage using machine learning.
PloSone
13, 7 (2018), e0201016.[18] Steven Horng, David A Sontag, Yoni Halpern, Yacine Jernite, Nathan I Shapiro,and Larry A Nathanson. 2017. Creating an automated trigger for sepsis clinicaldecision support at emergency department triage using machine learning.
PloSone
12, 4 (2017), e0174708.[19] JFdarre. 2015. Project 1: Lending Club’s data. https://rpubs.com/jfdarre/119147[20] Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. 2020. Model-agnostic counterfactual explanations for consequential decisions. In
InternationalConference on Artificial Intelligence and Statistics . PMLR, 895–905.[21] Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. 2020. AlgorithmicRecourse: from Counterfactual Explanations to Interventions. arXiv preprintarXiv:2002.06278 (2020). owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY [22] Janis Klaise, Arnaud Van Looveren, Giovanni Vacanti, and Alexandru Coca.2019.
Alibi: Algorithms for monitoring and explaining machine learning models .https://github.com/SeldonIO/alibi[23] Ronny Kohavi and Barry Becker. 1996. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/adult[24] Alex Kulesza and Ben Taskar. 2012. Determinantal point processes for machinelearning. arXiv preprint arXiv:1207.6083 (2012).[25] I Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, and SorelleFriedler. 2020. Problems with Shapley-value-based explanations as feature im-portance measures. arXiv preprint arXiv:2002.11097 (2020).[26] Vivian Lai, Jon Z Cai, and Chenhao Tan. 2019. Many Faces of Feature Importance:Comparing Built-in and Post-hoc Feature Importance in Text Classification. arXivpreprint arXiv:1910.08534 (2019).[27] Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations andPredictions of Machine Learning Models: A Case Study on Deception Detection.In
Proceedings of FAT* .[28] Himabindu Lakkaraju, Stephen H Bach, and Jure Leskovec. 2016. Interpretabledecision sets: A joint framework for description and prediction. In
Proceedings ofthe 22nd ACM SIGKDD international conference on knowledge discovery and datamining . 1675–1684.[29] Scott Levin, Matthew Toerper, Eric Hamrock, Jeremiah S Hinson, Sean Barnes,Heather Gardner, Andrea Dugas, Bob Linton, Tom Kirsch, and Gabor Kelen. 2018.Machine-learning-based electronic triage more accurately differentiates patientswith respect to clinical outcomes compared with the emergency severity index.
Annals of emergency medicine
71, 5 (2018), 565–574.[30] Q Vera Liao, Daniel Gruen, and Sarah Miller. 2020. Questioning the AI: InformingDesign Practices for Explainable AI User Experiences. In
Proceedings of the 2020CHI Conference on Human Factors in Computing Systems . 1–15.[31] Zachary C Lipton. 2018. The mythos of model interpretability.
Queue
16, 3 (2018),31–57.[32] Yin Lou, Rich Caruana, and Johannes Gehrke. 2012. Intelligible models forclassification and regression. In
Proceedings of KDD .[33] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting modelpredictions. In
Advances in neural information processing systems . 4765–4774.[34] Divyat Mahajan, Chenhao Tan, and Amit Sharma. 2019. Preserving causalconstraints in counterfactual explanations for machine learning classifiers. arXivpreprint arXiv:1912.03277 (2019).[35] Tim Miller. 2018. Explanation in artificial intelligence: Insights from the socialsciences.
Artificial Intelligence (2018).[36] Karel GM Moons, Andre Pascal Kengne, Mark Woodward, Patrick Royston,Yvonne Vergouwe, Douglas G Altman, and Diederick E Grobbee. 2012. Risk pre-diction models: I. Development, internal validation, and assessing the incrementalvalue of a new (bio) marker.
Heart
98, 9 (2012), 683–690.[37] Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explainingmachine learning classifiers through diverse counterfactual explanations. In
Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency .607–617.[38] Ziad Obermeyer and Ezekiel J Emanuel. 2016. Predicting the future—big data,machine learning, and clinical medicine.
The New England journal of medicine
Causal inference instatistics: A primer . John Wiley & Sons.[40] Rafael Poyiadzi, Kacper Sokol, Raul Santos-Rodriguez, Tijl De Bie, and Peter Flach.2020. FACE: feasible and actionable counterfactual explanations. In
Proceedingsof the AAAI/ACM Conference on AI, Ethics, and Society . 344–350.[41] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should Itrust you?" Explaining the predictions of any classifier. In
Proceedings of the 22ndACM SIGKDD international conference on knowledge discovery and data mining .1135–1144.[42] Donald B Rubin. 2005. Causal inference using potential outcomes: Design, mod-eling, decisions.
J. Amer. Statist. Assoc.
Nature MachineIntelligence
1, 5 (2019), 206–215.[44] Chris Russell. 2019. Efficient search for diverse coherent explanations. In
Pro-ceedings of the Conference on Fairness, Accountability, and Transparency . 20–28.[45] Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu. 2021. GeCo:Quality Counterfactual Explanations in Real Time. arXiv preprint arXiv:2101.01292 (2021).[46] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. 2019. Certifai: Counter-factual explanations for robustness, transparency, interpretability, and fairnessof artificial intelligence models. arXiv preprint arXiv:1905.07857 (2019).[47] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. 2020. Certifai: A commonframework to provide explanations and analyse the fairness and robustness ofblack-box models. In
Proceedings of the AAAI/ACM Conference on AI, Ethics, andSociety . 166–172.[48] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning im-portant features through propagating activation differences. arXiv preprint arXiv:1704.02685 (2017).[49] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju.2019. How can we fool LIME and SHAP? Adversarial Attacks on Post hocExplanation Methods. arXiv preprint arXiv:1911.02508 (2019).[50] Kacper Sokol and Peter Flach. 2020. Explainability fact sheets: a frameworkfor systematic assessment of explainable approaches. In
Proceedings of the 2020Conference on Fairness, Accountability, and Transparency . 56–67.[51] Mukund Sundararajan and Amir Najmi. 2019. The many Shapley values formodel explanation. arXiv preprint arXiv:1908.08474 (2019).[52] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attributionfor deep networks. arXiv preprint arXiv:1703.01365 (2017).[53] Sarah Tan, Rich Caruana, Giles Hooker, and Yin Lou. 2017. Detecting bias in black-box models using transparent model distillation. arXiv preprint arXiv:1710.06169 (2017).[54] Berk Ustun, Alexander Spangher, and Yang Liu. 2019. Actionable recourse inlinear classification. In
Proceedings of the Conference on Fairness, Accountability,and Transparency . 10–19.[55] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactualexplanations without opening the black box: Automated decisions and the GDPR.
Harv. JL & Tech.
31 (2017), 841.[56] Mo Yu, Shiyu Chang, Yang Zhang, and Tommi S Jaakkola. 2019. Rethinkingcooperative rationalization: Introspective extraction and complement control. arXiv preprint arXiv:1910.13294 (2019).[57] Yujia Zhang, Kuangyan Song, Yiming Sun, Sarah Tan, and Madeleine Udell. 2019." Why Should You Trust My Explanation?" Understanding Uncertainty in LIMEExplanations. arXiv preprint arXiv:1904.12991 (2019).[58] Haojun Zhu. 2016. Predicting Earning Potential using the Adult Dataset. https://rpubs.com/H_Zhu/235617 oodstock ’18, June 03–05, 2018, Woodstock, NY Ramaravind K. Mothilal, Divyat Mahajan, Chenhao Tan, and Amit Sharma
A SUPPLEMENTARY MATERIALSA.1 Explanation Scores: A Simple Example
Method 𝑥 𝑥 LIME 0.34 0.07SHAP (median BG) 1.0 0.0SHAP (train data BG) 0.69 0.28DiCE FA FA Table 1: Explaining model 𝑦 = 𝐼 ( . 𝑥 + . 𝑥 ≥ . ) at aninput point ( 𝑥 = , 𝑥 = , 𝑦 = ) . 𝑥 and 𝑥 are continu-ous features randomly sampled from an uniform distribu-tion, 𝑈 ( , ) . The second and third column shows an expla-nation method’s score for 𝑥 and 𝑥 respectively. For SHAP,the scores are shown for both median data and the entiretraining data as background (BG) sample in the second andthird row respectively. Unlike attribution-based methods(LIME and SHAP), counterfactual-based methods (DiCE FA and WachterCF FA ) give almost equal importance to 𝑥 fea-ture even though its coefficient in the target model is muchsmaller than 𝑥 ’s coefficient. A.2 Choosing Counterfactual ExplanationMethods
We surveyed publicly available counterfactual explanation methodson GitHub which satisfy two criteria for our experiments: (a) sup-port to generate counterfactuals using a subset of features, and (b)support to generate multiple counterfactuals. While few methodscould be altered in theory to generate CFs using a feature subset[5, 20, 22, 45], we filter them out since it is not clear how to imple-ment the same in practice without making significant changes tothe original libraries. Similarly, we filter out those methods that donot explicitly support generating multiple CFs [5, 22].Further, some libraries require substantial pre-processing tomake comparison with other libraries for evaluation. For instance,while MACE [20] could generate multiple CFs, it requires exten-sive conversion to logic formulae to include any new ML modelother than few standard models provided by the authors. Similarly,it is not clear how GeCo [45], written completely in Julia, couldbe altered to generate CFs with a feature subset (and how to useit to explain Python-based ML models and compare to other ex-planation methods which are mostly based in Python). DiCE [37]and MOC [9] are the only two libraries that directly satisfy boththe aforementioned criteria. Further, the seminal counterfactualmethod by Wachter et al (WachterCF) could also be easily imple-mented. Though WachterCF, by default, provides only a singlecounterfactual, their optimization could be run with multiple ran-dom seeds to generate multiple counterfactuals simultaneously.Since we faced several compatibility issues such as transferringmodels between DiCE and MOC as these two libraries are based inPython and R respectively, we chose to use DiCE and WachterCF asour two counterfactual methods against the two feature attributionmethods, LIME and SHAP.
A.3 Validity and Stability of DiCE
Table 2 shows the mean percentage validity of DiCE with its defaulthyperparameters. DiCE has two main hyperparamaters, namely proximity_weight and diversity_weight , controlling the closeness ofcounterfactuals to the test instance and the diversity of counterfac-tuals respectively. proximity_weight takes 0.5 and diversity_weight takes 1.0 as the default values respectively. These two parametershave an inherent trade-off [37] and hence we change only the di-versity_weight to examine the sensitivity of hyperparameters to thefeature importance scores derived from DiCE FA . Figure 9 showsthe results. We find that DiCE FA is not sensitive to these hyperpa-rameters and different hyperparameter versions have a correlationof above 0.96 on all datasets. Data avg %valid CFs
Table 2: The second column shows the mean percentage ofunique and valid CFs found at each 𝑛𝐶𝐹 ∈ {1,2,4,6,8} for dif-ferent datasets given in the first column. The mean validityis computed over a random sample of 200 test instances foreach dataset. The third column shows the number of test in-stances for which all the CFs found are unique and valid atdifferent 𝑛𝐶𝐹 . C o rr e l a t i o n nCFAdult-Income nCFLendingClub nCFGerman-Credit DiCE FA - DiCE FA DiCE FA - DiCE FA DiCE FA - DiCE FA Figure 9: Correlation between different versions of DiCE FA at different hyperparameters. The pink line corresponds tocorrelation between feature importance derived from DiCEversions with 1.0 and 0.25 as diversity-weight respectively.Similarly, the gray and blue lines correspond to 1.0 and 0.5,and 0.25 and 0.5 diversity-weights respectively. All DiCE FA methods exhibit high pairwise correlation ( > . ) on alldatasets. A.4 Necessity and Sufficiency
Figure 10 shows the necessity and sufficiency metrics when weallow all features upto top- 𝑘 features to change (for necessity) orremain fixed (sufficiency). Necessity increases for the top- 𝑘 features,but sufficiency remains identical to the setting in the main paper. owards Unifying Feature Attribution and Counterfactual Explanations: Different Means to the Same End Woodstock ’18, June 03–05, 2018, Woodstock, NY A d u l t - I n c o m e nCF=1 N e c e ss i t y nCF=4 nCF=8 L e n d i n g C l u b N e c e ss i t y G e r m a n - C r e d i t N e c e ss i t y LIME(WachterCF) SHAP(WachterCF) LIME(DiCE) SHAP(DiCE) (a) Necessity A d u l t - I n c o m e nCF=1 S u ff i c i e n c y nCF=4 nCF=8 L e n d i n g C l u bS u ff i c i e n c y G e r m a n - C r e d i t S u ff i c i e n c y LIME(WachterCF) SHAP(WachterCF) LIME(DiCE) SHAP(DiCE) (b) Sufficiency
Figure 10: The 𝑦 -axis represents the necessity and sufficiency measures at a particular nCF, as defined in §4.2. Fig. (a) showsthe results when we are only allowed to change until the 𝑘 -th most important features ( 𝑘 = , , ) or the other features, whileFig. (b) shows the results when we fix until 𝑘 -th most important features ( 𝑘 = , ,3