Feature relevance quantification in explainable AI: A causal problem
FFeature relevance quantification in explainable AI:A causal problem
Dominik Janzing, Lenon Minorics, and Patrick Bl¨obaum
Amazon Research T¨ubingen, Germany { janzind, minorics, bloebp } @amazon.com
25 November 2019
Abstract
We discuss promising recent contributions on quantify-ing feature relevance using Shapley values, where we ob-served some confusion on which probability distributionis the right one for dropped features. We argue that theconfusion is based on not carefully distinguishing be-tween observational and interventional conditional prob-abilities and try a clarification based on Pearl’s seminalwork on causality. We conclude that unconditional ratherthan conditional expectations provide the right notion of dropping features in contradiction to the theoretical jus-tification of the software package SHAP. Parts of SHAPare unaffected because unconditional expectations (whichwe argue to be conceptually right) are used as approxima-tion for the conditional ones, which encouraged others to‘improve’ SHAP in a way that we believe to be flawed.
Despite several impressive success stories of deep learn-ing, not only researchers in the field have been shockedmore recently about lack of robustness for algorithmsthat were actually believed to be powerful. Image clas-sifiers, for instance, fail spectacularly once the imagesare subjected to adversarial changes that appear minorto humans, see e.g. Goodfellow et al. (2015); Sharifet al. (2016); Kurakin et al. (2018); Eykholt et al. (2018);Brown et al. (2018). Understanding these failures is chal-lenging since it is hard to analyze which features were de-cisive for the classification in a particular case. However, lack of robustness is only one of several different motiva-tions for getting artificial intelligence interpretable . Alsothe demand for getting fair decisions, e.g., Dwork et al.(2012); Kilbertus et al. (2017); Barocas et al. (2018), re-quires understanding of algorithms. In this case, it mayeven be subject of legal and ethical discussions why analgorithm came to a certain conclusion.To formalize the problem, we describe the input / out-put behaviour as a function f : X , . . . , X n → R where X , . . . , X n denote the ranges of some input variables ( X , . . . , X n ) =: X (discrete or continuous), while weassume the target variable Y to be real valued for rea-sons that will become clear later. Given one particular in-put x := ( x , . . . , x n ) we want to quantify to what ex-tent each x j is ‘responsible’ for the output f ( x , . . . , x n ) .This question makes only sense, of course, after specify-ing what should one input be instead . Let us first considerthe case where x is compared to some ‘baseline’ element x (cid:48) , which has been studied in the literature mostly for thecase of real-valued inputs and differentiable f . Based ona hypothetical scenario where only some of the baselinevalues x (cid:48) j are replaced with x j while others are kept onewants to quantify to what extent each component j con-tributes to the difference f ( x ) − f ( x (cid:48) ) . The focus of thepresent paper, however, is a scenario where the baselineis defined by the expectation E [ f ( X )] over some distribu-tion P X . To explain the relevance of each j for the dif-ference f ( x ) − E [ f ( X )] one considers a scenario whereonly some values are kept and the remaining ones are av-eraged over some probability distribution . The main con-tribution of this paper is to discuss which distribution is1 a r X i v : . [ s t a t . M L ] N ov he right one. Recalling the difference between interven-tional and observational conditional distributions in thefield of causality, we explain why we disagree with theinteresting proposal of Lundberg and Lee (2017) in thisregard. Further we argue that our criticism is irrelevantfor any software that ‘approximates’ the conditional ex-pectation (which we consider conceptually wrong) by theunconditional expectation, as proposed by Lundberg andLee (2017). The paper is structured as follows. Section 2summarizes results from the literature regarding axiomsfor feature attribution for the case where there is a uniquebaseline reference input. Here integrated gradients andShapley values (as the generalization to discrete input)are the unique attribution functions for the stated set ofaxioms. Section 3 discusses the attribution problem forthe case where one averages over unused features as inLundberg and Lee (2017), and then we present our crit-icism. We think that the big overlap of the present pa-per with existing literature is justified by aiming at thisclarification only, while keeping this clarification as self-consistent as possible. In particular, the very general dis-cussion of Datta et al. (2016) contains all the ideas ofthis work at least implicitly, but since it appeared beforeLundberg and Lee (2017) it could not explicitly discussthe conceptual problems raised by the latter. Our viewon marginalization over unused features is supported byDatta et al. (2016) for similar reasons. In Section 4 wepresent different experiments which illustrate our argu-ments. The growth of deep neural networks recently motivatedmany researchers to investigate feature attribution, seee.g. Shrikumar et al. (2016) for DeepLIFT, Binder et al.(2016) for Layer-wise Relevance Propagation (LRP),Ribeiro and Singh (2016) for Local Interpretable Model-agnostic Explanations (LIME), and for gradient basedmethods Chattopadhyay et al. (2019). For a summaryof common architecture agnostic methods, see Molnar(2019). We first discuss two closely related concepts thatarise from an axiomatic approach.
Sundararajan et al. (2017), investigated the attribution of x i to the difference f ( x ) − f ( x (cid:48) ) , (1)where x (cid:48) is a given baseline. Under the assumption that f is differentiable almost everywhere , they defined theattribution of x i to (1) asIntegratedGrads i ( x ; f ) :=( x i − x (cid:48) i ) (cid:90) α =0 ∂f ( x (cid:48) + α ( x − x (cid:48) )) ∂x i dα. Contrary to LIME, DeepLIFT and LRP, this attributionmethod has the advantage that all of the following prop-erties are satisfied (see Sundararajan et al. (2017) and Aaset al. (2019)):
1. Completeness: If atr i ( x ; f ) denotes the attributionof x i to (1), then (cid:88) i atr i ( x ; f ) = f ( x ) − f ( x (cid:48) ) .
2. Sensitivity: If f does not depend on x i , then atr i ( x ; f ) = 0 .
3. Implementation Invariance: If f and f (cid:48) are equalfor all inputs, then atr i ( x ; f ) = atr i ( x ; f (cid:48) ) for all i.
4. Linearity:
For a, b ∈ R holds atr i ( x ; af + bf ) = a · atr i ( x ; f ) + b · atr i ( x ; f ) .
5. Symmetry-Preserving: If f is symmetric in compo-nent i and j and x i = x j and x (cid:48) i = x (cid:48) j , then atr i ( x ; f ) = atr j ( x ; f ) . see Sundararajan et al. (2017, Proposition 1) Note that this axiom is pointless if it refers to properties of functions rather than properties of algorithms . We have listed it for completenessand for consistency with the literature. γ instead of the straight line. Thisattribution method is called path method and the follow-ing theorem holds. Theorem 1. ((Friedman, 2004, Theorem 1) and (Sun-dararajan et al., 2017, Theorem 1)) If an attributionmethod satisfies the properties Completeness, Sensitivity,Implementation Invariance and Linearity, then the attri-bution method is a convex combination of path methods.Furthermore, integrated gradients is the only path methodthat is symmetry preserving.
Notice that convex combinations of path methodscan also be symmetry preserving even if the attributionmethod is not given by integrated gradients.
To assess feature relevance relative to the average , Lund-berg and Lee (2017) use a concept that relies on firstdefining an attribution for binary functions, or, equiva-lently, functions with subset as input (’set functions’). Wefirst explain this concept and describe in Section 3 how itsolves the attribution relative to the expectation. Assumewe are given a set with n elements, say U := { , . . . , n } and a function g : 2 U → R with g ( U ) (cid:54) = 0 , g ( ∅ ) = 0 . We then ask to what extent each single j ∈ U contributesto g ( U ) . A priori, the contribution of each j depends onthe order in which more elements are included. We canthus define the contribution of j , given T ⊆ U by C ( j | T ) := g ( T ∪ { j } ) − g ( T ) (note that it can be negative and also exceed g ( U ) ). With φ i := (cid:88) T ⊆ U \{ i } n (cid:0) n − | T | (cid:1) C ( i | T ) . (2)it then holds g ( U ) = n (cid:88) i =1 φ i . The quantity φ i is called the Shapley value (Shapley,1953) of i , which can be considered the average contri-bution of i to g ( U ) . At first glance, Shapley values onlysolve the attribution problem for binary inputs by canoni-cally identifying subsets T with binary words { , } n . Toshow that Shapley values also solve the above attributionproblem, one can simply define a set function by g ( T ) := f T ( x T ) − f ( x (cid:48) ) , for any subset T ⊆ { , . . . , n } . Here, f T is the ‘simpli-fied’ function with the reduced input x T obtained from f when all remaining features are taken from the baselineinput x (cid:48) , that is, f ∅ ( x ∅ ) = f ( x (cid:48) ) .Since Shapley Values also satisfy Completeness,Sensitivity, Implementation Invariance and Linearity(Aas et al., 2019) with respect to the binary functiondefined by the set function g , they are given by a convexcombination of path methods. Furthermore, ShapleyValues with respect to g are Symmetry-perserving, butdon’t coincide with integrated gradients.Different ways of feature attribution based on ShapleyValues were recently investigated by Sundararajan andNajmi (2019). Their main consideration is feature rele-vance relative to an auxiliary baseline, but feature attri-bution relative to the expectation (according to an arbi-trary distribution) is also mentioned. Furthermore, Sun-dararajan and Najmi (2019) already discussed that Shap-ley Values based on conditional distributions can assignunimportant features non-zero attribution. However, Sun-dararajan and Najmi (2019) didn’t consider the problemfrom a causal perspective. dropped features? We now want to attribute the difference between f ( x ) and the expectation E [ f ( X )] to individual features. Ex-plaining why the output for one particular input x deviatesstrongly from the average output is particularly interestingfor understanding ‘outliers’. Let us introduce some nota-tion first. For any T ⊆ U let E [ f ( x T , X ¯ T ) | X T = x T ] de-note the conditional expectation of f , given X T = x T . By E [ f ( x T , X ¯ T )] we denote the expectation of f ( x T , X ¯ T ) X ¯ T without condition-ing on X T = x T . Let us call this expression ‘marginalexpectation’ henceforth.Accordingly, we now discuss two different options fordefining ‘simplified functions’ f T where all features from ¯ T are dropped: f T ( x ) := E [ f ( x T , X ¯ T ) | X T = x T ] (3) or f T ( x ) := E [ f ( x T , X ¯ T )] ? (4)Lundberg and Lee (2017) propose (3), but since it is dif-ficult to compute they approximate it by (4), which theyjustify by the simplifying assumption of feature indepen-dence. Using the set function g ( T ) := f T ( x ) − f ∅ ( x ) ,they compute Shapley values φ i according to (2). We willargue that using (4) rather than (3) is conceptually theright thing in the first place. Our clarification is supposedto prevent others from ‘improving’ SHAP by finding anapproximation for the conditional expectation that is bet-ter than the marginal expectation, like, for instance Aaset al. (2019) and (Lundberg et al., 2018) To explain our arguments, let us first explain whymarginal expectations occur naturally in the field of causalinference.
Observational versus interventional conditional distri-butions
The main ideas of this paragraph can already befound in Datta et al. (2016) in more general and abstractform, see also Friedman (2001) and Zhao and Hastie(2019), but we want to rephrase them in a way that op-timally prepares the reader to the below discussion. As-sume we are given the causal structure shown in Figure 1.Further, assume we are interested in how the expectationof Y changes when we manually set X to some value x . This is not given by E [ Y | X = x ] because observing X = x changes also the distribution of X , X due tothe dependences between X and X , X (which are gen-erated by the common cause Z ). This way, the differencebetween E [ Y ] and E [ Y | X = x ] is not only due to theinfluence of X , but can also be caused by the influenceof X , X . The impact of setting X to x is captured by Note that TreeExplainer in SHAP has meanwhile been changed ac-cordingly. ZX X X Y Figure 1: A simple causal structure where the observa-tional conditional p ( y | x ) does not correctly describe how Y changes after intervening on X because the commoncause Z ‘confounds’ the relation between X and Y .Pearl’s do-operator Pearl (2000) instead, which yields E [ Y | do ( X = x )]= (cid:90) E [ Y | x , x , x ] p ( x , x ) dx dx . (5)This can be easily verified using the backdoor criterionPearl (2000) since (phrased in Pearl’s language) the vari-ables X , X ‘block the backdoor path’ X ← Z → Y .Observations from Z are not needed, we may thereforeassume Z to be latent, which we have indicated by whitecolor.For our purpose, two observations are important: first,(5) does not contain the conditional distribution, given X = x . Replacing p ( x , x ) with p ( x , x | x ) in(5) would yield the observational conditional expectation E [ Y | X = x ] , which we are not interested in. In otherwords, the intervention on X breaks the dependences to X , X . The second observation that is crucial for us isthat the dependences between X , X are kept, they areunaffected by the intervention on X . Why observational conditionals are flawed
Let usstart with a simple example.
Example 1 (irrelevant feature) . Assume we have f ( x , x ) = x . Obviously, the feature X is irrelevant. Let both X , X e binaries and p ( x , x ) = (cid:26) / for x = x otherwise . (1) with conditional expectations: f ∅ ( x ) = E [ f ( X , X )] = 1 / (6) f { } ( x ) = E [ f ( x , X ) | x ] = x (7) f { } ( x ) = E [ f ( X , x ) | x ] = x (8) f { , } ( x ) = f ( x , x ) = x (9) Therefore, C (2 |∅ ) = f { } ( x ) − f ∅ ( x ) = x − / C (2 |{ } ) = f { , } ( x ) − f { } ( x ) = x − x . Hence, the Shapley value for X reads: φ = 12 ( x − / x − x ) = x / − / (cid:54) = 0 . (2) with marginal expectations: f ∅ ( x ) = E [ f ( X , X )] = 1 / (10) f { } ( x ) = E [ f ( x , X )] = x (11) f { } ( x ) = E [ f ( X , x )] = 1 / (12) f { , } ( x ) = f ( x , x ) = x . (13) We then obtain C (2 |∅ ) = f { } ( x ) − f ∅ ( x ) = 0 C (2 |{ } ) = f { , } ( x ) − f { } ( x ) = 0 , which yields φ = 0 . The example proves the follow result, which were al-ready discussed in Sundararajan and Najmi (2019):
Lemma 1 (failure of Sensitivity) . When the relevance of φ i is defined by defining ‘simplified’ functions f T via con-ditional expectations f T ( x T ) := E [ f ( x ) | X ¯ T = x ¯ T ] , then φ i (cid:54) = 0 does not imply that f depends on x i . The example is particularly worrisome because wementioned earlier that Shapley values satisfy the axiomof sensitivity, while Lemma 1 seems to claim the oppo-site. The resolve this paradox, note that the Shapley val-ues refer to binary functions (or set functions) and reading(6) to (8) as the values of a binary function ˜ g with inputs ( z , z ) = 00 , , , we clearly observe that ˜ g de-pends also on the second bit. This way, the Shapley valuesdo not violate sensitivity for ˜ g , but we certainly care about‘sensitivity for f ’. Note that this distinction between thebinary function ˜ g and f is crucial although in our example f is binary itself. Fortunately, the second bit is irrelevantfor the binary function ˜ g defined by (10) and (13) and wedo not obtain the above paradox.To assess the impact of changing the inputs of f , wenow switch to a more causal language and state that weconsider the inputs of an algorithm as causes of the out-put. Although this remark seems trivial it is necessary toemphasize that we are not talking about the causal relationbetween any features in the real world outside the com-puter (where the attribute predicted by Y may be the causeof the features), but only about causality of this techni-cal input / output system . To facilitate this view, we for-mally distinguish between the true features ˜ X , . . . , ˜ X n obtained from the objects and the corresponding features X , . . . , X n plugged into the algorithm. This way, we areable to talk about a hypothetical scenario where the inputsare changed compared to the true features. Let us first con-sider the causal structure in figure 2, top, where the inputsare determined by the true features. In contrast, figure 2,bottom, shows the causal structure after an intervention on X , X has adjusted these variables to fixed values x , x .We now consider the impact of an hypothetical inter-vention, which leaves the remaining components unaf-fected . They are therefore sampled from their natural jointdistribution without conditioning. Similar to the aboveparagraph, we then obtain E [ Y | do ( X T = x T )] = E [ f ( x T , X ¯ T )] . (14)Our formal separation between the true values of the fea-tures ˜ X j of some object and the corresponding inputs X j of the algorithms allows us to be agnostic about the causalrelations between the true features in the real world, the Accordingly, Y is the output of the system and not a property of theexternal world. X , . . . , X n cause the output Y is theonly causal knowledge needed to compute (14). Since theinterventional expectations coincide with the marginal ex-pectations, we have thus justified the use of marginal ex-pectations for the Shapley values from the causal perspec-tive. ˜ X ˜ X ˜ X ˜ X ˜ X X X X X X Y object with features ˜ X ˜ X ˜ X ˜ X ˜ X x x X X X Y object with featuresFigure 2: Top: Causal structure of our predictionscenario: The output Y is determined by the inputs X , . . . , X n . In the usual learning scenario these inputscoincide with features ˜ X , . . . , ˜ X n ob some object, that is X j = ˜ X j . Bottom: To evaluate the impact of some inputs,say X , X , for the output Y we consider a hypotheticalscenario where we adjust these inputs to some fixed val-ues x , x and sample the remaining inputs from the usualjoint distribution P X ,...,X n . Probability X X f = X + X (1 − p ) · (1 − q ) (1 − p ) · q (1 − q ) · p p · q The problem with the symmetry axiom
We brieflyrephrase Example 4.9 of Sundararajan and Najmi (2019)showing that the symmetry axiom is violated when Shap-ley values are used for quantifying the influence relativeto conditional or marginal expectations. Figure 3 showsvalues and probabilities of two random variables X and X and the values of the function f ( X , X ) = X + X .As explained by Sundararajan and Najmi (2019), for theinput ( x , x ) = (2 , the value x gets attribution (1 − p ) and x gets attribution (1 − q ) . Therefore, if p (cid:54) = q , x and x get different attribution, although f is symmetric. Theyconclude that this is a violation of symmetry. Since X and X are independent, this problem occurs regardlessof whether one defines the simplified function f T withrespect to marginal or conditional expectations. One canargue, however, that this result makes intuitively sense be-cause the value x j that is farther from its mean contributes more to the fact that f ( x , x ) deviates from its mean. Ifwe have even x = E [ X ] , we would certainly say that x does not contribute to the deviation from the mean atall. For this reason we do not follow Sundararajan andNajmi (2019) in regarding this phenomenon as a problemof this kind of attribution analysis. Recall furthermore thatwe have already mentioned that the symmetry axiom doeshold for the corresponding binary function defined by in-cluding or not certain features (simply because symmetryholds for Shapley values). For the above example this bi-nary function is indeed asymmetric. To check this, define ˜ g ( z , z ) := E [ f ( x T , X ¯ T )] , where T is the set of all j for which z j = 1 . This func-tion is not symmetric in Z and Z , since we have, forinstance, ˜ g (1 ,
0) = x + E [ X ] (cid:54) = ˜ g (0 ,
1) = x + E [ X ] .6 Numerical Evidence
In this section, we show numerically that the marginalexpectation E [ f ( x T , X ¯ T )] is a better choice than E [ f ( x T , X ¯ T ) | X T = x T ] to quantify the attributionof each observation x j of a particular input x =( x , . . . , x n ) to f ( x ) − E f ( X ) . As explained by Aas et al. (2019, Section 2.3), the im-plementation of KernelSHAP (Lundberg and Lee, 2017)consists of two parts:1. Using a representation of Shapley Values as the so-lution of a weighted least square problem for a com-putationally tractable approximation.2. Approximation of g ( T ) . By Charnes et al. (1988), the Shapley Values to the setfunction g are given as the solution ( φ , . . . , φ n ) of min φ ,...,φ n (cid:88) T ⊆ U (cid:104) g ( T ) − (cid:16) (cid:88) j ∈ T φ j (cid:17)(cid:105) k ( U, T ) , (15)where k ( U, T ) = ( | U | − / ( (cid:0) | U || T | (cid:1) | T | ( | U | − | T | )) are the Shapley kernel weights . Since k ( U, U ) = ∞ , we use theconstraint (cid:80) j φ j = g ( U ) , or, for numerical calculation,we set k ( U, U ) to a large number.Since the power set of U consists of n elements, thecomputation time of the Sharpley Values increases ex-ponentially. KernelSHAP therefore samples subsets of U according to the probability distribution induced by theShapley kernel weights. As discussed in the previous sections, Lundberg and Lee(2017) define f T ( x ) = E [ f ( x T , X ¯ T ) | X T = x T ] . To evaluate the conditional expectation, they assumefeature independence (or weak dependence) to obtain E [ f ( x T , X ¯ T ) | X T = x T ] ≈ E [ f ( x T , X ¯ T )] and use theapproximation f T, KernelSHAP ( x ) ≈ K (cid:88) k f ( x T , x k ¯ T ) , (16)where x k ¯ T , k = 1 , . . . , K are our samples from X ¯ T . To show in an experimental setup that the marginal ex-pectation is a better choice, we consider functions f forwhich we can calculate analytically the attribution of x j .This is possible for linear functions f ( x ) = α + (cid:88) i α i x i , α i ∈ R since f ( x ) − E [ f ( X )] = (cid:88) i α i ( x i − E X i ) and hence, the attribution of x j is α j ( x j − E [ X j ]) . Ourexperiments are divided into the following setups:1. We assume that the feature vector X follows a mul-tivariate Gaussian distribution.2. We use a kernel estimation to approximate the con-ditional expectation.For the experiments, we use the KernelExplainer class ofthe python SHAP package from Lundberg and Lee (2017)to calculate Shapley Values with respect to the marginalexpectation and the R package SHAPR, in which themethodology of Aas et al. (2019) is implemented, to cal-culate Shapley Values with respect to the conditional dis-tribution.Notice that calculating Shapley Values is also pos-sible for non-linear functions. Further, approximatingthe marginal expectation is computationally inexpensivecompared to the approximation of the conditional expec-tation with kernel estimation.7 .2.1 Multivariate Gaussian distribution If X ∼ N ( µ , Σ ) with some mean vector µ and covari-ance matrix Σ , it holds that P ( X ¯ T | X T = x T ) = N ( µ ¯ T | T , Σ ¯ T | T ) (see (Aas et al., 2019, Section 3.1)), where µ ¯ T | T = µ ¯ T + Σ ¯ T T Σ − T T ( x T − µ T ) Σ ¯ T | T = Σ ¯ T ¯ T − Σ ¯ T T Σ − T T Σ T ¯ T with µ = (cid:18) µ T µ ¯ T (cid:19) , Σ = (cid:18) Σ T T Σ T ¯ T Σ ¯ T T Σ ¯ T ¯ T (cid:19) . Hence, we can approximate the conditional expectationby sampling X ¯ T directly from its distribution.We simulate Gaussian data and run the experiment fordifferent number of features. For every experiment withmultivariate Gaussian distribution, we set the intercept to0, i.e. α = 0 . Dimension n=3.
In the first 3-dimensional experiment,we let α = 0 and choose in every run α and α in-dependently from the standard normal distribution. Fur-ther, we let µ = (0 , , T and Σ = cc T , where wechoose the entries of c in every run independently fromthe standard normal distribution and x also randomly inevery run. The number of runs and the sample size of X is 1000. Figure 4 shows the errors φ j − contr j ( x ) ofthe Shapley Values φ j with respect to the set function g ( T ) = E [ f ( x T , X ¯ T )] − E f ( X ) (blue) and the set func-tion g ( T ) = E [ f ( x T , X ¯ T ) | X T = x T ] − E f ( X ) (red).The very precise results for the marginal expectation aremainly from feature 1. Dimension n=10.
In 10-dimensions, we take almost thesame setting with the difference that we set the first 3 co-efficients to zero, i.e. α = α = α = 0 . Again, the veryprecise results for the marginal expectation are from thefeatures whose coefficient we set to 0. If we have no information about the underlying distribu-tion, it is hard to approximate the conditional distributionsufficiently. However, in low dimensions kernel estimates Figure 4: Histogram showing the error of the ShapleyValues for multivariate Gaussian distribution in the 3-dimensional (left) and 10-dimensional (right) setting with α = 0 . Blue: error using marginal expectation, Red: er-ror using conditional expectation.can provide a good approximation. We take the kernelestimation method from Aas et al. (2019) to show howstrongly the Shapley Values w.r.t. conditional expectationdeviate from α j ( x j − E [ X j ]) . Their approximation is asfollows:1. Let Σ T be the covariance matrix of our sample from X T . To each point x i of the sample, calculate theMahalanobis distance (see Mahalanobis (1936))dist T ( x , x i ) := (cid:115) ( x T − x iT ) (cid:48) Σ − T ( x T − x iT ) | T | , where ( x T − x iT ) (cid:48) denotes the transpose of ( x T − x iT ) .2. Calculate the Kernel weights w T ( x , x i ) := exp (cid:18) − dist T ( x , x i ) σ (cid:19) . Hereby, σ > is a bandwidth which has to be spec-ified.3. Sort the weights w T ( x , x i ) in increasing order andlet ˜ x i be the corresponding ordered sampling in-stances. Then, approximate g ( T ) by g cond ( T ) := (cid:80) Ki =1 w T ( x , ˜ x i ) f ( x i ¯ T , x T ) (cid:80) Ki =1 w T ( x , ˜ x i ) . For the experiment, we use the real data set
Human Ac-tivity Recognition Using Smartphones Data Set (see An-guita et al. (2013)) from the UCI repository. The data set8onsists of 561 features with a training sample size of7352 and test sample size of 2948. In this experiment,we merge these two samples together and therefore oursample size is 10299. We take randomly 4 features andtrain a linear model with 3 of these features as inputs andwith the 4-th feature as target. We don’t consider the label(which is a daily activity performed by the human) of thedata set, but the different features have the true label asa common cause. Notice that we are not interested in thequality of the model, but rather in a model for which theground truth of the attribution is known (because we cancertainly look at the linear model obtained).Afterwards, we calculate the Shapley Values withSHAP and SHAPR (with σ set to 0.1 in SHAPR which isthe default value) using the first 1000 samples and approx-imate the expected value E X j using the whole data set.The observation x is also randomly picked from the dataand we run this experiment 1000 times. Figure 5 showsthe histogram of the error φ j − contr j ( x ) for the marginalexpectation (blue) and conditional expectation (red).Figure 5: Histogram showing the error of the ShapleyValues for the data set Human Activity Recognition UsingSmartphones Data Set . Blue: error using marginal expec-tation, Red: error using conditional expectation.
In this work we considered the problem of attributing theoutput from one particular multivariate input to individ-ual features. We argued that there is a misconception alsoin recent proposals for feature attribution because theyuse observational conditional distributions rather than in-terventional distributions. Our arguments are phrased interms of the causal language introduced by Pearl (2000).We argue that parts of the package SHAP from Lundbergand Lee (2017) are unaffected by this misconception (al- though the corresponding theory part of the paper suffersfrom this issue) since they ‘approximates’ the observa-tional expectations by an expression that would have beenthe right one in the first place. We think that this clarifi-cation is important since other authors tried to ‘improve’the SHAP package in a way that we consider conceptuallyflawed. Moreover, we revisited some properties that werestated as desirable in the context of attribution analysis. Ifstated in a too vague manner, there is some room for inter-pretation. We argued, for instance, why we think that ourattribution method satisfies a reasonable symmetry prop-erty, since attribution via interventional probabilities hasbeen criticised for violating alleged desirable symmetryproperties.
Acknowledgements:
The authors would like to thankScott Lundberg and Anders Løland for their valuablefeedback and Atalanti Mastakouri for remarks on the pre-sentation.
References
K. Aas, M. Jullum, and A. Løland. Explaining indi-vidual predictions when features are dependent: Moreaccurate approximations to Shapley values.
ArXiv:1903.10464 , 2019.D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. A Public Domain Dataset for Human ActivityRecognition Using Smartphones. In , pages 24–26, April 2013.S. Barocas, M. Hardt, and A. Narayanan.
Fairness andMachine Learning . fairmlbook.org, 2018. .A. Binder, G. Montavon, S. Lapuschkin, K. R. Mller,and W. Samek. Layer-Wise Relevance Propagation forNeural Networks with Local Renormalization Layers.In
Artificial Neural Networks and Machine LearningICANN , volume 9887, 2016.T. B. Brown, D. Man, A. Roy, M. Abadi, and J. Gilmer.Adversarial Patch. arXiv:1712.09665 , 2018.9. Charnes, B. Golany, M. Keane, and J. Rousseau. Ex-tremal Principle Solutions of Games in CharacteristicFunction Form: Core, Chebychev and Shapley ValueGeneralizations.
Econometrics of Planning and Effi-ciency , 11:123–133, 1988.A. Chattopadhyay, P. Manupriya, A. Sarkar, and V. Bal-asubramanian. Neural network attributions: A causalperspective. In K. Chaudhuri and R. Salakhutdinov,editors,
Proceedings of the 36th International Confer-ence on Machine Learning , volume 97 of
Proceedingsof Machine Learning Research , pages 981–990, LongBeach, California, USA, 09–15 Jun 2019. PMLR.A. Datta, S. Sen, and Y. Zick. Algorithmic transparencyvia quantitative input influence: Theory and experi-ments with learning systems. In , pages 598–617, 2016.C. Dwork, M. Hardt, T. Pitassi, O. Reingold, andR. Zemel. Fairness through awareness. In
Proceed-ings of the 3rd Innovations in Theoretical ComputerScience Conference , ITCS ’12, pages 214–226, NewYork, NY, USA, 2012. ACM. ISBN 978-1-4503-1115-1. doi: 10.1145/2090236.2090255. URL http://doi.acm.org/10.1145/2090236.2090255 .K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati,C. Xiao, A. Prakash, T. Kohno, and D. Song. Ro-bust Physical-World Attacks on Deep Learning VisualClassification. In
The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 1625–1634, 2018.E. J. Friedman. Paths and consistency in additive costsharing.
International Journal of Game Theory , 32(4):501–518, 2004.J. H. Friedman. Greedy function approximation: A gradi-ent boosting machine.
Annals of statistics , pages 1189–1232, 2001.I. J. Goodfellow, J. Shlens, and C. Szegedy. Explainingand harnessing adversarial examples. In , 2015. URL http://arxiv.org/abs/1412.6572 . N. Kilbertus, M. Rojas-Carulla, G. Parascandolo,M. Hardt, D. Janzing, and B. Sch¨olkopf. Avoiding dis-crimination through causal reasoning. In
Proceedingsfrom the conference ”Neural Information ProcessingSystems 2017 , pages 656–666. Curran Associates, Inc.,December 2017.A. Kurakin, I. J. Goodfellow, and Samy Bengio. Adver-sarial examples in the physical world. In
Artificial In-telligence Safety and Security , pages 99–112. Chapmanand Hall/CRC, 2018.S. Lundberg and S. Lee. A unified approach to interpret-ing model predictions. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors,
Advances in Neural Informa-tion Processing Systems 30 , pages 4765–4774. CurranAssociates, Inc., 2017.S. Lundberg, G. Erion, and S. Lee. Consistent in-dividualized feature attribution for tree ensembles. arXiv:1802.03888 , 2018.P. C. Mahalanobis. On the generalised distance in statis-tics. In
Proceedings of the National Institute of Sci-ences of India , April 1936.C. Molnar.
Interpretable Machine Learning . Molnar,C., 2019. URL https://christophm.github.io/interpretable-ml-book/ .J. Pearl.
Causality . Cambridge University Press, 2000.M. Ribeiro and C. Singh, S.and Guestrin. ”why shouldi trust you?”: Explaining the predictions of any clas-sifier. In
Proceedings of the 22Nd ACM SIGKDD In-ternational Conference on Knowledge Discovery andData Mining , KDD ’16, pages 1135–1144, New York,NY, USA, 2016.L. Shapley. A value for n-person games.
Contributions tothe Theory of Games (AM-28) , 2, 1953.M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter.Accessorize to a Crime: Real and Stealthy Attacks onState-of-the-Art Face Recognition. In
Proceedings ofthe 2016 ACM SIGSAC Conference on Computer andCommunications Security , pages 1528–1540, 2016.10. Shrikumar, P. Greenside, and A. Kundaje. NotJust A Black Box: Learning Important FeaturesThrough Propagating Activation Differences.
In ICML(arXiv:1605.01713) , 2016.M. Sundararajan and A. Najmi. The many Shapley valuesfor model explanation. arXiv:1908.08474 , 2019.M. Sundararajan, A. Taly, and Q. Yan. Axiomatic At-tribution for Deep Networks. In
Proceedings of the34th International Conference on Machine Learning ,volume 70, pages 3319–3328, August 2017.Q. Zhao and T. Hastie. Causal interpretations of black-box models.