[PDF] Feature relevance quantification in explainable AI: A causal problem

Abstract

We discuss promising recent contributions on quantifying feature relevance using Shapley values, where we observed some confusion on which probability distribution is the right one for dropped features. We argue that the confusion is based on not carefully distinguishing between observational and interventional conditional probabilities and try a clarification based on Pearl's seminal work on causality. We conclude that unconditional rather than conditional expectations provide the right notion of dropping features in contradiction to the theoretical justification of the software package SHAP. Parts of SHAP are unaffected because unconditional expectations (which we argue to be conceptually right) are used as approximation for the conditional ones, which encouraged others to `improve' SHAP in a way that we believe to be flawed.

Full PDF

FFeature relevance quantiﬁcation in explainable AI:A causal problem

Dominik Janzing, Lenon Minorics, and Patrick Bl¨obaum

Amazon Research T¨ubingen, Germany { janzind, minorics, bloebp } @amazon.com

25 November 2019

Abstract

We discuss promising recent contributions on quantify-ing feature relevance using Shapley values, where we ob-served some confusion on which probability distributionis the right one for dropped features. We argue that theconfusion is based on not carefully distinguishing be-tween observational and interventional conditional prob-abilities and try a clariﬁcation based on Pearl’s seminalwork on causality. We conclude that unconditional ratherthan conditional expectations provide the right notion of dropping features in contradiction to the theoretical jus-tiﬁcation of the software package SHAP. Parts of SHAPare unaffected because unconditional expectations (whichwe argue to be conceptually right) are used as approxima-tion for the conditional ones, which encouraged others to‘improve’ SHAP in a way that we believe to be ﬂawed.

Despite several impressive success stories of deep learn-ing, not only researchers in the ﬁeld have been shockedmore recently about lack of robustness for algorithmsthat were actually believed to be powerful. Image clas-siﬁers, for instance, fail spectacularly once the imagesare subjected to adversarial changes that appear minorto humans, see e.g. Goodfellow et al. (2015); Sharifet al. (2016); Kurakin et al. (2018); Eykholt et al. (2018);Brown et al. (2018). Understanding these failures is chal-lenging since it is hard to analyze which features were de-cisive for the classiﬁcation in a particular case. However, lack of robustness is only one of several different motiva-tions for getting artiﬁcial intelligence interpretable . Alsothe demand for getting fair decisions, e.g., Dwork et al.(2012); Kilbertus et al. (2017); Barocas et al. (2018), re-quires understanding of algorithms. In this case, it mayeven be subject of legal and ethical discussions why analgorithm came to a certain conclusion.To formalize the problem, we describe the input / out-put behaviour as a function f : X , . . . , X n → R where X , . . . , X n denote the ranges of some input variables ( X , . . . , X n ) =: X (discrete or continuous), while weassume the target variable Y to be real valued for rea-sons that will become clear later. Given one particular in-put x := ( x , . . . , x n ) we want to quantify to what ex-tent each x j is ‘responsible’ for the output f ( x , . . . , x n ) .This question makes only sense, of course, after specify-ing what should one input be instead . Let us ﬁrst considerthe case where x is compared to some ‘baseline’ element x (cid:48) , which has been studied in the literature mostly for thecase of real-valued inputs and differentiable f . Based ona hypothetical scenario where only some of the baselinevalues x (cid:48) j are replaced with x j while others are kept onewants to quantify to what extent each component j con-tributes to the difference f ( x ) − f ( x (cid:48) ) . The focus of thepresent paper, however, is a scenario where the baselineis deﬁned by the expectation E [ f ( X )] over some distribu-tion P X . To explain the relevance of each j for the dif-ference f ( x ) − E [ f ( X )] one considers a scenario whereonly some values are kept and the remaining ones are av-eraged over some probability distribution . The main con-tribution of this paper is to discuss which distribution is1 a r X i v : . [ s t a t . M L ] N ov he right one. Recalling the difference between interven-tional and observational conditional distributions in theﬁeld of causality, we explain why we disagree with theinteresting proposal of Lundberg and Lee (2017) in thisregard. Further we argue that our criticism is irrelevantfor any software that ‘approximates’ the conditional ex-pectation (which we consider conceptually wrong) by theunconditional expectation, as proposed by Lundberg andLee (2017). The paper is structured as follows. Section 2summarizes results from the literature regarding axiomsfor feature attribution for the case where there is a uniquebaseline reference input. Here integrated gradients andShapley values (as the generalization to discrete input)are the unique attribution functions for the stated set ofaxioms. Section 3 discusses the attribution problem forthe case where one averages over unused features as inLundberg and Lee (2017), and then we present our crit-icism. We think that the big overlap of the present pa-per with existing literature is justiﬁed by aiming at thisclariﬁcation only, while keeping this clariﬁcation as self-consistent as possible. In particular, the very general dis-cussion of Datta et al. (2016) contains all the ideas ofthis work at least implicitly, but since it appeared beforeLundberg and Lee (2017) it could not explicitly discussthe conceptual problems raised by the latter. Our viewon marginalization over unused features is supported byDatta et al. (2016) for similar reasons. In Section 4 wepresent different experiments which illustrate our argu-ments. The growth of deep neural networks recently motivatedmany researchers to investigate feature attribution, seee.g. Shrikumar et al. (2016) for DeepLIFT, Binder et al.(2016) for Layer-wise Relevance Propagation (LRP),Ribeiro and Singh (2016) for Local Interpretable Model-agnostic Explanations (LIME), and for gradient basedmethods Chattopadhyay et al. (2019). For a summaryof common architecture agnostic methods, see Molnar(2019). We ﬁrst discuss two closely related concepts thatarise from an axiomatic approach.

Sundararajan et al. (2017), investigated the attribution of x i to the difference f ( x ) − f ( x (cid:48) ) , (1)where x (cid:48) is a given baseline. Under the assumption that f is differentiable almost everywhere , they deﬁned theattribution of x i to (1) asIntegratedGrads i ( x ; f ) :=( x i − x (cid:48) i ) (cid:90) α =0 ∂f ( x (cid:48) + α ( x − x (cid:48) )) ∂x i dα. Contrary to LIME, DeepLIFT and LRP, this attributionmethod has the advantage that all of the following prop-erties are satisﬁed (see Sundararajan et al. (2017) and Aaset al. (2019)):

1. Completeness: If atr i ( x ; f ) denotes the attributionof x i to (1), then (cid:88) i atr i ( x ; f ) = f ( x ) − f ( x (cid:48) ) .

2. Sensitivity: If f does not depend on x i , then atr i ( x ; f ) = 0 .

3. Implementation Invariance: If f and f (cid:48) are equalfor all inputs, then atr i ( x ; f ) = atr i ( x ; f (cid:48) ) for all i.

4. Linearity:

For a, b ∈ R holds atr i ( x ; af + bf ) = a · atr i ( x ; f ) + b · atr i ( x ; f ) .

5. Symmetry-Preserving: If f is symmetric in compo-nent i and j and x i = x j and x (cid:48) i = x (cid:48) j , then atr i ( x ; f ) = atr j ( x ; f ) . see Sundararajan et al. (2017, Proposition 1) Note that this axiom is pointless if it refers to properties of functions rather than properties of algorithms . We have listed it for completenessand for consistency with the literature. γ instead of the straight line. Thisattribution method is called path method and the follow-ing theorem holds. Theorem 1. ((Friedman, 2004, Theorem 1) and (Sun-dararajan et al., 2017, Theorem 1)) If an attributionmethod satisﬁes the properties Completeness, Sensitivity,Implementation Invariance and Linearity, then the attri-bution method is a convex combination of path methods.Furthermore, integrated gradients is the only path methodthat is symmetry preserving.

Notice that convex combinations of path methodscan also be symmetry preserving even if the attributionmethod is not given by integrated gradients.

To assess feature relevance relative to the average , Lund-berg and Lee (2017) use a concept that relies on ﬁrstdeﬁning an attribution for binary functions, or, equiva-lently, functions with subset as input (’set functions’). Weﬁrst explain this concept and describe in Section 3 how itsolves the attribution relative to the expectation. Assumewe are given a set with n elements, say U := { , . . . , n } and a function g : 2 U → R with g ( U ) (cid:54) = 0 , g ( ∅ ) = 0 . We then ask to what extent each single j ∈ U contributesto g ( U ) . A priori, the contribution of each j depends onthe order in which more elements are included. We canthus deﬁne the contribution of j , given T ⊆ U by C ( j | T ) := g ( T ∪ { j } ) − g ( T ) (note that it can be negative and also exceed g ( U ) ). With φ i := (cid:88) T ⊆ U \{ i } n (cid:0) n − | T | (cid:1) C ( i | T ) . (2)it then holds g ( U ) = n (cid:88) i =1 φ i . The quantity φ i is called the Shapley value (Shapley,1953) of i , which can be considered the average contri-bution of i to g ( U ) . At ﬁrst glance, Shapley values onlysolve the attribution problem for binary inputs by canoni-cally identifying subsets T with binary words { , } n . Toshow that Shapley values also solve the above attributionproblem, one can simply deﬁne a set function by g ( T ) := f T ( x T ) − f ( x (cid:48) ) , for any subset T ⊆ { , . . . , n } . Here, f T is the ‘simpli-ﬁed’ function with the reduced input x T obtained from f when all remaining features are taken from the baselineinput x (cid:48) , that is, f ∅ ( x ∅ ) = f ( x (cid:48) ) .Since Shapley Values also satisfy Completeness,Sensitivity, Implementation Invariance and Linearity(Aas et al., 2019) with respect to the binary functiondeﬁned by the set function g , they are given by a convexcombination of path methods. Furthermore, ShapleyValues with respect to g are Symmetry-perserving, butdon’t coincide with integrated gradients.Different ways of feature attribution based on ShapleyValues were recently investigated by Sundararajan andNajmi (2019). Their main consideration is feature rele-vance relative to an auxiliary baseline, but feature attri-bution relative to the expectation (according to an arbi-trary distribution) is also mentioned. Furthermore, Sun-dararajan and Najmi (2019) already discussed that Shap-ley Values based on conditional distributions can assignunimportant features non-zero attribution. However, Sun-dararajan and Najmi (2019) didn’t consider the problemfrom a causal perspective. dropped features? We now want to attribute the difference between f ( x ) and the expectation E [ f ( X )] to individual features. Ex-plaining why the output for one particular input x deviatesstrongly from the average output is particularly interestingfor understanding ‘outliers’. Let us introduce some nota-tion ﬁrst. For any T ⊆ U let E [ f ( x T , X ¯ T ) | X T = x T ] de-note the conditional expectation of f , given X T = x T . By E [ f ( x T , X ¯ T )] we denote the expectation of f ( x T , X ¯ T ) X ¯ T without condition-ing on X T = x T . Let us call this expression ‘marginalexpectation’ henceforth.Accordingly, we now discuss two different options fordeﬁning ‘simpliﬁed functions’ f T where all features from ¯ T are dropped: f T ( x ) := E [ f ( x T , X ¯ T ) | X T = x T ] (3) or f T ( x ) := E [ f ( x T , X ¯ T )] ? (4)Lundberg and Lee (2017) propose (3), but since it is dif-ﬁcult to compute they approximate it by (4), which theyjustify by the simplifying assumption of feature indepen-dence. Using the set function g ( T ) := f T ( x ) − f ∅ ( x ) ,they compute Shapley values φ i according to (2). We willargue that using (4) rather than (3) is conceptually theright thing in the ﬁrst place. Our clariﬁcation is supposedto prevent others from ‘improving’ SHAP by ﬁnding anapproximation for the conditional expectation that is bet-ter than the marginal expectation, like, for instance Aaset al. (2019) and (Lundberg et al., 2018) To explain our arguments, let us ﬁrst explain whymarginal expectations occur naturally in the ﬁeld of causalinference.

Observational versus interventional conditional distri-butions

The main ideas of this paragraph can already befound in Datta et al. (2016) in more general and abstractform, see also Friedman (2001) and Zhao and Hastie(2019), but we want to rephrase them in a way that op-timally prepares the reader to the below discussion. As-sume we are given the causal structure shown in Figure 1.Further, assume we are interested in how the expectationof Y changes when we manually set X to some value x . This is not given by E [ Y | X = x ] because observing X = x changes also the distribution of X , X due tothe dependences between X and X , X (which are gen-erated by the common cause Z ). This way, the differencebetween E [ Y ] and E [ Y | X = x ] is not only due to theinﬂuence of X , but can also be caused by the inﬂuenceof X , X . The impact of setting X to x is captured by Note that TreeExplainer in SHAP has meanwhile been changed ac-cordingly. ZX X X Y Figure 1: A simple causal structure where the observa-tional conditional p ( y | x ) does not correctly describe how Y changes after intervening on X because the commoncause Z ‘confounds’ the relation between X and Y .Pearl’s do-operator Pearl (2000) instead, which yields E [ Y | do ( X = x )]= (cid:90) E [ Y | x , x , x ] p ( x , x ) dx dx . (5)This can be easily veriﬁed using the backdoor criterionPearl (2000) since (phrased in Pearl’s language) the vari-ables X , X ‘block the backdoor path’ X ← Z → Y .Observations from Z are not needed, we may thereforeassume Z to be latent, which we have indicated by whitecolor.For our purpose, two observations are important: ﬁrst,(5) does not contain the conditional distribution, given X = x . Replacing p ( x , x ) with p ( x , x | x ) in(5) would yield the observational conditional expectation E [ Y | X = x ] , which we are not interested in. In otherwords, the intervention on X breaks the dependences to X , X . The second observation that is crucial for us isthat the dependences between X , X are kept, they areunaffected by the intervention on X . Why observational conditionals are ﬂawed

Let usstart with a simple example.

Example 1 (irrelevant feature) . Assume we have f ( x , x ) = x . Obviously, the feature X is irrelevant. Let both X , X e binaries and p ( x , x ) = (cid:26) / for x = x otherwise . (1) with conditional expectations: f ∅ ( x ) = E [ f ( X , X )] = 1 / (6) f { } ( x ) = E [ f ( x , X ) | x ] = x (7) f { } ( x ) = E [ f ( X , x ) | x ] = x (8) f { , } ( x ) = f ( x , x ) = x (9) Therefore, C (2 |∅ ) = f { } ( x ) − f ∅ ( x ) = x − / C (2 |{ } ) = f { , } ( x ) − f { } ( x ) = x − x . Hence, the Shapley value for X reads: φ = 12 ( x − / x − x ) = x / − / (cid:54) = 0 . (2) with marginal expectations: f ∅ ( x ) = E [ f ( X , X )] = 1 / (10) f { } ( x ) = E [ f ( x , X )] = x (11) f { } ( x ) = E [ f ( X , x )] = 1 / (12) f { , } ( x ) = f ( x , x ) = x . (13) We then obtain C (2 |∅ ) = f { } ( x ) − f ∅ ( x ) = 0 C (2 |{ } ) = f { , } ( x ) − f { } ( x ) = 0 , which yields φ = 0 . The example proves the follow result, which were al-ready discussed in Sundararajan and Najmi (2019):

Lemma 1 (failure of Sensitivity) . When the relevance of φ i is deﬁned by deﬁning ‘simpliﬁed’ functions f T via con-ditional expectations f T ( x T ) := E [ f ( x ) | X ¯ T = x ¯ T ] , then φ i (cid:54) = 0 does not imply that f depends on x i . The example is particularly worrisome because wementioned earlier that Shapley values satisfy the axiomof sensitivity, while Lemma 1 seems to claim the oppo-site. The resolve this paradox, note that the Shapley val-ues refer to binary functions (or set functions) and reading(6) to (8) as the values of a binary function ˜ g with inputs ( z , z ) = 00 , , , we clearly observe that ˜ g de-pends also on the second bit. This way, the Shapley valuesdo not violate sensitivity for ˜ g , but we certainly care about‘sensitivity for f ’. Note that this distinction between thebinary function ˜ g and f is crucial although in our example f is binary itself. Fortunately, the second bit is irrelevantfor the binary function ˜ g deﬁned by (10) and (13) and wedo not obtain the above paradox.To assess the impact of changing the inputs of f , wenow switch to a more causal language and state that weconsider the inputs of an algorithm as causes of the out-put. Although this remark seems trivial it is necessary toemphasize that we are not talking about the causal relationbetween any features in the real world outside the com-puter (where the attribute predicted by Y may be the causeof the features), but only about causality of this techni-cal input / output system . To facilitate this view, we for-mally distinguish between the true features ˜ X , . . . , ˜ X n obtained from the objects and the corresponding features X , . . . , X n plugged into the algorithm. This way, we areable to talk about a hypothetical scenario where the inputsare changed compared to the true features. Let us ﬁrst con-sider the causal structure in ﬁgure 2, top, where the inputsare determined by the true features. In contrast, ﬁgure 2,bottom, shows the causal structure after an intervention on X , X has adjusted these variables to ﬁxed values x , x .We now consider the impact of an hypothetical inter-vention, which leaves the remaining components unaf-fected . They are therefore sampled from their natural jointdistribution without conditioning. Similar to the aboveparagraph, we then obtain E [ Y | do ( X T = x T )] = E [ f ( x T , X ¯ T )] . (14)Our formal separation between the true values of the fea-tures ˜ X j of some object and the corresponding inputs X j of the algorithms allows us to be agnostic about the causalrelations between the true features in the real world, the Accordingly, Y is the output of the system and not a property of theexternal world. X , . . . , X n cause the output Y is theonly causal knowledge needed to compute (14). Since theinterventional expectations coincide with the marginal ex-pectations, we have thus justiﬁed the use of marginal ex-pectations for the Shapley values from the causal perspec-tive. ˜ X ˜ X ˜ X ˜ X ˜ X X X X X X Y object with features ˜ X ˜ X ˜ X ˜ X ˜ X x x X X X Y object with featuresFigure 2: Top: Causal structure of our predictionscenario: The output Y is determined by the inputs X , . . . , X n . In the usual learning scenario these inputscoincide with features ˜ X , . . . , ˜ X n ob some object, that is X j = ˜ X j . Bottom: To evaluate the impact of some inputs,say X , X , for the output Y we consider a hypotheticalscenario where we adjust these inputs to some ﬁxed val-ues x , x and sample the remaining inputs from the usualjoint distribution P X ,...,X n . Probability X X f = X + X (1 − p ) · (1 − q ) (1 − p ) · q (1 − q ) · p p · q The problem with the symmetry axiom

We brieﬂyrephrase Example 4.9 of Sundararajan and Najmi (2019)showing that the symmetry axiom is violated when Shap-ley values are used for quantifying the inﬂuence relativeto conditional or marginal expectations. Figure 3 showsvalues and probabilities of two random variables X and X and the values of the function f ( X , X ) = X + X .As explained by Sundararajan and Najmi (2019), for theinput ( x , x ) = (2 , the value x gets attribution (1 − p ) and x gets attribution (1 − q ) . Therefore, if p (cid:54) = q , x and x get different attribution, although f is symmetric. Theyconclude that this is a violation of symmetry. Since X and X are independent, this problem occurs regardlessof whether one deﬁnes the simpliﬁed function f T withrespect to marginal or conditional expectations. One canargue, however, that this result makes intuitively sense be-cause the value x j that is farther from its mean contributes more to the fact that f ( x , x ) deviates from its mean. Ifwe have even x = E [ X ] , we would certainly say that x does not contribute to the deviation from the mean atall. For this reason we do not follow Sundararajan andNajmi (2019) in regarding this phenomenon as a problemof this kind of attribution analysis. Recall furthermore thatwe have already mentioned that the symmetry axiom doeshold for the corresponding binary function deﬁned by in-cluding or not certain features (simply because symmetryholds for Shapley values). For the above example this bi-nary function is indeed asymmetric. To check this, deﬁne ˜ g ( z , z ) := E [ f ( x T , X ¯ T )] , where T is the set of all j for which z j = 1 . This func-tion is not symmetric in Z and Z , since we have, forinstance, ˜ g (1 ,

0) = x + E [ X ] (cid:54) = ˜ g (0 ,

1) = x + E [ X ] .6 Numerical Evidence

In this section, we show numerically that the marginalexpectation E [ f ( x T , X ¯ T )] is a better choice than E [ f ( x T , X ¯ T ) | X T = x T ] to quantify the attributionof each observation x j of a particular input x =( x , . . . , x n ) to f ( x ) − E f ( X ) . As explained by Aas et al. (2019, Section 2.3), the im-plementation of KernelSHAP (Lundberg and Lee, 2017)consists of two parts:1. Using a representation of Shapley Values as the so-lution of a weighted least square problem for a com-putationally tractable approximation.2. Approximation of g ( T ) . By Charnes et al. (1988), the Shapley Values to the setfunction g are given as the solution ( φ , . . . , φ n ) of min φ ,...,φ n  (cid:88) T ⊆ U (cid:104) g ( T ) − (cid:16) (cid:88) j ∈ T φ j (cid:17)(cid:105) k ( U, T )  , (15)where k ( U, T ) = ( | U | − / ( (cid:0) | U || T | (cid:1) | T | ( | U | − | T | )) are the Shapley kernel weights . Since k ( U, U ) = ∞ , we use theconstraint (cid:80) j φ j = g ( U ) , or, for numerical calculation,we set k ( U, U ) to a large number.Since the power set of U consists of n elements, thecomputation time of the Sharpley Values increases ex-ponentially. KernelSHAP therefore samples subsets of U according to the probability distribution induced by theShapley kernel weights. As discussed in the previous sections, Lundberg and Lee(2017) deﬁne f T ( x ) = E [ f ( x T , X ¯ T ) | X T = x T ] . To evaluate the conditional expectation, they assumefeature independence (or weak dependence) to obtain E [ f ( x T , X ¯ T ) | X T = x T ] ≈ E [ f ( x T , X ¯ T )] and use theapproximation f T, KernelSHAP ( x ) ≈ K (cid:88) k f ( x T , x k ¯ T ) , (16)where x k ¯ T , k = 1 , . . . , K are our samples from X ¯ T . To show in an experimental setup that the marginal ex-pectation is a better choice, we consider functions f forwhich we can calculate analytically the attribution of x j .This is possible for linear functions f ( x ) = α + (cid:88) i α i x i , α i ∈ R since f ( x ) − E [ f ( X )] = (cid:88) i α i ( x i − E X i ) and hence, the attribution of x j is α j ( x j − E [ X j ]) . Ourexperiments are divided into the following setups:1. We assume that the feature vector X follows a mul-tivariate Gaussian distribution.2. We use a kernel estimation to approximate the con-ditional expectation.For the experiments, we use the KernelExplainer class ofthe python SHAP package from Lundberg and Lee (2017)to calculate Shapley Values with respect to the marginalexpectation and the R package SHAPR, in which themethodology of Aas et al. (2019) is implemented, to cal-culate Shapley Values with respect to the conditional dis-tribution.Notice that calculating Shapley Values is also pos-sible for non-linear functions. Further, approximatingthe marginal expectation is computationally inexpensivecompared to the approximation of the conditional expec-tation with kernel estimation.7 .2.1 Multivariate Gaussian distribution If X ∼ N ( µ , Σ ) with some mean vector µ and covari-ance matrix Σ , it holds that P ( X ¯ T | X T = x T ) = N ( µ ¯ T | T , Σ ¯ T | T ) (see (Aas et al., 2019, Section 3.1)), where µ ¯ T | T = µ ¯ T + Σ ¯ T T Σ − T T ( x T − µ T ) Σ ¯ T | T = Σ ¯ T ¯ T − Σ ¯ T T Σ − T T Σ T ¯ T with µ = (cid:18) µ T µ ¯ T (cid:19) , Σ = (cid:18) Σ T T Σ T ¯ T Σ ¯ T T Σ ¯ T ¯ T (cid:19) . Hence, we can approximate the conditional expectationby sampling X ¯ T directly from its distribution.We simulate Gaussian data and run the experiment fordifferent number of features. For every experiment withmultivariate Gaussian distribution, we set the intercept to0, i.e. α = 0 . Dimension n=3.

In the ﬁrst 3-dimensional experiment,we let α = 0 and choose in every run α and α in-dependently from the standard normal distribution. Fur-ther, we let µ = (0 , , T and Σ = cc T , where wechoose the entries of c in every run independently fromthe standard normal distribution and x also randomly inevery run. The number of runs and the sample size of X is 1000. Figure 4 shows the errors φ j − contr j ( x ) ofthe Shapley Values φ j with respect to the set function g ( T ) = E [ f ( x T , X ¯ T )] − E f ( X ) (blue) and the set func-tion g ( T ) = E [ f ( x T , X ¯ T ) | X T = x T ] − E f ( X ) (red).The very precise results for the marginal expectation aremainly from feature 1. Dimension n=10.

In 10-dimensions, we take almost thesame setting with the difference that we set the ﬁrst 3 co-efﬁcients to zero, i.e. α = α = α = 0 . Again, the veryprecise results for the marginal expectation are from thefeatures whose coefﬁcient we set to 0. If we have no information about the underlying distribu-tion, it is hard to approximate the conditional distributionsufﬁciently. However, in low dimensions kernel estimates Figure 4: Histogram showing the error of the ShapleyValues for multivariate Gaussian distribution in the 3-dimensional (left) and 10-dimensional (right) setting with α = 0 . Blue: error using marginal expectation, Red: er-ror using conditional expectation.can provide a good approximation. We take the kernelestimation method from Aas et al. (2019) to show howstrongly the Shapley Values w.r.t. conditional expectationdeviate from α j ( x j − E [ X j ]) . Their approximation is asfollows:1. Let Σ T be the covariance matrix of our sample from X T . To each point x i of the sample, calculate theMahalanobis distance (see Mahalanobis (1936))dist T ( x , x i ) := (cid:115) ( x T − x iT ) (cid:48) Σ − T ( x T − x iT ) | T | , where ( x T − x iT ) (cid:48) denotes the transpose of ( x T − x iT ) .2. Calculate the Kernel weights w T ( x , x i ) := exp (cid:18) − dist T ( x , x i ) σ (cid:19) . Hereby, σ > is a bandwidth which has to be spec-iﬁed.3. Sort the weights w T ( x , x i ) in increasing order andlet ˜ x i be the corresponding ordered sampling in-stances. Then, approximate g ( T ) by g cond ( T ) := (cid:80) Ki =1 w T ( x , ˜ x i ) f ( x i ¯ T , x T ) (cid:80) Ki =1 w T ( x , ˜ x i ) . For the experiment, we use the real data set

Human Ac-tivity Recognition Using Smartphones Data Set (see An-guita et al. (2013)) from the UCI repository. The data set8onsists of 561 features with a training sample size of7352 and test sample size of 2948. In this experiment,we merge these two samples together and therefore oursample size is 10299. We take randomly 4 features andtrain a linear model with 3 of these features as inputs andwith the 4-th feature as target. We don’t consider the label(which is a daily activity performed by the human) of thedata set, but the different features have the true label asa common cause. Notice that we are not interested in thequality of the model, but rather in a model for which theground truth of the attribution is known (because we cancertainly look at the linear model obtained).Afterwards, we calculate the Shapley Values withSHAP and SHAPR (with σ set to 0.1 in SHAPR which isthe default value) using the ﬁrst 1000 samples and approx-imate the expected value E X j using the whole data set.The observation x is also randomly picked from the dataand we run this experiment 1000 times. Figure 5 showsthe histogram of the error φ j − contr j ( x ) for the marginalexpectation (blue) and conditional expectation (red).Figure 5: Histogram showing the error of the ShapleyValues for the data set Human Activity Recognition UsingSmartphones Data Set . Blue: error using marginal expec-tation, Red: error using conditional expectation.

In this work we considered the problem of attributing theoutput from one particular multivariate input to individ-ual features. We argued that there is a misconception alsoin recent proposals for feature attribution because theyuse observational conditional distributions rather than in-terventional distributions. Our arguments are phrased interms of the causal language introduced by Pearl (2000).We argue that parts of the package SHAP from Lundbergand Lee (2017) are unaffected by this misconception (al- though the corresponding theory part of the paper suffersfrom this issue) since they ‘approximates’ the observa-tional expectations by an expression that would have beenthe right one in the ﬁrst place. We think that this clariﬁ-cation is important since other authors tried to ‘improve’the SHAP package in a way that we consider conceptuallyﬂawed. Moreover, we revisited some properties that werestated as desirable in the context of attribution analysis. Ifstated in a too vague manner, there is some room for inter-pretation. We argued, for instance, why we think that ourattribution method satisﬁes a reasonable symmetry prop-erty, since attribution via interventional probabilities hasbeen criticised for violating alleged desirable symmetryproperties.

Acknowledgements:

The authors would like to thankScott Lundberg and Anders Løland for their valuablefeedback and Atalanti Mastakouri for remarks on the pre-sentation.

References

K. Aas, M. Jullum, and A. Løland. Explaining indi-vidual predictions when features are dependent: Moreaccurate approximations to Shapley values.

ArXiv:1903.10464 , 2019.D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. A Public Domain Dataset for Human ActivityRecognition Using Smartphones. In , pages 24–26, April 2013.S. Barocas, M. Hardt, and A. Narayanan.

Fairness andMachine Learning . fairmlbook.org, 2018. .A. Binder, G. Montavon, S. Lapuschkin, K. R. Mller,and W. Samek. Layer-Wise Relevance Propagation forNeural Networks with Local Renormalization Layers.In

Artiﬁcial Neural Networks and Machine LearningICANN , volume 9887, 2016.T. B. Brown, D. Man, A. Roy, M. Abadi, and J. Gilmer.Adversarial Patch. arXiv:1712.09665 , 2018.9. Charnes, B. Golany, M. Keane, and J. Rousseau. Ex-tremal Principle Solutions of Games in CharacteristicFunction Form: Core, Chebychev and Shapley ValueGeneralizations.

Econometrics of Planning and Efﬁ-ciency , 11:123–133, 1988.A. Chattopadhyay, P. Manupriya, A. Sarkar, and V. Bal-asubramanian. Neural network attributions: A causalperspective. In K. Chaudhuri and R. Salakhutdinov,editors,

Proceedings of the 36th International Confer-ence on Machine Learning , volume 97 of

Proceedingsof Machine Learning Research , pages 981–990, LongBeach, California, USA, 09–15 Jun 2019. PMLR.A. Datta, S. Sen, and Y. Zick. Algorithmic transparencyvia quantitative input inﬂuence: Theory and experi-ments with learning systems. In , pages 598–617, 2016.C. Dwork, M. Hardt, T. Pitassi, O. Reingold, andR. Zemel. Fairness through awareness. In

Proceed-ings of the 3rd Innovations in Theoretical ComputerScience Conference , ITCS ’12, pages 214–226, NewYork, NY, USA, 2012. ACM. ISBN 978-1-4503-1115-1. doi: 10.1145/2090236.2090255. URL http://doi.acm.org/10.1145/2090236.2090255 .K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati,C. Xiao, A. Prakash, T. Kohno, and D. Song. Ro-bust Physical-World Attacks on Deep Learning VisualClassiﬁcation. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 1625–1634, 2018.E. J. Friedman. Paths and consistency in additive costsharing.

International Journal of Game Theory , 32(4):501–518, 2004.J. H. Friedman. Greedy function approximation: A gradi-ent boosting machine.

Annals of statistics , pages 1189–1232, 2001.I. J. Goodfellow, J. Shlens, and C. Szegedy. Explainingand harnessing adversarial examples. In , 2015. URL http://arxiv.org/abs/1412.6572 . N. Kilbertus, M. Rojas-Carulla, G. Parascandolo,M. Hardt, D. Janzing, and B. Sch¨olkopf. Avoiding dis-crimination through causal reasoning. In

Proceedingsfrom the conference ”Neural Information ProcessingSystems 2017 , pages 656–666. Curran Associates, Inc.,December 2017.A. Kurakin, I. J. Goodfellow, and Samy Bengio. Adver-sarial examples in the physical world. In

Artiﬁcial In-telligence Safety and Security , pages 99–112. Chapmanand Hall/CRC, 2018.S. Lundberg and S. Lee. A uniﬁed approach to interpret-ing model predictions. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, editors,

Advances in Neural Informa-tion Processing Systems 30 , pages 4765–4774. CurranAssociates, Inc., 2017.S. Lundberg, G. Erion, and S. Lee. Consistent in-dividualized feature attribution for tree ensembles. arXiv:1802.03888 , 2018.P. C. Mahalanobis. On the generalised distance in statis-tics. In

Proceedings of the National Institute of Sci-ences of India , April 1936.C. Molnar.

Interpretable Machine Learning . Molnar,C., 2019. URL https://christophm.github.io/interpretable-ml-book/ .J. Pearl.

Causality . Cambridge University Press, 2000.M. Ribeiro and C. Singh, S.and Guestrin. ”why shouldi trust you?”: Explaining the predictions of any clas-siﬁer. In

Proceedings of the 22Nd ACM SIGKDD In-ternational Conference on Knowledge Discovery andData Mining , KDD ’16, pages 1135–1144, New York,NY, USA, 2016.L. Shapley. A value for n-person games.

Contributions tothe Theory of Games (AM-28) , 2, 1953.M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter.Accessorize to a Crime: Real and Stealthy Attacks onState-of-the-Art Face Recognition. In

Proceedings ofthe 2016 ACM SIGSAC Conference on Computer andCommunications Security , pages 1528–1540, 2016.10. Shrikumar, P. Greenside, and A. Kundaje. NotJust A Black Box: Learning Important FeaturesThrough Propagating Activation Differences.

In ICML(arXiv:1605.01713) , 2016.M. Sundararajan and A. Najmi. The many Shapley valuesfor model explanation. arXiv:1908.08474 , 2019.M. Sundararajan, A. Taly, and Q. Yan. Axiomatic At-tribution for Deep Networks. In

Proceedings of the34th International Conference on Machine Learning ,volume 70, pages 3319–3328, August 2017.Q. Zhao and T. Hastie. Causal interpretations of black-box models.