WWhy did the distribution change?
Kailash Budhathoki, Dominik Janzing, Patrick Blöbaum, and Hoiyi Ng
Amazon{kaibud, janzind, bloebp, nghoiyi}@amazon.com
March 1, 2021
Abstract
We describe a formal approach based on graphical causalmodels to identify the “root causes” of the change in theprobability distribution of variables. After factorizing thejoint distribution into conditional distributions of each vari-able, given its parents (the “causal mechanisms”), we at-tribute the change to changes of these causal mechanisms.This attribution analysis accounts for the fact that mech-anisms often change independently and sometimes onlysome of them change. Through simulations, we study theperformance of our distribution change attribution method.We then present a real-world case study identifying thedrivers of the difference in the income distribution betweenmen and women.
Changes to a probability distribution are common in manyreal-world domains that are part of a changing environ-ment. For example, during COVID-19, most retailers mostlikely observed a shift in the distribution of their inventorylevel of products— as certain products have unusually highdemand (e.g., masks and hand sanitizers) while supply forsome other products are limited due to suppliers suspend-ing manufacturing. To be able to effectively respond to(either proactively or retrospectively) similar situations, itis not only important to identify if and where the distribu-tion changed, but also know why the distribution changed.In recent years, several techniques have been developedto either automatically detect changes in the underlyingdistribution from a sequence of observations (Pollak, 1985;Kifer et al., 2004), or determine if two data samples come from the same distribution (Chakravarti et al., 1967; Scholzand Stephens, 1987; Snedecor and Cochran, 1989; Grettonet al., 2012). However, a formal way to identify the “rootcauses” of the distribution change seems to be missing.In this work, we consider a system of n variables X , . . . , X n . In a supply chain, for instance, these variablescan represent different business metrics, such as demandforecast, labour cost, shipment cost, and inventory level toname a few. Typical causal questions on these variablesare interventional in nature, e.g. "What would be the im-pact on X k if we were to intervene on X j ?" Here we areinterested in a slightly different question, namely "Whichmechanisms are responsible for the change in the joint dis-tribution P X ,..., X n , or the marginal distribution P X k of oneof the variables X k ?" For example, in a supply chain, wemight be interested in understanding the drivers of week-over-week changes in the distribution of inventory level(or a summary statistic, such as its mean across differentproducts). To this end, we build upon graphical causalmodels (Pearl, 2009).Given a causal graph G of variables X , . . . , X n , assumingthe causal Markov condition (Spirtes et al., 2000), we canfactorise the joint distribution into causal conditionals, i.e. P X ,..., X n = n ∏ j = P X j | PA j , where P X j | PA j denotes the causal mechanism of variable X j given its direct parents PA j in the causal graph. Eachparent-child relationship captured by P X j | PA j representsan autonomous physical mechanism—we can change one1 a r X i v : . [ s t a t . M E ] F e b uch relationship without affecting the others. Thus itis plausible to attribute any change in the joint distribu-tion or the marginal distribution of some target variableto the change in some of the causal mechanisms. Basedon this insight, we develop a formal approach to definethe quantitative contribution of each mechanism to theoverall change. In particular, we use the Shapley valueconcept (Shapley, 1953) from cooperative game theory tocope with the fact that the impact of changing a mechanismdepends on which other mechanisms have been changedalready.The paper is structured as follows. In Section 2, we for-malise changes to causal mechanisms. Section 3 presentsa proposal to attribute the change in the joint distribution.In Section 4, we describe a formal method to attribute thechange in the marginal distribution to causal mechanisms.Section 5 discusses the practical implications of applyingthese attribution proposals. In Section 7, we report resultsfrom simulations and present a case study in identifyingthe drivers of difference in the income distribution betweenmen and women. Finally, we conclude in Section 8.
We consider probabilistic causal models that incorporateprobability to infer causal relationships between variables.Suppose that we have a collection of n random variables ( X , . . . , X n ) = : X . The underlying causal graph G of thesevariables is a directed acyclic graph in which a directededge from X i to X j indicates that X i causes X j directly.A joint distribution P X is said to be compatible with thecausal graph G , if P X can be generated following the edgesin G . More formally, P X is compatible to G if P X = n ∏ j = P X j | PA j . (1) Definition 1 (Probabilistic Causal Model)
A probabilis-tic causal model is a pair C : = (cid:104) G , P X (cid:105) that consists ofa causal graph G, and a joint distribution P X over thevariables in G that is compatible with G. This idea of autonomy of mechanisms has a long history, see Pearl(2009, Section 1.3) and Peters et al. (2017, Section 2.2) for historicalnotes.
Given a probabilistic causal model, the causal Markovassumption (Spirtes et al., 2000) allows us to factorise thejoint distribution P X into causal mechanisms P X j | PA j at eachnode X j . Each causal mechanism P X j | PA j remains invariantto interventions (external influences) in other variables.With this, we can formally define what causal mechanismchanges entail. Definition 2 (Mechanism Changes)
Mechanismchanges to a causal model C : = (cid:104) G , P X (cid:105) on a sub-set of variables X T indexed by a change set T ⊆ { , . . . , n } transform C into C T : = (cid:104) G , P T X (cid:105) , whereP T X = ∏ j ∈ T ˜ P X j | PA j ∏ j / ∈ T P X j | PA j is a new joint distribution obtained by replacing “old”causal mechanism P X j | PA j at each node X j , where j ∈ T ,with the “new” causal mechanism ˜ P X j | PA j . The following example illustrates the idea above in a for-mal setting where causal relationships between variablesare represented in terms of structural equations (Pearl,2009).
Example 1
Consider a causal model consisting of twovariables, i.e. C = (cid:104) X → X , P X , X (cid:105) , induced by the struc-tural equations X : = N and X : = X + N , where theindependent unobserved noise terms N j ∼ N ( , ) aredistributed according to a standard Normal distribution. Atypical structure preserving intervention (Eberhardt andScheines, 2007) changes either the parent-child functionalrelationship, or the distribution of unobserved noise term.Consider an intervention that changes the relationship be-tween X and X from the linear function to a non-linearfunction represented by the new structural assignmentX : = X + N . Whereas this changes the causal mech-anism of X from P X | X to ˜ P X | X , the underlying causalgraph remains the same. As such, we have a new causalmodel C { } = (cid:104) X → X , ˜ P X , X (cid:105) , where P { } X , X = P X ˜ P X | X . The “new” joint distribution P T X can also be seen asthe post-intervention joint distribution P do ( T ) X where do ( T ) represents mechanism changes through stochastic interven-tion at each node X j indexed by T (Correa and Bareinboim,2020). In particular, we only consider the cases where thechange in the joint distribution is a result of systematic2eplacements of independent physical mechanisms. Othercases, e.g. adversarial perturbation, can also change thejoint distribution arbitrarily, which we do not consider here.Moreover, we consider the problem setting where the jointdistribution changes, but not the causal graph. As the joint distribution P X ,..., X n is a composition of in-dependent causal mechanisms P X j | PA j , it is plausible toattribute any change in the joint distribution to the changein some of the causal mechanisms. We would like to com-pute the contribution of each node—potentially due to thechange in its causal mechanism—to the change in the jointdistribution.To this end, first we need a measure that quantifies thechange in the joint distribution. A natural choice for quan-tifying the change in the joint distribution are the diver-gence measures as they measure the “distance” betweentwo probability distributions. In this work, we considerthe Kullback-Leibler (KL) divergence (also called rela-tive entropy) (Cover and Thomas, 2006). Let P and Q betwo probability distributions of a discrete random variabledefined on the same probability space X . Then the KLdivergence from Q to P is defined as D ( P || Q ) : = ∑ x ∈ X P ( x ) log (cid:18) P ( x ) Q ( x ) (cid:19) . If P and Q are the distributions of a continuous randomvariable, then the summation is replaced by an integral.The KL divergence is particularly suitable for our purposeas it is additive for the independent compositions of thejoint distribution. That is, by generalising the chain rule tomore than two variables using the causal Markov condition,we get an additive decomposition of the KL divergencefrom the joint distribution P X to ˜ P X (Cover and Thomas,2006, Theorem 2.5.3) : D ( ˜ P X || P X ) = n ∑ j = D ( ˜ P X j | PA j || P X j | PA j ) Thus, the contribution of each node X j to the KL diver-gence from the joint distribution P X to ˜ P X is the KL di-vergence from its causal mechanism P X j | PA j to ˜ P X j | PA j . In other words, each node—due to the change in its causalmechanism—contributes independently to the KL diver-gence from the joint distribution P X to ˜ P X . The lemmabelow formalises this observation. Definition 3
Suppose that the causal mechanism of a nodeX j changes from P X j | PA j to ˜ P X j | PA j . Then the contribution ofa node X j to the KL divergence from the joint distributionP X to ˜ P X is D ( ˜ P X j | PA j || P X j | PA j ) . The KL divergence is always non-negative. That is, forany P and Q , it holds that D ( P || Q ) ≥
0. Therefore, thecontribution of a node X j to the change in the joint distri-bution, measured in terms of the KL divergence, cannot be negative. If the causal mechanism of a variable did not change, then its contribution will be zero.We should add a remark, however, that the KL diver-gence between conditionals also depends on the distribu-tion of parents, not only the conditionals. Formally, the KLdivergence from P X j | PA j to ˜ P X j | PA j is defined as D ( ˜ P X j | PA j || P X j | PA j ) : = E PA j ∼ ˜ P PAj (cid:104) D ( ˜ P X j | pa j || P X j | pa j ) (cid:105) . This is not really problematic, however, as the marginaldistribution ˜ P PA j of the parents is only used for averagingthe KL divergences D ( ˜ P X j | pa j || P X j | pa j ) . As such, eachnode X j uses the corresponding parent-distribution ˜ P PA j from the same joint distribution ˜ P X .Estimating KL divergence in high-dimensional settingis a challenging problem. For some parametric families,such as the exponential family of distributions, however,closed-form expressions exist for computing KL diver-gence. If a non-parametric estimator is desired, then wecan use the k -nearest-neighbour-based estimator of KLdivergence (Wang et al., 2009) that is asymptotically unbi-ased and mean-square consistent assuming i.i.d. samples.One may also wonder whether the asymmetry of KLdivergence creates a non-intuitive interpretation—as theimpact of each mechanism change differs according tothe direction of comparison. From an inferential stand-point, however, one direction seems preferable. When adistribution changes in time, there is a natural inferentialasymmetry since one would rather consider the likelihoodof new data with respect to the old model than vice versa. The exponential family of distributions includes the Gaussian, Poi-son, Binomial, Multinomial, and Beta, as well as many others. marginal distribution of one target variable changed, in-stead of the change in the joint distribution of all variables.In the next section, using a concept from game theory,we formalise how to attribute the change in the marginaldistribution of a target variable to each node in the causalgraph.
Suppose that the marginal distribution of a target variable X k changes—from P X k to ˜ P X k . The causal Markov conditionallows us to compute the marginal distribution of any vari-able in the causal graph by marginalising (summing) overall independent causal mechanisms excluding that of thevariable itself. Formally, given a causal model C = (cid:104) G , P X (cid:105) ,the marginal distribution P X k of the variable X k can be com-puted by first factorising the joint distribution using thecausal Markov condition, and then marginalising over allother variables, i.e. P X k = ∑ x ,..., x k − , x k + ,..., x n P X ,..., X n = ∑ x ,..., x k − , x k + ,..., x n n ∏ j = P X j | PA j The change in the marginal distribution of X k given thechange set T is then given by P TX k = = ∑ x ,..., x k − , x k + ,..., x n ∏ j ∈ T ˜ P X j | PA j ∏ j / ∈ T P X j | PA j . It is therefore reasonable to assume that any change inthe marginal distribution P X k of the variable X k is mostlikely due to the change in some of the causal mechanisms P X j | PA j . Unlike in case of the joint distribution change,however, the additive property of KL divergence cannotbe leveraged directly for the attribution here—due to themarginalisation. A natural way to compute the contribution of each nodeto the change in the marginal distribution of the target, from P X k to ˜ P X k , is then as follows: replace each “old” mecha-nism P X j | PA j by the “new” mechanism ˜ P X j | PA j in succession.Each replacement changes the marginal distribution of thetarget X k , which can be used to compute the contributionof the corresponding node. The amount of change, how-ever, depends on the causal mechanisms that have beenalready replaced. In other words, the contribution of a node X j to the change in the marginal distribution of the target X k depends on the order in which we replace the causalmechanisms—the resulting attribution procedure is hencein danger of becoming arbitrary. The Shapley value (Shapley, 1953) from cooperative gametheory provides a principled approach to mitigate the ar-bitrariness in the attribution procedure, due to the depen-dence on the ordering of replacements. In particular, itremoves the arbitrariness by symmetrizing over all order-ings. We briefly summarise the Shapley value here.Let N : = { , . . . , n } be a set of n players and ν : 2 N → R be a set function that associates a real-valued payoff to acoalition of players T ⊆ N with ν ( /0 ) =
0, where /0 denotesan empty set. We assume that players will cooperate toform a grand coalition N . The goal is then to “fairly” assignthe resulting payoff ν ( N ) to each player j in N .Let σ : N → N denote a permutation of players N . Allpermutations of the set N with n elements form a sym-metric group S n . Suppose that each player enters into acoalition one by one in the ordering σ ( ) , σ ( ) , . . . , σ ( n ) to eventually form a grand coalition. Let N prec ( j ) denotethe set of players that precede player j in the ordering, i.e. N prec ( j ) : = { i ∈ N | σ − ( i ) < σ − ( j ) } . Note that σ − ( i ) gives player i ’s position. Then the change in the payoff ofa coalition when a player joins the coalition is the marginalcontribution of the player to the coalition, C ν ( j | N prec ( j )) : = ν (cid:0) N prec ( j ) ∪ { j } (cid:1) − ν (cid:0) N prec ( j ) (cid:1) . The marginal contributions of all players will then sum upto the payoff of the grand coalition ν ( N ) , i.e. n ∑ j = C ν ( j | N prec ( j )) = ν ( N ) . (2)4nfortunately, the marginal contribution of a player j de-pends on the order σ in which players enter into a coalition.To remove the dependence on the ordering σ , Shapley’ssolution (Shapley, 1953) is to assign each player j ∈ N itsaverage marginal contribution over all permutations, i.e. φ j ( ν ) : = n ! ∑ σ ∈ S n ν (cid:0) N prec ( j ) ∪ { j } (cid:1) − ν (cid:0) N prec ( j ) (cid:1) = ∑ T ⊂ N \{ j } | T | ! ( n − | T | − ) ! n ! ( ν ( T ∪ { j } ) − ν ( T ))= ∑ T ⊂ N \{ j } n (cid:0) n − | T | (cid:1) ( ν ( T ∪ { j } ) − ν ( T )) , where the second line follows from counting the number ofpermutations (out of n !) where N prec ( j ) = T . The Shapleyvalue is also desirable because it gives a unique solution tothe following axioms that capture the notion of fairness. Efficiency
The Shapley values of all players sum up to thepayoff of the grand coalition. That is, ∑ nj = φ j ( ν ) = ν ( N ) for any set function ν . Symmetry
If two players contribute the same amount toevery coalition of other players, then their Shapley valuesare equal. Formally, for two players i and j , if ν ( T ∪{ i } ) = ν ( T ∪ { j } ) for all coalitions T ⊆ N \ { i , j } , then we have φ i ( ν ) = φ j ( ν ) . Null Player
A player that does not contribute to any coali-tion will get a zero Shapley value. Formally, for a player j ,if ν ( T ∪ { j } ) = ν ( T ) for all coalitions T ⊂ N \ { j } , then φ j ( ν ) = Additivity
The Shapley value computed from the sum oftwo set functions is the same as the sum of the Shapleyvalues computed using individual set functions. That is, forany two set functions ν and ν , we have φ j ( ν + ν ) = φ j ( ν ) + φ j ( ν ) .Computing the Shapley value for a player has the worst-case time-complexity of O ( n ) . However, efficient approx-imations exist (Lundberg and Lee, 2017). By slightly abusing the notation, we use the index set { , . . . , n } = : N to refer to the corresponding variables X , . . . , X n . Then any coalition T ⊂ N represents the changeset for mechanism changes to the “old” causal model. Let P TX k denote the marginal distribution of X k obtained bymarginalising the joint distribution P T X in the new causalmodel C T : = (cid:104) G , P T X (cid:105) . Naturally, for T = /0, the causalmodel does not change, i.e. C T = C for T = /0.First consider a scenario where the change in themarginal distribution of the target X k is quantified by theKL divergence. The quantity that we want to attribute toeach node is then the KL divergence D ( ˜ P X k || P X k ) . To thisend, we define the marginal contribution of a node X j givena change set T as C D ( j | T ) : = D ( P T ∪{ j } X k || P X k ) − D ( P TX k || P X k ) . In other words, the marginal contribution quantifies howmuch replacing the causal mechanism of variable X j con-tributes to the KL divergence D ( ˜ P X k || P X k ) , given that wehave already replaced the causal mechanisms of variablesin the change set T . Then the Shapley value of the variable X j is φ j ( D ) : = ∑ T ⊂ N \{ j } n (cid:0) n − | T | (cid:1) C D ( j | T ) , (3)which gives the “fair” contribution of X j to the change inthe marginal distribution of the target X k measured in termsof D ( ˜ P X k || P X k ) . Note that the Shapley value contribution φ j ( D ) can be negative. This is completely reasonable asthe marginal contribution C D ( j | T ) can be negative; re-placing the causal mechanism of a variable can bring the“new” marginal distribution of the target closer to the “old”marginal distribution P X k , thereby lowering the KL diver-gence relative to not replacing it.Due to the efficiency property, the Shapley values of allvariables will sum up to the KL divergence of the marginaldistribution from P X k to ˜ P X k , i.e. n ∑ j = φ j ( D ) = D ( ˜ P X k || P X k ) . Note that the equality above may not hold strictly whenwe approximate the Shapley values for players. For theworst case analysis on the approximation error of Shapleyvalues, we refer the interested reader to Charnes et al.(1988); Fatima et al. (2008).5nstead of the overall change in the marginal distributionas measured by the KL divergence, we might be interestedin the change in some of its property or summary (e.g.mean, median, variance, skew). For instance, it is not un-common for a retailer to ask, "Why did the mean inventorylevel go down?", as maintaining an inventory requires up-front investment. Amongst many, that could be due to thechange in the distribution of the demand forecast, or somechanges in the algorithms (mathematical functions). Thedefinition below provides a rather general way to attributemarginal distribution change.
Definition 4 (Marginal Distribution Change Attribution)
Let ψ denote any functional defined on the marginal distri-bution. Given a change set T , the marginal contribution ofa node X j to the change in the functional of the marginaldistribution of the target X k , i.e. ∆ ψ : = ψ ( ˜ P X k ) − ψ ( P X k ) ,is C ψ ( j | T ) : = ψ ( P T ∪{ j } X k ) − ψ ( P TX k ) , and its Shapley value contribution to ∆ ψ is given by φ j ( ψ ) : = ∑ T ⊂ N \{ j } n (cid:0) n − | T | (cid:1) C ψ ( j | T ) . The marginal contribution C ψ ( j | T ) quantifies how muchreplacing the causal mechanism of X j contributes to thechange in the summary of P X k , given that we have alreadyreplaced the causal mechanisms of variables in the changeset T . As the marginal contribution of a variable can benegative, the Shapley value contribution φ j ( ψ ) can also benegative. In the example below, we show how to attributethe change in the mean of the target variable to each node. Example 2
Let E X k ∼ P Xk [ X k ] denote the mean of the vari-able X k under the distribution P X . The Shapley value con-tribution of a node X j to the change in the mean (due tothe change in the marginal distribution) of the target nodeX k is φ j ( E ) : = ∑ T ⊂ N \{ j } n (cid:0) n − | T | (cid:1) C E ( j | T ) , where the marginal contribution C E ( j | T ) of X j given achange set T is defined asC E ( j | T ) : = E X k ∼ P T ∪{ j } Xk [ X k ] − E X k ∼ P TXk [ X k ] . The Shapley values of all nodes sum up to the change inthe mean of the target node. That is, the following holds: n ∑ j = φ j ( E ) = E X k ∼ ˜ P Xk [ X k ] − E X k ∼ P Xk [ X k ] . Next we discuss practical aspects of distribution changeattribution solution we have presented thus far.
To apply our attribution methods, we require a causal graphand its causal conditionals. Existing techniques on soundand complete causal structure learning (Spirtes et al., 2000;Pearl, 2009), however, can only recover the Markov equiva-lence class of DAGs assuming faithfulness. With additionalassumptions on the data-generating process, it is possi-ble to recover the exact causal graph (Peters et al., 2017;van de Geer and Bühlmann, 2013). A promising approachis to combine domain knowledge with conditional inde-pendence tests (Mastakouri et al., 2019). In some cases,we can also perform controlled randomised experimentsto identify the exact DAG from the Markov equivalenceclass (Dubois et al., 2020).Once we have the causal graph, we can then estimatethe causal conditionals directly from data using techniquesfrom high-dimensional statistics (Bühlmann and van deGeer, 2011; Wainwright, 2019). Note that we are giventwo samples of the same size or different sizes (e.g. datafrom the week before Christmas, and data from the Christ-mas week). Then for each node X j , we estimate P X j | PA j from the first sample, and ˜ P X j | PA j from the second sample.In the context of distribution-change attribution, however,sampling variability can lead to spurious results when wedirectly plug in the estimated causal conditionals. Even iftwo samples are drawn from the same joint distribution P X j , PA j , we will most likely estimate two different causalconditionals (even with regularisation), because of sam-pling variability. Contrary to the expectation, X j will thenbe attributed a non-zero value.If we knew that the causal mechanism P X j | PA j did notchange, it would then make sense to learn the causal mech-anism from the combined sample, as then the quality ofthe learned causal conditional will also improve due tothe increased sample size. Moreover, we can also directly6 A j X j g ( A ) Figure 1: The assumed causal graph to detect mechanismchanges for a parent-child relationship.attribute a zero contribution to the node. This raises thequestion, "how do we detect causal mechanism changesfrom two samples?" In other words, are there any statisti-cally testable implications of causal mechanism changes?There exists a large body of work, e.g. Chakravartiet al. (1967); Scholz and Stephens (1987); Snedecor andCochran (1989); Gretton et al. (2012), on statistical hypoth-esis test to determine whether the difference between thetwo distributions is statistically significant from their twosamples, each drawn from a separate distribution. Thoseare not applicable in our setting as we are interested in thedifference between the conditional distributions, not themarginals. In a recent work, Huang et al. (2020) extendthe PC algorithm Spirtes et al. (2000) for causal discoveryfrom non-stationary or heterogeneous data, where one ofthe steps involve detecting changing causal mechanisms.Here we adapt their method.Assume that causal mechanisms P X j | PA j can be writtenas functions of a time or domain index A (see Figure 1). Ifthe causal graph is induced by a functional causal model,then the quantities such as functional models, noise levels,etc that may change over time or across domains can bewritten as functions of A . Under these assumptions, ifthe causal mechanism P X j | PA j remains the same acrossvarious values of A , then the conditional independencetest X j ⊥⊥ A | PA j suffices to detect changes to the causalmechanism P X j | PA j .Let D t denote the m t -by- n matrix containing samplefrom time (or domain) t ∈ { , } , and D denote the m -by- n matrix obtained by vertically concatenating D t s, where m = m + m . That is, we have D : = (cid:20) D D (cid:21) We can construct A directly from data. The key idea isto assign the same value of A to the units in the sample from the same time (or domain) t . For clarity, with a slightabuse of notation, we interchangeably use A for a variableas well as the data vector. Each entry a i of m -by-1 vector A is assigned a ( i ) : = (cid:40) + , if i ≤ m , − , otherwise . Using the columns from the combined data matrix D andindex vector A , we then test if each variable X j is condi-tionally independent of A given its direct parents PA j inthe causal graph, i.e. X j ⊥⊥ A | PA j . If the conditional inde-pendence holds, then the causal mechanism P X j | PA j did notchange across various values of A . On the other hand, if X j is dependent on A given PA j then its causal mechanism P X j | PA j also changes with A .Note that the functional relationships between variables X j and time (or domain) index A is unknown. It is thereforeimportant to use a non-parametric conditional indepen-dence test. In this work, we use kernel-based conditionalindependence test (Zhang et al., 2011). Path-specific effect of X and Y via path π is the degreeto which an interventional-change in X would changethe marginal distribution of Y if that change were tobe transmitted only via π . If an indicator (or context)variable A represents samples from distributions P and˜ P , then computing the path-specific effect of A on anynode via the direct path simply measures the distance be-tween the “old” causal mechanism and the “new” causalmechanism of the node. Arguably, we then capture thecausal influence of external factors (abstracted by A ) tothe node (Janzing et al., 2013). To discuss the relationto the strength of causal arrows defined in Janzing et al.(2013), we need to label the two distributions by an addi-tional variable V attaining values v and ¯ v with edges to all X j whose mechanisms change. Their strengths would be D ( P X j | PA j || [ p ( v ) P X j | PA j + p ( ¯ v ) P X j | PA j ]) .Most feature-based interpretability techniques assumethat features independently co-cause the target (Lundbergand Lee, 2017; Janzing et al., 2020). In particular, they con-sider how the marginal distribution of the target changesw.r.t. to interventional changes in the features. They cannot7ttribute causal mechanism changes, as one must assumethat the causal mechanism of the target does not change tomake them work for multiple datasets.In causal discovery from multiple contexts (Mooij et al.,2020), they find the union of causal graphs in each dataset(or context) by jointly modelling the context variables( A , . . . , A k ) and observed variables ( X , . . . , X n ). Whilethey also work on samples from different distributionsand hence might appear related, these are two differentproblems.Mechanism changes can also be represented withstochastic interventions (Pearl, 2009; Correa and Barein-boim, 2020). We briefly discussed the connection in the lastparagraph of Section 2. Computing the effect of a stochas-tic intervention, however, does not tell us how much thecorresponding mechanism change contributed to a targetquantity summarising distribution change.Overall, existing methods are not suitable for distribu-tion change attribution, where our goal is to attribute a tar-get quantity summarising distribution change (e.g. changein mean, KL divergence) to change in causal mechanism(conditional distribution of a node given its direct causes)of each variable. In experiments, we study the performance of our approachfor attributing the change in the marginal distribution, andits application to real-world data. In particular, throughsimulations, we study the performance of our attributionmethod when we learn causal mechanisms from samplesw.r.t. sample size and magnitude of causal mechanismchange. On real-world data, we asses whether results aresensible.
We consider a setting where n − X , . . . , X n − co-cause a target variable X n . Their un-derlying causal graph is shown in Figure 2. We choosethis simple, yet representative, causal graph for simula-tion because computing the Shapley values analytically forsummaries of distributional changes is not trivial for rathercomplex graphs. X ... X n − X n Figure 2: The causal graph used in simulations, whereindependent inputs X , . . . , X n − co-cause the target X n .Let N w ∼ N ( µ w , ) denote an independent Gaussiannoise with mean µ w and unit variance for each w ∈{ , . . . , n } . Suppose that their causal graph is induced bythe following structural assignments: X w : = N w for w ∈ { , . . . , n − } and X n : = X + . . . + X n − + N n . We refer to the structural causal model (SCM) above by C . Suppose that their joint distribution P X ,..., X n changes to˜ P X ,..., X n due to the changes in the structural assignments.In the new SCM ˜ C , we have the following assignments: X w : = N w + λ w for w ∈ { , . . . , n − } and X n : = X + . . . + X n − + N n + λ n , where λ w is a scalar that shifts the mean of the correspond-ing variable X w for each w ∈ { , . . . , n } , and further definedas λ w : = (cid:40) λ if S w =
10 otherwise, (4)where S w is a Bernoulli random variable with a probabilityof success p = .
5. That is, S w determines whether thecausal mechanism of the corresponding variable X w ispotentially subject to change. If S w =
1, then the valueof λ subsequently dictates the magnitude of the changein the causal mechanism of X w . Note that even if S w = X w doesnot change if λ =
0. With rejection sampling, we ensurethat at least one causal mechanism changes, i.e. λ w (cid:54) = w ∈ { , . . . , n } . This way, we can change thecausal mechanism of a random subset of variables through S w , and study the performance of our attribution methodw.r.t λ .8ue to changes in the causal mechanisms of variables,the marginal distribution of the target X n also changes: C : X n ∼ N ( µ + . . . + µ n , n ) ˜ C : X n ∼ N ( µ + . . . + µ n + λ + . . . + λ n , n ) . Let P X n and ˜ P X n denote the marginal distributions of X n inSCMs C and ˜ C respectively. We measure the change in themarginal distribution of X n by the difference in its mean,i.e. ∆ E : = E X n ∼ ˜ P Xn [ X n ] − E X n ∼ P Xn [ X n ]= λ + . . . + λ n . With some algebraic manipulation, we can show that thecontribution of each variable X w —due to the change in itscausal mechanism—to ∆ E is then given by φ w ( E ) = λ w (5)These closed-form expressions provide the ground truthfor our evaluation. As it should, we see that the followingholds: φ ( E ) + . . . + φ n ( E ) = ∆ E . First we generate two set of samples of same size fromthe two SCMs C and ˜ C stated above. From each sample, welearn (estimate) the SCM assuming that the causal graphis known. As the SCM has an additive unobserved noiseterm, it is possible to estimate both the function and thenoise from data with regression. We generate a samplefrom the joint distribution induced by the learned SCM.Then we estimate the mean of the marginal distributionby the sample average. Finally, we compute the Shapleyvalue of each variable X w . To measure the quality of theestimated Shapley values against the ground truth, we usethe (cid:96) norm, otherwise known as Manhattan distance.First we study the performance of our attribution methodagainst the magnitude parameter λ . To this end, for a givenvalue of λ , we generate 100 pair of SCMs ( C , ˜ C ) with µ w chosen according to the Uniform distribution U ( − , ) foreach w ∈ { , . . . , n } , where n is chosen uniformly randomlyfrom { , , , } . From each SCM in the pair ( C , ˜ C ), wegenerate 100 samples, each containing 1000 observations.Note that we learn the SCM from each sample, and thenestimate the Shapley values from the sample drawn fromthe learned SCM, which we repeat 100 times. Therefore . λ (magnitude of causal mechanism change) ‘ d i s t a n ce linear reg. XGBoost reg. Figure 3: The (cid:96) distance between the ground truth, andthe estimated Shapley values when the underlying linearstructural causal model is estimated with the linear regres-sion model versus the graident boosted trees regressionmodel at various values of λ . The standard error bars forthe linear regression model are invisble as they are toonarrow.we report the average (cid:96) distance (with standard error) over100 × × = λ . With a right regression model (lin-ear regression), the estimated Shapley values are very closeto the ground truth regardless of the magnitude of causalmechanism change λ —indicated by close to zero (cid:96) dis-tance. With gradient boosted trees (from xgboost pythonpackage with default hyperparameters and 100 trees), how-ever, the estimated Shapley values, on average, differ fromthe ground truth as λ increases. This is expected as infer-ring the right model is harder if function classes are lessrestricted a priori. Therefore, the Shapley values estimatedusing a non-linear function will deviate from the groundtruth compared to that from a linear function. We also ob-serve that the uncertainty in the Shapley value estimationincreases if model estimation is not accurate. This is in-dicated by wide error bars for the gradient boosted treescompared to narrow error bars, that are invisible in thefigure, for the linear regression.In Figure 4, we show the result when we vary the sam-ple size, but randomly choose the magnitude parameter λ according to U ( , ) for each SCM pair. We observe thatthe Shapley values from linear regression model is close tothe ground truth even at a relatively small sample size of500—with an average (cid:96) distance of 0.29 and standard er-ror of 0.21. While the performance of XGBoost regressionmodel certainly improves with increasing sample size, itsperformance does not match the linear regression model.9
00 1000 2000 400004 .
59 sample size ‘ d i s t a n ce linear reg. XGBoost reg. Figure 4: The (cid:96) distance between the ground truth, andthe estimated Shapley values when the underlying linearstructural causal model is estimated with linear regressionversus XGBoost regression at various sample sizes. Next we present a case study on the Adult Census Incomedataset where we use our proposal to identify the driversof the difference in the income distribution between menand women.The dataset contains 32,561 records from the censuson annual income in the United States from 1994. Inaddition to whether the annual income of an individualis greater than fifty thousand USD, it contains 14 othersocio-economic attributes. We consider a subset of non-redundant attributes for analysis, namely education andoccupation, that directly affect income as well as act aproxy of income for other attributes. After removing therows with missing values, we end up with 30,718 rows.As the number of variables is small, we combine causaldiscovery with domain knowledge to construct the causalgraph. First, from the combined records of men andwomen, we discover the skeleton graph using the PC algo-rithm (Spirtes et al., 2000) with kernel-based conditionalindependence test (Zhang et al., 2011) at a significancelevel of 0 . http://archive.ics.uci.edu/ml/datasets/Adult educationoccupation income educationoccupation income Figure 5: (left) Skeleton graph discovered using PC al-gorithm on a subset of variables from the Adult CensusIncome dataset. (right) Causal graph derived from theskeleton by orienting the undirected edges using domainknowledge. − .
020 0 .
000 0 .
020 0 .
040 0 .
060 0 .
080 0 . Figure 6: Shapley value contribution of each variable (dueto the potential difference in its causal mechanism) to thedifference in the mean annual income between men andwomen ( µ men − µ women ). Each horizontal bar represents themean Shapley value contribution, and each line representsthe bias-corrected and accelerated bootstrap confidenceinterval at a 95% confidence level, over 100 resamples.The horizontal bars of education and income are almostinvisible as their mean Shapley value contributions areclose to zero.etc. Therefore, education confounds both occupation andincome. We then have a causal graph as shown in Figure 5(right).Since our goal is to identify the drivers of differencein the income distribution between men and women, asa sanity check, we perform a two-sample test to deter-mine whether the income distribution is, indeed, differentbetween men and women in the dataset. Under the nullhypothesis that incomes of men and women come from thesame distribution, the Kolmogorov-Smirnov two-sampletest yields a p -value of 1 . × − . Since the p -value isextremely low, we can safely reject the null hypothesis.All three variables are categorical. In particular, thetarget variable (income) is binary (>50K?). We use theempirical distribution as the causal mechanism of the rootnode (education). For a non-root node (occupation andincome), we learn its causal mechanism using a XGBoost10lassifier with 100 gradient boosted trees. To detect causalmechanism changes, we use kernel-based conditional in-dependence test (Gretton et al., 2012) with delta kernel ata significance level of 0.05. We would like to attribute thedifference in the mean annual income between men andwomen.The result of our method is shown in Figure 6. We findthat the change in the causal mechanism of occupation, P occupation | education , is the main driver for the difference inthe mean annual income between men and women. Thisis also often cited in public discourse as a reason why weneed more women participation in labour market to em-power them economically (Giuliano, 2017). Educationalchoices of men and women differ—science-related sub-jects are attended mostly by men than women (Wang andDegol, 2017; Tellhed et al., 2016). In UC Berkeley genderbias study from 1975, for instance, it was found that, com-pared to men, women tended to apply to departments thatare more crowded, less well funded, and that frequentlyoffer poorer professional employment prospects (Bickelet al., 1975). Graduates of science-related subjects arealso known to earn more than the others (Deming and No-ray, 2018). Moreover, women often take non-professionalresponsibilities, such as parenting and caring family rela-tives that affect their career and earnings (Jolly et al., 2014;Budig, 2006). Therefore, income distribution of men andwomen differ even when they have the same level of ed-ucation. As a representative case, we show the empiricalconditional probability distribution of occupation given“Bachelor” educated men versus women in Figure 7, whichalso corroborates our attribution result. The finding that thesubject of education plays a significant role on occupationand income would deserve further studies with more de-tailed data sets which goes beyond the scope of the presentpaper.
We presented a formal approach to identify the drivers ofdistribution change using graphical causal models. Thekey idea is that, given a causal graph, we can factorise thejoint distribution into independent causal conditionals. Any A comprehensive review of literature on this topic along withdata and visualisation is available at https://ourworldindata.org/female-labor-supply . A d m - c l e r i ca l A r m e d - F o r ce s C r a f t -r e p a i r E x ec - m a n a g e r i a l F a r m i ng - fi s h i ng H a nd l e r s - c l ea n e r s M ac h i n e - op - i n s p c t O t h e r- s e r v i ce P r i v - hou s e - s e r v P r o f- s p ec i a lt y P r o t ec ti v e - s e r v S a l e s T ec h - s uppo r t T r a n s po r t - m ov i ng . . . Figure 7: Conditional probability distribution of occupa-tion given “Bachelor” educated men versus women.change in the joint distribution, or marginal distributionof any target variable thereof, can then be attributed tochanges in some of the causal conditionals. We illustratedour method on both simulated and real-world datasets.In Section 5, we showed how to detect causal mecha-nism changes from data given the underlying causal graphusing conditional independence tests. In many tasks—e.g.exploratory data analysis, designing and evaluating robustmodels—knowing those conditionals that change is al-ready sufficient. One might then ask, "Why do we need toquantify the contribution from each mechanism?" Causalmechanisms always change in a system of a large num-ber of variables that are embedded in a changing environ-ment. Supply chain is one such example where continuousdeployments of new changes in constituent subsystems(e.g. forecasting, buying) are common. It is too costly (e.g.time, personnel) to look at all variables whose mechanismschange. Quantifying how much each variable contributedto the change allows us to focus on a few most relevantvariables.Finally, our attribution approach requires a causal graph,which may not be identifiable from observational data. Ifthe causal graph is not identifiable, the Shapley valueswill not be identifiable as well. Therefore, this question onthe robustness of our attribution approach to causal graphmisspecification deserves further research.
Acknowledgements
The authors thank Dr. David Afshartous for useful com-ments.11 eferences
P. J. Bickel, E. A. Hammel, and J. W. O’Connell. Sex biasin graduate admissions: Data from berkeley.
Science ,187(4175):398–404, 1975.M. J. Budig. Gender, self-employment, and earnings: Theinterlocking structures of family and professional status.
Gender & Society , 20(6):725–753, 2006.P. Bühlmann and S. van de Geer.
Statistics for High-Dimensional Data: Methods, Theory and Applications .Springer Publishing Company, Incorporated, 1st edition,2011.I. M. Chakravarti, R. G. Laha, and J. Roy. Handbook ofmethods of applied statistics, 1967.A. Charnes, B. Golany, M. Keane, and J. Rousseau.
Ex-tremal Principle Solutions of Games in CharacteristicFunction Form: Core, Chebychev and Shapley ValueGeneralizations , pages 123–133. Springer Netherlands,1988.J. Correa and E. Bareinboim. A calculus for stochasticinterventions:causal effect identification and surrogateexperiments.
Proceedings of the AAAI Conference onArtificial Intelligence , 34(06):10093–10100, 2020.T. M. Cover and J. A. Thomas.
Elements of InformationTheory (Wiley Series in Telecommunications and SignalProcessing) . Wiley-Interscience, USA, 2006.D. J. Deming and K. L. Noray. Stem careers and thechanging skill requirements of work. Working Paper25065, National Bureau of Economic Research, 2018.J. Dubois, H. Oya, J. M. Tyszka, M. Howard, F. Eberhardt,and R. Adolphs. Causal mapping of emotion networksin the human brain: Framework and initial findings.
Neu-ropsychologia , 145, 2020.F. Eberhardt and R. Scheines. Interventions and causalinference.
Philosophy of Science , 74:981–995, 2007.S. S. Fatima, M. Wooldridge, and N. R. Jennings. A linearapproximation method for the shapley value.
ArtificialIntelligence , 172(14):1673–1699, 2008. P. Giuliano. Gender: An historical perspective. NBERWorking Papers 23635, National Bureau of EconomicResearch, Inc, 2017.A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, andA. Smola. A kernel two-sample test.
Journal of MachineLearning Research , 13:723–773, 2012.J. J. Heckman, J. E. Humphries, and G. Veramendi. Re-turns to Education: The Causal Effects of Education onEarnings, Health, and Smoking.
Journal of PoliticalEconomy , 126(S1):197–246, 2018.B. Huang, K. Zhang, J. Zhang, J. Ramsey, R. Sanchez-Romero, C. Glymour, and B. Schölkopf. Causal discov-ery from heterogeneous/nonstationary data.
Journal ofMachine Learning Research , 21(89):1–53, 2020.D. Janzing, D. Balduzzi, M. Grosse-Wentrup, andB. Schölkopf. Quantifying causal influences.
The An-nals of Statistics , 41(5):2324 – 2358, 2013.D. Janzing, L. Minorics, and P. Bloebaum. Feature rele-vance quantification in explainable ai: A causal problem.In
Proceedings of the Twenty Third International Confer-ence on Artificial Intelligence and Statistics , volume 108of
Proceedings of Machine Learning Research , pages2907–2916. PMLR, 2020.S. Jolly, K. A. Griffith, R. DeCastro, A. Stewart, andP. Ubel. Gender differences in time spent on parentingand domestic responsibilities by high-achieving youngphysician-researchers.
Annals of Internal Medicine , 160(5):344–353, 2014.D. Kifer, S. Ben-David, and J. Gehrke. Detecting changein data streams. In
Proceedings of the Thirtieth Inter-national Conference on Very Large Data Bases - Vol-ume 30 , VLDB ’04, page 180–191. VLDB Endowment,2004.S. M. Lundberg and S.-I. Lee. A unified approach tointerpreting model predictions. In
Proceedings of the31st International Conference on Neural InformationProcessing Systems , NIPS’17, pages 4768–4777, RedHook, NY, USA, 2017. Curran Associates Inc.12. Mastakouri, B. Schölkopf, and D. Janzing. Selectingcausal brain features with a single conditional indepen-dence test per feature. In
Advances in Neural Informa-tion Processing Systems 32 , pages 12532–12543. CurranAssociates, Inc., 2019.J. M. Mooij, S. Magliacane, and T. Claassen. Joint causalinference from multiple contexts.
Journal of MachineLearning Research , 21(99):1–108, 2020.J. Pearl.
Causality: Models, Reasoning and Inference .Cambridge University Press, New York, NY, USA, 2ndedition, 2009.J. Peters, D. Janzing, and B. Schölkopf.
Elements of CausalInference – Foundations and Learning Algorithms . MITPress, Cambridge, MA, USA, 2017.M. Pollak. Optimal detection of a change in distribution.
Annals of Statistics , 13(1):206–227, 1985.F. W. Scholz and M. A. Stephens. K-sample ander-son–darling tests.
Journal of the American StatisticalAssociation , 82(399):918–924, 1987.L. S. Shapley. A value for n-person games.
Contributionsto the Theory of Games (AM-28) , 2, 1953.G. W. Snedecor and W. G. Cochran.
Statistical Methods .Iowa State University Press, 8th edition, 1989.P. Spirtes, C. Glymour, and R. Scheines.
Causation, Pre-diction, and Search . MIT press, 2000. U. Tellhed, M. Bäckström, and F. Björklund. Will i fit inand do well? the importance of social belongingness andself-efficacy for explaining gender differences in interestin stem and heed majors.
Sex Roles , 77, 10 2016.S. van de Geer and P. Bühlmann. (cid:96) -penalized maximumlikelihood for sparse directed acyclic graphs. Annals ofStatistics , 41(2):536–567, 2013.M. J. Wainwright.
High-Dimensional Statistics: A Non-Asymptotic Viewpoint . Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge UniversityPress, 2019.M. T. Wang and J. L. Degol. Gender Gap in Science,Technology, Engineering, and Mathematics (STEM):Current Knowledge, Implications for Practice, Policy,and Future Directions.