aa r X i v : . [ ec on . T H ] A p r Causality:a Decision-Theoretic Foundation ∗ Pablo Schenone † This version: April 7, 2020First version: September 1, 2017
Abstract
We propose a decision-theoretic model akin to that of Savage [17] that isuseful for defining causal effects. Within this framework, we define what itmeans for a decision maker (DM) to act as if the relation between two vari-ables is causal. Next, we provide axioms on preferences that are equivalentto the existence of a (unique) directed acyclic graph (DAG) that representsthe DM’s preferences. The notion of representation has two components:the graph factorizes the conditional independence properties of the DM’ssubjective beliefs, and arrows point from cause to effect. Finally, we explorethe connection between our representation and models used in the statisticalcausality literature (for example, that of Pearl [14]).
Keywords: causality, decision theory, subjective expected utility, axioms, representa-tion theorem, intervention preferences, Bayesian graphs
JEL classification:
D80, D81 ∗ I wish to thank David Ahn, Arjada Bardhi, Jeff Ely, Simone Galperti, Bart Lipman, MarcianoSiniscalchi, and Tristan Tomala for insightful discussions on the paper. All remaining errors are,of course, my own. † Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena,CA. E-mail: [email protected]. Introduction
Consider an econometrician (say, Alex) who is interested in the causal relationamong the intellectual ability ( A ), education level ( E ), and lifetime earnings ( L ) ofa representative citizen (say, Mr. Kane). Alex maintains that both education andability cause lifetime earnings, so that the observed dependence of L on E includesboth the direct causal effect of E and A on L . Alex’s objective is to quantify thedirect causal effect that E – and E alone – has on L . There is a plethora ofmodels that Alex can use to isolate the direct causal effect of education on lifetimeearnings; see, for example, Rosenmbaum-Rubin [16], Pearl [14], and recent booksby Imbens and Rubin [9] and Hernan and Robins [7].For Alex to decide which model best quantifies the causal effect of E on L , Alexmust first have in mind a definition of what causality means; then, Alex shouldchoose the model of causality that best fits with her definition of causality. A natu-ral definition of causality is that a variable (e.g., education) causes another variable(e.g., lifetime earnings) if a ceteris paribus policy on the presumed cause affectsthe distributions of the presumed consequence. That is, education causes earningsif a policy that exogenously changes education levels without affecting any othervariable affects the distribution of earnings. As such, assessing causality requiresinformation beyond that which is contained in observed correlation structures.Unfortunately, applied economics research is generally observational: researcherstypically do not have access to a laboratory where ceteris paribus changes in causescan be implemented as a means of quantifying their effect on consequences. In-stead, economists often must content themselves with observations of the realizedvalues for education, earnings and ability, rather than direct and exogenous inter-ventions on these variables. Economists are thus forced to conduct causal analysisbased only on independence structures alone, which generates a natural discon-nect with our understanding of causality as a phenomenon that transcends theinformation contained in correlations. The observational nature of applied economics research implies that there is a While, technically, the term correlation means a linear statistical dependency, in this paper,we use this term in its vernacular sense to denote any sort of statistical dependence. X and Y ), the statement “ X causes Y ” (in the sense of our decision-theoreticdefinition) is true if, and only if, the representing DAG includes a direct arrow from X to Y . We also provide conditions under which our decision-theoretic definitionof causality can be quantified purely in terms of the correlation structure of therelevant variables. This bridges the gap between our understanding of causationas a phenomenon centered around policy interventions and the practical limitationthat economics is observational in nature. As such, this exercise yields a modelthat can be used in empirical research together with a microeconomic foundationfor why such a model should be used. Further investigation of the differencesbetween our representation and Pearl’s model provides additional insights intoPearl’s model.The remainder of the paper is organized as follows. Section 2 provides a heuristicoverview of the paper based on the three-variable example in this introduction.Section 3 outlines the model, and Section 4 provides our formal definition of causal-ity. Sections 5 and 6 present our axioms and representation, respectively. Section7 contains our two representation theorems, and finally, Section 8 contains a lit-erature review. Since this paper combines elements from different literatures, we3o not yet have the tools nor the specific language to provide a detailed literaturereview; hence, we delay it. Readers already familiar with the Pearl model, Markovrepresentations, and the do-probability formalism can skip to Section 8 withoutloss of continuity. This paper connects ideas from three areas of study: decision theory, Bayesiangraphs, and statistical causality. This section shows how these literatures cometogether in our paper, using the three-variable example from the introduction (thestudy of the A bility, E ducation, and L ifetime earnings of a representative agent).We begin by illustrating how axiomatic exercises help provide foundations forstatistical models – in this case, models of causality. We then provide a bird’s eyeview of our model. We also briefly discuss the causal model most closely relatedto the one our axioms identify, Pearl [14]. In doing so, we highlight some concernswith that model and how our model addresses them. On axiomatics.
Axiomatic exercises are useful as a guideline for selecting amongdifferent numerical models for empirical research. The purpose of such exercise isto provide a link between some numerical model and the way a rational decisionmaker (henceforth, DM) approaches the issue of interest (in this case, causality).In this paper, the DM is the analyst’s econometric model . Indeed, an econometricmodel is a tool that takes a problem of decision making under uncertainty – shoulda firm produce high or low output? Should we give a patient drug A or drug B?Should we implement tax reform X or maintain the status quo? – and, after a suit-able treatment of the uncertainty (say, estimating parameters in some regression),produces a recommendation, the firm should produce high output, the patientshould not be given the drug, tax reform X is preferred to the status quo. Therole of the DM’s beliefs is played by the probability laws the researcher feeds intothe numerical model (presumably, obtained via some treatment of available data),and the role of axioms is normative; they provide rules that the econometric model should follow when selecting among the different alternatives. Henceforth, all ref-erences to a “DM” refer to the researcher’s statistical model, and all references4o the“DM’s beliefs” refer to the probability distributions fed into the statisticalmodel.
The decision-theoretic framework.
In our decision-theoretic model, a DMis faced with a multidimensional state space—each dimension of which is calleda variable —and solves a two-period problem. First, the DM chooses a policyintervention . The role of policy interventions is to distinguish which variables arechosen by nature and which variables are chosen by the DM. In our example, if X “ A ˆ E ˆ L is the state space, and a policy intervention is an element of P “ p A Y tHuq ˆ p E Y tHuq ˆ p L Y tHuq , where H means that the variable’s valueis selected by nature in the second period. For example, p “ pH , College, Hq is apolicy where both ability and lifetime earnings are chosen by nature in the secondperiod, whereas education is chosen by the DM and takes value p E “ college ; insuch a case, we say that education is an intervened variable . Each choice of policythen determines a state space over the nonintervened variables; in this example,the second-period state space is A ˆ L . In the second period, the DM takes thisnew state space as given and behaves as in the standard Savage model; that is,the DM chooses among monetary acts defined over A ˆ L .In this framework, we can define what causality means. Intuitively, we saythat our DM— i.e., our statistical model—treats education as a cause of lifetimeearnings if there is a ceteris paribus intervention on the education variable thataffects the distribution of lifetime earnings. Formally, this amounts to finding twopolicies (say, p and p ) that satisfy the following four conditions: (i) Education isintervened under p and p at different levels (say, p E “ college and p E “ PhD),(ii) lifetime earnings are not intervened under either p or p ( i.e., p L “ p L “ H ),(iii) all other variables are intervened at a constant level under both p and p ( i.e. , p A “ p A ), and (iv) the DM’s beliefs over lifetime earnings following the choice of p are different than the DM’s beliefs over lifetime earnings following p . In this case,policies p and p amount to a ceteris paribus intervention on education that impactsthe distribution of lifetime earnings, thus suiting our intuitive understanding ofthe phrase “education causes lifetime earnings”. Section 4 formally presents thischoice-theoretic definition of causality. 5he DM’s choice over policies in the first period and of acts over nonintervenedvariables in the second period define both the DM’s causal view of the world andthe DM’s beliefs over A ˆ E ˆ L . However, nothing so far disciplines how causalityand beliefs are related. Our axioms (Axiom 1, 2, 3) discipline how our definition ofcausality interacts with the DM’s beliefs. Intuitively, Axiom 1 states that variablesmust be logically independent, while Axioms 2 and 3 jointly state that the causesof a variable are the most proximal sources of information about that variable.Section 5 formally presents the axioms.We prove two theorems. Theorem 1 identifies a unique class of models that isconsistent with both our definition of causality and axioms. Theorem 2 takes asgiven the model identified by Theorem 1 and shows that causal effects can bequantified in terms of correlation structures alone if, and only if, we consider anauxiliary axiom (Axiom 4). This result makes the model applicable to empiricalresearch, since it states that what we want to calculate –causal effects– may becomputed in terms of what we can calculate – independence structures. Graphical methods and correlation structures.
Consider a distribution(say, µ ) over the variables in our example. Arrange the variables in a graph usingthe following criterion: for any two variables X and Y , if X is not adjacent to Y –that is, if there is no link from X to Y nor from Y to X – then X is independentof Y conditional on their respective parents – that is, conditional on the variablesthat directly point towards Y or X . Graphs constructed in this way encode all conditional independence properties of µ (see, for example, Dawid [1], Geiger etal. [3], Lauritzen et al. [12]). Furthermore, given any probability distribution overa set of variables, the conditional independence structure of that distribution canalways be represented by some DAG as described above.For example, suppose that µ is a distribution over E , A , and L that can befactorized as follows: µ p A, E, L q “ µ p A q µ p E | A q µ p L | A q (1)Equation 1 implies that under µ , E ducation and L ifetime earnings are independent6f each other once we condition on A bility. Thus, µ can be represented by eitherof the graphs depicted in Figure 1: in both graphs, E and L are non adjacent, andthe set of parents of t E, L u is t A u . EA L (a) Graph G µ represents µ : µ p A, E, L q “ µ p A q µ p E | A q µ p L | A q . EA L (b) Graph ˆ G µ also represents µ : µ p A, E, L q “ µ p E q µ p A | E q µ p L | A q .Figure 1: Two directed acyclic graphs that represent µ . Importantly, DAGs are a depiction of conditional independence; as such, arrowshave no causal connotations. Note that both graphs represent µ . If arrows hadcausal meaning, then education would have both an indirect causal effect on earn-ings (via the path E Ñ A Ñ L in ˆ G µ ) and no causal effect on earnings (evidencedby a lack of a directed path from E to L in G µ ), which is a contradiction. From correlation to causality.
While arrows in a DAG carry no causal con-notations, they nevertheless form the backbone of many causal models (see, forexample, Pearl [14] and follow-up work). Pearl’s model circumvents this problemby using three primitives: a distribution µ over the relevant variables, a graph G that represents µ , and a Markov representation of µ that is compatible with G (for a formal definition of Markov representations, see Section 7.2, which can beread independently of the rest of the paper). Pearl’s model provides no definitionof what causality is but assumes that, whatever causality means, it can be rep-resented by these three objects. Under this assumption, Pearl provides tools forquantifying causal effects in terms of independence structures.Our model takes a complementary view to Pearl’s model. First, we provide aformal definition of causality. Second, we provide normative axioms for how aneconometric model should treat uncertainty. That causality can be representedvia a DAG is therefore a theorem rather than an assumption. Furthermore, we7rovide the exact set of conditions under which causality may also be representedby a Markov representation. Again, that causality can be represented by a Markovrepresentation is therefore a theorem, not an assumption. Furthermore, compar-ing both sets of axioms showcases exactly the additional assumptions required torepresent causality as in Pearl’s model, relative to a treatment of causality basedpurely on DAGs. The following notation is used throughout this paper. The set N “ t , ..., N u isa set of indexes. For each J Ă N , let t X j : j P J u be a family of sets indexedby J . We denote by X J “ Π j P J X j the Cartesian products of the family and by x J “ p x j q j P J a canonical element in X J ; furthermore, we use X to denote X N to simplify notation. Moreover, all complements are taken with respect to N : if J Ă N , then J A ” N z J . Given any set J Ă N and any x P X , we use x ´ J “ x J A and X ´ J “ X J A as a way to simplify notation. Finally, if J Ă N and E Ă X J ,then E : X J Ñ t , u denotes the indicator function that event E has occurred;that is, E p x J q “ ô x J P E .The following notation refers to the graph-theoretic component of the model. Adirected graph is a pair p V, E q such that V is a (finite) set of nodes and E Ă V ˆ V is the set of edges. If two nodes, i and j , satisfy that p i, j q P E , we simplify thenotation by writing i Ñ j . Moreover, the set of parents for a node v P V is theset P a p v q “ t v P V : p v , v q P E u . A node v P V is a descendant of a node v P V whenever a directed path exists from v to v . Formally, if a sequence p v , ..., v T q P V T exists such that v “ v , v t is a parent of v t ` for each t P t , ..., T ´ u , and v T “ v . Analogously, v is an ancestor of v whenever v is a descendant of v .A directed graph is a DAG if, and only if, for all v P V , v is not a descendantof v . We denote by D p v q the set of descendants of v and by N D p v q the set ofnondescendants. 8 .2 Model description Our DM faces a variant of the standard Savage problem. The state space is S “ Π Ni “ X i , where each X i is finite. We make this assumption for technicalsimplicity because causality is orthogonal to whether state spaces are finite orinfinite. We let N “ t , ..., N u , and we call each i P N a variable . Set A “ R S isthe set of Savage acts, and a DM has preferences ą over A .However, our problem differs from Savage’s since we incorporate policies thataffect the states. This added language allows us to distinguish correlations fromother types of relations among variables. A set of intervention policies is a set P “ Π Ni “ p X i Y tHuq . The interpretation is as follows. Let a policy p P P be suchthat p i “ H for some i P N . Then, this policy leaves variable i unaffected; thatis, i is determined as it would have been in a standard Savage world. However,if for some j P N , we have p j “ x j P X j , then policy j forces variable j to takethe value x j ; that is, the value of variable j is not determined as it would havebeen in a Savage problem but is chosen by the DM. Therefore, each policy impliesa collection of interventions on the state space. Our model is one where the DMfirst chooses a policy from the set of all policies and then chooses a Savage actfrom the set of acts defined over the nonintervened variables.We now define the primitive choice domain for our DM. Let p P P be any policy,and let N p p q “ t i P N : p i “ Hu . That is, N p p q are the variables that p leavesunaffected. Furthermore, let A p p q ” R X N p p q be the set of acts defined over thevariables that p leaves unaffected. Then, the primitive domain of choice for theDM is the set tp p, a q : p P P , a P A p p qu . That is, our DM’s problem is to select anintervention policy and a Savage act over the nonintervened variables. We endowthis DM with a preference relation ¯ ą on tp p, a q : p P P , a P A p p qu .Given ¯ ą , each p induces an intervention preference on A p p q : for each p P P andeach f, g P A p p q , we say that f ą p g if, and only if, p p, f q ¯ ą p p, g q . Since our axiomsfocus on the DM’s intervention preferences, it is convenient to express interventionpreferences explicitly in terms of the values at which the DM intervenes on thevariables. For each policy p P P , if p ´ N p p q “ x ´ N p p q , we use ą x ´ N p p q to denote ą p ´ N p p q . The special case where p “ pH , ..., Hq , so that no variables are intervened9n, corresponds to the DM’s preferences in a standard Savage world. For such a p , we use ą pH ,..., Hq “ ą for notational simplicity.From intervention preferences, we obtain intervention beliefs . For each p P P ,we say that ą p has a belief representation if there is a probability distribution µ p on X N p p q such that for all E, F Ă X N p p q , µ p p E q ą µ p p F q if, and only if, E ą p F .When such a representation exists, we say that µ p is an intervention belief .Intervention preferences (resp. beliefs) resemble Savage conditional preferences(resp. beliefs) but have important differences. Savage conditional preferencescapture betting behavior conditional on the DM observing that a certain eventwas realized, whereas intervention preferences (beliefs) capture betting behaviorafter a controlled intervention on the relevant variables. To illustrate the difference,consider Example 1 below. Conditional preferences (resp. beliefs) are statementsabout item r . s , whereas intervention preferences (resp. beliefs) are statementsabout item r . s . These are clearly different statements that do not imply oneanother. Therefore, we need language to distinguish these two decision problems,and intervention preferences provide such language. Example 1.
Let f and g be acts over Mr. Kane’s lifetime earnings defined asfollows. Act f pays $1 if Mr. Kane’s lifetime earnings are greater than $100 K peryear and ´ $1 otherwise. Act g is the opposite: it pays ´ $1 if lifetime earnings aregreater than $100 K per year and $1 otherwise. Consider the following statements:1. “Having observed that Mr. Kane earned a college degree (of his own free willand ability), Alex prefers f to g .”2. “Having forced Mr. Kane to obtain a college degree (regardless of his desireor ability to do so) Alex prefers f to g ”. Because this paper is concerned with understanding what a rational agent’sapproach to causality is, the role of axioms is exclusively normative. Whetheractual humans adhere to these axioms is orthogonal to this paper. While thecounterfactual-based setup presented above may seem hard to test in a laboratorywith actual human subjects, this is not the objective of the exercise at hand. Sincethe DM in our paper is an analyst’s econometric model of the world, the only ques-tion that matters is whether the analyst finds the axioms normatively appealing.10oreover, econometric models (as opposed to human subjects in a laboratory) arenaturally built to handle counterfactual analysis of the sort presented above.
In this section, we introduce the definition of causal effect, which formalizes theintuitive definition given in Section 2.We begin by introducing the definition of intervention independence. Considera set of variables K and two variables i, j R K . Informally, i is K -independent of j if, after eliminating the possibility that i and j are related through variables in K , the choice of acts over i is insensitive to interventions on j . Formally, we saythat i is K -independent of j if the following holds: p@ x K P X K q , p@ x j , x j P X j q ,and p@ f, g P R X i q , f ą x j ,x K g ô f ą x j ,x K g,f ą x j ,x K g ô f ą x K g. The first line indicates that having intervened on K at value x K , intervening on j at different values does not affect the DM’s choice of act in R X i . The secondline indicates that having intervened on K , the ability to intervene on j at all,regardless of the values at which it is intervened on, does not affect the DM’schoice of act in R X i . Note that the second of these conditions implies the first.Indeed, if the second condition holds, then we have that for all x j , x j , f ą x j ,x K g ô f ą x K g ô f ą x j ,x K g, so the first equation also holds. This motivates the formal definition of interventionindependence. Definition 1.
For all i, j P N and K such that i, j R K , we say that variable i is K - independent of variable j if for all f, g P R X i , f ą x j ,x K g ô f ą x K g,
11o illustrate Definition 1, consider a DM who believes that A bility has a directimpact on E ducation and that education has a direct impact on L ifetime earningsbut that ability has no direct impact on lifetime earnings. This is depicted inFigure 2 below. If a, a P A are two ability levels and f and g P R L are two acts onlifetime earnings, we might have the DM behave as follows: f ą a g and g ą a f .This reversal indicates that A and E are not tHu -independent, which is intuitive:interventions on A affect beliefs about E , and beliefs about E affect beliefs about L . However, this is an effect of A on L that is mediated through E . As such, wedo not want to use this as a basis to claim that A causes L . The correct way tocapture the causal effect of A on L is to regard intervention preferences ą p a,e q as afunction of a for each fixed e P E . In other words, we want to ask if A and L are t E u -independent. This motivates the formal definition of causal effect. A E L
Figure 2: Variable A has no direct causal effect on L , but non-ceteris paribusinterventions on A affect L through E . Definition 2.
For all i, j P N , we say that variable j causes variable i if i is not t i, j u A -independent of j .Let Ca p i q “ t j P N : j causes i u denote the causal set of i .Finally, we say that j is an indirect cause of i if there is a sequence j , ..., j T suchthat, for all t P t , ..., T ´ u , j t causes j t ` , j “ j and j T “ i . We denote the setof indirect causes of a variable i by ICa p i q . Finally, if a variable i is such that Ca p i q “ H , we say that i is an exogenousprimitive ; otherwise, we say that it is an endogenous variable. Indeed, when aDM forms a causal model of the world, the set of primitives of such a model isprecisely the set of variables that are not caused by any other variable in the model.Exogenous primitives are relevant in our discussion of Axiom 1.We conclude this section by defining the causal graph associated with a pref-erence, ¯ ą . Causal graphs are an integral part of our representation, which isintroduced in Section 6. Given ¯ ą , draw a graph by letting the set of nodes bethe set of variables and the set of arrows be defined by the causal sets, that is, by12etting j Ñ i ô j P Ca p i q . This graph is well defined because Ca p i q is well definedfor each i P N . We denote such a graph as G p ¯ ą q . Definition 3.
Let ¯ ą be a preference and t Ca p i q : i P N u be the collection of causalsets derived from ¯ ą . Define G p ¯ ą q “ p V, E q by setting V “ N and E “ tp j, i q : j P Ca p i qu . Our axioms are normative statements about how the DM should treat uncertaintyas a function of the DM’s causal model. Hence, our axioms tackle variations ofthe following question: given the DM’s causal graph as per Definition 3, what normative restrictions should we impose on the DM’s intervention beliefs?As such, the axioms are about conditional independence properties of the variousintervention preferences, ą p . Since the act notation for conditional independenceis somewhat burdensome, we use the following simplifying notation. Definition 4.
Let i P N and let J , K , H Ă N be disjoint sets such that i R J Y K Y H . We say that i is independent of J conditional on K after interveningon H if the following is true for all x p J Y K Y H q P X p J Y K Y H q and all f, g P R X i : x K f ą x H x K g ô x K x J f ą x H x K x J g. (2) When the above holds, we write i K H J | K . (3) In the case in which J is a singleton, J “ t j u , we simply write i K H j | K . (4)In terms of behavior, conditional independence says the following: i and j areindependent if a DM would never pay for information about j when the DM’stask is to predict i . Imagine a DM intervened variables H to a specific level,13 H . For instance, the DM might have carried out a controlled experiment, or thiscould simply be a thought experiment. Imagine further that the DM observed aspecific realization of variables in K . In this context, if the DM had to choosebetween f and g , he would have to compare x K f with x K g using preferences ą x H . To aid the DM’s decision, someone offers to reveal the DM the value of thevariables in J for a fee ε ą
0. Is there an ε small enough that the DM wouldpurchase this information? If the DM bought this information about J , then hisproblem becomes to compare x K x J f with x K x J g using preferences ą x H . Sincethe definition states that the DM’s choice under both situations is the same, theinformation is useless. Thus, the DM would not accept any price ε ą ą p is a (monotone) subjectiveexpected utility preference if there exists a unique probability distribution, µ p P ∆ p X N p p q q , and a (monotone increasing) function u p : R Ñ R such that for allacts, f, g P R X N p p q condition 5 below holds. There are many axiomatizations ofmonotone expected utility preferences that fit the framework of our model, suchas Gul [4], Fishburn [2], and Theorem 3 in Karni [10], among others. We let thereaders select their preferred axiomatization. f ą p g ô ÿ x N p p q P X N p p q u p p f p x N p p q qq µ p p x N p p q q ą ÿ x N p p q P X N p p q u p p g p x N p p q qq µ p p x N p p q q . (5) Assumption 1.
For each J Ă N , the following are true.i- For each p P P , the preferences ą p are monotone subjective expected utilitypreferences.ii- The state space is complete: p@ i, j P N q , p@ x N zt i u P X N zt i u q , and p@ f, g P R X i q ; if j P Ca p i q , then f ą x N zt i u g ô x j f ą x N zt i,j u x j g .iii- There are no null states: for all x P X , x ą X .iv- Policies do not affect preferences: p@ x, y P R q , p@ p, p P P q , X N p p q x ą p X N p p q y ô X N p p x ą p X N p p y axiom per se. Below,we examine each of the restrictions in Assumption 1.As noted above, many axiomatizations exist that will deliver item r i. s , eachwith its own advantages and disadvantages. We let readers choose their preferredaxiomatization of monotone expected utility. The importance of item r i. s is that allintervention preferences are probabilistically sophisticated, so intervention beliefsare always well defined. Item r iii. s states that being paid $1 if realization x P X occurs is strictly preferred to receiving $0 for sure, thus guaranteeing that allstates receive positive probability. Item r iv. s rules out the possibility that policieshave a direct impact on the Bernoulli utility indexes, thus making u p “ u p for all p, p P P . Item r ii. s in Assumption 1 says that the state space is complete: given twovariables (say, i and j ), the state space includes all variables that could mediateeffects between i and j . Once all variables k ‰ i, j are intervened on, either i causes j , j causes i , or i and j are independent. If j causes i , since all possibleconfounding variables have been intervened on, observing that x j P X j was realizedor intervening on variable j and shifting its value to x j should lead to the samepreference over R X i . Violations of this axiom are reasonable only if the statespace is missing some potential confounding variables. In line with Savage [17],we assume that the state space is complete, so there are no missing confoundingvariables.That the state space is complete has the following implication for the represen-tation of preferences. Take two variables (say, i and j ) and assume that all othervariables are intervened on at a level x ´t i,j u . Let µ x ´t i,j u be the belief representa-tion of ą x ´t i,j u . In this 2-variable environment, correlation and causation shouldcoincide. Therefore, if i causes j , conditioning on j or intervening on j should lead This assumption is not strictly needed, but it simplifies notation in some proofs. Sincecausality is orthogonal to whether Bernoulli utilities are constant in P , we feel comfortablemaintaining this assumption.
15o the same posteriors about i . Namely, µ x ´t i,j u p x i | x j q “ µ x ´t i u p x i q . (6)Importantly, this relation is not symmetric. If j does not cause i , then the sym-metric expression µ x ´t i,j u p x j | x i q “ µ x ´t i,j u ,x i p x j q is false. The left-hand side is anonconstant function of x i , whereas the right-hand side is constant in x i . Thus,under complete state spaces, equation 6 identifies the direction of causality. Wewill use this observation in Section 6 when defining when a graph represents apreference.That the state space is complete does not imply that the DM must know whatall the relevant variables are. For instance, assume that Alex from the introduc-tion is worried that the interaction among ability, education and lifetime earningsmight be affected by some other variable. Concretely, Alex thinks that some othervariable might influence education: Alex does not know what this variable is butbelieves that it exists. For concreteness, denote this variable as an “unknown butpossibly exiting variable”. Assumption 1 says that Alex’s state space should in-clude such a variable. Therefore, the state space should not be A ˆ E ˆ L butrather A ˆ E ˆ L ˆ U , where U stands for “unknown but possibly existing variable”.In short, Assumption 1 does allow the econometrician to add variables that act asproxies for unknown shocks to the system. Indeed, modeling a potential unknownconfounder as exogenous noise shocks is a common way to proceed in empiricalstudies.Assumption 1 states that the DM is probabilistically sophisticated but is silentabout the statistical properties of causal sets. Axioms 1 through 3 disciplinehow causation interacts with the DM’s probabilistic beliefs. In other words, theydiscipline how correlation and causation interact. Axiom 1. I Ă N , there exists i P I such that Ca p i q X I “ H Axiom 1 states that variables are logically independent of each other. If the DMis asked to explain the relation between variables in I and only those in I , Axiom1 states that the DM has an explanation that involves at least one exogenousprimitive relative to I . Models without primitives describe identities rather than16elations among logically independent variables. Therefore, Axiom 1 states thatthe DM’s state space includes only logically independent variables.A potential critique of this axiom is that certain systems are inherently cyclical.For instance, the relation among the speed of a car, the distance traveled by thecar, and the time traveled by the car is inherently circular: any two determinethe third. The problem with this system is that speed is not caused by distanceand time traveled; rather, speed is defined in terms of distance and time traveled.Therefore, the model includes variables that are not logically independent of oneanother. The correct model to analyze this situation is one in which the onlyvariables are time and distance traveled by the car, as these variables are the onlylogically independent variables. In this sense, the assumption that no causal cyclesexist is sensible.A related critique of Axiom 1 is that it precludes the DM from viewing theworld as a system of recursive structural equations. As such, Axiom 1 couldbe seen as precluding the DM from reasoning in terms of equilibrium equations(see, for example, the critique in Heckman and Pinto [6]). This assessment stemsfrom interpreting functional relations as causal relations. However, the equationsin a model (in particular, equilibrium equations) are succinct descriptions of thespecific values that the variables may obtain; they say nothing of how those valuesare achieved. As such, causality and equilibrium equations are orthogonal issues.To make the above discussion concrete, consider a general equilibrium modelwith aggregate demand curve D and aggregate supply curve S . The equilibriumis defined as follows: p p ˚ , q ˚ q constitutes an equilibrium if D p p ˚ q “ q ˚ and S p p ˚ q “ q ˚ . Note that this is a definition; as such, the equilibrium price and equilibrium quantity are not logically independent. These equations describe the values oneshould expect for prices and quantities but are silent regarding the mechanism thatgenerated them. This silence motivates the equilibrium convergence literature.For example, a tatˆonnement convergence process is compatible with the generalequilibrium equations without invoking feedback loops: a DM posits that pricesin period t cause quantities in period t (via consumer/producer optimization) andthat quantities in period t cause prices in period t ` p t “ p t ` “ p ˚ and q t “ q t ` “ q ˚ is orthogonalto the issue of causation. In short, one should not mistake functional equations,which simply describe relations between variables, for causal statements. Axiom 2.
For all i P N , if J Ă Ca p i q and H Ă N zt i u is disjoint from J , i M H p Ca p i qzp J Y H qq | J Axiom 2 captures the following normative property about causation: the causesof a variable (say, i ) are the most proximal sources of information about i . Supposethat one were to slice the set Ca p i q into three slices: J , H and Ca p i qz J z H . Ifone intervened on slice H and observed the value of variables in J , would onepay for information about the remaining variables j P Ca p i qz H z J ? If j is themost proximal source of information about i , then there should be an ε ą ε in exchange for information on the value of thesevariables j . If j were rendered useless for prediction after information on the othercauses were obtained, then j would not itself be a proximal source of informationabout i and therefore should not be called a “cause” of i . As a final remark, notethat Axiom 2 is symmetric in the following sense: the only fundamental sourcesof information about i are causes of i and those variables that are directly causedby i . Axiom 3. p@ i, j P N q , p@ K , J Ă N zt i, j uq , p J X K “ Hq , if i R Ca p j q and j R Ca p i q , then i K K j |p Ca p i q Y Ca p j q Y J qz K While Axiom 2 describes the conditional independence properties of variablesthat are directly related to each other, Axiom 3 analyzes the independence prop-erties of variables that are not directly related to each other. Axiom 3 states thattwo variables that do not cause each other are independent of each other once wecondition on any set that includes i and j ’s causes.To understand Axiom 3’s normative appeal, consider the DAG in Figure 3 below,18here arrows point from cause to effect. If a DM had to predict the value of i andknew the realizations x b and x c (and perhaps some other noncauses, say k ), shouldthe DM pay for information about the realization of j ? Our axiom says the DMshould not pay for this information, which is quite sensible: once the DM knowsthe values of b and c , he knows all there is to know about the relation between i and j . Thus, extra information on j is useless for predicting the value of i . bi jc k Figure 3: i and j are independent conditional on their respective causes. A more general analysis of Axiom 3 proceeds in three steps. First, it is norma-tively appealing to say that i and j are not independent: because i causes c and c causes j , it stands to reason that any information we have about i will (via c ) pro-vide information about j . Similarly, b provides another link between i and j : since b is a common cause of both, then any information we have about i should allowus to make inferences about b and, in turn, inferences about j . Second, becauseneither i causes j nor vice versa, any information i provides about j will be medi-ated by some variable. Third, the mediating variable will either be a cause of i , acause of j , or both. Indeed, if i provided information about j that is not mediatedby any cause of j , then i is providing information about j that is more proximalthan the information contained by any cause of j . Therefore, i should itself bea cause of j , which it is not. Combining these three observations implies that ifwe condition on the causes of both i and j , then i and j should be conditionallyindependent. Finally, Axiom 3 states that the same condition would remain trueif we were to make the choice of f vs. g contingent on some additional noncauses, J .While Axioms 1 through 3 are our basic axioms, Axiom 4 is a supplementary ax-iom that is relevant for Theorem 2. We present it here in the interest of containingall axioms and their corresponding discussions within a single section.19 xiom 4. p@ i P N q , p@ J Ă N zt i uq p@ f, g P R X i q , p@ x Ca p i qY J P X Ca p i qY J q , t x Ca p i q u f ą t x Ca p i q u g ô x Ca p i qz J f ą x J x Ca p i qz J g. Axiom 4 states that the following two decision problems are equivalent. Givena variable i and acts f, g P R X i , the first problem is to choose f or g when theirpayments are contingent on the causes of i obtaining a particular value, x Ca p i q .In the second decision problem, the DM intervenes on a subset of causes of i (say, moving J Ă Ca p i q to the value x J ), and the payments of f and g are nowcontingent on the values of the nonintervened causes, x ´ J , being realized. From anumerical standpoint, both these situations result in the same value for the causesof i (namely, x Ca p i q ); the difference is how those values are achieved. In the firstproblem, it is simply by selecting a standard Savage conditional act, while in thesecond problem, it is by a combination of interventions and Savage conditionalacts. Because Axiom 4 requires that these two problems be treated identically,Axiom 4 implies that the only aspect of interventions that matters is the valuethe intervention sets for the variable. In other words, the act of intervening on avariable does not, in itself, change the DM’s structural view of the world.We use Figure 4 below to illustrate the normative appeal of Axiom 4. jk wi Figure 4: Observing or intervening on j makes the DM update differently about k .This difference in updating may affect the DM’s beliefs about i . First, we explain why Axiom 4 involves sets that are weakly larger than Ca p i q .Suppose that a DM has to choose between two acts over i (say, f, g P R X i ), the20ayments of which are contingent on j taking value x j . That is, the DM has tochoose between x j f and x j g . Note that t j u is a strict subset of Ca p i q . Observingthat j takes the value x j gives the DM information about the value of k ; in turn,this information about k gives the DM information about w , which ultimatelygives the DM information about i . Thus, observing that j took the value x j is informative about i in two ways: directly, because j P Ca p i q , and indirectly,via k and w . If the DM intervenes on j at value x j , the DM receives the samedirect information about i but loses the indirect information mediated via k and w . Thus, the DM could say that x j f ą x j g but g ą x j f . Clearly, observing x j or intervening on variable j and moving it to value x j are different problems interms of the DM’s updating.Now, consider the situation above but where the payments of f and g involve all causes i , j and w . That is, for some x j and x w , the DM chooses between x j ,x w f and x j ,x w g . For concreteness, suppose that x j ,x w f ą x j ,x w g . If the DMintervened on j and shifted it to the value x j and then had to choose between x w f and x w g , would the DM lose any information? In other words, if at acost ε ą
0, the DM could intervene and shift the value of j to x j , rather thansimply conditioning his choice on value x j being realized, is there a ε ą w is observedto take value x w ; therefore, any information that j could indirectly provide about i through w is still directly captured in the observation of x w . Thus, interveningon j entails no information loss relative to simply observing that j took value x j .Thus, the DM has the same information in both problems and should thus treatthe problems equivalently. This result is precisely what Axiom 4 requires.Both of the above discussions addressed J Ă Ca p i q , but to complete our discus-sion of Axiom 4, we must allow that J contains noncauses of i . Axiom 4 statesthat once we know the value of all the causes of i , intervening variables that arenot causes of i are uninformative about i . In Figure 4, if an act’s payments arecontingent on x w and x j , then intervening and shifting the value of k to some x k is uninformative about i . 21 Representation
In this section, we define the representation we seek for ¯ ą . Since ¯ ą will ultimatelybe associated with a collection of probability distributions, we proceed in twosteps. First, define what it means for a DAG to represent a single probabilitydistribution. Then, we generalize to a family of probability distributions. For areminder of our graph-theoretic notation, see Section 3.1.Lauritzen et al. [12] provide a definition for when a DAG represents a probabilitydistribution, say µ P ∆ p Π i P N X i q . The objective of such a definition is to graphi-cally represent the conditional independence structure of µ . Let µ P ∆ p Π i P N X i q ,and let G “ pt , ..., N u , E q be a DAG. The chain rule implies the following p@ x P Π i P N X i q , µ p x q “ Π Ni “ µ p x i | N D p i qq . (7)Now, consider the DAG in Figure 5 below. ab j w i k z Figure 5: A DAG representing the distribution µ p a, b, w, j, i, k, z q “ µ p a q µ p w q µ p b | w, a q µ p j | a q µ p i | w, j q µ p k | i q µ p z | k q . In a DAG such as the one above, an arrow between two nodes indicates thatthe two nodes are never statistically independent. In this way, arrows encode whichvariables provide fundamental information about other variables, in the sense thatthe information transmitted by the source is not contained in any other variable.For instance, the DAG in Figure 5 conveys that w and j contain fundamentalinformation about i and thus that i is never independent of t w, j u . Similarly, i is never independent of its direct descendant, k . Now, consider a variable that is22n ancestor of i ; for example, a . Clearly, a and i are not independent: a providesfundamental information about j , which provides fundamental information about i . However, any information that a has about i is implicitly encoded in j P P a p i q .Indeed, if a carries fundamental information about i , there should be an arrow a Ñ i , but such an arrow is absent. Similarly, b provides information about i : b is informative about t a, w u , both of which are informative about i . However, anyinformation that b has about i is encoded in t j, w u . What this implies is that oncewe condition on the parents of i (in this case, t w, j u ), all nondescendants of i areconditionally independent of i . Therefore, the terms µ p x i | N D p i qq in equation 7simplify to µ p x i | P a p i qq . This observation motivates Definition 5 below. Definition 5.
Let µ P ∆ p Π i P N X i q . A DAG pt , ..., N u , E q represents µ if, andonly if, the following hold: p@ x P Π i P N X i q , µ p x q “ Π Ni “ µ p x i | P a p i qqp@p T i q i P N qp T i Ă P a p i qq , if µ p x q “ Π Ni “ µ p x i | T i q ñ p@ i P N q , T i “ P a p i q Definition 5 makes two statements. First, a DAG represents a probabilitydistribution if, and only if, the DAG summarizes the conditional independenceproperties of µ in the sense discussed previously. Second, the set of parents isthe smallest set that allows for such a decomposition. Indeed, consider a set ofnodes V “ t a, b, c u and a probability distribution µ p x a , x b , x c q “ µ p x a q µ p x b q µ p x c q .Since all variables are statistically independent, both DAGs in Figure 6 representthis µ . Indeed, both µ p x a , x b , x c q “ µ p x a q µ p x b | x a q µ p x c | x a q and µ p x a , x b , x c q “ µ p x a q µ p x b q µ p x c q are true statements. However, the first representation includesirrelevant arrows: the minimality requirement prevents this. X a X b X c X a X b X c Figure 6: Both DAGs above represent the same probability distribution, µ p x a , x b , x c q “ µ p x a q µ p x b q µ p x c q , but the top one includes irrelevant arrows. ą were the DM’s Savage preference defined on R X .Under Assumption 1, there is a well-defined belief representation of ą , µ P ∆ p X q .We can then say that a graph G represents ą if G represents µ , in the sense ofDefinition 5.However, Definition 5 is insufficient to define when a graph represents a prefer-ence ¯ ą . Indeed, a preference ¯ ą is associated with the collection of induced Savagepreferences t ą p : p P P u . As such, ¯ ą is associated with a family of beliefs, ratherthan a single belief, as in Savage’s model. Thus, to define when a DAG representspreferences ¯ ą , we first define what it means for a DAG to represent a collectionof probability distributions rather than a single probability distribution.To define when a DAG represents preferences ¯ ą , we first define the truncation of a DAG. Let G “ p V, E q be a DAG, and let W Ĺ V . The W -truncated DAG, G W , is the DAG obtained by eliminating all nodes in W , together with theirincoming and outgoing arrows. Formally, G W “ p V z W, E X W A ˆ W A q . ThisDAG is a useful representation of intervention beliefs. After variables in W areintervened on, they no longer form part of the DM’s statistical model; they are nowdeterministic objects that are statistically uninformative about the value of theirparents. Thus, we exclude these variables from the corresponding DAG. In thecontext of our running example, if Alex observes that Mr. Kane obtained a collegedegree, his education is no longer random, but Alex can still make inferences aboutMr. Kane’s intellectual ability. Thus, education remains a legitimate element ofAlex’s statistical model. However, if Mr. Kane’s education is intervened on andshifted to “college degree”, then his education level is no longer random and,furthermore, is uninformative about his ability level. Thus, we exclude educationfrom the DM’s post-intervention model.24 E L A L
Figure 7: Right: the full econometric model; we know that E is informative about L because knowing E is informative of its cause, A .Left: once E is intervened on, it is no longer part of the econometric DAG since E isuninformative about its causes. We can now define when a DAG, G , represents a preference ¯ ą . We thensay that a graph represents a preference if the appropriately truncated subgraphrepresents the corresponding intervention preference and the arrows are consistentwith the direction of causality. This is formally presented in Definition 6 below. Definition 6.
Let G “ p N , E q be a DAG and ¯ ą be a DM’s preference. Assumethat for each T Ă N and each x T P X T , ą x T has a well-defined belief representa-tion; let µ x T be the corresponding belief representation. We say that G represents¯ ą if the following are true for each T Ă N and each x P X :i G T represents µ x T ,ii If p i, j q P E then µ x ´t i,j u p x j | x i q “ µ x ´t j u p x j q . Note that nothing in this section is related to causality. Indeed, the statementthat a graph represents a probability distribution is purely a statement aboutstatistical independence. As such, the representation of a probability by a DAGis a statement about correlation, not causation. At this point in the exposition,DAGs used to represent probability distributions and DAGs used to representcausal statements are completely unrelated. It is precisely the job of Theorems1 and 2 to show the exact conditions under which a DAG can simultaneouslyrepresent the DM’s beliefs and the DM’s causal model.25
Results
Our first theorem is Theorem 1, stated below.
Theorem 1.
Let ¯ ą satisfy Assumption 1. The following are equivalent:i Axioms 1 through 3 hold, andii pD G q such that G is a DAG and represents ¯ ą in the sense of Definition 6.Furthermore, if G represents ¯ ą , then G “ G p ¯ ą q . The literature on Bayesian graphs assumes that causal DAGs fulfill a dual role:they represent both causal assumptions and assumptions on conditional indepen-dence. This is clearly a joint assumption about how the analyst defines causalityand how the analyst’s definition of causality interacts with statements of condi-tional independence. Theorem 1 states the exact conditions under which a DAGcan fulfill this dual role.In particular, the uniqueness result implies that Definition 2 is the only definitionof causality that satisfies our axioms. Suppose that a researcher has a definition ofcausality in mind (say, C ) such that statements of the form “ i causes j according tocriterion C ” are well defined. If C satisfies our axioms, then C can be representedvia a DAG, G , such that i Ñ j holds if, and only if, “ i causes j according tocriterion C ”. The uniqueness claim in Theorem 1 says that G “ G p ¯ ą q . Therefore, i Ñ j also holds if, and only if, i causes j in the sense of Definition 2. Thus, C must coincide with Definition 2.Theorem 1 also provides a foundation for unifying and structuring our under-standing of causation. The theorem states that any formal discussion of causality(as understood by Definition 2) begins with two items: a collection of probabilitylaws, t µ p P ∆ p X N p p q q : p P P u , and a DAG, G , that represents those laws. Modelsthat include these components can legitimately be called models of “causation”regardless of any other details the model might include. However, models that can-not be phrased in terms of intervention beliefs and their representing DAG are notmodels of causality (again, as understood by Definition 2). In short, researcherswho find our axioms normatively appealing and who agree that Definition 2 is a26ensible definition of causal effect are encouraged to use DAG-based models forconducting causal inference. Researchers who find our axioms normatively un-appealing, or disagree with Definition 2 as a sensible definition of causal effect,are encouraged to stay away from DAG-based models. In this way, Theorem 1provides a foundation for selecting among models with which to empirically studycausal effects. In this section, we consider the following question. Let µ P ∆ p X q be the DM’sbeliefs elicited from his Savage preference and µ p be the DM’s beliefs elicitedfrom an intervention preference ą p . When can we express µ p as a function of µ ?Theorem 2 in this section answers this question, and Proposition 2 in Appendix Bprovides further results along the same lines.Answering the question above is useful to make the model applicable to empiricalresearch. When µ p is expressed in terms of µ (henceforth, when µ p is identified ),any information that allows a DM to update his Savage beliefs, µ , also allows theDM to update his intervention beliefs, µ p . If an analyst had access to a perfectlycontrolled setting, the analyst could directly estimate each µ p , and models of causalinference would be unnecessary. However, most empirical work in economics isobservational, in the sense that direct policy interventions are unavailable to theresearcher. Proposition 2 and Theorem 2 bridge the gap between interventionbeliefs – what the econometrician wants to calculate – and standard conditionalprobabilities – what the econometrician can calculate.When added to Axioms 1 through 3, Axiom 4 yields a model in which differentintervention beliefs, µ p , can be expressed in terms of µ . In what follows, we remindthe reader of Axiom 4 and illustrate Theorem 2 by means of two simple examples.Then, we state and discuss the general form of Theorem 2. Axiom 4. p@ i P N q , p@ J Ă N zt i uq , p@ f, g P R X i q , p@ x Ca p i qY J P X Ca p i qY J q , t x Ca p i q u f ą t x Ca p i q u g ô x Ca p i qz J f ą x J x Ca p i qz J g. Example 2.
Blake is an econometrician who believes that ability causes both ducation and lifetime earnings and that education causes lifetime earnings. Thismodel is graphically depicted in Figure 8. To understand the direct effect of ed-ucation on lifetime earnings, Blake has to understand how µ a,e p¨q changes with e P E for each fixed a P A . However, Blake cannot access a controlled envi-ronment, so Blake has no data on µ p a,e q . However, under Axiom 4, data fromcontrolled environments are unnecessary. When J “ t A, E u , Axiom 4 implies µ p a,e q p¨q “ µ p¨| a, e q . Thus, the direct causal effect of education on lifetime earningsis calculated by computing how µ p¨| a, e q varies with e for each value of a . Notethat µ p¨| a, e q is a standard conditional probability, and data on this quantity areavailable with access to observational datasets. Blake can therefore use data fromoutside a controlled environment to form his intervention beliefs. AE L
Figure 8: Causal effects are identified: µ p a,e qp l q “ µ p l | a, e q . Example 2 above provides a simple case in which one can directly substituteinterventions for conditionals. Example 3 below shows an example where slightlymore work is involved in identifying intervention beliefs.
Example 3.
Charlie is a colleague of Blake. However, Charlie believes thatpeople are not born with intrinsic ability. On the contrary, education causes ability,and this ability is the sole cause of lifetime earnings. Charlie’s causal DAG isdepicted in Figure 9. Charlie is interested in studying the indirect effect thateducation policies have on lifetime earnings, which can be done by applying Axiom4 twice. First, set J “ t E u , i “ A to obtain µ a p e q “ µ p e | a q for each p a, e q P A ˆ E .Second, set J “ t E u , i “ L to obtain µ e p l | a q “ µ p l | a q for each p e, a, l q P E ˆ A ˆ L . inally, we obtain the following derivation. µ e p l q “ ÿ a µ e p l, a q“ ÿ a µ e p l | a q µ e p a q“ ÿ a µ p l | a q µ p a | e q . Thus, calculating the indirect effects of E and L requires computing µ p l | a q and µ p a | e q , both of which can be computed with data from observational studies. Evenif access to a controlled environment is unavailable, the identification of µ e impliesthat such data are unnecessary. AE L
Figure 9: The indirect causal effect of E on L is identified: µ e p l q “ ř a µ p l | a q µ p a | e q . The examples highlight two simple cases in which intervention beliefs are iden-tified. First, if j is a cause of i , then the direct causal effect that j has on i isidentified via the formula µ x ´t i,j u ,x j p x i q “ µ p x i | x j , x Ca p i qzt j u q . Thus, one can obtainthe direct causal effect of j on i by conditioning on all causes of i and analyzinghow that conditional probability varies with x j . Similarly, if j causes k , k causes i , and this is the only connection between j and i , the indirect causal effect of j on i is calculated by following the chain rule: µ x j p x i q “ ř x k µ p x i | x k q µ p x k | x j q .However, other intervention beliefs may also be identified. The rest of this sectionis devoted to understanding the exact conditions under which intervention beliefs29re identified. Before we can present Theorem 2, we need to define two auxiliary concepts:
Markov representations and do-probability . We do this by means of a simple nu-merical example first, where we argue how Markov representations are related toa specific view of causality. We then provide a formal definition of both Markovrepresentations and do-probabilities and state Theorem 2Consider a distribution µ and a graph G as in Figure 10: Ability is a commoncause of education and lifetime earnings, and education is a further cause of lifetimeearnings. AE L
Figure 10: A graph, G , representing a distribution µ p A, E, L q ” µ p A q µ p E | A q µ p L | A, L q . In a Markov representation of µ that is compatible with G , each variable is as-sumed to be written as a deterministic function of its parents in the graph and itsown idiosyncratic noise. One way to understand such functions is as productionfunctions: in this example, ability is exogenous, education is stochastically “pro-duced” with units of ability, and lifetime earnings are stochastically “produced”with units of education and ability. For example, the following is a Markov repre-sentation of a distribution µ compatible with the graph G in Figure 10.30 “ α α „ U p , q ,E “ A ` ε ε „ U p , q ,L “ A ` λ ` E λ „ U p , q . In particular, the distribution of L conditional on E “ e is given by the followingexpression: P r p L ď l | E “ e q “ P r p α ` λ ` p α ` ε q ď l | p α ` ε q “ e q . One may express the above system of equations purely in terms of the indepen-dent noise shocks α , ε and λ and thus obtain the joint distribution µ P ∆ p A ˆ E ˆ L q .Do-probabilities are calculated from Markov representations, and they are thetool Pearl’s model uses to quantify causal effects. Suppose that we want to calcu-late the direct effect of education on lifetime earnings; that is, we want to calculatewhat weight arrow E Ñ L carries without including the effect of path E Ð A Ñ L .Doing this requires eliminating the dependence of E on A . To eliminate the de-pendence of E on A , we first eliminate the equation that determines the value of E in the Markov representation, namely, E “ A ` ε . Second, any time that E would appear in any other equation, we treat E as a deterministic value. Finally,we calculate the joint distribution of A and L for each deterministic value of E .Concretely, the distribution of L do- E is obtained by solving the following system,where E is treated as a number rather than a random variable: A “ α α „ U p , q ,L “ A ` λ ` E λ „ U p , q , which, purely in terms of the random shocks, simplifies to the following:31 “ α α „ U p , q ,L “ α ` λ ` E λ „ U p , q . In the expression above, the distribution of L do- E “ e , denoted as µ p L | : do p E “ e qq , is given by µ p L ď l | do p E “ e qq “ µ p α ` λ ď l ´ e q . Thedependence of this expression on the value e indicates that education indeed has acausal effect on L ; furthermore, this expression differs from µ p L : | E “ e q because µ p L | do p E “ e qq disregards the confounding effect of the arch E Ð A Ñ L .Theorem 2 shows that under Axioms 1, 2, 3 and
4, the DM’s beliefs, µ , can berepresented by a Markov representation. In particular, intervention beliefs exactlycoincide with do-probabilities, as illustrated above, so all the tools from Pearl [14]apply to intervention beliefs. Markov representations, do-probabilities, and Theorem 2 .Below, we present a formal definition of a Markov representation.
Definition 7.
Let µ P ∆ p X q . For each i P N , let µ i be the marginal over X i . Foreach i P N , let ε i be a random variable with range E i , let G be the DAG defined by afamily of sets of parents p P a p i qq i P N , and let h i be a function h i : X P a p i q ˆ E i Ñ X i .Let φ be the joint distribution of the vector p ε , ..., ε N q . A Markov representation of µ compatible with G is a tuple pp h , ..., h N q , p ε , ..., ε N qq that satisfies the following: • p@ i, j q , ε i is independent of ε j , • µ can be recovered implicitly as a solution to the following system of equa-tions: µ i p x i q “ φ pt ε : h i p x P a p i q , ε i q “ x i uq , p i P t , ..., N uq , (8)Markov representations are used in statistical causality to numerically repre-sent causal effects (see Pearl [14]) via do-probabilities , defined below.32 efinition 8. Let µ P ∆ p X q be a probability distribution, G be a DAG that rep-resents µ , and pp h , ..., h N q , p ε , ..., ε N qq be a Markov representation of µ compat-ible with G . Given two disjoint sets of variables, I and J , the do-probability µ p x I | do p x J qq is calculated as follows:1 For all j P J , eliminate from system (8) in Definition 7 all the formulas µ j p x j q “ φ pt ε : h j p x P a p j q , ε i q “ x j u .2 For each i R J and for each j P P a p i q X J , input value x j into the corre-sponding formula in system (8) of Definition 7.3 Calculate the probability of realization x I in the model resulting from applyingsteps and above. Having defined Markov representations and do-probabilities, we can now stateTheorem 2.
Theorem 2.
Let ¯ ą satisfy Assumption 1, and let p µ x I q I Ă N be the subjective beliefselicited from ¯ ą . The following statements are equivalent: • Axioms 1 through 3 and 4 hold, • There exists a Markov representation of µ , p G, p h , ..., h N q , p ε , ..., ε N qq , suchthat – p@ J Ă N q , p@ x J P X J q ; µ x J “ µ p¨| do p x J qq P ∆ p X ´ J q , – G represents ¯ ą .Furthermore, if G represents ¯ ą , then G “ G p ¯ ą q . The crucial contribution of Theorem 2 is that it clarifies the role of do-probabilitiesin the understanding of causal effects. Do-probabilities are often presented asthe definition of a causal effect. As Pearl writes in [15]: “ the definition of a“cause” is clear and crisp; variable X is a probabilistic-cause of variable Y if P p y | do p x qq ‰ P p y q for some values x and y .” Theorem 1 states that one canlegitimately represent causal effects based on interventions via a DAG, which,nonetheless, is incompatible with any system of do-probabilities. The causal DAGwill be compatible with a set of do-probabilities only when adding Axiom 4 tothe list of basic axioms. This result is analogous to the exercise conducted by33achina-Schmeidler [13]: just as expected utility and probabilistic sophisticationcan be behaviorally separated, we show that the graph-theoretic aspects of Pearl-like models can be separated from the do-probability formalism. The substantiveassumptions about causality are conveyed by the DAG, while do-probabilities rep-resent an additional assumption about when interventions and simple observationscan be used interchangeably. In short, the notion of causality represented by ado-probability is strictly stronger than the notion of causality represented by aDAG.Theorem 2 further clarifies that Axiom 4 is the fundamental property that linksdo-probabilities with intervention beliefs. When defining Markov representations,the functions h p¨q are not indexed by whether their arguments have been observedor intervened on. The functions h p¨q only concern the numerical values of theirarguments and not the method through which these numerical values are obtained.This is an implicit assumption of Pearl’s model, and it is delivered by Axiom 4.Furthermore, note that Definition 7 implicitly requires that the Markov represen-tation that defines do-probabilities has a unique solution. While this characteristichas sometimes been pointed to as a limitation of the theory (see Halpern [5]), underAxiom 4, this result is without loss of generality.Finally, while do-probabilities are commonly referred to as the causal effectof one variable on another, it is important to be cautious with language. Do-probabilities reflect the effect that an intervention on a set of variables has onthe whole system of equations; that is, do-probabilities capture both the directand indirect effects of interventions. For example, consider the DAG in Figure11. This DAG states that there is no direct causal effect of A on C ; however, P r p x C | do p x A qq “ P r p x C | x A q , which is a nontrivial function of x A . Indeed, in-tervening on A has an effect on B , which in turn affects C . In this example, P r p x C | do p x A qq captures this indirect effect. In line with our definition of causaleffects, the causal effect of A on C is given by how P r p x C | do p x A , x B qq changeswith x A . In this case, P r p X c | do p x A , x B qq is a constant function of x A , which isconsistent with A having no direct causal impact on x C .34 B C Figure 11: A has no direct causal effect on C , but pr p x C | do p x A qq is a nontrivialfunction of x A . In economic theory, the work most closely related to ours is a series of papers bySpiegler ([18], [19], [20]). The main difference is the focus of the papers. Spiegler’swork does not provide a definition of the term “causal effect”, except that it canbe represented via a DAG that satisfies two properties. First, the DAG factorizesthe correlation structure in the DM’s beliefs; second, the arrows in the DAG areinterpreted as pointing from cause to effect. Given these assumptions, Spieglerasks what types of mistakes a DM with a misspecified causal model might make.In our paper, we first define what a causal relation is and then seek to understandwhich axioms on behavior allow us to represent causal effects in the languageof DAGs that factorize the DM’s beliefs. The uniqueness claim in Theorem 1provides the point of contact between the two papers. Under our definition ofcausality, a DAG can simultaneously factorize the DM’s beliefs while retaining acausal interpretation only if Axioms 1 through 3 hold. Furthermore, under theaxioms, a graph G both represents a DM’s correlation structure and is interpretedcausally (in the sense that arrows point from cause to effect); only the definitionof causal effect is as in Definition 2.In decision theory, Karni ([10], [11]) explores models where a DM can affect thestates that are realized. In those papers, the primitive objects are a set of actionsand a set of consequences. States of nature are defined as mappings from actionsthat a DM might take to consequences that arise from those actions; that themapping Action Ñ Outcome is stochastic reflects that states are stochasticallyrealized. A DM can affect the states that occur by making an appropriate choiceof action. This idea is similar to our idea of a policy intervention, since a policy p can be seen as an action that the DM takes that affects the realization of states.Indeed, we can map Karni’s set of actions to our set P and Karni’s outcomes torealizations of our state space, X , and a Karni state is a mapping s : P Ñ X . The35ain difference arises in that we impose – objectively – a consistency condition: if apolicy p intervenes and shifts variable j to value x j , a state s cannot map this policyto a realization x j ‰ x j . Karni has a version of this condition, but it is imposedsubjectively on preferences. Moreover, the focus of Karni’s paper is not on usingthese ideas to discuss causal effects or understand what types of models reflectnormative definitions of causality. Rather, Karni focuses on obtaining subjectiveexpected utility representations of his preferences. For this reason, while a formalconnection exists, the substance of the research agenda is different.The statistics and computer science literature includes research that uses graphi-cal methods to represent the conditional independence structure of any given jointprobability law (see Dawid [1], Geiger et al. [3], and Lauritzen et al. [12]). Specif-ically, Dawid [1] and Geiger et al. [3] show that, given a probability distributionover a set of variables, p p¨q , and given a graph G that represents p , the D-separation criterion for graphs (see Definitions 11 and 9) summarizes the independence struc-ture of p . Our proof relies on the one-to-one correspondence between variablesthat satisfy the D-separation criterion and variables that are conditionally inde-pendent. This is the main point of contact between our paper and that body ofwork. Lauritzen et al. [12] provide alternative graphical tests for D-separationthat may be used to obtain alternative proofs of our results.In causal statistics, the most closely related papers are those in the Bayesiannetworks literature (see Spirtes [21], Pearl [14], and follow-up work). Two mainpoints of contact exist between that literature and our paper. First, the statisticalcausality literature offers no formal definition of the term “causal relation”, andthe exact meaning of this phrase is left to the researcher’s common sense. As Pearlstates, “ The first step in this analysis is to construct a causal diagram such as theone given in Fig. [1] (sic. ), which represents the investigator’s understanding ofthe major causal influences among measurable quantities in the domain ” and later“
The purpose of the paper is not to validate or repudiate such domain-specificassumptions but, rather, to test whether a given set of assumptions is sufficient forquantifying causal effects from non-experimental data, for example, estimating thetotal effect of fumigants on yields ”. Second, the numerical value of the causal effectof one variable on another (e.g., education on lifetime earnings) is given by do- robability formalism. As Pearl writes in [15], “ the definition of a “cause” is clearand crisp; variable X is a probabilistic-cause of variable Y if P p y | do p x qq ‰ P p y q for some values x and y .” By contrast, we show that, under Axioms 1 through4, there exists a unique definition of causal effect that is both representable viaa DAG and consistent with an interventionist perspective on causality. Thus,we show that causal models based on causal diagrams implicitly impose a specificdefinition of causality. Moreover, Axioms 1 through 4 neither imply nor are impliedby a representation of causality in terms of do-probabilities. Contrary to Pearl’squote, do-probabilities neither define nor are defined by the definition of causalityembodied by the causal diagram. Theorem 2 shows that under Axioms 1 through 4,causal effects are represented via a DAG that is compatible with the do-probabilityformulas. This makes explicit the fundamental restrictions imposed by using do-probabilities to numerically quantify causal effects.37 eferences [1] A. P. Dawid. Conditional independence in statistical theory. Journal of theRoyal Statistical Society. Series B (Methodological) , pages 1–31, 1979.[2] P. C. Fishburn. Preference-based definitions of subjective probability.
TheAnnals of Mathematical Statistics , 38(6):1605–1617, 1967.[3] D. Geiger, T. Verma, and J. Pearl. Identifying independence in bayesiannetworks.
Networks , 20(5):507–534, 1990.[4] F. Gul. Savage’s theorem with a finite number of states.
Journal of EconomicTheory , 1992.[5] J. Y. Halpern. Axiomatizing causal reasoning.
Journal of Artificial Intelli-gence Research , 12:317–337, 2000.[6] J. Heckman and R. Pinto. Causal analysis after haavelmo.
EconometricTheory , 31(1):115–151, 2015.[7] M. A. Hernan and J. M. Robins.
Causal inference . CRC Boca Raton, FL,2010.[8] Y. Huang and M. Valtorta. Pearl’s calculus of intervention is complete. arXivpreprint arXiv:1206.6831 , 2012.[9] G. W. Imbens and D. B. Rubin.
Causal inference in statistics, social, andbiomedical sciences . Cambridge University Press, 2015.[10] E. Karni. Subjective expected utility theory without states of the world.
Journal of Mathematical Economics , 42(3):325–342, 2006.[11] E. Karni. States of nature and the nature of states.
Economics & Philosophy ,33(1):73–90, 2017.[12] S. L. Lauritzen, A. P. Dawid, B. N. Larsen, and H.-G. Leimer. Independenceproperties of directed markov fields.
Networks , 20(5):491–505, 1990.[13] M. J. Machina and D. Schmeidler. A more robust definition of subjective38robability.
Econometrica: Journal of the Econometric Society , pages 745–780, 1992.[14] J. Pearl. Causal diagrams for empirical research.
Biometrika , 82(4):669–688,1995.[15] J. Pearl. Bayesianism and causality, or, why i am only a half-bayesian. In
Foundations of bayesianism , pages 19–36. Springer, 2001.[16] P. R. Rosenbaum and D. B. Rubin. The central role of the propensity scorein observational studies for causal effects.
Biometrika , 70(1):41–55, 1983.[17] L. J. Savage.
The foundations of statistics . Courier Corporation, 1972.[18] R. Spiegler. Bayesian networks and boundedly rational expectations.
TheQuarterly Journal of Economics , 131(3):1243–1290, 2016.[19] R. Spiegler. Data monkeys: A procedural model of extrapolation from partialstatistics.
The Review of Economic Studies , 84(4):1818–1841, 2017.[20] R. Spiegler. Can agents with causal misperceptions be systematically fooled?
Journal of the European Economic Association , 2018.[21] P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper,and T. Richardson.
Causation, prediction, and search . MIT press, 2000.
A Proofs
Proposition 1.
Let ¯ ą “ p ą J q J Ă N be a DM’s preferences, and let G p ¯ ą q “ p N , E q be the directed graph defined by setting P a p i q “ Ca p i q for each i P I . If ¯ ą satisfiesAssumption 1, then the following are true: • If G “ p N , F q is a directed graph that represents ¯ ą , then p j, i q P F ñ j P Ca p i q . • If G “ p N , F q is a directed graph that represents ¯ ą , then j P Ca p i q ñ p j, i q P F or i P Ca p j q .Proof. Let ¯ ą be as in the statement of the proposition, G p ¯ ą q be the directed graph39efined by setting P a p i q “ Ca p i q for each i P N , and G “ p N , F q be any otherdirected graph that represents ¯ ą . For each I Ă N and each realization x I P X I ,let µ x I P ∆ p X ´ I q represent beliefs obtained from ą x I .We first show that j P Ca p i q ñ p j, i q P F or i P Ca p j q . If j P Ca p i q , thenthe function T : X j Ñ R defined as T p x j q “ µ x ´t i,j u ,x j p x i q is not constant in x j .Additionally, by Assumption 1‘, µ x ´t i,j u p x i | x j q “ T p x j q . Thus, i and j are notindependent after intervening on t i, j u A . Because G represents ¯ ą , then G ´t i,j u represents ą ´t i,j u . Thus, either p i, j q P F or p j, i q P F (if not, G ´t i,j u would treat i and j as independent, which is a contradiction). If p j, i q P F , the proof concludes.Therefore, let p j, i q R F so that p i, j q P F . Because G represents ¯ ą , this meansthat µ x ´t i,j u p x j | x i q “ µ x ´t j u p x j q . By definition, the above equation says i P Ca p j q ,as desired.We now show p j, i q P F ñ j P Ca p i q . First, note that for all x P X , µ x ´t i,j u p x i , x j q “ µ x ´t i,j u p x j q µ x ´t i,j u p x i | x j q . Because G represents ¯ ą , p j, i q P F and the minimalitycondition in Definition 5, jointly imply that i and j are not independent afterintervening on t i, j u A . That is, µ x ´t i,j u p x i | x j q is not constant in x j . Moreover,because G represents ¯ ą and p j, i q P F , we obtain that µ x ´t i,j u p x i | x j q “ µ x ´t i u p x i q .Therefore, there is a value of x ´t j u for which T p x j q “ µ x ´t i u p x i q is not constant in x j . Therefore, j P Ca p i q . Remark 1.
Without axiom , any representing graph must include the causal linksin the sense of Definition 2 ( i.e. , p j, i q P F ñ j P Ca p i q ), but F could omit somearrows. However, only arrows involved in 2 cycles are omitted. Before proving Theorem 1, we need two Lemmas. Let i be a variable and I , J be two disjoint sets of variables that do not contain i . It is known from Dawid([1]) and Pearl ([14]) that i is independent of I conditional on J if, and only if, J D-separates t i u from I (see below for a definition of D-separation). The next twolemmas prove that, for each variable i , Ca p i q D-separates t i u from all sets J thatsatisfy J Ă N D p i q , where N D p i q is the set of nondescendants of i . Furthermore, Ca p i q is the smallest set that has this property. Definition 9.
Let I , J , K Ă N be three disjoint sets of variables. We say that K D- separates I from J if for each undirected path between a variable in I and avariable in J , one of the following properties holds: There is a node w along the path such that w is a collider (that is, there arenodes w , w in the path such that w Ñ w Ð w ), such that w R K and K Ă N D p w q . • There is a node w along the path such that w is not a collider and such that w P K . Lemma 1.
Fix K Ă N and x K P X K . Let G K represent ą x K . For each i P N , Ca p i qz K D-separates t i u from N D p i qz K ” t ˆ j P K A : i is not an indirect cause of ˆ j . u .Proof. Choose j P t ˆ j P K A : i is not an indirect cause of ˆ j . u Choose an undi-rected trail t from j to i . That is, t “ p i , ..., i N q where i “ j , i N =i, and, foreach n P t , ..., N u , either p i n ´ , i n q P E or p i n , i n ´ q P E . First, since i is notan indirect cause of j , then t cannot be a directed path from i to j . That is, t cannot be such that p i n , i n ´ q P E for each n . Second, if t is a directed path from j to i (that is, p i n ´ , i n q P E for each n ), then t is blocked by i N ´ P Ca p i qz K .Third, assume that t is not directed in any direction. Then, t has colliders and/ortail-to-tail nodes. Let i n be the last node that is either a collider or a tail-to-tailnode. Let q “ p i n , ..., i N q be the trail starting at i n . By the definition of i n , q mustbe directed. Assume that q is directed from i n to i . Then, i n is tail to tail. Then, t is blocked by i N ´ . Finally, assume that q is directed from i to i n . Then, i n isa collider. If i n P Ca p i qz K , then p i n , i, q q is a cycle. Thus, i n R Ca p i qz K . By asimilar argument, no descendants of i n can be in Ca p i qz K . Therefore, i n blocks t .Since each trail joining j to i is blocked, this concludes the proof. Lemma 2.
Fix K Ă N , x K P X K , and i P K A . Let G K represent ą x K . If T Ă K A satisfies that T D-separates t i u from N D p i q , then Ca p i qz K Ă T Proof.
Let K , i , and T be as in the statement of the Lemma. Assume that w P Ca p i qz K . Then, w P N D p i q because otherwise, G K would not be acyclic.Consider the path w Ñ i . Then, T can D-separate this path only if w P T . Thus, Ca p i qz K Ă T . Lemma 3.
Assume that Axiom 3 holds. Then, p@ i P N q , p@ j such that i R Ca p j qq and p@ K Ă N zt i, j uq , i K K p Ca p j qz K z Ca p i qq|p Ca p i qz K q Proof.
Let i P N be arbitrarily chosen and j , K be as in the statement of thetheorem. Define the following sets: Ca p j q “ t t : t P ICa p j q , , Ca p t q “ H , and t R Ca p i qu Ca p j q “ t t : t P ICa p j q , Ca p t q Ă Ca p j q , and t R Ca p i qup ... q Ca k p j q “ t t : t P ICa p j q , Ca p t q Ă Y k ´ n “ Ca p j q , and t R Ca p i qu . Axiom 3 implies the following: i K Ca p j q| Ca p i q . The above follows because i R Ca p j q (since i R ICa p j q ) and because Ca p Ca p j qq “H . By induction, Axiom 3 implies the following for all k : i K Ca p j q| Y kn “ Ca n p j q . We have already shown that the above is true for k “
0. Assume that it is true forsome value k . We need to show that this is true for k `
1. Note that Ca p Ca k ` p j qq ĂY kn “ Ca n p t q by definition. Then, by Axiom 3, i K K Ca k ` p j q| Ca p i q Y Y kn “ Ca n p j q .Moreover, according to the inductive hypothesis, i K K Y kn “ Ca n p j q| Ca p i q . Thus, i K K pY kn “ Ca n p j q Y Ca p j q k ` q| Ca p i q . Since Ca p j q “ Y n “ Ca p j q n , this completesthe proof.As a corollary of the Lemma 3, we obtain the Markov property i K K j | Ca p i qz K whenever i R ICa p j q . Indeed, by the intersection property of conditional indepen-dence, i K K j | Ca p i q Y Ca p j q and i K K Ca p j q| Ca p i q implies i K K j | Ca p i q . Theorem 1.
Let ¯ ą satisfy Assumption 1. The following are equivalent: • Axioms 1 and 3 hold; pD G q such that G is a DAG and represents ¯ ą .Furthermore, if G represents ¯ ą , then G “ G p ¯ ą q .Proof. The uniqueness claim is proven in Proposition 1.We now prove that the axioms imply the existence of a representation. Withoutloss of generality, label the variables such that j ă i implies j P N D p i q . Construct G by setting P a p i q “ Ca p i q . By Axiom 1, G is acyclic. Indeed, if for some length k P N , there were a cycle e “ pp i , i q , p i , i q , ..., p i k , i qq , then i would be anindirect cause of itself. Choose any set K Ă N and any realization x K P X K .We need to show that µ x K p x ´ K q “ Π i R K µ x K p x i | Ca p i qz K q . By our enumeration, t j R K : j ă i u Ă t j P N : i is not an indirect cause of j u . Through Lemma 3,Axiom 3 implies that µ x K p x i | x t j R K : j ă i u q “ µ x K p x i | x Ca p i qz K q . By the chain rule, weknow that µ x K p x ´ K q “ Π Ni “ ,i R K µ x K p x i | x t j R K : j ă i u q . Combining the last two claims, µ x K p x ´ K q “ Π Ni “ ,i R K µ x K p x i | x Ca p i qz K q , which is what we wanted to prove. We nowprove the minimality of Ca p i q . Assume that J Ĺ Ca p i q . Axiom 2 states that i M K p Ca p i qz K z J q| J . Thus, the factorization formula is minimal.Now, suppose that G is a DAG that represents ¯ ą . By our uniqueness claim,without loss of generality, G is such that P a p i q “ Ca p i q . By contrapositive, that G is acyclic implies Axiom 1 holds. If Axiom 1 did not hold, there would exist i and a sequence p i, i , ..., i T , i q such that i P Ca p i q , for all t P t , ..., T ´ u , i t P Ca p i t ` q , and i T P Ca p i q . Thus, pp i, i q , ..., p i t ´ , i t q , ..., p i T , i qq is a cycle in G .Axiom 2 holds by the minimality requirement in the definition of representation.Indeed, if there were K , J such that i K K p Ca p i qz K z J q| J , then the representationwould not be minimal. To see that Axiom 3 holds, note that Ca p i q Y Ca p j qz K blocks all paths from i to j in G K . Indeed, assume that p is an undirected pathfrom i to j that is not blocked, and enumerate p “ p i, i , ..., i T , j q . Because p is notblocked, i R Ca p i q and i T R Ca p j q . Indeed, if either i P Ca p i q or i T P Ca p j q , theneither i is not a collider or i T is not a collider, thus implying that p is blocked by Ca p i q Y Ca p j q . Therefore, p has a collider. Let n be the smallest number such that i n is a collider and m be the largest number such that i m is a collider (possibly n “ m ). Note that because G is acyclic, i n R Ca p i q and i m R Ca p j q . Because ofthis, and because p is not blocked, the following must be true:43 (i) i n P Ca p j q , • (ii) i m P Ca p i q .Then, the directed path that goes from i to i n , jumps to j , returns to i m , and skipsback to i is a cycle. This constitutes a contradiction. Thus, every path p from i to j is blocked by Ca p i q Y Ca p j q . Thus, Axiom 3 holds. Similarly, Ca p i q Y t j uz K blocks all paths from i to Ca p j qz K in G K . Indeed, let p be an undirected pathfrom i to Ca p j q , and assume that p is not blocked by Ca p i q Y t j u . Enumerate p “ p i, i , ..., i T , k q , where k P Ca p j q . Because j P N D p i q , then p cannot bedirected from i to k . If i P Ca p i q , then i blocks p . If i R Ca p i q , since p is notdirected, p has a collider. Let n be the smallest number such that i n is a collider.First, note that i n R Ca p i q because this would constitute a cycle. Second, if i n “ j ,then j would be a descendant of i , a contradiction. Thus, i n R Ca p i q Y t j u , andhence, p is blocked. Thus, Axiom 3 holds. Theorem 2.
Let ¯ ą satisfy Assumption 1, and let p µ x I q I Ă N be the subjective beliefselicited from ¯ ą . The following are equivalent: • Axioms 1 through 4 hold; • D a Markov representation of µ , p G, p h , ..., h N q , p ε , ..., ε N qq , such that – p@ J Ă N q , p@ x J P X J q ; µ x J “ µ p¨| do p x J qq P ∆ p X ´ J q ; and – G represents ¯ ą .Furthermore, if G represents ¯ ą , then G “ G p ¯ ą q .Proof. The uniqueness claim was proven in 1.We first show that the axioms imply the representation. By Theorem 1, Axioms1 and 2 imply that there exists a DAG G such that G represents ¯ ą . For each i P N let P a p i q be the set of parents of i in G . Note that P a p i q “ Ca p i q bythe uniqueness claim. For each i P N , let ε i „ U r , s . For each realization x i P X i and each x P a p i q P X P a p i q , let I p x i , x P a p i q q Ă r , s be an interval of length µ x Pa p i q p x i q . Because ř x i P X i µ x Pa p i q p x i q “ x P a p i q , then I p¨ , x P a p i q q can bechosen to form a partition of r , s . Fix any variable i P N , and let h i p x P a p i q , ε i q “ Formally, this is the path q “ p i, i , ...i n , j, i T , i T ´ , ..., i m , i q x i P X i x i I p x i ,x Pa p i q q p ε i q . By construction, p G, p h , ..., h N q , p ε , ..., ε N qq is a Markovrepresentation of the beliefs elicited from ¯ ą . Choose any J Ă N and any i R J .By Axiom 4, for each x i P X i and each x Ca p i qY J P X Ca p i qY J , we obtain µ x J p x i | x Ca p i qz J q “ µ p x i | x Ca p i q q . (9)Our Markov representation implies µ p x i | x Ca p i q q “ φ pt ε : h i p x Ca p i q , ε i q “ x i uq“ µ p x i | do p x J q , x Ca p i qz J q . (10)By 9 and 10, µ x J p x i | x Ca p i qz J q “ µ p x i | do p x J q , x Ca p i qz J q . Because G represents ¯ ą ,for each x P X , µ x J p x ´ J q “ Π Ni “ ,i R J µ x J p x i | x Ca p i qz J q“ Π Ni “ ,i R J µ p x i | do p x J q , x Ca p i qz J q “ µ p x ´ J | do p x J qq . Thus, µ x J p¨q “ µ p¨| do p x J qq P ∆ p X ´ J q .We now show that the representation implies the axioms. If there exists a DAG G that represents ¯ ą , then that Axioms 1 and 2 hold is proven in Theorem 1. Let i P N , J Ă N zt i u , f, g P R X i , x J P X J and x Ca p i qz J P X Ca p i qz J be arbitrarily selected.We know from the Markov representation that for each x i P X i , µ ix J p x i | x Ca p i qz J q “ µ i p x i | x Ca p i q q , where µ i denotes the marginal of µ on X i . Thus, Axiom 4 holds.Proposition 2 is a direct consequence of Theorem 2 and Theorem 3, which isstated and proven below. Theorem 3.
Let ¯ µ “ t µ p : p P P u be a collection of intervention beliefs, and let G be a DAG that represents ¯ µ . If equations 11 and 12 hold, then Axiom 4 holds.Proof. Let ¯ µ and G be as in the theorem. Let i P N and J Ă N zt i u . We wantto show that µ p x i | x Ca p i q q “ µ x J p x i | x Ca p i qz J q . Let J ˚ ” J X Ca p i q ; that is, J ˚ are those variables in J that are direct causes of i . Thus, we need to show that µ p x i | x Ca p i q q “ µ x J p x i | x Ca p i qz J ˚ q ; we do this in two steps.First, we show that µ p x i | x Ca p i q q “ µ x J ˚ p x i | x Ca p i qz J ˚ q . To see this, note that45 a p i qz J ˚ blocks any path from t i u to J ˚ in graph G p J z J ˚ q in , p J ˚ q out . Indeed, let p be any path from i to some j P J ˚ in graph G p J z J ˚ q in , p J ˚ q out . Write p “ p i , ..., i T q ,where i “ i and i T “ j . Because j P Ca p i q , then p cannot be a directed pathfrom i to j ; otherwise, G would have a cycle. Similarly, p cannot be a directedpath from j to i since G p J z J ˚ q in , p J ˚ q out has no arrows emerging from j . Therefore, p has a collider or a tail-to-tail node. Let w be the first node that is either acollider or a tail-to-tail node. First, assume w is tail to tail. Then, p is of the form i Ð i p ... q Ð w Ñ p ... q j . Then, i P Ca p i qz J ˚ : indeed, i P Ca p i q and i R J ˚ (since there are no arrows emerging from nodes in J ˚ ). Furthermore, i is not acollider. Then, i blocks p . Now, assume that w is a collider rather than tail totail. Then, p is of the form i Ñ i p ... q Ñ w Ð p ... q j . Then, w is a descendantof i , so neither w nor any w descendant is in Ca p i q . A fortori, neither w nor anydescendant of w is in Ca p i qz J ˚ . Thus, w blocks p . Therefore, by formula 11, wehave µ p x i | x Ca p i q q “ µ x J ˚ p x i | x Ca p i qz J ˚ q .Second, we show that µ x J ˚ p x i | x Ca p i qz J ˚ q “ µ x J ˚Yp J z J ˚q p x i | x Ca p i qz J ˚ q . This isbecause Ca p i q blocks all paths between i and J z J ˚ in graph G J z J ˚ p Ca p i qz J ˚ q in .Note that if J z J ˚ contains only nondescendants of i , then the result is a directconsequence of lemma 1. Let p be a path (not necessarily directed) between i and j P J z J ˚ . By contradiction, assume that j P J z J ˚ is a descendant of i . Then, j R Ca p i q , and j is not an ancestor of any node in Ca p i q . Therefore, j P J z J ˚ p Ca p i qz J ˚ q , so there are no arrows in j . Therefore, no path from i to j can be directed in any direction, so there is at least one collider or tail-to-tailnode. Let w be the first such node, and assume thats w is a collider. Then, p isof the form i Ñ p ... q Ñ w Ð p ... q Ð j . Then, neither w nor any descendant of w can be in Ca p i q , so p is blocked by Ca p i q . Alternatively, w is a tail-to-tail node.Then, p is of the form i Ð i p ... q Ð w Ñ p ... q Ð j (with possibly w “ i ). Then, i P Ca p i q , and i is not a collider. Thus, Ca p i q “ J ˚ Yp Ca p i qz J ˚ q blocks p . Thus,by formula 12, µ x J ˚ p x i | x Ca p i qz J ˚ q “ µ x J ˚Yp J z J ˚q p x i | x Ca p i qz J ˚ q “ µ x J p x i | x Ca p i qz J q .Combining this with the first step, we conclude that µ p x i | x Ca p i q q “ µ x J p x i | x Ca p i qz J q as desired. 46 The Rules of Causal Calculus
Theorem 2 in Section 7.2 shows a formalism (do-probability) that is useful foridentifying causal effects. Under such a formalism, Pearl [14] shows the followingtwo results. First, if a DAG satisfies certain properties, then intervening on avariable and conditioning on a variable generate the same distribution – that is, p p x i | x j , do p x k qq “ p p x i | x j , x k q . Second, if a DAG satisfies certain other properties,then one can drop interventions altogether – that is, p p x i | x j , do p x k qq “ p p x i | x j q .These two results are generally known as the “rules of causal calculus”. Huangand Valtorta [8]) go a step further: they show that these rules are complete. Thatis, any identification theorem that is true can be proven by iterative applicationof these rules of causal calculus.However, there remains an open question: are do-probabilities the only modelof causality under which these rules apply, or are these rules consistent with othermodels of causality? Using our framework, we prove that the do-probability modelis the only model of causal effects where these two rules are complete.Explaining and proving the result above requires two definitions: we need todefine specific truncations of a DAG, and we need the definition of a blockedpath. We provide these definitions and then formally state the result. AppendixB discusses the intuition for why we need these definitions. Definition 10.
Given G and three disjoint sets of variables I , J , K Ă N , thetruncated DAGs G I in , G J out , and G I in , J p K q out are defined as follows:1 G I in is obtained from G by eliminating all arrows pointing to nodes in I ,2 G I in , J out is obtained from G by eliminating all arrows emerging from nodesin J and all arrows pointing to nodes in I ,3 G I in , J p K q in is obtained by eliminating all arrows pointing to nodes in J p K q and I , where J p K q is the set of J nodes that are not ancestors of any K nodes in G I in . The following figures show the base DAG, G , and its corresponding trunca-tions. In all cases, J “ t J , J u , I “ t I u , K “ t K u .47 K J I (a) The base DAG, G . J K J I (b) DAG G I in obtained by eliminating allarrows into I . J K J I (c) DAG G I in , J out obtained by:(i) eliminating arrows into I and (ii)eliminating all arrows emerging from J and J . J K J I (d) DAG G I in , J p K q in obtained by(i) eliminating all arrows into I and (ii)eliminating all arrows into J since J isthe only J node that is not an ancestor ofa K node.Figure 12: Different truncations of a DAG . For the following definition, suppose that Q is an undirected path between twonodes, i.e., a collection of nodes, regardless of directionality, and that q is a nodeon Q . For example, Figure 12 shows an undirected path Q “ p J , I, J , K q from J to K . We say that Q has converging arrows at q if there exist nodes q and q thatare adjacent to q in Q such that q Ñ q Ð q . For example, path Q “ p J , I, J , K q has converging arrows at I . We say that Q does not have converging arrows at q if for all nodes q and q that are adjacent to q in Q , either q Ñ q Ñ q or q Ð q Ñ q holds. For example, Q “ p J , I, J , K q does not have convergingarrows at J . Definition 11.
Let I , J , K be three disjoint sets of variables, and let Q be anundirected path between a node in I and a node in J . We say that K blocks Q ifthere exists a node q on Q such that one of the following conditions holds: • Q has converging arrows at q , and neither q nor any of its descendants is in , and • Q does not have converging arrows at q and q P K . Below, we state the two rules of causal calculus (in the language of our model)and Proposition 2.
Rule 1. (Exchanging intervention and observation.) Let I , I , I , I be disjointsets of variables. If I Y I block all paths from I to I in graph G I in , I out , then µ x I ,x I p x I | x I q “ µ x I p x I | x I , x I q . (11) Rule 2. (Eliminating interventions.) Let I , I , I , I be disjoint sets of variables.If I Y I block all paths from I to I in graph G I in , I p I q in , then µ x I ,x I p x I | x I q “ µ x I p x I | x I q . (12) Proposition 2.
Let ¯ ą satisfy Axioms 1 through 3, let G represent ¯ ą , and let t µ p : p P P u be the DM’s intervention beliefs. Then, the following statements areequivalent. • ¯ ą satisfies Axiom 4. • Rules 1 and 2 hold. Furthermore, if µ p is identified for some p P P , then theidentification is obtained by iterative application of these two rules. With Proposition 2, we can refer back to Example 3 in Section 7.2 and obtainthe identification result by applying Rules 1 and 2. In Rule 2, set I “ t L u , I “ H , I “ t E u , and I “ t A u . The corresponding truncated DAG is G itself.In G , A blocks the unique path from E to L since no converging arrows exist at A .Thus, µ e p l | a q “ µ p l | a q . Similarly, in Rule 1, set I “ t A u , I “ H , I “ t E u , and I “ H . In the truncated graph that results, E is isolated from all other variables,so any path from E to A is blocked; thus, µ e p a q “ µ p a | e q . These two conclusionsyield the identification of µ e p l q “ ř a µ p l | a q µ p a | e q .49 .1 On the relevance of blocks In what follows, we intuitively explain why the notion of a block is relevant foranalyzing conditional independence. Furthermore, we provide intuition for why thetruncations in Figure 13a are the relevant truncations for identifying interventionbeliefs.To illustrate the notion of a block, see Figure 13a below. The singleton t K u blocks all paths from J to J . Indeed, one such path is J Ñ K Ñ J . This pathis blocked by t K u because (i) the path has no converging arrows at K , and (ii) K P t K u . The other path from J to J is J Ñ I Ð J . This path is blocked by t K u because I is a node along the path such that there are converging arrows at I , but neither I nor any of its descendants are in t K u . J K J I (a) A base DAG, G . J K J I (b) DAG G I in obtained by eliminating allarrows into I . J K J I (c) DAG G I in , J out obtained by:(i) eliminating arrows into I and (ii) allarrows emerging from J and J . J K J I (d) DAG G I in , J p K q in obtained by(i) eliminating all arrows into I and (ii)then eliminating all arrows into J since J is the only J node that is not an ancestorof a K node.Figure 13: Different truncations of a DAG . The notion of a block is a graphical depiction of conditional independence. In-50eed, that a path exists between two sets of variables, I and J , implies that I and J are (a priori) statistically dependent: any variable w present on a path from I to J may potentially act as a correlating device between I and J .In particular, the position of a variable w on a path between I and J is relevantto the way in which w correlates these variables. Say that there is a path i Ñ w Ð j , where i P I and j P J ; i.e., there is a path joining I and J that hasconverging arrows at w . This implies that observations of w (and its descendants)are informative about i and j simultaneously. However, interventions of w areuseless for the purposes of predicting the value of either i or j since neither w nor any of its descendants are a cause of either i or j . By contrast, if there isa path of the from i Ñ w Ñ j or i Ð w Ñ j ( i.e., a path with nonconvergingarrows), then we know that both observations of and interventions on w are usefulfor predicting the values of i and j , although in different ways. In the case in which i Ð w Ñ j , observing or intervening on w provides the same joint informationabout i and j since w is a common direct cause of j and i . However, if i Ñ w Ñ j ,intervening on w provides information about j (since w is a direct cause of j ) butprovides no information about i (since w is neither a direct nor an indirect causeof i ). In this case, intervening on w breaks down the statistical dependence of i and j in a way that is different from simply conditioning on observations of ww