AA Combinatorial Solution to Causal Compatibility
Thomas C. [email protected]
Perimeter Institute for Theoretical Physics, Waterloo, Ontario, Canada, N2L 2Y5University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1
October 26, 2020
Abstract
Within the field of causal inference, it is desirable to learn the structureof causal relationships holding between a system of variables from the correla-tions that these variables exhibit; a sub-problem of which is to certify whetheror not a given causal hypothesis is compatible with the observed correlations. Aparticularly challenging setting for assessing causal compatibility is in the pres-ence of partial information; i.e. when some of the variables are hidden/latent.This paper introduces the possible worlds framework as a method for decid-ing causal compatibility in this difficult setting. We define a graphical objectcalled a possible worlds diagram, which compactly depicts the set of all pos-sible observations. From this construction, we demonstrate explicitly, usingseveral examples, how to prove causal incompatibility. In fact, we use theseconstructions to prove causal incompatibility where no other techniques havebeen able to. Moreover, we prove that the possible worlds framework can beadapted to provide a complete solution to the possibilistic causal compatibil-ity problem. Even more, we also discuss how to exploit graphical symmetriesand cross-world consistency constraints in order to implement a hierarchy ofnecessary compatibility tests that we prove converges to sufficiency.
Keywords: causal inference, causal compatibility, quantum non-classicality1 a r X i v : . [ s t a t . O T ] O c t Introduction
A theory of causation specifies the effects of actions with absolute necessity. On theother hand, a probabilistic theory encodes degrees of belief and makes predictionsbased on limited information. A common fallacy is to interpret correlation as causa-tion; opening an umbrella has never caused it to rain, although the two are stronglycorrelated. Numerous paradoxical and catastrophic consequences are unavoidablewhen probabilistic theories and theories of causation are confused. Nonetheless,
Re-ichenbach’s principle asserts that correlations must admit causal explanation; afterall, the fear of getting wet causes one to open an umbrella.In recent decades, a concerted effort has been put into developing a formal theoryfor probabilistic causation [43, 54]. Integral to this formalism is the concept of a causal structure . A causal structure is a directed acyclic graph, or DAG, which en-codes hypotheses about the causal relationships among a set of random variables. A causal model is a causal structure when equipped with an explicit description of theparameters which govern the causal relationships. Given a multivariate probabilitydistribution for a set of variables and a proposed causal structure, the causal com-patibility problem aims to determine the existence or non-existence of a causal modelfor the given causal structure which can explain the correlations exhibited by thevariables. More generally, the objective of causal discovery is to enumerate all causalstructure(s) compatible with an observed distribution. Perhaps unsurprisingly, causalinference has applications in a variety of academic disciplines including economics,risk analysis, epidemiology, bioinformatics, and machine learning [29, 42, 43, 48, 62].For physicists, a consideration of causal influence is commonplace; the theory ofspecial/general relativity strictly prohibits causal influences between space-like sep-arated regions of space-time [57]. Famously, in response to Einstein, Podolsky, andRosen’s [19] critique on the completeness of quantum theory, Bell [7] derived an obser-vational constraint, known as Bell’s inequality, which must be satisfied by all hiddenvariable models which respect the causal hypothesis of relativity. Moreover, Belldemonstrated the existence of quantum-realizable correlations which violate Bell’sinequality [7]. Recently, it has been appreciated that Bell’s theorem can be under-stood as an instance of causal inference [61]. Contemporary quantum foundationsmaintains two closely related causal inference research programs. The first is to de-velop a theory of quantum causal models in order to facilitate a causal description ofquantum theory and to better understand the limitations of quantum resources [3,6, 13, 17, 25, 30, 36, 38, 44, 47, 60]. The second is the continued study of classi-cal causal inference with the purpose of distinguishing genuinely quantum behaviorsfrom those which admit classical explanations [1, 2, 11, 23, 24, 25, 50, 58, 60]. Inparticular, the results of [30] suggest that causal structures which support quantumnon-classicality are uncommon and typically large in size; therefore, systematicallyfinding new examples of such causal structures will require the development of newalgorithmic strategies. As a consequence, quantum foundations research has relied2pon, and contributed to, the techniques and tools used within the field of causalinference [13, 30, 50, 60]. The results of this paper are concerned exclusively withthe latter research program of classical causal inference, but does not rule out thepossibility of a generalization to quantum causal inference.When all variables in a probabilistic system are observed, checking the compat-ibility status between a joint distribution and a causal structure is relatively easy;compatibility holds if and only if all conditional independence constraints implied bygraphical d-separation relations hold [39, 43]. Unfortunately, in more realistic situa-tions there are ethical, economic, or fundamental barriers preventing access to certainstatistically relevant variables, and it becomes necessary to hypothesize the existenceof latent/hidden variables in order to adequately explain the correlations expressed bythe visible/observed variables [22, 43, 60]. In the presence of latent variables, and inthe absence of interventional data, the causal compatibility problem, and by extensionthe subject of causal inference as a whole, becomes considerably more difficult.In order to overcome these difficulties, numerous simplifications have be invokedby various authors in order to make partial progress. A particularly popular simpli-fication strategy has been to consider alternative classes of graphical causal modelswhich can act as surrogates for DAG causal models; e.g. MC-graphs [34], summarygraphs [59], or maximal ancestral graphs (MAGs) [46, 63]. While these approaches arecertainly attractive from a practical perspective (efficient algorithms such as FCI [54]or RFCI [16] exist for assessing causal compatibility with MAGs, for instance), theynevertheless fail to fully capture all constraints implied by DAG causal models with la-tent variables [21]. The forthcoming formalism is concerned with assessing the causalcompatibility of DAG causal structures directly , therefore avoiding these shortcom-ings.Nevertheless, when considering DAG causal structures directly (henceforth justcausal structures), making assumptions about the nature of the latent variables andthe parameters which govern them can simplify the problem [28, 53, 56]. For in-stance, when the latent variables are assumed to have a known and finite cardinality ,it becomes possible to articulate the causal compatibility problem as a finite systemof polynomial equality and inequality equations with a finite list of unknowns forwhich non-linear quantifier elimination methods, such as Cylindrical Algebraic De-composition [31], can provide a complete solution. Unfortunately, these techniquesare only computationally tractable in the simplest of situations. Other techniquesfrom algebraic geometry have been used in simple scenarios to approach the causalcompatibility problem as well [27, 28, 35]. When no assumptions about the natureof the latent variables are made, there are a plethora of methods for deriving novel For concrete and relevant example of this weakness, note that there are observable distributionsincompatible with the DAG causal structure in Figure 11 (which admits of no observable d-separationrelations), whereas its associated MAG is compatible with all observed distributions. An analogousstatement happens to be true of the DAG causal structure in Figure 13. The cardinality of a random variable is the size of its sample space. and sufficient [37] for determining compatibility.In contrast with the aforementioned algebraic techniques, the purpose of this paperis to present the possible worlds framework , which offers a combinatorial solution tothe causal compatibility problem in the presence of latent variables. Importantly,this framework can only be applied when the cardinality of the visible variables areknown to be finite. This framework is inspired by the twin networks of Pearl [43],parallel worlds of Shpitser [52], and by some original drafts of the Inflation Techniquepaper [60]. The possible worlds framework accomplishes three things. First, weprove its conceptual advantages by revealing that a number of disparate instances ofcausal incompatibility become unified under the same premise. Second, we providea closed-form algorithm for completely solving the possibilistic causal compatibilityproblem. To demonstrate the utility of this method, we provide a solution to anunsolved problem originally reported [21]. Third, we show that the possible worldsframework provides a hierarchy of tests, much like the Inflation Technique, whichsolves completely the probabilistic causal compatibility problem.Unfortunately, the computational complexity of the proposed probabilistic solu-tion is prohibitively large in many practical situations. Therefore, the contributions ofthis work are primarily conceptual. Nevertheless, it is possible that these complexityissues are intrinsic to the problem being considered. Notably, the hierarchy of testspresented here has an asymptotic rate of convergence commensurate to the only othercomplete solution to the probabilistic compatibility problem, namely the hierarchy oftests provided in [37]. Moreover, unlike the Inflation Technique, if a distribution iscompatible with a causal structure, then the hierarchy of tests provided here has theadvantage of returning a causal model which generates that distribution.This paper is organized as follows: Section 2 begins with a review of the mathe-matical formalism behind causal modeling, including a formal definition of the causalcompatibility problem, and also introduces the notations to be used throughout thepaper. Afterwards, Section 3 introduces the possible worlds framework and definesits central object of study: a possible worlds diagram. Section 4 applies the possibleworlds framework to prove possibilistic incompatibility between several distributionsand corresponding causal structures, culminating in an algorithm for exactly solv-ing the possibilistic causal compatibility problem. Finally, Section 5 establishes ahierarchy of tests which completely solve the probabilistic causal compatibility prob-lem. Moreover, Section 5.1 articulates how to utilize internal symmetries in order to Regarding the latent variables, Appendix B.2 demonstrates that the latent variables can beassumed to have finite cardinality without loss of generality whenever the visible variables havefinite cardinality.
This review section is segmented into three portions. First, Section 2.1 defines di-rected graphs and their properties. Second, Section 2.2 introduces the notation andterminology regarding probability distributions to be used throughout the remainderof this article. Finally, Section 2.3 defines the notion of a causal model and formallyintroduces the causal compatibility problem.
Definition 1. A directed graph G is an ordered pair G = ( Q , E ) where Q is afinite set of vertices and E is a set edges , i.e. ordered pairs of vertices E ⊆ Q × Q .If ( q, u ) ∈ E is an edge, denoted as q → u , then u is a child of q and q is a parent of u . A directed path of length k is a sequence of vertices q (1) → q (2) → · · · → q ( k ) connected by directed edges. For a given vertex q , pa G ( q ) denotes its parents and ch G ( q ) its children. If there is a directed path from q to u then q is an ancestor of u and u is a descendant of q ; the set of all ancestors of q is denoted an G ( q ) andthe set of all descendants is denoted des G ( q ). The definition for parents, children,ancestors and descendants of a single vertex q are applied disjunctively to sets ofvertices Q ⊆ Q : ch G ( Q ) = (cid:91) q ∈ Q ch G ( q ) , pa G ( Q ) = (cid:91) q ∈ Q pa G ( q ) , (1) an G ( Q ) = (cid:91) q ∈ Q an G ( q ) , des G ( Q ) = (cid:91) q ∈ Q des G ( q ) . (2)A directed graph is acyclic if there is no directed path of length k > q backto q for any q ∈ Q and cyclic otherwise. For example, Figure 1 depicts the differencebetween cyclic and acyclic directed graphs. Definition 2.
The subgraph of G = ( Q , E ) induced by W ⊂ Q , denoted sub G ( W ),is given by, sub G ( W ) = ( W , E ∩ ( W × W )) , (3)i.e. the graph obtained by taking all edges from E which connect members of W .5 (a) A directed cyclic graph.
12 3 4 5 (b) A directed acyclic graph.
Figure 1: The difference between a directed cyclic graph and a directed acyclic graph.
Definition 3 (Probability Theory) . A probability space is a triple (Ω , Ξ , P ) wherethe state space Ω is the set of all possible outcomes , Ξ ⊆ Ω is the set of events forminga σ -algebra over Ω, and P is a σ -additive function from events to probabilities suchthat P (Ω) = 1. Definition 4 (Probability Notation) . For a collection of random variables X I = { X , X , . . . , X k } indexed by i ∈ I = { , , . . . , k } where each X i takes values from Ω i ,a joint distribution P I = P ...k assigns probabilities to outcomes from Ω I = (cid:81) i ∈I Ω i .The event that each X i takes value x i , referred to as a valuation of X I , is denotedas, P I ( x I ) = P ...k ( x x . . . x k ) = P ( X = x , X = x , . . . X k = x k ) . (4)A point distribution P I ( y I ) = 1 for a particular event y I ∈ Ω I is expressed usingsquare brackets, P I ( y I ) = 1 ⇔ P I ( x I ) = [ y I ]( x I ) = δ ( y I , x I ) = (cid:89) i ∈I δ ( y i , x i ) . (5)The set of all probability distributions over Ω I is denoted as P I . Let k i denote the cardinality or size of Ω i . If X i is discrete, then k i = | Ω i | , otherwise X i is continuousand k i = ∞ . A causal model represents a complete description of the causal mechanisms under-lying a probabilistic process. Formally, a causal model is a pair of objects ( G , P ),which will be defined in turn. First, G is a directed acyclic graph ( Q , E ), whose ver-tices q ∈ Q represent random variables X Q = { X q | q ∈ Q} . The purpose of a causal A valuation is a particular type of event in Ξ where the random variables take on definite values. v v ‘ ‘ v v ‘ Figure 2: The causal structure G in this figure encodes a causal hypothesis aboutthe causal relationships between the visible variables V = { v , v , v , v , v } and thelatent variables L = { (cid:96) , (cid:96) , (cid:96) } ; e.g. v experiences a direct causal influence fromeach of its parents, both visible vpa G ( v ) = { v , v } and latent lpa G ( v ) = { (cid:96) , (cid:96) } .Throughout this paper, visible variables and edges connecting them are colored bluewhereas all latent variables and all other edges are colored red.structure is to graphically encode the causal relationships between the variables. Ex-plicitly, if q → u ∈ E is an edge of the causal structure, X q is said to have causalinfluence on X u . Consequently, the causal structure predicts that given completeknowledge of a valuation of the parental variables X pa G ( u ) = (cid:8) X q | q ∈ pa G ( u ) (cid:9) , therandom variable X u should become independent of its non-descendants [43]. Withthis observation as motivation, the causal parameters P of a causal model are afamily of conditional probability distributions P q | pa G ( q ) for each q ∈ Q . In the casethat q has no parents in G , the distribution is simply unconditioned. The purpose ofthe causal parameters are to predict a joint distribution P Q on the configurations Ω Q of a causal structure, ∀ x Q ∈ Ω Q , P Q ( x Q ) = (cid:89) q ∈Q P q | pa G ( q ) ( x q | x pa G ( q ) ) . (6)If the hypotheses encoded within a causal structure G are correct, then the observeddistribution over Ω Q should factorize according to Equation 6. Unfortunately, asdiscussed in Section 1, there are often ethical, economic, or fundamental obstaclespreventing access to all variables of a system. In such cases, it is customary to par-tition the vertices of causal structure into two disjoint sets; the visible (observed)vertices V , and the latent (unobserved) vertices L (for example, see Figure 2).Additionally, we denote visible parents of any vertex q ∈ V ∪L as vpa G ( q ) = V ∩ pa G ( q )and analogously for the latent parents lpa G ( q ) = L ∩ pa G ( q ).In the presence of latent variables, Equation 6 stills makes a prediction about It is seldom necessary to make the distinction between the random variable X q and the in-dex/vertex q ; this paper henceforth treats them as synonymous. This is known as the local Markov property. P V∪L ( x V , λ L ) over the visible and latent variables, albeit anexperimenter attempting to verify or discredit a causal hypothesis only has access tothe marginal distribution P V ( x V ). If Ω L is continuous, ∀ x V ∈ Ω V , P V ( x V ) = (cid:90) λ L ∈ Ω L d P V∪L ( x V , λ L ) (7)If Ω L is discrete, ∀ x V ∈ Ω V , P V ( x V ) = (cid:88) λ L ∈ Ω L P V∪L ( x V , λ L ) . (8)A natural question arises; in the absence of information about the latent variables L , how can one determine whether or not their causal hypotheses are correct? Theprinciple purpose of this paper is to provide the reader with methods for answeringthis question.In general, other than being a directed acyclic graph, there are no restrictionsplaced on a causal structure with latent variables. Nonetheless, [21] demonstratesthat every causal structure G can be converted into a standard form that is observa-tionally equivalent to G where the latent variables are exogenous (have no parents)and whose children sets are isomorphic to the facets of a simplicial complex over V . Appendix A summarizes the relevant results from [21] necessary for making thisclaim. Additionally, Appendix B demonstrates that any finite distribution P V whichsatisfies the causal hypotheses (i.e. Equation 7) can be generated using determinis-tic causal parameters for the visible variables and moreover, the cardinalities of thelatent variables can be assumed finite . Altogether, Appendices A and B suggestthat without loss of generality, we can simplify the causal compatibility problem asfollows: Definition 5 (Functional Causal Model) . A (finite) functional causal model fora causal structure G = ( V ∪ L , E ) is a triple ( G , F V , P L ) where F V = { f v : Ω pa G ( v ) → Ω v | v ∈ V} (9)are deterministic functions for the visible variables V in G , and P L = { P (cid:96) : Ω (cid:96) → [0 , | (cid:96) ∈ L} (10)are finite probability distributions for the latent variables L in G . A functional causalmodel defines a probability distribution P V : Ω V → [0 , ∀ x V ∈ Ω V , P V ( x V ) = (cid:89) (cid:96) ∈L (cid:88) λ (cid:96) ∈ Ω (cid:96) P (cid:96) ( λ (cid:96) ) (cid:89) v ∈L δ ( x v , f v ( x vpa G ( v ) , λ lpa G ( v ) )) . (11) This paper adopts the notational convenient of using λ (cid:96) ∈ Ω (cid:96) for valuations of latent variables (cid:96) ∈ L to differentiate them from valuations x v ∈ Ω v of observed variables v ∈ V . Appendix A.1 briefly discusses what it means for two causal structures to be observationallyequivalent . We prove this result in Appendix B by generalizing the proof techniques used in [50]. efinition 6 (The Causal Compatibility Problem) . Given a causal structure G =( V ∪ L , E ) and a distribution P V over the visible variables V , the causal compati-bility problem is to determine if there exists a functional causal model ( G , F V , P L )(defined in Definition 5) such that Equation 11 reproduces P V . If such a functionalcausal model exists, then P V is said to be compatible with G ; otherwise P V is incom-patible with G . The set of all compatible distributions on V for a causal structure G is denoted M V ( G ). Consider the causal structure in Figure 3a denoted G a . For the sake of concreteness,suppose one is promised the latent variables are sampled from a binary sample space,i.e. k µ = k ν = 2. Let z µ = P µ (0 µ ) and z ν = P ν (0 ν ). The causal hypothesis G a predicts (via Equation 11) that observable events ( x a , x b , x c ) ∈ Ω a × Ω b × Ω c will bedistributed according to, P abc = z µ z ν [ obs abc (0 µ ν )] + z µ (1 − z ν )[ obs abc (0 µ ν )]++ (1 − z µ ) z ν [ obs abc (1 µ ν )] + (1 − z µ )(1 − z ν )[ obs abc (1 µ ν )] , (12)where obs abc ( λ µ λ ν ) ∈ Ω a × Ω b × Ω c is shorthand for the observed event generated bythe autonomous functions f a , f b , f c for each ( λ µ , λ ν ) ∈ Ω µ × Ω ν . In the case of G a , obs abc ( λ µ λ ν ) = ( f a ( λ µ ) , f b ( f a ( λ µ ) , λ ν ) , f c ( f b ( f a ( λ µ ) , λ ν ) , λ ν )) . (13)For each distinct realization ( λ µ , λ ν ) ∈ Ω µ × Ω ν of the latent variables, one can considera possible world wherein the values λ µ , λ ν are not sampled according to the respectivedistributions P µ , P ν , but instead take on definite values. From the perspective ofcounterfactual reasoning, each world is modelling a distinct counterfactual assignmentof the latent variables, but not the visible variables. In this particular example, thereare k µ × k ν = 2 × . Critically, regardless of the deterministicfunctional relationships f a , f b , f c , there are identifiable consistency constraints thatmust hold between these worlds. For example, a is determined by a function f a :Ω µ → Ω a and thus the observed value for a in the yellow (0 µ ν )-world must be exactly the same as the observed value for a in the green (0 µ ν )-world. This cross-world consistency constraint is illustrated in Figure 3c by embedding each possible It is conceivable that this framework, and its associated diagrammatic notation, could be ex-tended to accommodate counterfactual assignments to the visible variables as well. Such an exten-sion could be useful for assessing compatibility with interventional data, in addition to the purelyobservational data being considered here. This diagrammatic convention is imminently explained in more depth by Definition 7 and asso-ciated Figure 4. λ µ → a subgraphs. It is important toremark that not all cross-world consistency constraints are captured by this diagram;the value of b in the yellow (0 µ ν )-world must match the value of b in the orange(1 µ ν )-world if the value of a in both possible worlds is the same.For comparison, in the original causal structure G a , the vertices represented ran-dom variables sampled from distributions associated with causal parameters; whereasin the possible worlds diagram of Figure 3c, every valuation, including the latent val-uations are predetermined by the functional dependences f a , f b , f c . For example,Figure 3d populates Figure 3c with the observable events generated by the followingfunctional dependences, f a (0 µ ) = 0 a f a (1 µ ) = 1 a ,f b (0 a ν ) = 3 b f b (0 a ν ) = 1 b f b (1 a ν ) = 2 b f b (1 a ν ) = 0 b ,f c (3 b µ ν ) = 0 c f c (1 b µ ν ) = 1 c f c (2 b µ ν ) = 2 c f c (0 b µ ν ) = 3 c . (14)The utility of Figure 3d is in its simultaneous accounts of Equation 14, the causalstructure G a and the cross-world consistency constraints that G a induces. Nonethe-less, Figure 3d fails to specify the probabilities z µ , z ν associated with the latent events.In Section 4, we utilize diagrams analogous to Figure 3d to tackle the causal com-patibility problem. Before doing so, this paper needs to formally define the possibleworlds framework . Definition 7 (The Possible Worlds Framework) . Let G = ( V ∪ L , E ), be a causalstructure with visible variables V and latent variables L . Let F V be a set of functionalparameters for V defined exactly as in Equation 9. The possible worlds diagram for the pair ( G , F V ) is a directed acyclic graph D satisfying the following properties:1. (Valuation Vertices) Each vertex in D consists of three pieces (consult Figure 4for clarity):(a) a subscript q ∈ V ∪ L corresponding to a vertex in G (indicated inside asmall circle in the bottom-right corner),(b) an integer ω corresponding to a possible valuation/outcome ω q of q where ω q ∈ { q , q , . . . } = Ω q (indicated inside the square of each vertex),(c) and a decoration in the form of colored outlines indicating which worlds(defined below) the vertex is a member of .2. (Ancestral Isomorphism) For every valuation vertex ω q in D , the ancestralsubgraph of ω q in D is isomorphic to the ancestral subgraph of q in G under the The order of the colored outlines are arbitrary. Every valuation vertex belongs to at least one world. Readers who are familiar with the Inflation technique [60] will recognize this ancestral iso-morphism property from the definition of an
Inflation of a causal structure. The critical differencebetween a possible worlds diagram and an Inflation is that vertices in the former represent valuationsof variables whereas vertices in the latter represent independent copies of the variables. b cµ ν (a) An example causal structure G a . a b cµ ν a b cµ ν a b cµ ν a b cµ ν (b) The possible worlds picture for G a . cc cca abb bbµ ν µ ν (c) Identifying consistency constraintsamong possible worlds for G a . c c c c a a b b b b µ ν µ ν (d) Populating a possible worlds diagramwith the deterministic functions f a , f b , f c inEquation 14. Figure 3: A causal structure G a and the creation of the possible worlds diagram when k µ = k ν = 2. vω valuation ω v ∈ Ω v original variable v ∈ V colors indicate world membership Figure 4: A vertex of a possible worlds diagram dissected.11ap ω q (cid:55)→ q . sub D ( an D ( ω q )) (cid:39) sub G ( an G ( q )) (15)3. (Consistency) Each valuation vertex x v of a visible variable v ∈ V is consistentwith the output of the functional parameter f v ∈ F V when applied to thevaluation vertices pa D ( x v ), x v = f v ( pa D ( x v )) (16)4. (Uniqueness) For each latent variable (cid:96) ∈ L , and for every valuation λ (cid:96) ∈ Ω (cid:96) there exists a unique valuation vertex in D corresponding to λ (cid:96) . Unlike latentvaluation vertices, the valuations of visible variables x v ∈ Ω v may be repeated(or absent) from D depending on the form of F V . In such cases, duplicated x v ’sare always uniquely distinguished by world membership (colored outline).5. (Worlds) A world is a subgraph of D that is isomorphic to G under the map ω q (cid:55)→ q . Let wor ( λ L ) ⊆ D denote the world containing the valuation λ L ∈ Ω L .Furthermore, for any subset V ⊆ V of visible variables, let obs V ( λ L ) ∈ Ω V denote the observed event supported by wor ( λ L ).6. (Completeness) For every valuation of the latent variables λ L ∈ Ω L , there existsa subgraph corresponding to wor ( λ L ). It is important to remark that although a possible worlds diagram D can beconstructed from the pair ( G , F V ), the two mathematical objects are not equivalent;the functional parameters F V can contain superfluous information that never appearsin D . We return to this subtle but crucial observation in Section 5.1.The essential purpose of the possible worlds construction is as a diagrammatic toolfor calculating the observational predictions of a functional causal model. Lemma 1captures this essence. Lemma 1.
Given a functional causal model ( G = ( V ∪ L , E ) , F V , P L ) (see Defini-tion 5), let D be the possible worlds diagram for ( G , F V ). The causal compatibilitycriterion (Equation 11) for G is equivalent to a probabilistic sum over worlds in D : P V = (cid:88) λ L ∈ Ω L (cid:89) (cid:96) ∈L P (cid:96) ( λ (cid:96) )[ obs V ( λ L )] . (17)The remainder of this paper explores the consequences of adopting the possibleworlds framework as a method for tackling the causal compatibility problem. The uniqueness property guarantees that each world wor ( λ L ) is uniquely determined by λ L . Sometimes it is useful to construct an incomplete possible worlds diagram; for example, Fig-ure 10. b cµ ν Figure 5: A causal structure G with three visible vertices V = { a, b, c } and two latentvertices L = { µ, ν } . Section 3 introduced the possible worlds framework as a technique for calculating theobservable predictions of a functional causal model by means of Lemma 1. In thissection, we use the possible worlds framework to develop a combinatorial algorithmfor completely solving the possibilistic causal compatibility problem.
Definition 8.
Given a probability distribution P V : Ω V → [0 , support σ ( P V )is defined as the subset of events which are possible, σ ( P V ) = { x V ∈ Ω V | P V ( x V ) > } . (18)An observed distribution P V is said to be possibilistically compatible with G ifthere exists a functional causal model ( G , F V , P L ) for which Equation 11 producesa distribution with the same support as P V . The possibilistic variant of the causalcompatibility problem is naturally related to the probabilistic causal compatibilityproblem defined in Definition 6; if a distribution is possibilistically incompatible with G , then it is also probabilistically incompatible. We now proceed to apply the pos-sible worlds framework to prove possibilistic incompatibility between a number ofdistribution/causal structure pairs. Consider the causal structure G depicted in Figure 5. For G , the causal compatibilitycriteria (Equation 11) takes the form, P abc ( x a x b x c ) = (cid:88) λ µ ∈ Ω µ (cid:88) λ ν ∈ Ω ν P µ ( λ µ ) P ν ( λ ν ) δ ( x a , f a ( λ µ )) δ ( x b , f b ( λ µ , λ ν )) δ ( x c , f c ( λ ν )) . (19)The following family of distributions for arbitrary x b , y b ∈ Ω b , P (20) abc = z [0 a x b c ] + (1 − z )[1 a y b c ]) , < z < , (20)are incompatible with G . Traditionally, distributions like P (20) abc are proven incom-patible on the basis that they violate an independence constraint that is implied by G [43], namely, ∀ P abc ∈ M ( G ) , P ac ( x a x c ) = P a ( x a ) P c ( x c ) . (21)13 bx c µ ν µ ν a by c (a) An incomplete possible worlds diagramfor G initialized by P (20) abc . The worlds arecolored: wor (0 µ ν ) green, wor (1 µ ν ) violet. a bx c µ ν µ ν b ? b ? a by c (b) Considering possible worlds producesa contradiction with P (20) abc . The addi-tional worlds are colored: wor (0 µ ν ) orange, wor (1 µ ν ) yellow. Figure 6: The possible worlds diagram for G (Figure 5) is incompatible with P (20) abc (Equation 20).Intuitively, G provides no latent mechanism by which a and c can attempt to correlate(or anti-correlate). We now prove the possibilistic incompatibility of the support σ ( P (20) abc ) with G using the possible worlds framework. Proof.
Proof by contradiction; assume that a functional causal model F V = { f a , f b , f c } for G exists such that Equation 19 produces P (20) abc . Since there are two distinct valua-tions of the joint variables abc in P (20) abc , namely 0 a x b c and 1 a y b c , consider each as be-ing sampled from two possible worlds. Without loss of generality , let 0 µ ν ∈ Ω µ × Ω ν denote any valuation of the latent variables such that obs abc (0 µ ν ) = 0 a x b c . Sim-ilarly, let 1 µ ν ∈ Ω µ × Ω ν denote any valuation of the latent variables such that obs abc (1 µ ν ) = 1 a y b c . Using these observations, initialize a possible worlds diagramusing wor (0 µ ν ), colored green, and wor (1 µ ν ), colored violet, as seen in Figure 6a. Inorder to complete Figure 6a, one simply needs to specify the behavior of b in two of the“off-diagonal” worlds, namely wor (0 µ ν ), colored orange, and wor (1 µ ν ), colored yel-low (see Figure 6b). Regardless of this choice, the observed event obs ac (0 µ ν ) = 0 a c in the orange world wor (0 µ ν ) predicts P ac (0 a c ) > which contradicts P (20) abc . There- There is no loss of generality in choosing 0 µ ν and 1 µ ν (instead of 0 µ ν and 1 µ ν ) as thevaluations for the worlds because the valuation “labels” associated with latent events are arbi-trary. The valuations can not be 0 µ ν and 1 µ ν because of the cross-world consistency constraint obs c (0 µ ν ) = obs c (1 µ ν ) = f c (1 ν ). The probabilities associated to each world by Lemma 1 can always be assumed positive, becauseotherwise, those valuations would be excluded from the latent sample space Ω L . b cµ ν Figure 7: The Instrumental Scenario. a b c a b c µ µ ν ν b ? b ? c ? c ? (a) Worlds wor (0 µ ν ), and wor (1 µ ν ) areinitialized by the observed events in Equa-tion 23. a b c a b c µ µ ν ν b b c c (b) Populating the events in wor (0 µ ν ) and wor (1 µ ν ) leads to a contradiction withEquation 23. Figure 8: A possible worlds diagram for G (Figure 7). The worlds are colored: wor (0 µ ν ) yellow, wor (1 µ ν ) orange, wor (1 µ ν ) violet, wor (0 µ ν ) green.fore, because the proof technique did not rely on the value of 0 < z < P (20) abc ispossibilistically incompatible with G . The causal structure G depicted in Figure 7 is known as the Instrumental Scenario [8,40, 41]. For G , Equation 11 takes the form, P abc ( x a x b x c ) = (cid:88) λ µ ∈ Ω µ (cid:88) λ ν ∈ Ω ν P µ ( λ µ ) P ν ( λ ν ) δ ( x a , f a ( λ µ )) δ ( x b , f b ( a, λ ν )) δ ( x c , f c ( b, λ ν )) . (22)The following family of distributions, P (23) abc = z [0 a b c ] + (1 − z ) [1 a b c ] , < z < , (23)are possibilistically incompatible with G . The Instrumental scenario G is differentfrom G in that there are no observable conditional independence constraints which15an prove the possibilistic incompatibility of P (23) abc . Instead, the possibilistic incom-patibility of P (23) abc is traditionally witnessed by an Instrumental inequality originallyderived in [41], ∀ P abc ∈ M ( G ) , P bc | a (0 b c | a ) + P bc | a (0 b c | a ) ≤ . (24)Independently of Equation 24, we now prove possibilistic incompatibility of P (23) abc with G using the possible worlds framework. Proof.
Proof by contradiction; assume that a functional model F V = { f a , f b , f c } for G exists such that Equation 22 produces P (23) abc (Equation 23). Analogously to the proofin Section 4.1, there are only two distinct valuations of the joint variables abc , namely0 a b c and 1 a b c . Therefore, define two worlds one where obs abc (0 µ ν ) = 0 a b c andanother where obs abc (1 µ ν ) = 1 a b c . Using these two worlds, a possible worldsdiagram can be initialized as in Figure 8a where wor (0 µ ν ) is colored yellow and wor (1 µ ν ) is colored orange. In order to complete the possible worlds diagram ofFigure 8a, one first needs to specify how b behaves in two possible worlds: wor (0 µ ν )colored green and wor (1 µ ν ) colored violet. obs b (1 µ ν ) = f b (1 a ν ) =? b , obs b (0 µ ν ) = f b (0 a ν ) =? b . (25)By appealing to P (23) abc , it must be that obs b (1 µ ν ) = obs b (0 µ ν ) = 0 b as no other valu-ations for b are in the support of P (23) abc . Finally, the remaining ‘unknown’ observationsfor c in the violet world obs c (1 µ ν ) = f c (0 b ν ), and green world obs c (0 µ ν ) = f c (0 b ν )are determined respectively by the behavior of c in the orange wor (1 µ ν ) and yellow wor (0 µ ν ) worlds as depicted in Figure 8b. Explicitly, obs c (1 µ ν ) = f c (0 b ν ) = obs c (0 µ ν ) = 0 c , obs c (0 µ ν ) = f c (0 b ν ) = obs c (1 µ ν ) = 1 c . (26)Therefore the observed events in the green and violet worlds are fixed to be, obs abc (1 µ ν ) = 1 a b c , obs abc (0 µ ν ) = 0 a b c . (27)Unfortunately, neither of theses events are in the support of P (23) abc , which is a contra-diction; therefore P (23) abc is possibilistically incompatible with G .Notice that unlike the proof from Section 4.1, here we needed to appeal to thecross-world consistency constraints (Equation 26) demanded by the possible worldsframework. 16 a byρµ ν Figure 9: The Bell causal structure has variables a, b ‘measuring’ hidden variable ρ with ‘measurement settings’ x, y determined independently of ρ . a b x y µ ν ρ a b x y µ ν ρ a ? b ? Figure 10: An incomplete possible worlds diagram for the Bell structure G (Figure 9)initialized by the observed events obs xaby (0 µ ρ ν ) = 0 x a b y and obs xaby (1 µ ρ ν ) =1 x a b y . The worlds are colored: wor (0 µ ρ ν ) green, wor (1 µ ρ ν ) violet, wor (1 µ ρ ν )magenta, wor (0 µ ρ ν ) yellow, and wor (0 µ ρ ν ) orange.17 .3 The Bell Structure Consider the causal structure G depicted in Figure 9 known as the Bell structure [7].From the perspective of causal inference, Bell’s theorem [7] states that any distributioncompatible with G must satisfy an inequality constraint known as a Bell inequality.For example, the inequality due to Clauser, Horne, Shimony and Holt, referred to asthe CHSH inequality, constrains correlations held between a and b as x, y vary [15] , ∀ P xaby ∈ M ( G ) , S = (cid:104) ab | x y (cid:105) + (cid:104) ab | x y (cid:105) + (cid:104) ab | x y (cid:105) − (cid:104) ab | x y (cid:105) , | S | ≤ S = 2 √ S = 4 using Popescu-Rohrlich box correlations [49]. The following distribution isan example of a Popescu-Rohrlich box correlation, P (29) xaby = 18 ([0 x a b y ] + [0 x a b y ] + [0 x a b y ] + [0 x a b y ]++[1 x a b y ] + [1 x a b y ] + [1 x a b y ] + [1 x a b y ]) . (29)Unlike G , there are conditional independence constraints placed on correlations com-patible with G , namely the no-signaling constraints P a | xy = P a | x and P b | xy = P b | y .Because P (29) xaby satisfies the no-signaling constraints, the incompatibility of P (29) xaby with G is traditionally proven using Equation 28. We now proceed to prove its incompat-ibility using the possible worlds framework. Proof.
Proof by contradiction; assume that a functional causal model F V = { f a , f b , f x , f y } for G exists which supports P (29) xaby and use the possible worlds framework. Unlike theprevious proofs, we only need to consider a subset of the events in P (29) xaby to initialize apossible worlds diagram. Consider the following pair of events and associated latentvaluations which support them , obs xaby (0 µ ρ ν ) = 0 a b x y , obs xaby (1 µ ρ ν ) = 1 a b x y . (30)Using Equation 30, initialize the possible worlds diagram in Figure 10 with worlds wor (0 µ ρ ν ) colored green and wor (1 µ ρ ν ) colored violet. An unavoidable contradic-tion arises when attempting to populate the values for f a (0 x ρ ) in the yellow world wor (0 µ ρ ν ) and f b (0 y ρ ) in the magenta world wor (1 µ ρ ν ). First, the observed event obs xaby (0 µ ρ ν ) = 0 x ? a b y in the yellow world wor (0 µ ρ ν ) must belong to the listof possible events prescribed by P (29) xaby ; a quick inspection leads one to recognize that The two variable correlation is defined as (cid:104) ab | x x x y (cid:105) = (cid:80) i,j =1 ( − i + j P ab | xy ( i a j b | x x x y ). Clearly, the values of λ µ and λ ν that support these worlds must be unique. Less obvious isthe possibility for these worlds to share a λ ρ value. Albeit if they do, the event 0 x a b y becomespossible, contradicting P (29) xaby as well. bcµ νρ Figure 11: The Triangle structure G involving three visible variables V = { a, b, c } each sharing a pair of latent variables from L = { µ, ν, ρ } .the only possibility is obs a (0 µ ρ ν ) = f a (0 x ρ ) = 1 a . An analogous argument in themagenta world wor (1 µ ρ ν ) proves that obs b (1 µ ρ ν ) = f b (0 y ρ ) = 0 b . Therefore, theobserved event in the orange world wor (0 µ ρ ν ) must be, obs abcd (0 µ ρ ν ) = 0 x a b y , (31)and therefore P xaby (0 x a b y ) > P (29) xaby . Therefore, P (29) xaby is possi-bilistically incompatible with G . Consider the causal structure G depicted in Figure 11 known as the Triangle struc-ture. The Triangle has been studied extensively in recent decades [10, 12, 23, 24, 30,37, 55, 58, 60]. The following family of distributions are possibilistically incompatiblewith G , P (32) abc = p [1 a b c ] + p [0 a b c ] + p [0 a b c ] , (cid:88) i =1 p i = 1 , p i > . (32) Proof.
Proof by contradiction: assume that a functional causal model F V = { f a , f b , f c } for G exists supporting P (32) abc and use the possible worlds framework. For each distinctevent in P (32) abc , consider a world in which it happens definitely. Explicitly define, obs abc (0 µ ρ ν ) = 1 a b c , (33) obs abc (1 µ ρ ν ) = 0 a b c , (34) obs abc (2 µ ρ ν ) = 0 a b c , (35)corresponding to the exterior worlds in Figure 12. Consider magenta world wor (0 µ ρ ν )with partially specified observation obs abc (0 µ ρ ν ) =? a ? b c . Recalling P (32) abc , whenever The proof holds if the probabilities of the events in P (29) xaby are any positive value. The Inflation Technique first proved the incompatibility between P (32) abc and G . b c µ ν ρ a b c µ ν ρ a b c µ ν ρ a ? b ? c ? a ? b ? c ? Figure 12: An incomplete possible worlds diagram for the Triangle structure G (Figure 11) initialized by the triplet of observed events in Equation 35. The worldsare colored: wor (0 µ ν ρ ) brown, wor (1 µ ν ρ ) yellow, wor (2 µ ν ρ ) orange, wor (0 µ ν ρ )magenta, wor (2 µ ν ρ ) blue, wor (0 µ ν ρ ) violet, and wor (0 µ ν ρ ) green.20 b c dµ νρ Figure 13: The Evans Causal Structure G . c takes value 1 c , both a and b take the value 0; i.e. 0 a b . Therefore, it must be thatthe observed event in the magenta world wor (0 µ ρ ν ) is obs abc (0 µ ρ ν ) = 0 a b c . Ananalogous argument holds for other worlds, obs abc (0 µ ρ ν ) =? a ? b c ⇒ obs abc (0 µ ρ ν ) = 0 a b c , obs abc (2 µ ρ ν ) =? a b ? c ⇒ obs abc (2 µ ρ ν ) = 0 a b c , obs abc (0 µ ρ ν ) = 1 a ? b ? c ⇒ obs abc (0 µ ρ ν ) = 1 a b c . (36)However, the conclusions drawn by Equation 36 predict the observed event the incentral, green world wor (0 µ ρ ν ) must be, obs abc (0 µ ρ ν ) = 0 a b c , (37)and therefore P abc (0 a b c ) > P (32) abc . Therefore, P (32) abc is possibilisti-cally incompatible with G . Consider the causal structure in Figure 13, denoted G . This causal structure wasfirst mentioned by Evans [21], along with two others, as one for which no existingtechniques were able to prove whether or not it was saturated; that is, whether ornot all distributions were compatible with it. Here it is shown that there are indeeddistributions which are possibilistically incompatible with G using the frameworkof possible worlds diagrams. As such, this framework currently stands as the mostpowerful method for deciding possibilistic compatibility.Consider the family of distributions with three possible events: P (38) abcd = p [0 a b c y d ] + p [1 a b c d ] + p [0 a b c d ] , (cid:88) i =1 p i = 1 , p i > . (38)Regardless of the values for p , p , p (and y d ∈ Ω d arbitrary), P (38) abcd is incompatiblewith G . Proof.
Proof by contradiction. First assume that a deterministic model F V = { f a , f b , f c , f d } for P (38) abcd exists and adopt the possible worlds framework. Let wor ( i µ i ν i ρ ) for i ∈ b c d µ ν ρ a b c dyµ ν ρ a b c d µ ν ρ a ? b ? c ? d ? b ? c ? d ? Figure 14: A possible worlds diagram for G initialized by the distribution inEquation 38. The worlds are colored: wor (0 µ ν ρ ) magenta, wor (1 µ ν ρ ) orange, wor (2 µ ν ρ ) yellow, wor (1 µ ν ρ ) violet, and wor (1 µ ν ρ ) green.22 , , } index the possible worlds which support the events observed in P abcd , obs abcd (0 µ ν ρ ) = 0 a b c y d , obs abcd (1 µ ν ρ ) = 1 a b c d , obs abcd (2 µ ν ρ ) = 0 a b c d . (39)Only two additional possible worlds are necessary for achieving a contradiction. Con-sulting Figure 14 for details, these possible worlds are wor (1 µ ν ρ ) colored violet and wor (1 µ ν ρ ) colored green. Notice that the determined value for a must be the samein both worlds as it is independent of λ ν : x a = f a (1 µ ρ ) = obs a (1 µ ν ρ ) = obs a (1 µ ν ρ ) . (40)There are only two possible values for x a in any world, namely x a = 0 a or x a = 1 a asgiven by P (38) abcd . First suppose that x a = 0 a . Then in the violet world wor (1 µ ν ρ ), thevalue of b , to be obs b (1 µ ν ρ ) = f b (0 a ν ) = 0 b is completely constrained by consistencywith the magenta world wor (0 µ ν ρ ). Therefore, obs ab (1 µ ν ρ ) = 0 a b . By analogouslogic, in the violet world the value of c is constrained to be obs c (1 µ ν ρ ) = f c (0 b µ ) =0 c by the orange world wor (1 µ ν ρ ). Therefore, obs abc (1 µ ν ρ ) = 0 a b c , which is acontradiction because 0 a b c is an impossible event in P (38) abcd . Therefore, it must be that x a = 1 a . An unavoidable contradiction follows from attempting to populate the greenworld wor (1 µ ν ρ ) in Figure 14 with the established knowledge that obs a (1 µ ν ρ ) =1 a . The value of obs b (1 µ ν ρ ) = f b (1 a ν ) has yet to be specified by any possibleworlds, but choosing f b (1 a ν ) = 1 b would yield an impossible event obs a (1 µ ν ρ ) =1 a b . Therefore, it must be that f b (1 a ν ) = 0 b and obs a (1 µ ν ρ ) = 1 a b . Similarly, theorange world wor (1 µ ν ρ ) fixes f c (0 b µ ) = 1 c and therefore obs abc (1 µ ν ρ ) = 1 a b c .Finally, the yellow world wor (2 µ ν ρ ) already determines obs d (1 µ ν ρ ) = f d (0 c ν ρ ) =1 d and therefore one concludes that, obs abcd (1 µ ν ρ ) = 1 a b c d , (41)which is an impossible event in P (38) abcd . This contradiction implies that no functionalmodel F V = { f a , f b , f c , f d } exists and therefore P (38) abcd is possibilistically incompatiblewith G .To reiterate, there are currently no other methods known [21] which are capableof proving the incompatibility of any distribution with G . Therefore, the possi-ble worlds framework can be seen as the state-of-the-art technique for determiningpossibilistic causation. It is worth noting we have also proven the non-saturation of the other two causal structuresmention in [21] using analogous proofs. .6 Necessity and Sufficiency Throughout this section, we explored a number of proofs of possibilistic incompatibil-ity using the possible worlds framework. Moreover, the above examples communicatea systematic algorithm for deciding possibilistic compatibility. Given a distribution P V with support σ ( P V ) ⊂ Ω V , and a causal structure G = ( V ∪ L , E ), the followingalgorithm sketch determines if P V is possibilistically compatible with G .1. Let W = | σ ( P V ) | < | Ω V | denote the number of possible events provided by P V .2. For each 1 ≤ i ≤ W , create a possible world wor (cid:16) λ ( i ) L (cid:17) where λ ( i ) L = { i (cid:96) | (cid:96) ∈ L} ,thus defining the latent sample space Ω L .3. Attempt to complete the possible worlds diagram D initialized by the worlds (cid:110) wor (cid:16) λ ( i ) L (cid:17)(cid:111) Wi =1 .4. If an impossible event x V (cid:54)∈ σ ( P V ) is produced by any “off-diagonal” world wor ( . . . i (cid:96) . . . j (cid:96) (cid:48) . . . ) where i (cid:54) = j , or if a cross-world consistency constraint isbroken, back-track.Upon completing the search, there are two possibilities. The first possibility is thatthe algorithm returns a completed, consistent, possible worlds diagram D . Then byLemma 1, P V is possibilistically compatible with G . The second possibility is that anunavoidable contradiction arises, and P V is not possibilistically compatible with G . In Section 4, we demonstrated that the possible worlds framework was capable ofproviding a complete possibilistic solution to the causal compatibility problem. Ifhowever, a given distribution P V happens to satisfy a causal hypothesis on a possi-bilistic level, can the possible worlds framework be used to determine if P V satisfiesthe causal hypothesis on a probabilistic level as well? In this section, we answer thisquestion affirmatively. In particular, we provide a hierarchy of feasibility tests forprobabilistic compatibility which converges exactly. In addition, we illustrate that apossible worlds diagram is the natural data structure for algorithmically implementingthis converging hierarchy. A simple C implementation of the above pseudo-algorithm for boolean visible variables ( | Ω v | =2 , ∀ v ∈ V ) can be found at github.com/tcfraser/possibilistic causality. In particular, the providedsoftware can output a DIMACS formatted CNF file for usage in most popular boolean satisfiabilitysolvers. .1 Symmetry and Superfluity This aforementioned hierarchy of tests, to be explained in Section 5.3, relies onthe enumeration of all probability distributions P V which admit uniform functionalcausal models ( G , F V , P L ) for fixed cardinalities k V∪L = { k q = | Ω q | | q ∈ V ∪ L} . Afunctional causal model is uniform if the probability distributions P (cid:96) ∈ P L over thelatent variables are uniform distributions; P (cid:96) : Ω (cid:96) → k − (cid:96) . Section 5.2 discusses whyuniform functional causal models are worth considering, whereas in this section, wediscuss how to efficiently enumerate all probability distributions P V that are uniformlygenerated from fixed cardinalities k V∪L .One method for generating all such distributions is to perform a brute force enu-meration of all deterministic strategies F V for fixed cardinalities k V∪L . Depending onthe details of the causal structure, the number of deterministic functions of this formis poly-exponential in the cardinalities k V∪L . This method is inefficient because isfails to consider that many distinct deterministic strategies produce the exact samedistribution P V . There are two optimizations that can be made to avoid regenerationsof the same distribution P V while enumerating all deterministic strategies F V . Theseoptimizations are best motivated by an example using the possible worlds framework.Consider the causal structure G a in Figure 15a with visible variables V = { a, b, c } and latent variables L = { µ, ν } . Furthermore, for concreteness, suppose that k µ = k ν = k a = k a = 2 and k c = 4. Finally let F V = { f a , f b , f c } be such that, f a (0 µ ) = 0 a , f a (1 µ ) = 1 a , f b (0 µ ) = 0 b , f b (1 µ ) = 1 b ,f c (0 a b ν ) = 2 c , f c (0 a b ν ) = 0 c , f c (1 a b ν ) = 3 c , f c (1 a b ν ) = 1 c f c (0 a b ν ) = 0 c , f c (0 a b ν ) = 1 c , f c (1 a b ν ) = 2 c , f c (1 a b ν ) = 3 c . (42)The possible worlds diagram D for G a generated by Equation 42 is depicted in Fig-ure 15b. If the latent valuations are distributed uniformly, the probability distributionassociated with Figure 15b (as given by Equation 17) is equal to, P abc = 14 ([ wor (0 µ ν )] + [ wor (0 µ ν )] + [ wor (1 µ ν )] + [ wor (1 µ ν )])= 14 ([0 a b c ] + [0 a b c ] + [1 a b c ] + [1 a b c ]) . (43)The first optimization comes from noticing that Equation 42 specifies how c wouldrespond if provided with the valuation 1 a b ν of its parents, namely f c (1 a b ν ) = 3 c .Nonetheless, this hypothetical scenario is excluded from Figure 15b (crossed out in thefigure) because the functional model in Equation 42 never produces an opportunityfor a to be different from b . Consequently, the functional dependences in Equation 42contain superfluous information irrelevant to the observed probability distribution inEquation 43.Therefore, a brute force enumeration of deterministic strategies would regenerateEquation 43 several times, once for each assignment of c ’s behavior in these super-fluous scenarios. It is possible to avoid these regenerations by using an unpopulated25ossible worlds diagram ˜ D as a data structure and performing a brute force enumer-ation of all consistent valuations of ˜ D .The second optimization comes from noticing that Equation 43 contains many symmetries . Notably, independently permuting the latent valuations, π µ : 0 µ ↔ µ or π ν : 0 ν ↔ ν , leaves the observed distribution in Equation 43 invariant, but maps thefunctional dependences F V of Equation 42 to different functional dependences F π µ V and F π ν V . These symmetries are reflected as permutations of the worlds as depictedin Figures 15c, and 15d.Analogously, it is possible to avoid these regenerations by first pre-computing theinduced action on ˜ D , and thus an induced action on F V , under the permutationgroup S L = (cid:81) (cid:96) ∈L perm (Ω (cid:96) ). Then, using the permutation group S L , one only needsto generate a representative from the equivalence classes of possible worlds diagrams D under S L .Importantly, the optimizations illuminated above, namely ignoring superfluousspecifications and exploiting symmetries, are universal ; they can be applied for anycausal structure. Additionally, the possible worlds framework intuitively excludessuperfluous cases and directly embodies the observational symmetries, making a pos-sible worlds diagram the ideal data structure for performing a search over observeddistributions. The purpose of this section is motivate why it is always possible to approximate anyfunctional causal model ( G , F V , P L ) with another functional causal model ( G , ˜ F V , ˜ P L )which has latent events λ L ∈ ˜Ω L uniformly distributed. Unsurprisingly, an accurateapproximation of this form will require an increase in the cardinality | ˜Ω L | > | Ω L | ofthe latent variables. Definition 9 (Rational Distributions) . A discrete probability distribution P over Ωis rational if every probability assigned to events in Ω by P is rational, ∀ λ ∈ Ω , P ( λ ) = n λ d λ , where n λ , d λ ∈ Z . (44) Definition 10 (Distance Metric for Distributions) . Given two probability distribu-tions P , ˜ P over the same sample space Ω, the distance ∆( P , ˜ P ) between P and ˜ P isdefined as, ∆( P , ˜ P ) = (cid:88) x ∈ Ω (cid:12)(cid:12) P ( x ) − ˜ P ( x ) (cid:12)(cid:12) (45) As a special case, causal networks (which are causal structures where all variables are exogenousor endogenous) contain no superfluous scenarios. bcµ ν (a) A causal structure G a with three visible variables V = { a, b, c } and two latentvariables L = { µ, ν } . a b c µ ν a b c µ ν c c c ? (b) A possible worlds diagram for G a . The crossed out vertex is excluded becauseit fails to satisfy the ancestral isomorphism property. a b c µ ν a b c µ ν c c (c) The image of Figure 15b under the permutation 0 µ ↔ µ . a b c µ ν a b c µ ν c c (d) The image of Figure 15b under the permutation 0 ν ↔ ν . Figure 15: Every permutation π (cid:96) : Ω (cid:96) → Ω (cid:96) of valuations on the latent variables mapsa possible worlds diagram to another possible worlds diagram with the same observedevents. The worlds are colored: wor (0 µ ν ) green, wor (0 µ ν ) orange, wor (1 µ ν ) yellow,and wor (1 µ ν ) violet. 27 heorem 2. Let P (cid:96) : Ω (cid:96) → [0 , be any discrete probability distribution on Ω (cid:96) , thenthere exists a rational approximation ˜ P (cid:96) : Ω (cid:96) → [0 , , ∀ λ (cid:96) ∈ Ω (cid:96) , ˜ P (cid:96) ( λ (cid:96) ) = 1 | Ω u | (cid:88) ω u ∈ Ω u δ ( λ (cid:96) , g ( ω u )) , (46) where g : Ω u → Ω (cid:96) is deterministic and ∆( P (cid:96) , ˜ P (cid:96) ) ≤ | Ω u |− | Ω (cid:96) | .Proof. The proof is illustrated in Figure 16. In the special case that | Ω (cid:96) | = 1, theproof is trivial; g simply maps all values of ω u to the singleton λ (cid:96) ∈ Ω (cid:96) . The prooffollows from a construction of g using inverse uniform sampling. Given some ordering1 (cid:96) < (cid:96) < · · · of Ω (cid:96) and ordering 1 u < u < · · · of Ω u compute the cumulativedistribution function P ≤ (cid:96) ( λ (cid:96) ) = (cid:80) λ (cid:48) (cid:96) ≤ λ (cid:96) P (cid:96) ( λ (cid:48) (cid:96) ). Then the function g : Ω u → Ω (cid:96) isdefined as, g ( ω u ) = min { λ (cid:96) ∈ Ω (cid:96) | P ≤ (cid:96) ( λ (cid:96) ) | Ω u | ≥ ω u } . (47)Consequently, the proportion of ω u ∈ Ω u values which map to λ (cid:96) ∈ Ω (cid:96) has error ε ( λ (cid:96) ), ε ( λ (cid:96) ) = | Ω u | P (cid:96) ( λ (cid:96) ) − (cid:12)(cid:12) g − ( λ (cid:96) ) (cid:12)(cid:12) , (48)where | ε ( λ (cid:96) ) | ≤ λ (cid:96) ∈ Ω (cid:96) with the exception of the minimum (1 µ ) andmaximum ( | Ω (cid:96) | (cid:96) ) values where | ε ( λ (cid:96) ) | ≤ /
2. Therefore, the proof follows from adirect computation of the distance ∆( P (cid:96) , ˜ P (cid:96) ),∆( P (cid:96) , ˜ P (cid:96) ) = (cid:88) λ (cid:96) ∈ Ω (cid:96) (cid:12)(cid:12) P (cid:96) ( λ (cid:96) ) − ˜ P (cid:96) ( λ (cid:96) ) (cid:12)(cid:12) , (49)= (cid:88) λ (cid:96) ∈ Ω (cid:96) (cid:12)(cid:12)(cid:12)(cid:12) P (cid:96) ( λ (cid:96) ) − | Ω u | (cid:12)(cid:12) g − ( λ (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (50)= 1 | Ω u | (cid:88) λ (cid:96) ∈ Ω (cid:96) | ε ( λ (cid:96) ) | , (51) ≤ | Ω u | (cid:18) | Ω (cid:96) | − (cid:19) , (52)= | Ω (cid:96) | − | Ω u | . (53)In terms of the causal compatibility problem, Theorem 2 suggests that if an ob-served distribution P V is compatible with G , and there exists a functional causal model( G , F V , P L ) which reproduces P V (via Equation 11), then it must be close to a rationaldistribution ˜ P V generated by a functional causal model ( G , ˜ F V , ˜ P L ) wherein probabil-ity distributions for the latent variables ˜ P L are uniform . The following theorem provesthis. 28 ‘ ω u g ( ω u )Ω u P ≤ ‘ ( λ ‘ ) Ω ‘ Figure 16: Theorem 2: Approximately sampling a non-uniform distribution usinginverse sampling techniques.
Theorem 3.
Let ( G , F V , P L ) be a functional causal model with cardinalities c (cid:96) = | Ω (cid:96) | for the latent variables producing distribution P V . Then there exists a functional causalmodel ( G , ˜ F V , ˜ P L ) with cardinalities k (cid:96) = | ˜Ω (cid:96) | for the latent variables producing ˜ P V where the distributions ˜ P L = { U (cid:96) : ˜Ω (cid:96) → k − (cid:96) | (cid:96) ∈ L} over the latent variables areuniform. In particular, the distance between P V and ˜ P V is bounded by, ∆( P V , ˜ P V ) ≤ ε = L (cid:88) n =1 n ! (cid:18) L ( C − K (cid:19) n ∈ O (cid:18) LCK (cid:19) , (54) where C = max { c (cid:96) | (cid:96) ∈ L} , K = min { k (cid:96) | (cid:96) ∈ L} , and L = |L| is the number oflatent variables.Proof. The proof relies on Theorem 2 and can be found in Appendix C.
In Section 5.1, we discussed how to take advantage of the symmetries of a possi-ble worlds diagram and the superfluities within a set of functional parameters F V in order to optimally search over functional models. In Section 5.2, we discussedhow to approximate any functional causal model ( G , F V , P L ) using one with uniformlatent probability distributions. Here we combine these insights into a hierarchy ofprobabilistic compatibility tests for the causal compatibility problem. Definition 11.
Given a causal structure G , and given cardinalities k L = { k (cid:96) = | Ω (cid:96) | | (cid:96) ∈ L} for the latent variables, define the uniformly induced distributions , denoted as The cardinalities for the visible variables, k V = { k v = | Ω v | | v ∈ V} , are also assumed to beknown. ( k L ) V ( G ), as the set of all distributions ˜ P V ∈ M V ( G ) which admit of a uniform func-tional model ( G , F V , P L ) with cardinalities k L .Recall that Section 5.1 demonstrates a method, using the possible worlds frame-work, for efficient generation of the entirety of U ( k L ) V ( G ). Lemma 4.
The uniformly induced distributions U ( k L ) V ( G ) form an ε -dense set in M V ( G ) , P V ∈ M V ( G ) = ⇒ ∃ ˜ P V ∈ U ( k L ) V ( G ) , ∆( P V , ˜ P V ) ≤ ε ∈ O (cid:18) LCK (cid:19) (55) where ε is a function of K = min { k (cid:96) | (cid:96) ∈ L} , the number of latent variables L = |L| , and C = max { c (cid:96) | (cid:96) ∈ L} where c (cid:96) is the minimum upper bound placed on thecardinalities of the latent variable (cid:96) by Theorem 9.Proof. Since c L = { c (cid:96) | (cid:96) ∈ L} are minimum upper bounds placed on the cardinalitiesof the latent variables by Theorem 9, any P V ∈ M V ( G ) must admit a functional causalmodel with cardinalities for the latent variables at most c L . Then by Theorem 3, thereexists a uniform causal model producing ˜ P V ∈ U ( k L ) V ( G ), within a distance ε given byEquation 54.Lemma 4 forms the basis of the following compatibility test, Theorem 5 (The Causal Compatibility Test of Order K ) . For a probability dis-tribution P V and a causal structure G , the causal compatibility test of order K =min { k (cid:96) | (cid:96) ∈ L} is defined as the following question:Does there exist a uniformly induced distribution ˜ P V ∈ U ( k L ) V ( G ) such that ∆( P V , ˜ P V ) ≤ ε ( K ) ? As K → ∞ , the distance tends to zero ε ( K ) → and the sensitivity of the testincreases. If P V (cid:54)∈ M V ( G ) , then P V will fail the test for finite K . If P V ∈ M V ( G ) ,then P V will pass the test for all K . Moreover, for fixed K , the test can readily returnthe functional causal model behind the best approximation ˜ P V . First notice that Theorem 5 achieves the same rate of convergence as [37]. Unlikethe result of [37], Theorem 5 returns a functional model which approximates P V .It is interesting to remark that the distance bound ε ∈ O ( LC/K ) in Equation 55depends on C = max { c (cid:96) | (cid:96) ∈ L} where c (cid:96) is the minimum upper bound placed onthe cardinalities of the latent variable (cid:96) by Theorem 9. As conjectured in Appendix B,it is likely that there are tighter bounds that can be placed on these cardinalities forcertain causal structures. Therefore, further research into lowering these bounds willimprove the performance of Theorem 5. Here ε ( K ) is the value for ε provided by Lemma 4. Conclusion
In conclusion, this paper examined the abstract problem of causal compatibility forcausal structures with latent variables. Section 3 introduced the framework of possibleworlds in an effort to provide solutions to the causal compatibility problem. Centralto this framework is the notion of a possible worlds diagram, which can be viewed asa hybrid between a causal structure and the functional parameters of a causal model.It does not however, convey any information about the probability distributions overthe latent variables.In Section 4, we utilized the possible worlds framework to prove possibilistic in-compatibility of a number of examples. In addition, we demonstrated the utility ofour approach by resolving an open problem associated with one of Evans’ [21] causalstructures. Particularly, we have shown the causal structure in Figure 13 is incom-patible with the distribution in Equation 38. Section 4 concluded with an algorithmfor completely solving the possibilistic causal compatibility problem.In Section 5, we discussed how to efficiently search through the observationalequivalence classes of functional parameters using a possible worlds diagram as adata structure. Afterwards, we derived bounds on the distance between compatibledistributions and uniformly induced ones. By combining these results, we provide ahierarchy of necessary tests for probabilistic causal compatibility which converge inthe limit.
Foremost, I must thank my supervisor Robert W. Spekkens for his unwavering sup-port and encouragement. Second, I would like to sincerely thank Elie Wolfe for ournumerous and lengthy discussions. Without him or his research, this paper simplywould not exist. Finally, I thank the two anonymous referees for providing insightnecessary for significantly improving this paper.
References [1] S. Abramsky and A. Brandenburger. “The Sheaf-Theoretic Structure Of Non-Locality and Contextuality”. In:
New J. Phys
Phys. Rev. A
85 (6June 2012), p. 062114.[3] J.-M. A. Allen et al. “Quantum common causes and quantum causal models”.In: arXiv:1609.09487 (2016).[4] J.-D. Bancal, N. Gisin, and S. Pironio. “Looking for symmetric Bell inequali-ties”. In:
J. Phys. A
Dis-crete & Computational Geometry
Physical Review A
Physics
ArXiv e-prints (Jan. 2013).arXiv: .[9] Bradley, Hax, and Magnanti.
Applied Mathematical Programming . Addison-Wesley, 1977. Chap. 4, pp. 143–144.[10] C. Branciard et al. “Bilocal versus nonbilocal correlations in entanglement-swapping experiments”. In:
Phys. Rev. A
85 (3 Mar. 2012), p. 032119.[11] R. Chaves. “Polynomial bell inequalities”. In:
Physical review letters
New J. Phys.
Nature communications
Lett. MathPhys.
Phys. Rev. Lett.
23 (15 Oct. 1969), pp. 880–884.[16] D. Colombo et al. “Learning high-dimensional directed acyclic graphs with la-tent and selection variables”. In:
The Annals of Statistics (2012), pp. 294–321.[17] F. Costa and S. Shrapnel. “Quantum causal modelling”. In:
New Journal ofPhysics
J. Combin. Theor. A
Phys. Rev.
47 (10 May 1935),pp. 777–780.[20] R. J. Evans. “Graphical methods for inequality constraints in marginalizedDAGs”. In:
ArXiv e-prints (Sept. 2012). arXiv: .[21] R. J. Evans. “Graphs for margins of Bayesian networks”. In:
Scandinavian Jour-nal of Statistics arXiv:1501.02103 (2015). 3223] T. Fraser and E. Wolfe. “Causal Compatibility Inequalities Admitting of Quan-tum Violations in the Triangle Structure”. In:
Phys. Rev. A 98, 022113 (2018) (Sept. 19, 2017). arXiv: .[24] T. Fritz. “Beyond Bell’s Theorem: Correlation Scenarios”. In:
New J. Phys
Comm. Math. Phys.
IEEE Trans. Info. Theor.
Algebraic Geometry of BayesianNetworks . 2003. eprint: arXiv:math/0301255 .[28] D. Geiger and C. Meek. “Graphical Models and Exponential Families”. In:
ArXiv e-prints (Jan. 2013). arXiv: .[29] O. Goudet et al. “Causal Generative Neural Networks”. In:
ArXiv e-prints (Nov.2017). arXiv: .[30] J. Henson, R. Lal, and M. F. Pusey. “Theory-independent limits on correlationsfrom generalized Bayesian networks”. In:
New Journal of Physics .[31] M. Jirstrand.
Cylindrical algebraic decomposition-an introduction . Link¨opingUniversity, 1995.[32] C. Jones, E. C. Kerrigan, and J. Maciejowski.
Equality set projection: A newalgorithm for the projection of polytopes in halfspace representation . Tech. rep.Cambridge University Engineering Dept, 2004.[33] D. J. Kavvadias and E. C. Stavropoulos. “An Efficient Algorithm for the Transver-sal Hypergraph Generation”. In:
J. Graph Algor. Applic.
Bernoulli
ArXiv e-prints (June 2015). arXiv: .[36] M. S. Leifer and R. W. Spekkens. “Towards a Formulation of Quantum Theoryas a Causally Neutral Theory of Bayesian Inference”. In:
ArXiv e-prints (July2011). arXiv: .[37] M. Navascues and E. Wolfe. “The inflation technique solves completely theclassical inference problem”. In:
ArXiv e-prints (July 2017). arXiv: . 3338] O. Oreshkov, F. Costa, and C. Brukner. “Quantum correlations with no causalorder”. In:
Nat. Comm. arXiv:1105.4464 .[39] J. Pearl. “A Constraint Propagation Approach to Probabilistic Reasoning”. In:
ArXiv e-prints (Mar. 2013). arXiv: .[40] J. Pearl. “On the Testability of Causal Models with Latent and InstrumentalVariables”. In:
ArXiv e-prints (Feb. 20, 2013). arXiv: .[41] J. Pearl. “On the Testability of Causal Models with Latent and InstrumentalVariables”. In: (Aug. 1995), pp. 435–443.[42] J. Pearl. “Causal inference in statistics: An overview”. In:
Stat. Surv.
Causality: Models, Reasoning, and Inference . Cambridge UniversityPress, 2009.[44] J. Pienaar and ˇC. Brukner. “A graph-separation theorem for quantum causalmodels”. In:
New Journal of Physics
ArXiv e-prints (Jan. 2017). arXiv: .[46] T. Richardson, P. Spirtes, et al. “Ancestral graph Markov models”. In:
TheAnnals of Statistics
NaturePhysics
11 (May 2015), pp. 414–420. arXiv: .[48] J. M. Robins, M. A. Hernan, and B. Brumback.
Marginal structural models andcausal inference in epidemiology . 2000.[49] D. Rohrlich and S. Popescu. “Nonlocality as an axiom for quantum theory”. In: quant-ph/9508009 (1995).[50] D. Rosset, N. Gisin, and E. Wolfe. “Universal bound on the cardinality of lo-cal hidden variables in networks”. In:
Quantum Information and Computation,Vol. 18, No. 11 & 12 (2018) 0910-0926 (Sept. 3, 2017). arXiv: .[51] A. Schrijver.
Theory of Linear and Integer Programming . Wiley, Apr. 27, 1998.484 pp.[52] I. Shpitser and J. Pearl. “Complete identification methods for the causal hi-erarchy”. In:
Journal of Machine Learning Research
Directed Cyclic Graphical Representations of Feedback Models .2013. eprint: arXiv:1302.4982 .[54] P. Spirtes, C. N. Glymour, and R. Scheines.
Causation, prediction, and search .MIT press, 2000. 3455] B. Steudel and N. Ay. “Information-theoretic inference of common ancestors”.In:
Entropy
Foundations and Trends ® in Machine Learning General relativity . University of Chicago press, 2010.[58] M. Weilenmann and R. Colbeck. “Non-Shannon inequalities in the entropy vec-tor approach to causal structures”. In: arXiv:1605.02078 (2016).[59] N. Wermuth et al. “Probability distributions with summary graph structure”.In:
Bernoulli
Journal of Causal Inference
New J. Phys
Bioinformatics
Journal of MachineLearning Research v (a) A direct cause from v to v . v v ‘ (b) A shared common cause (cid:96) between v and v . Figure 17: The causal structures of (a) and (b) are observationally equivalent.
A Simplifying Causal Structures
A.1 Observational Equivalence
From an experimental perspective, a causal model ( G , P ) has the ability to predictthe effects of interventions ; by manually tinkering with the configuration of a sys-tem, one can learn more about the underlying mechanisms than from observationsalone [43]. When interventions become impossible, because experimentation is ex-pensive or unethical for example, it becomes possible for distinct causal structures toadmit the same set of compatible correlations. An important topic in the study ofcausal inference is the identification of observationally equivalent causal structures.Two causal structures G and G (cid:48) are observationally equivalent or simply equivalent if they share the same set of compatible models M V ( G ) = M V ( G (cid:48) ). For example,the direct cause causal structure in Figure 17a is observationally equivalent to thecommon cause causal structure in Figure 17b. Identifying observationally equivalentcausal structures is of fundamental importance to the causal compatibility problem;if a distribution P V is known to satisfy the hypotheses of G , and M V ( G ) = M V ( G (cid:48) )then it will also satisfy the hypotheses of G (cid:48) . A.2 Exo-Simplicial Causal Structures
In general, other than being a directed acyclic graph, there are no restrictions placedon a causal structure with latent variables. Nonetheless, [21] demonstrated a numberof transformations on causal structures which leave M V ( G ) invariant. Two of thesetransformations are the subject of interest for this section. The first concerns itselfwith latent vertices that have parents while the second concerns itself with parent-lesslatent vertices that share children. Each will be taken in turn. Definition 12 (See Defn. 3.6 [21]) . Given a causal structure G = ( V ∪ L , E ) withlatent vertex (cid:96) ∈ L , the exogenized causal structure exo G ( (cid:96) ) is formed by taking E and (i) adding an edge p → c for every p ∈ pa G ( (cid:96) ) and c ∈ ch G ( (cid:96) ) if not alreadypresent, and (ii) deleting all edges of the form p → (cid:96) where p ∈ pa G ( (cid:96) ). If pa G ( (cid:96) ) isempty, exo G ( (cid:96) ) = G . 36 v v v v ‘ (a) A latent vertex with observable parents. v v v ‘ ‘ (b) A latent vertex with latent parents v v v ‘ (c) A latent vertex with no children. v v v v ‘ ‘ (d) Latent vertices with nested children. Figure 18: Examples of causal structures which are not exo-simplicial.
Lemma 6 (See Lem. 3.7 [21]) . Given a causal structure G = ( V ∪ L , E ) with latentvertex (cid:96) ∈ L , then M V ( exo G ( (cid:96) )) = M V ( G ) .Proof. See proof of Lem. 3.7 from [21].The concept of exogenization is best understood with an example.
Example 1.
Consider the causal structure G a in Figure 18a. In G a , the latentvariable (cid:96) has parents pa ( (cid:96) ) = { v , v , v } and children ch ( (cid:96) ) = { v , v } . Since thesample space Ω (cid:96) is unknown, its cardinality could be arbitrarily large or infinite. Asa result, it has an unbounded capacity to inform its children of the valuations of itsparents, e.g. v can have complete knowledge of v through (cid:96) and therefore adding theedge v → v has no observational impact. Applying similar reasoning to all parentsof (cid:96) , i.e. applying Lemma 6, one converts G a to the observationally equivalent,exogenized causal structure exo G a ( (cid:96) ) depicted in Figure 19.Lemma 6 can be applied recursively to each latent variable (cid:96) ∈ L in order totransform any causal structure G into an observationally equivalent one wherein thelatent variables have no parents (exogenous). Notice that the process of exogenizationalso works when latent vertices have latent parents, as is the case in Figure 18b. Also,when a latent vertex (cid:96) has no children, the process of exogenization disconnects (cid:96) fromthe rest of the causal structure, where it can be ignored with no observational impactdue to Equation 7. 37 v v v v ‘ Figure 19: The exogenized causal structure exo G a ( (cid:96) ).The next observationally invariant transformation requires the exogenization pro-cedure to have been applied first. In Figure 18d, (cid:96) and (cid:96) are exogenous latentvariables where ch G d ( (cid:96) ) ⊂ ch G d ( (cid:96) ). Therefore, because the sample space Ω (cid:96) isunspecified, it has the capacity to emulate any dependence that v and/or v mighthave on (cid:96) . This idea is captured by Lemma 7. Lemma 7 (See Lem. 3.8 [21]) . Let G be a causal structure with latent vertices (cid:96), (cid:96) (cid:48) ∈ L where (cid:96) (cid:54) = (cid:96) (cid:48) . If pa G ( (cid:96) ) = pa G ( (cid:96) (cid:48) ) = ∅ , and ch G ( (cid:96) (cid:48) ) ⊆ ch G ( (cid:96) ) then M V ( G ) = M V ( sub G ( V ∪ L − { (cid:96) (cid:48) } )) .Proof. See proof of Lem. 3.8 from [21].An immediate corollary of Lemma 7 is that the latent variables { (cid:96) | (cid:96) ∈ L} , whichare isomorphic to their children { ch ( (cid:96) ) | (cid:96) ∈ L} , are isomorphic to the facets of asimplicial complex over the visible variables. Definition 13. An (abstract) simplicial complex , ∆, over a finite set V is acollection of non-empty subsets of V such that:1. { v } ∈ ∆ for all v ∈ V ; and2. if C ⊆ C ⊆ V , C ∈ ∆ ⇒ C ∈ ∆.The maximal subsets with respect to inclusion are called the facets of the simplicialcomplex.In [21], this concept led to the invention of mDAGs (or marginal directed acyclicgraphs), a hybrid between a directed acyclic graph and a simplicial complex. In thiswork, we refrain from adopting the formalism of mDAGs and instead continue toconsider causal structures as entirely directed acyclic graphs. Despite this refrain,Lemmas 6, 7 demonstrate that for the purposes of the causal compatibility problem,the latent variables of a causal structure can be assumed to be exogenous and tohave children forming the facets of a simplicial complex. Causal structures whichadhere to this characterization will be referred to as exo-simplicial causal structures.Figure 20 depicts four exo-simplicial causal structures respectively equivalent to thecausal structures in Figure 18. 38 v v v v ‘ ‘ ‘ ‘ (a) v v v ‘ (b) v v v ‘ ‘ ‘ (c) v v v v ‘ ‘ (d) Figure 20: Examples of exo-simplicial causal structures which are observationallyequivalent to their respective counterparts in Figure 18.
B Simplifying Causal Parameters
Recall that a causal model ( G , P ) consists of a causal structure G and causal pa-rameters P . Appendix A simplified the causal compatibility problem by revealingthat each causal structure G can be replaced with an observationally equivalent exo-simplicial causal structure G (cid:48) such that M V ( G ) = M V ( G (cid:48) ). The purpose of thissection is to simplify the causal compatibility problem in three ways. Section B.1demonstrates that the visible causal parameters (cid:8) P v | pa ( v ) | v ∈ V (cid:9) of a causal modelcan be assumed to be deterministic without observational impact. Section B.2 showsthat if the observed distribution is finite (i.e. | Ω V | < ∞ ), one only needs to con-sider finite probability distributions for the latent variables. Moreover, explicit upperbounds on the cardinalities of the latent variables can be computed. B.1 Determinism
Lemma 8. If P V ∈ M V ( G ) and G is exo-simplicial (see Appendix A), then withoutloss of generality, the causal parameters P v | pa G ( v ) over the observed variables can beassumed to be deterministic, and consequently, ∀ x V ∈ Ω V , P V ( x V ) = (cid:89) (cid:96) ∈L (cid:90) λ (cid:96) ∈ Ω (cid:96) d P (cid:96) ( λ (cid:96) ) (cid:89) v ∈L δ ( x v , f v ( x vpa G ( v ) , λ lpa G ( v ) )) (56) Proof.
Since P V ∈ M V ( G ), by definition, there exists a joint distribution P V∪L (or39ensity d P V∪L ) admitting marginal P V via Equation 7. Since the joint distributionsatisfies Equation 6, it is possible to associate to each observed variable X v an inde-pendent random variable E e v and measurable function f v : Ω vpa G ( v ) × Ω lpa G ( v ) × Ω e v such that for all v ∈ V , X v = f v (cid:0) X vpa G ( v ) , Λ lpa G ( v ) , E e v (cid:1) . (57)Therefore, by promoting each e v to the status of a latent variable in G and adding anedge e v → v to E , each X v becomes a deterministic function of its parents. Finally,making use of the fact that G is exo-simplicial, every error variable e v has its children ch G ( e v ) = { v } nested inside the children of at least one other pre-existing latentvariable. Therefore, by applying Lemma 7, e v is eliminated and one recovers theoriginal G .Essentially, Lemma 8 indicates that any non-determinism due to local noise vari-ables E e v can be emulated by the behavior of the latent variables L . B.2 The Finite Bound for Latent Cardinalities
In [50], it was shown that if the visible variables have finite cardinality (i.e. k V = | Ω V | is finite), then for a particular class of causal structures known as causal networks ,the cardinalities of the latent variables could be assumed to be finite as well. Acausal network is a causal structure where all latent variables have no parents (areexogenous) and all visible variables either have no parents or no children [37]. Thepurpose of this section is to generalize the results of [50] to the case of exo-simplicialcausal structures . Although the proof techniques presented here are similar to thatof [50], the best upper bounds placed on k L = | Ω L | depends more intimately onthe form of G . It is also anticipated that the upper bounds presented here are sub-optimal, much like [50]. It is also worth noting that the results presented here holdindependently of whether or not Lemma 8 is applied. Theorem 9.
Let ( G , P ) be a causal model with (possibly infinite) cardinalities k L = { k (cid:96) | (cid:96) ∈ L} for the latent variables such that, ∀ x V ∈ Ω V , P V ( x V ) = (cid:89) (cid:96) ∈L (cid:90) λ (cid:96) ∈ Ω (cid:96) d P (cid:96) ( λ (cid:96) ) (cid:89) v ∈V P v | pa ( v ) ( x v | x vpa ( v ) λ lpa ( v ) ) , (58) produces the distribution P V . Then there exists a causal model ( G , P (cid:48) ) reproducing P V with cardinalities k L = { k (cid:96) | (cid:96) ∈ L} where each k (cid:96) is a finite.Proof. The following proof considers each latent variable ξ ∈ L independently andobtains a value for k (cid:96) in each case. Let L (cid:48) = L − { ξ } denote the set of latent variables40 = { a, b, c, d, e, f } D = { a, b, c, d, g } D c = { e, f } ¯ D = { f } ¯ D c = { a, b } A = { a, b, c } B = { d, g } abc de fg ξ νµ ρ Figure 21: A causal structure G that helps in visualizing the proof of Theorem 9.with ξ removed. Let d P L (cid:48) = (cid:81) (cid:96) ∈L (cid:48) d P (cid:96) be a probability density over Ω L (cid:48) and considerthe conditional probability distribution P V| ξ ( x V | λ ξ ) given λ ξ , P V| ξ ( x V | λ ξ ) = (cid:90) Ω L(cid:48) d P L (cid:48) ( λ L (cid:48) ) (cid:89) v ∈V P v | pa ( v ) ( x v | x vpa ( v ) λ lpa ( v ) ) (59)Consulting Figure 21 for clarity, define the district D ⊆ V of ξ to be the maximalset of visible vertices v in G for which there exists an undirected path from v to ξ with alternating visible/latent vertices. Let D c = V − D , ¯ D = pa ( D ) − D and¯ D c = pa ( D c ) − D c . The district D has the property that P V| ξ factorizes over D, D c [21], P V| ξ ( x V | λ ξ ) = P D | ¯ Dξ ( x D | x ¯ D λ ξ ) P D c | ¯ D c ( x D c | x ¯ D c ) . (60)For varying λ ξ , consider a vector representation p λ ξ of the conditional distribution P D | ¯ Dξ ( x D | x ¯ D λ ξ ) and define U = (cid:8) p λ ξ | λ ξ ∈ Ω ξ (cid:9) . By construction, the center of mass p ∗ of U represents P D | ¯ D ( x D | x ¯ D ), p ∗ = (cid:90) Ω ξ d P ξ ( λ ξ ) p λ ξ (61) P D | ¯ D ( x D | x ¯ D ) = (cid:90) Ω ξ d P ξ ( λ ξ ) P D | ¯ Dξ ( x D | x ¯ D λ ξ ) (62)Therefore, by a variant of Carath´eodory’s theorem due to Fenchel [5], if U is compactand connected, then p ∗ can be written as a finite convex decomposition, p ∗ = aff ( U ) (cid:88) j =1 w j p j , (cid:88) j w j = 1 , ∀ i, w i ≥ . (63)where aff ( U ) is the affine dimension of U . Then by letting Ω ξ = { ξ , ξ , . . . , aff ( U ) ξ } be a finite sample space for ξ distributed according to P ξ ( λ ξ ) = w λ , by Equa-41ions 58, 59, 60 and 62, P V ( x V ) = (cid:88) λ ξ ∈ Ω ξ P ξ ( λ ξ ) P V| ξ ( x V | λ ξ ) . (64)Therefore, causal parameters exist reproducing P V with cardinality k ξ = aff ( U ).What remains is to show that U is compact and to find a bound on aff ( U ).Because of normalization constraints on each p λ ξ , U is bounded. Moreover, [50]demonstrates that U can be taken to be closed as well. Again consulting Figure 21 forclarity, partition D into subsets A = des ( ξ ) ∩ D and B = D − A . This partitioningenables one to identify the following linear equality constraint placed on all points p λ ξ : (cid:88) x A ∈ Ω A P D | ¯ Dξ ( x D | x ¯ D λ ξ ) (65)= (cid:88) x A ∈ Ω A P A | B ¯ Dξ ( x A | x B x ¯ D λ ξ ) P B | ¯ Dξ ( x B | x ¯ D λ ξ ) (66)= P B | ¯ Dξ ( x B | x ¯ D λ ξ ) (67)= P B | ¯ D ( x B | x ¯ D ) , (68)where the last equality holds because B is independent of ξ given ¯ D . Furthermorenote that if U is not connected, it can be made connected by a scheme due to [50]which adds noisy variants of each p λ ξ to U . Simply include a noise parameter ν ∈ [0 , λ (cid:48) ξ = ( λ ξ , ν ) and adjust the response functions for variables in A such that, P A | B ¯ Dξ ( x A | x B x ¯ D λ ξ ν ) = ν P A | B ¯ Dξ ( x A | x B x ¯ D λ ξ ) + 1 − ν | Ω A | (69)For each degree of noise 0 ≤ ν ≤
1, Equation 69 defines a noisy model p λ ξ ,ν whichare added to U . As special cases, no noise ν = 0, yields p λ ξ , = p λ ξ ∈ U and completenoise ν = 1 yields p λ ξ , representing P B | ¯ D ( x B | x ¯ D ) / | Ω A | ∈ U which is independent of λ ξ . Therefore, U is connected. Finally, the affine dimension aff ( U ) is at mostthe affine dimension of P D | ¯ D with the degrees of freedom associated with satisfyingEquation 68 removed [50]. Therefore, k ξ = aff ( U ) ≤ aff (cid:0) P D | ¯ D (cid:1) − aff (cid:0) P B | ¯ D (cid:1) (70) Every path from b ∈ B to ξ must pass through an unconditioned collider in A and therefore thed-separation relation B ⊥ { ξ } | ¯ D holds [43]. Proof of Theorem 3
Proof.