[PDF] Information-theoretic inference of common ancestors

Abstract

A directed acyclic graph (DAG) partially represents the conditional independence structure among observations of a system if the local Markov condition holds, that is, if every variable is independent of its non-descendants given its parents. In general, there is a whole class of DAGs that represents a given set of conditional independence relations. We are interested in properties of this class that can be derived from observations of a subsystem only. To this end, we prove an information theoretic inequality that allows for the inference of common ancestors of observed parts in any DAG representing some unknown larger system. More explicitly, we show that a large amount of dependence in terms of mutual information among the observations implies the existence of a common ancestor that distributes this information. Within the causal interpretation of DAGs our result can be seen as a quantitative extension of Reichenbach's Principle of Common Cause to more than two variables. Our conclusions are valid also for non-probabilistic observations such as binary strings, since we state the proof for an axiomatized notion of mutual information that includes the stochastic as well as the algorithmic version.

Full PDF

aa r X i v : . [ c s . I T ] O c t Information-theoretic inference of common ancestors

Bastian Steudel and Nihat Ay , [email protected] MPI for Mathematics in the Sciences, Leipzig, Germany Santa Fe Institute, New Mexico, USAJune 19, 2018

Abstract

A directed acyclic graph (DAG) partially represents the conditional independence struc-ture among observations of a system if the local Markov condition holds, that is, if everyvariable is independent of its non-descendants given its parents. In general, there is a wholeclass of DAGs that represents a given set of conditional independence relations. We are inter-ested in properties of this class that can be derived from observations of a subsystem only. Tothis end, we prove an information theoretic inequality that allows for the inference of commonancestors of observed parts in any

DAG representing some unknown larger system. More ex-plicitly, we show that a large amount of dependence in terms of mutual information amongthe observations implies the existence of a common ancestor that distributes this information.Within the causal interpretation of DAGs our result can be seen as a quantitative extensionof Reichenbach’s Principle of Common Cause to more than two variables.Our conclusions are valid also for non-probabilistic observations such as binary strings, sincewe state the proof for an axiomatized notion of ‘mutual information’ that includes the stochas-tic as well as the algorithmic version.

Causal relations among components X , . . . , X n of a system are commonly modeled in terms of adirected acyclic graph (DAG) in which there is an edge X i → X j whenever X i is a direct cause of X j . Further, it is usually assumed that information about the causal structure can be obtainedthrough interventions in the system. However, there are situations in which interventions are notfeasible (too expensive, unethical or physically impossible) and one faces the problem to infercausal relations from observational data only. To this end, postulates linking observations to theunderlying causal structure have been employed, one of the most fundamental being the causalMarkov condition [1, 2]. It connects the underlying causal structure to conditional independenciesamong the observations. Explicitly it states that every observation is independent of its non-eﬀectsgiven its direct causes. It formalizes the intuition, that the only relevant components of a systemfor a given observation are its direct causes.In terms of DAGs, the causal Markov condition states that a DAG can only be a valid causal modelof a system if every node is independent of its non-descendants given its parents. The graph isthen said to fulﬁll the local Markov condition [3]. Consider for example the causal hypothesis X → Y ← Z on three observations X, Y and Z . Assuming the causal Markov condition, the hy-pothesis implies that X and Z are independent. Violation of this independence then allows one toexclude this causal hypothesis. But note that in general there are many DAGs that fulﬁll the localMarkov condition with respect to a given set of conditional independence relations. For example,all three DAGs X → Y → Z , X ← Y → Z and X ← Y ← Z encode that X is independent of Z X X X X X Figure 1: Two causal hypothesis for which the causal Markov condition does not imply conditionalindependencies among the observations X , X and X . Thus they can not be distinguished usingqualitative criteria like the common cause principle (unobserved variables are indicated as dots).However, the model on the right can be excluded if the dependence among the X i exceeds a certainbound.given Y and it can not be decided from information on conditional independences alone, which isthe true causal model. Nevertheless, properties that are shared by all valid DAGs (e.g. an edgebetween X and Y in the example) provide information about the underlying causal structure.The causal Markov condition is only expected to hold for a given set of observations if all rele-vant components of a system have been observed, that is if there are no confounders (causes ofmore than two observations that have not been measured). It can then be proven by assuminga functional model of causality [1, 4, 5]. As an example, consider the observations X , . . . , X n tobe jointly distributed random variables. In this case, the causal Markov condition can be derivedfor a given DAG on X , . . . , X n from two assumptions: (1) every variable X i is a deterministicfunction of its parents and an independent (possibly unobserved) noise variable N i and (2) thenoise variables N i are jointly independent. However, in this paper we assume that our observationsprovide only partial knowledge about a system and ask for structural properties common to allDAGs that represent the independencies of some larger set of elements.To motivate our result, assume ﬁrst that our observation consists of only two jointly distributedrandom variables X and X which are stochastically dependent. Reichenbach [6] postulated al-ready in 1956 that the dependence of X and X needs to be explained by (at least) one of thefollowing cases: X is a cause of X , or X is a cause of X , or there exists a common causeof X and X . This link between dependence and the underlying causal structure is known as Reichenbach’s principle of common cause . It is easily seen that by assuming X and X to bepart of some unknown larger system whose causal structure is described by a DAG G , then thecausal Markov condition for G implies the principle of common cause. Moreover, we can subsumeall three cases of the principle if we formally allow a node to be an ancestor of itself and arrive at Common cause principle.

If two observations X and X are dependent, then they must havea common ancestor in any DAG modeling some possibly larger system.

Our main result is an information theoretic inequality that enables us to generalize this principleto more than two variables. It leads to the

Extended common cause principle (informal version).

Consider n observations X , . . . , X n ,and a number c , ≤ c ≤ n . If the dependence of the observations exceeds a bound that depends on c , then in any DAG modeling some possibly larger system there exist c nodes out of X , . . . , X n that have a common ancestor. Thus, structural information can be obtained by exploiting the degree of dependence on the subsys-tem and we would like to emphasize that, in contrast to the original common cause principle, theabove criterion provides means to distinguish among cases with the same independence structureof the observed variables. This is illustrated in Figure 1.2bove, the extended common cause principle is stated without making explicit the kind of ob-servations we consider and how dependence is quantiﬁed. In the main case we have in mind, theobservations are jointly distributed random variables and dependence is quantiﬁed by the mu-tual information [7] function. Then the extended common cause principle (Theorem 10) relatesstochastic dependence to a property of all Bayesian networks that include the observations.However, the result holds for more general observations (such as binary strings) and for moregeneral notions of mutual information (such as algorithmic mutual information [8]). Therefore weintroduce an ’axiomatized’ version of mutual information in the following Section and describehow it can be connected to a DAG. Then, in Section 3 we prove a theorem on the decompositionof information about subsets of a DAG out of which the extended common cause principle thenfollows as a corollary. Apart from a larger area of applicability, we think that an abstract proofbased on an axiomatized notion of information better illustrates that the result is independent ofthe notion of ’probability’. It only relies on the basic properties of (stochastic) mutual informa-tion (see Deﬁnition 1). Finally, in Section 4 we describe the result in more detail within diﬀerentcontexts and relate it to the notion of redundancy and synergy that was introduced in the area ofneural information processing.

Before introducing a general notion of mutual information, let us describe how it is connected to aDAG in the stochastic setting. Assume we are given an observation of n discrete random variables X , . . . , X n in terms of their joint probability distribution p ( X , . . . , X n ). Write [ n ] = { , . . . , n } and for a subset S ⊆ [ n ] let X S be the random variable associated with the tuple ( X i ) i ∈ S . Assumefurther, that a directed acyclic graph (DAG) G is associated with the nodes X , . . . , X n , thatfulﬁlls the local Markov condition [3]: for all i, (1 ≤ i ≤ n ) X i ⊥⊥ X nd i | X pa i , (1)where nd i and pa i denote the subset of indices corresponding to the non-descendants and tothe parents of X i in G . The tuple ( G, p ( X [ n ] )) is called a Bayesian net [9] and the conditionalindependence relations imply the factorization of the joint probability distribution p ( x , . . . , x n ) = Y i ∈ [ n ] p ( x i | x pa i ) , where small letters x i stand for values taken by the random variables X i . From this factorizationit follows that the joint information measured in terms of Shannon entropy [7] decomposes into asum of individual conditional entropies H ( X , . . . , X n ) = n X i =1 H ( X i | X pa i ) . (2)Shannon entropy can be considered as absolute measure of information. However, in many casesonly a notion of information relative to another observation may be available. For example, inthe case of continuous random variables, Shannon entropy can be negative and hence may notbe a good measure of the information. Therefore we would like formulate our results based on arelative measure, such as mutual information, which, moreover, induces a notion of independencein a natural way. This can be achieved by introducing a specially designated variable Y relative towhich information will be quantiﬁed. Y can for example be thought of as providing a noisy mea-surement of the X [ n ] (Fig. 2 ( a )). Then, with respect to a joint probability distribution p ( Y, X [ n ] )we can transform the decomposition of entropies into a decomposition of mutual information [7] I ( Y : X [ n ] ) ≥ n X i =1 I ( Y : X i | X pa i ) . (3)3 X X X X X ( a ) ( b ) X X X X X X X O O O Y Figure 2: The graph in ( a ) shows a DAG on nodes X , . . . , X whose observation is modeled bya leaf node Y (e.g. a noisy measurement). Figure ( b ) shows a DAG-model of observed elements O = { X } and O = { X , X } .For a proof and a condition for equality see Lemma 3 below. In the case of discrete variables,Shannon entropy H ( X i ) can be seen as mutual information of X i with a copy of itself: H ( X i ) = I ( X i : X i ). Therefore we can always choose p ( Y | X [ n ] ) such that Y = X [ n ] and the decompositionof entropies in (2) is recovered. We are interested in decompositions as in (2) and (3), since theirviolation allows us to exclude possible DAG structures.However, note that the above relations are not yet very useful, since they require, through theassumption of the local Markov condition, that we have observed all relevant variables of a system.Before we relax this assumption in the next section we introduce mutual information measures ongeneral observations. Deﬁnition 1 (measure of mutual information) . Given a ﬁnite set of elements O a measure of mutual information on O is a three-argument functionon the power set I : 2 O × O × O → R , ( A, B, C ) → I ( A : B | C )such that for disjoint sets A, B, C, D ⊆ O it holds: I ( A : ∅ ) = 0 (normalization) I ( A : B | C ) ≥ I ( A : B | C ) = I ( B : A | C ) (symmetry) I ( A : ( B ∪ C ) | D ) = I ( A : B | C ∪ D ) + I ( A : C | D ) (chain rule).We say A is independent of B given C and write ( A ⊥⊥ B | C ) iﬀ I ( A : B | C ) = 0. Further we willgenerally omit the empty set as a third argument and substitute the union by a comma, hence wewrite I ( A : B ) instead of I ( A : B |∅ ) and I ( A : B, C ) instead of I ( A : B ∪ C ).Of course, mutual information of discrete as well as of continuous random variables is includedin the above deﬁnition. Further, in Section 4.2 we will discuss a recently developed theory ofcausal inference [4] based on algorithmic mutual information of binary strings . We now state twoproperties of mutual information that we need later on. Lemma 2 (properties of mutual information) . Let I be a measure of mutual information on a set of elements O . Then Mutual information of composed quantum systems satisﬁes the deﬁnition as well, because it can be deﬁnedin formal analogy to classical information theory if Shannon entropy is replaced by von Neumann entropy of aquantum state. The properties of mutual information stated above have been used to single out quantum physicsfrom a whole class of no-signaling theories [10]. i ) (data processing inequality) For three disjoint sets A, B, C ⊆ O I ( A : C | B ) = 0 = ⇒ I ( A : B ) ≥ I ( A : C ) . ( ii ) (increase through conditioning on independent sets)For three disjoint sets A, B, C ⊆ O I ( A : C | B ) = 0 = ⇒ I ( Y : A | B ) ≤ I ( Y : A | B, C ) , (4) where Y is an arbitrary set Y ⊆ O disjoint from the rest. Further, the diﬀerence is given by I ( A : C | B, Y ) . Proof: ( i ) Using the chain rule two times I ( A : B ) = I ( A : B ) + I ( A : C | B ) = I ( A : B, C )= I ( A : C ) + I ( A : B | C ) ≥ I ( A : C ) , where the last inequality follows from non-negativity of I . To prove ( ii ) we again use the chainrule I ( Y : A | B ) − I ( Y : A | B, C ) = I ( Y : A | B ) − I ( Y, C : A | B ) + I ( A : C | B )= − I ( A : C | B, Y ) ≤ . (cid:3) As in the stochastic setting, we can connect a DAG to the conditional independence relation thatis induced by mutual information: we say that a DAG on a given set of observations fulﬁlls thelocal Markov condition if every node is independent of its non-descendants given its parents. Fur-thermore, we show in Appendix A that the induced independence relations are suﬃciently nice,in the sense that they satisfy the semi-graphoid axioms [11]. This is useful because it impliesthat a DAG that fulﬁlls the local Markov condition is an eﬃcient partial representation of theconditional independence structure. Namely, conditional independence relations can be read oﬀthe graph with the help of a criterion called d-separation [1] (see Appendix A for details).We conclude with a general formulation of the decomposition of mutual information that wealready described in the probabilistic case.

Lemma 3 (decomposition of mutual information) . Let I be a measure of mutual information on elements O [ n ] = { O , . . . , O n } and Y . Further let G be a DAG with node set O [ n ] that fulﬁlls the local Markov condition. Then I ( Y : O [ n ] ) ≥ n X i =1 I ( Y : O i | O pa i ) (5) with equality if conditioning on Y does preserve the independences of the local Markov condition:that is for all i O i ⊥⊥ O nd i | ( O pa i , Y ) . (6)Proof: Assume the O i are ordered topologically with respect to G . The proof is by inductionon n . The lemma is trivially true if n = 1 with equality. Assume that it holds for k − < n . Itis easy to see that the graph G k with nodes O [ k ] that is obtained from G by deleting all but theﬁrst k nodes fulﬁlls the local Markov condition with respect to O [ k ] . By the chain rule I ( Y : O [ k ] ) = I ( Y : O [ k − ) + I ( Y : O k | O [ k − )and we are left to show that I ( Y : O k | O [ k − ) ≥ I ( Y : O k | O pa k ). Since the local Markov condi-tion holds, we have O k ⊥⊥ O [ k − \ pa k | O pa k and the inequality follows by applying (4). Further,by property ( ii ) of the previous Lemma, equality holds if for every k : O k ⊥⊥ O [ k − \ pa k | ( O pa k , Y )which is implied by (6). (cid:3) In the next section we derive a similar inequality in the case in which only the mutual informationof Y with a subset of the nodes O [ n ] is known. 5 Partial information about a system

We have shown that the information about elements of a system described by a DAG decomposesif the graph fulﬁlls the local Markov condition. In this section we derive a similar decompositionin cases where not all elements of a system have been observed. This decomposition will of coursedepend on speciﬁc properties of G and, in turn, enable us to exclude certain DAGs as models ofthe total system whenever we observe a violation of such a decomposition.More precisely, we are interested in properties of the class of DAG-models of a set of observationsthat we deﬁne as follows (see Figure 2 ( b )). Deﬁnition 4 (DAG-model of observations) . An observation of elements O [ n ] = { O , . . . , O n } with respect to a reference object Y and mutualinformation measure I is given by the values of I ( Y : O S ) for every subset S ⊆ [ n ].A DAG G with nodes X together with a measure of mutual information I G on X is a DAG-model of an observation, if the following holds( i ) each observation O i is a subset of the nodes X of G .( ii ) G fulﬁlls the local Markov condition with respect to I G ( iii ) I G is an extension of I , that is I G ( Y : O S ) = I ( Y : O S ) for all S ⊆ [ n ].( iv ) Y is a leaf node (no descendants) of G .The ﬁrst three conditions state that, given the causal Markov condition, G is a valid hypothesis onthe causal relations among components of some larger system including the O [ n ] that is consistentwith the observed mutual information values. Condition ( iv ) is merely a technical condition dueto the special role of Y as an observation of the O [ n ] external to the system.As an example, if the O i and Y are random variables with joint distribution p ( O [ n ] , Y ), a DAG-model G with nodes X is given by the graph structure of a Bayesian net with joint distribution p ( X ), such that the marginal on O [ n ] and Y equals p ( O [ n ] , Y ). Moreover, if Y is a copy of O [ n ] then an observation in our sense is given by the values of the Shannon entropy H ( O S ) for everysubset S ⊆ [ n ].The general question posed in this paper can then be formulated as follows: What can be learnedfrom an observation given by the values I ( Y : O S ) about the class of DAG-models?As a ﬁrst step we present a property of mutual information about independent elements. Lemma 5 (submodularity of I ) . If the O i are mutually independent , that is I ( O i : O [ n ] \ i ) = 0 for all i , then the function [ n ] ⊇ S → − I ( Y : O S ) is submodular, that is, for two sets S, T ⊆ [ n ] I ( Y : O S ) + I ( Y : O T ) ≤ I ( Y : O S ∪ T ) + I ( Y : O S ∩ T ) . Proof: For two subsets

S, T ⊆ [ n ] write S ′ = S \ ( S ∩ T ) and T ′ = T \ ( S ∩ T ). Using the chain rulewe have I ( Y : O S ∪ T ) + I ( Y : O S ∩ T ) = I ( Y : O S ) + I ( Y : O T ′ | O S ) + I ( Y : O S ∩ T ) ≥ I ( Y : O S ) + I ( Y : O T ′ | O S ∩ T ) + I ( Y : O S ∩ T )= I ( Y : O S ) + I ( Y : O T ) , where the inequality follows from property (4) of mutual information. (cid:3) Hence, a violation of submodularity allows one to reject mutual independence among the O i andtherefore to exclude the DAG that does not have any edges from the class of possible DAG-models(the local Markov condition would imply mutual independence).We now broaden the applicability of the above Lemma based on a result for submodular functionsfrom [12]: We assume that there are unknown objects X = { X , . . . , X r } which are mutually6 a ) Y X X X X X X X X O O O O ( b ) O O O O Y Figure 3: ( a ) shows four subsets O , . . . , O of independent elements X , . . . , X ‘observed by’ Y . Note that the intersection of three sets O i is empty, hence d i ≤ i = 1 , . . . , I ( Y : O [4] ) ≥ P i =1 I ( Y : O i ). ( b ) shows a DAG-model in gray.The observed elements O , . . . , O are subsets of its nodes. One can check that the DAG doesnot imply any conditional independencies among the O i (e.g. with the help of the d -separationcriterion, see Appendix A). Nevertheless, there is no common ancestor of all four observations( ∩ i =1 an ( O i ) = ∅ ). Since Y only depends on the O i , inequality (10) of Theorem 7 implies I ( Y : O [4] ) ≥ P i =1 I ( Y : O i ).independent and that the observed elements O i ⊆ X will be subsets of them (see Figure 3 ( a )). Incontrast to the previous lemma it is not required anymore, that the O i are mutually independentthemselves. It turns out, that the way the information about the O i decomposes allows for theinference of intersections among the sets O i , namely Proposition 6 (decomposition of information about sets of independent elements) . Let X = { X , . . . , X r } be mutually independent objects, that is I ( X j : X [ r ] \ j ) = 0 for all j . Let O [ n ] = { O , . . . , O n } , where each O i ⊆ X is a non-empty subset of X . For every i ∈ [ n ] let d i bemaximal such that O i has non-empty intersection with d i − sets out of O [ n ] distinct from O i .Then the information about the O [ n ] can be bounded from below by I ( Y : O [ n ] ) ≥ n X i =1 d i I ( Y : O i ) . (7)For an illustration see Figure 3(a). Even though the proposition is actually a corollary of thefollowing theorem, its proof is given in appendix B since it is, unlike the theorem, independent ofgraph theoretic notions .As a trivial example consider the case where O = O = O ⊆ X are identical subsets. Then d = d = 2 and I ( Y : O ) = 12 I ( Y : O ) + 12 I ( Y : O ) , hence equality holds in (7). In general, if there is an element in O i , that is also in k − O j , then d i ≥ k and we account for this redundancy in dividing the single information I ( Y : O i )by at least k .Independent elements can always be modeled as root nodes of a DAG. The following theorem, thatis our main result, generalizes the proposition by connecting the information about observations O i to the intersection structure of associated ancestral sets. For a given DAG G , a set of nodes A is called ancestral, if for every edge v → w in G such that w is in A , also v is in A . Further, for asubset of nodes S , we denote by an ( S ) be the smallest ancestral set that contains S . Elements of an ( S ) will be called ancestors of S . Theorem 7 (decomposition of ancestral information) . Let G be a DAG-model of an observation of elements O [ n ] = { O , . . . , O n } . For every i let d i be the aximal number such that the intersection of an ( O i ) with d i − distinct sets an ( O i ) , . . . , an ( O i d − ) is non-empty. Then the information about all ancestors of O [ n ] can be bounded from below by I ( Y : an ( O [ n ] )) ≥ n X i =1 d i I ( Y : an ( O i )) ≥ n X i =1 d i I ( Y : O i ) . (8) Furthermore, if Y only depends on whole system X through the O [ n ] , that is Y ⊥⊥ X \ ( O [ n ] ∪ { Y } ) | O [ n ] (9) we obtain an inequality containing only known values of mutual information I ( Y : O [ n ] ) ≥ n X i =1 d i I ( Y : O i ) . (10)The proof is given in Appendix C and an example is illustrated in Figure 3(b). If all quantitiesexcept the structural parameters d i are known, inequality (10) can be used to obtain informationabout the intersection structure among the O i that is encoded in the d i provided that the inde-pendence assumption (9) holds. Even if (9) does not hold but information on an upper bound of I ( Y : an ( O [ n ] )) is available (e.g. in terms of the entropy of Y ) information about the intersectionstructure may be obtained from (8). The following corollary additionally provides a bound on theminimum information about ancestral sets. Corollary 8 (inference of common ancestors, local version) . Given an observation of elements O [ n ] = { O , . . . , O n } , assume that for natural numbers c =( c , . . . , c n ) with (1 ≤ c i ≤ n − we observe ǫ c := n X i =1 c i I ( Y : O i ) − I ( Y : an ( O [ n ] )) > . (11) Let G be an arbitrary DAG-model of the observation. For every O i , let A c i +1 be the set of commonancestors in G of O i and at least c i elements of O [ n ] diﬀerent from O i . Then the joint informationabout all common ancestors can be bounded from below by I (cid:0) Y : ∪ ni =1 A c i +1 (cid:1) ≥ (cid:0) n X i =1 c i − (cid:1) − ǫ c > . In particular, for an index i ∈ [ n ] we must have A c i +1 = ∅ , hence there exists a common ancestorof O i and at least c i elements of O [ n ] diﬀerent from O i . The proof is given in Appendix D. Theorem 7 and its corollary are our most general results butdue to ease of interpretation we illustrate them in the next section only in the speciale case inwhich all c i are equal (Cor. 9) to obtain a lower bound on the information about all commonancestors of at least c + 1 elements O i .To conclude this section, we ask what is the maximum amount of information that one can expectto obtain about the intersection structure of ancestral sets of a DAG-model of an observations. Themain requirement for a DAG-model G is, that it fulﬁlls the local Markov condition with respectto some larger set X of elements. This will remain true if we add nodes and arbitrary edges ina way that G remains acyclic. Therefore, if G contains a common ancestor of c elements we canalways construct a DAG-model G ′ that contains a common ancestor of more than c elements (e.g.the DAG-model on the right hand side of Fig. 1 can be transformed in the one on the left handside). We conclude that without adding minimality requirements for the DAG-models (such asthe causal faithfulness assumption [2]) only assertions on ancestors of a minimal number of nodescan be made. 8 Structural implications of redundancy and synergy

The results of the last section can be related to the notions of redundancy and synergy. In thecontext of neuronal information processing, it has been proposed [13] to capture the redundancyand synergy of elements O [ n ] = { O , . . . , O n } with respect to another element Y using the function r ( Y ) := n X i =1 I ( Y : O i ) − I ( Y : O [ n ] ) , (12)where I is a measure of mutual information. Thus r relates information that Y has about thesingle elements to information about the whole set.If the sum of informations about the single O i is larger than the information about whole set( r ( Y ) > O [ n ] are said to be redundant with respect to Y . This may be the case if Y ‘contains’ information that is shared by multiple O i . In general, if the O i do not share anyinformation, that is, if they are mutually independent, then they can not be redundant withrespect to any Y (this follows from Lemma 5).On the other hand, if the information of Y about the whole set of elements is larger than aboutits single elements ( r ( Y ) < O [ n ] are called synergistic with respect to Y . This may forexample be the case if Y is generated through a function Y = f ( O , . . . , O n ) and the functionvalue contains little information about each argument (as is the case for the parity function, seebelow). If, instead, Y is a copy of the O [ n ] , then r ( Y ) ≥ O [ n ] are not synergeticwith respect to Y .To connect our results to the introduced notion of redundancy and synergy, we introduce thefollowing version of r parametrized by a parameter c ∈ { , . . . , n } r c ( Y ) := 1 c n X i =1 I ( Y : O i ) − I ( Y : O [ n ] ) . (13)Intuitively, if r c ( Y ) > c , then the O i are highly redundant with respect to Y . Corollary8 of the last section implies that high redundancy implies common ancestors of many O i . Corollary 9 (redundancy explained structurally) . Let an observation of elements O [ n ] = { O , . . . , O n } be given by the values of I ( Y : O S ) for anysubset S ⊆ [ n ] . If r c ( Y ) > , then in any DAG-model of the observation in which Y only dependson X through O [ n ] 2 , there exists a common ancestor of at least c + 1 elements of O [ n ] . In the following two subsections we discuss this result in more detail for the cases in which theobserved elements are discrete random variables and binary strings.

Let X [ n ] = { X , . . . , X n } and Y be discrete random variables with joint distribution p ( X [ n ] , Y ) andlet I denote the usual measure of mutual information given by the Kullback-Leibler divergence of p from its factorized distribution [7]. If Y = X [ n ] is a copy of the X [ n ] then I ( Y : X [ n ] ) = H ( X [ n ] ),where H denotes the Shannon entropy. In this case the redundancy r ( X [ n ] ) is equal to the multi-information [14] of the X [ n ] . Moreover r c gives rise to a parametrized version of multi-information I c ( X , . . . , X n ) := n X i =1 c H ( X i ) − H ( X [ n ] ) , and from Corollary 8 we obtain We formulate the independence assumption as Y ⊥⊥ ˜ X | O [ n ] , where ˜ X denotes all nodes of the DAG-modeldiﬀerent from the nodes in O [ n ] and Y . Note that this assumption does not hold in the original context in which r has been introduced. There, Y is the observation of a stimulus that is presented to some neuronal system and the O i represent the responses of (areas of) neurons to this stimulus. heorem 10 (lower bound on entropy of common ancestors) . Let X [ n ] be jointly distributed discrete random variables. If I c ( X [ n ] ) > , then, in any Bayesiannet containing the X [ n ] , there exists a common ancestor of strictly more than c variables out ofthe X [ n ] . Moreover, the entropy of the set A c +1 of all common ancestors of more than c variablesis lower bounded by H ( A c +1 ) ≥ cn − c I c ( X [ n ] ) . We continue with some remarks to illustrate the theorem:( a ) Setting c = 1, the theorem states that, up to a factor 1 / ( n − I isa lower bound on the entropy of common ancestors of more than two variables. In particular, if I ( X [ n ] ) > X [ n ] must have at least an edge.( b ) Conversely, the entropy of common ancestors of all the elements X , . . . , X n is lower boundedby ( n − I n − ( X [ n ] ). This bound is not trivial whenever I n − ( X [ n ] ) >

0, which is for example thecase if the X i are only slightly disturbed copies of some not necessarily observed random variable(see example below).( c ) We emphasize that the inferred common ancestors can be among the elements X i themselves.Unobserved common ancestors can only be inferred by postulating assumptions on the causalinﬂuences among the X i . If, for example, all the X i were measured simultaneously, a direct causalinﬂuence among the X i can be excluded and any dependence or redundancy has to be attributedto unobserved common ancestors.( d ) Finally note that I c > I c is used in the theorem in an optimal way. By this we mean that we can construct distributions p ( X [ n ] ), such that I c ( X [ n ] ) = 0 for a given c and no common ancestors of c + 1 nodes have to exist.We conclude this section with two examples: Example (three variables):

Let X , X and X be three binary variables, each with maximalentropy H ( X i ) = log 2. Then I ( X , X , X ) > H ( X , X , X ) is strictlyless than log 2. In this case, there must exist a common ancestor of all three variables in anyBayesian net that contains them. In particular, any Bayesian net corresponding to the DAG onthe right hand side of Figure 1 can be excluded as a model. Example (synchrony and interaction among random variables):

Let X = X = · · · = X n be identical random variables with non-vanishing entropy h . Then in particular I n − ( X [ n ] ) =( n − − h > n nodes inany Bayesian net that contains them.In contrast to the synchronized case, let X , X , . . . , X n be binary random variables taking valuesin {− , } and assume that the joint distribution is of pure n -interaction , that is for some β = 0it has the form p β ( x , . . . , x n ) := 1 Z β exp( βx x · · · x n ) , where Z is a normalization constant. It can be shown that there exists a Bayesian net includ-ing the X [ n ] , in which common ancestors of at most two variables exist. This is illustrated inFigure 4 for three variables and in the limiting case β = ∞ in which each X i is uniformly dis-tributed and X = X · X . We found it somewhat surprising that, contrary to synchronization,higher order interaction among observations does not require common ancestors of many variables. In some situations it is not convenient or straightforward to summarize an observation in terms ofa joint probability distribution of random variables. Consider for example cases in which the datacomes from repeated observations under varying conditions (e.g. time series). A related situation This terminology is motivated by the general framework of interaction spaces proposed and investigated byDarroch et. al. [15] and used by Amari [16] within information geometry. = U U X = U U X = U U U U U Figure 4: The ﬁgure illustrates that higher order interaction among observed random variablescan be explained by a Bayesian net in which only common ancestors of two variables exist. Moreprecisely, all random variables are assumed to be binary with values in {− , } and the unobservedcommon ancestors U ij are mutually independent and uniformly distributed. Further the valueof each observation X i is obtained the product of the values of its two ancestors. Then theresulting marginal distribution p ( X , X , X ) is of higher order interaction: it is related to theparity function p ( X = x , X = x , X = x ) = if x x x = 1 and zero otherwise.is given if the number of samples is low. Janzing and Schoelkopf [4] argue that causal inference inthese situations still should be possible, provided that the observations are suﬃciently complex. Tothis end, they developed a framework for causal inference from single observations that we describenow brieﬂy. Assume we have observed two objects A and B in nature (e.g. two carpets) and weencoded these observations into binary strings a and b . If the descriptions of the observations interms of the strings a and b are suﬃciently complex and suﬃciently similar (e.g. the same patternon the carpets) one would expect an explanation of this similarity in terms of a mechanism thatrelates these two strings in nature (are the carpets produced by the same company?). It isnecessary that the descriptions are suﬃciently complex, as an example of [4] illustrates: assumethe two observed strings are equal to the ﬁrst hundred digits of the binary expansion of π , hencethey can be generated independently by a simple rule. If this is the case, the similarity of the twostrings would not be considered as strong evidence for the existence of a causal link. To excludesuch cases, Kolmogorov complexity [17] K ( s ) of a string s has been used as measure of complexity.It is deﬁned as the length of the shortest program that prints out s on a universal (preﬁx-free)Turing machine. With this deﬁnition, strings that can be generated using a simple rule, such asthe constant string s = 0 · · · n digits of the binary expansion of π are consideredsimple, whereas it can be shown that a random string of length n is complex with high probability.Kolmogorov complexity can be transformed into a function on sets of strings by choosing a suitableconcatenation function h· , ·i , such that K ( s , . . . , s n ) = K ( h s , h s , . . . , h s n − , s n i . . . i ).The algorithmic mutual information [8] of two strings a and b is then equal to the sum of thelengths of the shortest programs that generate each string separately minus the length of theshortest program that generates the strings a and b : I ( a : b ) + = K ( a ) + K ( b ) − K ( a, b ) , where + = stands for equality up to an additive constant that depends on the choice of the universalTuring machine. Analog to Reichenbach’s principle of common cause, [4] postulates a causalrelation among a and b whenever I ( a : b ) is large, which is the case if the complexities of thestrings are large and both strings together can be generated by a much shorter program than theprograms that describe them separately.In formal analogy to the probabilistic case, algorithmic mutual information can be extended to aconditional version deﬁned for sets of strings A, B, C ⊆ { s , . . . , s n } as I ( A : B | C ) + = K ( A ∪ C ) + K ( B ∪ C ) − K ( A ∪ B ∪ C ) − K ( C ) . I ( A : B | C ) is the mutual information between the strings of A and the strings of B ifa shortest program that prints the strings in C has been provided as an additional input. Basedon this notion of condition mutual information the causal Markov condition can be formulatedin the algorithmic setting. It can be proven [4] to hold for a directed acyclic graph G on strings s , . . . , s n if every s i can be computed by a simple program on a universal Turing machine fromits parents and an additional string n i such that the n i are mutually independent. Without goinginto the details we sum up by stating that DAGs on strings can be given a causal interpretationand it is therefore interesting to infer properties of the class of possible DAGs that represent thealgorithmic conditional independence relations.In the algorithmic setting, our result can be stated as follows Theorem 11 (inference of common ancestors of strings) . Let O [ n ] = { s , . . . , s n } be a set of binary strings. If for a number c, (1 ≤ c ≤ n − c n X i =1 K ( s i ) − K ( s , . . . , s n ) + ≥ , then there must exist a common ancestor of at least c + 1 strings out of O [ n ] in any DAG-modelof the O [ n ] . Proof: As described, algorithmic mutual information is an information measure in our sense onlyup to an additive constant depending on the choice of the universal Turing machine. However,one can check that in this case, the decomposition of mutual information (Theorem 7) holds upto an additive constant that depends additionally on the number of strings n and the chosenparameter c . The result on Kolmogorov complexities follows by choosing Y = ( s , . . . , s n ), since K ( s i ) + = I ( Y : s i ). (cid:3) Thus, highly redundant strings require a common ancestor in any DAG-model. Since the Kol-mogorov complexity of a string s is uncomputable, we have argued in recent work [5], that it canbe substituted by a measure of complexity in terms of the length of a compressed version of s withrespect to a chosen compression scheme (instead of a universal Turing machine) and the aboveresult should still hold approximately. We saw that large redundancy implies common ancestors of many elements and we may wonderwhether structural information can be obtained from synergy in a similar way. This seems not tobe possible, since synergy is related to more ﬁne-grained information (information about the mech-anisms) as the following example shows: Assume the observations O [ n ] are mutually independent.Then any DAG is a valid DAG-model since the local Markov condition will always be satisﬁed.We also now that r ( Y ) ≤

0, but it turns out that the amount of synergy crucially depends on theway that Y has processed the information of the O [ n ] (and therefore not on a structural propertyamong the O [ n ] themselves). To see this, let the observations O i be binary random variables whichare mutual independent and distributed uniformly, such that p ( O [ n ] ) = n Y i =1 p ( O i ) and p ( O i = 1) = p ( O i = 0) = 1 / . Further let Y = ( O i ⊕ O j ) i

1) log 2. On the other hand,if Y = O ⊕ · · · ⊕ O n , then r ( Y ) = − log 2 only.Nevertheless, it is an easy observation that synergy with respect to Y can be related to an increase Here + ≥ means up to an additive constant dependent only on the choice of a universal Turing machine, on c and on n .

12f redundancy after conditioning on Y . Since I ( · | Y ) is a measure of mutual information as well,we deﬁne a conditioned version of r in a canonical way as r c ( Z | Y ) = 1 c n X i =1 I ( Z : O i | Y ) − I ( Z : O [ n ] | Y ) , with respect to some observation Z . If I can be evaluated on non-disjoint subsets, that is, if wecan choose Z = O [ n ] , we have the following Proposition 12 (synergy from increased redundancy induced by conditioning) . Let O [ n ] = { O , . . . , O n } and Y be arbitrary elements on which a mutual information function I is deﬁned. Then r c ( Y ) = r c ( O [ n ] ) − r c ( O [ n ] | Y ) , hence if conditioning on Y increases the redundancy of O [ n ] with respect to itself, then r c ( Y ) < and the O [ n ] are synergetic with respect to Y . Proof: Using the chain rule, we derive r c ( O [ n ] ) − r c ( O [ n ] | Y ) = r c ( Y ) − r c ( Y | O [ n ] ) = r c ( Y ) , where the last equality follows because r c ( Y | O [ n ] ) = 0. (cid:3) Continuing the example of binary random variables above, mutual independence of the O [ n ] isequivalent to r ( O [ n ] ) = 0 and therefore, using the proposition r ( Y ) = − r ( O [ n ] | Y ). Thus, if Y = O ⊕ · · · ⊕ O [ n ] , r ( Y ) = − r ( O [ n ] | Y ) = H ( O [ n ] | Y ) − n X i =1 H ( O i | Y ) = − log 2 , as already noted above. Based on a generalized notion of mutual information, we proved an inequality describing the de-composition of information about a whole set into the sum of information about its parts. Thedecomposition depended on a structural property, namely the existence of common ancestors ina DAG. We connected the result to the notions of redundancy and synergy and concluded thatlarge redundancy implies the existence of common ancestors in any DAG-model. Specialized tothe case of discrete random variables, this means that large stochastic dependence in terms ofmulti-information needs to be explained through a common ancestor (in a Bayesian net) actingas a broadcaster of information.Much work has been done already that examined the restrictions that are imposed on observationsby graphical models that include latent variables. Pearl [1, 18] already investigated constraints im-posed by the special instrumental variable model. Also Darroch et al. [15] and recently Sullivantet. al [19] looked at linear Gaussian graphical models and determined constraints in terms ofthe entries on the covariance matrix describing the data (tetrad constraints). Further, methods ofalgebraic statistics were applied (e.g. [20]) to derive constraints that are induced by latent variablemodels directly on the level of probabilities. In general this does not seem to be an easy task dueto the large number of variables involved and information theoretic quantities allow for relativelyeasy derivations of ‘macroscopic’ constraints (see also [21]).Finally, we think that the general methodology of connecting concepts such as synergy and redun-dancy of observations to properties of the class of possible DAG-models is interesting, especiallyin the light of their causal interpretation. 13

Semi-graphoid axioms and d -separation Consider the conditional independence relation that is induced by an information measure on aset of objects ( A ⊥⊥ B | C ⇔ I ( A : B | C ) = 0). Then Lemma 13 (general independence satisﬁes semi-graphoid axioms) . The relation of (conditional) independence induced by an independence measure I on elements O satisﬁes the semi-graphoid axioms: For disjoint subsets W, X, Y and Z of O it holds (1) X ⊥⊥ Y | Z ⇒ Y ⊥⊥ X | Z (symmetry) (2) X ⊥⊥ ( Y, W ) | Z ⇒ (cid:26) X ⊥⊥ Y | ZX ⊥⊥ W | Z (decomposition) (3) X ⊥⊥ ( Y, W ) | Z ⇒ X ⊥⊥ Y | ( Z, W ) (weak union) (4) X ⊥⊥ W | ( Z, Y ) X ⊥⊥ Y | Z (cid:27) ⇒ X ⊥⊥ ( W, Y ) | Z (contraction) The proof is immediate using non-negativity and the chain rule of mutual information. In theprobabilistic context, the axiomatic approach to conditional independence has been presented byDawid [11]. The above Lemma is important, since it implies that a DAG that fulﬁlls the localMarkov condition with respect to a set of objects is an eﬃcient partial representation of theconditional independence structure among the observations. Namely, conditional independencerelations can be read oﬀ the graph with the help of a criterion called d-separation [1]. This is thecontent of the following theorem but before stating it we recall the deﬁnition of d-separation: Twosets of nodes A and B of a DAG are d-separated given a set C disjoint from A and B if everyundirected path between A and B is blocked by C . A path that is described by the ordered tupleof nodes ( x , x , . . . , x r ) with x ∈ A and x r ∈ B is blocked if at least one of the following is true(1) there is an i such that x i ∈ C and x i − → x i → x i +1 or x i − ← x i ← x i +1 or x i − ← x i → x i +1 ,(2) there is an i such that x i and its descendants are not in C and x i − → x i ← x i +1 . Theorem 14 (Equivalence of Markov conditions) . Let I be a measure of mutual information on elements O [ n ] = { O , . . . , O n } and let G be a DAGwith node set O [ n ] . Then the following two properties are equivalent (1) (local Markov condition) Every node O i of G is independent of its non-descendants O nd given its parents O pa i , O i ⊥⊥ O nd i | O pa i . (2) (global Markov condition) For every three disjoint sets of nodes A , B and C such that A is d -separated from B given C in G , it holds A ⊥⊥ B | C . Proof: (1) → (2). Since the dependence measure I satisﬁes the semi-graphoid axioms (Lemma13) we can apply Theorem 2 in Verma & Pearl [22] which asserts that the DAG is an I -map, orin other words that d-separation relations represent a subset of the (conditional) independencesthat hold for the given objects.(2) → (1) holds because the non-descendants of a node are d-separated from the node itself bythe parents. (cid:3) In general there may hold additional conditional independence relations among the observations that are notimplied by the local Markov condition together with the semi-graphoid axioms. In fact, it is well known that thereso called non-graphical probability distributions whose conditional independence structure can not be completelyrepresented by any DAG. Proof of Proposition 6

We have shown in Lemma 5 the submodularity of I ( Y : · ) with respect to independent sets. Therest of the proof is on the lines of the proof of Corollary I in [12]: First, by iteratively applyingthe chain rule for mutual information we obtain I ( Y : X [ r ] ) = r − X i =0 I ( Y : X i +1 | X [ i ] ) . (14)Without loss of generality we can assume that every X i is part of at least one set O k for some k .Let n i be the total number of subsets O k containing X i . By deﬁnition of d k , for every k it holds n i ≤ d k and we obtain X O j , ( X i ∈ O j ) d j ≤ n i · max O j , ( X i ∈ O j ) d j ≤ . (15)Putting (14) and (15) together we get I ( Y : O [ n ] ) = I ( Y : X [ r ] ) = r − X i =0 I ( Y : X i | X [ i − ) ≥ n X i =1 I ( Y : X i | X [ i − ) (cid:0) X O j , ( X i ∈ O j ) d j (cid:1) ( a ) = n X j =1 d j X X i ∈ O j I ( Y : X i | X [ i − ) ( b ) ≥ n X j =1 d j X X i ∈ O j I ( Y : X i | X [ i − ∩ O j ) ( c ) = n X j =1 d j I ( Y : O j ) , where ( a ) is obtained by exchanging summations and ( b ) uses the property of I , that conditioningon independent objects can only increase mutual information (inequality (4) applied to X i ⊥⊥ ( X [ i − \ O j ) | O j ) . This is the point at which submodularity of I is used, since it is actuallyequivalent to (4) as can be seen from the proof of Lemma 5. Finally ( c ) is an application of thechain rule to the elements of each O j separately. C Proof of Theorem 7

By assumption O i ⊆ X and the DAG G with node set X fulﬁlls the local Markov condition. Foreach O i denote by an G ( O i ) the smallest ancestral set in G containing O i .An easy observation that we need in the proof is given by the fact that two ancestral sets A and B are independent given their intersection: A \ B ⊥⊥ B \ A | A ∩ B . (16)This is implied by d-separation using Theorem 14.We ﬁrst prove the inequality I ( Y : an G ( O [ n ] )) ≥ n X i =1 d i I ( Y : an G ( O i )) . (17)From this the inequalities of the theorem follow directly: (8) holds since I ( Y : an ( O i )) ≥ I ( Y : O i )using the monotony of I (implied by chain rule and non-negativity). Further, (10) is a direct15onsequence of (17) together with the independence assumption (9), since by the chain rule I ( Y : an G ( O [ n ] )) = I ( Y : O [ n ] ) + I ( Y : an G ( O [ n ] ) \ O [ n ] | O [ n ] ) = I ( Y : O [ n ] ) , where the last equality is a consequence of (9).The proof of (17) is by induction on the number of elements in A = an G ( O [ n ] ). If A = ∅ nothinghas to be proven. Assume now (17) holds for ˜ O [ n ] = { ˜ O , . . . , ˜ O n } such that ˜ A = ∪ ni =1 an ( ˜ O i ) is ofcardinality at most k −

1. Let O [ n ] be a set of observations such that A is of cardinality k . From O [ n ] we construct a new collection ˜ O [ n ] as follows: W.l.o.g. assume m := d >

0, in particular O is non-empty and moreover, by deﬁnition of d and after reordering of the O i we can assumethat the intersection V := ∩ mi =1 an G ( O i ) is non-empty. Note that V itself is an ancestral set. Wedeﬁne ˜ O i = O i \ V for all 1 ≤ i ≤ n and denote by ˜ G the modiﬁed graph that is obtained from G by removing all elements of V . Further, denote by ˜ I ( A : B | C ) := I ( A : B | C, V ) a modiﬁedmeasure of mutual information obtained by conditioning on V . One checks easily that the graph˜ G fulﬁlls the local Markov condition with respect to the independence relation induced by ˜ I andis a DAG-model of the elements ˜ O [ n ] . Hence, by induction assumption˜ I (cid:0) Y : an ˜ G ( ˜ O [ n ] ) (cid:1) ≥ n X i =1 d i ˜ I (cid:0) Y : an ˜ G ( ˜ O i ) (cid:1) , (18)where ˜ d i is deﬁned similarly as d i , but with respect to the elements ˜ O i and ˜ G . Further the sumis over all non-empty ˜ O i . By construction of ˜ I and ˜ O [ n ] , the left hand side of (18) is equal to˜ I (cid:0) Y : an ˜ G ( ˜ O [ n ] ) (cid:1) = I (cid:0) Y : an G ( O [ n ] ) \ V | V (cid:1) = I ( Y : an G ( O [ n ] )) − I ( Y : V ) . (19)The right hand side of (18) can be rewritten to n X i =1 d i ˜ I (cid:0) Y : an ˜ G ( ˜ O i ) (cid:1) ( a ) ≥ n X i =1 d i ˜ I (cid:0) Y : an ˜ G ( ˜ O i ) (cid:1) ( b ) = m X i =1 d i I ( Y : an G ( O i ) \ V | V ) + n X i = m +1 d i I ( Y : an G ( O i ) | V ) ( c ) ≥ m X i =1 d i I ( Y : an G ( O i ) \ V | V ) + n X i = m +1 d i I ( Y : an G ( O i )) , where ( a ) follows because d i ≥ ˜ d i by deﬁnition and ( b ) follows because an G ( O i ) ∩ V = ∅ for i > m .Hence by (16) V and an G ( O i ) are independent and therefore conditioning on V only increasesmutual information as proven in Lemma 2 and inequality ( c ) follows. We continue by rewritingthe ﬁrst m summands of the right hand side using the chain rule m X i =1 d i I ( Y : an G ( O i ) \ V | V ) = m X i =1 d i (cid:2) I ( Y : an G ( O i )) − I ( Y : V ) (cid:3) ≥ h m X i =1 d i I ( Y : an G ( O i )) i − I ( Y : V ) , where the inequality holds because P mi =1 1 d i ≤ n X i =1 d i ˜ I ( Y : an ˜ G ( ˜ O i )) ≥ n X i =1 d i I ( Y : an G ( O i )) − I ( Y : V ) . Since we have shown in (18) and (19), that the left hand side can be bounded from above by I ( Y : O [ n ] ) − I ( Y : V ), we observe that I ( Y : V ) cancels and (17) is proven.16 Proof of Corollary 8

Proof: Let G be a DAG-model of the observation of O [ n ] = { O , . . . , O n } . We construct a newDAG G ′ , by removing the objects of A := ∪ ni =1 A c i +1 . Since A is an ancestral set G ′ fulﬁlls thelocal Markov condition with respect to the mutual information measure obtained by conditioningon A . We apply Theorem 7 to G ′ and the observations O ′ [ n ] = { O \ A, . . . , O n \ A } to get I ( Y : an G ′ ( O ′ [ n ] ) | A ) ≥ n X i =1 c i I ( Y : O ′ i | A ) . (20)Using assumption (11) and the chain rule for mutual information we obtain I ( Y : A ) = I ( Y : an G ( O [ n ] )) − I ( Y : an G ( O [ n ] ) \ A | A ) ( a ) = I ( Y : an G ( O [ n ] )) − I ( Y : an G ′ ( O ′ [ n ] ) | A ) ( b ) ≤ n X i =1 c i (cid:2) I ( Y : O i ) − I ( Y : O ′ i | A ) (cid:3) − ǫ c ( c ) ≤ n X i =1 c i I ( Y : A ) − ǫ c , where in ( a ) we used the deﬁnition of O ′ i and for ( b ) we plugged in inequalities (11) and (20).Finally ( c ) holds because I ( Y : O i ) − I ( Y : O ′ i | A ) = I ( Y : O i ∩ A | O ′ i ) + I ( Y : O ′ i ) − I ( Y : O ′ i | A )= I ( Y : O i ∩ A | O ′ i ) + I ( Y : A ) − I ( Y : A | O ′ i ) ≤ I ( Y : A ) , where the chain rule has been applied multiple times. The corollary now follows by solving for I ( Y : A ). (cid:3) References [1] J. Pearl,

Causality . Cambridge University Press, 2000.[2] P. Spirtes, C. Glymour, and R. Scheines,

Causation, Prediction, and Search, Second Edition(Adaptive Computation and Machine Learning) . The MIT Press, 2001.[3] S. L. Lauritzen,

Graphical Models . Oxford Statistical Science Series, Oxford University Press,USA, July 1996.[4] D. Janzing and B. Sch¨olkopf, “Causal inference using the algorithmic markov condition,”

IEEE Trans. Inf.Th. , vol. 56, oct. 2010.[5] B. Steudel, D. Janzing, and B. Sch¨olkopf, “Causal markov condition for submodular infor-mation measures,”

Proceedings of COLT 2010 , vol. abs/1002.4020, 2010.[6] H. Reichenbach,

The Direction of Time . University of Califonia Press, 1956.[7] T. M. Cover and J. A. Thomas,

Elements of Information Theory . Wiley-Interscience, 2nd ed.,July 2006.[8] P. G´acs, J. T. Tromp, and P. M. Vit´anyi, “Algorithmic statistics,”

IEEE Trans. Inf.Th. ,vol. 47, pp. 2443–2463, 2001.[9] J. Pearl,

Probabilistic reasoning in intelligent systems: networks of plausible inference . SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc., 1988.1710] M. Paw lowski, T. Paterek, D. Kaszlikowski, V. Scarani, A. Winter, and M. ˙Zukowski, “Infor-mation causality as a physical principle,”

Nature , vol. 461, pp. 1101–1104, Oct. 2009.[11] A. P. Dawid, “Conditional independence in statistical theory,”

Journal of the Royal StatisticalSociety. Series B (Methodological) , vol. 41, no. 1, pp. 1–31, 1979.[12] M. Madiman and P. Tetali, “Information inequalities for joint distributions, with interpreta-tions and applications,”

IEEE Trans. Inf.Th. , 2008.[13] E. Schneidman, S. Still, M. J. B. II, and W. Bialek, “Network information and connectedcorrelations,”

Phys. Rev. Let. , vol. 91, 2003.[14] M. Studeny and J. Vejnarova, “The multiinformation function as a tool for measuring stochas-tic dependence,”

M. I. Jordan (ed), Learning in Graphical Models , pp. 261–297, 1998.[15] J. N. Darroch, S. L. Lauritzen, and T. P. Speed, “Markov ﬁelds and log-linear interactionmodels for contingency tables,”

Annals of Statistics , vol. 8, pp. 522–539, 1980.[16] S. I. Amari, “Information geometry on hierarchy of probability distributions,”

IEEE Trans.Inf.Th. , vol. 47, no. 5, pp. 1701–1711, 2001.[17] M. Li and P. Vit´anyi,

An Introduction to Kolmogorov Complexity and Its Applications . Textand Monographs in Computer Science, Springer-Verlag, 2007.[18] J. Pearl, “On the testability of causal models with latent and instrumental variables,” in

UAI ,pp. 435–443, 1995.[19] S. Sullivant, K. Talaska, and J. Draisma, “Trek separation for gaussian graphical models,” arXiv:0812.1938 , 2009.[20] E. Riccomagno and J. Q. Smith, “Algebraic causality: Bayes nets and beyond,” http://arxiv.org/abs/0709.3377 , Sep 2007.[21] N. Ay, “A reﬁnement of the common cause principle,”

Discrete Applied Mathematics , vol. 157,pp. 2439–2457, 2009.[22] T. Verma and J. Pearl, “Causal networks: Semantics and expressiveness,”