Marginal log-linear parameters for graphical Markov models
MMarginal log-linear parameters for graphical Markov models
Robin J. Evans
Department of StatisticsUniversity of Washington [email protected]
Thomas S. Richardson
Department of StatisticsUniversity of Washington [email protected]
October 1, 2018
Abstract
Marginal log-linear (MLL) models provide a flexible approach to multivariate dis-crete data. MLL parametrizations under linear constraints induce a wide variety ofmodels, including models defined by conditional independences. We introduce a sub-class of MLL models which correspond to Acyclic Directed Mixed Graphs (ADMGs)under the usual global Markov property. We characterize for precisely which graphsthe resulting parametrization is variation independent. The MLL approach providesthe first description of ADMG models in terms of a minimal list of constraints. Theparametrization is also easily adapted to sparse modelling techniques, which we illus-trate using several examples of real data.
Keywords: acyclic directed mixed graph; discrete graphical model; marginal log-linearparameter; parsimonious modelling; variation independence.
Models defined by conditional independence constraints are central to many methodsin multivariate statistics, and in particular to graphical models (Darroch et al., 1980;Whittaker, 1990). In the case of discrete data, marginal log-linear (MLL) parameters canbe used to parametrize a broad range of models, including some graphical classes andmodels for conditional independence (Rudas et al., 2010; Forcina et al., 2010). Theseparameters are defined by considering a sequence, M , M , . . . , M k , of margins of thedistribution which respects inclusion (i.e. M i precedes M j if M i ⊂ M j ), with each suchsequence giving rise to a smooth parametrization of the saturated model. Useful sub-models can be induced by setting some of the parameters to zero, or more generally byrestricting attention to a linear or affine subset of the parameter space.The flexibility present in this scheme presents a challenge both in terms of interpretingthe resulting model and performing model selection, for which a tractable search space istypically required. We describe a sub-class of marginal log-linear models corresponding toa class of graphs known as acyclic directed mixed graphs (ADMGs), which contain directed1 a r X i v : . [ s t a t . M E ] O c t G .( → ) and bidirected ( ↔ ) edges, subject to the constraint that there are no cycles of directededges; an example is given in Figure 1. The relationship between the MLL models andADMGs is analogous to that between ordinary log-linear models and undirected graphs:log-linear models give a very rich class of models to choose from, since their number growsdoubly-exponentially as the number of variables increases; undirected graphs provide anatural and more manageable subset of models with which to work (Darroch et al., 1980).The patterns of independence described by ADMGs arise naturally in the context ofgenerating processes in which not all variables are observed. To illustrate this, consider therandomized encouragement design carried out by McDonald et al. (1992) to investigatethe effect of computer reminders for doctors on take-up of influenza vaccinations, andconsequent morbidity in patients. The study involved 2,861 patients; here we focus on thefollowing fields: (Re) patient’s doctor sent a card asking to Re mind them about flu vaccine (randomized); (Va) patient Va ccinated against influenza; (Y) the endpoint: patient was not hospitalized with flu; (Ag) Ag e of patient: 0 = ‘65 and under’, 1 = ‘over 65’; (Co) patient has C hronic O bstructive Pulmonary Disease (COPD), as measured at base-line. The graphs in Figure 2 represent two possible data generating processes. Under bothstructures, whether or not a patient’s doctor received a reminder note is independentof the baseline variables age (Ag) and COPD status (Co), as would be expected underrandomization. Further the absence of an edge Re → Y encodes the assumption thatwhether or not a reminder (Re) was received only influences the final outcome (Y) viawhether or not a patient received a flu vaccination (Va). Both structures also assume thatthere are unobserved confounding factors between vaccination and COPD, and betweenCOPD and the final outcome. However, the graph in Figure 2(b) supposes that there is noadditional confounding between Va and Y. As a consequence the generating process givenin (b) implies the additional restriction that Re ⊥⊥ Y | Va , Ag. (We make no assumptionsabout the state spaces of the variables H, H and H , since these factors are unobserved.)2e Va(a) YAg CoH Re Va(b) YAg CoH H Figure 2: Two different generating processes for the flu vaccine encouragement design(red vertices are unobserved): both graphs imply Re ⊥⊥ Ag , Co; however (b) also impliesRe ⊥⊥ Y | Va , Ag.Re Va(a) YAg Co Re Va(b) YAg CoFigure 3: Two ADMGs representing the conditional independence restrictions on theobserved margin implied by the corresponding graphs in Figure 2.In Figure 3 we show the ADMGs corresponding to the generating processes in Figure2. These graphs only contain observed variables, but by including bidirected edges ( ↔ )they encode the same observable conditional independence relations; see § Richardson (2003) described local and global Markov properties for ADMGs, while Richard-son (2009) described a parametrization for discrete random variables via a collection of3onditional probabilities of the form P ( X H = 0 | X T = x T ). However, although Richard-son’s parametrization is simple, it does not naturally lead to parsimonious sub-models.In addition, the parameters are subject to variation dependence constraints, in the sensethat setting some parameters to particular values may restrict the valid range of otherparameters; this makes maximum likelihood fitting, for example, more challenging (Evansand Richardson, 2010). To illustrate this point, consider the graph G in Figure 1 as anexample; it encodes the model under which X ⊥⊥ X and X ⊥⊥ X | X . Richardson’sparametrization consists in this case (for binary random variables) of the probabilities P ( X = 0) P ( X = 0 | X = x ) P ( X = 0 , X = 0 | X = x ) P ( X = 0) P ( X = 0 | X = x ) P ( X = 0 , X = 0 | X = x , X = x )where x , x ∈ { , } . A disadvantage of this parametrization is that, for instance, thejoint probabilities P ( X = 0 , X = 0 | X = x ) are bounded above by the marginalprobabilities P ( X = 0 | X = x ). Consequently, from the point of view of parameterinterpretation, it makes little sense to consider the joint probabilities in isolation. Forexample, strong (conditional) correlation between X and X is present when the jointprobability is large relative to the marginals.However, replacing the joint probabilities P ( X = 0 , X = 0 | X = x ) with theconditional odds ratios P ( X = 0 , X = 0 | X = x ) · P ( X = 1 , X = 1 | X = x ) P ( X = 1 , X = 0 | X = x ) · P ( X = 0 , X = 1 | X = x ) , x ∈ { , } (and similarly for P ( X = 0 , X = 0 | X = x , X = x )) yields a variation independentparametrization, the odds ratio measuring dependence without reference to marginal dis-tributions. This means that if we wish to define a prior distribution over the univariateprobabilities and the odds ratios, we may, if appropriate, simply use a product of uni-variate distributions; similarly, to fit a generalized linear model with these parametersas joint responses, we need only use simple univariate link functions. We will see thatthis approach to discrete parametrizations can be generalized using marginal log-linearparameters.In Section 2 we introduce marginal log-linear (MLL) parameters and some of theirproperties, while Section 3 gives background theory about ADMGs and the parametriza-tion of Richardson (2009). The development of MLL parameters for ADMG models ispresented in Section 4, resulting in a parametrization we refer to as ingenuous (since itarises naturally, but ‘natural parametrization’ already has a particular meaning). We alsoshow that this parametrization can always be embedded in a larger one corresponding toa complete graph and the saturated model, where some of the parameters in the biggermodel are linearly constrained. In Section 5 we classify for which models the ingenuousparametrization is variation independent, since this can facilitate interpretation of the re-sulting coefficients. In Section 6 we discuss approaches to sparse modelling using MLLs inthe context of several additional datasets and a simulation. Longer proofs are in Section7. 4 Marginal Log-Linear Parameters
We consider collections of random variables ( X v ) v ∈ V with finite index set V , taking valuesin finite discrete probability spaces ( X v ) v ∈ V under a strictly positive probability measure P ; without loss of generality, X v = { , , . . . , | X v | − } . For A ⊆ V we let X A ≡ × v ∈ A ( X v ), X ≡ X V and similarly X A ≡ ( X v ) v ∈ A , X ≡ X V and x A ≡ ( x v ) v ∈ A , x ≡ x V . In addition˜ X is the subset of X which does not contain the last possible element in any co-ordinate;that is ˜ X v = { , , . . . , | X v | − } , and ˜ X = × v ∈ V ( ˜ X v ). We use p A ( x A ) ≡ P ( X A = x A ) and p A | B ( x A | x B ) ≡ P ( X A = x A | X B = x B ), for particular instantiations of x .Following Bergsma and Rudas (2002), we define a general class of parameters on dis-crete distributions. The definition relies upon abstract collections of subsets, so it may behelpful to the reader to keep in mind that the sets M i ∈ M are margins, or subsets, ofthe distribution over V , and each set L i is a collection of effects in the margin M i . A pair( L, M i ) corresponds to a log-linear interaction over the set L , within the margin M i . Definition 2.1.
For L ⊆ M ⊆ V , the pair ( L, M ) is an ordered pair of subsets of V . Let P be a collection of such pairs, and define M ≡ { M | ( L, M ) ∈ P for some L } , to be the collection of margins in P . If M = { M , . . . , M k } , write L i ≡ { L | ( L, M i ) ∈ P } , for the set of effects present in the margin M i . We say that the collection P is hierarchical if the ordering on M may be chosen so that if i < j , then M j (cid:42) M i and also L ∈ L j ⇒ L (cid:42) M i ; the second condition is equivalent to saying that each L is associated only withthe first margin M of which it is a subset. We say the collection is complete if everynon-empty subset of V is an element of precisely one set L i .The term ‘hierarchical’ is used because each log-linear interaction is defined in the firstpossible margin in an ascending class, and ‘complete’ because all interactions are present.Some authors (Rudas et al., 2010; Lupparelli et al., 2009) consider only collections whichare complete. Definition 2.2.
For each M ⊆ V and x M ∈ X M , define the functions λ ML ( x L ) by theidentity log p M ( x M ) ≡ (cid:88) L ⊆ M λ ML ( x L ) , subject to the identifiability constraint that for every ∅ (cid:54) = L ⊆ M , x L ∈ X L and v ∈ L , (cid:88) x v ∈ X v λ ML ( x L \{ v } , x v ) = 0;5hat is, the sum over the support of each variable is zero. We call λ ML ( x L ) a marginallog-linear parameter .Note that the constant λ M ∅ is determined by the values of the other parameters andthe fact that the probabilities p M ( x M ) sum to one. In the sequel we will always assumethat L is non-empty.The term ‘marginal log-linear parameter’ is coined by analogy with ordinary log-linearparameters, which correspond to the special case M = V . The following result providesan explicit expression for λ ML ( x L ). Lemma 2.3.
For L ⊆ M ⊆ V and x L ∈ X L we have λ ML ( x L ) = 1 | X M | (cid:88) y M ∈ X M log p M ( y M ) (cid:89) v ∈ L (cid:0) | X v | I { x v = y v } − (cid:1) . (1)This result is elementary, and its proof is omitted.For a collection of ordered pairs of subsets P (see Definition 2.1), we let˜Λ( P ) = { λ ML ( x L ) | ( L, M ) ∈ P , x L ∈ ˜ X L } be the collection of marginal log-linear parameters associated with P . Note that we avoidthe redundancy created by the identifiability constraint by only considering x L ∈ ˜ X L .The definition of a marginal log-linear parameter we give is equivalent to the recursiveone given in Bergsma and Rudas (2002); since both expositions are somewhat abstract, weinvite the reader to consult the examples below for additional intuition. In particular notethat for binary random variables, the product in (1) is always ±
1. Bergsma and Rudas(2002, Theorem 2) show that any collection ˜Λ( P ) where P is hierarchical and completesmoothly parametrizes the saturated model, that is, it parametrizes the set of all positivedistributions on X .The restriction that the parameters must sum to zero is required for identifiability,but different constraints can be used in its place. We might instead require that λ ML ( x L )be zero whenever any entry of x L is zero (or some other selected value); this is seen inMarchetti and Lupparelli (2011), for example, and its use would not substantially affectany of the results in this paper. We will write λ ML to mean the collection { λ ML ( x L ) | x L ∈ X L } ; the expression λ ML = 0denotes that we are setting all the parameters in this collection to 0. Example 2.4.
The classical log-linear parameters for a discrete distribution over a set ofvariables V are { λ VL | L ⊆ V } . Example 2.5.
Up to trivial transformations, the multivariate logistic parameters ofGlonek and McCullagh (1995) are { λ LL | L ⊆ V } .6 xample 2.6. Let V = { , , } and assume all random variables are binary. Write P ≡ P ( X = 0 , X = 0 , X = 1), and P ≡ P ( X = 1), etc. Then λ (0) = 12 log P P , which, up to a multiplicative constant, is the logit of the probability of the event { X = 0 } .Also, λ (0) = 14 log P P P P and λ (0 ,
0) = 14 log P P P P , the log odds product and log odds ratio between X and X respectively.If instead X is ternary, we obtain λ (0) = 13 log P P P ,λ (0) = 16 log P P P P P P and λ (0 ,
0) = 16 log P P P P P P . Here λ (0) contrasts the probability P ( X = 0) with the geometric mean of the proba-bilities P ( X = 1) and P ( X = 2). On the other hand, up to constants, λ (0 ,
0) is anaverage of the two log odds ratioslog P P P P log P P P P , and so gives a contrast between P ( X = X = 0) and other joint probabilities in a waywhich generalizes the binary log odds ratio and provides a measure of dependence; inparticular note that λ (0 ,
0) = 0 if X ⊥⊥ X .Here we have written, for example, 12 instead of { , } ; similarly, for sets A and B wesometimes write AB for A ∪ B , and aB for { a } ∪ B . The next result relates marginal log-linear parameters to conditional independences; it isfound as Lemma 1 in Rudas et al. (2010) and Equation (6) of Forcina et al. (2010).
Lemma 2.7.
For any disjoint sets A , B and C , where C may be empty, A ⊥⊥ B | C ifand only if λ ABCA (cid:48) B (cid:48) C (cid:48) = 0 for every ∅ (cid:54) = A (cid:48) ⊆ A, ∅ (cid:54) = B (cid:48) ⊆ B, C (cid:48) ⊆ C. The special case of C = ∅ (giving marginal independence) is proved in the context ofmultivariate logistic parameters by Kauermann (1997).7 xample 2.8. Take a complete and hierarchical parametrization of 3 variables, λ λ λ λ λ λ λ . Then we can force X ⊥⊥ X by setting λ = 0. Similarly X ⊥⊥ X | X corresponds tosetting λ = λ = 0.The following lemma shows that under conditional independence constraints, certainMLL parameters defined within different margins are equal. Lemma 2.9.
Suppose that A ⊥⊥ B | C , and A is non-empty. Then for any D ⊆ C , λ ABCAD ( x AD ) = λ ACAD ( x AD ) , for each x AD ∈ X AD . The proof of this result is found in Section 7.1.
We introduce basic graphical concepts used to describe the global Markov property andparametrization schemes.
Definition 3.1. A directed mixed graph G consists of a set of vertices V , and both directed( → ) and bidirected ( ↔ ) edges. Edges of the same type and orientation may not berepeated, but there may be multiple edges of different types between a pair of vertices.A path in G is a sequence of adjacent edges, without repetition of a vertex; a pathmay be empty, or equivalently consist of only one vertex. The first and last vertices ona path are the endpoints (these are not distinct if the path is empty); other vertices onthe path are non-endpoints . The graph G in Figure 1, for example, contains the path1 → → ↔
3, with endpoints 1 and 3, and non-endpoints 2 and 4. A directed path isone in which all the edges are directed ( → ) and are oriented in the same direction, whereasa bidirected path consists entirely of bidirected edges.A directed cycle is a non-empty sequence of edges of the form v → · · · → v . An acyclic directed mixed graph (ADMG) is one which contains no directed cycles. Definition 3.2.
For a graph G and a subset of its vertices A ⊆ V , we denote by G A the induced subgraph formed by A ; that is, the graph containing the vertices A , and the edgesin G whose endpoints are both in A . Definition 3.3.
Let a and d be vertices in a mixed graph G . If a = d , or there is a directedpath from a to d , we say that a is an ancestor of d , and that d is a descendant of a . Thesets of ancestors of d and descendants of a are denoted an G ( d ) and de G ( a ) respectively. Ifthere is a directed path from a to d containing precisely one edge ( a → d ) then a is calleda parent of d ; the set of vertices which are parents of d is written pa G ( d ).8he district of a , denoted dis G ( a ), is the set containing a and all vertices which areconnected to a by a bidirected path. These definitions are applied disjunctively to sets ofvertices, so that, for example,pa G ( W ) ≡ (cid:91) w ∈ W pa G ( w ) , dis G ( W ) ≡ (cid:91) w ∈ W dis G ( w ) . A set of vertices A is ancestral if A = an G ( A ); that is, A contains all its own ancestors. Example 3.4.
Consider the graph G in Figure 1. We havean G (4) = { , , } an G ( { , } ) = { , , } . The district of 3 is the set { , , } , and since 3 has no parents, pa G (3) = ∅ .Note that by the definitions of some authors, vertices are not their own ancestors(Lauritzen, 1996). The above notations may be shortened on induced subgraphs so thatpa A ≡ pa G A , and similarly for other definitions. In some cases where the meaning is clear,we will dispense with the subscript altogether.We use the now standard notation of Dawid (1979), and represent the statement ‘ X is independent of Y given Z under a probability measure P ’, for random variables X , Y and Z , by X ⊥⊥ Y | Z [ P ]. If P is unambiguous, this part is dropped, and if Z is emptywe write simply X ⊥⊥ Y . Finally, we abuse notation in the usual way: v and X v are usedinterchangeably as both a vertex and a random variable; likewise A denotes both a vertexset and X A . A Markov property associates a particular set of independence relations with a graph.A non-endpoint vertex c on a path is a collider on the path if the edges preceding andsucceeding c on the path have an arrowhead at c , for example → c ← or ↔ c ← ; otherwise c is a non-collider . A path between vertices a and b in a mixed graph is said to be blockedgiven a set C if either(i) there is a non-collider on the path in C , or(ii) there is a collider on the path which is not in an G ( C ).If all paths from a to b are blocked by C , then a and b are said to be m-separated given C . Sets A and B are said to be m-separated given C if every a ∈ A and every b ∈ B arem-separated given C . This naturally extends the d-separation criterion of Pearl (1988) tographs with bidirected edges.A probability measure P on X is said to satisfy the global Markov property for G if forevery triple of disjoint sets of vertices A , B and C , A is m-separated from B given C in G = ⇒ X A ⊥⊥ X B | X C [ P ].9he model associated with an ADMG G is simply the set of distributions that obey theglobal Markov property for G . Proposition 3.5.
If a path m-connects x and y given Z in G then every vertex on thepath is in an G ( { x, y } ∪ Z ) .Proof. This follows from the definition of m-connection.
Example 3.6.
Consider the graph G in Figure 1. There are two paths between thevertices 1 and 4, π : 1 → → π : 1 → ↔ ↔ C = { } . π is blocked because 2 is a non-collider on the path and isin C , while π is blocked because 3 is a collider on the path and is not in an G ( C ) = { , } .Hence { } and { } are m-separated given { } in G .One can similarly see that { } and { } are m-separated given C = ∅ , and that no otherm-separations hold for this graph. Thus a joint distribution P obeys the global Markovproperty for G if and only if X ⊥⊥ X | X [ P ] and X ⊥⊥ X [ P ].By similar arguments the independences associated with the ADMGs in Figure 2 mayalso be read off. This subsection defines the parameters of Richardson (2009) for multivariate discrete dis-tributions satisfying the global Markov property for an ADMG.
Definition 3.7.
Let G be an ADMG with vertex set V . We say that a collection ofvertices W ⊆ V is barren if for each v ∈ W , we have W ∩ de G ( v ) = { v } ; in other words v has no non-trivial descendants in W . For an arbitrary set of vertices U , the maximalsubset with no non-trivial descendants in U is denoted barren G ( U ).A head is a collection of vertices H which is connected by bidirected paths in G an( H ) and is barren in G . We write H ( G ) for the collection of heads in G . The tail of a head H is the set tail G ( H ) ≡ pa G (dis an( H ) ( H )) ∪ (dis an( H ) ( H ) \ H ) . Thus the tail of H is the set of vertices in V \ H connected to a vertex in H by a path onwhich every vertex is a collider and an ancestor of a vertex in H . We typically write T for a tail, provided it is clear which head it belongs to. Proposition 3.8.
Let H be a head. Then (i) H = barren G ( H ∪ tail G ( H )) ; (ii) tail G ( H ) ⊆ an G ( H ) . roof. Immediate from the respective definitions.Richardson (2009) shows that discrete distributions obeying the global Markov prop-erty for an ADMG G are parametrized by the conditional probabilities: (cid:110) P ( X H = x H | X T = x T ) (cid:12)(cid:12)(cid:12) H ∈ H , T = tail G ( H ) , x H ∈ ˜ X H , x T ∈ X T (cid:111) . This is achieved via factorizations based on head-tail pairs; let ≺ be the partial orderingon heads such that H i ≺ H j if H i ⊂ an G ( H j ) and H i (cid:54) = H j . This is well defined, sinceotherwise G would contain a directed cycle. Then let [ · ] G be a function which partitionssets of vertices into heads by repeatedly removing heads which are maximal under ≺ .Then P satisfies the global Markov property for G if and only if it obeys the factoriza-tions P ( X A = x A ) = (cid:89) H ∈ [ A ] G P ( X H = x H | X T = x T ) (2)for ancestral sets of vertices A ; see Richardson (2009) for details. In the case of a directedacyclic graph (DAG), this corresponds to the probability distribution of each vertex con-ditional on its parents. Example 3.9.
Consider again the ADMG G in Figure 1; its head-tail pairs ( H, T ) are(1 , ∅ ), (2 , , ∅ ), (23 , ,
2) and (34 , G are therefore parametrized by p (0) p | (0 | x ) p (0) p | (0 , , | x ) p | (0 | x ) p | (0 , | x , x ) , for x , x ∈ { , } , as mentioned in the Introduction. Given a discrete model defined by a set of conditional independence constraints, it isnatural to consider it as a sub-model of the saturated model, which contains all positiveprobability distributions. In a setting where the model is graphical, it becomes equallynatural to think of the graph as a subgraph of a complete graph, by which we meana graph containing at least one edge between every pair of vertices. We can obtain acomplete graph from an incomplete one by inserting edges between each pair of verticeswhich lack one, but this leaves a choice of edge type and orientation. These choices mayaffect how much of the structure and spirit of the original graph is retained; we will requirethat a completion preserves the heads of the original graph, which helps to preserve thestructure of the parametrization.
Definition 3.10.
Given an ADMG G and a supergraph ¯ G , we call ¯ G a head-preservingcompletion of G if ¯ G is complete, and H ( G ) ⊆ H ( ¯ G ).11 2 3 4Figure 4: A head-preserving completion, ¯ G , of the ADMG in Figure 1.It is easy to see that a head-preserving completion always exists; for example, if weadd in a bidirected edge between every pair of vertices which are not joined by an edge,then it is clear that barren sets in G will remain barren in ¯ G , and bidirected connectedsets in G will remain bidirected connected in ¯ G .Note that it is not necessary for every pair of vertices to be joined by an edge inorder for a graph to represent the saturated model, however we will require this for ourcompletions. Example 3.11.
Figure 4 shows a head-preserving completion of the ADMG in Figure 1.
Proposition 3.12. If ¯ G is a head-preserving completion of G then an G ( v ) ⊆ an ¯ G ( v ) . Inparticular, if a set A is ancestral in ¯ G then A is also ancestral in G .Proof. This follows because G contains a subset of the edges in ¯ G . We now use the marginal log-linear parameters defined in Section 2 to parametrize theADMG models discussed in Section 3.
Definition 4.1.
Consider an ADMG G with head-tail pairs ( H i , T i ) over some index i ,and let M i = H i ∪ T i . Further, let L i = { A | H i ⊆ A ⊆ H i ∪ T i } . This collection of marginsand associated effects is the ingenuous parametrization of G , denoted P ing ( G ). Example 4.2.
We return again to the ADMG G in Figure 1; the head-tail pairs are(1 , ∅ ), (2 , , ∅ ), (23 , ,
2) and (34 , L Lemma 4.3.
For any ADMG G , there is an ordering on the margins M i of the ingenuousparametrization P ing ( G ) which is hierarchical.Proof. Firstly we show that for distinct heads H i and H j , the collections L i and L j aredisjoint. To see this, assume for a contradiction that there exists A such that H i ⊆ A ⊆ H i ∪ T i and H j ⊆ A ⊆ H j ∪ T j . Since H i (cid:54) = H j , assume without loss of generality thatthere exists v ∈ H i ∩ H cj ⊆ A .Then v ∈ H j ∪ T j implies that v ∈ T j , and thus there is a directed path from v tosome w ∈ H j . Now, w / ∈ H i , since v, w ∈ H i would imply that H i is not barren. But if w ∈ H j ∩ H ci , then by the same argument as above we can find a directed path from w tosome x ∈ H i . Then v → · · · → w → · · · → x is a directed path between elements of H i ,which is a contradiction. Thus L i and L j are disjoint.Now, consider the partial ordering ≺ of heads defined in Section 3.2: H i ≺ H j whenever H i ⊂ an G ( H j ) and H i (cid:54) = H j . Any total ordering which respects this partial ordering ishierarchical, because each set A ∈ L i is a subset of the ancestors of H i .We proceed to show that the ingenuous parameters for an ADMG G characterize theset of distributions which obey the global Markov property with respect to G . Lemma 4.4.
For any sets M and L ⊆ M , the collection of MLL parameters { λ MA ( x A ) | L ⊆ A ⊆ M, x M ∈ ˜ X M } , together with the ( | L | − -dimensional marginal distributions of X L conditional on X M \ L ,smoothly parametrizes the distribution of X L conditional on X M \ L . A proof is given in Section 7.2.We now come to the main result of this section.
Theorem 4.5.
The ingenuous parametrization ˜Λ( P ing ( G )) of an ADMG G parametrizesprecisely those distributions P obeying the global Markov property with respect to G . roof. We proceed by induction. Again we use the partial ordering ≺ on heads fromSection 3.2. For the base case, we know that singleton heads { h } with empty tails areparametrized by the logits λ hh .Now, suppose that we wish to find the distribution of a head H conditional on its tail T . Assume that we have the distribution of all heads H (cid:48) which precede H , conditional ontheir respective tails; we claim this is sufficient to give the ( | H | − H conditional on T .Let v ∈ H , and let C = H \{ v } be a ( | H |− A = an G ( H ) \ { v } is ancestral, since v cannot have (non-trivial) descendants in an G ( H );in particular C ∪ T ⊆ A . Theorem 4 of Richardson (2009) states that the factorization inequation (2) holds for every ancestral set, so p A ( x A ) = (cid:89) H (cid:48) ∈ [ A ] G T (cid:48) =tail( H ) p H (cid:48) | T (cid:48) ( x H (cid:48) | x T (cid:48) ) . But all the probabilities in the product are known by our induction hypothesis, and themarginal distribution of C conditional on T is given by the distribution of A .The ingenuous parametrization, by definition, contains λ H ∪ TA for H ⊆ A ⊆ H ∪ T , andthus the result follows from Lemma 4.4. Example 4.6.
Returning to our running example, the graph G in Figure 1 correspondsto the model (cid:110) P (cid:12)(cid:12)(cid:12) X ⊥⊥ X | X [ P ] and X ⊥⊥ X [ P ] (cid:111) . Theorem 4.5 tells us that this collection of distributions is precisely characterized by theingenuous parameters for G , λ λ λ λ λ λ λ λ λ λ λ λ . The results above show that the ingenuous parameters for an ADMG G , like Richardson’sparameters, provide precisely the information required to reconstruct a distribution obey-ing the global Markov property for G . However, it is difficult to use this parametrization inpractice unless we can evaluate the likelihood, which requires us to make explicit the mapwhich we have implicitly defined from the ingenuous parameters to the joint probabilitydistribution under the model. For example, for the parameters in Richardson (2009) thereis an explicit map from the parameters back to the joint distribution using a generalizationof M¨obius inversion. This was used by Evans and Richardson (2010) to fit these modelsvia maximum likelihood. In contrast, the map from ingenuous parameters to the jointdistribution cannot be written in closed form.14n alternative approach is to consider the ingenuous parametrization as part of alarger, complete parametrization of the saturated model, such that the additional param-eters are constrained to be zero under the sub-model defined by G . This enables us to fitthe model using Lagrange-type algorithms, as in Evans and Forcina (2011). Theorem 4.7.
Let G be an ADMG, and ¯ G a head-preserving completion of G . Theingenuous parametrization of G corresponds to setting λ ML = 0 for ( L, M ) ∈ P ing ( ¯ G ) whenever L does not appear as an effect in P ing ( G ) . In particular,these constraints define the set of distributions which satisfy the global Markov propertywith respect to G . The proof of this result is found in Section 7.3
Example 4.8.
Consider again the ADMG G in Figure 1; a possible head-preservingcompletion ¯ G (shown in Figure 4) is obtained by adding the edges 1 → →
4. Theingenuous parametrization for ¯ G is M L P ing ( ¯ G ) but not in P ing ( G ) are 13, 14, and 124, and indeed thesub-model defined by G corresponds to setting λ = λ = λ = 0;under this model the following equalities hold by Lemma 2.9: λ = λ λ = λ . Removing the zero parameters in P ing ( ¯ G ) and renaming two others according to the aboveequations returns us to the ingenuous parametrization of G .Theorem 4.7 shows that we can fit the model defined by G by maximum likelihoodsimply by maximizing the log-likelihood subject to λ = λ = λ = 0. In particular,this approach always provides a list of independent constraints which characterize themodel.An obvious question which arises is whether any completion of a graph will lead to acomplete parametrization with the property of Theorem 4.7. We can obtain a counterex-ample by considering the complete graph ˜ G in Figure 5, which has ingenuous parametriza-tion 15 2 3 4Figure 5: A complete ADMG, ˜ G , of which G is a subgraph, but whose ingenuousparametrization does not contain the model described by G as a linear sub-space be-cause the associated completion is not head-preserving. M L G in Figure 1 is a subgraph of ˜ G , and corresponds to the model obtained bysetting λ = λ = λ = 0; however, these last two parameters do not appear in theingenuous parametrization of ˜ G , and so there is no way to enforce the sub-model as alinear constraint.˜ G is, of course, not head-preserving. Such completions may still lead to parametriza-tions which satisfy the property of Theorem 4.7: for example, if the edge 1 → { , , } , but the sub-model correspondsto λ = 0, which is a parameter in the complete graph. Rudas et al. (2010) parametrize chain graph models of multivariate regression type, alsoknown as type IV chain graph models, using marginal log-linear parameters. Type IVchain graph models are a special case of ADMG models, in the sense that by replacingthe undirected edges in a type IV chain graph with bidirected edges, the global Markovproperty on the resulting ADMG is equivalent to the Markov property for the chain graph(see Drton, 2009). The graphs in Figure 6 are examples of Type IV models. However,there are models in the class of ADMGs which do not correspond to any chain graph, suchas the one described by G in Figure 1.The parametrization of Rudas et al. (2010) uses different choices of margins to theingenuous parametrization, though their parameters can be shown to be equal to theparameters considered here under the global Markov property, using Lemma 2.9. Thusthe variation dependence properties of that parametrization are identical to those of theingenuous parametrization (see next section). Forcina et al. (2010) provide an algorithmwhich gives a range of ‘admissible’ margins in which collections of conditional independence16onstraints may be defined.Marchetti and Lupparelli (2011) also parametrize type IV chain graph models in asimilar manner to Rudas et al. (2010), in that case using multivariate logistic contrasts. As discussed in the introduction, the interpretation of parameters and the construction ofprior distributions is simpler when parameters are variation independent.
Definition 5.1.
Let θ i , for i = 1 , . . . , k be a collection of parameters such that θ i takesall values in the set Θ i . We say that the vector θ = ( θ , . . . , θ k ) is variation independent if θ can take every value in the set Θ × · · · × Θ k .Bergsma and Rudas (2002) characterize precisely which hierarchical and completeparametrizations are variation independent, using a notion they call ordered decompos-ability. We now do this for ingenuous parametrizations. Definition 5.2.
A collection of sets M = { M , . . . , M k } is incomparable if M i (cid:42) M j forevery i (cid:54) = j .A collection M of incomparable subsets of V is decomposable if it has at most twoelements, or there is an ordering M , . . . , M k on the elements of M wherein for each i = 3 , . . . , k , there exists j i < i such that (cid:32) i − (cid:91) l =1 M l (cid:33) ∩ M i = M j i ∩ M i . This is also known as the running intersection property .A collection M of (possibly comparable) subsets is ordered decomposable if it has atmost two elements, or there is an ordering M , . . . , M k such that M i (cid:42) M j for i > j , andfor each i = 3 , . . . , k , the inclusion maximal elements of { M , . . . , M i } form a decomposablecollection. We say that a collection P of parameters is ordered decomposable if there isan ordering on the margins M which is both hierarchical and ordered decomposable.The following example is found in Bergsma and Rudas (2002). Example 5.3.
Let M = { , , , } . In order to have a hierarchical ordering of thesemargins it is clear that the set 123 must come last, but there is no way to order the col-lection of inclusion maximal margins { , , } such that it has the running intersectionproperty. Thus M is not ordered decomposable.The next result links variation independence to ordered decomposability. Theorem 5.4 (Bergsma and Rudas (2002), Theorem 4) . Let P be a parametrization whichis hierarchical and complete. Then the parameters ˜Λ( P ) are variation independent if andonly if P is ordered decomposable.
17 2 3(a) 1 2 3(b)0 1 2 3 4(c)Figure 6: (a) a graph with a variation dependent ingenuous parametrization; (b) a Markovequivalent graph to (a) with a variation independent ingenuous parametrization; (c) agraph with no variation independent MLL parametrization.As previously noted, the ingenuous parametrization is not complete in general, and sowe cannot apply the above result directly to characterize its variation dependence. How-ever, by constructing complete parametrizations of which the ingenuous parametrizationsare linear sub-models, we can obtain the following.
Theorem 5.5.
The ingenuous parametrization for an ADMG G is variation independentif and only if G contains no heads of size greater than or equal to 3. The proof of this result is found in Section 7.4.
Example 5.6.
The graph G in Figure 1 has maximum head size 2, and therefore theassociated ingenuous parametrization is variation independent.Likewise the graphs in Figure 3(a) and (b) contain no heads of size greater than 2,so that the resulting ingenuous parameters are variation independent. Note that this wasnot true of the parameters given by Richardson (2009). Example 5.7.
The bidirected 3-chain shown in Figure 6(a) has the head 123, and there-fore its ingenuous parametrization is variation dependent. This can easily be seen directly:in the binary case, for example, if the parameters λ (0) and λ (0) are chosen to be verylarge, this induces very strong dependence between the variables X and X , and be-tween X and X respectively. If these correlations are chosen to be too large, then it isimpossible for X and X to be marginally independent, which is implied by the graph.Observe that we could use the Markov equivalent graph in Figure 6(b), which has noheads of size 3, and thus obtain a variation independent parametrization of the same model.However, if we add incident arrows as shown in Figure 6(c), we obtain a graph where such atrick is not possible. In fact this third graph has no variation independent parametrizationin the Bergsma and Rudas framework, since it requires λ = λ = λ = 0, andthese margins cannot be ordered in a way which satisfies the running intersection property18 234Figure 7: A bidirected 4-cycle.(see Example 5.3).In general, it would be sensible for a statistician concerned about variation dependenceto choose a graph from the Markov equivalence class created by their model which has thesmallest possible maximum head size. This could be achieved by reducing the number ofbidirected edges in the graph, where possible; see, for example, Ali et al. (2005) and Drtonand Richardson (2008b) for algorithms for finding the graph with the minimal number ofarrowheads in a given Markov equivalence class. Example 5.8.
The bidirected 4-cycle, shown in Figure 7, contains a head of size 4, andso its ingenuous parametrization is variation dependent. However, there is a marginallog-linear parametrization of this model which is ordered decomposable, and thereforevariation independent. The 4-cycle is precisely the model with X ⊥⊥ X and X ⊥⊥ X .Set M = { , , } , with L = { , , } L = { , , } L = P ( { , , , } ) \ ( L ∪ L );here P ( A ) denotes the power set of A . This gives a hierarchical, complete and ordereddecomposable parametrization, so the parameters are variation independent. The 4-cyclecorresponds exactly to setting λ = λ = 0, and it follows that the remaining parametersare still variation independent under this constraint.This approach to parametrization, which considers disconnected sets, is discussed indetail by Lupparelli et al. (2009). It produces a variation independent parametrizationfor graphs where the disconnected sets do not overlap, and may well be preferable tothe ingenuous parametrization in these cases. In sparser graphs however, it does notseem as useful; as mentioned above, some graphs have no variation independent MLLparametrization. 19 Parsimonious Modelling with Marginal Log-Linear Pa-rameters
The number of parameters in a model associated with a sparse graph containing bidirectededges can, in certain cases, be relatively large. In a purely bidirected graph, the parametercount depends upon the number of connected sets of vertices; in the case of a chainof bidirected edges such as that shown in Figure 11(a), this means that the number ofparameters grows quadratically in the length of the chain.The parametrization of Richardson (2009), and its special case for purely bidirectedgraphs (see Drton and Richardson, 2008a) does not present us with any obvious method ofreducing the parameter count whilst preserving the conditional independence structure.In contrast, there are well established methods for sparse modelling with other classesof graphical models. In the case of an undirected graph with binary random variables,restricting to one parameter for each vertex and each edge leads to a Boltzmann Machine(Ackley et al., 1985). Rudas et al. (2006) use marginal log-linear parameters to providea sparse parametrization of a DAG model, again restricting to one parameter for eachvertex and edge.As we will see from the following examples, the ingenuous parametrization allowsus to fit graphical models with a large number of parameters, and then remove higher-order interactions to obtain a more parsimonious model whilst preserving the conditionalindependence structure of the original graph.
We first return to the McDonald et al. (1992) study considered in the Introduction. Allvariables are binary, and (excepting Age) are coded as 0 = false, 1 = true; we add con-straints to our model sequentially, recording the results in the analysis of deviance Table1. The ADMG in Figure 3(a) represents the constraint Ag , Co ⊥⊥ Re; it fits well, having adeviance of 2.54 on 3 degrees of freedom. The smaller model for 3(b) encodesAg , Co ⊥⊥ Re Y ⊥⊥ Re | Va , Ag;note that these precise independences cannot be represented by a DAG or chain graph (ofany of the types considered by Drton (2009)). It also fits well (deviance 7.66 on 7 d.f.), sowe may prefer it on the grounds of simplicity.The ingenuous parametrization in this case contains some higher order effects, includ-ing the 5-way interaction between all variables. Setting λ ML = 0 for | L | ≥ p = 0 . onstraint Figure Add. Dev. d.f. Total Dev. Ag , Co ⊥⊥ Re 3(a) 2.54 3 2.54Y ⊥⊥ Re | Va , Ag 3(b) 5.11 7 7.66no 4- and 5-way params 2.22 12 9.88no 3-way params 8.39 19 18.28Table 1: Analysis of deviance table of models considered for influenza data. Constraintsare added sequentially from top to bottom; the last three columns give the additionaldeviance for the constraint, the total degrees of freedom and the total deviance of themodels respectively. S E S E (a) S E S E (b)Figure 8: Graphs for the twins data for models corresponding to (a) a common gene and(b) separate genes affecting the prevalence of frozen shoulder and tennis elbow. Hakim et al. (2003) investigate genetic effects on the presence or absence of two soft tissuedisorders, frozen shoulder and tennis elbow, based on a study in pairs of monozygotic anddizygotic twins; the data are reproduced in Ekholm et al. (2012). We have count data fora 5-way contingency table over the variables S i and E i , indicators of whether twin i inthe pair suffers from frozen shoulder and tennis elbow respectively, i ∈ { , } , and T , anindicator of whether the pair are monozygotic or dizygotic twins. There are a total of 866observations for monozygotic pairs, and 963 for dizygotic pairs; twin 1 corresponds to thetwin who was born first.We first fitted the model T ⊥⊥ ( S , S , E , E ) to test whether the zygosity of the twinshas any effect on the other variables; we obtained a deviance of 16.4 on 15 degrees offreedom, suggesting that there is no evidence that T is related to the other variables.Note that this contradicts the conclusions of Ekholm et al. (2012), but they use additionalassumptions to obtain more powerful tests.Collapsing to a 4-way table over ( S , S , E , E ), we consider the complete bidirectedmodel in Figure 8(a). A further simplifying assumption is to impose symmetry between thetwins in each pair, on the basis that we do not expect any association between the preva-lence of the disorders and which twin was born first. Using the ingenuous parametrization21or the graph in Figure 8(a), which is itself symmetric with respect to the individual twins,this amounts to six independent linear constraints, and gives a deviance of 0.59 comparedto the saturated model on four variables; there is therefore no evidence to reject symmetry.Now, a hypothesis of interest is whether a common gene is responsible for the increasedrisk of the two disorders, or the genetic effects are separate and independent. In the lattercase we would expect the data to be explained by the model encoded by the graph inFigure 8(b), and therefore to observe the marginal independences E ⊥⊥ S and E ⊥⊥ S (see Drton and Richardson, 2008a, for more details). This amounts to the constraint λ E S E S = λ E S E S = 0; the first equality already holds by symmetry, so only one additionalconstraint is imposed.This model has a deviance of 8.41 on 7 degrees of freedom, which is not rejected in alikelihood ratio test with the saturated model ( p = 0 . The Netherlands Kinship Panel Survey (NKPS) is an ongoing study which collects lon-gitudinal information on several thousand Dutch individuals and their families (Dykstraet al., 2005, 2007). One question asked of both the primary respondents ( anchors ) andtheir partners is “How is your health in general?”, with possible responses of ‘excellent’,‘good’, ‘good nor poor’, ‘poor’ and ‘very poor’. We combined ‘good nor poor’, ‘poor’ and‘very poor’ into one category to avoid small counts.Two waves of data are currently available, from 2002–04 and 2006–07. We only consid-ered anchors who had the same partner in both waves, and such that both the individualand the partner answered the health question in both waves. Let A i and P i denote theresponse of the anchor and partner respectively for wave i ∈ { , } . In total there are n = 2 ,
318 data points, classified into a 3 × × × λ A P A P A P (1 ,
0) = λ A P A P A P (0 , . This model has a deviance of 89.98, which when compared to the tail of a χ distributiongives p = 1 . × − ; thus the symmetry model is a poor fit to the data, and is rejected. Thelack of exchangeability is probably due to selection bias in the sampling of the anchors, aswell as the different ways in which the anchors and their partners were asked the question:22 P A P (a) A P A P (b)Figure 9: Graphs for the NKPS data; responses of A nchor and P artner regarding theirassessment of health; subscripts indicate time. (a) a complete graph; (b) a subgraph whichimplies P ⊥⊥ A | P .anchors were asked about their health as part of a face-to-face interview, whereas thepartners were only asked to complete a survey. See Siemiatycki (1979) for an analysis ofdifferences resulting from survey mode.If instead we remove the edge A → P and fit the graph in Figure 9(b), we obtainan explanation of the data which is not rejected at the 5% level (deviance 19.09 on 12degrees of freedom, p = 0 . P ⊥⊥ A | P . This graph is the only subgraph of the complete graph in Figure 9(a) whichleads to a good fit; in particular the model created by removing the edge P → A isstrongly rejected, which is one manifestation of the asymmetry between individuals andtheir partners.Note that we could also have obtained the independence P ⊥⊥ A | P , for instance, byusing a DAG with topological ordering P , A , P , A , but the resulting parametrizationwould have made it much more difficult to enforce the symmetry constraint tested above. Drton and Richardson (2008a) examine responses to seven questions relating to trustand social institutions, taken from the US General Social Survey between 1975 and 1994.Briefly, the seven questions were:
Trust.
Can most people be trusted?
Helpful.
Do you think most people are usually helpful?
MemUn, MemCh.
Are you a member of a labour union / church?
ConLegis, ConClerg, ConBus.
Do you have confidence in congress / organized reli-gion / business?In that paper, the model given by the graph in Figure 10 is shown to adequately explainthe data, having a deviance of 32.67 on 26 degrees of freedom, when compared with the23rustHelpfulMemChMemUn ConClergConBus ConLegisFigure 10: Markov model for trust data given in Drton and Richardson (2008a).saturated model. The authors also provide an undirected graphical model which has onemore edge than the graph in Figure 10, and yet has 62 fewer parameters. It too gives agood fit to the data, having a deviance of 87.62 on 88 degrees of freedom. Both graphswere chosen by backwards stepwise selection methods; see Drton and Richardson (2008a)for details.For practical and theoretical reasons, the bidirected model may be preferred to theundirected one, even though the latter appears to be much more parsimonious. One mayconsider the dependence between the responses given to a questionnaire to be manifesta-tions of unmeasured characteristics of the respondent, such as their political beliefs. Sucha system can be well represented by a bidirected graph, through its marginal independencestructure and connection to latent variable models, but not necessarily by an undirectedone, which induces conditional independences. Note that, since models defined by undi-rected and bidirected graphs are not nested, there is no a priori reason to expect the twomethods to give a similar graphical structure.The greater parsimony of the undirected model (when defined purely by conditionalindependences) is due to its hierarchical nature: if we remove an edge between two vertices a and b , then this corresponds to requiring that λ VA = 0 for every effect A containing both a and b . Removing that edge in a bidirected model may correspond merely to setting λ abab = 0 and nothing else, depending upon the other edges present. Using the ingenuousparametrization, it is easy to constrain additional higher order terms to be zero to obtainsub-models of the set of distributions obeying the global Markov property.Starting with the model in Figure 10 and fixing the 4-, 5-, 6- and 7-way interactionterms to be zero increases the deviance to 84.18 on 81 degrees of freedom; none of the 4-way interaction parameters was found to be significant on its own. Furthermore, removing21 of the remaining 25 three-way interaction terms increases the deviance to 111.48 on102 degrees of freedom; using an asymptotic χ approximation gives a p-value of 0.245,so this model is not contradicted by the data. The only parameters retained are theone-dimensional marginal probabilities, the two-way interactions corresponding to edges24 2 3 · · · k (a)1 2 3 · · · kh h h h k − (b)Figure 11: (a) A bidirected k -chain and (b) a DAG with latent variables ( h , . . . , h k − )generating the same observable conditional independence structure.in Figure 10, and the following three-way interactions:MemUn , ConClerg , ConBus Helpful , MemUn , MemChTrust , ConLegis , ConBus MemCh , ConClerg , ConBus . This model retains the marginal independence structure of Drton and Richardson’s model,but provides a good fit with only 25 parameters, rather than the original 101.A similar analysis, for different data, is performed by Lupparelli et al. (2009, page 573);again they find an undirected graphical model to be much more parsimonious than anybidirected one, but obtain comparable fits by removing statistically insignificant higher-order parameters.
We saw in the earlier examples that we were often able to remove higher order interactionparameters without compromising the goodness of fit. Here we explore this phenomenafurther via simulations.Consider the DAG with latent variables shown in Figure 11(b); over the observed vari-ables, the conditional independences which hold are exactly those given by the bidirectedchain in Figure 11(a).We randomly generated 1,000 distributions from this DAG model with k = 6, whereeach latent variable was given three states, and each observed variable two. The probabilityof each observed variable being zero, conditional on each state of its parents, was anindependent uniform random draw on (0 , o 5−way Parameters Increase in Deviance D en s i t y . . . . . . No 4−way Parameters
Increase in Deviance D en s i t y . . . . . . . No 3−way Parameters
Increase in Deviance D en s i t y . . . . . . . . Figure 12: Histograms showing the increase in deviance caused by setting to zero (a) the5- and 6-way interaction parameters; (b) the 4-, 5- and 6-way interaction parameters; (c)the 3-, 4-, 5- and 6-way interaction parameters. Plots are based on 1 ,
000 datasets, each ofsize 10 , χ with3, 6 and 10 degrees of freedom respectively.The histogram in Figure 12(a) demonstrates that the deviance increase from settingthe 5- and 6-way interaction parameters to zero (a total of three parameters) was notdistinguishable from that which would be observed under the null hypothesis that theseparameters are zero. The deviance increase from setting the 4-, 5- and 6-way interactionsto zero appeared to have only a slightly heavier tail than the associated χ -distribution,as suggested by the outliers in Figure 12(b). Removing the 3-way interactions in additionto this caused a dramatic increase in the deviance, as may be observed from the heavy tailof the histogram in Figure 12(c). This illustrates that the ingenuous parametrization canbe used to produce more parsimonious model descriptions than would be possible usingRichardson’s parameters.Note that under the process which generated these models, each of these interactionparameters was non-zero almost surely. As the sample size increases the power of alikelihood ratio test for a fixed distribution tends to one, so it must be the case that asimulation such as the above would, for large enough data sets, show significant deviationfrom the associated χ distributions. However, even at a fairly large sample size of 10,000,a limited effect was observed in Figures 12(a) and (b), and the examples above with realdata suggest that higher order interactions are often not particularly useful in practice fordescribing data. 26 Proofs
Proof of Lemma 2.9.
Using the independence, we have p ABC ( x ABC ) = p AC ( x AC ) · p B | C ( x B | x C ) . Thus applying Lemma 2.3, λ ABCAD ( x AD ) = 1 | X ABC | (cid:88) y ABC ∈ X ABC (log p AC ( y AC ) + log p B | C ( y B | y C )) (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) . We can split this sum into terms involving p AC ( y AC ) and those involving p B | C ( y B | y C ).For the first of these,1 | X ABC | (cid:88) y ABC ∈ X ABC log p AC ( y AC ) (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) = 1 | X AC | · | X B | (cid:88) y B ∈ X B (cid:88) y AC ∈ X AC log p AC ( y AC ) (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) = 1 | X AC | (cid:88) y AC ∈ X AC log p AC ( y AC ) (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) = λ ACAD ( x AC ) , because the summand has no dependence on y B . For the latter,1 | X ABC | (cid:88) y ABC ∈ X ABC log p B | C ( y B | y C ) (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) = 1 | X ABC | (cid:88) y BC ∈ X BC log p B | C ( y B | y C ) (cid:88) y A ∈ X A (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) . Now for any w ∈ A , the inner part of this term is (cid:88) y A ∈ X A (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) = (cid:88) y A \{ w } (cid:88) y w (cid:89) v ∈ A ∪ D (cid:0) | X v | I { x v = y v } − (cid:1) = (cid:88) y A \{ w } (cid:89) v ∈ ( A ∪ D ) \{ w } (cid:0) | X v | I { x v = y v } − (cid:1) (cid:88) y w ∈ X w (cid:0) | X w | I { x w = y w } − (cid:1) = 0 , because the innermost summand is | X w | − y w , and − | X w | − We first need the following result. 27 emma 7.1.
For L ⊆ M ⊆ V with N ≡ M \ L , define κ L | N ( x L | x N ) ≡ (cid:88) L ⊆ A ⊆ M λ MA ( x A ) . Then κ L | N ( x L | x N ) = 1 | X L | (cid:88) y M ∈ X M y N = x N log p ( y M ) (cid:89) v ∈ L (cid:0) | X v | I { x v = y v } − (cid:1) . Proof.
Applying Lemma 2.3, we have κ L | N ( x L | x N )= (cid:88) L ⊆ A ⊆ M | X M | (cid:88) y M ∈ X M log p M ( y M ) (cid:89) v ∈ A (cid:0) | X v | I { x v = y v } − (cid:1) = 1 | X M | (cid:88) y M ∈ X M log p M ( y M ) (cid:88) L ⊆ A ⊆ M (cid:89) v ∈ A (cid:0) | X v | I { x v = y v } − (cid:1) = 1 | X M | (cid:88) y M ∈ X M log p M ( y M ) (cid:88) L ⊆ A ⊆ M (cid:89) v ∈ L (cid:0) | X v | I { x v = y v } − (cid:1) (cid:89) v ∈ A \ L (cid:0) | X v | I { x v = y v } − (cid:1) = 1 | X M | (cid:88) y M ∈ X M log p M ( y M ) (cid:89) v ∈ L (cid:0) | X v | I { x v = y v } − (cid:1) (cid:88) B ⊆ N (cid:89) v ∈ B (cid:0) | X v | I { x v = y v } − (cid:1) . Now, consider the value of the inner sum, for a fixed y M . In the case that there is some w ∈ N with x w (cid:54) = y w , then (cid:88) B ⊆ N (cid:89) v ∈ B (cid:0) | X v | I { x v = y v } − (cid:1) = (cid:88) B ⊆ N \{ w } (cid:89) v ∈ B (cid:0) | X v | I { x v = y v } − (cid:1) + (cid:89) v ∈ B ∪{ w } (cid:0) | X v | I { x v = y v } − (cid:1) = (cid:88) B ⊆ N \{ w } (cid:34) (cid:89) v ∈ B (cid:0) | X v | I { x v = y v } − (cid:1) − (cid:89) v ∈ B (cid:0) | X v | I { x v = y v } − (cid:1)(cid:35) = 0 . Alternatively, if x N = y N , then (cid:88) B ⊆ N (cid:89) v ∈ B (cid:0) | X v | I { x v = y v } − (cid:1) = (cid:88) B ⊆ N (cid:89) v ∈ B ( | X v | − | X N | by the binomial theorem. Thus κ L | N ( x L | x N ) = 1 | X L | (cid:88) y M ∈ X M y N = x N log p ( y M ) (cid:89) v ∈ L (cid:0) | X v | I { x v = y v } − (cid:1) , since X M = X L × X N . 28 roof of Lemma 4.4. Let N ≡ M \ L , and pick some x L ∈ ˜ X L and x N ∈ X N ; for A ⊆ L ,let A be a vector of length | L | with a 1 in position j if the j th element of L is in A , and 0otherwise. Define the local | L | -way log-linear interaction parameter between x L + L and x L conditional on x N as (cid:88) A ⊆ L ( − | L \ A | log p L | N ( x L + A | x N );note that since x L ∈ ˜ X L , x L + A ∈ X L . We will first show that we can construct allthese local | L | -way log-linear interaction parameters using the parameters given in thestatement of the lemma. As in Lemma 7.1, let κ L | N ( x L | x N ) ≡ (cid:80) L ⊆ A ⊆ M λ MA ( x A ), andnote that (cid:88) A ⊆ L ( − | L \ A | κ L | N ( x L + A | x N )= ( − | L | | X L | (cid:88) y L ∈ X L log p M ( y L , x N ) (cid:88) A ⊆ L ( − | A | (cid:89) v ∈ L (cid:16) | X v | I { x v + I { v ∈ A } = y v } − (cid:17) follows directly from Lemma 7.1. Now consider the inner sum; if for some w ∈ L , y w / ∈{ x w , x w + 1 } , then (cid:88) A ⊆ L ( − | A | (cid:89) v ∈ L (cid:16) | X v | I { x v + I { v ∈ A } = y v } − (cid:17) = (cid:88) A ⊆ L \{ w } ( − | A | (cid:34)(cid:89) v ∈ L (cid:16) | X v | I { x v + I { v ∈ A } = y v } − (cid:17) − (cid:89) v ∈ L (cid:16) | X v | I { x v + I { v ∈ A ∪{ w }} = y v } − (cid:17)(cid:35) = 0 , because the value of the outer indicator function is 0 in both terms when v = w , while theinner indicator functions are the same for all other v . Alternatively, if y w ∈ { x w , x w + 1 } for all w ∈ L , then define B ( A ) ≡ { v ∈ L | x v + I { v ∈ A } = y v } . The map A (cid:55)→ B ( A ) is a one-to-one map from P ( L ), the power set of L , to itself, i.e. anautomorphism. Note that D ≡ B ( A ) (cid:52) A = { v ∈ L | x v = y v } is independent of A . Since | A | + 2 | B ( A ) \ A | = | B ( A ) | + | A (cid:52) B ( A ) | = | B ( A ) | + | D | we can rewrite the sum over subsets as (cid:88) A ⊆ L ( − | A | (cid:89) v ∈ L (cid:16) | X v | I { x v + I { v ∈ A } = y v } − (cid:17) = (cid:88) A ⊆ L ( − | B ( A ) | + | D | (cid:89) v ∈ L (cid:0) | X v | I { v ∈ B ( A ) } − (cid:1) = ( − | D | (cid:88) B ⊆ L ( − | B | (cid:89) v ∈ L (cid:0) | X v | I { v ∈ B } − (cid:1) = ( − | D | ( − | L | (cid:88) B ⊆ L (cid:89) v ∈ B ( | X v | − − | D | ( − | L | (cid:89) v ∈ L | X v | = ( − | D | ( − | L | | X L | . Then, substituting this back into the original expression and noting that the two ( − | L | factors cancel out, (cid:88) A ⊆ L ( − | L \ A | κ L | N ( x L + A | x N ) = (cid:88) D ⊆ L ( − | D | log p M ( x L + L \ D , x N )= (cid:88) D ⊆ L ( − | D | (cid:2) log p L | N ( x L + L \ D | x N ) + log p N ( x N ) (cid:3) = (cid:88) D ⊆ L ( − | D | log p L | N ( x L + L \ D | x N ) , where the terms in log p N ( x N ) cancel because of the lack of dependence upon D . This isthe (conditional) local | L | -way log-linear interaction. The collection of all the (conditional)local | L | -way log-linear interactions together with the (conditional) ( | L | − | L | -way table (Csisz´ar, 1975; Rudas,1998). We require the following lemma.
Lemma 7.2.
Let ¯ G be a head-preserving completion of G , and let H ∈ H ( G ) have tails T and ¯ T in G and ¯ G respectively. Then under the global Markov property for G , H ⊥⊥ ( ¯ T \ T ) | T [ P ] . Proof.
Let π be a path in G from some h ∈ H to t ∈ ¯ T \ T , and assume without loss ofgenerality that π does not intersect H or ¯ T \ T other than at its endpoints. By Proposition3.5, every vertex on π is in an G ( { h, t } ∪ T ) ⊆ an G ( H ∪ ¯ T ). Since ¯ G is complete, if v ∈ an ¯ G ( H ∪ ¯ T ), then v ∈ H ∪ ¯ T , thus H ∪ ¯ T is ancestral in ¯ G . By Proposition 3.12, H ∪ ¯ T isalso ancestral in G , thus every vertex on π is in H ∪ ¯ T .By Proposition 3.8, ¯ T ⊆ an ¯ G ( H ), so H ∪ ¯ T = an ¯ G ( H ). However, since H forms a headin ¯ G , H is barren in ¯ G . Thus in ¯ G , no proper descendant of a vertex in H is on π , and byProposition 3.12 this also holds in G .Now let y be the first vertex after h on π that is not in T . By hypothesis, y exists since t / ∈ T . By construction, any vertices between h and y on π are in T , hence are colliderson π and ancestors of H in G (by Proposition 3.8). Thus y ∈ dis G ( H ) ∪ pa G (dis G ( H )). If y ∈ an G ( H ) then y ∈ T , which is a contradiction, hence y ∈ dis G ( H ) and y / ∈ an G ( H ).As shown earlier, y is not a descendant of a vertex in H , so H ∪ { y } forms a head in G .Since ¯ G is a head-preserving completion, it follows that H ∪ { y } also forms a head in ¯ G ,and thus y / ∈ an ¯ G ( H ) = H ∪ ¯ T , but this is a contradiction.30 roof of Theorem 4.7. Let ( H, ¯ T ) be a head-tail pair in ¯ G . There are three possibilitiesfor how this pair relates to G : if ( H, ¯ T ) is also a head-tail pair in G , then there is no workto be done; otherwise either (i) H is not a head in G , or (ii) H is a head in G but ¯ T is notits tail.If (i) holds, then we claim that under G , λ H ¯ TA = 0 for all H ⊆ A ⊆ H ∪ ¯ T . To see this,first note that H is a barren set in ¯ G , and since H is maximally connected, this means thatall elements are joined by bidirected edges in ¯ G . Since G contains a subset of the edges in¯ G , H is also barren in G ; since H is not a head in G this means that H = K ∪ L for disjointnon-empty sets K and L with no edges directly connecting them. But this implies that K and L are m-separated conditional on ¯ T , and thus X K ⊥⊥ X L | X ¯ T under the Markovproperty for G . Then, by Lemma 2.7, these parameters are all identically zero under G .(ii) implies that H is head in both G and ¯ G , but ¯ T ≡ tail ¯ G ( H ) ⊃ tail G ( H ) ≡ T . Then λ H ¯ TA = 0 for all H ⊆ A ⊆ H ∪ ¯ T such that A ∩ ( ¯ T \ T ) (cid:54) = ∅ ; this follows from Lemma 7.2and application of Lemma 2.7.We have shown that all parameters corresponding to effects not found in P ing ( G ) areidentically zero under G . The vanishing of these parameters defines the correct sub-model, but note that some of the margins in P ing ( ¯ G ) which we have not yet considered arenot the same as those in P ing ( G ). These remaining cases are again from (ii), but where H ⊆ A ⊆ H ∪ T ; in this case λ H ¯ TA = λ HTA under G , again due to Lemma 7.2, this timecombined with Lemma 2.9.Thus we have shown that under G , all the ingenuous parameters for ¯ G are either zeroor equal to ingenuous parameters for G . Combined with Theorem 4.5, this shows thatthose constraints define the model. We first prove the following graphical result.
Lemma 7.3.
Let G be an ADMG containing at least one head of size 3 or more. Then G also contains two heads of the form { v , v } and { v , v } , where { v , v , v } is barren.Proof. Suppose not; let G be an ADMG which violates this condition, and let H be ahead in G of size k ≥
3. Pick 3 vertices { w , w , w } in H . By the definition of a head,we can pick a bidirected path π , through an G ( H ), from w to w ; assume that π containsno other element of H , otherwise shorten the path and redefine w or w . Then create asimilar path ρ from w to w ; again assume that ρ contains no other element of H , elseshorten the path and redefine w . If w lies on ρ then we can swap w and w to get thedesired result.According to our assumption that the result is false, at least one of { w , w } or { w , w } is not a head; assume the former without loss of generality. This implies that π must passthrough at least one vertex v which is not an ancestor of { w , w } . If there is more thanone such vertex, then choose one which has no distinct descendants on the path π . By the31onstruction of π we have v ∈ an G ( H ) \ H .Then let W be the set of vertices on π , and H ∗ ≡ barren G ( W ). Since W is ↔ -connected, H ∗ must be a head, and { w , w , v } ⊆ H ∗ . Thus we have created a headdistinct from H , of size at least 3, which is contained in the set of ancestors of H .The assumption we have made implies that we must be able to repeat this processindefinitely, with each head being contained in the ancestors of the previous head. To seethat we never obtain the same head twice, note that there is a non-empty directed pathfrom v ∈ H ∗ to H ; but H is contained within the ancestors of any previous heads in thesequence, so if H ∗ had appeared before, this would imply that H ∗ was not barren.Then since H has a finite set of ancestors, the apparently infinite recursion of distinctheads is a contradiction. Definition 7.4.
Let A be an ancestral set in an ADMG G , and let v ∈ barren G ( A ). The Markov blanket for v in A is the setmb( v, A ) ≡ pa A (dis A ( v )) ∪ (dis A ( v ) \ { v } ) . In particular, under the ordered local Markov property for G , v ⊥⊥ A \ (mb( v, A ) ∪ { v } ) | mb( v, A ) . (3)Note that (3) holds for every v and ancestral set A (with v ∈ barren G ( A )) if and only ifthe global Markov property for G holds (Richardson, 2003). Proof of Theorem 5.5. ( ⇐ ). Suppose that G contains no heads of size ≥
3, and let 1 , . . . , n be a topological ordering on the vertices of G . We will construct a complete, hierarchicaland variation independent parametrization of the saturated model, and then show thatunder the global Markov property for G it is equivalent to the ingenuous parametrization.Let M i ⊆ M be the margins which involve only the vertices in [ i ] = { , . . . i } . Assumefor induction, that M i − includes the set [ i − i = 1 is trivial.Now, let the heads involving i contained within [ i ] be H = { i } , H = { j , i } , . . . , H k = { j k , i } , where j < . . . < j k < i (possibly with k = 0). Call the associated tails T , . . . , T k .We have barren G (dis G ( i )) = { j k , i } , since barren G (dis G ( i )) is a head, and cannot have size ≥
3. This also implies that ( H k ∪ T k ) \ { i } = mb( i, [ i ]), where mb( v, A ) is the Markov blanket of v in the ancestral set A .Now, since the ordering is topological, A k ≡ [ i ] is an ancestral set, and the orderedlocal Markov property shows that i ⊥⊥ A k \ (mb( i, A k ) ∪ { i } ) | mb( i, A k ) , i ⊥⊥ A k \ ( H k ∪ T k ) | ( H k ∪ T k ) \ { i } . Then for all { i } ⊆ C ⊆ A k such that C ∩ de G ( j k ) (cid:54) = ∅ , λ A k C = λ H k ∪ T k C if H k ⊆ C ⊆ H k ∪ T k λ A k C = 0 otherwise , where the first equality follows from the independence and Lemma 2.9, and the secondfrom the above independence and Lemma 2.7.Now set A k − = A k \ de G ( j k ). Then A k − is ancestral and contains i , so applyingthe ordered local Markov property again gives for any { i } ⊆ C ⊆ A k − such that C ∩ de G ( j k − ) (cid:54) = ∅ , λ A k − C = λ H k − ∪ T k − C if H k − ⊆ C ⊆ H k − ∪ T k − λ A k − C = 0 otherwise . Continuing this approach gives exactly one parameter for each subset C of [ i ] containing i and some descendant of any of j , . . . , j k . Lastly let A = A \ de G ( j ). Then for { i } ⊆ C ⊆ A , λ A C = λ H ∪ T C if { i } ⊆ C ⊆ { i } ∪ T λ A C = 0 otherwise.Now, add the margins A ⊂ · · · ⊂ A k = [ i ]; since these all contain { i } , they are not asubset of any existing margin. Further, each set C we associate with A l contains a vertexwhich is not in A l − . Thus the addition of these margins and their associated effects keepsour parametrization complete and hierarchical. Setting M i = M i − ∪ { A , . . . , A k } , thenthere are at most two maximal subsets out of the margins up to A l (being [ i −
1] and A l ); thus M i is clearly also ordered decomposable, and so the parameters are variationindependent.Furthermore we have shown that under the global Markov property for G , these param-eters are equal to the ingenuous parameters or are identically zero. Thus the ingenuousparameters must also be variation independent.( ⇒ ). Our construction will assume the random variables are binary; the general caseis a trivial but tedious extension. Suppose that G has a head of size ≥
3, and assumefor a contradiction that its ingenuous parametrization is variation independent. Then byLemma 7.3, there exist two heads H = { v , v } and H = { v , v } such that { v , v , v } is barren. Let H ≡ { v , v } noting that this set may or may not be a head.Also let T i = tail G ( H i ), where if H is not a head, this set is taken to be the tail of H if there were a bidirected arrow between v and v . Further let A = an G ( H ).Now choose λ B i C i = 0, where B i = { v i } ∪ tail G ( v i ) and { v i } ⊆ C i ⊆ B i ; this sets every v i to be uniform on { , } for each instantiation of its tail.33imilarly, by choosing λ H ∪ T C (0) to be large and positive for each H ⊆ C ⊆ H ∪ T ,we can force v and v to be arbitrarily highly correlated conditional on T , and thereforeconditional on A . We can do the same for v and v , so for any 0 < (cid:15) < : v v − (cid:15) (cid:15) (cid:15) − (cid:15) v v − (cid:15) (cid:15) (cid:15) − (cid:15) ,where these tables are understood to show the two-way marginal distributions condi-tional on each instantiation x A of A .But now either λ H ∪ T C = 0 by design (because H is not a head, and v and v are inde-pendent conditional on their ‘tail’), or we can choose this to be the case by the assumptionof variation independence. This implies that v and v are independent conditional on A .Thus 14 = P ( v = 1 , v = 0 | A = x A )= P ( v = 1 , v = 0 , v = 0 | A = x A ) + P ( v = 1 , v = 1 , v = 0 | A = x A ) < P ( v = 1 , v = 0 | A = x A ) + P ( v = 1 , v = 0 | A = x A )= 2 (cid:15), which is a contradiction if (cid:15) < . Thus the parameters are variation dependent. Acknowledgements
This research was supported by the U.S. National Science Foundation grant CNS-0855230and U.S. National Institutes of Health grant R01 AI032475. The Netherlands KinshipPanel Study is funded by grant 480-10-009 from the Major Investments Fund of theNetherlands Organisation for Scientific Research (NWO), and by the Netherlands Interdis-ciplinary Demographic Institute (NIDI), Utrecht University, the University of Amsterdamand Tilburg University. We thank McDonald, Hiu and Tierney for giving us permissionto use their flu vaccine data.Our thanks go to Tam´as Rudas for helpful discussions, and to Antonio Forcina fordiscussions and the use of his computer programmes. Finally we thank two anonymousreferees and an associate editor for their thorough reading of an earlier draft, and veryuseful suggestions.
References
D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for Boltzmannmachines.
Cognitive Science , 9:147–169, 1985.34. A. Ali, T. S. Richardson, P. Spirtes, and J. Zhang. Towards characterizing Markovequivalence classes for directed acyclic graphs with latent variables. In
Proceedings ofthe 21st Conference on Uncertainty in Artificial Intelligence , pages 10–17, 2005.W. P. Bergsma and T. Rudas. Marginal models for categorical data.
Ann. Stat. , 30(1):140–159, 2002.I. Csisz´ar. i -divergence geometry of probability distributions and minimization problems. Annals of Probability , 3(1):146–158, 1975.J. N. Darroch, S. L. Lauritzen, and T. P. Speed. Markov fields and log-linear models forcontingency tables.
Ann. Statist. , 8:522–539, 1980.A. P. Dawid. Conditional independence in statistical theory (with discussion).
J. Roy.Statist. Soc. Ser. B , 41:1–31, 1979.M. Drton. Discrete chain graph models.
Bernoulli , 15(3):736–753, 2009.M. Drton and T. S. Richardson. Binary models for marginal independence.
J. Roy. Statist.Soc. Ser. B , 70(2):287–309, 2008a.M. Drton and T. S. Richardson. Graphical methods for efficient likelihood inference inGaussian covariance models.
J. Mach. Learn. Res. , 9:893–914, 2008b.P. A. Dykstra, M. Kalmijn, T. C. M. Knijn, A. E. Komter, A. C. Liefbroer, and C. H.Mulder. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-methodpanel study on solidarity in family relationships. Wave 1.
NKPS Working Paper No. 4 ,2005.P. A. Dykstra, M. Kalmijn, T. C. M. Knijn, A. E. Komter, A. C. Liefbroer, and C. H.Mulder. Codebook of the Netherlands Kinship Panel Study, a multi-actor, multi-methodpanel study on solidarity in family relationships. Wave 2.
NKPS Working Paper No. 6 ,2007.A. Ekholm, J. Jokinen, J.W. McDonald, and P.W.F. Smith. A latent class model forbivariate binary responses from twins.
J. Roy. Statist. Soc. Ser. C , 61(3), 2012.R. J. Evans and A. Forcina. Two algorithms for fitting constrained marginal models. Arxivpreprint arXiv:1110.2894, 2011.R. J. Evans and T. S. Richardson. Maximum likelihood fitting of acyclic directed mixedgraphs to binary data. In
Proceedings of the 26th conference on Uncertainty in ArtificialIntelligence , 2010.A. Forcina, M. Lupparelli, and G. M. Marchetti. Marginal parameterizations of discretemodels defined by a set of conditional independencies.
Journal of Multivariate Analysis ,101:2519–2527, 2010. 35. F. V. Glonek and P. McCullagh. Multivariate logistic models.
J. Roy. Statist. Soc.Ser. B , 57(3):533–546, 1995.A. J. Hakim, L. F. Cherkas, T. D. Spector, and A. J. MacGregor. Genetic associationsbetween frozen shoulder and tennis elbow: a female twin study.
Rheumatology , 42(6):739–742, 2003.G. Kauermann. A note on multivariate logistic models for contingency tables.
Austral. J.Statist. , 39(3):261–276, 1997.S. L. Lauritzen.
Graphical Models . Clarendon Press, Oxford, UK, 1996.M. Lupparelli, G. M. Marchetti, and W. P. Bergsma. Parameterizations and fitting ofbi-directed graph models to categorical data.
Scand. J. Statist. , 36:559–576, 2009.G. M. Marchetti and M. Lupparelli. Chain graph models of multivariate regression typefor categorical data.
Bernoulli , 17(3):827–844, 2011.C. J. McDonald, S. L. Hui, and W. M. Tierney. Effects of computer reminders for influenzavaccination on morbidity during influenza epidemics.
MD computing , 9(5):304, 1992.J. Pearl.
Probabilistic Reasoning in Intelligent Systems . Morgan Kaufmann, 1988.T. S. Richardson. Markov properties for acyclic directed mixed graphs.
Scand. J. Statist. ,30(1):145–157, 2003.T. S. Richardson. A factorization criterion for acyclic directed mixed graphs. In
Proceedingsof the 25th conference on Uncertainty in Artificial Intelligence , 2009.T. S. Richardson and P. Spirtes. Ancestral graph Markov models.
Ann. Statist. , 30:962–1030, 2002.T. Rudas.
Odds ratios in the analysis of contingency tables . Sage Publications, Inc, 1998.T. Rudas, W. P. Bergsma, and R. N´emeth. Parameterization and estimation of path mod-els for categorical data. In
Proceedings in Computational Statistics, 17th Symposium ,pages 383–394. Physica-Verlag HD, 2006.T. Rudas, W. P. Bergsma, and R. N´emeth. Marginal log-linear parameterization of con-ditional independence models.
Biometrika , 97:1006–1012, 2010.J. Siemiatycki. A comparison of mail, telephone, and home interview strategies for house-hold health surveys.
American Journal of Public Health , 69:238–245, 1979.N. Wermuth. Probability distributions with summary graph structure.
Bernoulli , 17(3):845–879, 2011.J. Whittaker.