[PDF] Gaussian Approximation of Collective Graphical Models

Abstract

The Collective Graphical Model (CGM) models a population of independent and identically distributed individuals when only collective statistics (i.e., counts of individuals) are observed. Exact inference in CGMs is intractable, and previous work has explored Markov Chain Monte Carlo (MCMC) and MAP approximations for learning and inference. This paper studies Gaussian approximations to the CGM. As the population grows large, we show that the CGM distribution converges to a multivariate Gaussian distribution (GCGM) that maintains the conditional independence properties of the original CGM. If the observations are exact marginals of the CGM or marginals that are corrupted by Gaussian noise, inference in the GCGM approximation can be computed efficiently in closed form. If the observations follow a different noise model (e.g., Poisson), then expectation propagation provides efficient and accurate approximate inference. The accuracy and speed of GCGM inference is compared to the MCMC and MAP methods on a simulated bird migration problem. The GCGM matches or exceeds the accuracy of the MAP method while being significantly faster.

Full PDF

GGaussian Approximation of Collective Graphical Models

Li-Ping Liu LIULI @ EECS . OREGONSTATE . EDU

Daniel Sheldon SHELDON @ CS . UMASS . EDU

Thomas G. Dietterich TGD @ EECS . OREGONSTATE . EDU School of EECS, Oregon State University, Corvallis, OR 97331 USA University of Massachusetts, Amherst, MA 01002 and Mount Holyoke College, South Hadley, MA 01075

Abstract

The Collective Graphical Model (CGM) modelsa population of independent and identically dis-tributed individuals when only collective statis-tics (i.e., counts of individuals) are observed. Ex-act inference in CGMs is intractable, and pre-vious work has explored Markov Chain MonteCarlo (MCMC) and MAP approximations forlearning and inference. This paper studies Gaus-sian approximations to the CGM. As the popula-tion grows large, we show that the CGM distri-bution converges to a multivariate Gaussian dis-tribution (GCGM) that maintains the conditionalindependence properties of the original CGM.If the observations are exact marginals of theCGM or marginals that are corrupted by Gaus-sian noise, inference in the GCGM approxima-tion can be computed efﬁciently in closed form.If the observations follow a different noise model(e.g., Poisson), then expectation propagation pro-vides efﬁcient and accurate approximate infer-ence. The accuracy and speed of GCGM infer-ence is compared to the MCMC and MAP meth-ods on a simulated bird migration problem. TheGCGM matches or exceeds the accuracy of theMAP method while being signiﬁcantly faster.

1. Introduction

Consider a setting in which we wish to model the behaviorof a population of independent and identically distributed(i.i.d.) individuals but where we can only observe collec-tive count data. For example, we might wish to model therelationship between education, sex, housing, and incomefrom census data. For privacy reasons, the Census Bureauonly releases count data such as the number of people hav-

Proceedings of the st International Conference on MachineLearning , Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s). ing a given level of education or the number of men livingin a particular region. Another example concerns modelingthe behavior of animals from counts of (anonymous) indi-viduals observed at various locations and times. This arisesin modeling the migration of ﬁsh and birds.The CGM is constructed by ﬁrst deﬁning the individualmodel —a graphical model describing a single individual.Let C and S be the clique set and the separator set of a junc-tion tree constructed from the individual model. Then, wedeﬁne N copies of this individual model to create a pop-ulation of N i.i.d. individuals. This permits us to deﬁnecount variables n A , where n A ( i A ) is the number of indi-viduals for which clique A ∈ C ∪ S is in conﬁguration i A .The counts n = ( n A : A ∈ C ∪ S ) are the sufﬁcient statis-tics of the individual model. After marginalizing away theindividuals, the CGM provides a model for the joint distri-bution of n .In typical applications of CGMs, we make noisy obser-vations y that depends on some of the n variables, andwe seek to answer queries about the distribution of someor all of the n conditioned on these observations. Let y = ( y D : D ∈ D ) , where D is a set of cliques from theindividual graphical model and y D contains counts of set-tings of clique D . We require each D ⊆ A for some clique A ∈ C ∪ S the individual model. In addition to the usualrole in graphical models, the inference of the distributionof n also serves to estimate the parameters of the individ-ual model (e.g. E step in EM learning), because n are sufﬁ-cient statistics of the individual model. Inference for CGMsis much more difﬁcult than for the individual model. Un-like the individual model, many conditional distributions inthe CGM do not have a closed form. The space of possi-ble conﬁgurations of the CGM is very large, because eachcount variable n i can take values in { , . . . , N } .The original CGM paper, Sheldon and Dietterich (2011)introduced a Gibbs sampling algorithm for sampling from P ( n | y ) . Subsequent experiments showed that this exhibitsslow mixing times, which motivated Sheldon, Sun, Kumar,and Dietterich (2013) to introduce an efﬁcient algorithm a r X i v : . [ c s . L G ] M a y aussian Approximation of Collective Graphical Models for computing a MAP approximation based on minimizinga tractable convex approximation of the CGM distribution.Although the MAP approximation still scales exponentiallyin the domain size L of the individual-model variables, itwas fast enough to permit ﬁtting CGMs via EM on modest-sized instances ( L = 49 ). However, given that we wish toapply this to problems where L = 1000 , we need a methodthat is even more efﬁcient.This paper introduces a Gaussian approximation to theCGM. Because the count variables n C have a multinomialdistribution, it is reasonable to apply the Gaussian approx-imation. However, this approach raises three questions.First, is the Gaussian approximation asymptotically cor-rect? Second, can it maintain the sparse dependency struc-ture of the CGM distribution, which is critical to efﬁcientinference? Third, how well does it work with natural (non-Gaussian) observation distributions for counts, such as thePoisson distribution? This paper answers these questionsby proving an asymptotically correct Gaussian approxima-tion for CGMs. It shows that this approximation, whendone correctly, is able to preserve the dependency structureof the CGM. And it demonstrates that by applying expecta-tion propagation (EP), non-Gaussian observation distribu-tions can be handled. The result is a CGM inference pro-cedure that gives good accuracy and achieves signiﬁcantspeedups over previous methods.Beyond CGMs, our main result highlights a remarkableproperty of discrete graphical models: the asymptotic dis-tribution of the vector of sufﬁcient statistics is a Gaussiangraphical model with the same conditional independenceproperties as the original model.

2. Problem Statement and Notation

Consider a graphical model deﬁned on the graph G =( V, E ) with n nodes and clique set C . Denote the randomvariables by X , . . . , X n . Assume for simplicity all vari-ables take values in the same domain X of size L . Let x ∈ X n be a particular conﬁguration of the variables, andlet x C be the subvector of variables belonging to C . Foreach clique C ∈ C , let φ C ( x C ) be a non-negative potentialfunction. Then the probability model is: p ( x ) = 1 Z (cid:89) C ∈C φ C ( x C )= exp (cid:16) (cid:88) C ∈C (cid:88) i C ∈X | C | θ C ( i C ) · I ( x C = i C ) − Q ( θ ) (cid:17) . (1)The second line shows the model in exponential-familyform Wainwright & Jordan (2008), where I ( π ) is an indi-cator variable for the event or expression π , and θ C ( i C ) =log φ C ( i C ) is an entry of the vector of natural parame- ters. The function Q ( θ ) = log Z is the log-partition func-tion. Given a ﬁxed set of parameters θ and any subset A ⊆ V , the marginal distribution µ A is the vector with en-tries µ A ( i A ) = Pr( X A = i A ) for all possible i A ∈ X | A | .In particular, we will be interested in the clique marginals µ C and the node marginals µ i := µ { i } . Junction Trees.

Our development relies on the existence ofa junction tree (Lauritzen, 1996) on the cliques of C to writethe relevant CGM and GCGM distributions in closed form.Henceforth, we assume that such a junction tree exists . Inpractice, this means that one may need to add ﬁll-in edgesto the original model to obtain the triangulated graph G ,of which C is the set of maximal cliques. This is a clearlimitation for graphs with high tree-width. Our methodsapply directly to trees and are most practical for low tree-width graphs. Since we use few properties of the junctiontree directly, we review only the essential details here andreview the reader to Lauritzen (1996) for further details.Let C and C (cid:48) be two cliques that are adjacent in T ; theirintersection S = C ∩ C (cid:48) is called a separator . Let S bethe set of all separators of T , and let ν ( S ) be the number oftimes S appears as a separator, i.e., the number of differentedges ( C, C (cid:48) ) in T for which S = C ∩ C (cid:48) . The CGM Distribution.

Fix a sample size N and let x , . . . , x N be N i.i.d. random vectors distributed accord-ing to the graphical model G . For any set A ⊆ V andparticular setting i A ∈ X | A | , deﬁne the count n A ( i A ) = N (cid:88) m =1 I ( x mA = i A ) . (2)Let n A = ( n A ( i A ) : i A ∈ X | A | ) be the complete vectorof counts for all possible settings of the variables in A . Inparticular, let n u := n { u } be the vector of node counts.Also, let n = ( n A : A ∈ C ∪ S ) be the combined vector ofall clique and separator counts—these are sufﬁcient statis-tics of the sample of size N from the graphical model. Thedistribution over this vector is the CGM distribution. Proposition 1

Let n be the vector of (clique and separa-tor) sufﬁcient statistics of a sample of size N from the dis-crete graphical model (1) . The probability mass function of n is given by p ( n ; θ ) = h ( n ) f ( n ; θ ) where f ( n ; θ ) = exp (cid:0) (cid:88) C ∈C ,i C ∈X | C | θ C ( i C ) · n C ( i C ) − N Q ( θ ) (cid:1) (3) aussian Approximation of Collective Graphical Models h ( n ) = N ! · (cid:81) S ∈S (cid:81) i S ∈X | S | (cid:16) n S ( i S )! (cid:17) ν ( S ) (cid:81) C ∈C (cid:81) i C ∈X | C | n C ( i C )! (cid:89) S ∼ C ∈T ,i S ∈X | S | I (cid:16) n S ( i S ) = (cid:88) i C \ S n C ( i S , i C \ S ) (cid:17) · (cid:89) C ∈C I (cid:0) (cid:88) i C ∈X | C | n C ( i C ) = N (cid:1) . (4) Denote this distribution by

CGM( N, θ ) . Here, the notation S ∼ C ∈ T means that S is adjacentto C in T . This proposition was ﬁrst proved in nearlythis form by Sundberg (1975) (see also Lauritzen (1996)).Proposition 1 differs from those presentations by writing f ( n ; θ ) in terms of the original parameters θ instead of theclique and separator marginals { µ C , µ S } , and by includ-ing hard constraints in the base measure h ( n ) . The hardconstraints enforce consistency of the sufﬁcient statisticsof all cliques on their adjacent separators, and were treatedimplicitly prior to Sheldon & Dietterich (2011). A proofof the equivalence between our expression for f ( n ; θ ) andthe expressions from prior work is given in the supple-mentary material. Dawid & Lauritzen (1993) refer to thesame distribution as the hyper-multinomial distribution dueto the fact that it follows conditional independence prop-erties analogous to those in the original graphical model. Proposition 2

Let

A, B ∈ S ∪ C be two sets that are sep-arated by the separator S in T . Then n A ⊥⊥ n B | n S . Proof:

The probability model p ( n ; θ ) factors over theclique and separator count vectors n C and n S . The onlyfactors where two different count vectors appear togetherare the consistency constraints where n S and n C appeartogether if S is adjacent to C in T . Thus, the CGM is agraphical model with the same structure as T , from whichthe claim follows. (cid:3)

3. Approximating CGM by the NormalDistribution

In this section, we will develop a Gaussian approximation,GCGM, of the CGM and show that it is the asymptoti-cally correct distribution as M goes to inﬁnity. We thenshow that the GCGM has the same conditional indepen-dence structure as the CGM, and we explicitly derive theconditional distributions. These allow us to use Gaussianmessage passing in the GCGM as a practical approximateinference method for CGMs.We will follow the most natural approach of approximat-ing the CGM distribution by a multivariate Gaussian withthe same mean and covariance matrix. The moments of the CGM distribution follow directly from those of the in-dicator variables of the individual model: Fix an outcome x = ( x , . . . , x n ) from the individual model and for anyset A ⊆ V let I A = (cid:0) I ( x A = i A ) : i A ∈ X | A | (cid:1) be thevector of all indicator variables for that set. The mean andcovariance of any such vectors are given by E [ I A ] = µ A (5) cov( I A , I B ) = (cid:104) µ A,B (cid:105) − µ A µ TB . (6)Here, the notation (cid:104) µ A,B (cid:105) refers to the matrix whose ( i A , i B ) entry is the marginal probability Pr( X A = i A , X B = i B ) . Note that Eq. (6) follows immediatelyfrom the deﬁnition of covariance for indicator variables,which is easily seen in the scalar form: cov( I ( X A = i A ) , I ( X B = i B )) = Pr( X A = i A , X B = i B ) − Pr( X A = i A ) Pr( X B = i B ) . Eq. (6) also covers the case when A ∩ B is nonempty. In particular if A = B = { u } , then we re-cover cov( I u , I u ) = diag( µ u ) − µ u µ Tu , which is the co-variance matrix for the marginal multinomial distributionof I u .From the preceding arguments, it becomes clear that the co-variance matrix for the full vector of indicator variables hasa simple block structure. Deﬁne I = ( I A : A ∈ C ∪S ) to bethe vector concatention of all the clique and separator indi-cator variables, and let µ = ( µ A : A ∈ C ∪ S ) = E [ I ] bethe corresponding vector concatenation of marginals. Thenit follows from (6) that the covariance matrix is Σ := cov( I , I ) = ˆΣ − µµ T , (7)where ˆΣ is the matrix whose ( A, B ) block is the marginaldistribution (cid:104) µ A,B (cid:105) . In the CGM model, the count vector n can be written as n = (cid:80) Nm =1 I m , where I , . . . , I N arei.i.d. copies of I . As a result, the moments of the CGM areobtained by scaling the moments of I by N . We thus arriveat the natural moment-matching Gaussian approximationof the CGM. Deﬁnition 1

The Gaussian CGM, denoted

GCGM( N, θ ) is the multivariate normal distribution N ( N µ , N Σ) ,where µ is the vector of all clique and separator marginalsof the graphical model with parameters θ , and Σ is deﬁnedin Equation (7) . In the following theorem, we show the GCGM is asymptot-ically correct and it is a Gaussian graphical model, whichwill lead to efﬁcient inference algorithms.

Theorem 1

Let n N ∼ CGM( N, θ ) for N = 1 , , . . . .Then following are true:(i) The GCGM is asymptotically correct. That is, as N → ∞ we have √ N ( n N − N µ ) D −→ N ( , Σ) . (8) aussian Approximation of Collective Graphical Models (ii) The GCGM is a Gaussian graphical model with thesame conditional independence structure as the CGM.Let z ∼ GCGM( N, θ ) and let A, B ∈ C ∪ S be twosets that are separated by separator S in T . Then z A ⊥⊥ z B | z S . Proof:

Part (i) is a direct application of the multivariatecentral limit theorem to the random vector n N , which, asnoted above, is a sum of i.i.d. random vectors I , . . . , I N with mean µ and covariance Σ (Feller, 1968).Part (ii) is a consequence of the fact that these conditionalindependence properties hold for each n N (Proposition 2),so they also hold in the limit as N → ∞ . While this is intu-itively clear, it seems to require further justiﬁcation, whichis provided in the supplementary material. (cid:3) The goal is to use inference in the GCGM as a tractable ap-proximate alternative inference method for CGMs. How-ever, it is very difﬁcult to compute the covariance matrix Σ over all cliques. In particular, note that the ( C, C (cid:48) ) blockrequires the joint marginal (cid:104) µ C,C (cid:48) (cid:105) , and if C and C (cid:48) are notadjacent in T this is hard to compute. Fortunately, we cansidestep the problem completely by leveraging the graphstructure from Part (ii) of Theorem 1 to write the distribu-tion as a product of conditional distributions whose param-eters are easy to compute (this effectively means workingwith the inverse covariance matrix instead of Σ ). We thenperform inference by Gaussian message passing on the re-sulting model.A challenge is that Σ is not full rank, so the GCGM distri-bution as written is degenerate and does not have a den-sity. This can be seen by noting that any vector n ∼ CGM( N ; θ ) with nonzero probability satisﬁes the afﬁneconsistency constraints from Eq. (4)—for example, eachvector n C and n S sums to the population size N —and thatthese afﬁne constraints also hold with probability one inthe limiting distribution. To ﬁx this, we instead use a lineartransformation T to map z to a reduced vector ˜ z = T z such that the reduced covariance matrix ˜Σ = T Σ T T is in-vertible. The work by Loh & Wainwright (2013) proposeda minimal representation of the graphical model in (1), andthe corresponding random variable has a full rank covari-ance matrix. We will ﬁnd a transformation T to project ourindicator variable I into that form. Then T I (as well as T n and T z ) will have a full rank covariance matrix.Denote by C + the maximal and non-maximal cliques in thetriangulated graph. Note that each D ∈ C + must be a sub-set of some A ∈ C ∪ S and each subset of A is also a cliquein C + . For every D ∈ C + , let X D = ( X \{ L } ) | D | denotethe space of possible conﬁgurations of D after excludingthe largest value, L , from the domain of each variable in D . The corresponding random variable I in the minimalrepresentation is deﬁned as (Loh & Wainwright, 2013): ˜ I = ( I ( x D = i D ) : i D ∈ X D , D ∈ C + ) . (9) ˜ I D can be calculated linearly from I A when D ⊆ A via thematrix T D,A whose ( i D , i A ) entry is deﬁned as T D,A ( i D , i A ) = I ( i D ∼ D i A ) , (10)where ∼ D means that i D and i A agree on the setting ofthe variables in D . It follows that ˜ I D = T D,A I A . Thewhole transformation T can be built in blocks as follows:For every D ∈ C + , choose A ∈ C ∪ S and construct the T D,A block via (10). Set all other blocks to zero. Due tothe redundancy of I , there might be many ways of choosing A for D and any one will work as long as D ⊆ A . Proposition 3

Deﬁne T as above, and deﬁne ˜ z = T z , ˜ z A + = (˜ z D : D ⊆ A ) , A ∈ C ∪ S . Then(i) If A, B ∈ C ∪ S are separated by S in T , it holds that ˜ z A + ⊥⊥ ˜ z B + | ˜ z S + .(ii) The covariance matrix of ˜ z has full rank. Proof:

In the appendix, we show that for any A ∈ C ∪ S , I A can be linearly recovered from ˜ I A + = (˜ I D : D ⊆ A ) .So there is a linear bijection between I A and ˜ I A + (Themapping from I A to ˜ I A + has been shown in the deﬁni-tion of T ). The same linear bijection relation also existsbetween n A and ˜ n A + = (cid:80) Nm =1 ˜ I mA + and between z A and ˜ z A + .Proof of (i): Since z A ⊥⊥ z B | z S , it follows that ˜ z A + ⊥⊥ ˜ z B + | z S because ˜ z A + and ˜ z B + are deterministic functionsof z A and z B respectively. Since z S is a deterministic func-tion of ˜ z S + , the same property holds when we condition on ˜ z S + instead of z S .Proof of (ii): The bijection between I and ˜ I indicates thatthe model representation of Loh & Wainwright (2013) de-ﬁnes the same model as (1). By Loh & Wainwright (2013), ˜ I has full rank covariance matrix and so do ˜ n and ˜ z . (cid:3) With this result, the GCGM can be decomposed intoconditional distributions, and each distribution is a non-degenerate Gaussian distribution.Now let us consider the observations y = { y D , D ∈ D} ,where D is the set of cliques for which we have observa-tions. We require each D ∈ D be subset of some clique C ∈ C . When choosing a distribution p ( y D | z C ) , a mod-eler has substantial ﬂexibility. For example, p ( y D | z C ) can be noiseless, y D ( i D ) = (cid:80) i C \ D z C ( i D , i C \ D ) , whichpermits closed-form inference. Or p ( y D | z C ) can con-sist of independent noisy observations: p ( y D | z C ) = (cid:81) i D p ( y D ( i D ) | (cid:80) i C \ D z C ( i D , i C \ D )) . With a little work, p ( y D | z C ) can be represented by p ( y D | ˜ z C + ) . aussian Approximation of Collective Graphical Models We describe how to decompose GCGM for the special casewhen the original graphical model G is a tree. We assumethat only counts of single nodes are observed. In this case,we can marginalize out edge (clique) counts z { u,v } and re-tain only node (separator) counts z u . Because the GCGMhas a normal distribution, marginalization is easy. The con-ditional distribution is then deﬁned only on node counts.With the deﬁnition of ˜ z in Proposition (3) and the propertyof conditional independence, we can write p (˜ z , . . . , ˜ z n ) = p (˜ z r ) (cid:89) ( u,v ) ∈ E p (˜ z v | ˜ z u ) . (11)Here r ∈ V is an arbitrarily-chosen root node, and E isthe set of directed edges of G oriented away from r . Themarginalization of the edges greatly reduces the size of theinference problem, and a similar technique is also applica-ble to general GCGMs.Now specify the parameters of the Gaussian conditionaldensities p (˜ z v | ˜ z u ) in Eq. (11). Assume the blocks T u,u and T v,v are deﬁned as (10). Let ˜ µ u = T u,u µ u be themarginal vector of node u without its last entry, and let (cid:104) ˜ µ u,v (cid:105) = T u,u (cid:104) µ u,v (cid:105) T Tv,v be the marginal matrix overedge ( u, v ) , minus the ﬁnal row and column. Then themean and covariance martix of the joint distribution are η := N (cid:20) ˜ µ u ˜ µ v (cid:21) , N (cid:20) diag( ˜ µ u ) (cid:104) ˜ µ u,v (cid:105)(cid:104) ˜ µ v,u (cid:105) diag( ˜ µ v ) (cid:21) − ηη T . (12)The conditional density p (˜ z v | ˜ z u ) is obtained by standardGaussian conditioning formulas.If we need to infer z { u,v } from some distribution q (˜ z u , ˜ z v ) ,we ﬁrst calculate the distribution p (˜ z { u,v } | ˜ z u , ˜ z v ) . Thistime we assume blocks T { u,v } + , { u,v } = ( T u, { u,v } : D ∈{ u, v } ) are deﬁned as (10). We can ﬁnd the mean andvariance of p (˜ z u , ˜ z v , ˜ z { u,v } ) by applying linear transfor-mation T { u,v } + , { u,v } on the mean and variance of z { u,v } .Standard Gaussian conditioning formulas then give theconditional distribution p (˜ z { u,v } | ˜ z u , ˜ z v ) . Then wecan recover the distribution of z { u,v } from distribution p (˜ z { u,v } | ˜ z u , ˜ z v ) q (˜ z u , ˜ z v ) . Remark:

Our reasoning gives a completely different wayof deriving some of the results of Loh & Wainwright (2013)concerning the sparsity pattern of the inverse covariancematrix of the sufﬁcient statistics of a discrete graphicalmodel. The conditional independence in Proposition 2 forthe factored GCGM density translates directly to the spar-sity pattern in the precision matrix

Γ = ˜Σ − . Unlike thereasoning of Loh & Wainwright, we derive the sparsity di-rectly from the conditional independence properties of theasymptotic distribution (which are inherited from the CGMdistribution) and the fact that the CGM and GCGM sharethe same covariance matrix.

4. Inference with Noisy Observations

We now consider the problem of inference in the GCGMwhen the observations are noisy. Throughout the remainderof the paper, we assume that the individual model—and,hence, the CGM—is a tree. In this case, the cliques cor-respond to edges and the separators correspond to nodes.We will also assume that only the nodes are observed. Fornotational simplicity, we will assume that every node is ob-served (with noise). (It is easy to marginalize out unob-served nodes if any.) From now on, we use uv instead of { u, v } to represent edge clique. Finally, we assume that theentries have been dropped from the vector z as describedin the previous section so that it has the factored densitydescribed in Eq. 11.Denote the observation variable for node u by y u , and as-sume that it has a Poisson distribution. In the (exact) CGM,this would be written as y u ∼ Poisson( n u ) . However, inour GCGM, this instead has the form y u ∼ Poisson( λ z u ) , (13)where z u is the corresponding continuous variable and λ determines the amount of noise in the distribution. Denotethe vector of all observations by y . Note that the missingentry of z u must be reconstructed from the remaining en-tries when computing the likelihood.With Poisson observations, there is no longer a closed-formsolution to message passing in the GCGM. We address thisby applying Expectation Propagation (EP) with the Laplaceapproximation. This method has been previously applied tononlinear dynamical systems by Ypma and Heskes (2005). In the GCGM with observations, the potential on each edge ( u, v ) ∈ E is deﬁned as ψ ( z u , z v ) = (cid:26) p ( z v , z u ) p ( y v | z v ) p ( y u | z u ) if u is root p ( z v | z u ) p ( y v | z v ) otherwise. (14)We omit the subscripts on ψ for notational simplicity.The joint distribution of ( z v , z u ) has mean and covarianceshown in (12).With EP, the model approximates potential on edge ( u, v ) ∈ E with normal distribution in context q \ uv ( z u ) and q \ uv ( z v ) . The context for edge ( u, v ) is deﬁned as q \ uv ( z u ) = (cid:89) ( u,v (cid:48) ) ∈ E,v (cid:48) (cid:54) = v q uv (cid:48) ( z u ) (15) q \ uv ( z v ) = (cid:89) ( u (cid:48) ,v ) ∈ E,u (cid:48) (cid:54) = u q u (cid:48) v ( z v ) , (16) aussian Approximation of Collective Graphical Models where each q uv (cid:48) ( z u ) and q u (cid:48) v ( z v ) have the form of normaldensities.Let ξ ( z u , z v ) = q \ uv ( z u ) q \ uv ( z v ) ψ ( z u , z v ) . The EP up-date of q uv ( z u ) and q uv ( z v ) is computed as q uv ( z u ) = proj z u [ ξ ( z u , z v )] q \ uv ( z u ) (17) q uv ( z v ) = proj z v [ ξ ( z u , z v )] q \ uv ( z v ) . (18)The projection operator, proj , is computed in two steps.First, we ﬁnd a joint approximating normal distribution viathe Laplace approximation and then we project this ontoeach of the random variables z u and z v . In the Laplace ap-proximation step, we need to ﬁnd the mode of log ξ ( z u , z v ) and calculate its Hessian at the mode to obtain the mean andvariance of the approximating normal distribution: µ ξuv = arg max ( z u , z v ) log ξ ( z u , z v ) (19) Σ ξuv = (cid:32) ∇ z u , z v )= µ ξuv log ξ ( z u , z v ) (cid:33) − . (20)The optimization problem in (19) is solved by optimizingﬁrst over z u then over z v . The optimal value of z u can becomputed in closed form in terms of z v , since only nor-mal densities are involved. Then the optimal value of z v is found via gradient methods (e.g., BFGS). The function log ξ ( z u , z v ) is concave, so we can always ﬁnd the globaloptimum. Note that this decomposition approach only de-pends on the tree structure of the model and hence willwork for any observation distribution.At the mode, we ﬁnd the mean and variance of the normaldistribution approximating p ( z u , z v | y ) via (19) and (20).With this distribution, the edge counts can be inferred withthe method of Section 3.2. In the projection step in (17)and (18), this distribution is projected to one of z u or z v bymarginalizing out the other. What is the computational complexity of inference withthe GCGM? When inferring node counts, we must solvethe optimization problem and compute a ﬁxed number ofmatrix inverses. Each matrix inverse takes time L . . Inthe Laplace approximation step, each gradient calculationtakes O ( L ) time. Suppose m iterations are needed. In theouter loop, suppose we must perform r passes of EP mes-sage passing and each iteration sweeps through the wholetree. Then the overall time is O ( r | E | max( mL , L . )) .The maximization problem in the Laplace approximationis smooth and concave, so it is relatively easy. In our ex-periments, EP usually converges within 10 iterations. In the task of inferring edge counts, we only consider thecomplexity of calculating the mean, as this is all that isneeded in our applications. This part is solved in closedform, with the most time-consuming operation being thematrix inversion. By exploiting the simple structure ofthe covariance matrix of z uv , we can obtain an inferencemethod with time complexity of O ( L ) .

5. Experimental Evaluation

In this section, we evaluate the performance of our methodand compare it to the MAP approximation of Sheldon, Sun,Kumar, and Dietterich (2013). The evaluation data are gen-erated from the bird migration model introduced in Shel-don et al. (2013). This model simulates the migration of apopulation of M birds on an L = (cid:96) × (cid:96) map. The entirepopulation is initially located in the bottom left corner ofthe map. Each bird then makes independent migration de-cisions for T = 20 time steps. The transition probabilityfrom cell i to cell j at each time step is determined by a lo-gistic regression equation that employs four features. Thesefeatures encode the distance from cell i to cell j , the degreeto which cell j falls near the path from cell i to the destina-tion cell in the upper right corner, the degree to which cell j lies in the direction toward which the wind is blowing,and a factor that encourages the bird to stay in cell i . Let w denote the parameter vector for this logistic regression for-mula. In this simulation, the individual model for each birdis a T -step Markov chain X = ( X , . . . , X ) where thedomain of each X t consists of the L cells in the map. TheCGM variables n = ( n , n , , n , . . . , n T ) are vectors oflength L containing counts of the number of birds in eachcell at time t and the number of birds moving from cell i tocell j from time t to time t + 1 . We will refer to these asthe “node counts” (N) and the “edge counts” (E). At eachtime step t , the data generation model generates an obser-vation vector y t of length L which contains noisy countsof birds at all map cells at time t , n t . The observed countsare generated by a Poisson distribution with unit intensity.We consider two inference tasks. In the ﬁrst task, the pa-rameters of the model are given, and the task is to inferthe expected value of the posterior distribution over n t foreach time step t given the observations y , . . . , y T (aka“smoothing”). We measure the accuracy of the node countsand edge counts separately.An important experimental issue is that we cannot com-pute the true MAP estimates for the node and edge counts.Of course we have the values generated during the sim-ulation, but because of the noise introduced into the ob-servations, these are not necessarily the expected valuesof the posterior. Instead, we estimate the expected val-ues by running the MCMC method (Sheldon & Dietterich,2011) for a burn-in period of 1 million Gibbs iterations aussian Approximation of Collective Graphical Models and then collecting samples from 10 million Gibbs iter-ations and averaging the results. We evaluate the ac-curacy of the approximate methods as the relative error || n app − n mcmc || / || n mcmc || , where n app is the approx-imate estimate and n mcmc is the value obtained from theGibbs sampler. In each experiment, we report the meanand standard deviation of the relative error computed from10 runs. Each run generates a new set of values for the nodecounts, edge counts, and observation counts and requires aseparate MCMC baseline run.We compare our method to the approximate MAP methodintroduced by Sheldon et al. (2013). By treating countsas continuous and approximating the log factorial function,their MAP method ﬁnds the approximate mode of the pos-terior distribution by solving a convex optimization prob-lem. Their work shows that the MAP method is much moreefﬁcient than the Gibbs sampler and produces inference re-sults and parameter estimates very similar to those obtainedfrom long MCMC runs.The second inference task is to estimate the parameters w of the transition model from the observations. This is per-formed via Expectation Maximization, where our GCGMmethod is applied to compute the E step. We compute therelative error with respect to the true model parameters.Table 1 compares the inference accuracy of the approxi-mate MAP and GCGM methods. In this table, we ﬁxed L = 36 , set the logistic regression coefﬁcient vector w = (1 , , , , and varied the population size N ∈{ , , , } . At the smallest population size, theMAP approximation is slightly better, although the resultis not statistically signiﬁcant. This makes sense, since theGaussian approximation is weakest when the populationsize is small. At all larger population sizes, the GCGMgives much more accurate results. Note that the MAP ap-proximation exhibits much higher variance as well. Table 1.

Relative error in estimates of node counts (“N”) and edgecounts (“E”) for different population sizes N . N =

36 360 1080 3600MAP(N) .173 ± .020 .066 ± .015 .064 ± .012 .069 ± .013MAP(E) .350 ± .030 .164 ± .030 .166 ± .027 .178 ± .025GCGM(N) .184 ± .018 .039 ± .007 .017 ± .003 .009 ± .002GCGM(E) .401 ± .026 .076 ± .008 .034 ± .003 .017 ± .002 Our second inference experiment is to vary the magnitudeof the logistic regression coefﬁcients. With large coefﬁ-cients, the transition probabilities become more extreme(closer to 0 and 1), and the Gaussian approximation shouldnot work as well. We ﬁxed N = 1080 and L = 36 and evaluated three different parameter vectors: w . =(0 . , , , , w = (1 , , , and w = (2 , , , . Ta-ble 2 shows that for w . and w , the GCGM is muchmore accurate, but for w , the MAP approximation gives a slightly better result, although it is not statistically signiﬁ-cant based on 10 trials. Table 2.

Relative error in estimates of node counts (“N”) and edgecounts (“E”) for different settings of the logistic regression param-eter vector w w . w w MAP(N) .107 ± .014 .064 ± .012 .018 ± .004MAP(E) .293 ± .038 .166 ± .027 .031 ± .004GCGM(N) .013 ± .002 .017 ± .003 .024 ± .004GCGM(E) .032 ± .004 .034 ± .003 .037 ± .005Our third inference experiment explores the effect of vary-ing the size of the map. This increases the size of the do-main for each of the random variables and also increasesthe number of values that must be estimated (as well asthe amount of evidence that is observed). We vary L ∈{ , , , } . We scale the population size accordingly,by setting N = 30 L . We use the coefﬁcient vector w .The results in Table 3 show that for the smallest map, bothmethods give similar results. But as the number of cellsgrows, the relative error of the MAP approximation growsrapidly as does the variance of the result. In comparison,the relative error of the GCGM method barely changes. Table 3.

Relative inference error with different map size L =

16 25 36 49MAP(N) .011 ± .005 .025 ± .007 .064 ± .012 .113 ± .015MAP(E) .013 ± .004 .056 ± .012 .166 ± .027 .297 ± .035GCGM(N) .017 ± .003 .017 ± .003 .017 ± .003 .020 ± .003GCGM(E) .024 ± .002 .027 ± .003 .034 ± .003 .048 ± .005 We now turn to measuring the relative accuracy of themethods during learning. In this experiment, we set L = 16 and vary the population size for N ∈ { , , , } .After each EM iteration, we compute the relative error as || w learn − w true || / || w true || , where w learn is the param-eter vector estimated by the learning methods and w true is the parameter vector that was used to generate the data.Figure 1 shows the training curves for the three parame-ter vectors w . , w , and w . The results are consistentwith our previous experiments. For small population sizes( N = 16 and N = 160 ), the GCGM does not do as well asthe MAP approximation. In some cases, it overﬁts the data.For N = 16 , the MAP approximation also exhibits over-ﬁtting. For w , which creates extreme transition probabil-ities, we also observe that the MAP approximation learnsfaster, although the GCGM eventually matches its perfor-mance with enough EM iterations.Our ﬁnal experiment measures the CPU time required toperform inference. In this experiment, we varied L ∈{ , , , , } and set N = 100 L . We used pa-rameter vector w . We measured the CPU time consumedto infer the node counts and the edge counts. The MAP aussian Approximation of Collective Graphical Models . . . . . . . . w = (0.5, 1, 1, 1) EM iteration r e l a t i v e e rr o r l l l l l l l l l l ll l l l l l l l l l l l population size161604801600methodsMAPGCGM . . . . . . . w = (1, 2, 2, 2) EM iteration r e l a t i v e e rr o r l l l l l l l l l l ll l l l l l l l l l l l population size161604801600methodsMAPGCGM . . . . . w = (2, 4, 4, 4) EM iteration r e l a t i v e e rr o r l l l l l l l l l l ll l l l l l l l l l l l population size161604801600methodsMAPGCGM Figure 1.

EM convergence curve different feature coefﬁcient and population sizes l l l l l

Inference time v.s. domain size domain size r unn i ng t i m e ( s e c ond s )

16 36 64 100 144 l MAPGCGMGCGM infer node counts only

Figure 2.

A comparison of inference run time with different num-bers of cells L method infers the node and edge counts jointly, whereasthe GCGM ﬁrst infers the node counts and then computesthe edge counts from them. We report the time requiredfor computing just the node counts and also the total timerequired to compute the node and edge counts. Figure 2shows that the running time of the MAP approximation ismuch larger than the running time of the GCGM approx-imation. For all values of L except 16, the average run-ning time of GCGM is more than 6 times faster than forthe MAP approximation. The plot also reveals that thecomputation time of GCGM is dominated by estimatingthe node counts. A detailed analysis of the implementa-tion indicates that the Laplace optimization step is the mosttime-consuming.In summary, the GCGM method achieves relative error thatmatches or is smaller than that achieved by the MAP ap- proximation. This is true both when measured in terms ofestimating the values of the latent node and edge counts andwhen estimating the parameters of the underlying graphi-cal model. The GCGM method does this while runningmore than a factor of 6 faster. The GCGM approxima-tion is particularly good when the population size is largeand when the transition probabilities are not near 0 or 1.Conversely, when the population size is small or the prob-abilities are extreme, the MAP approximation gives betteranswers although the differences were not statistically sig-niﬁcant based on only 10 trials. A surprising ﬁnding is thatthe MAP approximation has much larger variance in its an-swers than the GCGM method.

6. Concluding Remarks

This paper has introduced the Gaussian approximation(GCGM) to the Collective Graphical Model (CGM). Wehave shown that for the case where the observations onlydepend on the separators, the GCGM is the limiting dis-tribution of the CGM as the population size N → ∞ .We showed that the GCGM covariance matrix maintainsthe conditional independence structure of the CGM, andwe presented a method for efﬁciently inverting this covari-ance matrix. By applying expectation propagation, we de-veloped an efﬁcient algorithm for message passing in theGCGM with non-Gaussian observations. Experiments on abird migration simulation showed that the GCGM methodis at least as accurate as the MAP approximation of Shel-don et al. (2013), that it exhibits much lower variance, andthat it is 6 times faster to compute. Acknowledgement

This material is based upon work supported by the NationalScience Foundation under Grant No. 1125228. aussian Approximation of Collective Graphical Models

References

Dawid, A. P. and Lauritzen, S. L. Hyper Markov laws inthe statistical analysis of decomposable graphical mod-els.

The Annals of Statistics , 21(3):1272–1317, 1993.Feller, W.

An Introduction to Probability Theory and ItsApplications, Vol. 2 . Wiley, 1968.Lauritzen, S.L.

Graphical models . Oxford UniversityPress, USA, 1996.Loh, Po-Ling and Wainwright, Martin J. Structure estima-tion for discrete graphical models: Generalized covari-ance matrices and their inverses.

The Annals of Statistics ,41(6):3022–3049, 2013.Sheldon, Daniel and Dietterich, Thomas G. CollectiveGraphical Models. In

NIPS 2011 , 2011.Sheldon, Daniel, Sun, Tao, Kumar, Akshat, and Dietterich,Thomas G. Approximate Inference in Collective Graph-ical Models. In

Proceedings of ICML 2013 , pp. 9, 2013.Sundberg, R. Some results about decomposable (orMarkov-type) models for multidimensional contingencytables: distribution of marginals and partitioning of tests.

Scandinavian Journal of Statistics , 2(2):71–79, 1975.Wainwright, M.J. and Jordan, M.I. Graphical models, ex-ponential families, and variational inference.

Founda-tions and Trends in Machine Learning , 1(1-2):1–305,2008.Ypma, Alexander and Heskes, Tom. Novel approximationsfor inference in nonlinear dynamical systems using ex-pectation propagation.

Neurocomputing , 69(1-3):85–99,2005. aussian Approximation of Collective Graphical Models

A. Proof of Proposition 1

The usual way of writing the CGM distribution is to replace f ( n ; θ ) in Eq. (3) by f (cid:48) ( n ; θ ) = (cid:81) C ∈C ,i C ∈X | C | µ C ( i C ) n C ( i C ) (cid:81) S ∈S ,i S ∈X | S | (cid:0) µ S ( i S ) n S ( i S ) (cid:1) ν ( S ) (21)We will show that f ( n ; θ ) = f (cid:48) ( n ; θ ) for any n such that h ( n ) > by showing that both descibe the probability ofan ordered sample with sufﬁcient statistics n . Indeed, sup-pose there exists some ordered sample X = ( x , . . . , x N ) with sufﬁcient statistics n . Then it is clear from inspectionof Eq. (3) and Eq. (21) that f ( n ; θ ) = (cid:81) Nm =1 p ( x m ; θ ) = f (cid:48) ( n ; θ ) by the junction tree reparameterization of p ( x ; θ ) (Wainwright & Jordan, 2008). It only remains to show thatsuch an X exists whenever h ( n ) > . This is exactly whatwas shown by Sheldon & Dietterich (2011): for junctiontrees, the hard constraints of Eq. (4), which enforce localconsistency on the integer count variables, are equivalentto the global consistency property that there exists some or-dered sample X with sufﬁcient statistics equal to n . (Sincethese are integer count variables, the proof is quite differ-ent from the similar theorem that local consistency impliesglobal consistency for marginal distributions.) We brieﬂynote two interesting corollaries to this argument. First, bythe same reasoning, any reparameterization of p ( x ; θ ) thatfactors in the same way can be used to replace f ( n ; θ ) inthe CGM distribution. Second, we can see that the basemeasure h ( n ) is exactly the number of different orderedsamples with sufﬁcient statistics equal to n . B. Proof of Theorem 1: Additional Details