[PDF] A Direct Sampler for G-Wishart Variates

Abstract

The G-Wishart distribution is the conjugate prior for precision matrices that encode the conditional independencies of a Gaussian graphical model. While the distribution has received considerable attention, posterior inference has proven computationally challenging, in part due to the lack of a direct sampler. In this note, we rectify this situation. The existence of a direct sampler offers a host of new possibilities for the use of G-Wishart variates. We discuss one such development by outlining a new transdimensional model search algorithm--which we term double reversible jump--that leverages this sampler to avoid normalizing constant calculation when comparing graphical models. We conclude with two short studies meant to investigate our algorithm's validity.

Full PDF

aa r X i v : . [ s t a t . C O ] A p r A Direct Sampler for G-Wishart Variates

Alex Lenkoski ∗ Norwegian Computing Center

April 5, 2013

Abstract

The G-Wishart distribution is the conjugate prior for precision matrices that encodethe conditional independencies of a Gaussian graphical model. While the distributionhas received considerable attention, posterior inference has proven computationallychallenging, in part due to the lack of a direct sampler. In this note, we rectify this sit-uation. The existence of a direct sampler oﬀers a host of new posibilities for the use ofG-Wishart variates. We discuss one such development by outlining a new transdimen-sional model search algorithm–which we term double reversible jump–that leveragesthis sampler to avoid normalizing constant calculation when comparing graphical mod-els. We conclude with two short studies meant to investigate our algorithm’s validity. ∗ Corresponding author address:

Alex Lenkoski, Norsk Regnesentral, P.O. Box 114 Blindern, NO-0314Oslo, NorwayE-mail: [email protected] Introduction

The Gaussian graphical model (GGM) has received widespread consideration (see Jones et al.,2005) and estimators obeying graphical constraints in standard Gaussian sampling were pro-posed as early as Dempster (1972). Initial incorporation of GGMs in Bayesian estimationhas largely focused on decomposable graphs (Dawid and Lauritzen, 1993), since prior dis-tributions factorize into products of Wishart distributions. Roverato (2002) generalizes theHyper-Inverse Wishart distribution to arbitrary graphs and, by consequence, speciﬁes aconjugate prior for sparse precision matrices K . Atay-Kayis and Massam (2005) furtherdevelop this prior and outline a Monte Carlo (MC) method that enables the computa-tion of Bayes factors. Following Letac and Massam (2007) and Rajaratnam et al. (2008),Lenkoski and Dobra (2011) term this distribution the G-Wishart, and propose computa-tional improvements to direct model comparison and model search.The desire to embed the G-Wishart distribution in more complicated hierarchical frameworks–particularly those involving latent Gaussianity–exposed diﬃculties with the MC approxi-mation (see Dobra et al., 2011; Wang and Li, 2012; Cheng and Lenkoski, 2012, for discus-sion). These diﬃculties were partly related to numerical instability (Wang and Li, 2012),but were also methodological, as a realization of K was needed from the current model inorder to update other hierarchical parameters (Cheng and Lenkoski, 2012). At the time ahost of Markov chain Monte Carlo (MCMC) methods had been proposed (Piccioni, 2000;Mitsakakis et al., 2011; Dobra and Lenkoski, 2011; Dobra et al., 2011) as well as an ac-cept/reject sampler (Wang and Carvalho, 2010), which Dobra et al. (2011) shows suﬀersfrom very low acceptance probabilities even in moderate dimensional problems. Despitethese developments no way of reliably sampling directly from a G-Wishart distribution hasbeen proposed.We rectify this situation. Our direct sampler is quite similar to the block Gibbs samplerof Piccioni (2000) and involves sampling a standard Wishart variate from a full model andusing the iterative proportional scaling (IPS) algorithm (Dempster, 1972) to then place thisvariate in the correct space. Our approach diﬀers critically, however, from the block Gibbssampler of Piccioni (2000) in that sampling occurs ﬁrst, and independently of previous sam-ples, with the subsequent application of the IPS algorithm relative to a ﬁxed target.The existence of a direct sampler considerably expands the usefulness of the G-Wishartdistribution. We provide one example of this, by proposing a new method of moving through2he space of GGMs. The reversible jump algorithms developed in Dobra and Lenkoski(2011), and Dobra et al. (2011) provided a means of model averaging K in the contextof more involved Bayesian models. As noted by Wang and Li (2012), these approaches stillrequire the use of unstable MC approximation of prior normalizing constants. With a directsampler, we are now able to resolve this issue by proposing a new transdimensional algo-rithm that combines the concept behind the exchange algorithm (Murray et al., 2006) withreversible jump MCMC (Green, 1995), which we call double reversible jump.The article is organized as follows. In Section 2 we review the G-Wishart distribution,and propose the direct sampler. Section 3 develops the new double reversible jump algo-rithm. In Section 4 we provide two short examples meant to conﬁrm the validity of our newapproach. We conclude in Section 5. Suppose that we collect data D = { Z (1) , . . . , Z ( n ) } such that Z ( j ) ∼ N p (0 , K − ) indepen-dently for j ∈ { , . . . , n } , where K ∈ P p , the space of p × p symmeteric positive deﬁnitematrices. This sample has likelihood pr ( D| K ) = (2 π ) − np/ | K | n/ exp (cid:18) − h K , U i (cid:19) , where h A, B i = tr ( A ′ B ) denotes the trace inner product and U = P ni =1 Z ( i ) Z ( i ) ′ .Further suppose that G = ( V, E ) is a conditional independence graph where V = { , . . . , p } and E ⊂ V × V . As in Cheng and Lenkoski (2012), we will slightly abuse notationthroughout, by writing ( i, j ) ∈ G to indicate that the edge ( i, j ) is in the edge set E . Asso-ciated with G is a subspace P G ⊂ P p such that K ∈ P G implies that K ∈ P p and K ij = 0whenever ( i, j ) G . The G-Wishart distribution (Roverato, 2002; Atay-Kayis and Massam,2005) W G ( δ, D ) assigns probability to K ∈ P G as pr ( K | δ, D , G ) = 1 I G ( δ, D ) | K | ( δ − / (cid:18) − h K , D i (cid:19) K ∈ P G . This distribution is conjugate (Roverato, 2002) and thus pr ( K | δ, D , G, D ) = W G ( δ + n, D + U ) . C = { C , . . . , C J } be a clique decomposition of the graph G . For our purposes weassume that this decomposition is maximally complete. We thus have that K C j ∈ P | C j | for each j ∈ { , . . . , J } and K ∈ P G . We deﬁne the function B C j ( · ) by B C j ( K \ K C j ) = K C j ,V \ C j K − V \ C j K V \ C j ,C j . Then given K ∼ W G ( δ, D ) and any clique C j of the graph G , Roverato (2002) proves that K C j | K \ K C j ∼ W ( δ, D C j , B C j ( K \ K C j )) , (1)where, in general we write K ∼ W ( δ, D , B ) to denote any matrix for which K − B ∼W ( δ, D ). Equation (1) thereby gives the conditional distributions for an overlapping pariti-tion of E and proves critical to the developments below. As above, let C j be one of the cliques of G . For A ∈ P | C j | deﬁne the transformation T C j , A : P G → P G (2)where [ T C j , A ( K )] C j = A + B C j ( K \ K C j )while [ T C j , A ( K )] lk = K lk if either l or k are not in C j . Lenkoski and Dobra (2011) use (2) to determineˆ K G = argmax | K | ( δ − / exp (cid:18) − h K , D i (cid:19) K ∈ P G via an algorithm known as Iterative Proportional Scaling (IPS), following the work ofDempster (1972). The IPS algorithm works by constructing a chain K (0) , K (1) , . . . suchthat K (0) = I p and K ( s ) is determined from K ( s − through the update K ( s ) = T C J , D − CJ ◦ . . . ◦ T C , D − C ( K ( s − )4ventually K ( s ) coverges to ˆ K G , see Lauritzen (1996) for an in-depth discussion of the prop-erties of the IPS algorithm.The IPS algorithm takes deterministic updates and therefore converges to a unique ma-trix. Piccioni (2000) extends the IPS idea to create an MCMC sampler for W G ( δ, D ). Theblock Gibbs sampler of Piccioni (2000) works by starting with a K (0) ∈ P G and constructinga chain K (1) , K (2) , . . . via the update K ( s ) = T C J , ˜ K J ◦ . . . ◦ T C , ˜ K ( K ( s − )where ˜ K j is sampled from a W ( δ, D C j ). We thus see that each subblock C j is being sampledfrom its full conditional according to (1), satisfying the requirements of a Gibbs sampler. We borrow ideas from Section 2.2 to specify a direct sampler for W G ( δ, D ). First sam-ple K ∗ ∼ W ( δ, D ) and determine Σ = ( K ∗ ) − . Set K (0) = I p and construct a chain K (1) , K (2) , . . . where K ( s ) is updated from K ( s − via K ( s ) = T C J , Σ − CJ ◦ . . . ◦ T C , Σ − C ( K ( s − ) . (3)Eventually K ( s ) will converge to a matrix K ∈ P G . We note that the key diﬀerence betweenour algorithm and that of Piccioni (2000) is the point in which random sampling occurs. Inthe block Gibbs sampler, new matrices are sampled in each step of the IPS update accordingto the appropriate conditional distribution. In our framework, sampling occurs ﬁrst, relativeto the full model, independently of all previous samples, and the IPS is then run with a ﬁxedtarget.The question remains what properties K has inherited from K ∗ . Note that by the natureof these updates, we have that K C j − B C j ( K \ K C j ) = K ∗ C j − B C j ( K ∗ \ K ∗ C j ) = Σ − C j for j ∈ { , . . . , J } . This fact is critical. By properties of standard Wishart variates, we knowthat K ∗ C j − B C j ( K ∗ \ K ∗ C j ) ∼ W ( δ, D C j )since this matrix has not changed, we similarly have that K C j − B C j ( K \ K C j ) ∼ W ( δ, D C j )5r, equivalently K C j | K \ K C j ∼ W ( δ, D C j , B C j ( K \ K C j ))for all j ∈ { , . . . , J } . We note that several properties of K ∗ are not shared by K . Forinstance let F = C ∪ C ⊂ V . We have that K ∗ F − B F ( K ∗ \ K ∗ F ) ∼ W ( δ, D F ) , while this does not hold for K since K lk = 0 for any l ∈ C and k ∈ C \ C . Thus, whilethe conditional distributions of K ∗ are not fully transferred by (3), those that are relevantfor W G ( δ, D ) variates are retained.The fact that K ∼ W G ( δ, D ) then follows from Brook (1964). In part, we have speciﬁeda sampler that has postive density over P G and the conditional distributions along a completepartition of the parameter set E correspond to those of a W G ( δ, D ). The sampler discussed in Section 2.3 relied on the IPS algorithm to move from K ∗ ∈ P p to K ∈ P G . While the IPS is useful in illuminating the properties of K ∈ P G , it is compu-tationally burdensome. This is for two reasons, the ﬁrst of which is the requirement thatthe clique decomposition C be both determined and stored, an NP hard problem. Further,the matrix B C j ( K \ K C j ) must be determined at each step for all j , an action that requires K V \ C j to be solved. If the cliques of G are small, this matrix will be nearly p × p .Hastie et al. (2009) discuss an alternative algorithm to the one described in Section 2.2,which can be modiﬁed to determine K ∈ P G from K ∗ ∈ P p (Moghaddam et al., 2009, discussits use in determining ˆ K G ). It works in the following manner1. Set W = Σ .2. For j = 1 , . . . , J a. Let N j ⊂ V be the set of neighbors of node j in G . Form W N j and Σ N j ,j andsolve ˆ β ∗ j = W − N j Σ N j ,j b. Form ˆ β j ∈ R p − by copying the elements of ˆ β ∗ j to the appropriate locations andputting zeroes in those locations not connected to j in G .6. Replace W j, − j and W − j,j with W − j, − j ˆ β j .3. Repeat step 2 until convergence4. Return K = W − . The direct sampler discussed in Section 2.3 opens the possibility for a host of new appli-cations of the G-Wishart distribution in hierarchical Bayesian modeling. We focus on theproblem of constructing a computationally eﬃcient algorithm for mixing over the poste-rior pr ( K , G |D ), thereby forming a model averaged estimate of K . We build upon thereversible jump algorithms developed in Dobra and Lenkoski (2011) and futher extended inDobra et al. (2011). Let G be given and suppose that K ∼ W G ( δ + n, D + U ). Let Φ be the upper triangularmatrix such that Φ ′ Φ = K , its Cholesky decomposition. The transformation from K to Φ has Jacobian J ( K → Φ ) = p Y i =1 Φ ν Gi ii where ν Gi = |{ j : ( i, j ) ∈ G, i < j }| (Roverato, 2002). Working with Φ is useful when K ∈ P G since its primary restriction is thatΦ ij = − ii i X l =1 Φ li Φ lj (4)for any ( i, j ) G . Otherwise Φ ii ∈ R + while Φ ij ∈ R for ( i, j ) ∈ G . We refer to thecompletion of Φ as the action of using (4) to augment a matrix for which only the elementsof G are speciﬁed.Dobra and Lenkoski (2011) use this representation to move between neighboring graphsin the context of a larger MCMC. Suppose that ( K , G ) is the current state of an MCMCchain, where K ∈ P G and we would like to attempt moving to ˜ G , which we assume tobe equal to G except for the additional edge ( l, m ). The algorithm of Dobra and Lenkoski(2011) ﬁrst determines Φ from G , samples γ ∼ N (Φ ij , σ g ) and forms ˜ Φ where ˜Φ ij = Φ ij for7 = j or ( i, j ) ∈ G , while Φ lm = γ . ˜Φ is then completed according to ˜ G . This proposal isthen accepted with probability min { α, } where α = exp (cid:18) − h ˜ K − K , D + U i (cid:19) Φ ll √ πσ g exp( − γ σ g ) I ˜ G ( δ, D ) I G ( δ, D ) . (5)Subsequent to this move, the matrix K has typically been updated according to the acceptedgraph using MCMC methods, for instance the block Gibbs sampler.Several embellishments of the algorithm of Dobra and Lenkoski (2011) have been devel-oped, including asymmetric model moves in the graph space and permuting the elementsof K to increase acceptance (Dobra et al., 2011), noting that a conditional Bayes factorcan be derived to obviate the need for reversible jump when comparing neighboring graphs(Wang and Li, 2012) and using notions of sparse Cholesky decompositions and node reorder-ings to reduce the time spent computing Φ (Cheng and Lenkoski, 2012).Each of these developments has proven to yield some improvement in performance in cer-tain situations. However, the most important technical problem with (5) revolves around thecalculation of the normalizing constants I G and I ˜ G . These factors require MC approximation(Atay-Kayis and Massam, 2005), which Wang and Li (2012) rather convincingly show failsin high dimensions.Wang and Li (2012) propose an alternative approach, which borrows ideas from theexchange algorithm (Murray et al., 2006) and the double Metropolis-Hastings algorithm(Liang, 2010) to approximate this ratio. Unfortunately, the double Metropolis-Hastingsalgorithm is not exact, though the approximation used by Wang and Li (2012) appears towork well in practice for neighboring graphs. We note that the approach of Wang and Li(2012) is not feasible if the graphs are not neighbors. The exchange algorithm (Murray et al., 2006) has proven a useful tool for general MCMCwhen working with models where the likelihood has an intractable normalizing constant.Wang and Li (2012) discuss how to use the concept behind the exchange algorithm to aidmodel comparison, where prior distributions–like those of the G-Wishart–have a similar un-known normalizing constant that varies according to the model. Unfortuntately, without adirect sampler Wang and Li (2012) relied on the block Gibbs sampler to propose a versionof the double Metropolis-Hastings algorithm (Liang, 2010). This approach should only be8onsidered approximate, whereas the original exchange algorithm avoids normalizing con-stant calculations and still yields correct MCMC transition probabilities.With the existence of a direct sampler for W G variates, however, we may use a modiﬁca-tion of the exchange algorithm to avoid the normalizing constants in (5). We call this newapproach double reversible jump.Suppose that G is the graph in the current state of an MCMC procedure and propose anew graph ˜ G . At the moment, assume that ˜ G is a neighbor of G with the additional edge( l, m ) ∈ ˜ G . We discuss the relaxation of this assumption in Section 5. The double reversiblejump algorithm then proceeds by1. Sample K ∼ W G ( δ + n, D + U ) and form Φ , its Cholesky decomposition. Let ϑ = Φ lm .2. Sample ˜ K ∼ W ˜ G ( δ, D ) and form ˜ Φ . Let˜ ϑ = − ll l X r =1 ˜Φ rl ˜Φ rm

3. Sample γ ∼ N ( ϑ, σ g ) and set ˜ γ = ˜Φ lm − ˜ ϑ

4. Form ˜ Φ where ˜Φ ij = Φ ij for ( i, j ) ∈ G or i = j and set ˜Φ lm = γ . Complete ˜ Φ accordingto ˜ G and set ˜ K = ˜ Φ ′ ˜ Φ

5. Form Φ where Φ ij = ˜Φ ij for ( i, j ) ∈ G and Φ lm = ˜ ϑ . Complete Φ according to G and set K = ( Φ ) ′ Φ .6. Accept the move from G to ˜ G with probability min { , α } where α = exp (cid:16) − h ˜ K − K , D + U i (cid:17) exp (cid:16) − h ˜ K − K , D i (cid:17) Φ ll Φ ll exp − ( γ − ϑ ) − (˜ γ − ˜ ϑ ) σ g ! We see that the double reversible jump algorithm considers switching between( K , G, ˜ K , ˜ G )to the alternative ( ˜ K , ˜ G, K , G )by performing two reversible jump moves, one that moves between ( K , G ) to ( ˜ K , ˜ G ) accord-ing to the posterior parameters δ + n and D + U and the other between ( ˜ K , ˜ G ) to ( K , G )9ccording to the prior parameters δ , D . By doing so, the prior normalizing constants in(5) cancel, making double reversible jump the transdimensional equivalent to the exchangealgorithm of Murray et al. (2006). We begin with a simple sanity check to ensure that the direct sampler of Section 2.3 returnsidentical results as the block Gibbs sampler when both are run for an exceedingly long time.We set p = 4 and G = C the four cycle where edges (1 ,

4) and (2 ,

3) are missing. We thenconsider sampling from W C ( δ, D ) where we set δ = 103 and D =  . − .

15 8 .

027 2 . − .

15 93 . − . − . . − .

122 116 .

652 11 . . − .

162 11 .

62 120 .  which was randomly generated to resemble the posterior distribution after observing 100samples drawn from a N (0 , I ). We then ran the block Gibbs sampler as well as the directsampler for 10 million iterations each, with an additional one million iterations for the blockGibbs sampler as burn-in. The expection of K taken over the samples from the block Gibbssampler was  . . − . . . . − . . − . . − . .  . the expectation of K taken over samples from the proposed direct sampler was  . . − . . . . − . . − . . − . .  As shown above, the expectations of the two samplers appear identical. Further, we notethat all other comparisons we could consider–for instance element-wise variance, quantiles of10eterminants, medians–likewise returned identical results. Diﬀerent ﬁxed graphs and choicesof δ or D do not aﬀect the results.Since 10 million samples of the block Gibbs sampler after a one million sample burn-inshould be expected to characterize a W C ( δ, D ) distribution, this brief study appears toconﬁrm that our proposed sample is indeed a direct sampler for G-Wishart variates. Both Roverato (2002) and Atay-Kayis and Massam (2005) use Fisher’s

Iris Virginica datasetto conﬁrm their approximations of I G . These data consist of four measurements, SepalLength (SL), Sepal Width (SW), Petal Length (PL) and Petal Width (PW), taken on 50 irisplants. We use these data to compare the double reversible jump algorithm to an exhaus-tive scoring of all models using the Monte Carlo approximation in Atay-Kayis and Massam(2005). We set δ = 3, D = I p and σ g = 1.For all 64 models in the graph space, we determine the model probability by running theMC approximation of Atay-Kayis and Massam (2005) for one-million iterations. Further,we run the double reversible jump algorithm for ﬁve-million iterations and discard the ﬁrst100,000 iterations as burn-in. This takes approximately ten minutes on a 2.8 gHz desktoprunning Linux and should be recognized as an extremely long chain. Table 1 shows thepairwise edge probabilities returned from the two methods. As shown in the table, theestimated edge probabilities using the two approaches agree. We note that if the doublereversible jump chain is run for less time, say 50,000 iterations (which takes approximately 6seconds), results are nearly, but not perfectly, identical. Model moves are accepted in 23.2%of attempts, alternative choices of σ g appear to have marginal eﬀect on this acceptance level.See Section 5 for a discussion of ways to improve mixing through more involved schemes.These results indicate that the double reversible jump, coupled with the direct sampler,enables model averaged estimates of K to be formed without needing the unstable MCapproximation of Atay-Kayis and Massam (2005). We have proposed a direct sampler for G-Wishart variates, which promises to dramaticallyimprove the usefulness of this distribution. In this note we have focused on using this11ampler to develop a trandimensional MCMC algorithm that has no normalizing constantevaluations. While this is a promising ﬁrst step, there are considerable additional avenuesfor development.While the direct sampler performs well, in our mind the entire process is still too slow.In high dimensions, the majority of computing time is spent moving from K ∗ ∈ P p to K ∈ P G . While the development in section 2.4 is considerably faster (and dramaticallymore stable) than the use of the IPS algorithm, we feel that there must be potiential forfurther improvements. Connecting with the rapid development of procedures for formingglasso (Friedman et al., 2008) estimators would be fruitful in improving the eﬃciency of thesampler in high dimensions, since this action can be phrased as a constrained optimizationproblem.Rodriguez et al. (2011) consider embedding the G-Wishart distribution inside Dirichletprocessses and related structures from nonparameteric Bayesian methods. However, de-composable graphs were used, since a direct sampler was unavailable for nondecomposablemodels and is critical in the posterior sampling of nonparameteric models. It is now possibleto consider the use of general graphical models in Bayesian nonparametric approaches.Our development of the double reversible jump algorithm was partially to show how thedirect sampler could be used to avoid prior normalizing constant evaluations when com-paring models. A host of embellishments could be made. When comparing neighboringgraphs, for instance, conditional Bayes factors could be computed as in Wang and Li (2012)or Cheng and Lenkoski (2012). In our mind, a more promising avenue for development wouldbe to construct a procedure for global moves in the graph space. In order to work properly,we feel that global moves must be coupled with better proposals in the double reversiblejump scheme. Relating these proposals to some of the guidelines in Rue and Held (2005)could prove useful in this regard. The ability to make large, focused moves in the graph spacewill be critical to extending the G-Wishart distribution to truly high dimensional problems. References

Atay-Kayis, A. and Massam, H. (2005). A Monte Carlo method for computing the marginallikelihood in nondecomposable Gaussian graphical models.

Biometrika , 92:317–335.Brook, D. (1964). On the distinction between the conditional probability and the joint prob-12bility approaches in the speciﬁcation of nearest-neighbour systems.

Biometrika , 51:481–483.Cheng, Y. and Lenkoski, A. (2012). Hierarchical Gaussian graphical models: Beyond re-versible jump.

Electronic Journal of Statistics , 6:2309–2331.Dawid, A. P. and Lauritzen, S. L. (1993). Hyper Markov laws in the statistical analysis ofdecomposable graphical models.

Ann. Statist. , 21:1272–1317.Dempster, A. P. (1972). Covariance selection.

Biometrics , 28:157–175.Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their applicationto modeling functional disability data.

Annals of Applied Statistics , 5:969–993.Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general Gaussiangraphical models with application to multivariate lattice data.

Journal of the AmericanStatistical Association , 106:1418–1433.Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimationwith the graphical lasso.

Biostatistics , 9(3):432–441.Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesianmodel determination.

Biometrika , 82:711–732.Hastie, T., Tibshirani, R., and Friedman, J. (2009).

The Elements of Statistical Learning:Data Mining, Inference and Prediction . Springer.Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experi-ments in stochastic computation for high-dimensional graphical models.

Statistical Sci-ence , 20:388–400.Lauritzen, S. L. (1996).

Graphical Models . Oxford Press.Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in Gaussiangraphical models with the G-Wishart prior.

Journal of Computational and GraphicalStatistics , 20:140–157.Letac, G. and Massam, H. (2007). Wishart distributions for decomposable graphs.

Ann.Statist. , 35:1278–323. 13iang, F. (2010). A double Metropolis-Hastings sampler for spatial models with intractablenormalizing constants.

Journal of Statistical Computing and Simulation , 80:1007–1022.Mitsakakis, N., Massam, H., and Escobar, M. D. (2011). A Metropolis-Hastings basedmethod for sampling from the G-Wishart distribution in Gaussian graphical models.

Elec-tronic Journal of Statistics , 5:18–30.Moghaddam, B., Marlin, B. M., Khan, M. E., and Murphy, K. P. (2009). AcceleratingBayesian structural inference for non-decomposable Gaussian graphical models. In

Con-ference on Neural Information Processing Systems .Murray, I., Ghahramani, Z., and MacKay, D. (2006). MCMC for doubly-intractable distribu-tions.

Proceedings of the 22nd Annual Conference on Uncertainty in Artiﬁcial Intelligence .Piccioni, M. (2000). Independence structure of natural conjugate densities to exponentialfamilies and the Gibbs sampler.

Scand. J. Statist. , 27:111–27.Rajaratnam, B., Massam, H., and Carvalho, C. M. (2008). Flexible covariance estimationin graphical Gaussian models.

Ann. Statist. , 36:2818–2849.Rodriguez, A., Dobra, A., and Lenkoski, A. (2011). Sparse covariance estimation in hetero-geneous samples.

Electronic Journal of Statistics , 5:981–1014.Roverato, A. (2002). Hyper inverse Wishart distribution for non-decomposable graphs andits application to Bayesian inference for Gaussian graphical models.

Scand. J. Statist. ,29:391–411.Rue, H. and Held, L. (2005).

Gaussian Markov Random Fields . Chapman & Hall.Wang, H. and Carvalho, C. M. (2010). Simulation of hyper-inverse Wishart distributions fornon-decomposable graphs.

Electronic Journal of Statistics , 4:1470–1475.Wang, H. and Li, S. Z. (2012). Eﬃcient Gaussian graphical model determination underG-Wishart prior distributions.

Electronic Journal of Statistics , 6:168–198.14 cknowledgement

Alex Lenkoski’s work is funded by Statistics for Innovation ( sf i )2