[PDF] A Hybrid Approximation to the Marginal Likelihood

Abstract

Computing the marginal likelihood or evidence is one of the core challenges in Bayesian analysis. While there are many established methods for estimating this quantity, they predominantly rely on using a large number of posterior samples obtained from a Markov Chain Monte Carlo (MCMC) algorithm. As the dimension of the parameter space increases, however, many of these methods become prohibitively slow and potentially inaccurate. In this paper, we propose a novel method in which we use the MCMC samples to learn a high probability partition of the parameter space and then form a deterministic approximation over each of these partition sets. This two-step procedure, which constitutes both a probabilistic and a deterministic component, is termed a Hybrid approximation to the marginal likelihood. We demonstrate its versatility in a plethora of examples with varying dimension and sample size, and we also highlight the Hybrid approximation's effectiveness in situations where there is either a limited number or only approximate MCMC samples available.

Full PDF

AA Hybrid Approximation to the Marginal Likelihood

Eric Chuu Debdeep Pati Anirban Bhattacharya

Texas A&M University Texas A&M University Texas A&M University

Abstract

Computing the marginal likelihood or evi-dence is one of the core challenges in Bayesiananalysis. While there are many establishedmethods for estimating this quantity, theypredominantly rely on using a large numberof posterior samples obtained from a MarkovChain Monte Carlo (MCMC) algorithm. Asthe dimension of the parameter space in-creases, however, many of these methods be-come prohibitively slow and potentially in-accurate. In this paper, we propose a novelmethod in which we use the MCMC samplesto learn a high probability partition of theparameter space and then form a determinis-tic approximation over each of these partitionsets. This two-step procedure, which consti-tutes both a probabilistic and a deterministiccomponent, is termed a Hybrid approxima-tion to the marginal likelihood. We demon-strate its versatility in a plethora of exampleswith varying dimension and sample size, andwe also highlight the Hybrid approximation’seﬀectiveness in situations where there is ei-ther a limited number or only approximateMCMC samples available.

Model selection and model averaging are among themost important inferential goals in Bayesian statistics.These goals inherently rely on evaluating model un-certainty, which in turn comes down to calculating themarginal likelihood of competing models. This makesaccurate and eﬃcient computation of the marginallikelihood an important problem.Suppose we observe data y with a likelihood function Proceedings of the 24 th International Conference on Artiﬁ-cial Intelligence and Statistics (AISTATS) 2021, San Diego,California, USA. PMLR: Volume 130. Copyright 2021 bythe author(s). p p y | u q indexed by u coming from some parameterspace U . Provided that the prior distribution over theunknowns is speciﬁed, the marginal likelihood or evi-dence can be written as p p y q “ ż U p p y | u q p p u q du. (1)Barring speciﬁc conjugate settings, the marginal like-lihood is analytically intractable in practice and posesa computationally challenging problem. Since numer-ical integration becomes infeasible beyond moderatedimension, Monte Carlo approximations present an al-ternative solution. In much of the literature devotedto estimating this quantity, the recurring idea is toform an asymptotically unbiased approximation of (1)using MCMC samples. However, running an MCMCalgorithm in addition to forming a Monte Carlo ap-proximation can quickly accrue error as the dimensionof the parameter space increases. Consequently, thesealgorithms may require an exceedingly large number ofsamples in order to form accurate estimates. In manyproblems, however, obtaining MCMC samples is time-consuming and potentially unreliable. As such, theneed for an approximation that does not too heavilyrely on both the quantity and quality of the MCMCsamples is evident.Some commonly used algorithms include Laplace’smethod (Tierney and Kadane, 1986), which assumesthat the posterior distribution can be approximatedwith a normal distribution, and the Harmonic Meanestimator (Newton and Raftery, 1994), which is easyto implement, but has been shown to be unstable andcan have inﬁnite variance (Newton and Raftery, 1994),(Raftery et al., 2007). On a similar vein, the AdjustedHarmonic Mean estimator (Lenk, 2009) and the Cor-rected Arithmetic Mean estimator (Pajor, 2017) lever-age the harmonic mean and arithmetic mean identi-ties, respectively, with the idea of sampling from highposterior probability regions of the parameter space toimprove upon the original estimators. Annealed Im-portance Sampling (Neal, 2001) uses a dynamic impor-tance sampling function that sequentially transitionsthrough intermediate distributions to the target dis-tribution. a r X i v : . [ s t a t . C O ] F e b Hybrid Approximation to the Marginal Likelihood

Other popular algorithms include Chib’s method(Chib, 1995; Chib and Jeliazkov, 2001), Bridge Sam-pling (Meng and Wong, 1996), Warp Bridge Sam-pling (Meng and Schilling, 2002), and Nested Sampling(Skilling, 2006). See Friel and Wyse (2012) for a morecomprehensive overview and discussion of these algo-rithms. There have also been recent developments invariational inference techniques that provide alterna-tive ways to approximate and bound the marginal like-lihood (Rezende and Mohamed, 2015; Salimans et al.,2015).In contrast to these methods, we propose a novel ap-proach which can be thought of as a hybrid betweenprobabilistic and deterministic procedures. A highlevel view of our method can be broken down into twomajor steps: (i) the MCMC samples are used to learna partition of the parameter space U , and (ii) withthis partition, we then make a deterministic approx-imation to the log posterior on each of the partitionsets. In essence, we seek to exploit the assumptionthat the posterior distribution will be far from a uni-form looking distribution and instead show concentra-tion around some parameter. If the partition obtainedfrom the MCMC samples can identify areas of highposterior mass by carving up these regions more ﬁnely,then we are better equipped to make an accurate ap-proximation to the log posterior over each of these re-gions. Given the use of a probabilistic procedure instep (i), coupled with a deterministic calculation instep (ii), we refer to the resulting approximation tothe marginal likelihood as the Hybrid estimator .Our contribution fundamentally provides a way to by-pass the need for a large number of posterior sam-ples for accurate computation of the marginal like-lihood. In many applications, evaluating the likeli-hood can be extremely time consuming, so in turn,collecting lots of posterior samples in such cases isprohibitively expensive in both time and computation.The typical guarantees for MCMC-based estimates ofthe marginal likelihood are asymptotic in the numberof posterior samples. Our approach instead only usesthe MCMC samples to learn a skeleton of the poste-rior distribution, which then simpliﬁes the subsequentcalculation; hence, the Hybrid estimator establishes ascalable framework for computing the evidence in highdimensional problems.The paper is organized as follows. In Section 2, wemotivate each step in the algorithm and provide a for-mal statement of the Hybrid approximation scheme.In Section 3, we demonstrate the performance of theHybrid estimator in a variety of simulation studies andcompare the results with some of the aforementionedestimators. Finally, in Section 4, we brieﬂy discusssome details for extensions and future work.

We introduce some preliminary notation. Let γ be aprobability density with respect to the Lebesgue mea-sure on R d given by γ p u q “ e ´ Φ p u q π p u q Z , u P U Ď R d . When Φ p¨q corresponds to a negative log-likelihoodfunction and π p¨q a prior distribution, γ p¨q is the corre-sponding posterior distribution, although such an in-terpretation is not necessary for our approach. Then,the marginal likelihood has the following form, Z “ ż U e ´ Ψ p u q du, (2)where Ψ p u q “ Φ p u q ` p´ log π p u qq is the negative log-posterior. As stated before, while we can evaluate Ψ,we are unable to compute the integral in (2). We canaddress this problem using the two sub-routines men-tioned in the previous section. First, we ﬁnd a parti-tion of the parameter space that gives more attentionto (i.e, more ﬁnely partitions) regions of the poste-rior that have high posterior mass. Next, we proposea suitable approximation for Ψ that allows for eas-ier evaluation of the integral over each of the parti-tion sets learned from the previous step. These stepsused in conjunction with each other give us a way toapproximate Z by computing a simpliﬁed version ofthe integral over partition sets of the parameter spacethat have ideally taken into account the assumed non-uniform nature of the posterior distribution. We ﬁrst elaborate on our strategy to replace Ψ withan approximation p Ψ. Our starting point is the follow-ing observation: ﬁx q P p , q small and let A Ď U be a compact subset with γ p A q ě p ´ q q . Rear-ranging this equation, one obtains p ´ q q ď γ p A q “ Z ´ ş A e ´ Ψ p u q du ď

1, leading to the two-sided bound ż A e ´ Ψ p u q du ď Z ď ´ q ż A e ´ Ψ p u q du. (3)We then make the following approximationlog Z « F A : “ log „ ż A e ´ Ψ p u q du  . (4)From Eq. (3), it is immediate that | log Z ´ F A | ď log t {p ´ q qu « q for q small. Henceforth, we aimto estimate the quantity F A . This initial approxi-mation step can be thought of as compactifying theparameter space to reduce its entropy. Even if U it-self is compact, γ can be highly concentrated in a ric Chuu, Debdeep Pati, Anirban Bhattacharya region A with vol p A q ! vol p U q , particularly whenthe posterior exhibits concentration (Ghosal and VanDer Vaart, 2007), and it is judicious to eliminate suchlow posterior-probability regions.Having compactiﬁed the integral domain, our generalplan is to replace Ψ with a suitable approximation p Ψon the compact set A . In this article, we speciﬁcallyfocus on a piecewise constant approximation of theform p Ψ p u q “ K ÿ k “ c ‹ k ¨ A k p u q , (5)where A “ t A , . . . , A K u is a partition of A , i.e., A “ Ť Kk “ A k and A k X A k “ H for all k ‰ k , and c ‹ k is a representative value of Ψ within the partition set A k . To simplify the ensuing calculations, we furtherrestrict ourselves to dyadic partitions in this article sothat each of the partition sets is rectangular, A k “ ś dl “ r a p l q k , b p l q k s . This leads to the approximation ż A e ´ Ψ p u q du « ż A e ´ p Ψ p u q du “ K ÿ k “ e ´ c ‹ k ¨ µ p A k q , (6)where µ p B q “ ş B du denotes the d -dimensional vol-ume of a set B . We eventually deﬁne p F A : “ log ” ż A e ´ p Ψ p u q du ı “ log „ K ÿ k “ e ´ c ‹ k ¨ µ p A k q  (7) to be our estimator of F A , and hence of log Z . Thechoice of the piecewise constant approximation is mo-tivated both by its approximation capabilities (Binevet al., 2005) as well as the analytic tractability of theapproximating integral in Eq. (7). We remark herethat the integral remains tractable if a piecewise lin-ear approximation is employed, suggesting a naturalgeneralization of our estimator.Since F A is a non-linear functional of Ψ, it is reason-able to question the validity of the approximation inEq. (6), or equivalently, the approximation of F A with p F A — even if p Ψ is a good approximation to Ψ, it isnot immediately clear if the same should be true of p F A . Using an interpolation trick, we show below thatthe approximation error | p F A ´ F A | can be bounded interms of a speciﬁc distance between p Ψ and Ψ. Deﬁne F p t q “ log „ ż A e ´ ` t Ψ p u q`p ´ t q p Ψ p u q ˘ du  , t P r , s . Clearly, F p q “ p F A and F p q “ F A , so that F A ´ p F A “ F p q ´ F p q “ ż F p t q dt. Computing F , we get F p t q “ ´ ş A ` Ψ p u q ´ p Ψ p u q ˘ e ´ ` t Ψ p u q`p ´ t q p Ψ p u q ˘ du ş A e ´ ` t Ψ p u q`p ´ t q p Ψ p u q ˘ du “ ´ E U „ π t ` Ψ p U q ´ p Ψ p U q ˘ , where π t is the probability density on A given by π t p u q 9 e ´ ` t Ψ p u q`p ´ t q p Ψ p u q ˘ , u P A. Using the integral representation, we can now boundthe approximation error, | F A ´ p F A | ď sup t Pr , s ˇˇ E U „ π t ` Ψ p U q ´ p Ψ p U q ˘ˇˇ . Interestingly, note that π γ A is our target densityrestricted to A , and π p u q 9 e ´ p Ψ p u q A p u q has normal-izing constant p F A . The collection of densities t π t u cantherefore be thought of as continuously interpolatingbetween π and π . Piecing together the various ap-proximations, we arrive at the following result. Proposition 1.

For any compact subset A Ď U , wehave | p F A ´ log Z | ď sup t Pr , s ˇˇ E U „ π t ` Ψ p U q ´ p Ψ p U q ˘ˇˇ ` log ˆ ν p A q ˙ . Here, ν denotes the Lebesgue measure on R D . Theﬁrst term in the right hand side above can be furtherbounded by } Ψ ´ p Ψ } : “ sup u P A | Ψ p u q ´ p Ψ p u q| . Thisconclusion is not restricted to the piecewise constantapproximation and can be used for other approxima-tions, such as the piecewise linear one. Next, we address the task of obtaining a suitable par-tition of the parameter space. Clearly, traditionalquadrature methods would render this method inef-fective, requiring the number of function evaluationsto grow exponentially with d . Furthermore, with aposterior distribution that exhibits any degree of con-centration, there will indubitably be regions of U wherethe posterior probability is close to 0. From a compu-tationally mindful standpoint, it makes sense to thenfocus on more ﬁnely partitioned regions of U that havehigh posterior probability. With this in mind, weturn to using samples from γ to obtain such a parti-tion. Speciﬁcally, let u , . . . , u J be approximate sam-ples from γ , e.g., the output of an MCMC procedure.We treat tp u j , Ψ p u j qu Jj “ as covariate-response pairsand feed them to a tree-based model such as CART(Breiman, 1984), implemented in the R package rpart Hybrid Approximation to the Marginal Likelihood

Figure 1:

Left: bivariate normal distribution truncated to the ﬁrst orthant. Right: A density of the form γ p u q 9 exp p´ nu u q π p u q , where u P r , s and π p¨q is the uniform measure on r , s . For this simulation, n “ . Both plots have 5000 MCMC samples with the partition returned from ﬁtting a CART model. (Therneau and Atkinson, 2019), to obtain a dyadicpartition. While the MCMC samples are typicallyused to construct Monte Carlo averages, we use themto construct a high probability partition of the param-eter space. We assume the capability to evaluate Ψ,which is a very mild assumption since obtaining sam-ples from γ using even a basic sampler like Metropolis–Hastings requires evaluating Ψ. Finally, the above pro-cedure implicitly suggests the compactiﬁcation A to bea bounding box using the range of posterior samples, A “ b l ” min t u p l q j u , max t u p l q j u ı , 1 ď j ď J , 1 ď l ď d ,where u p l q j is the l th component of u j . Before moving into higher dimensions, we provide anillustration of the process described in the previoussection in 2 dimensions, where the partitioning canbe easily visualized. Suppose γ is a density on R supported on U Ď R , and u j „ γ for j “ , . . . , J .Forming the pairs, tp u j , Ψ p u j qqu Jj “ , we then ﬁt aCART model to these points and extract the decisionrules, which form a dyadic partition of the aforemen-tioned bounding box A Ď U . Denote the partitionas A “ t A , . . . , A K u . Plotting the sampled pointsand overlaying the partitions learned from the regres-sion tree, we observe in Figure 1 that areas of U witha high concentration of points coincide with regionsthat are more ﬁnely partitioned by the regression tree.Taking γ to be a posterior distribution, we see thatthis behavior of partitioning areas of greater posteriormass is desirable in producing a better approximation.Equipped with the partition A , we need only to deter-mine the representative point of each partition set inorder to form the approximation to Ψ.Recall that CART ﬁts a constant for each point within a given partition set. At any given stage, the CARTmodel will search for the optimal predictor value, u “ p u , u q , on which to partition the remainingpoints such that the sum of squares error (SSE) be-tween the response, Ψ p u q , and the predicted constantis minimized. In particular, to partition data into tworegions A and A , the objective function is given as SSE “ ÿ u i P A p Ψ p u i q ´ c q ` ÿ u i P A p Ψ p u i q ´ c q . (8)Upon minimization of the SSE, the resulting partitionsets A and A have ﬁtted values c and c , respec-tively. For each partition set A k P A , a natural choicefor the representative point c ‹ k is the ﬁtted value for A k produced by the tree-ﬁtting algorithm. Following thistwo-step process of using CART to obtain both thepartition and the ﬁtted values for each of the partitionsets and then plugging these into Eq. (6), we obtainthe Hybrid approximation to the marginal likelihood. We consider the following conjugate normal model: y n | µ, σ „ N p µ, σ q , µ | σ „ N p m , σ { w q , σ „ IG p r { , s { q , where IG p¨ , ¨q denotes the inverse-gamma distribution. In order to compute the Hybridestimator, we require samples from the posterior dis-tribution and a way to evaluate Ψ. In this example,the posterior distribution of u “ p µ, σ q is known, andsince the likelihood and prior are speciﬁed, the evalu-ation of Ψ is straightforward. With this architecturein place, we feed the pairs, tp u j q , Ψ p u j qu Jj “ , throughCART to obtain a partition over the parameter spaceand each partition set’s representative point. Then,we use Eq. (6) to compute the ﬁnal approximation.Table 1 shows results for the Hybrid estimator anda number of other competing methods. Here, the ric Chuu, Debdeep Pati, Anirban Bhattacharya Table 1:

Normal Inverse-Gamma Example. We re-port the mean, standard deviation, average error (AE,truth - estimated), and the root mean squared error(RMSE), taken over 100 replications. Each replicationhas 50 observations and 1000 posterior samples. Thetrue log marginal likelihood is -113.143. Estimators in-clude the Harmonic Mean estimator (HME), CorrectedArithmetic Mean estimator (CAME), Bridge Samplingestimator (BSE), and the Hybrid estimator (HybE).

Method Mean SD AE RMSE log HME -104.762 0.733 -8.381 8.431log CAME -112.704 0.048 -0.439 0.441log BSE -113.143 0.006 0 0.006log HybE -113.029 0.025 -0.114 0.117 true log marginal likelihood can be computed in closedform, so we have direct comparisons to the groundtruth. All estimators except for the Harmonic Meanestimator give accurate approximations to the logmarginal likelihood.

Until this point, the representative point within eachpartition has simply been the ﬁtted value for eachpartition returned from the CART model. When tp u j , Ψ p u j qqu Jj “ is fed into the tree, it attempts to op-timize the sum of squared errors as in Eq. (8). Note,however, that our eventual objective is to best approx-imate the functional log ş A e ´ Ψ , and it is not unreason-able to suspect that the optimal value for A k chosen bythe regression tree model may not be a suitable choicefor our end goal, especially for higher dimensions. Sim-ulations in higher dimensions indeed conﬁrm this. Be-fore suggesting a remedy, we oﬀer some additional un-derstanding into the approximation mechanism thatguides us toward an improved choice. To that end,write p F A from Eq. (7) as p F A “ log „ K ÿ k “ e ´ c ‹ k p k  ` log µ p A q : “ p G ` log µ p A q , where recall µ p B q “ vol p B q is the Lebesgue measureof a Borel set B , and we deﬁne p k : “ µ p A k q{ µ p A q . Wecan also write F A “ G ` log µ p A q , with G : “ log „ µ p A q ż A e ´ Ψ p u q du  “ log „ K ÿ k “ p k µ p A k q ż A k e ´ Ψ p u q du  “ log „ K ÿ k “ e ´ c k p k  , where e ´ c k “ µ p A k q ż A k e ´ Ψ p u q du “ E U k „ Unif p A k q “ e ´ Ψ p U k q ‰ . Thus, for p G to approximate G , we would ideally liketo have each c ‹ k chosen so that e ´ c ‹ k targets e ´ c k . Im-portantly, the above exercise suggests the appropri-ate scale to perform the approximation – rather thanworking in the linear scale as in Eq. (8), it is potentiallyadvantageous to work in the exponential scale. Based on the above discussion, we deﬁne a family ofobjective functions Q k p c q “ ÿ u P A k | e ´ Ψ p u q ´ e ´ c | e ´ Ψ p u q , c P A k , (9)one for each partition set A k returned by the tree,and set c ‹ k “ argmin c Q k p c q . We experimented with anumber of diﬀerent metrics before zeroing in on theabove relative error criterion in the exponential scale.Minimizing (9) is a weighted (cid:96) problem and admits aclosed-form solution.Thus, our overall algorithm can be summarized as fol-lows. We obtain samples u , . . . , u J from γ , and feed tp u j , Ψ p u j qu Jj “ through a tree to partition the bound-ing box A of the samples. Then, rather than usingthe default ﬁtted values returned by the tree, we takethe representative value c ‹ k within each A k as the min-imizer of Q k . These c ‹ k s are then used to compute p F A as in (7) – note that p F A can be stably computed usingthe log-sum-exp trick. Finally, we declare p F A as log p Z ,our estimator of log Z . Algorithm 1

Hybrid Approximation

Input:

Sampler for γ , method for evaluating Ψ Output:

Estimate of the log marginal likelihood Sample t u j u Jj “ from γ Fit tp u j , Ψ p u j qqu Jj “ using a regression tree From the ﬁtted tree, extract the dyadic parti-tion A “ t A , A , . . . , A K u of the bounding box A “ b l ” min t u p l q j u , max t u p l q j u ı determined by thesamples, with A k “ ś dl “ r a p l q k , b p l q k s for k P t , . . . , K u do c ‹ k Ð argmin c P A k log Q k p c q p Z k Ð e ´ c ‹ k ś dl “ ` b p l q k ´ a p l q k ˘ end for Use the log-sum-exp trick to compute the ﬁnal esti-mator, log p Z “ log-sum-exp ´ log ˆ Z , . . . , log ˆ Z K ¯ Hybrid Approximation to the Marginal Likelihood

Figure 2:

Boxplots of the error (truth - estimate) for the log marginal likelihood in the MVN-IG (left, true log p p y q : -303.8482) and truncated MVN (right, true log p p y q : -250.2755) examples. Both examples correspondto 20-dimensional parameter spaces. Results are reported over 100 simulations, with 100 observations. Estimatesare based on 45 MCMC samples. The results correspond to the natural logarithm of each of the estimators. In the following experiments, we present a variety ofproblem settings. First, we consider the linear regres-sion model under diﬀerent prior speciﬁcations wherethe true marginal likelihood is known, so we can eas-ily verify the accuracy of any subsequent approxima-tions. We then extend the application of the Hybridestimator to examples for which the parameter is a d ˆ d covariance matrix, thus showcasing its versatil-ity even when the parameter space is non-Euclidean.We examine the performance of the Hybrid estimatoralongside competing methods and focus primarily onsituations where the posterior samples are either fewin number or non-exact. In addition to the Hybrid es-timator (HybE), we examine the following additionalestimators: Bridge Sampling estimator (BSE), WarpBridge Sampling estimator (WBSE), Harmonic Meanestimator (HME), and Corrected Arithmetic Mean es-timator (CAME). The BSE and WBSE results areobtained using the bridgesampling package (Gronauet al., 2020). Corresponding calculations and formu-lae for posterior parameters and analytical marginallikelihoods are given in the Supplement.We emphasize that in the experiments provided in thissection we seek to mimic scenarios where posteriorsampling is highly expensive and/or mixing is poor.By considering a small number of samples as the inputfor these marginal likelihood estimation algorithms, weprovide a realistic scenario for the regime in which wewish to operate. In the examples in Sections 3.1 and3.3, we sample directly from exact posterior distribu-tion, while in Section 3.4, we use approximate samplesfrom the posterior distribution. Consider the following setup of the linear regressionmodel, y “ Xβ ` ε , where y P R n , X P R n ˆ d , β P R d , and ε „ N p , σ I n q . In the next two examples, weconsider diﬀerent prior distributions on β and σ . We assume a multivariate normal inverse-gamma(MVN-IG) prior on p β, σ q , where β | σ „ N d p µ β , σ V β q , σ „ IG p a , b q . Given this choice ofthe prior, the posterior distribution is known to be β | σ , y „ N ` µ n , σ V n ˘ , σ | y „ IG p a n , b n q . In oursimulation, we take d “

19, so that u “ p β, σ q P R .Since the log marginal likelihood in this example iswell known, we evaluate each of the estimates againstthe true value. In Figure 2, we plot the errors for eachof the estimators when only 45 MCMC samples areused for each approximation. The accuracy and stan-dard error of the Hybrid estimator are clearly superiorcompared to the well-established estimators. Next, we place a multivariate normal prior on β truncated to the ﬁrst orthant. In particular, β „ N d p , σ λ ´ I d q ¨ r , d , where σ , λ are known. Thisproduces a posterior distribution of the form, β | y „ N d p β | Q ´ b, Q ´ q ¨ r , d , where Q, b are deﬁned in the Supplement. Then, themarginal likelihood can be written as p p y q “ ż R N p y | Xβ, σ I n q ´ d N p β | , σ λ ´ I d q dβ “ C ¨ ż R det p Q q e ´ p β ´ Q ´ b q Q p β ´ Q ´ b q dβ. Here, R “ r , and C is a known constant term.Note that in this case, however, the integral is not ana-lytically available and prevents the marginal likelihood ric Chuu, Debdeep Pati, Anirban Bhattacharya from being easily computed. Botev (2016) uses a min-imax tilting method to calculate the normalizing con-stant of truncated normal distributions and shows thatthe proposed estimator has the vanishing relative errorproperty (Kroese et al., 2011). In light of this, we ac-cept Botev’s estimator as the true marginal likelihoodin the following experiments. The TruncatedNormal package (Botev and Belzile, 2019) provides samplesfrom truncated normal distributions, so posterior sam-ples from β | y are readily available.In Figure 2, we present the simulation results forthe case when d “

20. Each approximation uses 45MCMC samples, and we compare the results againstthe true log marginal likelihood. Once again, theHybE outperforms the other estimators and reinforcesits ability to deal with a scarce number of samples.Provided with a suﬃciently large number of samples,however, the BSE and CAME are both eventually ableproduce more accurate results than the HybE.

The examples have thus far dealt with parameters inEuclidean space. In the next set of examples, we movebeyond the usual Euclidean space and consider param-eters in R d ˆ d . In particular, let x , . . . , x n iid „ N d p , Σ q ,where Σ P R d ˆ d . Then the likelihood can be writtenas follows, L p Σ q “ p π q ´ nd { det p Σ q ´ n { e ´ tr p Σ ´ S q{ , (10)where S “ ř ni “ x i x i . For simplicity, we consider aconjugate inverse-Wishart (IW) prior, W ´ p Λ , ν q , forΣ, where Λ is positive deﬁnite d ˆ d matrix and ν ą d ´ W ´ p Λ ` S , ν ` n q , andwe can compute the marginal likelihood in closed form.Note that despite being able to sample from the poste-rior distribution, we cannot yet carry out the Hybridapproximation algorithm. Since posterior samples aredrawn from a sub-manifold of R d ˆ d , if we were to pro-ceed as usual to obtain a partition over R d ˆ d , therewould be no guarantee that a given point within thepartition could be used to reconstruct a valid covari-ance matrix. As such, we circumvent this issue by tak-ing the Cholesky factorization of Σ, so that Σ “ T T ,where T is a lower triangular matrix with positive di-agonal entries, t jj for j “ , . . . , d .Under this transformation, we can deﬁne Ψ p T q “´ log L p T q ´ log π p T q , where Eq. (10) gives us L p T q “ p π q ´ nd { det p T q ´ n e ´ tr pp T T q ´ S q{ . Conveniently, the determinant of the Jacobian ma-trix J of this transformation is well-known; | J | “ d ś dj “ t d ` ´ jjj . By the change of variable formula,the induced prior on T is π p T q “ C Λ ,ν det p T q ´p ν ` d ` q e ´ tr pp T T q ´ Λ q{ ˆ d d ź j “ t d ` ´ jjj , where C Λ ,ν “ det p Λ q ν { {p νd { Γ d p ν { qq , and Γ d p¨q isthe multivariate gamma function. Obtaining poste-rior samples of T is trivial, as we can simply drawΣ from W ´ p Λ ` S, ν ` n q , and then take the lowerCholesky factor. With this general setup in place, itis worth noting that even with another prior on Σ, wecan carry out the entire algorithm, provided that wehave a way to sample from the posterior of Σ and away to compute the Jacobian of the transformation.In Figure 3, we present the results for which each ap-proximation uses 25 MCMC samples. The boxplotof each approximation’s errors solidify the robustnessof the Hybrid estimator, which produces accurate andlow-variance estimates. Although the BSE and WBSEboth cover the true log marginal likelihood value, itis apparent that these estimators suﬀer from stabil-ity and convergence issues that are not present in theHybrid estimator. In the following examples, we extend the previousanalysis of covariance matrices in the graphical model-ing context. Gaussian graphical models are a populartool to learn the dependence structure among variablesof interest. Consider independent and identically dis-tributed vectors x , x , . . . , x n drawn from a d -variatenormal distribution with mean vector 0 and a sparseinverse covariance matrix Ω. If the variables i and j donot share an edge in a graph G , then Ω ij “

0. Hence,an undirected (or concentration) graphical model cor-responding to G restricts the inverse covariance matrixΩ to a linear subspace of the cone of positive deﬁ-nite matrices. A probabilistic framework for learningthe dependence structure and the graph G requiresspeciﬁcation of a prior distribution for p Ω , G q . Con-ditional on G , a hyper-inverse Wishart (HIW) distri-bution (Dawid and Lauritzen, 1993) on Σ “ Ω ´ andthe corresponding induced class of distributions on Ω(Roverato, 2000) are attractive choices of priors. Denoted by HIW G p δ, B q , the hyper-inverse Wishartdistribution is a distribution on the cone of d ˆ d pos-itive deﬁnite matrices with parameters δ ą d ˆ d positive deﬁnite matrix B . Refer to equa-tions (4) and (5) of (Roverato, 2000) for the form of Hybrid Approximation to the Marginal Likelihood

Figure 3:

Boxplots of the error (truth - estimate) for the unrestricted covariance (left, true log p p y q : -673.7057)and graphical model (right, true log p p y q : -506.3061) examples. Results are reported over 100 simulations, with100 observations and 25 MCMC samples. For the IW example, we consider ˆ covariance matrices with 10free parameters. For the HIW example, we consider ˆ precision matrices with 10 free parameters. Note thatwe do not include BSE results in the HIW example because the Bridge Sampling algorithm fails to converge withonly 25 MCMC samples. the density. When G is decomposable, an alternativeparameterization is given by the Cholesky decomposi-tion T T of Ω “ Σ ´ . Provided that the vertices of G “ p V, E q are enumerated according to a perfect ver-tex elimination scheme , the upper triangular matrix T has the same zero pattern as Ω. Since the likelihoodfunction is identical to the one given in Eq. (10), weneed only compute the induced prior on T to completethe deﬁnition of Ψ p T q . Following Roverato (2000), thedeterminant of the Jacobian matrix J of this transfor-mation is given by | J | “ d ś di “ t ν i ` ii , where the i throw of T has exactly ν i ` p v i q “ t j : p v i , v j q P E u . Then ν i “ | ne p v i qXt i ` , . . . , d u| . The induced joint densityof the elements of T , i.e., t sr for s ă r with the edge p v s , v r q P E , and t ii , i “ , . . . , d , is given by π p T q “ „ d ź i “ ´p δ ` ν i q{ Γ pp δ ` ν i q{ q ˆ t p δ ` ν i ´ q ii e ´ t ii p t ii q  ˆ « ź p r,s q : r ą s, p v s ,v r qP E ? π e ´ t sr ﬀ . Since we are able to sample from the posterior distri-bution, HIW G p δ ` n, B ` S q , where S “ ř ni “ x i x i ,we are well-equipped to compute the Hybrid estima-tor. For this example, we take δ “ B “ I ,and in Figure 3, we present the errors for the diﬀerentestimators when 25 MCMC samples are used for eachapproximation. Even with limited MCMC samples,the HybE retains its ability to produce reliable resultsthat do not exhibit the high variance that we see in theWBSE. As the number of MCMC samples increases,the WBSE stabilizes and eventually beats the HybE. Up until now, we have assumed that asymptoticallyexact samples from the posterior distribution are avail-able to be used as input for the proposed approx-imation. In fact, for all previous numerical exper-iments, we have used samples drawn from the ex-act posterior distribution, ridding us of the need forburn-in or thinning. We now investigate how thesealgorithms perform when we only have approximateposterior samples. As a demonstration, we revisitthe MVN-IG example in Section 3.1.1 and considerthe case where β P R . We construct the followingmean ﬁeld approximation to the posterior distribution, q p β, σ q “ q p β q q p σ q , where q p β q ” ź i “ N ´ µ p i q n , σ V p i q n ¯ , q p σ q ” IG p a n , b n q . Here, we have split the original 9-dimensional nor-mal distribution into a product of 3-dimensional nor-mal distributions, with the mean and covariance com-ponents extracted from the true posterior parame-ters. In particular, µ p q n “ p µ n, , µ n , µ n q , µ p q n “p µ n , µ n , µ n q , µ p q n “ p µ n , µ n , µ n q . Each V p i q n isdeﬁned as the corresponding 3 ˆ V n ,and σ is the posterior mean of σ .In Figure 4 below, we observe that even with non-exactposterior samples, the Hybrid approximation producesaccurate estimates, with an average error of 0.449 over100 replications, compared to average errors of 0.698and 1.035 for the CAME and BSE, respectively. Whilethe latter two estimators have lower variance thanthe Hybrid approximations, neither covers the truemarginal likelihood. ric Chuu, Debdeep Pati, Anirban Bhattacharya Figure 4:

Boxplots of the error (truth - estimate) forthe MVN-IG example p β P R q with approximate pos-terior samples. For each of the 100 replications, weused 100 observations and 100 approximate posteriorsamples. The true log marginal likelihood is -147.3245. In this paper, we developed a novel algorithm thatcombines a variety of ideas to eﬃciently estimate themarginal likelihood. By ﬁrst using a regression treeto identify high probability regions of the parameterspace and then leveraging numerical integration ideasto obviate the need to trust the quality of the MCMCsamples, we are able to construct an approximationthat scales well with both the dimension and the com-plexity of the parameter space. From the simulationstudies, we see that the Hybrid estimator is both ac-curate and reliable, providing robust approximationsin situations when MCMC samples are either scarce ornon-exact. Therefore, our contribution is multifacetedand bears practical value such that even in higher di-mensions and in instances where generating MCMCsamples is expensive and/or non-exact, the Hybrid es-timator still delivers promising results.Furthermore, the Hybrid approximation scheme out-lined in this paper lays the groundwork for future workin a number of possible directions. One area of po-tential reﬁnement is the construction of the partitionof the parameter space. While we used CART for itsconvenience and interpretability, we found that the de-fault objective function for CART was unsuitable fordetermining the representative point of each partitionset, and we had to solve an additional optimizationproblem to obtain these points. Instead of this two-step roundabout approach, where we used CART tolearn the partition and the objective function in Eq.(9) to identify representative points, we could mod-ify the CART objective function to directly target thedesired objective.Another aspect of the current algorithm that can befurther developed is the current formulation of the local approximation to Ψ in each of the partitionsets. The piecewise constant approximation in Eq. (5),though providing encouraging results, is a rather sim-plistic way to approximate Ψ, particularly when mov-ing to higher dimensions. A natural extension to theconstant approximation is to use a local Taylor expan-sion to introduce higher order terms, giving piecewiselinear and quadratic approximations.

Acknowledgements

The authors are grateful for the anonymous reviewersfor providing valuable comments and suggestions. Theauthors also express special thanks to Donald Chungfor insightful discussion about code optimizations.

References

Peter Binev, Albert Cohen, Wolfgang Dahmen,Ronald DeVore, and Vladimir Temlyakov. Univer-sal algorithms for learning theory part i: Piecewiseconstant functions.

Journal of Machine LearningResearch , 6:1297–1321, 2005.Z. I. Botev. The normal law under linear restric-tions: simulation and estimation via minimax tilt-ing.

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 79(1):125–148, 2016.doi: 10.1111/rssb.12162.Zdravko Botev and Leo Belzile.

TruncatedNormal:Truncated Multivariate Normal and Student Dis-tributions , 2019. URL https://CRAN.R-project.org/package=TruncatedNormal . R package version2.1.Leo Breiman.

Classiﬁcation and regression trees .Wadsworth International Group, 1984.Siddhartha Chib. Marginal likelihood from the gibbsoutput.

Journal of the American Statistical As-sociation , 90(432):1313–1321, 1995. doi: 10.1080/01621459.1995.10476635.Siddhartha Chib and Ivan Jeliazkov. Marginal like-lihood from the metropolis–hastings output.

Jour-nal of the American Statistical Association , 96(453):270–281, 2001. doi: 10.1198/016214501750332848.A Philip Dawid and Steﬀen L Lauritzen. Hypermarkov laws in the statistical analysis of decom-posable graphical models.

The Annals of Statistics ,pages 1272–1317, 1993.Nial Friel and Jason Wyse. Estimating the evidence - areview.

Statistica Neerlandica , 66(3):288–308, 2012.doi: 10.1111/j.1467-9574.2011.00515.x.Subhashis Ghosal and Aad Van Der Vaart. Conver-gence rates of posterior distributions for noniid ob-servations.

The Annals of Statistics , 35(1):192–223,2007.

Hybrid Approximation to the Marginal Likelihood

Quentin F. Gronau, Henrik Singmann, and Eric-JanWagenmakers. bridgesampling: An R package forestimating normalizing constants.

Journal of Sta-tistical Software , 92(10):1–29, 2020. doi: 10.18637/jss.v092.i10.Dirk P. Kroese, Thomas Taimre, and Zdravko I.Botev.

Handbook of Monte Carlo methods . Wiley-Blackwell, 2011.Peter Lenk. Simulation pseudo-bias correction to theharmonic mean estimator of integrated likelihoods.

Journal of Computational and Graphical Statistics ,18(4):941–960, 2009. doi: 10.1198/jcgs.2009.08022.Xiao-Li Meng and Stephen Schilling. Warp bridgesampling.

Journal of Computational and Graphi-cal Statistics , 11(3):552–586, 2002. doi: 10.1198/106186002457.Xiao-Li Meng and Wing Hung Wong. Simulating ra-tios of normalizing constants via a simple identity: atheoretical exploration.

Statistia Sinica , 6:831–860,1996. URL .Radford M. Neal. Annealed importance sampling.

Statistics and Computing , 11(2):125–139, 2001. doi:10.1023/a:1008923215028.Michael A. Newton and Adrian E. Raftery. Approxi-mate bayesian inference with the weighted likelihoodbootstrap.

Journal of the Royal Statistical Society:Series B (Methodological) , 56(1):3–26, 1994. doi:10.1111/j.2517-6161.1994.tb01956.x.Anna Pajor. Estimating the marginal likelihood usingthe arithmetic mean identity.

Bayesian Analysis , 12(1):261–287, 2017. doi: 10.1214/16-ba1001.Adrian E Raftery, Michael A Newton, Jaya M Sa-tagopan, and Pavel N Krivitsky. Estimating theintegrated likelihood via posterior simulation usingthe harmonic mean identity.

Bayesian Statistics , 8:1–45, 2007.Danilo Rezende and Shakir Mohamed. Variational in-ference with normalizing ﬂows. In Francis Bach andDavid Blei, editors,

Proceedings of the 32nd Interna-tional Conference on Machine Learning , volume 37of

Proceedings of Machine Learning Research , pages1530–1538, Lille, France, 07–09 Jul 2015. PMLR.Alberto Roverato. Cholesky decomposition of a hyperinverse wishart matrix.

Biometrika , 87(1):99–112,2000.Tim Salimans, Diederik Kingma, and Max Welling.Markov chain monte carlo and variational inference:Bridging the gap. In Francis Bach and David Blei,editors,

Proceedings of the 32nd International Con-ference on Machine Learning , volume 37 of

Pro-ceedings of Machine Learning Research , pages 1218–1226, Lille, France, 07–09 Jul 2015. PMLR. John Skilling. Nested sampling for general bayesiancomputation.

Bayesian Analysis , 1(4):833–859,2006. doi: 10.1214/06-ba127.Terry Therneau and Beth Atkinson. rpart: Recur-sive Partitioning and Regression Trees , 2019. URL https://CRAN.R-project.org/package=rpart . Rpackage version 4.1-15.Luke Tierney and Joseph B. Kadane. Accurate ap-proximations for posterior moments and marginaldensities.

Journal of the American Statistical As-sociation , 81(393):82–86, 1986. doi: 10.1080/01621459.1986.10478240. ric Chuu, Debdeep Pati, Anirban Bhattacharya

Supplementary Materials

As a part of the proposed Hybrid approximation algorithm, we used the CART model implemented in the rpart package (Therneau and Atkinson, 2019) in R for the tree-ﬁtting process. Under the default settings, we obtainedthe partition of the parameter space from the ﬁtted model. The rest of the Hybrid approximation algorithm, aselicited in Algorithm 1, is implemented in R and C++ . We obtain estimates from a number of competing methods, such as the Harmonic mean estimator (HME),Corrected Arithmetic Mean estimator (CAME), Bridge Sampling estimator (BSE), and Warp Bridge Samplingestimator (WBSE). For the CAME, we use the importance sampling estimator shown in Eq. (19) in Pajor(2017). Moreover, since the conjugate normal model and multivariate normal inverse-gamma model in Sections2.3.1 and 3.1.1, respectively, are very similar to those given in Pajor (2017), we select the same importancedistributions given in the aforementioned paper. For the truncated multivariate normal distribution example inSection 3.1.2, we take the importance distribution to be a truncated multivariate normal distribution with themean and variance components estimated from the posterior samples. For the BSE and WBSE, we rely on theimplementation provided by the bridgesampling package in R ; see Gronau et al. (2020) for more details. Below, we present the calculations and formulae associated with each of the examples that we included inSection 3. If available in closed form, we include the expressions for the posterior distributions and the marginallikelihoods. In instances where the marginal likelihood cannot be written analytically, we turn to existingpackages that provide specialized approximations that have been shown in literature to be accurate. In addition,we provide the true parameter values used to generate the data, as well as the prior hyperparameter settingsused to obtain the true log marginal likelihood values.

In the conjugate normal model in Section 2.3.1, the posterior distribution of p µ, σ q is well-known. In particular, µ | σ , y n „ N p m n , σ { w n q and σ | y n „ IG p r n { , s n { q , with posterior parameters deﬁned as follows, m n “ m ¯ y ` w m n ` w , w n “ w ` n, r n “ r ` n,s n “ s ` n ÿ i “ p y i ´ ¯ y q ` ˆ nw n ` w ˙ p ¯ y ´ m q . With this in place, the marginal likelihood can be computed in closed form, p p y q “ π ´ n { ˆ w w n ˙ { Γ ` r n ˘ Γ ` r ˘ ¨ s r { s r n { n . (11)Each of the n “

100 observations was drawn from a normal distribution with mean 30 and variance 4. The priorhyperparameters were m “ , w “ . , r “ , s “

3. Plugging these into Eq. (11), we computed the truelog marginal likelihood to be -113.143.

Hybrid Approximation to the Marginal Likelihood

Recall the linear regression setup given in Section 3.1 and 3.1.1. Here, the posterior distribution can be shownto be of the following form: β | σ , y „ N ` µ n , σ V n ˘ ,σ | y „ IG p a n , b n q , with posterior parameters µ n “ V n p X y ` V ´ β µ β q , V n “ p X X ` V ´ β q ´ , a n “ a ` n {

2, and b n “ b ` p y y ` µ β V ´ β µ β ´ µ n V ´ n µ n q . Then the marginal likelihood can be computed directly to be p p y q “ p π q n { b a b a n n Γ p a n q Γ p a q det p V n q { det p V β q { . (12)Each of the 100 observations was drawn from a d -dimensional normal distribution according to the linear regres-sion model presented in Section 3.1. In the experiments, we took d “

19, and the prior hyperparameters were µ β “ d , V β “ I d , a “ , b “

1. The true value of β is shown as a heatmap in Figure 5 and σ “

4. Pluggingthese into Eq. (12), we computed the true log marginal likelihood to be -303.8482.Figure 5:

True value of β ; each component is represented as a tile and takes on value between -10 and 10. Valuescloser to 10 are red and values closer to -10 are white. With the truncated multivariate normal prior given in Section 3.1.2, we obtain the following form of the posteriordistribution of β , β | y „ N d p β | Q ´ b, Q ´ q ¨ r , d , with posterior parameters Q “ σ p X X ` λI d q and b “ σ X y . Each of the n “

100 observations was drawnfrom a d -dimensional normal distribution according to the linear regression model presented in Section 3.1. Inthe experiments, we took d “

20, and the prior hyperparameters were σ “ , λ “ .

25. The true value of β is shown as a heat map in Figure 6. Due to the intractable marginal likelihood, we used the TruncatedNormal package to compute the true marginal likelihood to be -250.2755.Figure 6:

True value of β ; each component is represented as a tile and takes on value between 0 and 1. Valuescloser to 1 are red and values closer to 0 are white. We consider the inverse-Wishart prior on Σ, W ´ p Λ , ν q , where Λ is a positive deﬁnite d ˆ d matrix, and ν ą d ´ ric Chuu, Debdeep Pati, Anirban Bhattacharya π p Σ q “ C Λ ,ν det p Σ q ´p ν ` d ` q{ e ´ tr p Σ ´ Λ q{ , where C Λ ,ν “ det p Λ q ν { {p νd { Γ d p ν { qq . Here, Γ d p¨q is the multivariate gamma function, given byΓ d p a q “ π d p d ´ q{ d ź j “ Γ p a ` p ´ j q{ q , where Γ p¨q is the ordinary gamma function. Our choice of the prior admits the following closed form marginallikelihood: ż L p Σ q π p Σ q d Σ “ Γ d pp n ` ν q{ q π nd { Γ d p ν { q det p Λ q ν { det p Λ ` S q p n ` ν q{ . (13)In each of the 100 replications, we took d “ n “

100 observations from a 4-dimensional normaldistribution with mean vector 0 d and covariance matrix Σ, whereΣ “ »——– .

662 1 . ´ . ´ . .

640 7 . ´ .

146 5 . ´ . ´ .

146 4 . ´ . ´ .

007 5 . ´ .

237 6 . ﬁﬃﬃﬂ The prior hyperparameters were Λ “ I and ν “

5. Plugging these into Eq. (13), we computed the true logmarginal likelihood to be -673.7057.

We ﬁrst introduce some notation to help us obtain a closed form for the marginal likelihood of a decomposablegraph G . For an n ˆ d matrix X , X C is deﬁned as the submatrix of X consisting of columns with indices in theclique C . Let p x , x , . . . , x d q “ p x , x , . . . , x n q , where x i is the i th column of X n ˆ d . If C “ t i , i , . . . , i | C | u ,where 1 ď i ă i ă . . . ă i | C | ď d , then X C “ p x i , x i , . . . , x i | C | q . For any square matrix A “ p a ij q d ˆ d ,deﬁne A C “ p a ij q | C |ˆ| C | where i, j P C , and the order of entries carries into the new submatrix A C . Therefore, X C X C “ p X X q C .Decomposable graphs correspond to a special kind of sparsity pattern in Σ, henceforth denoted Σ G . Suppose wehave a HIW G p b, D q distribution on the cone of d ˆ d positive deﬁnite matrices with b ą d ˆ d positive deﬁnite matrix D such that the joint density factorizes on the junction tree of the givendecomposable graph G as p p Σ G | b, D q “ ś C P C p p Σ C | b, D C q ś S P S p p Σ S | b, D S q , (14)where for each C P C , Σ C „ IW | C | p b, D C q with density p p Σ C | b, D C q9 | Σ C | ´p b ` | C |q{ etr ! ´

12 Σ ´ C D C ) , (15)where | C | is the cardinality of the clique C and etr p¨q “ exp (cid:32) tr p¨q ( . IW d p b, D q is the inverse-Wishart distributionwith degrees of freedom b and a ﬁxed d ˆ d positive deﬁnite matrix D with normalizing constant ˇˇˇˇ D ˇˇˇˇ p b ` d ´ q{ Γ ´ d ´ b ` d ´ ¯ . Note that we can establish equivalence to the parametrization used in Section 3.3.1 by taking δ “ b ` d ´ f p X | Σ G q “ p π q ´ np ś C P C | Σ C | ´ n etr ´ ´ Σ ´ C X C X C ¯ś S P S | Σ S | ´ n etr ´ ´ Σ ´ S X S X S ¯ . (16) Hybrid Approximation to the Marginal Likelihood

The HIW p b, D q density can be written as f p Σ G | G q “ ś C P C p p Σ C | b, D C q ś S P S p p Σ S | b, D S q“ ś C P C ˇˇ D C ˇˇ b `| C |´ Γ ´ | C | ` b `| C |´ ˘ | Σ C | ´ b ` | C | etr ` ´ Σ ´ C D C ˘ś S P S ˇˇ D S ˇˇ b `| S |´ Γ ´ | S | ` b `| S |´ ˘ | Σ S | ´ b ` | S | etr ` ´ Σ ´ S D S ˘ . Then, it is straightforward to obtain the marginal likelihood of the decomposable graph G , f p X | G q “ p π q ´ np h p G, b, D q h p G, b ` n, D ` S q “ p π q ´ np ś C P C w p C q ś S P S w p S q , (17)where h p G, b, D q “ ś C P C ˇˇ D C ˇˇ b `| C |´ Γ ´ | C | ` b `| C |´ ˘ś S P S ˇˇ D S ˇˇ b `| S |´ Γ ´ | S | ` b `| S |´ ˘ , w p C q “ | D C | b `| C |´ | D C ` X C X C | ´ b ` n `| C |´ ´ n | C | Γ | C | ` b `| C |´ ˘ Γ ´ | C | ` b ` n `| C |´ ˘ . Conditional on the graph G , represented in Figure 7, we considered a hyper-inverse Wishart prior on Σ “ Ω ´ ,HIW G p δ, B q , where the prior hyperparameters were B “ I and δ “

3. We then drew n “

100 observations froma 5-dimensional normal distribution with mean vector 0 and a sparse inverse covariance matrix Ω, where thedependence structure in Ω was speciﬁed by the graph G .Figure 7: In the undirected graph G , with vertex set V “ t , , , , u , the p i, j q -th box is black if the correspondingedge is present in G and white otherwise. Using the formula for the marginal likelihood derived in Eq. (17), we computed the true log marginal likelihoodto be -506.3061.

For the multivariate normal inverse-gamma example in Section 3.4 where we drew approximate posterior samplesfrom the mean ﬁeld approximation of the posterior distribution, each of the n “

100 observations was ﬁrst drawnfrom a d -dimensional normal distribution according to the linear regression model presented in Section 3.1. Inthe experiments, we took d “

9, and the prior hyperparameters were µ β “ d , V β “ I d , a “ , b “

1. Theposterior distribution of β was approximated by a product of 3-dimensional normal distributions, each with mean ric Chuu, Debdeep Pati, Anirban Bhattacharya and covariance components, p µ p i q n , V p i q n q for i “ , ,

3. These were extracted from the true posterior parameters p µ n , V n q (deﬁned in Section 5.2.2) in the following way: µ n “ »———– µ n µ n ... µ n ﬁﬃﬃﬃﬂ “ »———– µ p q n µ p q n µ p q n ﬁﬃﬃﬃﬂ , V n “ »———– V p q n V p q n V p q n ﬁﬃﬃﬃﬂ The true value of β is shown as a heat map in Figure 8 and σ “

4. Using the formula for the marginal likelihoodderived in Eq. (12), we computed the true log marginal likelihood to be -147.3245.Figure 8:

True value of ββ