Approximate Bayesian Inference via Sparse grid Quadrature Evaluation for Hierarchical Models
AApproximate Bayesian Inference via Sparse gridQuadrature Evaluation for Hierarchical Models
Joshua Hewitt and Jennifer A. HoetingColorado State University
Abstract:
We combine conditioning techniques with sparse grid quadrature rules to developa computationally efficient method to approximate marginal, but not necessarily uni-variate, posterior quantities, yielding approximate Bayesian inference via Sparse gridQuadrature Evaluation (BISQuE) for hierarchical models. BISQuE reformulates pos-terior quantities as weighted integrals of conditional quantities, such as densities andexpectations. Sparse grid quadrature rules allow computationally efficient approxi-mation of high dimensional integrals, which appear in hierarchical models with manyhyperparameters. BISQuE reduces computational effort relative to standard, Markovchain Monte Carlo methods by at least two orders of magnitude on several appliedand illustrative models. We also briefly discuss using BISQuE to apply IntegratedNested Laplace Approximations (INLA) to models with more hyperparameters than iscurrently practical.
Keywords:
Approximate Bayesian inference, computational statistics, hierarchical models,parallel computing, sparse grid quadrature
Computationally efficient posterior approximation remains a key challenge in appliedBayesian analyses, especially for hierarchical models. Hierarchical Bayesian models allow1 a r X i v : . [ s t a t . C O ] A p r EWITT AND HOETING flexible modeling of complex data, but make posterior inference challenging because sim-ple, conjugate distributions are typically unavailable. Posterior densities, expectations,and other quantities involve computing integrals that often require numerical approxima-tion. The required approximations can be computationally expensive or challenging sincemany hierarchical models include many unknown parameters, thus integrals are defined overhigh dimensional state spaces. Sampling-based approaches, like Markov chain Monte Carlo(MCMC) methods, are widely used because they are generally reliable and relatively simpleto implement (Gelfand and Smith, 1990). However, MCMC approximations can be compu-tationally expensive for many models. Full conditional posterior distributions required by aGibbs sampler can be difficult to sample efficiently or lead to highly correlated Monte Carlosamplers. As a result, if n dependent samples are drawn via MCMC methods, the stochas-tic approximation error rate can often be higher than the error O p (cid:0) n − / (cid:1) for direct MonteCarlo approximations. Alternate approaches for Bayesian approximation are available viaa range of stochastic and deterministic methods, including Laplace and Integrated NestedLaplace approximations (Rue, Martino and Chopin, 2009; Tierney and Kadane, 1986), clas-sical quadrature-based approximations (Naylor and Smith, 1982), Variational Bayes (Attias,2000), and Approximate Bayesian Computing (Rubin, 1984; Tavare, Balding, Griffiths andDonnelly, 1997). Generally, each method is motivated by computational issues and structuresfound in different classes of models, so no method is necessarily well-suited for all hierarchi-cal models. In particular, technical limitations of Integrated Nested Laplace approximations(INLA) and classical quadrature motivate us to develop a strategy to yield approximateBayesian Inference via Sparse grid Quadrature Evaluation (BISQuE) for a wider range ofhierarchical models.INLA approximates marginal posterior distributions by using a discrete numerical in-tegration grid of hyperparameters to average over Laplace approximations of conditionalposterior densities. The method is developed for models that link observations to latentGaussian variables through link functions, similar to generalized linear models. The INLA2 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION approximation enables fast inference for a wide range of scientifically relevant models. How-ever, it can sometimes be difficult to reformulate models to have the latent Gaussian structurerequired by the INLA framework. Additionally, the numerical integration in INLA can be-come computationally infeasible for models with many hyperparameters. The latter issue isa limitation shared by classical quadrature-based approximations for posterior quantities.Classical quadrature methods can approximate marginal posterior distributions and ex-pectations for general Bayesian models, but like INLA, the models must have relatively smalldimension (Naylor and Smith, 1982). Quadrature methods approximate an integral by eval-uating its integrand at deterministic nodes and weighting the results. Nodes and weightsare chosen using known information about the shape of the integrand. However, classicalquadrature methods have limited practical use for approximate Bayesian inference. Classicalquadrature methods integrate over all unknown parameters—not just hyperparameters—andthe size of the integration grids suffer from the curse of dimensionality, growing exponentiallyas parameters are added to models.More recent quadrature literature formalized theory and methods that yield sparse inte-gration grids, thereby mitigating the curse of dimensionality for quadrature approximationsof high dimensional integrals (Gerstner and Griebel, 1998; Novak and Ritter, 1996,9). Instatistics, sparse grid quadrature methods have been used to approximate likelihoods that in-volve high dimensional integrals, as can arise from econometric models (Heiss and Winschel,2008). Sparse grid quadrature has also been used to approximate posterior expectations, den-sities, and integration constants for non-linear inverse problems with normal errors (Emeryand Johnson, 2012; Schillings and Schwab, 2013), estimate Kullback-Leibler informationgains to solve Bayesian experimental design problems (Long, Scavino, Tempone and Wang,2013), and to accelerate computations for specific non-linear Kalman filters (Arasaratnamand Haykin, 2009; Jia, Xin and Cheng, 2012). By comparison, we consider approximateBayesian posterior inference more generally.We propose reformulating Bayesian posterior quantities, such as densities and expec-3
EWITT AND HOETING tations, so that they can be efficiently approximated by combining conditioning techniqueswith sparse grid quadrature methods. Our reformulation lets us apply sparse grid quadraturemethods to hierarchical Bayesian models with non-Gaussian structures and potentially manyhyperparameters. The resulting computational approach greatly reduces computation timeas compared to MCMC approaches for many models, including fully non-Gaussian models.Our framework is very flexible and can be applied to other contexts. For example, it canalso potentially be combined with INLA to allow fast inference for latent Gaussian modelswith many hyperparameters.We briefly review quadrature and sparse grid methods (Section 2), then introduce theBayesian Inference via Sparse grid Quadrature Evaluation (BISQuE) strategy to yield ap-proximate inference for hierarchical Bayesian models (Section 3). Our method reduces thecomputational effort required to approximate posterior densities, means, and variances inexamples where traditional MCMC methods are relatively slow (Section 4). We concludewith discussions of extensions and other directions for future work (Section 5).
Let f ( x ) be a map from a d -dimensional space S onto the real line R , and w ( x ) be a weightfunction with the same support. The integral I ( f ) = (cid:90) S f ( x ) w ( x ) d x (1)may be approximated via the weighted sumˆ I ( f ) = k i (cid:88) (cid:96) =1 f (cid:0) x ( i,(cid:96) ) (cid:1) w ( i,(cid:96) ) (2)for some choice of summation length k i ∈ N , nodes A i = (cid:8) x ( i,(cid:96) ) : (cid:96) = 1 , . . . , k i (cid:9) ⊂ S , andweights W i = (cid:8) w ( i,(cid:96) ) : (cid:96) = 1 , . . . , k i (cid:9) ⊂ R k i . We will use the index i shortly. The approxi-4 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION mation (2) is called a quadrature rule if the integration domain S , weight function w , anddesired approximation accuracy or computational cost are used with specific procedures tospecify k i , A i , and W i (Givens and Hoeting, 2013, Section 5.3). The number of nodes andweights k i balances the approximation error in (2) with the approximation’s computationalcost. Large k i can yield more accurate approximation (or even exact evaluation) of (1),but at potentially high computational cost. In practice, sequences of increasingly accuratequadrature rules defined by ( k , A , W ), ( k , A , W ), . . . such that k < k < . . . can beused to estimate and control approximation error (Laurie, 1985). Quadrature rules can yieldhighly accurate approximations for integrals I ( f ) of smooth functions f defined on S , butcomputational efficiency is difficult to achieve if S has high dimension.For multidimensional S , product rules are the simplest quadrature rules to construct, butthese suffer from the curse of dimensionality. Product rules are formed by iteratively apply-ing univariate quadrature rules along each dimension of S to approximate (1); they are aptlynamed because their nodes A i are a Cartesian product of nodes from the underlying univari-ate quadrature rules (cf. Novak and Ritter, 1996). To be precise, let S be the product space S = S × · · · S d of one-dimensional, σ -finite measure spaces S , . . . , S d , and let the weightfunction w ( x ) be the product w ( x ) = (cid:81) di =1 w i ( x i ) of weight functions w ( x ) , . . . , w d ( x d )that are respectively defined on S , . . . , S d . If the target integral (1) is well defined, thenFubini’s theorem implies (1) may be evaluated as an iterated integral. Iterated integrationallows approximation by applying univariate quadrature rules along each dimension of S .Define U i , . . . , U i d d to be univariate quadrature rules that respectively approximate integralson S , . . . , S d with k i , . . . , k i d nodes A i , . . . , A i d d and weights W i , . . . , W i d d . The productrule that approximates (1) is defined via (cid:0) U i ⊗ · · · ⊗ U i d d (cid:1) ( f ) = k i (cid:88) (cid:96) =1 · · · k id (cid:88) (cid:96) d =1 f (cid:16) x ( i ,(cid:96) )1 , . . . , x ( i d ,(cid:96) d ) d (cid:17) w ( i ,(cid:96) )1 . . . w ( i d ,(cid:96) d ) d . (3)The product rule (3) is a special case of the general approximation form (2) because the5 EWITT AND HOETING nested sum in (3) may be re-expressed as a single sum over an enumeration of the quadraturenodes ( x ( i ,(cid:96) )1 , . . . , x ( i d ,(cid:96) d ) d ) ∈ A i × · · · × A i d d that uses aggregated weights w ( i ,(cid:96) )1 . . . w ( i d ,(cid:96) d ) d .The product rule (3) requires evaluation of f at (cid:12)(cid:12) A i × · · · × A i d d (cid:12)(cid:12) = k i · · · k i d nodes. Thenumber of quadrature nodes grows exponentially as d ↑ ∞ if f is explored equally in alldimensions, i.e., if k i = · · · = k i d . The curse of dimensionality for product rules can bepartially mitigated by exploring f unequally in different dimensions, but this approach isonly practical if f is extremely smooth in some dimensions.By comparison, sparse grid quadrature rules are computationally efficient approximationsfor integrals on multidimensional S . Novak and Ritter (1996,9) use the Smolyak (1963)formula to combine univariate quadrature rules U i , . . . , U i d d in a computationally efficientapproximation (2) of (1). The Smolyak formula specifies a linear combination A ( q, d ) ofproduct rules (3) that approximates (1) via A ( q, d )( f ) = (cid:88) q − d +1 ≤| i |≤ q ( − q −| i | (cid:18) d − q − | i | (cid:19)(cid:0) U i ⊗ · · · ⊗ U i d d (cid:1) ( f ) , (4)in which q ≥ d and | i | = i + · · · + i d . The Smolyak rule (4) is also a special case of thegeneral approximation form (2) because, as with (3), the underlying quadrature nodes andweights may be enumerated and aggregated. The constant q ∈ N is called the rule’s level and most directly controls the accuracy and computational cost of the approximation inapplications. The Smolyak rule (4) is called a sparse grid quadrature rule if each of the j = 1 , . . . , d univariate quadrature rules have nested nodes in the sense that A j ⊂ A j ⊂ · · · .The rule (4) requires evaluation of f at the nodes A ( q, d ) = (cid:91) q − d +1 ≤| i |≤ q A i × · · · × A i d d . Adopting the convention that A j = x j for some base point x j ∈ S j , nesting implies A ( q, d )is a sparse subset of the nodes used by the product rule ( U q ⊗ · · · ⊗ U qd )( f ).6 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
The sparse grid quadrature rule (4) mitigates the curse of dimensionality by creatingsparse integration grids relative to product rules, but requires f to satisfy stricter smooth-ness properties in exchange. Novak and Ritter (1999) present growth rates, bounds, andapproximations for the number of quadrature nodes k = |A ( q, d ) | under different scenarios.Novak and Ritter (1996) also show that the approximation error’s order of convergence is | I ( f ) − A ( q, d )( f ) | = O (cid:16) k − r (log k ) ( d − r/d +1) (cid:17) if f has a bounded mixed derivative f ( r,...,r ) . Even more precisely, Novak and Ritter (1999)show that I ( f ) = A ( q, d )( f ) if f is a polynomial with bounded total degree, i.e., that theapproximation (4) is exact for the integral (1). In practice, the sparse grid quadrature rule (4)is most computationally efficient for functions f that behave approximately as polynomialswith relatively low total degree. In statistical contexts, this is similar to saying that therule (4) is most useful for polynomial surfaces f that are mainly driven by main effects andlow order interaction terms. We will satisfy this requirement for computational efficiency inour application by appealing, in part, to the Bayesian central limit theorem to claim thatmany posterior surfaces and other quantities can be well approximated by the product of aGaussian weight function w ( x ) with a relatively low-order correction term f . We combine conditioning techniques with sparse grid quadrature rules to develop specialized,computationally efficient formulas like (4) that approximate Bayesian posterior inference formarginal quantities. For example, when used to approximate marginal posterior densities,our method will yield a weighted mixture of full conditional posterior distributions. Below,we briefly motivate the Bayesian Inference via Sparse grid Quadrature Evaluation (BISQuE)approximation strategy by arguing that it can be computationally inefficient to use sparsegrid quadrature rules to directly approximate posterior quantities (Section 3.1). First, our7
EWITT AND HOETING motivation simultaneously highlights the general strategy used to apply sparse grid quadra-ture rules to Bayesian models as well as key technical issues addressed by BISQuE. Then,the remainder of Section 3 defines the family of posterior quantities to which BISQuE applies(Section 3.2), the BISQuE approximation (Section 3.3), and a nested integration techniquethat is useful for applying BISQuE to models that lack closed form expressions of posteriordensities (Section 3.4).
Consider a generic hierarchical Bayesian model. Let X ∈ Ω be a sample of continuous, dis-crete, or mixed random variables from an arbitrary process. Define a conditional probabilitymodel for X such that X | θ , θ ∼ f ( X | θ , θ )( θ , θ ) ∼ f ( θ , θ )(5)for parameters θ ∈ Ω and θ ∈ Ω . Many Bayesian models can be written like (5). Forexample, many hierarchical Bayesian models add conditional independence assumptions andhierarchical structure to (5) so that f ( X | θ , θ ) = f ( X | θ ) f ( θ , θ ) = f ( θ | θ ) f ( θ ) . Non-hierarchical models also fit within our framework (5). For example, Bayesian formula-tions of some linear regression models specify prior independence between regression coeffi-cients θ and variance components θ , thus define f ( θ , θ ) = f ( θ ) f ( θ ).The marginal posterior density f ( θ | X ) is often of interest in posterior inference. The8 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION density may be computed by integrating θ out of the joint posterior density f ( θ | X ) = (cid:90) f ( θ , θ | X ) d θ . (6)Sparse grid quadrature rules (4) yield weighted-sum approximations (2) of (6) by introducinga weight function w ( θ , θ , X ) and proceeding via f ( θ | X ) = (cid:90) f ( θ , θ | X ) w ( θ , θ , X ) w ( θ , θ , X ) d θ ≈ k i (cid:88) (cid:96) =1 f (cid:16) θ , θ ( i,(cid:96) )2 (cid:12)(cid:12)(cid:12) X (cid:17) w (cid:16) θ , θ ( i,(cid:96) )2 , X (cid:17) w ( i,(cid:96), θ ) , (7)in which quadrature nodes θ ( i,(cid:96) )2 and weights w ( i,(cid:96), θ ) are determined by applying the Smolyakformula (4) to a collection of univariate quadrature rules that are appropriate for the supportof θ . For fixed θ ∈ Ω , the Gaussian approximation to f ( θ , θ | X ) will often be a sensibledefault choice for the weight function w ( θ , θ , X ) since the weight ratio f /w in (7) accountsfor deviations from normality in f ( θ , θ | X ).The direct marginal posterior density approximation (7) has two key inefficiencies thatthe BISQuE approximation completely avoids or minimizes. First, the weight function w depends on θ , which implies a separate weight function must be used to approximate f ( θ | X ) at each θ ∈ Ω . Second, the approximation (7) assumes f ( θ , θ | X ) is com-putable. Oftentimes, the joint posterior density f ( θ , θ | X ) is only known in closed formup to a proportionality constant because the density’s integration constant requires numer-ical approximation for many Bayesian models. While sparse grid quadrature rules couldapproximate the integration constant, BISQuE is able to avoid or reduce computational costof the approximation. We develop BISQuE to approximate marginal posterior quantities h ( θ ; X ) of hierarchi-cal models (5) that are defined implicitly with respect to a function or random variable9 EWITT AND HOETING h ( θ , θ ; X ) via h ( θ ; X ) = (cid:90) h ( θ , θ ; X ) f ( θ | X ) d θ . (8)For example, the construction (8) defines the marginal posterior density h ( θ ; X ) = f ( θ | X )when h ( θ , θ ; X ) = f ( θ | θ , X ). The posterior marginal density f ( θ | X ) and all othermarginal posterior quantities may be formed by switching the roles of θ and θ . In com-parison to the definition (6) used in the direct sparse grid approximation (7), the BISQuEconstruction (8) uses conditioning to express the joint posterior density in conditional form,as f ( θ , θ | X ) = f ( θ | θ , X ) f ( θ | X ). The construction (8) allows us to develop sparsegrid quadrature rules with weight functions w ( θ , X ) that only depend on θ (Section 3.3),thus addresses the first technical issue described in Section 3.1.The BISQuE construction (8) allows one set of quadrature nodes and weights to bereused to approximate many posterior quantities. For example, (8) defines the posteriormean h ( θ ; X ) = E [ g ( θ ) | X ] when h ( θ , θ ; X ) = E[ g ( θ ) | θ , X ]. Again, the approachrelies on conditioning asE[ g ( θ ) | X ] =E θ | X { E[ g ( θ ) | θ , X ] } = (cid:90) E[ g ( θ ) | θ , X ] f ( θ | X ) d θ . Posterior predictive distributions, variances and higher central moments, cumulative distri-bution functions, and model selection criteria such as the deviance information criteria (DIC,Spiegelhalter, Best, Carlin and van der Linde, 2002) and the Watanabe-Akaike informationcriterion (WAIC, Watanabe, 2010) can also be expressed through one or more applicationsof (8). For example, the posterior variance Var( g ( θ ) | X ) can be approximated by using thelaw of total variance to introduce expectations with respect to f ( θ | X ) viaVar( g ( θ ) | X ) =E θ | X [Var( g ( θ ) | θ , X )]+(9) 10 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION E θ | X (cid:2) (E[ g ( θ ) | θ , X ] − E[ g ( θ ) | X ]) (cid:3) , for which h ( θ , θ ; X ) = Var( g ( θ ) | θ , X ) + (E[ g ( θ ) | θ , X ] − E[ g ( θ ) | X ]) . (10)Note that here the marginal posterior expectation E[ g ( θ ) | X ] must be approximated before(9). We present expressions for the other quantities in Supplement Section A. We modify the integral form (1) to enable the use of sparse grid quadrature rules (4) toapproximate marginal posterior quantities (8) of hierarchical Bayesian models (5). Whilewe define marginal posterior quantities by integrating functions over the posterior density f ( θ | X ), numerical integration methods often use transformations to increase computa-tional stability and efficiency. Thus, we develop quadrature rules that integrate over f ( ν | X )where ν = T ( θ ) ∈ R p is defined by a monotone transformation to a real coordinate space T : Ω → R p . Consider a transformed density f ( ν | X ) = f (cid:0) T − ( ν ) (cid:12)(cid:12) X (cid:1) (cid:12)(cid:12) J (cid:0) T − ( ν ) (cid:1)(cid:12)(cid:12) , where | J ( T − ( ν )) | is the determinant of the Jacobian for the transformation T − . We pro-pose using sparse grid quadrature rules (4) to derive quadrature nodes and weights thatapproximate marginal posterior quantities (8) via the BISQuE approximation h ( θ ; X ) = (cid:90) h (cid:0) θ , T − ( ν ); X (cid:1) f ( ν | X ) w ( ν , X ) w ( ν , X ) d ν (11) ≈ k i (cid:88) (cid:96) =1 h (cid:16) θ , θ ( i,(cid:96) )2 ; X (cid:17) ˜ w ( i,(cid:96) ) , EWITT AND HOETING in which ˜ w ( i,(cid:96) ) = f (cid:0) ν ( i,(cid:96) ) (cid:12)(cid:12) X (cid:1) w ( ν ( i,(cid:96) ) , X ) w ( i,(cid:96) ) ,w ( ν , X ) is a weight function; and w ( i,(cid:96) ) , ν ( i,(cid:96) ) , and θ ( i,(cid:96) )2 = T − (cid:0) ν ( i,(cid:96) ) (cid:1) are respectivelyquadrature weights, nodes, and back-transformed nodes.Sparse grid quadrature theory implies the computational efficiency of the approximation(11) relies on several statistical and numerical assumptions. The weight function w ( ν , X )should approximate the transformed density f ( ν | X ) and have known, computationallyefficient, nested quadrature rules. In particular, such quadrature rules have been developedfor Gaussian weight functions (Genz and Keister, 1996). Thus, we appeal to Bayesian analogsof the central limit theorem to justify proposing the Gaussian approximation f G ( ν | X ) atthe posterior mode of f ( ν | X ) as a sensible default choice for a weight function for manyBayesian models. This approximation holds if the same size is large, the dimension of themodel is fixed, and both the prior and likelihood are twice differentiable near the mode ofthe posterior distribution (Berger, 1985, pg. 224–225). Sparse grid quadrature rules willalso be most efficient if the integrand h ( θ , T − ( ν ); X ) f ( ν | X ) /w ( ν , X ) in (11) can bewell-approximated by a low-order polynomial in ν . This requirement is easier to satisfy if w ( ν , X ) approximates f ( ν | X ) well and h ( θ , T − ( ν ); X ) is slowly varying with respect to ν . Standardizing the BISQuE approximation (11) weights ˜ w ( i,(cid:96) ) can address part of thesecond technical issue described in Section 3.1. For example, in some Bayesian modelsboth the joint f ( θ , θ | X ) and marginal f ( θ | X ) posterior densities are known only upto a proportionality constant, but the full conditional posterior f ( θ | θ , X ) is availablein closed form (Section 4). Marginal posterior probabilities and expectations cannot becomputed without either approximating the proportionality constant or using numericalapproximation techniques that implicitly cancel the constant. We propose using standardized12 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION weights ˜ w ( i,(cid:96) ) ∗ = ˜ w ( i,(cid:96) ) / (cid:80) k i j =1 ˜ w ( i,j ) that sum to one in order to approximate marginal posteriorquantities h ( θ ; X ) like f ( θ | X ) by implicitly cancelling the unknown integration constants.The result borrows ideas from importance sampling (Givens and Hoeting, 2013, pg. 181).An alternate definition for posterior quantities, h ( θ ; X ) = (cid:82) h ( θ , θ ; X ) f ( θ | X ) d θ (cid:82) f ( θ | X ) d θ , (12)is equivalent to the original construction (8) since (cid:82) f ( θ | X ) d θ = 1. Plugin BISQuEapproximations (11) for the numerator and denominator in (12) yield quadrature approxi-mations with standardized weights via (cid:82) h ( θ , θ ; X ) f ( θ | X ) d θ (cid:82) f ( θ | X ) d θ ≈ (cid:80) k i (cid:96) =1 h (cid:16) θ , θ ( i,(cid:96) )2 ; X (cid:17) ˜ w ( i,(cid:96) ) (cid:80) k i j =1 ˜ w ( i,j ) = k i (cid:88) (cid:96) =1 h (cid:16) θ , θ ( i,(cid:96) )2 ; X (cid:17) ˜ w ( i,(cid:96) ) ∗ . (13)Standardization also allows approximations of f ( θ | X ) to integrate exactly to one.Table 1 summarizes the BISQuE approach outlined in this section as it would be appliedwhen using a Gaussian approximation to the transformed posterior density to approximateposterior quantities (8). While hierarchical Bayesian models (5) typically have closed form expressions for the like-lihood f ( X | θ , θ ) and prior f ( θ , θ ), many models do not have closed form expressionsfor the posterior densities f ( θ | X ) and f ( θ | θ , X ). Lack of closed form expressions is aconcern related to the second technical issue described in Section 3.1. We propose a nestednumerical integration scheme to address the concern and allow application of BISQuE toa wider range of models. Recall that for a fixed dataset X , the joint posterior density13 EWITT AND HOETING f ( θ , θ | X ) is often only known up to a proportionality constant since f ( θ , θ | X ) = f ( θ , θ , X ) f ( X ) ∝ f ( X | θ , θ ) f ( θ , θ )and the marginal density f ( X ) often requires prohibitively expensive numerical approxima-tion.The densities f ( θ | X ) and f ( θ | θ , X ) may be derived (and ultimately approximated)indirectly, by factoring the joint density f ( θ , θ , X ) into components g ( θ , θ ; X ) and g ( θ ; X ) such that f ( θ , θ , X ) = g ( θ , θ ; X ) g ( θ ; X ) . (14)The factored joint density (14) implies f ( θ | X ) = (cid:90) f ( θ , θ | X ) d θ = g ( θ ; X ) C ( θ ) f ( X )(15)and f ( θ | θ , X ) = f ( θ , θ | X ) f ( θ | X ) = g ( θ , θ ; X ) C ( θ ) , (16)for which the integration constant C ( θ ) must be approximated numerically and is specifiedvia C ( θ ) = (cid:90) g ( θ , θ ; X ) d θ . (17)The alternate expressions (15) and (16) allow BISQuE to approximate posterior infer-ence for models that lack closed form expressions for the densities f ( θ | X ) and f ( θ | θ , X ).Standardized BISQuE weights ˜ w ( i,(cid:96) ) ∗ implicitly cancel the unknown factor f ( X ), and stan-dard quadrature techniques can efficiently approximate the integration constant (17) when14 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION the parameter vector θ has small dimension. The parameters θ and θ can often be definedor repartitioned to satisfy this requirement because the hierarchical model (5) places few re-strictions on the parameters; we use this flexibility in Section 4. The added computationalcost that the nested integration (17) adds to the BISQuE approximation is minimized asthe integration constant (17) only needs to be approximated relatively few times, specifi-cally, at the quadrature nodes and when developing the weight function—e.g., the Gaussianapproximation at the posterior mode. We demonstrate the benefits of the BISQuE approximation (11) on data that are typicallyanalyzed with standard, Gibbs sampling techniques for approximate Bayesian posterior infer-ence. We approximate posterior inference for a fully non-Gaussian capture-recapture model(Section 4.1), a spatial Gaussian process model (Section 4.2), and a more complex, appliedspatial Gaussian process model for climate teleconnection (Section 4.3). Posterior distribu-tions in the first and third examples respectively require integration over 8 and 5-dimensionalparameter vectors θ . Posterior approximations for the second and third examples have com-putational complexity that is O ( M N ) in the number of spatial observations N and M pointsat which the posterior distribution is explored, thus computational strategies like BISQuEthat reduce the number of points required for posterior approximation can be extremelybeneficial.We compare posterior inference and computational effort between standard Gibbs sam-pling techniques and BISQuE. Computational effort is measured indirectly with respect tocomputation time. All computations are conducted on a modest workstation with eight logi-cal processors. We use parallelization to compute the k i mixture components of the BISQuEapproximation and to draw posterior predictive samples via composition sampling in thespatial examples (cf. Banerjee, Carlin and Gelfand, 2015, pg. 126). For each posterior quan-15 EWITT AND HOETING tity, the level q for the underlying sparse grid quadrature rule (4) is chosen to be the smallestvalue (i.e., the simplest approximation) such that the posterior density approximations haveconverged. The number of Gibbs steps used in each approximation is similarly chosen. TheBISQuE approximation also requires specification of univariate quadrature rules, for whichwe choose nested Gauss-Hermite rules (Genz and Keister, 1996). Givens and Hoeting (2013, example 7.7) analyze data from a capture-recapture study con-ducted in New Zealand. The study’s research goal was to estimate the total number of pupsin a fur seal colony N ∈ N . Researchers visited the colony I = 7 times throughout the courseof a single season. In each visit, the researchers captured and marked all of the fur seal pupspresent, noting the total number of pups captured in each visit c = ( c , . . . , c I ) ∈ N I inaddition to the number of newly captured pups m , . . . , m I ∈ N . The data are analyzedusing a Bayesian model for capture-recapture data (18), and posterior distributions are ap-proximated with a Gibbs sampler. Gibbs sampling is particularly inefficient for this modelas one pair of hyperparameters has high posterior correlation and are only weakly identifiedby the data. By comparison, the BISQuE strategy (11) approximates posterior quantitiesfor this model with substantially less computational effort.The model (18) assumes the total population size N remains fixed during the time periodof the study (i.e., the model assumes a closed population). Let r = (cid:80) Ii =1 m i be the totalnumber of pups captured during the study. Givens and Hoeting (2013) introduce a vector α = ( α , . . . , α I ) ∈ [0 , I with capture probabilities for each census attempt and discuss16 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION modeling the data with the hierarchical model f ( c , r | N, α ) ∝ N !( N − r )! I (cid:89) i =1 α c i i (1 − α i ) N − c i f ( N ) ∝ /Nf ( α i | θ , θ ) ∼ Beta( θ , θ ) for i = 1 , . . . , If ( θ , θ ) ∝ exp {− ( θ + θ ) / } , (18)in which ( θ , θ ) are hyperparameters for the capture probabilities. We use the Beta distri-bution’s mean–sample size parameterization to increase the identifiability of the hyperpa-rameters. Specifically, let U = logit ( θ / ( θ + θ )) and U = log ( θ + θ ) and fix U = 5 . Givens and Hoeting (2013) use standard Gibbs-sampling approaches to draw posterior sam-ples for model parameters. The full conditional posterior distributions f ( N | c , r, α , θ , θ )and f ( α | c , r, N, θ , θ ) are conjugate and easy to sample. Posterior samples for U aredrawn using Metropolis steps. The sampler is run for 100,000 iterations, taking 298 secondsto complete; posterior inference uses the final 50,000 samples.We use the BISQuE strategy to approximate the posterior marginal densities f ( N | c , r ), f ( α i | c , r ), and f ( U | c , r ). Table 2 connects this example’s notation to that used withBISQuE. When used as the BISQuE conditioning variable θ , we map parameters to the realline by using log transforms with N − r and logit transforms with the capture probabilities α .We also rely on the Gaussian approximation to the negative binomial distribution in order tojustify using N as a conditioning variable θ in BISQuE. Almost all conditional and marginalposterior densities required for BISQuE are computable in closed form up to a proportionalityconstant; refer to Givens and Hoeting (2013, eqs. 7.16, 7.17) and Supplement Section C.1 fordetails. The posterior for f ( U | c , r ) requires approximation via nested integration strategies(Section 3.4). 17 EWITT AND HOETING
Posterior inference via BISQuE is effectively identical to posterior inference via Gibbssampling, but is computed with substantially less effort. Gibbs sampling takes 298 secondsto complete on our test machine, whereas the BISQuE approximations require a total of 5seconds (Table 2), and posterior densities are nearly identical (Figures 1 to 2).
We work with data simulated from a geostatistical spatial model. Gibbs sampling is com-putationally expensive for such models because it involves decomposing spatially-structuredcovariance matrices in R N × N at each Gibbs iteration, where N is the number of observations.Let { X ( s ) } s ∈D be a random field, whose stochasticity is defined by a mean-zero Gaussianprocess on a continuous spatial domain D ⊂ R . Let the covariance Cov ( X ( s ) , X ( t )) be-tween random variates X ( s ), X ( t ) be specified by the isotropic Mat´ern covariance function,defined via κ (cid:0) s , t ; σ , ρ, ν (cid:1) = σ ν − Γ( ν ) ( (cid:107) s − t (cid:107) /ρ ) ν K ν ( (cid:107) s − t (cid:107) /ρ ) , in which (cid:107) · (cid:107) is the Euclidean norm, K ν is the modified Bessel function of the second kindwith order ν >
0, which governs the smoothness of the process; σ > ρ > X = ( X ( s ) , . . . , X ( s N )) T ∈ R N at locations S = { s , . . . , s N } ⊂ D is normallydistributed X ∼ N ( , Σ). The covariance matrix Σ ∈ R N × N is spatially-structured, withentries Σ ij = κ ( s i , s j ; σ , ρ, ν ). The Gaussian process assumption allows estimation of thefield { X ( s ) } s ∈D at unobserved locations S = { s , . . . , s M } ⊂ D via kriging, which usesconditional normal distributions for the unobserved responses. Standard Bayesian hierarchi-cal modeling techniques for spatial data (e.g., Banerjee et al., 2015, Chapter 6) use conjugate18 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION or weakly informative priors for the covariance parameters, specified via σ ∼ Inverse-Gamma( a, b ) ,ρ ∼ Uniform( L , U ) ,ν ∼ Uniform( L , U ) . We simulate one dataset with N = 300 locations, sampled uniformly from the unitsquare D = [0 , and with covariance parameters ( σ , ρ, ν ) = (1 , . , . { X ( s ) } s ∈D at M = 400 unobserved, griddedlocations S ⊂ D . The priors are specified via ( a, b, L , U , L , U ) = (2 , , , , , Standard techniques approximate posterior distributions with a Gibbs sampler and compo-sition sampling (e.g., Banerjee et al., 2015, Chapter 6). Conjugate distributions are usedto sample the scale σ and unobserved field values X = ( X ( s ) , . . . , X ( s M )) ∈ R M , butMetropolis steps are used for the range ρ and smoothness ν parameters. The Gibbs sampleris used to draw 60,000 posterior samples for the covariance parameters, taking 2,043 secondsto complete; posterior inference uses the final 30,000 iterations. After drawing posteriorsamples for the covariance parameters, composition sampling is used to draw samples forthe unobserved field values X in parallel, taking 608 seconds to complete (Banerjee et al.,2015, pg. 126).We use the BISQuE strategy to approximate the posterior density f ( X | X ). Sparsegrid quadrature techniques are used to directly approximate the marginal posterior densities f ( σ | X ), f ( ρ | X ), and f ( ν | X ). Table 2 connects this example’s notation to that used withBISQuE. When used as the BISQuE conditioning variable θ , we map covariance parametersto the real line by log-transforming the scale parameter σ , and logit-transforming the range ρ and smoothness ν parameters. All conditional and marginal posterior densities required for19 EWITT AND HOETING
BISQuE are computable in closed form up to a proportionality constant; refer to Banerjeeet al. (2015, eqs. 2.15–16) for details.Posterior inference via BISQuE and sparse grid quadrature is effectively identical to pos-terior inference via Gibbs sampling, but is computed with substantially less effort. Drawingposterior covariance parameter samples takes 2,043 seconds and composition sampling takesan additional 608 seconds, whereas the BISQuE and sparse grid quadrature approximationstake a total of 238 seconds (Table 2), and posterior inference is nearly identical (Figures 3to 4).
While most spatial data can be modeled with the assumption that distant points are uncorre-lated, large-scale atmospheric circulations can induce dependence between fields separated bylarge distances. The resulting climate phenomena, known as teleconnection, may be modeledusing remote effects spatial process (RESP) models, which can improve teleconnection-basedpredictions of seasonal precipitation (Hewitt, Hoeting, Done and Towler, 2018). The RESPmodel is given by Y ( s , t ) = x T ( s , t ) β + w ( s , t ) + γ ( s , t ) , (19)which uses a stochastic teleconnection term γ ( s , t ) = (cid:90) D Z z ( r , t ) α ( s , r ) d r (20)to extend standard geostatistical regression models for a process { Y ( s , t ) : s ∈ D Y , t ∈ T } defined on a continuous spatial domain D Y for discrete times T . Regression coefficients β and spatially-correlated variation w ( s , t ) are augmented by (20), which uses doubly-indexed20 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION random effects α ( s , r ) to aggregate the impact of remote covariates { z ( r , t ) : r ∈ D Z , t ∈ T } ,such as sea surface temperatures, on a distant response, such as the standardized deviation Y ( s , t ) from mean seasonal precipitation. The authors adopt the climate science conventionthat mean precipitation is treated as known, and the standardized deviation Y ( s , t ) is thescientifically interesting response variable to model.The RESP model uses two isotropic Mat´ern covariance functions κ ( s , s (cid:48) ; σ w , ρ w , ν w ), κ ( r , r (cid:48) ; σ α , ρ α , ν α ), and a nugget effect σ ε to define Gaussian processes that model the spa-tial variation { w ( s , t ) : s ∈ D Y } and teleconnection effects { α ( s , r ) : s ∈ D Y , r ∈ D Z } . TheMat´ern smoothness parameters ν w and ν α are treated as fixed, and standard priors are usedto model the remaining regression coefficients β and covariance parameters σ w , ρ w , σ ε , σ α ,and ρ α (cf. Section 4.2.1).We follow Hewitt et al. (2018) and use the RESP model to analyze Colorado precipitationdata in a statistical downscaling-like scenario. The RESP model regresses standardizeddeviations Y ( s , t ) from mean Colorado precipitation observed at 240 locations s ∈ D Y ontolocal surface temperatures x ( s , t ) and Pacific Ocean sea surface temperatures z ( r , t ). Themodel is fit to Winter averages from 1981–2012 and an ordinal response ˜ Y ( s , t ) ∈ { v , . . . , v m } is predicted for Winter 2013, given the covariate values x ( s , t ) and z ( r , t ) for t = 2013. Thedistribution for ˜ Y ( s , t ) is induced by known cut points c ( s ) , . . . , c m ( s ) and defined suchthat P ( ˜ Y ( s , t ) = v i ) = P ( c i − ( s ) < Y ( s , t ) < c i ( s )). In this application, the ordinal response˜ Y ( s , t ) represents below average v , about average v , or v above average precipitation. Hewitt et al. (2018) construct a Gibbs sampler that approximates posterior distributions forthe RESP model (19). Gibbs sampling is computationally expensive for the RESP modelbecause two spatially-structured covariance matrices must be decomposed at each Gibbsiteration. Let Y denote all observations Y ( s , t ) from t = 1981 , . . . , Y denote allunobserved responses Y ( s , t ) at t = 2013; and ˜ Y denote all unobserved ordinal responses21 EWITT AND HOETING ˜ Y ( s , t ) at t = 2013. Conjugate distributions are used to sample the regression parameters β , scales σ w and σ α , and continuous predictions Y ; and Metropolis steps are used for theranges ρ w and ρ α . The Gibbs sampler is used to draw 41,000 posterior samples for theregression and covariance parameters, taking 8,331 seconds to complete; posterior inferencediscards the first 1,000 iterations as the chain mixes quickly, but requires many iterations tocontrol Monte Carlo integration error. Composition sampling is then used to draw samplesfor the predicted response Y in parallel, taking 755 seconds to complete. The continuousposterior predictive density f ( Y | Y ) is discretized after sampling to approximate f ( ˜ Y | Y )by using the empirical quantiles of historical precipitation as cut points c ( s ) , . . . , c ( s ).We use the BISQuE strategy to approximate the posterior predictive densities f ( Y | Y )and f ( ˜ Y | Y ). In particular, we use the BISQuE strategy to directly approximate f ( ˜ Y | Y )by letting h ( θ , θ ; X ) in (11) be the conditional cumulative distribution function for Y .Table 2 connects this example’s notation to that used with BISQuE. When used as theBISQuE conditioning variable θ , we map covariance parameters to the real line by log-transforming scale parameters σ and logit-transforming range parameters ρ . All conditionaland marginal posterior densities required for BISQuE are computable in closed form up toa proportionality constant; refer to Hewitt et al. (2018) for distributional results.Posterior inference via BISQuE is effectively identical to posterior inference via Gibbssampling, but is computed with substantially less effort. Drawing posterior covariance pa-rameter samples takes 8,331 seconds and composition sampling takes an additional 755seconds, whereas the BISQuE approximations take a total of 118 seconds (Table 2), andposterior inference is nearly identical (e.g., Figure 5). The approximate BISQuE and Gibbsposterior masses ˆ P ( ˜ Y ( s , t ) = v i | Y ) agree to at least two decimal places for all 240 locations s ∈ D Y and values v , v , v ; additional computing effort can further reduce approximationerrors, but offers limited practical benefit because the discretization is coarse.22 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
We combine conditioning techniques with sparse grid quadrature rules to develop approxi-mate Bayesian Inference via Sparse grid Quadrature Evaluation (BISQuE). Approximations(11) are developed by reformulating Bayesian posterior quantities, such as densities andexpectations, so that they may be approximated as weighted mixtures of conditional quan-tities h ( θ , θ ; X ). The integration nodes and weights from sparse grid quadrature rulesare used to build mixing weights w ( i,(cid:96) ) and conditioning values θ ( i,(cid:96) )2 . In a similar manneras general quadrature techniques and importance sampling methods, the final BISQuE ap-proximation weights ˜ w ( i,(cid:96) ) use weight ratios f ( ν ( i,(cid:96) ) | X ) /w ( ν ( i,(cid:96) ) , X ) to align the “theoreticaldistribution” f ( ν | X ) with the “sampling distribution” w ( ν , X ) (Givens and Hoeting, 2013,pgs. 143, 181). Nested integration strategies enable BISQuE approximations (11) whenmodels do not have closed form expressions for required components (Section 3.4). Poste-rior approximation via BISQuE is deterministic and computationally efficient, offering fastercomputation than MCMC methods for a wide range of models (5) and posterior quantities(8). In our applications, we find that BISQuE often reduces overall computing time byat least two orders of magnitude and yields nearly identical inference to standard MCMCapproaches (Section 4).The BISQuE approximation is similar to, and can be combined with Integrated NestedLaplace approximations (INLA) for latent Gaussian models (Rue et al., 2009). CombiningBISQuE with INLA can yield an approximation technique that scales better to models withmore hyperparameters. Similar to INLA, our framework will be most efficient when used toapproximate low-dimensional posterior quantities, like marginal densities or joint densitieswith computationally tractable closed form expressions (e.g., f ( X | X ) in Section 4.2).However, BISQuE does not require that a model have a latent Gaussian structure andis thus applicable to a broad class of models such as the population estimation model ofSection 4.1.We can combine the BISQuE approximation (11) and INLA because both methods23 EWITT AND HOETING use conditioning and integration grids to yield fast deterministic posterior approximation.In terms of the general hierarchical model (5), INLA specifies a hierarchical parametermodel f ( θ , θ ) = f ( θ | θ ) f ( θ ) in which f ( θ | θ ) is Gaussian and f ( θ ) is a priordistribution for relatively low-dimensional hyperparameters θ . Rue et al. (2009) define θ = ( θ , . . . , θ i , . . . , θ n ), develop an integration grid, and use Laplace approximations for f ( θ i | θ , X ) and f ( θ | X ) to approximate the marginal posterior density f ( θ i | X ). Thenested Laplace approximations can be embedded in the BISQuE approximation (11), yield-ing posterior approximation that uses an alternate integration grid to INLA. The embeddingcan be beneficial because sparse grid quadrature rules allow for more computationally ef-ficient approximation in models with higher dimensional hyperparameters θ . Specifically,Rue et al. (2009) suggest creating integration grids for models with high-dimensional θ by using central composite design (CCD) methods—an experimental design and responsesurface technique for approximating second order surfaces with relatively few function evalu-ations (Box and Wilson, 1951). When integration is the main concern, sparse grid quadraturemethods can require substantially fewer integration nodes in high dimensions (Novak andRitter, 1999, Table 2, (cid:96) = 3) than CCD-based grids (Sanchez and Sanchez, 2005, Table 3).Our BISQuE approximation advances Bayesian computing for hierarchical models, butopen questions remain for wider application of the method. Notably, our approximationrequires the ability to evaluate h ( θ , θ ; X ) quickly, so may often be limited to marginalposterior inference for θ with relatively small dimension. Our approximation also relies onthe availability of nested quadrature rules for θ . It is difficult to develop quadrature rulesfor discrete variables, thus practical use of our approximation may be limited to models withparameters θ defined on continuous spaces Ω . Fast convergence of our approximation alsorelies on the availability of accurate approximations to f ( θ | X ). If the BISQuE approxi-mation (11) has not converged, intuition about numerical integration suggests the resultingapproximation will likely underestimate posterior variability (Rue et al., 2009). However,Rue et al. (2009, Section 6.5) also point out that f ( θ | X ) often becomes increasingly Gaus-24 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION sian as the dimension of θ grows since the Bayesian structure will increase variability andregularity will the dimension, which will help accelerate convergence.The BISQuE methodology suggests continued development in several areas. Additionaldiagnostics should be developed for wider practical application of the BISQuE approximation(11). The approximation’s convergence can be monitored by checking the approximation’sstability as the level q of the underlying sparse grid quadrature rule (4) is increased (Laurie,1985). However, this does not necessarily provide a diagnostic that can assess how wellconditioned a model (5) or posterior quantity (8) is for use with BISQuE. Drawing fromimportance sampling, studying the weight ratio f ( ν ( i,(cid:96) ) | X ) /w ( ν ( i,(cid:96) ) , X ) in (11) at quadra-ture nodes ν ( i,(cid:96) ) may help diagnose practical issues. Theoretical smoothness properties of h ( θ , θ ; X ) or concentration of the posterior density f ( θ | X ) may also provide insight intothe conditioning for specific models.Software is available for implementing BISQuE approximations. We have developedthe bisque package for R that computes BISQuE approximations for user-specified models.Custom implementations of BISQuE can also be developed for specific, high performanceapplications with the use of software libraries, including the mvQuad package for R and theSGMGA libraries for C and C++ (Burkardt, 2007; Weiser, 2016). These libraries containtables and routines that compute sparse grid quadrature nodes and weights if w ( ν , X ) is amember of a standard family of weight functions (Givens and Hoeting, 2013, Table 5.6). Supplementary materials
Additional information and supporting material for this article is available online at thejournal’s website. 25
EWITT AND HOETING
Acknowledgements
This material is based upon work supported by the National Science Foundation undergrant number AGS–1419558. Any opinions, findings, and conclusions or recommendationsexpressed in this material are those of the authors and do not necessarily reflect the viewsof the National Science Foundation.
References
Arasaratnam, I. and Haykin, S. (2009) Cubature Kalman Filters.
IEEE Transactions onAutomatic Control , , 1254–1269.Attias, H. (2000) A Variational Bayesian Framework for Graphical Models. Advances inneural information processing systems , 209–215.Banerjee, S., Carlin, B. P. and Gelfand, A. E. (2015)
Hierarchical Modeling and Analysis forSpatial Data . Boca Raton, FL: CRC Press, second edn.Berger, J. O. (1985)
Statistical Decision Theory and Bayesian Analysis . New York: SpringerScience+Business Media, LLC, second edn.Box, G. E. P. and Wilson, K. B. (1951) On the Experimental Attainment of OptimumConditions.
Journal of the Royal Statistical Society Series B , , 1–45.Burkardt, J. (2007) Sparse Grid Mixed Growth Anisotropic Rules. https://people.sc.fsu.edu/˜jburkardt/cpp src/sgmga . URL: https://people.sc.fsu.edu/{~}jburkardt/cpp{ }src/sgmga/sgmga.html .Emery, A. F. and Johnson, K. C. (2012) Practical considerations when using sparse gridswith Bayesian inference for parameter estimation. Inverse Problems in Science andEngineering , , 591–608.Gelfand, A. E. and Smith, A. F. (1990) Sampling-Based Approaches to Calculating MarginalDensities. Journal of the American Statistical Association , , 398–409.26 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
Genz, A. and Keister, B. D. (1996) Fully Symmetric Interpolatory Rules for Multiple Inte-grals over Infinite Regions with Gaussian Weight.
J. Comp. Appl. Math. , , 299–309.Gerstner, T. and Griebel, M. (1998) Numerical Integration Using Sparse Grids. NumericalAlgorithms , , 209–232.Givens, G. H. and Hoeting, J. A. (2013) Computational Statistics . Hoboken, NJ: John Wiley& Sons, Inc., second edn.Heiss, F. and Winschel, V. (2008) Likelihood approximation by numerical integration onsparse grids.
Journal of Econometrics , , 62–80.Hewitt, J., Hoeting, J. A., Done, J. M. and Towler, E. (2018) Remote effects spatial processmodels for modeling teleconnections. Environmetrics .Jia, B., Xin, M. and Cheng, Y. (2012) Sparse-grid quadrature nonlinear filtering.
Automatica , , 327–341.Laurie, D. P. (1985) Practical error estimation in numerical integration. Journal of Compu-tational and Applied Mathematics , , 425–431.Long, Q., Scavino, M., Tempone, R. and Wang, S. (2013) Fast estimation of expected in-formation gains for Bayesian experimental designs based on Laplace approximations. Computer Methods in Applied Mechanics and Engineering , , 24–39.Naylor, J. C. and Smith, A. F. M. (1982) Applications of a Method for the Efficient Com-putation of Posterior Distributions. Journal of the Royal Statistical Society, Series C , , 214–225.Novak, E. and Ritter, K. (1996) High dimensional integration of smooth functions over cubes. Numerische Mathematik , , 79–97.— (1999) Simple Cubature Formulas with High Polynomial Exactness. Constructive Ap-proximation , , 499–522.Rubin, D. B. (1984) Bayesianly Justifiable and Relevant Frequency Calculations for theApplied Statistician. The Annals of Statistics , , 1151–1172.Rue, H., Martino, S. and Chopin, N. (2009) Approximate Bayesian Inference for Latent27 EWITT AND HOETING
Gaussian Models by Using Integrated Nested Laplace Approximations.
Journal of theRoyal Statistical Society Series B , , 319–392.Sanchez, S. M. and Sanchez, P. J. (2005) Very large fractional factorials and central compositedesigns. ACM Transactions on Modeling and Computer Simulation , , 362–377.Schillings, C. and Schwab, C. (2013) Sparse, adaptive Smolyak quadratures for Bayesianinverse problems. Inverse Problems , .Smolyak, S. A. (1963) Quadrature and interpolation formulas for tensor products of certainclasses of functions. Doklady Akademii Nauk SSSR , , 1042–1045.Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002) Bayesianmeasures of model complexity and fit. Journal of the Royal Statistical Society SeriesB , , 583–639.Tavare, S., Balding, D. J., Griffiths, R. C. and Donnelly, P. (1997) Inferring CoalescenceTimes From DNA Sequence Data. Genetics , , 505–518.Tierney, L. and Kadane, J. B. (1986) Accurate Approximations for Posterior Moments andMarginal Densities. Journal of the American Statistical Association , , 82–86.Watanabe, S. (2010) Asymptotic Equivalence of Bayes Cross Validation and Widely Appli-cable Information Criterion in Singular Learning Theory. Journal of Machine LearningResearch , , 3571–3594.Weiser, C. (2016) mvQuad: Methods for Multivariate Quadrature. http://cran.r-project.org/package=mvQuad . URL: http://cran.r-project.org/package=mvQuad .28 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
Figure 1: Fur seals example: A) BISQuE (x) and Gibbs ( (cid:3) ) approximations to the posteriordensity for total number of fur seal pups f ( N | c , r ) are nearly identical. B) BISQuE (—)and Gibbs (- · -) approximations to the joint posterior density f ( U | c , r ) are nearly identitcal.29 EWITT AND HOETING
Figure 2: Fur seals example: BISQuE (—) and Gibbs (- · -) approximations to the posteriordensities f ( α i | c , r ) are nearly identitcal. 30 AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
Figure 3: Spatial model example: Relative differences between BISQuE and Gibbs approx-imations ∆( Y ) = ( Y BISQuE − Y Gibbs ) /Y Gibbs × { X ( s ) } s ∈D at unobserved locations S . Nearly all (95%)relative differences in the posterior mean (A) are less than 5.5% (median=0.4%); relativedifferences in the mean are artificially large in regions where the posterior mean is near 0.All relative differences in the posterior standard errors (B) are below 3.3% (median=1.4%).31 EWITT AND HOETING
Figure 4: Spatial model example: A, B, C) Sparse grid quadrature (—) and Gibbs (- · -)approximations to the posterior densities for the spatial covariance parameters ( σ , ρ, ν ) arenearly identitcal. The true values of the parameters are marked by grey vertical lines. D)BISQuE (—) and Gibbs (- · -) approximations to the posterior density for new observation X ( s ) is nearly identical at s = ( . , . AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
Figure 5: BISQuE and Gibbs approximations to the mode of the discretized posterior pre-dictive distributions f ( ˜ Y | Y ) are nearly identical.33 EWITT AND HOETING
Table 1: Summary of steps to develop a BISQuE approximation.1. Write posterior quantity of interest in BISQuE form (8).
Computable approximations or exact expressions must exist for the components h ( θ , θ ; X ) and f ( θ | X ) . Section 3.4 proposes nested integration strategies (15)and (16) if approximation is necessary; nested Laplace approximations can alsobe used for components in latent Gaussian models (cf. Rue et al., 2009).
2. Select transformation ν = T ( θ ) to map θ ∈ Ω to ν ∈ R p . Favor transformations T that yield an approximately Gaussian posterior density f ( ν | X ) .
3. Apply the BISQuE approximation that uses unstandardized (11) or standardized (13)weights.
The level q ∈ N of the underlying sparse grid quadrature rule (4) determines theintegration nodes ν ( i,(cid:96) ) and weights w ( i,(cid:96) ) .
4. Increase the level q of underlying quadrature rule (4) until the approximation (11) or(13) converges. Nested quadrature rules allow the level q approximation to reduce computationalcost by reusing quadrature nodes and weight ratios from the level q − approxi-mation. AYESIAN INFERENCE VIA SPARSE GRID QUADRATURE EVALUATION
Table 2: Definitions of the parameters and posterior quantities for the BISQuE approxi-mations in Section 4. X = ( c , r ) for the fur seals example (Section 4.1), and X = Y forthe Remote effects spatial process model example (RESP, Section 4.3). The marginal pos-terior densities for the covariance parameters ( σ , ρ, ν ) in the spatial example (Section 4.2)are computed using sparse-grid quadrature methods (4) to directly marginalize the jointposterior distribution f ( σ , ρ, ν | X ) at each evaluation point. Computation times are alsopresented. For the RESP example, let θ ∗ = ( σ w , σ ε , σ α , ρ w , ρ α ) and I ( s ) = ( c i − ( s ) , c i ( s )).Time (sec.)Example h ( θ ; X ) h ( θ , θ ; X ) θ θ BISQuE GibbsFur seals f ( N | c , r ) f ( N | θ , c , r ) N ( α , U ) 0.3 298 f ( α i | c , r ) f ( α i | θ , c , r ) α i ( N, U ) 0.1 298 f ( U | c , r ) f ( U | θ , c , r ) U α X | X ] E[ X | θ , X ] X ( σ , ρ, ν ) 6 2,651Var( X | X ) (10) X ( σ , ρ, ν ) 6 2,651 f ( X | X ) f ( X | θ , X ) X ( σ , ρ, ν ) 6 2,651 f ( σ | X ) N/A σ ( ρ, ν ) 74 2,043 f ( ρ | X ) N/A ρ ( σ , ν ) 74 2,043 f ( ν | X ) N/A ν ( σ , ρ ) 74 2,043RESP f ( Y | Y ) f ( Y | θ , Y ) Y θ ∗
118 9,086 f ( ˜ Y | Y ) P ( Y ( s , t ) ∈ I ( s ) | θ , Y ) Y θ ∗∗