Approximating posteriors with high-dimensional nuisance parameters via integrated rotated Gaussian approximation
ppp . 1–32
Approximating posteriors with high-dimensional nuisanceparameters via integrated rotated Gaussian approximation B Y W. VAN DEN BOOM
Yale-NUS College, National University of Singapore, 16 College Avenue West [email protected]. REEVES
AND
D. B. DUNSON
Department of Statistical Science, Duke University, Box 90251, Durham,North Carolina 27708, U.S.A. [email protected] [email protected] S UMMARY
Posterior computation for high-dimensional data with many parameters can be challenging.This article focuses on a new method for approximating posterior distributions of a low- tomoderate-dimensional parameter in the presence of a high-dimensional or otherwise computa-tionally challenging nuisance parameter. The focus is on regression models and the key idea isto separate the likelihood into two components through a rotation. One component involves onlythe nuisance parameters, which can then be integrated out using a novel type of Gaussian approx-imation. We provide theory on approximation accuracy that holds for a broad class of forms ofthe nuisance component and priors. Applying our method to simulated and real data sets showsthat it can outperform state-of-the-art posterior approximation approaches.
Some key words : Bayesian statistics; Dimensionality reduction; Marginal inclusion probability; Nuisance parameter;Posterior approximation; Support recovery; Variable selection
1. I
NTRODUCTION
Consider the regression model y ∼ N (cid:0) Xβ + η, σ I n (cid:1) , (1)where y is an n -dimensional vector of observations, X is an n × p design matrix, β is a p -dimensional parameter of interest, η is an n -dimensional nuisance parameter, and σ is the errorvariance. The nuisance parameter can for instance capture the effect of a large set of covariatesnot included in X , or of errors that have a non-Gaussian distribution. Our goal is to conductBayesian inference on the model in (1) when p is of moderate size with the focus on the posterior π ( β | y ) = (cid:90) π ( β, η | y ) dη = 1 π ( y ) (cid:90) π ( y | β, η ) π ( β, η ) dη. (2)The integrals in (2) and π ( y ) are intractable to approximate accurately for certain priors π ( β, η ) ,with direct approximations such as Laplace’s method producing inaccurate results and MonteCarlo sampling being daunting computationally. Our key idea is to transform the hard problemwith nuisance parameter η in a principled way to a p -dimensional one which can be written as C (cid:13) a r X i v : . [ s t a t . C O ] S e p W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON a linear model including only β . Then, a low-dimensional inference technique can be appliedto this p -dimensional model. The transformation uses a novel type of Gaussian approximationusing a data rotation to integrate out η from (1).Section 3 discusses models that are special cases of (1). Applications of the model in (1) in-clude epidemiology studies in which y is a health outcome, X consists of exposures and keyclinical or demographic factors of interest, and η is the effect of high-dimensional biomarkers.The goal is inference on the effect of the exposures and the clinical or demographic covari-ates, but adjusting for the high-dimensional biomarkers. For example, η may result from geneticfactors, such as single-nucleotide polymorphisms, and we want to control for these factors inidentifying an environmental main effect. It is often impossible to isolate the impact of individ-ual genetic factors in such studies so we consider these effects as nuisance parameters. Anotheruse of (1) is for computation of posterior inclusion probabilities in high-dimensional Bayesianvariable selection as detailed in § · η that is not of primary interest and only a moderate number p of parameters of interest, are more and more common. Unfortunately, the complexity of η can make accurate approximation of π ( β | y ) in (2) challenging even when p = 1 . One naiveapproach is to ignore the nuisance parameter η by setting it to zero. The result can be problematicas omitting η changes the interpretation of the parameter of interest β , which therefore might takeon a different value. For example, η might capture the effect of covariates with it being importantto adjust for them to avoid misleading conclusions on β .Many posterior approximation methods exist, including Monte Carlo (George & McCul-loch, 1993, 1997; O’Hara & Sillanp¨a¨a, 2009), variational Bayes (Carbonetto & Stephens, 2012;Ormerod et al., 2017), integrated nested Laplace approximations (Rue et al., 2009), and expecta-tion propagation (Hern´andez-Lobato et al., 2015). However, these methods can be computation-ally expensive, do not apply to our setting, or lack theoretical results regarding approximationaccuracy. A notable exception to the latter is the fast posterior approximation algorithm of Hug-gins et al. (2017) which comes with bounds on the approximation error under conditions on theprior such as log-concavity, Gaussianity, and smoothness. The class of priors that we allow on β and η is much larger. Our method and its analysis for instance apply to dimensionality reductionand shrinkage priors such as spike-and-slab, horseshoe, and Laplace distributions.The main computational bottleneck of our method is calculation of the mean and variance ofa nuisance term, for which one can choose any suitable algorithm. As a result, the computationalcost of our method is comparable to that of the fast algorithm chosen for this step.
2. I
NTEGRATED ROTATED G AUSSIAN APPROXIMATION · . Notation and assumptions Denote the multivariate Gaussian distribution with mean µ and covariance Σ by N ( µ, Σ) , andits density function evaluated at a by N ( a | µ, Σ) . Denote the distribution of a conditional on b by Π( a | b ) and its density, with respect to some dominating measure, evaluated at a by π ( a | b ) .We assume that β and η are a priori independent so that Π( β, η ) = Π( β )Π( η ) . We treat X and σ as known constants unless otherwise noted. Assume that p ≤ n . We assume that X is fullrank to simplify the exposition, but our method also applies to rank deficient X .2 · . Description of the method We integrate out η from (1) by splitting the model into two parts, one of which does notinvolve β . A data rotation provides such a model split. Specifically, consider as rotation matrixthe n × n orthogonal matrix Q from the QR decomposition of X . Define the n × p matrix R ntegrated rotated Gaussian approximation and the n × ( n − p ) matrix S by ( R, S ) = Q . Then, the columns of R form an orthonormalbasis for the column space of X since X is full rank by assumption (Golub & Van Loan, 1996, § Q is orthogonal, the columns of S form an orthonormal basis for the orthogonalcomplement of the column space of X . Therefore, S T X = 0 ( n − p ) × p , an ( n − p ) × p matrix ofzeros, which can also be derived from the fact that Q T X is upper triangular.By the rotational invariance of the Gaussian distribution and Q T Q = I n , Q T y ∼ N (cid:0) Q T Xβ + Q T η, σ I n (cid:1) is distributionally equivalent to (1). This rotated model splits as R T y ∼ N (cid:0) R T Xβ + R T η, σ I p (cid:1) , (3a) S T y ∼ N (cid:0) S T η, σ I n − p (cid:1) ; (3b)using S T X = 0 ( n − p ) × p . This transformation motivates a two-stage approach in which one firstcomputes Π( η | S T y ) from submodel (3b) and then uses this distribution as an updated prior forthe projected nuisance term R T η in submodel (3a). Following this approach, the posterior of β can be expressed as Π( β | y ) ∝ Π( β ) (cid:82) N ( R T y | R T Xβ + R T η, σ I p ) d Π( R T η | S T y ) .In practice, Π( R T η | S T y ) may be intractable to compute exactly because of the complexityof Π( η ) . To alleviate this challenge, we consider an approximation ˆΠ( R T η | S T y ) , which thenleads to an approximation for the posterior of β : ˆΠ( β | y ) ∝ Π( β ) (cid:90) N ( R T y | R T Xβ + R T η, σ I p ) d ˆΠ( R T η | S T y ) . (4)All distributions, densities, and probabilities resulting from this approximation carry a hat todistinguish them from their exact counterparts.A Gaussian approximation is analytically convenient: ˆΠ( R T η | S T y ) = N (ˆ µ, ˆΣ) , (5)where ˆ µ and ˆΣ are estimates of the mean and covariance of Π( R T η | S T y ) , respectively. In thiscase, (4) simplifies as ˆΠ( β | y ) ∝ Π( β ) N ( R T y | R T Xβ + ˆ µ, σ I p + ˆΣ) . (6)Only β is unknown such that the computational problems with (1) resulting from the complexityof Π( η ) have been resolved in (6). Furthermore, (6) is equivalent to a Gaussian linear modelwith observations R T y − ˆ µ , design matrix R T X , and parameter β . We have reduced a modelwith a potentially challenging nuisance parameter to a low-dimensional one with the nuisanceintegrated out while controlling for the effect of the nuisance parameter in a principled manner.Algorithm 1 summarizes our method when the Gaussian approximation from (5) is used. Algorithm . Integrated rotated Gaussian approximation.Input: Data ( y, X )
1. Compute the QR decomposition of X to obtain the rotation matrix Q = ( R, S ) .2. Compute the estimates ˆ µ and ˆΣ for the mean and covariance of Π( R T η | S T y ) based onsubmodel (3b) using an algorithm of choice.3. Approximate the posterior Π( β | y ) according to (6).Output: The approximate posterior ˆΠ( β | y ) · . Relation to other methods Algorithm 1 has resemblances with other approximation methods. Integrated nested Laplaceapproximations (Rue et al., 2009) also approximate a nested part of a Bayesian model by a Gaus- W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON sian distribution but with important differences. A Laplace approximation is applied without adata rotation and is done at two, rather than one, nested levels of the model. Moreover, a Laplaceapproximation matches the mode and curvature of the approximating Gaussian while (5) matchesthe moments. Laplace’s method (Tierney & Kadane, 1986) requires a continuous target distri-bution and integrated nested Laplace approximations assume a conditionally Gaussian prior onsome parameters. Our Gaussian approximation needs no such conditions on priors but assumesa Gaussian error distribution. Indeed, the nuisance parameter η has a very non-Gaussian prior inour cases of interest.The approximation in (5) aims to match the first two moments of the exact Π( R T η | S T y ) .Such matching is the principle behind expectation consistent inference (Opper & Winther, 2005).Our method matches moments for the nuisance parameter but not for the parameter of interest β .This differs from applications of the expectation consistent framework in which moment match-ing is pervasive such as in expectation propagation (Hern´andez-Lobato et al., 2015). Implemen-tations of expectation propagation are usually not able to capture dependence among dimensionsof the posterior while our method allows for dependence in the p -dimensional β .Effectively, our method integrates out the nuisance parameter η approximately. Integrating outnuisance parameters from the likelihood is not new (Berger et al., 1999), including doing soapproximately (Severini, 2011). Previous approximations, however, do not apply a data rotationand consider cases where the distribution on the nuisance parameter is regular enough so that aLaplace approximation can be applied. Our method does not need such regularity conditions.The rotation Q is similar to the projection in the Frisch-Waugh-Lovell theorem (Stachurski,2016, Theorem 11.2.1) for least-squares estimation of a parameter subset. Our method applies be-yond least squares. Also, our estimation of the nuisance parameter through the rotation is merelyan intermediate step for inference on β . Our method reduces to the algorithm from van den Boomet al. (2015) when considering the example in § · p = 1 .2 · . Estimating σ and hyperparameters So far, we have treated σ as fixed and known. In practice, σ usually needs to be estimated,as well as any unknown parameters in the prior on η . This estimation fits naturally into Step 2 ofAlgorithm 1 as the methods that can be used there frequently come with such estimation proce-dures: See for instance § S5 · § S6 of the Supplementary Material. The resulting estimatescan then be plugged into Step 3. By doing so, only the ( n − p ) -dimensional submodel (3b) in-forms the estimates of these parameters and not the p -dimensional submodel (3a). We expect (3b)to contain the vast majority of information on the unknown parameters if ( n − p ) (cid:29) p , which isoften the case in scenarios of interest.
3. E
XAMPLES OF NUISANCE PARAMETERS η · . Adjusting for high-dimensional covariates Section 3 provides some examples of the general setting of model (1) that demonstrate the util-ity of the integrated rotated Gaussian approximation described in Algorithm 1. As first example,consider the setting where η = Zα with Z a known n × q feature matrix and α an unknown q -dimensional parameter with q (cid:29) n . Then, the model in (1) becomes y ∼ N (cid:0) Xβ + Zα, σ I n (cid:1) ,so that we are adjusting for high-dimensional covariates Z in performing inference on the co-efficients β on the predictors X of interest. One way to deal with the fact that the number ofcovariates q exceeds the number of observations n is by inducing sparsity in α via its prior Π( α ) . We consider the spike-and-slab prior, α j ∼ λ N (0 , ψ ) + (1 − λ ) δ (0) independently for j = 1 , . . . , q , where λ = pr( α j (cid:54) = 0) is the prior inclusion probability, ψ the slab variance, and ntegrated rotated Gaussian approximation δ (0) a point mass at zero. By specifying Π( α ) , we have also defined Π( η ) = Π( Zα ) . Since each Π( α j ) is a mixture of a point mass and a Gaussian, Π( α ) and thus Π( η ) are mixtures of q Gaus-sians. As a result, computation of π ( β | y ) in (2) involves summing over these q components.This is infeasible for large q .Algorithm 1 provides an approximation ˆΠ( β | y ) while avoiding the exponential compu-tational cost. Step 2 in Algorithm 1 requires choice of an estimation algorithm. Substituting η = Zα into (3b) yields S T y ∼ N ( S T Zα, σ I n − p ) , which is a linear model with ( n − p ) ob-servations and design matrix S T Z . As such, methods for linear regression with spike-and-slabpriors can produce an approximation to Π( α | S T y ) and thus the estimates ˆ µ and ˆΣ for in (5). Wechoose vector approximate message passing (Rangan et al., 2017), detailed in § S5 of the Sup-plementary Material, to approximate Π( α | S T y ) because of its computational scalability andaccuracy. As an alternative, we consider the debiased lasso (Javanmard & Montanari, 2013) in § · Π( α | S T y ) as detailed in § S6 of the Supplementary Material.The q -dimensional distribution Π( α | S T y ) is possibly highly non-Gaussian, being a mixture ofGaussians. At the same time, the p -dimensional distribution Π( R T Zα | S T y ) = Π( R T η | S T y ) can be nearly Gaussian such that the approximation in (5) is accurate as discussed in § · · . Bayesian variable selection For a second application of (1), consider the linear model y ∼ N ( Aθ, σ I n ) where A isa known n × r design matrix and θ an unknown r -dimensional parameter. Variable selec-tion is the problem of determining which entries of θ are non-zero. Modeling the data in aBayesian fashion provides a natural framework to evaluate statistical evidence via the poste-rior Π( θ | y ) . A standard variable selection prior Π( θ ) is the spike-and-slab prior defined by θ j ∼ λ N (0 , ψ ) + (1 − λ ) δ (0) independently for j = 1 , . . . , p . As in § ·
1, the cost of comput-ing the exact posterior with a spike-and-slab prior grows exponentially in r . Therefore, compu-tation of Π( θ | y ) is infeasible for r beyond moderate size. A variety of approximation methodsexist for larger r including Monte Carlo (George & McCulloch, 1993, 1997; O’Hara & Sillanp¨a¨a,2009), variational Bayes (Carbonetto & Stephens, 2012; Ormerod et al., 2017), and expectationpropagation (Hern´andez-Lobato et al., 2015).Monte Carlo methods do not scale well with the number of predictors r . For r even moder-ately large, the r possible non-zero subsets of θ is so huge that there is no hope of visiting morethan a vanishingly small proportion of models. The result is high Monte Carlo error in estimatingposterior probabilities, with almost all models assigned zero probability as they are never visited.As an alternative to Monte Carlo sampling, fast approximation approaches for Bayesian variableselection include variational Bayes (Carbonetto & Stephens, 2012; Ormerod et al., 2017) and ex-pectation propagation (Hern´andez-Lobato et al., 2015). Their accuracy, however, does not comewith theory guarantees. Our method, which applies to variable selection as detailed in the nextparagraph, allows for theoretical analysis as § θ j (cid:54) = 0 ( j = 1 , . . . , r ) asmeasured by the posterior inclusion probability pr( θ j (cid:54) = 0 | y ) . Algorithm 1 can estimate pr( θ j (cid:54) = 0 | y ) : Let p < r elements from θ constitute β and let the other q = r − p elementsin θ constitute α . Then, Aθ = Xβ + Zα where X and Z consist of the respective columns in A , and Π( α, β ) = Π( α ) Π( β ) since Π( θ ) = (cid:81) rj =1 Π( θ j ) . This set-up is the same as in § · Π( β | y ) as in § ·
1. Assuming θ j is contained in β , an approxima-tion of Π( θ j | y ) can be obtained as a marginal distribution of ˆΠ( β | y ) . Repeating Algorithm 1with different splits of θ into β and α provides estimates of all pr( θ j (cid:54) = 0 | y ) ( j = 1 , . . . , q ) .Computations for these different splits can run in parallel. The number of CPU cores is oftenless than the number of variables r . Then, computation time to obtain all ˆpr( θ j (cid:54) = 0 | y ) is a W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON tradeoff between the length p of β , which affects the cost of each execution of Algorithm 1, andthe number r/p of executions of Algorithm 1. The complexity of computing all ˆpr( θ j (cid:54) = 0 | y ) is O ( r log r ) if p = O (log r ) as detailed in the next paragraph.Step 1 of Algorithm 1 is the QR decomposition of an n × p matrix which has com-plexity O ( np ) (Golub & Van Loan, 1996, § n − p observations and q parameters, which has a complexity of O { ( n − p + K ) q min( n − p, q ) } where K is the number of message passing iterations as de-tailed in § S5 · S T y and S T Z , which are the observations and design matrix in (3b), and computing ˆ µ and ˆΣ in (5) fromthe message passing output is O ( n q ) . Computing Step 3 with the spike-and-slab prior Π( β ) is O (2 p p ) , ignoring dependence on n . The complexity of obtaining all ˆpr( θ j (cid:54) = 0 | y ) by apply-ing Algorithm 1 r/p times is thus O { ( r/p )( q + 2 p p ) } = O { ( r/p )( r − p + 2 p p ) } , ignoringdependence on n and K . For p = O (log r ) , this complexity reduces to O ( r log r ) .3 · . Non-parametric adjustment for covariates As a last example, let η i = ( g ◦ f )( z i ) ( i = 1 , . . . , n ) where g : R → R is a known, differen-tiable, non-linear function, f : R q → R is an unknown function, g ◦ f : R q → R is g composedwith f , and z i is a q -dimensional feature vector. Then, η i provides a non-parametric adjust-ment for the covariate z i in performing inferences on the effect of x i on y i . Take f ’s prior asa Gaussian process that induces a prior Π( η ) . Algorithm 1 applies if a Gaussian approxima-tion ˆΠ( R T η | S T y ) is available: Submodel (3b) reduces to S T y ∼ N { S T G ( F ) , σ I n − p } where F = { f ( z ) , . . . , f ( z n ) } T and G ( F ) = { g ( F ) , . . . , g ( F n ) } T , which is a non-linear Gaussianmodel as studied in Steinberg & Bonilla (2014). Linearizing G using a first-order Taylor se-ries yields a Gauss-Newton algorithm for a Laplace approximation of Π( F | S T y ) as detailed in § S7 of the Supplementary Material. Based on that approximation, compute ˆ µ and ˆΣ in (5), forinstance by sampling F from a Laplace approximation ˆΠ( F | S T y ) and computing the samplemean and covariance of R T G ( F ) since R T η = R T G ( F ) .
4. A
NALYSIS OF INTEGRATED ROTATED G AUSSIAN APPROXIMATION · . Approximation accuracy This section provides theoretical guarantees on the accuracy of our posterior approximationframework. We begin with a general upper bound in terms of the accuracy of the approximationfor the projected nuisance parameter. For this, denote the distribution of the p -dimensional a + b where b ∼ N (0 , σ I p ) by Π( a ) ∗ N σ . Define the Kullback-Leibler divergence from Π( a ) to Π( b ) as D { Π( a ) (cid:107) Π( b ) } = (cid:82) log { π ( a ) /π ( b ) } d Π( a ) .At a high level, it is clear that the accuracy of the approximation ˆΠ( β | y ) defined in (4)depends on the accuracy of the approximation ˆΠ( R T η | S T y ) . The following result quantifies thenature of this dependence in the setting where the data are generated from the prior predictivedistribution. This result applies generally for any approximation ˆΠ( R T η | S T y ) and thus includesthe Gaussian approximation (5) used in Algorithm 1 as a special case.T HEOREM Let y be distributed according to the model in (1) with β ∼ Π( β ) and η ∼ Π( η ) distributed according to their priors. Conditional on any realization S T y , the posteriorapproximation ˆΠ( β | y ) described in (4) satisfies E (cid:104) D (cid:110) Π( β | y ) (cid:107) ˆΠ( β | y ) (cid:111) (cid:12)(cid:12)(cid:12) S T y (cid:105) ≤ D (cid:110) Π( R T η | S T y ) ∗ N σ (cid:107) ˆΠ( R T η | S T y ) ∗ N σ (cid:111) , ntegrated rotated Gaussian approximation where the expectation on the left is with respect to the conditional distribution of y given S T y . A particularly useful property of Theorem 1 is that the upper bound does not depend in anyway on the prior Π( β ) . This differs from some of the related work on posterior approxima-tion, such as Huggins et al. (2017), which requires additional smoothness constraints, and thusexcludes certain priors such as the spike-and-slab prior in § ·
1. Another useful property of The-orem 1 is that it does not require any assumptions about the extent to which the exact posterior Π( R T η | S T y ) is concentrated about the ground truth. As a consequence, this result is relevantfor non-asymptotic settings where there may be high uncertainty about η .4 · . Accuracy of the Gaussian approximation Next, we provide theoretical justification for a Gaussian approximation to Π( R T η | S T y ) byshowing that such an approximation can be accurate even if the prior on η is non-Gaussian. Wefocus on the setup of § · η = Zα with a known n × q feature matrix Z and unknown parameter vector α . In this setting, the projected nuisance parameter is R T Zα where R T Z is a p × q matrix with p (cid:28) q .As a motivating example, consider the setting where the conditional distribution Π( α | S T y ) is a product measure with uniformly bounded second moments. Under regularity assumptionson the columns of R T Z , the multivariate central limit theorem implies that the distribution ofthe p -dimensional projection R T Zα is close to the Gaussian distribution of the same mean andcovariance. Hence, a Gaussian approximation for R T Zα is well motivated even though entriesof α are possibly non-Gaussian.More realistically, one may envision settings where the entries of Π( α | S T y ) are not inde-pendent but are weakly correlated on average. In this case, the usual central limit theorem doesnot hold because one can construct counterexamples in which the normalized sum of dependentbut uncorrelated variables is far from Gaussian. Nevertheless, a classic result due to Diaconis &Freedman (1984) suggests that these counterexamples are atypical. Specifically, if one considersa weighted linear combination of the entries in α , then approximate Gaussianity holds for mostchoices of the weights, where most is quantified with respect to the uniform measure on thesphere. The implications of this phenomenon have been studied extensively in the context of sta-tistical inference (Hall & Li, 1993; Leeb, 2013) and recent work by Meckes (2012) and Reeves(2017) provide approximation bounds for the setting of multidimensional linear projections.In the context of our approximation framework, these results imply that a Gaussian approxi-mation is accurate for most, but not necessarily all, instances of the p × q feature matrix R T Z . Tomake this statement mathematically precise, we consider the expected behavior when the rowsof Z are drawn independently from the q -dimensional Gaussian distribution N (0 , Λ) where Λ ispositive definite. As in the rest of the paper, we assume that X is fixed and arbitrary. Under theseassumptions, the rows of the projected matrices R T Z and S T Z are independent with the samedistribution as in Z .Our results depend on certain properties of the conditional distribution Π( α | S T y, S T Z ) . Let ξ and Ψ denote the mean and covariance of Π( α | S T y, S T Z ) , respectively. Define m = E (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) Λ ( α − ξ ) (cid:107) tr(ΛΨ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) S T y, S T Z (cid:41) , m = tr { (ΛΨ) } tr(ΛΨ) . The term m provides a measure of the concentration of (cid:107) Λ / ( α − ξ ) (cid:107) about its mean andsatisfies ≤ m ≤ . The term m provides a measure of the average correlation between theentries of Λ / α and satisfies /q ≤ m ≤ with equality on the left when ΛΨ is proportionalto the identity matrix and equality on the right when ΛΨ has rank one. W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON
Given estimates ˆ ξ and ˆΨ that are functions of S T y and S T Z , we consider the Gaussian ap-proximation ˆΠ( R T η | S T y, S T Z ) = N { R T Z ˆ ξ, tr(Λ ˆΨ) I p } . (7)The covariance is chosen independently of R T Z and depends only on a scalar summary of theestimated covariance. The following result bounds the accuracy of this approximation in termsof the terms m and m and the accuracy of the estimated mean and covariance.T HEOREM Conditional on any S T y and S T Z , the Gaussian approximation in (7) satisfies E R T Z (cid:104) D (cid:110) Π( R T η | S T y, S T Z ) ∗ N σ (cid:107) ˆΠ( R T η | S T y, S T Z ) ∗ N σ (cid:111)(cid:105) ≤ δ + δ , where the expectation is with respect to R T Z and δ = 3 p (cid:34) m log (cid:26) σ (cid:27) + m + m (cid:26) σ (cid:27) p (cid:35) ,δ = p (cid:107) Λ ( ξ − ˆ ξ ) (cid:107) σ + p σ (cid:110) tr(ΛΨ) − tr(Λ ˆΨ) (cid:111) . One of the main takeaways from Theorem 2 is that the Gaussian approximation is accurate un-der very mild conditions on Π( α | S T y ) , namely that the terms m and m are small. Combiningthis result with Theorem 1 provides an upper bound on the accuracy of the approximation to theposterior of β described in Algorithm 1. Although Theorem 2 bounds the expected performanceunder a Gaussian distribution on R T Z , the non-negativity of Kullback-Leibler divergence meansthat convergence in expectation implies convergence in probability.4 · . Variable selection consistency Finally, we provide guarantees for variable selection consistency. Let the set γ ⊂ { , . . . , p } contain all indices j such that β j (cid:54) = 0 . Define γ analogously for a non-random β . Vari-able selection consistency as in Fern´andez et al. (2001) and Liang et al. (2008) means that,for y ∼ N ( Xβ , σ I n ) , the posterior probability of the true model γ converges to one, pr( γ = γ | y ) → as n → ∞ where p does not change with n . It is desirable for a poste-rior approximation to inherit this property. The Monte Carlo approximations discussed in § · ˆΠ( R T η | S T y ) concentrates appropriately. Relatedly, Ormerod et al.(2017) established such consistency for their variational Bayes algorithm. More recently, K. Rayand B. Szab´o (arXiv:1904.07150) showed optimal convergence rates of variable selection usingvariational Bayes with different priors than we consider here.Let the | γ | -dimensional vector β γ consist of the elements in β with indices in γ , and the n × | γ | matrix X γ consist of the columns of X with indices in γ . Then, specifying Π( γ ) and Π( β γ | γ ) defines Π( β ) . We consider g -priors (Zellner, 1986): β γ | γ ∼ N (cid:110) , σ g n (cid:0) X T γ X γ (cid:1) − (cid:111) , g n ∈ (0 , ∞ ) . (8)Liang et al. (2008) showed variable selection consistency for priors of this form. Our approxi-mation inherits this property under the additional assumption (9) on ˆΠ( R T η | S T y ) and g n .T HEOREM Let Π( β γ ) be the g -prior on β γ from (8) . Assume that g n , Π( γ ) , and X satisfy pr( γ = γ ) > , lim n →∞ (cid:107){ I n − X γ ( X T γ X γ ) − X T γ } Xβ (cid:107) /n > for any γ not containing γ , ntegrated rotated Gaussian approximation g n → ∞ , and log( g n ) /n → , which are standard assumptions used in Fern´andez et al. (2001)and Liang et al. (2008) as detailed in § S of the Supplementary Material. Let y be distributedaccording to the data-generating model in (1) with β and η fixed to β and η , respectively.Assume that ˆΠ( R T η | S T y ) concentrates appropriately in that (cid:107) R T η − R T η (cid:107) log g n → , (9) in probability with respect to R T η ∼ ˆΠ( R T η | S T y ) and y . Let ˆΠ( β | y ) be as in (4) . Then, ˆpr( γ = γ | y ) → in probability with respect to y as n → ∞ .
5. S
IMULATION STUDIES AND APPLICATIONS · . Non-parametric adjustment for covariates Consider the set-up from § · g ( a ) = a and q = 1 . We assign f : R → R a zero-mean Gaussian process prior with a squared exponential covariance function such that cov { f ( z i ) , f ( z j ) } = exp {− ( z i − z j ) / } ( i, j = 1 , . . . , n ) , and β ∼ N (0 , I p ) . Set n =100 , p = 3 , and σ = 1 . We draw the rows of X independently from N (0 p × , Φ) where Φ isa Toeplitz matrix defined so that its first row equals (0 . , . . . , . p ) . Then, the columns of X arecorrelated. The features z i ( i = 1 , . . . , n ) equal the i th element of the first column of X . Generate y according to (1) with f equal to a draw from its prior distribution and β = (4 , − , T .We approximate the posterior Π( β | y ) using a random walk Metropolis-Hastings algorithmon f with 10,000 burnin and 90,000 recorded iterations. We marginalize out β since Π( β | f, y ) is analytically available, allowing approximation of π ( β | y ) with samples from Π( f | y ) . Algo-rithm 1 also provides ˆΠ( β | y ) per § ·
3. Lastly, ignoring the non-parametric nuisance parameterby setting η i = ( g ◦ f )( z i ) = 0 yields a simpler approximation. The Metropolis-Hastings algo-rithm took 6 minutes while our method finished in 2 seconds. The resulting posterior densityestimates for β j ( j = 1 , . . . , p ) are in Fig. 1. Taking the Metropolis-Hastings estimate as thegold standard, our method yields an approximation that matches the location and spread of theposterior better than the result from ignoring the non-parametric nuisance term η . . . . . . b p ( b | y ) and p ^ ( b | y ) −6.0 −5.0 −4.0 −3.0 . . . . b p ( b | y ) and p ^ ( b | y ) . . . . b p ( b | y ) and p ^ ( b | y ) Fig. 1: Marginal posterior density estimates from the simulation in § · π ( β j | y ) , the thick dotted line the estimate ˆ π ( β j | y ) from Algo-rithm 1, and the thin dotted line the estimate resulting from ignoring the nuisance parameter.0 W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON · . Diabetes data We consider the diabetes data from Efron et al. (2004) as it is a popular example of variableselection with collinear predictors (Park & Casella, 2008; Polson et al., 2013). The outcome y measures disease progression one year after baseline for n = 442 patients with diabetes. The r = 64 predictors come from 10 covariates with their squares and interactions. The outcomeand predictors are standardized to have zero mean and unit norm. Consider the variable se-lection set-up from § · λ = 1 / and ψ = 1 . We estimate theposterior inclusion probabilities pr( θ j (cid:54) = 0 | y ) ( j = 1 , . . . , r ) using 1) a Gibbs sampler with10,000 burnin and 90,000 recorded iterations, Algorithm 1 as described in § · p = 4 as suggestedby p = O (log r ) and parallelization across 8 CPU cores, 4) the expectation propagation schemefrom Hern´andez-Lobato et al. (2015), and 5) the variational Bayes algorithm from Carbonetto& Stephens (2012). The implementations of expectation propagation and variational Bayes usedare the R code from https://jmhl.org/publications/ dated January 2010 and theR package ‘varbvs’ version 2.5-7, respectively. Results from the variational Bayes algorithm byOrmerod et al. (2017) are omitted as the method from Carbonetto & Stephens (2012) outperformsit in the scenarios that we consider.Since the error variance is unknown, we assign it the prior /σ ∼ Ga (1 , , a gamma distribu-tion with unit shape and rate parameter. The Gibbs sampler incorporates this prior. Algorithm 1estimates σ as described in § ·
4, and § S5 · § S6 of the Supplementary Material. Expecta-tion propagation estimates σ by maximizing approximate evidence (Hern´andez-Lobato et al.,2015). The R package ‘varbvs’ (Carbonetto & Stephens, 2012) uses approximate maximum like-lihood for σ within the variational Bayes method.As discussed in § ·
2, determining whether posterior inclusion probabilities from a Gibbs sam-pler are accurate is non-trivial. Overlapping batch means (Flegal & Jones, 2010, §
3) estimatestheir average Monte Carlo standard error as 0.0015 in this application.Table 1 focuses on the errors in the posterior inclusion probability estimates. An approxi-mation error of . is worse when the inclusion probability is . versus . . We thereforetransform the probabilities to log odds. Our method with vector approximate message pass-ing outperforms expectation propagation and variational Bayes as its error is lowest in Table 1,though at a higher computational cost. Our method is slowest but still considerably faster thanthe Gibbs sampler which took 11 minutes to run. Since the debiased lasso yielded the worstapproximation, we do not consider it in the remainder of this article.Table 1: Summary statistics of the absolute difference between the Gibbs sampler estimates andthe approximations of the posterior log odds of inclusion for the application in § · Computationtime (seconds)
IRGA with VAMP 0.003
Variational Bayes ntegrated rotated Gaussian approximation Table 2: Posterior inclusion probabilities for the demographic factors from the application in § ·
3. IRGA stands for integrated rotated Gaussian approximation.PopulationMethod Gender Utahn of European ancestry Finnish Tuscan YorubaIRGA 0.83 0.96 0.96 0.92 0.00Ignoring the SNPs 0.73 0.07 0.04 0.20 0.495 · . Controlling for single-nucleotide polymorphisms The Geuvadis dataset from Lappalainen et al. (2013), available at , contains gene expression data from lymphoblastoid celllines of n = 462 individuals from the 1000 Genomes Project along with roughly 38 mil-lion single-nucleotide polymorphisms (SNPs). We focus on the gene E2F2, ensemble IDENSG00000007968, as it plays a key role in the cell cycle (Attwooll et al., 2004). Our focusis on assessing whether expression differs between populations, even after adjusting for geneticvariation between individuals. Specifically, we compare people from British descent with the fourother populations given in Table 2. If such differences occur, they can be presumed to be due toenvironmental factors that differ between these populations and that relate to E2F2 expression.We therefore consider the set-up from § · y the E2F2 gene expressions, X demographicfactors, and Z containing SNPs we would like to control for.The demographics in X are gender and the 4 populations with British as the reference group.The matrix X thus has p = 5 columns. The covariates Z consist of q = 10 , SNPs selectedusing sure independence screening (Fan & Lv, 2008) as vector approximate message passingon all 38 million SNPs was infeasible. We standardize y and the columns of X and Z to havezero mean and unit standard deviation. To complete the set-up from § ·
1, set λ = n/ (10 q ) and ψ = 1 /n for the spike-and-slab prior on α while Π( β ) is a spike-and-slab with prior inclusionprobability / and slab variance . Vector approximate message passing estimates σ usingthe prior /σ ∼ Ga (1 , and employs damping to achieve convergence in this application, asdescribed in § S5 · § S5 ·
6. D
ISCUSSION
Although our focus was on approximating the posterior of β under a Bayesian approach, ourmethod marginalizes out nuisance parameters from the likelihood for β as an intermediate step.This approximate likelihood from (6) is also useful in frequentist inference, though requires oneto define a prior for the nuisance parameter η .2 W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON A CKNOWLEDGMENT
This work was partially supported by the National Institute of Environmental Health Sciencesof the U.S. National Institutes of Health, the Singapore Ministry of Education Academic Re-search Fund, and the Laboratory for Analytic Sciences. Any opinions, findings, conclusions, orrecommendations expressed in this material are those of the authors and do not necessarily re-flect the views of the Laboratory for Analytic Sciences and/or any agency or entity of the UnitedStates Government. S UPPLEMENTARY MATERIAL
Supplementary material includes the proofs for §
4, a corollary to Theorem 1, details of vectorapproximate message passing and the Laplace approximation for § ·
3, and additional simula-tion studies. The R code for the numerical results is available at https://github.com/willemvandenboom/IRGA . R EFERENCES A TTWOOLL , C., D
ENCHI , E. L. & H
ELIN , K. (2004). The E2F family: Specific functions and overlapping interests.
The EMBO Journal , 4709–4716.B ERGER , J. O., L
ISEO , B. & W
OLPERT , R. L. (1999). Integrated likelihood methods for eliminating nuisanceparameters.
Statistical Science , 1–28.C ARBONETTO , P. & S
TEPHENS , M. (2012). Scalable variational inference for Bayesian variable selection in regres-sion, and its accuracy in genetic association studies.
Bayesian Anal. , 73–108.D IACONIS , P. & F
REEDMAN , D. (1984). Asymptotics of graphical projection pursuit.
Ann. Stat. , 793–815.E FRON , B., H
ASTIE , T., J
OHNSTONE , I. & T
IBSHIRANI , R. (2004). Least angle regression.
Ann. Stat. , 407–499.F AN , J. & L V , J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Statist. Soc. B , 849–911.F ERN ´ ANDEZ , C., L EY , E. & S TEEL , M. F. (2001). Benchmark priors for Bayesian model averaging.
J. Econom. , 381–427.F
LEGAL , J. M. & J
ONES , G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo.
Ann. Stat. , 1034–1070.G EORGE , E. I. & M C C ULLOCH , R. E. (1993). Variable selection via Gibbs sampling.
J. Am. Statist. Assoc. ,881–889.G EORGE , E. I. & M C C ULLOCH , R. E. (1997). Approaches for Bayesian variable selection.
Stat. Sin. , 339–374.G OLUB , G. H. & V AN L OAN , C. F. (1996).
Matrix Computations . Baltimore: Johns Hopkins University Press, 3rded.H
ALL , P. & L I , K.-C. (1993). On almost linearity of low dimensional projections from high dimensional data. Ann.Stat. , 867–889.H ERN ´ ANDEZ -L OBATO , J. M., H
ERN ´ ANDEZ -L OBATO , D. & S U ´ AREZ , A. (2015). Expectation propagation in linearregression models with spike-and-slab priors.
Mach. Learn. , 437–487.H UGGINS , J., A
DAMS , R. P. & B
RODERICK , T. (2017). PASS-GLM: Polynomial approximate sufficient statisticsfor scalable Bayesian GLM inference. In
Advances in Neural Information Processing Systems 30 . pp. 3611–3621.J
AVANMARD , A. & M
ONTANARI , A. (2013). Confidence intervals and hypothesis testing for high-dimensionalstatistical models. In
Advances in Neural Information Processing Systems 26 . pp. 1187–1195.L
APPALAINEN , T., , S
AMMETH , M., F
RIEDL ¨ ANDER , M. R., ‘ T H OEN , P. A. C., M
ONLONG , J., R
IVAS , M. A.,G
ONZ ` ALEZ -P ORTA , M., K
URBATOVA , N., G
RIEBEL , T., F
ERREIRA , P. G., B
ARANN , M., W
IELAND , T.,G
REGER , L.,
VAN I TERSON , M., A
LML ¨ OF , J., R IBECA , P., P
ULYAKHINA , I., E
SSER , D., G
IGER , T., T
IKHONOV ,A., S
ULTAN , M., B
ERTIER , G., M AC A RTHUR , D. G., L EK , M., L IZANO , E., B
UERMANS , H. P. J., P
ADIOLEAU ,I., S
CHWARZMAYR , T., K
ARLBERG , O., O
NGEN , H., K
ILPINEN , H., B
ELTRAN , S., G UT , M., K AHLEM , K.,A
MSTISLAVSKIY , V., S
TEGLE , O., P
IRINEN , M., M
ONTGOMERY , S. B., D
ONNELLY , P., M C C ARTHY , M. I.,F
LICEK , P., S
TROM , T. M., L
EHRACH , H., S
CHREIBER , S., S
UDBRAK , R., C
ARRACEDO , ´A., A
NTONARAKIS ,S. E., H ¨
ASLER , R., S YV ¨ ANEN , A.-C.,
VAN O MMEN , G.-J., B
RAZMA , A., M
EITINGER , T., R
OSENSTIEL , P.,G
UIG ´ O , R., G UT , I. G., E STIVILL , X. & D
ERMITZAKIS , E. T. (2013). Transcriptome and genome sequencinguncovers functional variation in humans.
Nature , 506–511.L
EEB , H. (2013). On the conditional distributions of low-dimensional projections from high-dimensional data.
Ann.Stat. , 464–483. ntegrated rotated Gaussian approximation L IANG , F., P
AULO , R., M
OLINA , G., C
LYDE , M. A. & B
ERGER , J. O. (2008). Mixtures of g priors for Bayesianvariable selection. J. Am. Statist. Assoc. , 410–423.M
ECKES , E. (2012). Projections of probability distributions: A measure-theoretic Dvoretzky theorem. In
LectureNotes in Mathematics . Berlin: Springer, pp. 317–326.O’H
ARA , R. B. & S
ILLANP ¨ A ¨ A , M. J. (2009). A review of Bayesian variable selection methods: What, how andwhich. Bayesian Anal. , 85–117.O PPER , M. & W
INTHER , O. (2005). Expectation consistent approximate inference.
J. Mach. Learn. Res. , 2177–2204.O RMEROD , J. T., Y OU , C. & M ¨ ULLER , S. (2017). A variational Bayes approach to variable selection.
Electron. J.of Stat. , 3549–3594.P ARK , T. & C
ASELLA , G. (2008). The Bayesian lasso.
J. Am. Statist. Assoc. , 681–686.P
OLSON , N. G., S
COTT , J. G. & W
INDLE , J. (2013). The Bayesian bridge.
J. R. Statist. Soc. B , 713–733.R ANGAN , S., S
CHNITER , P. & F
LETCHER , A. K. (2017). Vector approximate message passing. In
IEEE Interna-tional Symposium on Information Theory . pp. 1588–1592.R
EEVES , G. (2017). Conditional central limit theorems for Gaussian projections. In
IEEE International Symposiumon Information Theory . pp. 3045–3049.R UE , H., M ARTINO , S. & C
HOPIN , N. (2009). Approximate Bayesian inference for latent Gaussian models by usingintegrated nested laplace approximations.
J. R. Statist. Soc. B , 319–392.S EVERINI , T. A. (2011). Frequency properties of inferences based on an integrated likelihood function.
Stat. Sin. , 433–447.S TACHURSKI , J. (2016).
A Primer in Econometric Theory . Cambridge: MIT Press.S
TEINBERG , D. M. & B
ONILLA , E. V. (2014). Extended and unscented Gaussian processes. In
Advances in NeuralInformation Processing Systems 27 . pp. 1251–1259.T
IERNEY , L. & K
ADANE , J. B. (1986). Accurate approximations for posterior moments and marginal densities.
J.Am. Statist. Assoc. , 82–86. VAN DEN B OOM , W., D
UNSON , D. & R
EEVES , G. (2015). Quantifying uncertainty in variable selection with arbi-trary matrices. In . pp. 385–388.Z
ELLNER , A. (1986). On assessing prior distributions and Bayesian regression analysis with g -prior distributions.In Bayesian inference and decision techniques: Essays in Honor of Bruno de Finetti , P. K. Goel & A. Zellner, eds.Amsterdam: North-Holland/Elsevier, pp. 233–243. p . 1–18 Supplementary material forApproximating posteriors with high-dimensional nuisanceparameters via integrated rotated Gaussian approximation B Y W. VAN DEN BOOM
Yale-NUS College, National University of Singapore, 16 College Avenue West [email protected]. REEVES
AND
D. B. DUNSON
Department of Statistical Science, Duke University, Box 90251, Durham,North Carolina 27708, U.S.A. [email protected] [email protected]. P
ROOF OF T HEOREM L EMMA
S1.
Let P ( a, b ) and Q ( a, b ) be probability measures defined on the same space thathave the same a -marginal, that is, P ( a ) = Q ( a ) . Then, E P ( b ) [ D { P ( a | b ) k Q ( a | b ) } ] ≤ E P ( a ) [ D { P ( b | a ) k Q ( b | a ) } ] . Proof.
Using the chain rule for Kullback-Leibler divergence (Cover & Thomas, 2006, Theo-rem 2.5.3) two different ways leads to D { P ( a, b ) k Q ( a, b ) } = E P ( b ) [ D { P ( a | b ) k Q ( a | b ) } ] + D { P ( b ) k Q ( b ) } = E P ( a ) [ D { P ( b | a ) k Q ( b | a ) } ] + D { P ( a ) k Q ( a ) } . Hence, the desired result follows from the fact that D { P ( b ) k Q ( b ) } is non-negative, and theassumption P ( a ) = Q ( a ) which implies that D { P ( a ) k Q ( a ) } = 0 . (cid:3) Proof of Theorem . The distributions Π( β, R T y | S T y ) = Π( R T y | S T y, β ) Π( β | S T y ) and ˆΠ( β, R T y | S T y ) = ˆΠ( R T y | S T y, β ) Π( β | S T y ) have the same β -marginal Π( β | S T y ) . Hence, we can apply Lemma S1 with P ( a, b ) = Π( β, R T y | S T y ) and Q ( a, b ) = ˆΠ( β, R T y | S T y ) : E h D n Π( β | y ) k ˆΠ( β | y ) o (cid:12)(cid:12)(cid:12) S T y i = E Π( R T y | S T y ) h D n Π( β | R T y, S T y ) k ˆΠ( β | R T y, S T y ) oi ≤ E Π( β | S T y ) h D n Π( R T y | β, S T y ) k ˆΠ( R T y | β, S T y ) oi , Let Π( a ) ∗ Π( b ) denotee the distribution of a + b . Then, (3a) provides Π( R T y | β, S T y ) = Π( R T η | S T y ) ∗ N ( R T Xβ, σ I p ) , ˆΠ( R T y | β, S T y ) = ˆΠ( R T η | S T y ) ∗ N ( R T Xβ, σ I p ) . C (cid:13) W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON
Combining the last two displays yields E h D n Π( β | y ) k ˆΠ( β | y ) o (cid:12)(cid:12)(cid:12) S T y i ≤ E Π( β | S T y ) h D n Π( R T η | S T y ) ∗ N ( R T Xβ, σ I p ) k ˆΠ( R T η | S T y ) ∗ N ( R T Xβ, σ I p ) oi . Since the Kullback-Leibler divergence is invariant to one-to-one transformations (Kullback &Leibler, 1951, Corollary 4.1), the Kullback-Leibler divergence is constant with respect to R T Xβ .The required result follows from setting R T Xβ equal to zero and dropping the expectation in theright-hand side of the last display. (cid:3) S2. C
OROLLARY TO T HEOREM Theorem 1 considered how close our approximation ˆΠ( β | y ) is to the posterior Π( β | y ) .Alternatively, one may be interested in a scenario where the nuisance parameter η equals η , andone would like to do inference without interference from the nuisance term using Π( β | y, η ) ,even though η is unknown.Define the squared quadratic Wasserstein distance between the distributions Π( a ) and Π( b ) as W { Π( a ) , Π( b ) } = inf E ( k a − b k ) where k · k denotes the Euclidean norm and the infimumis over all joint distributions on ( a, b ) such that a ∼ Π( a ) and b ∼ Π( b ) .L EMMA
S2.
Let P and Q be distributions on R p . For any σ > , D (cid:8) P ∗ N (0 , σ I p ) k Q ∗ N (0 , σ I p ) (cid:9) ≤ σ W ( P, Q ) . Proof.
Let Π( a, b ) be any coupling on R p × R p satisfying the marginal constraints Π( a ) = P ( a ) and Π( b ) = Q ( b ) . By the convexity of Kullback-Leibler divergence (Cover & Thomas,2006, Theorem 2.7.2), Jensen’s inequality provides D (cid:8) P ∗ N (0 , σ I p ) k Q ∗ N (0 , σ I p ) (cid:9) ≤ E Π( a,b ) (cid:2) D (cid:8) N ( a, σ I p )] k N ( b, σ I d ) (cid:9)(cid:3) = 12 σ E Π( a,b ) (cid:0) k a − b k (cid:1) , where the equality follows from inserting the Gaussian densities into the definition of theKullback-Leibler divergence. Recalling the definition of the quadratic Wasserstein distance andchoosing the infimum over all couplings Π( a, b ) of P and Q gives the stated result. (cid:3) C OROLLARY Let ˆΠ( β | y ) be as in (4) . Let y be distributed according to the data-generating model in (1) with β ∼ Π( β ) distributed according to its prior and η fixed to η .Then, E h D n Π( β | y, η ) k ˆΠ( β | y ) o (cid:12)(cid:12)(cid:12) S T y i ≤ σ E ˆΠ( R T η | S T y ) (cid:16)(cid:13)(cid:13) R T η − R T η (cid:13)(cid:13) (cid:12)(cid:12)(cid:12) S T y (cid:17) . In particular, under the Gaussian approximation ˆΠ( R T η | S T y ) from (5) , E h D n Π( β | y, η ) k ˆΠ( β | y ) o (cid:12)(cid:12)(cid:12) S T y i ≤ σ n(cid:13)(cid:13) R T η − ˆ µ (cid:13)(cid:13) + tr( ˆΣ) o . Proof.
Evaluating Theorem 1 with Lemma S2, Π( η ) = δ ( η ) , a point mass at η , and recallingthe definition of the quadratic Wasserstein distance provides the first inequality. For the secondequality, (5) provides R T η − R T η | S T y ∼ N ( R T η − ˆ µ, ˆΣ) . Evaluating the right-hand sideof the first inequality with this distribution provides the second inequality. (cid:3) ntegrated rotated Gaussian approximation Corollary 1 links two different quantities of interest. The left-hand side is the difference be-tween our approximation ˆΠ( β | y ) and the exact posterior Π( β | y, η ) . The right-hand side in-volves the average squared deviation of the distribution ˆΠ( R T η | S T y ) from R T η . This devi-ation can be small while the average squared deviation of Π( η | S T y ) from η is large: The n -dimensional η can have a potentially high-dimensional distribution while the p -dimensional term R T η is a projection onto the low-dimensional column space of R . In Corollary 1, y is distributedaccording to (1) with β ∼ Π( β ) while η is fixed to η . That β and η are treated differently is aresult of their different treatment in Algorithm 1.Consider asymptotic analysis where, for a sequence of instances of (1), n → ∞ and interestis in the properties of ˆΠ( β | y ) as n → ∞ . If ˆΠ( R T η | S T y ) contracts around the value R T η as n → ∞ , Corollary 1 shows that the posterior approximation from our method converges tothe posterior Π( β | y, η ) based on the likelihood from (1) with η equal to η . This conver-gence is in terms of Kullback-Leibler divergence which bounds dissimilarity measures com-monly used in asymptotic analyses of Bayesian posteriors. For instance, Bernstein-von Misestheorems often use total variation distance (Bontemps, 2011) which Pinsker’s inequality boundsby the square root of the Kullback-Leibler divergence. The finite-sample analysis of Corol-lary 1 therefore gives rise to asymptotic properties of the approximate posterior ˆΠ( β | y ) if E ˆΠ( R T η | S T y ) ( (cid:13)(cid:13) R T η − R T η (cid:13)(cid:13) | S T y ) → . Such asymptotic results for ˆΠ( β | y ) differ fromusual Bayesian asymptotics due to the set-up of Corollary 1: The data-generating process in-volves β ∼ Π( β ) rather than fixing β to a value. By contrast, η is fixed to η in the data-generating process of Corollary 1 rather than distributed according to its prior. S3. P
ROOF OF T HEOREM To simplify notation, define a = q / Λ / ( α − ξ ) , b a = q / Λ / ( ˆ ξ − ξ ) , and H = q − / R T Z Λ − / such that the entries of H are independent with distribution N (0 , /q ) and Ha = R T Z ( α − ξ ) . Also, ∆ = D (Π( Ha | S T y, S T Z ) ∗ N σ k N [ Hb a , { tr(Λ ˆΨ) + σ } I p ]) , ∆ = D (Π( Ha | S T y, S T Z ) ∗ N σ k N [0 , { tr(ΛΨ) + σ } I p ]) , ∆ = D ( N [0 , { tr(ΛΨ) + σ } I p ] k N [ Hb a , { tr(Λ ˆΨ) + σ } I p ]) . Here, N [ Hb a , { tr(Λ ˆΨ) + σ } I p ] is a shifted version of the Gaussian approximation in (7). Wewill show that ∆ equals the divergence in Theorem 2. ∆ is the Kullback-Leibler divergencefrom the target distribution to the Gaussian approximation evaluated with the true and approxi-mated mean and covariance. ∆ depends on the mismatch in the estimates ˆ ξ , captured by b a , and ˆΨ .L EMMA
S3.
Conditional on any S T y and S T Z , E H (∆ ) ≤ δ with δ as in Theorem .Proof. Since E ( Ha | S T y, S T Z ) = 0 , cov( Ha | S T y, S T Z ) = E ( Haa T H T ) , where we dropthe condition on S T y and S T Z for notation convenience. By the law of total expectation, E ( Haa T H T ) = E { E ( Haa T H T | H ) } = E { H cov( a ) H T } . Inserting the definition of a and re-calling cov( α ) = Ψ yields E { H cov( a ) H T } = E ( Hq Λ / ΨΛ / H T ) . Since E ( H ij H kl ) equals /q if ( i, j ) = ( k, l ) and otherwise, E ( Hq Λ / ΨΛ / H T ) = tr( q Λ / ΨΛ / ) I p /q . The cyclicproperty of the trace now provides cov( Ha | S T y, S T Z ) = tr( q Λ / ΨΛ / ) I p /q = tr(ΛΨ) I p .Thus, the mean and covariance of both distributions in ∆ are matched. Therefore, Theorem 2 W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON from Reeves (2017) evaluated with (cid:15) = 1 and C = 3 yields E H (∆ ) ≤ p log ( q E ( k a k ) σ ) q E {|k a k − E ( k a k ) |} q E ( k a k )+ 3 p ( q E ( | a T a | ) q E ( k a k ) ) + 3 p ( q E ( k a k ) σ ) p q E ( | a T a | ) q E ( k a k ) , (S1)where a is an independent copy of a . The remainder of this proof is simplifying this bound.Since E ( a ) = 0 and cov( a ) = q Λ / ΨΛ / , q − E ( | a T a | ) = q − E ( a T a a T a ) = q − tr { E ( aa T a a T ) } = q − tr { cov( a ) } = tr(Λ ΨΛΨΛ ) = tr { (ΛΨ) } , and q E ( k a k ) = 1 q tr { cov( a ) } = 1 q tr( q Λ ΨΛ ) = tr(ΛΨ) . Therefore, q E {|k a k − E ( k a k ) |} q E ( k a k ) = E (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) k a k E ( k a k ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) = E ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k q Λ ( α − ξ ) k q tr(ΛΨ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)) = m , and, by Jensen’s inequality, ( q E ( | a T a | ) q E ( k a k ) ) ≤ q E ( | a T a | ) q E ( k a k ) = q − E ( | a T a | )tr(ΛΨ) = m . Inserting the last three displays and p / ≤ p / ≤ p into (S1) provides the required result. (cid:3) L EMMA
S4.
Conditional on any S T y and S T Z , E H (∆ ) ≤ δ with δ as in Theorem .Proof. Combining (7), Lemma S2, and the evaluation of the quadratic Wasserstein distancebetween two Gaussians from Dowson & Landau (1982) yields ∆ ≤ σ h k Hb a k + tr n tr(ΛΨ) I p + tr(Λ ˆΨ) I p − tr(ΛΨ) I p oi = 12 σ h k Hb a k + p { (ΛΨ) − (Λ ˆΨ) } i . (S2)Recalling b a = q / Λ / ( ˆ ξ − ξ ) , E H (cid:0) k Hb a k (cid:1) = E H n k q H Λ ( ξ − ˆ ξ ) k o = q { Λ ( ξ − ˆ ξ ) } T E H ( H T H ) Λ ( ξ − ˆ ξ ) = p k Λ ( ξ − ˆ ξ ) k , where the last equality follows from E ( H T H ) = pI q /q . Taking the expectation of (S2) withrespect to H and inserting the last display yields the required result. (cid:3) Proof of Theorem . Let π denote the density function of Π( Ha | S T y, S T Z ) ∗ N σ and let E ( · ) denote the expectation with respect to this distribution. Let υ ∼ π . By the definition of ntegrated rotated Gaussian approximation the Kullback-Leibler divergence, ∆ = E ( log π ( υ ) N [ υ | Hb a , { tr(Λ ˆΨ) + σ } I p ] !) = E (cid:26) log (cid:18) π ( υ ) N [ υ | , { tr(ΛΨ) + σ } I p ] (cid:19)(cid:27)| {z } ∆ + E ( log N [ υ | , { tr(ΛΨ) + σ } I p ] N [ υ | Hb a , { tr(Λ ˆΨ) + σ } I p ] !) . Taking the expectation with respect to H yields E H (∆) = E H (∆ ) + E H " E ( log N [ υ | , { tr(ΛΨ) + σ } I p ] N [ υ | Hb a , { tr(Λ ˆΨ) + σ } I p ] !) . (S3)Denote the expectation with respect to υ ∼ N [0 , { tr(ΛΨ) + σ } I p ] by E ( · ) . The mean and co-variance of E H { Π( Ha | S T y, S T Z ) ∗ N σ } and N [0 , { tr(ΛΨ) + σ } I p ] are the same as con-firmed in the proof of Lemma S3, and the expectation of the logarithm of the Gaussian densityonly depends on the mean and covariance of υ . Therefore, E H (cid:2) E (cid:8) log (cid:0) N [ υ | , { tr(ΛΨ) + σ } I p ] (cid:1)(cid:9)(cid:3) = E (cid:8) log (cid:0) N [ υ | , { tr(ΛΨ) + σ } I p ] (cid:1)(cid:9) . (S4)Also, expanding the square inside the Gaussian density and noting E ( υ ) = 0 yields E H h E n log (cid:16) N [ υ | Hb a , { tr(Λ ˆΨ) + σ } I p ] (cid:17)oi = E H " E ( log (cid:16) N [ υ | , { tr(Λ ˆΨ) + σ } I p ] (cid:17) + k Hb a k { tr(Λ ˆΨ) + σ } ) . Again using that the logarithm of a Gaussian density only depends on the mean and covarianceof ν provides E H h E n log (cid:16) N [ υ | Hb a , { tr(Λ ˆΨ) + σ } I p ] (cid:17)oi = E H " E ( log (cid:16) N [ υ | , { tr(Λ ˆΨ) + σ } I p ] (cid:17) + k Hb a k { tr(Λ ˆΨ) + σ } ) = E H h E n log (cid:16) N [ υ | Hb a , { tr(Λ ˆΨ) + σ } I p ] (cid:17)oi , where the last equality follows from completing the square and E ( υ ) = 0 . Inserting the lastdisplay and (S4) into (S3), and recalling the definition of the Kullback-Leibler divergence shows E H (∆) = E H (∆ ) + E H (∆ ) . Both distributions in the Kullback-Leibler divergence ∆ are equal to their respective distri-butions in the divergence in Theorem 2 shifted by Hq / Λ / ξ = R T Zξ . Since the Kullback-Leibler divergence is invariant to one-to-one transformations (Kullback & Leibler, 1951, Corol-lary 4.1), ∆ equals the divergence in Theorem 2. Also, H is a deterministic function of R T Z suchthat taking the expectation with respect to one or the other yields the same result. Therefore, E H (∆) equals the left-hand side of Theorem 2. The required result is thus E H (∆) ≤ δ + δ which inserting Lemmas S3 and S4 into the last display provides. (cid:3) W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON
S4. P
ROOF OF T HEOREM Let P γ = X γ ( X T γ X γ ) − X T γ denote the orthogonal projection onto the column space of X γ .The assumptions in Theorem 3 in addition to (9) are pr( γ = γ ) > , (S5a) lim n →∞ k ( I n − P γ ) Xβ k n > for any γ not containing γ , (S5b) g n → ∞ , (S5c) log g n n → . (S5d)Assumption (S5a) is a basic prerequisite as otherwise pr( γ = γ | y ) = 0 . Assumption (S5b) isanalogous to Equation A.4 from Fern´andez et al. (2001). Previous literature (Fern´andez et al.,2001; Liang et al., 2008) required g n to grow appropriately with n , estimates g n via empiricalBayes, or places an appropriate prior on g n to obtain consistency. We focus on the first case byassuming (S5c) and (S5d). Condition (S5b) ensures that any model that does not contain the trueone has posterior probability converging to zero. The fact that supersets of the true model arealso discarded follows from the g -prior, which favors smaller subsets.L EMMA
S5. P γ − P γ = RR T ( P γ − P γ ) .Proof. Recall from § · S T X = 0 ( n − p ) × p and Q is orthogonal so that QQ T = I n . There-fore, RR T X = RR T X + S ( S T X ) | {z } ( n − p ) × p = ( RR T + SS T ) X = ( QQ T ) X = I n X = X, where the third equality follows from Q = ( R, S ) . Considering RR T X = X columnwise andrecalling P γ = X γ ( X T γ X γ ) − X T γ yields RR T P γ = P γ , for any γ including γ . (cid:3) Proof of Theorem . Conditional on γ and η , the set-up is a normal-normal model as both theprior Π( β γ | γ ) from (8) and the likelihood from (1) are Gaussian. The corresponding marginallikelihood follows as π ( y | γ, η ) = Z π ( y | β γ , γ, η ) π ( β γ | γ ) dβ γ = (cid:0) πσ (cid:1) − p ( g n + 1) − | γ | exp (cid:26) − σ (cid:18) k z k − g n g n + 1 z T P γ z (cid:19)(cid:27) , where z = y − η and | γ | denotes the number of elements in γ . The logarithm of the Bayes factorof the true model γ over γ conditional on η is thus log BF γ : γ = log ( π (cid:0) y | γ , η (cid:1) π ( y | γ, η ) ) = | γ | − | γ | g n + 1) + g n σ ( g n + 1) h γ ( z ) , (S6)where h γ ( z ) = z T ( P γ − P γ ) z . By assumption (S5a), the required result follows if log BF γ : γ →∞ in probability, except for γ = γ when log BF γ : γ = 0 .Since z = y − η , z ∼ N ( ν, σ I n ) where ν = Xβ + η − η . Then, by Theorems 5.2a and5.2c from Rencher & Schaalje (2008) and the fact that the trace of a projection matrix equals the ntegrated rotated Gaussian approximation dimensionality of its target space, E { h γ ( z ) | η } = σ tr( P γ − P γ ) + ν T ( P γ − P γ ) ν = σ ( | γ | − | γ | ) + ν T ( P γ − P γ ) ν, (S7a) var { h γ ( z ) | η } = 2 σ tr (cid:8) ( P γ − P γ ) (cid:9) + 4 σ ν T ( P γ − P γ ) ν ≤ σ (cid:8) tr( P γ ) + tr( P γ ) (cid:9) + 4 σ k ( P γ − P γ ) ν k = 2 σ ( | γ | + | γ | ) + 4 σ k ( P γ − P γ ) ν k . (S7b)We analyze the asymptotic behavior of h γ ( z ) by bounding this expectation and variance.The first term of each right-hand side in (S7) is independent of n . Let us bound the secondterms. Inserting ν = Xβ + ζ where ζ = η − η and expanding the square yields ν T ( P γ − P γ ) ν = ( Xβ ) T ( P γ − P γ ) Xβ + 2 ζ T ( P γ − P γ ) Xβ + ζ T ( P γ − P γ ) ζ. Inserting P γ Xβ = Xβ and Lemma S5 provides ν T ( P γ − P γ ) ν = ( Xβ ) T ( I n − P γ ) Xβ + 2 ζ T RR T ( P γ − P γ ) Xβ + ζ T ( P γ − P γ ) ζ = k ( I n − P γ ) Xβ k + 2 ζ T RR T ( I n − P γ ) Xβ + ζ T RR T ( P γ − P γ ) ζ. Applying the Cauchy-Schwarz inequality and | ζ T RR T ( P γ − P γ ) ζ | ≤ ζ T RR T ζ = k R T ζ k , ν T ( P γ − P γ ) ν ≥ k ( I n − P γ ) Xβ k − k R T ζ k k R T ( I n − P γ ) Xβ k − k R T ζ k ,ν T ( P γ − P γ ) ν ≤ k ( I n − P γ ) Xβ k + 2 k R T ζ k k R T ( I n − P γ ) Xβ k + k R T ζ k . Since the columns of R form an orthonormal basis for the column space of X , k R T ( I n − P γ ) Xβ k = k ( I n − P γ ) Xβ k such that ν T ( P γ − P γ ) ν ≥ k ( I n − P γ ) Xβ k − k R T ζ k k ( I n − P γ ) Xβ k − k R T ζ k = (cid:8) k ( I n − P γ ) Xβ k − k R T ζ k (cid:9) k ( I n − P γ ) Xβ k − k R T ζ k , (S8a) ν T ( P γ − P γ ) ν ≤ k ( I n − P γ ) Xβ k + 2 k R T ζ k k ( I n − P γ ) Xβ k + k R T ζ k = (cid:8) k ( I n − P γ ) Xβ k + 2 k R T ζ k (cid:9) k ( I n − P γ ) Xβ k + k R T ζ k . (S8b)For the second term of the right-hand side in (S7b), consider ν = Xβ + ζ and k ( P γ − P γ ) ν k = k ( I n − P γ ) Xβ + ( P γ − P γ ) ζ k . By the triangle inequality, k ( P γ − P γ ) ν k ≤ k ( P γ − P γ ) Xβ k + k ( P γ − P γ ) ζ k = k ( I n − P γ ) Xβ k + k ( P γ − P γ ) RR T ζ k , where the equality follows from P γ Xβ = Xβ and Lemma S5. Also, k ( P γ − P γ ) RR T ζ k ≤k RR T ζ k = k R T ζ k since R T R = I p . Therefore, k ( P γ − P γ ) ν k ≤ k ( I n − P γ ) Xβ k + k R T ζ k . Inserting into (S7b) provides var { h γ ( z ) | η } ≤ (cid:8) σ ( | γ | + | γ | ) + 4 σ k ( P γ − P γ ) ν k (cid:9) ≤ σ ( | γ | + | γ | ) + 2 σ k ( P γ − P γ ) ν k≤ σ ( | γ | + | γ | ) + 2 σ (cid:8) k ( I n − P γ ) Xβ k + k R T ζ k (cid:9) . (S9) W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON
Since ζ = η − η , assumptions (9) and (S5d) imply k R T ζ k log g n → , k R T ζ k n → (S10)in probability. Let us consider γ = γ that contain γ , that is γ ( γ , and γ that do not contain γ , that is γ γ , separately.First, consider the case where γ does not contain γ . Assumption (S5b), (S7a),(S8a), and (S10) imply E { h γ ( z ) | η } / k ( I n − P γ ) Xβ k → ∞ . On the other hand, lim n →∞ var { h γ ( z ) | η } / / k ( I n − P γ ) Xβ k ≤ σ by (S5b), (S9), and (S10). Therefore, lim n →∞ h γ ( z | η ) /n > with probability tending to one by Chebyshev’s inequality and (S5b).Under assumption (S5d), it then follows from (S6) that log BF γ : γ → ∞ in probability.Next, consider the case where γ contains γ . In this setting, P γ Xβ = Xβ and thus ( I n − P γ ) Xβ = 0 n × . Therefore, (S7a) with (S8b), and (S9) reduce to E { h γ ( z ) | η } ≤ σ ( | γ | − | γ | ) + k R T ζ k , var { h γ ( z ) | η } ≤ σ ( | γ | + | γ | ) + 2 σ k R T ζ k . Chebyshev’s inequality and (S10) provide thus lim n →∞ h γ ( z | η ) / log g n = 0 with probabilitytending to one. We conclude from (S6) that BF γ : γ → ∞ in probability because of assumption(S5c) and | γ | > | γ | .We have shown BF γ : γ → ∞ whenever γ = γ . The required result follows from this resultas noted earlier in this proof. (cid:3) S5. V
ECTOR APPROXIMATE MESSAGE PASSING S5 · . Derivation To give a motivation for the steps of vector approximate message passing in Algorithm S1 onpage 12, we derive the algorithm as an approximation to sum-product message passing (Bishop,2006, § § III-B). Consider the linear model y ∼ N ( Xβ, σ I n ) where y is an n -dimensional vector of observations, X an n × p design ma-trix, β a p -dimensional vector of parameters, and σ the error variance. We assume that theentries of β are a priori independent such that π ( β ) = Q pj =1 π ( β j ) . The goal is to approximatethe posterior π ( β | y ) ∝ π ( β ) π ( y | β ) = π ( β ) N ( y | Xβ, σ I n )= π ( β ) δ ( β − ˜ β ) N ( y | X ˜ β, σ I n ) , (S11)where δ is the Dirac delta function and ˜ β is thus a copy of β . This copying of β gives rise to anextra variable node in the corresponding factor graph in Fig. S1.Let µ π → β and µ δ → β denote the messages to the variable node β , µ δ → ˜ β and µ N → ˜ β the mes-sages to the variable node ˜ β , and µ β → δ and µ ˜ β → δ the messages to the factor node δ ( β − ˜ β ) .By the general expression for a message from a factor to a variable node (Bishop, 2006, Equa-tion 8.69), µ δ → ˜ β ( ˜ β ) = Z δ ( β − ˜ β ) µ β → δ ( β ) dβ = µ β → δ ( ˜ β ) ,µ δ → β ( β ) = Z δ ( β − ˜ β ) µ ˜ β → δ ( ˜ β ) d ˜ β = µ ˜ β → δ ( β ) . (S12) ntegrated rotated Gaussian approximation π ( β ) β δ ( β − ˜ β ) ˜ β N ( y | X ˜ β, σ I n ) Fig. S1: The factor graph representation of (S11). The squares and circles are factor and variablenodes, respectively. This figure is an edited version of Fig. 1 from Rangan et al. (2016).The beliefs at the variable nodes are the products of the incoming messages, b ( β ) ∝ µ π → β ( β ) µ δ → β ( β ) = π ( β ) µ δ → β ( β ) ,b ( ˜ β ) ∝ µ δ → ˜ β ( ˜ β ) µ N → ˜ β ( ˜ β ) = µ β → δ ( ˜ β ) N ( y | X ˜ β, σ I n ); where the last equality uses (S12). Combining these beliefs with the general expression for amessage from a variable to a factor node (Bishop, 2006, Equation 8.66) and Fig. S1 yields µ β → δ ( β ) = µ π → β ( β ) ∝ b ( β ) µ δ → β ( β ) ,µ ˜ β → δ ( ˜ β ) = µ N → ˜ β ( ˜ β ) ∝ b ( ˜ β ) µ δ → ˜ β ( ˜ β ) = b ( ˜ β ) µ β → δ ( ˜ β ) ; where the last equality follows from (S12).The last two displays provide a message-passing algorithm. Initialize µ δ → β ( β ) . Then, iteratethe updates b ( β ) ∝ π ( β ) µ δ → β ( β ) , (S13a) µ β → δ ( β ) ∝ b ( β ) µ δ → β ( β ) , (S13b) b ( ˜ β ) ∝ µ β → δ ( ˜ β ) N ( y | X ˜ β, σ I n ) , (S13c) µ δ → β ( ˜ β ) = µ ˜ β → δ ( ˜ β ) ∝ b ( ˜ β ) µ β → δ ( ˜ β ) , (S13d)where the last equality is from (S12). Since the graph in Fig. S1 is a tree, the beliefs b ( p ) convergeto the exact posterior π ( β | y ) after one iteration. This exact algorithm can however be expensiveto compute for certain π ( β ) if p is large. Vector approximate message passing approximates(S13) to reduce computational cost:Initialize µ δ → β ( β ) = N ( β | r , t I p ) . At the k th iteration, approximate b ( β ) by N ( β | ˆ β k , s k I p ) where ˆ β k = E b ( β ) ( β ) and s k = Tr { cov b ( β ) ( β ) } /p . Applying (S13a) providesStep 3a of Algorithm S1.Since µ δ → β ( β ) ≈ N ( β | r k , t k I p ) and b ( β ) ≈ N ( β | ˆ β k , s k I p ) , the resulting approximationto µ β → δ ( β ) is Gaussian too by (S13b). Denote this Gaussian approximation by N ( ˜ β | ˜ r k , ˜ t k I p ) .Step 3b states the update equations for ˜ r k and ˜ t k derived from (S13b).With µ β → δ ( ˜ β ) ≈ N ( ˜ β | ˜ r k , ˜ t k I p ) , b ( ˜ β ) from (S13c) is Gaussian too. We further approxi-mate b ( ˜ β ) by requiring its covariance to be proportional to the identity matrix. Let b ( ˜ β ) ≈ W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON N ( ˜ β | ˆ˜ β k , ˜ s k I p ) . The updates follow from (S13c) as ˆ˜ β k = (cid:0) ˜ t k X T X + σ I p (cid:1) − (cid:0) ˜ t k X T y + σ ˜ r k (cid:1) , (S14a) ˜ s k = σ ˜ t k p Tr n(cid:0) ˜ t k X T X + σ I p (cid:1) − o . (S14b)These involve an inversion of a p × p matrix which is expensive to compute if p is large. We canhowever rewrite these expressions to make their computation faster.Let X = U DV T denote a singular-value decomposition with an n × min( n, p ) matrix U , a min( n, p ) × min( n, p ) diagonal matrix D , and V an p × min( n, p ) matrix such that U T U = V T V = I min( n,p ) . Substituting X = U DV T yields (cid:0) ˜ t k X T X + σ I p (cid:1) − = (cid:0) ˜ t k V D V T + σ I p (cid:1) − = 1 σ (cid:18) ˜ t k σ V D V T + I p (cid:19) − = 1 σ ( I p − V (cid:18) σ ˜ t k D − + V T V (cid:19) − V T ) = 1 σ " I p − V (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T , (S15)where D = DD , D − = D − D − and the third equality follows from the Woodbury matrixidentity. Substituting X T = V DU T and (S15) provide (cid:0) ˜ t k X T X + σ I p (cid:1) − ˜ t k X T y = ˜ t k σ " I p − V (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T V DU T y = ˜ t k σ " V − V (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T V DU T y = ˜ t k σ V " I min( n,p ) − (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − DU T y, where the last equality uses V T V = I min( n,p ) . Since the expression inside the square bracketsconsists only of diagonal matrices, we can write it as a single fraction to obtain (cid:0) ˜ t k X T X + σ I p (cid:1) − ˜ t k X T y = ˜ t k σ V " σ ˜ t k D − (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − DU T y = ˜ t k σ V (cid:26) I min( n,p ) + ˜ t k σ D (cid:27) − DU T y = V (cid:26) σ ˜ t k I min( n,p ) + D (cid:27) − DU T y, where the second equality follows from multiplying both the numerator and the denominator ofthe diagonal-matrices faction by (˜ t k /σ ) D . Combining the last display with (S14a) and (S15) ntegrated rotated Gaussian approximation provides ˆ˜ β k = V (cid:26) σ ˜ t k I min( n,p ) + D (cid:27) − DU T y + " I p − V (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T ˜ r k = ˜ r k + V (cid:26) σ ˜ t k I min( n,p ) + D (cid:27) − DU T y − V (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T ˜ r k = ˜ r k + V (cid:26) σ ˜ t k I min( n,p ) + D (cid:27) − DU T y − V (cid:26) σ ˜ t k I min( n,p ) + D (cid:27) − D V T ˜ r k = ˜ r k + V (cid:26) σ ˜ t k I min( n,p ) + D (cid:27) − (cid:0) DU T y − D V T ˜ r k (cid:1) = ˜ r k + V (cid:18) σ ˜ t k D − + D (cid:19) − ( U T y − DV T ˜ r k ) , (S16)where we used that D is a diagonal matrix. This update for ˆ˜ β k only involves matrix multiplica-tions and inversions of diagonal matrices.For ˜ s k , substitute (S15) into (S14b) such that ˜ s k = ˜ t k p Tr " I p − V (cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T = ˜ t k − p Tr "(cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − V T V , where the last equality uses that the trace is invariant under cyclic permutations. Since V T V = I min( n,p ) , the last expression reduces to ˜ s k = ˜ t k − p Tr "(cid:26) σ ˜ t k D − + I min( n,p ) (cid:27) − = ˜ t k " − p Tr ( D (cid:18) σ ˜ t k D − + D (cid:19) − ) , where the last equality uses that the argument of the trace is diagonal. This display with (S16)constitutes Step 3c. W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON
Recall µ β → δ ( ˜ β ) ≈ N ( ˜ β | ˜ r k , ˜ t k I p ) and b ( ˜ β ) ≈ N ( ˜ β | ˆ˜ β k , ˜ s k I p ) . We would like to update µ δ → β ( β ) ≈ N ( β | r k +1 , t k +1 I p ) where we have incremented the iteration counter. Step 3d fol-lows now from (S13d) in the same way as Step 3b followed from (S13b). Algorithm S . Vector approximate message passing.Input: Data ( y, X )
1. Compute the singular-value decomposition X = U DV T .2. Initialize r and t .3. For k = 0 , . . . , K do:a. Set ˆ β k,j = E ( β j | r k,j , t k ) and s k = P pj =1 var( β j | r k,j , t k ) /p where the density of β j isproportional to π ( β j ) N ( β j | r k,j , t k ) for j = 1 , . . . , p .b. Set / ˜ t k = 1 /s k − /t k and ˜ r k = ( t k ˆ β k − s k r k ) / ( t k − s k ) .c. Set ˆ˜ β k = ˜ r k + V ( σ D − / ˜ t k + D ) − ( U T y − DV T ˜ r k ) and ˜ s k = ˜ t k [1 − Tr { D ( σ D − / ˜ t k + D ) − } /p ] .d. Set /t k +1 = 1 / ˜ s k − / ˜ t k and r k +1 = (˜ t k ˆ˜ β k − ˜ s k ˜ r k ) / (˜ t k − ˜ s k ) .Output: Approximate posterior N ( ˆ β K , s K I p ) S5 · . Computational complexity The computational complexity of the singular-value decomposition is O { n p min( n, p ) } (Rangan et al., 2016, § I-E). The steps inside each iteration are O ( p ) except for Step 3c which is O { p min( n, p ) } if U T y is precomputed. The computational complexity of Algorithm S1 is thus O { ( n + K ) p min( n, p ) } .In practice, we do not always run Algorithm S1 for all K iterations. We stop it once theinnovation k ˆ β k − ˆ β k − k becomes small enough, indicating convergence.S5 · . Estimating σ So far, we have treated σ as fixed and known. As § · § · § · σ and methods available for Step 2 of Algorithm 1 oftenprovide such estimation. For instance, Vila & Schniter (2011) detail how σ can be estimatedwhen using approximate message passing. We add a step to Algorithm S1 to estimate σ whenrequired: Consider the prior /σ ∼ Ga ( a , b ) for some shape parameter a and rate parameter b . Then, the full conditional posterior for /σ of Algorithm S1 at iteration k is σ | ˆ β k ∼ Ga a + n , b + k y − X ˆ β k k ! . At each iteration, we update σ such that /σ matches the mean of this full conditional: σ k = b + k y − X ˆ β k k / a + n/ k = 1 , . . . , K ) , between Steps 3a and 3b of Algorithm S1.S5 · . Dampened updates If vector approximate message passing fails to converge, which can happen for certain matrices X which have a challenging collinearity structure, damping of updates can induce convergence, ntegrated rotated Gaussian approximation like it does in approximate message passing (Rangan et al., 2014). In this article, we only dampenthe updates for the SNP application in § · ρ ∈ (0 , denote the damping constant with ρ = 1 representing no damping. Then, thedampened version of Algorithm S1 follows by replacing Steps 3a and 3c by ˆ β k,j = (1 − ρ ) ˆ β k − ,j + ρ E ( β j | r k,j , t k ) ,s k = (1 − ρ ) s k − + ρ p X j =1 var( β j | r k,j , t k ) /p ; and ˆ˜ β k = (1 − ρ ) ˆ˜ β k − + ρ { ˜ r k + V ( σ D − / ˜ t k + D ) − ( U T y − DV T ˜ r k ) } , ˜ s k = (1 − ρ ) ˜ s k − + ρ ˜ t k [1 − Tr { D ( σ D − / ˜ t k + D ) − } /p ]; respectively, for k > . S6. D
EBIASED LASSO
Consider the linear model y ∼ N ( Xβ, σ I n ) as in § S5 with the spike-and-slab prior β j ∼ λ N (0 , ψ ) + (1 − λ ) δ (0) independently for j = 1 , . . . , p . Denote the lasso estimator of β by ˆ β lasso ( y ) : Here, we use the smallest lasso regularization parameter that results in at most b λp c nonzero coefficients where λp is the number of expected nonzero elements in β under its spike-and-slab prior. The lasso algorithm from Efron et al. (2004) allows for efficient computation of ˆ β lasso ( y ) under this constraint on the regularization parameter.As the number of predictors is less than the sample size in § ·
2, we assume p ≤ n . Then, wecan set the matrix M in Javanmard & Montanari (2013) equal to ˆΣ − where ˆΣ = X T X/n . Thedebiased lasso estimator follows as (Javanmard & Montanari, 2013, Equation 5) ˆ β unbiased ( y ) = ˆ β lasso ( y ) + 1 n M X T n y − ˆ β lasso ( y ) o . Theorem 2.1 from Javanmard & Montanari (2013) implies β | y ∼ N (cid:26) ˆ β unbiased ( y ) , σ n M ˆΣ M T (cid:27) , as posterior approximation based on the debiased lasso.The error variance σ is unknown in the application from § ·
2. We therefore estimate it by b + k y − X ˆ β unbiased ( y ) k / a + n/ , analogously to § S5 · S7. L
APLACE APPROXIMATION FOR § · We follow Steinberg & Bonilla (2014, § F = { f ( z ) , . . . , f ( z n ) } T . Since f hasa Gaussian process prior, F ∼ N ( µ, Σ) for some n -dimensional mean µ and an n × n covari-ance matrix Σ . The first-order Taylor series of G around an n -dimensional vector m is G ( F ) ≈ G ( m ) + J m ( F − m ) where J m is the Jacobian matrix of G ( F ) evaluated at m . The correspond-ing approximate likelihood from the non-linear Gaussian model S T y ∼ N { S T G ( F ) , σ I n − p } W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON follows as ˆ π m ( S T y | F ) = N { S T y | S T G ( m ) + S T J m ( F − m ) , σ I n − p } , which yields the ap-proximate posterior ˆ π m ( F | S T y ) = N (cid:18) Σ ∗ m (cid:20) σ J T m SS T { y − G ( m ) + J m m } + Σ − µ (cid:21) , Σ ∗ m (cid:19) , where Σ ∗ m = ( J T m SS T J m /σ + Σ − ) − . The posterior mean suggests the iterative update m t +1 = (1 − ρ t ) m t + ρ t Σ ∗ m t (cid:20) σ J T m t SS T { y − G ( m t ) + J m t m t } + Σ − µ (cid:21) , where t is the iteration number and ρ t the learning rate. This update produces a dampenedGauss-Newton algorithm. Since the mean of a Gaussian equals its mode, m ∞ targets the modeof the exact posterior as t → ∞ . Therefore, ˆ π m ∞ ( F | S T y ) provides a Laplace approximation ˆΠ( F | S T y ) . l l l l l l l l l l l l l l l − − − q Log o f t he ab s o l u t e e rr o r i n P I P l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l Fig. S2: Median (dot) and interquartile ranges (x) of the absolute differences between posteriorinclusion probability (PIP) and their approximation from the simulation in § S8 ·
1. Integrated ro-tated Gaussian approximation is in black, expectation propagation in blue, and variational Bayesin red. l l l l l l l l l l l l l l l − . − . − . q Log o f c o m pu t a t i on t i m e i n s e c ond s l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l Fig. S3: Median (dot) and interquartile ranges (x) of the computation times for the results inFig. S2. Integrated rotated Gaussian approximation is in black, expectation propagation in blue,and variational Bayes in red. ntegrated rotated Gaussian approximation DDITIONAL SIMULATION STUDIES S8 · . Variable selection on a correlated subset Consider the set-up from § · β as onthe elements of α , n = 100 , p = 2 , ψ = 1 , λ = p/ ( p + q ) , and σ = 1 / . Generate the elementsin X and Z independently from N (0 , , then reassign the second column of X , denoted by X ∗ , to equal . X ∗ + 0 . X ∗ to induce correlation, and lastly standardize the columns of X and Z to have zero mean and unit standard deviation. Generate y according to (1) with α =(0 , . . . , T and β = (1 , T . Then, compute the posterior inclusion probabilities for β usingAlgorithm 1 with vector approximate message passing in Step 2 as described in § ·
1, and usingexpectation propagation and variational Bayes as in § · σ known. Do this for q =1 , , . . . , with exact computation of the posterior inclusion probabilities as reference. Forlarge q , exact computation takes too long. Therefore, use a Gibbs sampler with 10,000 burnin l l l l l l l − − − − q Log o f t he ab s o l u t e e rr o r i n P I P
15 30 60 120 240 480 960 l l l l l l ll l l l l l l
Fig. S4: Median (dot) and interquartile ranges (x) of the absolute differences between the poste-rior inclusion probability (PIP) approximation and their Gibbs sampler estimate from the simu-lation in § S8 ·
1. Integrated rotated Gaussian approximation is in black, expectation propagationin blue, and variational Bayes in red. l l l l l l l − . − . − . − . q Log o f c o m pu t a t i on t i m e i n s e c ond s
15 30 60 120 240 480 960 l l l l l l ll l l l l l l
Fig. S5: Median (dot) and interquartile ranges (x) of the computation times for the results inFig. S4. Integrated rotated Gaussian approximation is in black, expectation propagation in blue,and variational Bayes in red. W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON and 90,000 recorded iterations to compute reference posterior inclusion probabilities for q =15 , , . . . , , . Repeat the above 20 times for each q .Figs. S2 through S5 contain the results and computation times. Integrated rotated Gaussian ap-proximation has the lowest approximation error, although the difference with expectation prop-agation is less pronounced in Fig. S4 as approximation error from the method and Monte Carloerror from the Gibbs sampler are mixed. Comparing q = 15 in Fig. S2 and Fig. S4 shows thatthe Monte Carlo error is of noticeable size compared to the approximation error of our methodand expectation propagation. Integrated rotated Gaussian approximation deals with the fact thatthe columns of X are correlated since β is treated separately in Algorithm 1. Expectation prop-agation and variational Bayes do not make such a distinction between the elements of α and β .Variational Bayes consistently has the highest approximation error and the shortest computationtime. l l l l l − − − − − r Log o f t he ab s o l u t e e rr o r i n P I P
60 120 240 480 960 l l l l ll l l l l
Fig. S6: Median (dot) and interquartile ranges (x) of the absolute differences between the poste-rior inclusion probability (PIP) approximation and their Gibbs sampler estimate from the simu-lation in § S8 ·
2. Integrated rotated Gaussian approximation is in black, expectation propagationin blue, and variational Bayes in red. l l l l l − q Log o f c o m pu t a t i on t i m e i n s e c ond s
60 120 240 480 960 l l l l ll l l l l
Fig. S7: Median (dot) and interquartile ranges (x) of the computation times for the results inFig. S6. Integrated rotated Gaussian approximation is in black, expectation propagation in blue,and variational Bayes in red. ntegrated rotated Gaussian approximation S8 · . Variable selection with a random design matrix Consider the set-up from § · n = 100 , λ = 40 /r , ψ = 1 , and σ = 1 / . Generate θ byrandomly selecting 40 elements in θ to be non-zero and drawing them from N (0 , . The ele-ments of A are drawn independently from N (0 , after which the columns of A are standardizedto have zero mean and unit standard deviation. Sample y according to y ∼ N ( Aθ, σ I n ) . We re-peat the random generation of θ , A , and y
20 times for each r = 60 , , , , . Estimatethe posterior inclusion probabilities using the same methods as in § · σ = 1 / knownand without considering the debiased lasso. Algorithm 1 is used with p = b log( r ) c as in § · r = 60 , , , when Monte Carlo error is also lower because r is smaller, albeit at a highercomputational cost. S8 · . Variable selection with gene expressions In § S8 · A was simulated. Let us instead set A equal to 3,571 expression levels from theleukemia data from Golub (1999) available in the supplementary data of Friedman et al. (2010).Then, n = 72 and r = 3 , . More importantly, the predictors are now highly dependent in acomplex, non-linear, and non-Gaussian way: For instance, the maximum correlation betweencolumns of A equals . . The rest of the simulation, which we repeat 10 times, is the same asin § S8 ·
2. The results are in Table S1.Expectation propagation and our method achieve similar performance, with similar error sizes.On the other hand, expectation propagation is over twice as fast. Variational Bayes takes an orderof magnitude less computation time than expectation propagation but yields worse approxima-tions.Table S1: Summary statistics of the absolute difference between the Gibbs sampler estimatesand the approximations of the posterior log odds of inclusion for the simulation study in § S8 · Median computationtime (seconds)
IRGA R EFERENCES B ISHOP , C. M. (2006).
Pattern Recognition and Machine Learning . New York: Springer.B
ONTEMPS , D. (2011). Bernstein – von Mises theorems for Gaussian regression with increasing number of regressors. Ann. Stat. , 2557–2584.C OVER , T. M. & T
HOMAS , J. A. (2006).
Elements of Information Theory . Wiley Series in Telecommunications andSignal Processing. Wiley-Interscience, 2nd ed.D
OWSON , D. & L
ANDAU , B. (1982). The Fr´echet distance between multivariate normal distributions.
J. Multivar.Anal. , 450–455.E FRON , B., H
ASTIE , T., J
OHNSTONE , I. & T
IBSHIRANI , R. (2004). Least angle regression.
Ann. Stat. , 407–499. W. VAN DEN B OOM , G. R
EEVES AND
D. B. D
UNSON F ERN ´ ANDEZ , C., L EY , E. & S TEEL , M. F. (2001). Benchmark priors for Bayesian model averaging.
J. Econom. , 381–427.F
RIEDMAN , J., H
ASTIE , T. & T
IBSHIRANI , R. (2010). Regularization paths for generalized linear models viacoordinate descent.
J. Stat. Softw. .G OLUB , T. R. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expressionmonitoring.
Science , 531–537.J
AVANMARD , A. & M
ONTANARI , A. (2013). Confidence intervals and hypothesis testing for high-dimensionalstatistical models. In
Advances in Neural Information Processing Systems 26 . pp. 1187–1195.K
ULLBACK , S. & L
EIBLER , R. A. (1951). On information and sufficiency.
Ann. Math. Stat. , 79–86.L IANG , F., P
AULO , R., M
OLINA , G., C
LYDE , M. A. & B
ERGER , J. O. (2008). Mixtures of g priors for Bayesianvariable selection. J. Am. Statist. Assoc. , 410–423.R
ANGAN , S., S
CHNITER , P. & F
LETCHER , A. (2014). On the convergence of approximate message passing witharbitrary matrices. In
IEEE International Symposium on Information Theory . pp. 236–240.R
ANGAN , S., S
CHNITER , P. & F
LETCHER , A. K. (2016). Vector approximate message passing. arXiv :1610.03082v2.R
EEVES , G. (2017). Conditional central limit theorems for Gaussian projections. In
IEEE International Symposiumon Information Theory . pp. 3045–3049.R
ENCHER , A. C. & S
CHAALJE , G. B. (2008).
Linear Models in Statistics . Hoboken: Wiley-Interscience, 2nd ed.S
TEINBERG , D. M. & B
ONILLA , E. V. (2014). Extended and unscented Gaussian processes. In
Advances in NeuralInformation Processing Systems 27 . pp. 1251–1259.V
ILA , J. & S
CHNITER , P. (2011). Expectation-maximization Bernoulli-Gaussian approximate message passing. In . pp. 799–803. W. VAN DEN B OOM , G. R
EEVES AND