[PDF] A Fast Linear Regression via SVD and Marginalization

Abstract

We describe a numerical scheme for evaluating the posterior moments of Bayesian linear regression models with partial pooling of the coefficients. The principal analytical tool of the evaluation is a change of basis from coefficient space to the space of singular vectors of the matrix of predictors. After this change of basis and an analytical integration, we reduce the problem of finding moments of a density over k + m dimensions, to finding moments of an m-dimensional density, where k is the number of coefficients and k + m is the dimension of the posterior. Moments can then be computed using, for example, MCMC, the trapezoid rule, or adaptive Gaussian quadrature. An evaluation of the SVD of the matrix of predictors is the dominant computational cost and is performed once during the precomputation stage. We demonstrate numerical results of the algorithm. The scheme described in this paper generalizes naturally to multilevel and multi-group hierarchical regression models where normal-normal parameters appear.

Full PDF

AA Fast Linear Regression via SVD and Marginalization

Philip Greengard ∗ , Andrew Gelman † , Aki Vehtari ‡ November 9, 2020

Abstract

We describe a numerical scheme for evaluating the posterior moments of Bayesianlinear regression models with partial pooling of the coeﬃcients. The principalanalytical tool of the evaluation is a change of basis from coeﬃcient space to thespace of singular vectors of the matrix of predictors. After this change of basis andan analytical integration, we reduce the problem of ﬁnding moments of a density over k + m dimensions, to ﬁnding moments of an m -dimensional density, where k is thenumber of coeﬃcients and k + m is the dimension of the posterior. Moments can thenbe computed using, for example, MCMC, the trapezoid rule, or adaptive Gaussianquadrature. An evaluation of the SVD of the matrix of predictors is the dominantcomputational cost and is performed once during the precomputation stage. Wedemonstrate numerical results of the algorithm. The scheme described in this papergeneralizes naturally to multilevel and multi-group hierarchical regression modelswhere normal-normal parameters appear. Linear regression is a ubiquitous tool for statistical modeling in a range of appli-cations including social sciences, epidemiology, biochemistry, and environmental sci-ences ([Gelman et al., 2013, Gelman and Hill, 2007, Greenland, 2000, Merlo et al., 2005,Bardini et al., 2017]).A common bottleneck for applied statistical modeling workﬂow is the computationalcost of model evaluation. Since posterior distributions in statistical models are oftenhigh dimensional and computationally intractable, various techniques have been used toapproximate posterior moments. Standard approaches often involve a variety of techniquesincluding Markov chain Monte Carlo (MCMC) or using a suitable approximation of theposterior.In this paper, we describe an approach for reducing the computational costs for aparticular class of regression models — those that contain parameters θ ∈ R k such that θ has a normal prior and normal likelihood. These models represent only a subset ofregression models that appear in applications. We focus our attention in this paper onnormal-normal models because they have well known analytical properties and are morecomputationally tractable than the vast majority of multilevel models. A broader class ∗ Columbia University † Department of Statistics and Department of Political Science, Columbia University ‡ Department of Computer Science, Aalto University a r X i v : . [ s t a t . C O ] N ov f models, including logistic regression, contain distributions that are less amenable tothe techniques of this paper and will require other analytical and computational tools.Mathematically, marginalization of normal-normal parameters is well-known and has beenapplied to the posterior by, for example, [Lindley and Smith, 1972]. Our contribution isto provide a stable, accurate, and fast algorithm for marginalization.The primary numerical tool used in the algorithm is the singular value decomposition(SVD) of the data matrix. As a mathematical and statistical tool, SVD has been knownsince at least 1936 (see [Eckart and Young, 1936]). Use of the SVD as a practical andeﬃcient numerical algorithm only started gaining popularity much later, with the ﬁrstwidely used scheme introduced in [Golub and Kahan, 1965]. Due in large part to advancesin computing power, use of the SVD as a tool in applied mathematics, statistics, anddata science has been gaining signiﬁcant popularity in recent years, however eﬃcientevaluation of SVDs and related matrix decompositions is still an active area of research(see [Hastie et al. 2015], [Halko et al., 2011], [Shamir et al., 2016]).Similar schemes to ours are used in the software packages lme4 ([Bates et al., 2015])and INLA ([Rue et al., 2017]). There are several diﬀerences between the problems theyaddress and their computational techniques, and those that we shall discuss here. Whilelme4 ﬁnds maximum likelihood and restricted maximum likelihood estimates, our goalis to ﬁnd posterior moments. The software package INLA uses Laplace approximationon the posterior for a general choice of likelihood functions, whereas our algorithm isfocused on fast and accurate solutions for only a particular class of densities: those withnormal-normal parameters.The approach presented in this paper analytically marginalizes the normal-normalparameters of a model using a change of variables. After marginalization, posteriormoments can be computed using standard techniques on the lower-dimensional density.In particular, for a model that contains k + m total variables, k of which are normal-normal, our scheme converts the problem of evaluating expectations of a density in k + m dimensions to ﬁnding expectations of an m -dimensional density. After marginalization, weevaluate the m -dimensional posterior density in O ( k ) operations. Without the change ofvariables, standard evaluation of marginal densities that relies on determinant evaluationrequires at least O ( k ) operations.We illustrate our scheme on the problem of evaluating the marginal expectations ofthe unnormalized density q ( σ , σ , β ) = σ − ( k +1)1 σ − n exp (cid:18) − γ log ( σ ) − σ − (cid:107) Xβ − y (cid:107) σ − (cid:107) β (cid:107) σ (cid:19) , (1)where γ > σ , σ >

0, and β ∈ R k . We assume that X is a ﬁxed n × k matrix, y ∈ R n is ﬁxed, and the normalizing constant of (1) is unknown. For ﬁxed n, k ∈ N , the algorithm is nearly identical when X is an n × k matrix to when X is a k × n matrix. In the case where k (cid:29) n , see [Kwon et al., 2011] for a similar approach.Using the standard notation of Bayesian models, density q is the unnormalized posteriorof the model σ ∼ lognormal(0 , √ γ ) σ ∼ normal + (0 , β ∼ normal(0 , σ ) y ∼ normal( Xβ, σ ) . (2)2n Appendix A, we include Stan code that can be used to sample from density (1).Statistical model (2) is a standard model of Bayesian statistics and appears whenseeking to model an outcome, y , as a linear combination of related predictors, the columnsof X . In [Gelman and Hill, 2007], these models are described in detail and are used inthe estimation of the distribution of radon levels in houses in Minnesota.Density (1) is also closely related to posterior densities that appear in genome-wide association studies (GWAS; see [Zhu and Stephens, 2017], [Meuwissen, et al., 2001],[Azevedo et al., 2015]) which can be used to identify genomic regions containing geneslinked with a speciﬁc trait, such as height. Using the notation of (1), each row of matrix X corresponds to a person, each column of X represents a genomic location, entriesof X indicate genotypes, and y corresponds to the trait. Due to technical advances ingenome sequencing over the last ten years, it is now feasible to collect large amounts ofsequencing data. GWAS models can contain data on up to millions of people and oftenbetween hundreds and thousands of genome locations (see [Linner et al., 2019]). As aresult, eﬃcient computational tools are required for model evaluation.The number of operations required by the scheme of this paper scales like O ( nk )with a small constant. The key analytical tool is a change of variables of β such that theterms, − σ (cid:107) Xβ − y (cid:107) − σ (cid:107) β (cid:107) , (3)in (1) are converted to a diagonal quadratic form in R k . After that change of variables,expectations over q are analytically converted from integrals over R k +2 to integrals over R . The remaining 2-dimensional integrals can be computed to high accuracy usingclassical numerical techniques including, for example, adaptive Gaussian quadrature oreven the 2-dimensional trapezoid rule.The schemes used to evaluate the expectations of (1) generalize naturally to evaluationof expectations of multilevel and multigroup posterior distributions including, for example,the two-group posterior of the form, q ( σ , σ , σ , µ, β ) = exp  − σ (cid:107) Xβ − y (cid:107) − σ k (cid:88) i =1 ( µ − β i ) − σ k + k (cid:88) i = k +1 β i  , (4)where X is a n × k matrix, y ∈ R n , k and k are non-negative integers satisfying k + k = k , and the vector t ∈ R m .For models where m is large, MCMC can be used to evaluate the m -dimensionalexpectations, with, for example, Stan [Carpenter et al., 2017]. The m -dimensional distri-bution has two qualities that make it preferable to its high-dimensional counterpart. First,it requires only O ( k ) operations to evaluate the integrand, and second, the geometry ofthe m dimensional marginal distribution will allow for more eﬃcient sampling.The structure of this paper is as follows. In the following section we describethe analytic integration that transforms (1) from a k + 2-dimensional problem to a2-dimensional problem. Section 3 includes formulas that will allow for the evaluationof posterior moments using the 2-dimensional density. In Sections 4 and 5 we provideformulas for evaluating covariances of (1). In Section 6, we discuss the numerical results3f the implementation of the algorithm. Conclusions and generalizations of the algorithmof this paper are presented in Section 7. Appendix A provides Stan code that can beused to sample from (1), and Appendix B includes proofs of the formulas of this paper. β In this section, we describe how we analytically marginalize the normal-normal parameter β of density (1). We include proofs of all formulas in Appendix B.We start by in marginalizing β using a change of variables that converts the quadraticforms in (1) into diagonal quadratic forms. The resulting integral in the new variable, z , is Gaussian, and the coeﬃcients of z i and z i are available analytically. The changeof variables is given by the right orthogonal matrix of the singular value decomposition(SVD) of X . That is, we set z = V t β (5)where the SVD of X is X = U DV t . (6)We deﬁne λ i to be the i th element of the diagonal of D . The elements of diagonal neednot be sorted. After this change of variables, we obtain the following identity for the lasttwo terms of (1). A proof can be found in Lemma B.5 in Appendix B. Formula 2.1. − σ (cid:107) Xβ − y (cid:107) − σ (cid:107) β (cid:107) = a + k (cid:88) i =1 a ,i (cid:18) z i − a ,i a ,i (cid:19) + a ,i a ,i (7) where a ,i = λ i σ + 12 σ , (8) a ,i = w i σ , (9) and a = − y t y σ (10) where w = V t X t y. (11)After performing the change of variables z = V t β and using (7), we now have anexpression for density (1) in a form that allows us to use the well-known properties ofa Gaussian with diagonal covariance. The following identity uses these properties andprovides a formula for analytically reducing expectations of (1) from integrals over k + 2dimensions to integrals over 2 dimensions. After the formula is applied, we have a newdensity, ˜ q , over only 2 dimensions. See Theorem B.6 in Appendix B for a proof.4 ormula 2.2. For all σ , σ > we have (cid:90) R k q ( σ , σ , β ) dβ = ˜ q ( σ , σ ) (12) where ˜ q ( σ , σ ) is deﬁned by the formula ˜ q ( σ , σ ) = σ − ( k +1)1 σ − n exp (cid:32) − γ log ( σ ) − σ a + k (cid:88) i =1 a ,i a ,i (cid:33) k (cid:89) i =1 (cid:112) a ,i (13) where a ,i is deﬁned in (8), a ,i is deﬁned in (9), a is deﬁned in (10), and γ is a constant. Remark 2.1.

Certain Bayesian models might contain correlated priors on β that willresult in posteriors such as (28) of Section 4. For such models, we perform the changeof variables that uses the fact that two diagonal forms over β can be simultaneouslydiagonalized.We include in Figure 1 a plot of the density of q as a function of σ and β for ﬁxed σ and randomly chosen X and y . Figure 2 shows a plot of q as a function of σ and β for ﬁxed σ . Figure 3 provides an illustration of ˜ q , obtained after the change of variablesand marginalization described in this section. Now that we have reduced the k + 2-dimensional density q to the 2-dimensional density ˜ q ,it remains to recover the posterior moments of q using ˜ q . We ﬁrst observe that momentsof σ and σ with respect to q are equivalent to moments of σ and σ over ˜ q . That is, E q ( σ ) = E ˜ q ( σ ) (14)and E q ( σ ) = E ˜ q ( σ ) . (15)As for moments of β , we use (13) and standard properties of Gaussians to obtain thefollowing formula. Formula 3.1.

For all σ , σ > , (cid:90) R k z i q ( σ , σ , β ) dβ = a ,i a ,i ˜ q ( t ) (16) where q is deﬁned in (1), ˜ q is deﬁned in (13), a ,i is deﬁned in (8), and a ,i is deﬁned in(9). As an immediate consequence of (16), we are able to evaluate the posterior expectationof z as an expectation of a 2-dimensional density: E q ( z i ) = E ˜ q ( a ,i a ,i ) . (17)5 .0 1.25 2.5 3.75 5.0 [ ] Density of q on log scale, = 1.0 Figure 1:

Density of q (see (1)) with respect to σ and β , where γ = 8 , n = 100 , k = 10 , and data were randomly generated. [ ] Density of q on log scale, = 1.0 Figure 2:

Density of q (see (1)) with respect to σ and β , for thesame parameters as Figure 1. We then transform those expectations back to expectations over the desired basis, β usingthe matrix V computed in (6). Speciﬁcally, using linearity of expectation and (17), we6 .0 5.0 10.0 15.0 20.0 Density of q on log scale Figure 3:

Density of ˜ q (see (13)) using the same q as Figure 1, where n = 100 , k = 10 ,and data were randomly generated. know E q (( β , . . . , β k ) t ) = E q ( V V t ( β , . . . , β k ) t )= V E q ( V t ( β , . . . , β k ) t )= V E q (( z , . . . , z k ) t )= V E ˜ q (cid:32)(cid:18) a , a , , . . . , a ,k a ,k (cid:19) t (cid:33) . (18) β In addition to facilitating the rapid evaluation of posterior means, the change of variablesdescribed in Section 2 is also useful for the evaluation of higher moments.Equation (7) shows that after the change of variables from β to z , the resulting densityis a Gaussian in z with a diagonal covariance matrix. Additionally, for each z i , usingequation (7) and standard properties of Gaussians, we have the following identity. Formula 4.1.

For all σ , σ > , we have (cid:90) R k ( z i − µ z i ) q ( σ , σ , β ) dβ = (2 a ,i ) − ˜ q ( σ , σ ) (19)7 here µ z i is the expectation of z i , ˜ q is deﬁned in (13), and a ,i is deﬁned in (8). The second moments of the posterior of β are obtained as a linear transformation ofthe posterior variances of z . In particular, denoting the expectation of β by µ β and theexpectation of z by µ z , we have E (( β − µ β )( β − µ β ) t ) = V V t E (( β − µ β )( β − µ β ) t ) V V t = V E ( V t ( β − µ β )( β − µ β ) t V ) V t = V E (( z − µ z )( z − µ z ) t ) V t . (20)We observe that due to the independence of all z i , E (( z − µ z )( z − µ z ) t ) (21)is diagonal and we can therefore evaluate the k × k posterior covariance matrix of β byevaluating var( z i ) for i = 1 , ..., k and then applying two orthogonal matrices. Speciﬁcally,combining Formula 4.1 and (20), we obtaincov( β ) = V E ˜ q (cid:16)(cid:0) (2 a , ) − , ..., (2 a ,k ) − (cid:1) t (cid:17) V t . (22) σ and σ Higher moments of σ and σ with respect to q can be evaluated directly as highermoments of σ and σ with respect to ˜ q . That is, for all j ∈ { , , ..., } , we have E q (( σ − µ σ ) j ) = E ˜ q (( σ − µ σ ) j ) (23)and E q (( σ − µ σ ) j ) = E ˜ q (( σ − µ σ ) j ) . (24)In particular, for j = 2, we obtainvar q ( σ ) = var ˜ q ( σ ) (25)and var q ( σ ) = var ˜ q ( σ ) . (26) Algorithm 1:

Evaluation of posterior expectations of normal-normal models Compute SVD of matrix X Compute w (see (11)) Compute V t (see (9)) Construct evaluator for density ˜ q of (13) Evaluate ﬁrst and second moments with respect to ˜ q : E ˜ q ( σ ) , E ˜ q ( σ ) , E ˜ q ( a ,i a ,i ) Compute E ( β ) via formula (18) 8 Numerical Experiments

Algorithm 1 was implemented in Fortran. We used the GFortran compiler on a 2.6 GHz6-Core Intel Core i7 MacBook Pro. All examples were run in double precision arithmetic.The matrix X and vector y were randomly generated as follows. Each entry of X wasgenerated with an independent Gaussian with mean 0 and variance 1. The vector y was created by ﬁrst randomly generating a vector β ∈ R k , each entry of which is anindependent Gaussian with mean 0 and variance 1. The vector y was set to the value of Xβ + (cid:15) where (cid:15) ∈ R n contains standard normal iid entries. We generated y this way inorder to ensure that the E ( β i ) were not all small in magnitude. We set γ of (1) to 8.In Table 2 and Figure 5, we compare the performance of Algorithm 1 to two alternativeschemes for computing posterior expectations — one in which we analytically marginalizevia equation (12) and then integrate the 2-dimensional density via MCMC using Stan. Inthe other, we use Stan’s MCMC sampling on the full k + 2 dimensional posterior. Whenusing MCMC with Stan, we took 10,000 posterior draws. In Table 2 and Figure 5 wedenote Algorithm 1 by “SVD-Trap”. The algorithm that uses Stan on the marginal 2-dimensional density is labeled “SVD-MCMC”, and “MCMC” corresponds to the algorithmthat uses only MCMC sampling in Stan.In the appendix, we include Stan code to sample from the marginal density ˜ q of (13). Remark 6.1.

In the numerical integration stage of algorithm 1, we use the trapezoidrule with 200 nodes in each direction. Because the integrand is smooth and vanishes nearthe boundary, convergence of the integral is super-algebraic when using the trapezoid rule(see [Stoer and Bulirsch, 1992]). A rectangular grid with 200 points in each direction issatisfactory for obtaining approximately double precision accuracy. In problems withlarge numbers of non-normal-normal parameters, MCMC algorithms such as HamiltonianMonte Carlo or other methods can be used.In Tables 1 and 2, n and k represent the size of the n × k random matrix X .The column labeled “max error” provides the maximum absolute error of the ex-pectations of σ , σ , and β i for i ∈ { , , . . . , n } . The true solution was evaluated usingtrapezoid rule with 500 nodes in each direction in extended precision.In Table 1, “Precompute time (s)” denotes the time in seconds of all computationsuntil numerical integration. These times are dominated by the cost of SVD (36). Thetotal time of the numerical integration in addition to the matrix-vector product (18) isgiven in “integrate time (s).” The ﬁnal column of Table 1, “total time (s)”, provides thetotal time of precomputation and integration. In this paper, we present a numerical scheme for the evaluation of the expectations of aparticular class of distributions that appear in Bayesian statistics; posterior dsitributionsof linear regression problems with normal-normal parameters.The scheme presented generalizes naturally to several classes of distributions thatappear frequently in Bayesian statistics. We list several examples of posteriors whoseexpectations can be evaluated using this method.9 k max error precompute time (s) integrate time (s) total (s)50 5 0 . × − .

01 0 .

01 0.02100 10 0 . × − .

02 0 .

01 0.03500 20 0 . × − .

04 0 .

01 0.051000 50 0 . × − .

09 0 .

03 0.125000 100 0 . × − .

29 0 .

05 0.3410000 500 0 . × −

14 0 . . × −

54 0 . Table 1:

Scaling of computation times for evaluation of expectations of q (see (1)) usingAlgorithm 1 − k t i m e ( s ) Timings for n = 10 , Scaling of computation times for evaluation of posterior expectations of q (see(1)) using Algorithm 1 as a function of k with n = 10 , . SVD-Trap SVD-MCMC MCMC n k time (s) error time (s) error time (s) error100 100 0 .

16 0 . × −

11 0 . × −

16 0 . × −

200 100 0 .

16 0 . × −

11 0 . × −

24 0 . × −

500 100 0 .

23 0 . × −

12 0 . × −

40 0 . × − .

25 0 . × −

12 0 . × −

88 0 . × − .

30 0 . × −

14 0 . × −

617 0 . × − .

65 0 . × −

13 0 . × − . × − Table 2:

Scaling of computation times for evaluation of expectations of q (see (1)) usingthree diﬀerent algorithms: i) SVD-Trap: Algorithm 1 of this paper, ii) SVD-MCMC:marginalization with MCMC integration of ˜ q using Stan, and iii) MCMC: full MCMCintegration of q using Stan. − n t i m e ( s ) Timings for k = 100 SVD-TrapSVD-MCMCMCMCFigure 5: Scaling of computation times for evaluation of posterior expectations of q (see(1)) as a function of sample size n with k = 100 . The three algorithms comapred arei) SVD-Trap: Algorithm 1 of this paper, ii) SVD-MCMC: marginalization with MCMCintegration of ˜ q using Stan, and iii) MCMC: full MCMC integration of q using Stan.

1. The choice of priors for σ , and σ in this document were log normal and half-normal.This choice did not substantially impact the algorithm and can be generalized. AdaptiveGaussian quadrature can be used for the numerical integration step of the algorithm fora more general choice of prior on σ and σ .2. Multilevel regression problems with more than two levels.3. Regression problems with multiple groups such as the two-group model with posteriorexp  − σ (cid:107) Xβ − y (cid:107) − σ k (cid:88) i =1 ( µ − β i ) − σ k + k (cid:88) i = k +1 β i  (27)where X is a n × k matrix, y ∈ R n , and k and k are non-negative integers satisfying k + k = k .3. Regression problems with correlated priors on β :exp (cid:18) − σ (cid:107) X β − y (cid:107) − σ (cid:107) X β (cid:107) (cid:19) (28)For regression problems with large numbers of non-normal-normal parameters, marginalexpectations can be computed using, for example, MCMC in Stan. For such problems,the algorithm of this paper would convert an MCMC evaluation from k + m dimensionsto m dimensions, where k is the number of normal-normal parameters.11 Acknowledgements

The authors are grateful to Ben Bales and Mitzi Morris for useful discussions.

A Code

The following Stan code allows for sampling from the distribution corresponding to theprobability density function proportional to (1). data {int n;int k;vector[n] y;matrix[n,k] X;}parameters {real sigma1;real sigma2;vector[k] beta;}model {y ~ normal(X*beta, sigma2);beta ~ normal(0, sigma1);sigma1 ~ lognormal(0, 0.25);sigma2 ~ normal(0, 1);}

The following Stan program samples from the marginal density ˜ q (see (13)). The datainput yty corresponds to y t y of (10), lam is the vector of singular values of X , and w isthe vector w in equation (11). We include R code for computing yty , lam , and w afterthe following Stan code. functions {real q_tilde_lpdf(real sig1, real sig2, vector w, vector lam, real yty,int k, int n) {vector[min(n,k)] a2 = lam^2/(sig2^2) + 1/(sig1^2);real sol = sum(w^2 ./a2)/2/sig2^4 - sum(log(a2))/2 -yty/(2*sig2^2);sol += -min(n,k)*log(sig1) - n*log(sig2);return sol;}}data {int n;int k;vector[min(n,k)] w;vector[min(n,k)] lam;real yty;matrix[min(n,k),k] V; parameters {real sigma1;real sigma2;}model {sigma1 ~ q_tilde(sigma2, w, lam, yty, k, n);sigma1 ~ lognormal(0, 0.25);sigma2 ~ normal(0, 1);}generated quantities {vector[k] beta;{ vector[min(n,k)] zvar = 1 ./(2*(lam^2 ./(2*sigma2^2) + 1/(2*sigma1^2)));vector[min(n,k)] zmu = w./sigma2^2 .* zvar;vector[min(n,k)] z = to_vector(normal_rng(zmu, sqrt(zvar)));beta = V * z;}} The following is a sample of code from R that can be used for the precomputation stageof Algorithm 1. udv <- svd(X)V <- udv$vlam <- as.vector(udv$d)w <- t(V) %*% t(X) %*% yw <- as.vector(w)yty <- t(y) %*% yyty <- yty[1]

B Proofs

In this appendix, we include proofs of the formulas provided in this paper. For increasedreadability, this appendix is self-contained.

B.1 Mathematical Preliminaries and Notation

In this section, we introduce notation and elementary mathematical identities that willbe used throughout the remainder of this section.We deﬁne C ∈ R by the equation C = (cid:90) σ ∈ R + (cid:90) σ ∈ R + (cid:90) β ∈ R k q ( σ , σ , β ) dβdσ dσ , (29)and deﬁne E ( σ ), E ( σ ), and E ( β i ) by the formulas E ( σ ) = 1 C (cid:90) σ ∈ R + (cid:90) σ ∈ R + (cid:90) β ∈ R k σ q ( σ , σ , β ) dβdσ dσ , (30)13 ( σ ) = 1 C (cid:90) σ ∈ R + (cid:90) σ ∈ R + (cid:90) β ∈ R k σ q ( σ , σ , β ) dβdσ dσ , (31)and E ( β i ) = 1 C (cid:90) σ ∈ R + (cid:90) σ ∈ R + (cid:90) β ∈ R k β i q ( σ , σ , β ) dβdσ dσ (32)for i ∈ { , , . . . , k } .We provide algorithms for the evaluation of (29), (30), (31), and (32).We will be denoting by the vector of ones = (1 , , . . . , t . (33)We denote the i th component of a vector v by v i .The following two well-known identities give the normalizing constant and expectationof a Gaussian distribution. Lemma B.1.

For all σ , σ > we have √ πσ = (cid:90) R e − ( β − µ )22 σ dβ (34) Lemma B.2.

For all µ in R and σ > , we have µ √ πσ = (cid:90) R βe − ( β − µ )22 σ dβ (35) B.2 Analytic Integration of β We denote the singular value decomposition of X by X = U DV t (36)where U is an orthogonal n × k matrix, V is an orthogonal k × k matrix, and D is a k × k diagonal matrix. We deﬁne z ∈ R k by the formula z = V t β. (37)The following lemma, which will be used in the proof of Lemma B.5, gives an expressionfor the second to last term of the exponent in (1) after a change of variables. Lemma B.3.

For all β ∈ R k , and y ∈ R n , − σ (cid:107) Xβ − y (cid:107) = − y t y σ + k (cid:88) i =1 − λ i σ z i + w i σ z i (38) where w = V t X t y, (39) z is deﬁned in (37), and λ i is the i th entry on the diagonal of D (see (36)). roof. Clearly, (cid:107) Xβ − y (cid:107) = β t X t Xβ − y t Xβ + y t y. (40)Substituting (36) and (37) into (40), we obtain (cid:107) Xβ − y (cid:107) = β t ( U DV t ) t ( U DV t ) β − y t XV V t β + y t y = ( β t V ) D ( V t β ) − y t ( V t X t ) t z + y t y. (41)where z is deﬁned in (37). Substituting (39) and (37) into (41), we have (cid:107) Xβ − y (cid:107) = z t D z − w t z + y t y (42)Equation (38) follows immediately from (42). (cid:4) The following lemma provides an equation for the last term of the exponent in (1).The identity will be used in Lemma B.5.

Lemma B.4.

For all σ > , − (cid:107) β (cid:107) σ = k (cid:88) i =1 − z i σ (43) where β ∈ R k , z is deﬁned in (37), and V is deﬁned in (36). Proof.

Clearly, (cid:107) β (cid:107) σ = 12 σ ( V z ) t ( V z ) = z t z σ (44)where V is deﬁned in (36). Equation (43) follows immediately from (44). (cid:4) The following formula combines Lemma B.3 and Lemma B.4 to convert the ﬁnal twoterms of (1) into a Gaussian in k dimensions. Lemma B.5. − (cid:107) Xβ − y (cid:107) σ − (cid:107) β (cid:107) σ = a + k (cid:88) i =1 a ,i ( z i − a ,i a ,i ) + a ,i a ,i (45) where a ,i = λ i σ + 12 σ , (46) a ,i = w i σ (47) and a = − y t y σ (48) where z is deﬁned in (37), w is deﬁned in (39) and V is deﬁned in (36). roof. By combining Lemma B.3 and Lemma B.4, we have − σ (cid:107) Xβ − y (cid:107) − σ (cid:107) β (cid:107) = a + k (cid:88) i =1 (cid:0) a ,i z i − a ,i z i (cid:1) . (49)We obtain equation (45) by completing the square in equation (49). (cid:4) The following theorem is the principal analytical apparatus of this note. It provides aformula for the k -dimensional integrals that appear in (29), (30), and (31). Theorem B.6.

For all σ , σ > (cid:90) R k q ( σ , σ , β ) dβ = ˜ q ( σ , σ ) (50) where ˜ q ( σ , σ ) is deﬁned by the formula ˜ q ( σ , σ ) = σ − ( k +1)1 σ − n exp (cid:32) − log ( σ ) − σ a + k (cid:88) i =1 a ,i a ,i (cid:33) √ π k k (cid:89) i =1 (cid:112) a ,i (51) where a ,i is deﬁned in (46), a ,i is deﬁned in (47) and a is deﬁned in (48). Proof.

Using (1), clearly (cid:90) R k q ( σ , σ , β ) dβ = σ − ( k +1)1 (cid:90) R k exp (cid:18) − log ( σ ) − σ − σ (cid:107) Xβ − y (cid:107) − σ (cid:107) β (cid:107) (cid:19) dβ (52)Performing the change of variables (37) and substituting (45) into (52), we have (cid:90) R k q ( σ , σ , β ) dβ =exp (cid:32) − log ( σ ) − σ a + k (cid:88) i =1 a ,i a ,i (cid:33) (cid:90) R k exp (cid:32) k (cid:88) i =1 a ,i ( z i − a ,i a ,i ) (cid:33) dz (53)Since the integrand on the right side of (53) is a Gaussian in z i , equation (50) followsfrom applying Lemma B.1 to (53). (cid:4) The following theorem provides a formula for the expectation of z (see (37)). We usethis formula, in combination with an orthogonal transformation, to obtain the expectationof β . Theorem B.7.

For all σ > and σ ∈ R , (cid:90) R k ( V t x ) i q ( σ , σ , β ) dβ = a ,i a ,i ˜ q ( t ) (54) where q is deﬁned in (1), ˜ q is deﬁned in (51), a ,i is deﬁned in (46), a ,i is deﬁned in(47), a is deﬁned in (48). roof. Combining (53) and (37), we have (cid:90) R k ( V t β ) i q ( σ , σ , β ) dβ =exp (cid:32) − log ( σ ) − σ a + k (cid:88) i =1 a ,i a ,i (cid:33) (cid:90) R k z i exp (cid:32) k (cid:88) i =1 a ,i ( z i − a ,i a ,i ) (cid:33) dz. (55)Applying Lemma B.2 to (55), we obtain (54). (cid:4) References [Azevedo et al., 2015] Azevedo, Camila Ferreira et al. “Ridge, Lasso and Bayesianadditive-dominance genomic models.”

BMC genetics

16, 105 (2015)[Bardini et al., 2017] Bardini, R. G. Politano, A. Benso, S. Di Carlo. “Multi-level andhybrid modelling approaches for systems biology”.

Computational and StructuralBiotechnology Journal . 15 (2017).[Bates et al., 2015] Bates, Douglas, Martin Machler, Ben Bolker, Steve Walker. “FittingLinear Mixed-Eﬀects Models Using lme4”.

Journal of Statistical Software . (2015).[Carpenter et al., 2017] Carpenter, Bob, Andrew Gelman, Matthew D. Hoﬀman, DanielLee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li,Allen Riddell. “Stan: A Probabilistic Programming Language”.

Journal of StatisticalSoftware . (2017).[Eckart and Young, 1936] Eckart, Carl and Gale Young. “The Approximation of OneMatrix with Another of Lower Rank”.

Psychometrika.

1, 3. (1936).[Gelman et al., 2013] Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson,Aki Vehtari, and Donald B. Rubin.

Bayesian Data Analysis.

Third Edition. NewYork: U.S. CRC, 2013.[Gelman and Hill, 2007] Gelman, Andrew and Jennifer Hill.

Data Analysis using Re-gression and Multilevel/Hierarchical Models

Cambridge: UK. Cambridge UniversityPress, 2007.[Golub and Kahan, 1965] Golub, Gene and William Kahan. “Calculating the SingularValues and Psuedo-Inverse of a Matrix.”

J. SIAM Numer. Anal.

InternationalJournal of Epidemiology.

29 (2000).[Halko et al., 2011] Halko, N., P.G. Martinsson, J. A. Tropp. “Finding Structure withRandomness: Probabilisitc Algorithms for Constructing Approximate Matrix De-compositions.”

SIAM Rev.

53 (2). (2011)17Hastie et al. 2015] Hastie, Trevor, Rahul Mazumder, Jason D. Lee, Reza Zadeh. “MatrixCompletion and Low-Rank SVD via Fast Alternating Least Squares.”

The Journalof Machine Learning Research.

16 (1). (2015)[Kwon et al., 2011] Kwon, Soonil, Xiaofei Yan, Jinrui Cui, Jie Yao, Kai Yang, DonaldTsiand, Xiaohui Li, Jerome Rotter, Xiuqing Guo. “Application of Bayesian Regressionwith Singular Value Decomposition method in Association Studies for SequenceData.”

BMC Proceedings . 5 (9) (2011).[Lindley and Smith, 1972] Lindley, D. V., and Smith, A. F. M. (1972). Bayes estimatesfor the linear model.

Journal of the Royal Statistical Society B.

34, 1–41.[Linner et al., 2019] Linner, Karlsson, et al. “Genome-wide association analyses of risktolerance and risky behaviors in over 1 million individuals identify hundreds of lociand shared genetic inﬂuences.”

Nature Genetics.

51, 2. (2019)[Merlo et al., 2005] Merlo, J., B. Chaix, M. Yang, J. Lynch, and L. Rastam. “A BriefConceptual Tutorial of Multilevel Analysis in Social Epidemiology: Linking the Sta-tistical Concept of Clusetering to the Idea of Contextual Phenomenon”.

J EpidemiolCommunity Health.

59 (2005).[Meuwissen, et al., 2001] Meuwissen, T.H., B. J. Hayes and M. E. Goddard. “Predictionof Total Genetic Value Using Genome-Wide Dense Marker Maps.”

Genetics

Annals of Statistics.

31 (2003).[Rue et al., 2017] Rue, Havard, Andrea Riebler, Sigrunn H. Sorbye, Janine B. Illian,Daniel P. Simpson, Finn K. Lindgren. “Bayesian Computing with INLA: A Review”.

Annual Review of Statistics and Its Application . 4 (2017).[Shamir et al., 2016] Shamir, Ohad. “Fast Stochastic Algorithms for SVD and PCA:Convergence Properties and Convexity.”

Proceedings of the 33rd ICML.

New York,NY. (2016)[Stoer and Bulirsch, 1992] Stoer, Josef and Roland Bulirsch.

Introduction to NumericalAnalysis , 2nd ed., Springer-Verlag, 1992.[Zhu and Stephens, 2017] Zhu, Xiang and Matthew Stephens. “Bayesian Large-Scale Mul-tiple Regression with Summary Statistics from Genome-Wide Association Studies”.