Variational Inference for Shrinkage Priors: The R package vir
VVariational Inference for Shrinkage Priors: The R package vir Suchit Mehrotra and Arnab Maity Department of Statistics, North Carolina State University
Abstract
We present vir , an R package for variational inference with shrinkage priors. Our package implementsvariational and stochastic variational algorithms for linear and probit regression models, the use ofwhich is a common first step in many applied analyses. We review variational inference and showhow the derivation for a Gibbs sampler can be easily modified to derive a corresponding variational orstochastic variational algorithm. We provide simulations showing that, at least for a normal linear model,variational inference can lead to similar uncertainty quantification as the corresponding Gibbs samplers,while estimating the model parameters at a fraction of the computational cost. Our timing experimentsshow situations in which our algorithms converge faster than the frequentist LASSO implementations in glmnet while simultaneously providing superior parameter estimation and variable selection. Hence, ourpackage can be utilized to quickly explore different combinations of predictors in a linear model, whileproviding accurate uncertainty quantification in many applied situations. The package is implementednatively in R and RcppEigen , which has the benefit of bypassing the substantial operating system specificoverhead of linking external libraries to work efficiently with R . Keywords:
Bayesian statistics, big data, Gibbs sampling, shrinkage priors, variational inference
Bayesian statistics assumes that all information about a parameter of interest is contained in a probabilitydistribution called the posterior. In most modern problems, the posterior is hard to compute due to anintractable normalizing constant, and approximations to it are necessary to conduct inference. The mostpopular choice for approximating the posterior distribution has been the use of Markov Chain Monte Carlo(MCMC) algorithms [Robert and Casella, 2013]. MCMC algorithms work by constructing a Markov chainwhich has the posterior as its stationary distribution. Once the stationary distribution is reached, samplescollected from chain serve as an approximation of the posterior distribution.A variety of MCMC algorithms exist, with the two most important being the Metropolis-Hastings algo-rithm [Metropolis et al., 1953, Hastings, 1970] and the Gibbs sampler [Geman and Geman, 1984, Gelfand and1 a r X i v : . [ s t a t . C O ] F e b mith, 1990]. The use of these algorithms is simplified by software which implement them in an automaticfashion by allowing the user to define a model; the R packages rjags [Plummer et al., 2019] and r2winbugs [Sturtz et al., 2005] implement Gibbs samplers, while the package rstan [Stan Development Team, 2018] usesthe No-U-Turn sampler [Hoffman and Gelman, 2014] for Hamiltonian Monte Carlo [Betancourt, 2017].Unfortunately, an issue with MCMC based approaches is that they scale poorly with dataset size or whenthe number of parameters is large. Consequently, modern work in MCMC has focused on dealing with theselimitations. To name a few, these methods work by exploiting modern computing architecture such as GPUs[Terenin et al., 2019], splitting the dataset into smaller chunks (data sharding) and running independentchains asynchronously [Terenin et al., 2020] or combining the results after convergence [Scott et al., 2016,Srivastava et al., 2015], compressing the data before analysis [Guhaniyogi and Dunson, 2015], sub-samplingdata to approximate expensive likelihoods [Quiroz et al., 2018, Ma et al., 2015], or using low-rank proposalsfor high dimensional parameter spaces [Saibaba et al., 2019]. Most of these approaches are not readilyavailable for use in statistical software, something which inhibits the use of the Bayesian paradigm in manyapplied problems of interest.An alternative to sampling based approaches for posterior approximation is to use variational inference(VI) [Blei et al., 2017a], which casts the problem into an optimization framework. The main idea is to findan optimal distribution from a family of densities which is closest to the posterior based on some distancemeasure, where the family is chosen to balance computational tractability and the quality of the posteriorapproximation. Since variational inference is an optimization based approach, it has the advantage that itcan be easily scaled to large data sets using stochastic optimization [Hoffman et al., 2013]. At present, onlya few R packages exist which implement variational methods: rstan , which utilizes the STAN frameworkto implement automatic differentiation variational inference (ADVI), but includes a warning that theirimplementation is unstable and subject to change [Stan Development Team, 2018, Kucukelbir et al., 2017],and the package varbvs [Carbonetto et al., 2012], which incorporates methods for variable selection usingspike-and-slab priors for normal and logistic linear models.In this chapter we present a new R package vir , which includes a set of variational and stochastic varia-tional algorithms to estimate parameters in linear and probit regression with shrinkage priors. We incorporatethe normal (ridge), double-exponential (LASSO) [Park and Casella, 2008], and horseshoe [Carvalho et al.,2010] priors to conduct variable selection and inference, a problem which arises as a first step in almost allapplied analyses. Our package adds to the R ecosystem by providing a suite of computationally efficientvariational algorithms, which scale with both the number of parameters, by allowing independence assump-tions between regression coefficients, and the number of data points, by utilizing stochastic optimizationmethods. We implement the algorithms natively in R and RcppEigen , which has the benefit of bypassing2he substantial operating system specific overhead of linking external libraries to work efficiently with R .Through our simulation studies, we show that the variational algorithms presented in this chapter are com-petitive with the popular glmnet package [Simon et al., 2011], which is widely used for variable selection,in both computation time and variable selection accuracy. Additionally, our simulation studies calculateempirical coverage probabilities for the regression coefficients, showing that the variational algorithms havethe potential to recover the correct coverage in many applied scenarios.The rest of this chapter proceeds as follows: Section 2 provides a short review of Gibbs sampling, Section3 reviews relevant details for variational and stochastic variational inference, Section 4 reviews the use andimplementation of our package, Sections 5 and 6 contain numerical studies comparing vir with glmnet , andSection 7 provides a short discussion of our results. As discussed in Section 1, the core problem in conducting Bayesian inference for a parameter, θ , is thecalculation of its posterior distribution conditioned on the data, y : p ( θ | y ). This quantity is rarely availablein closed form, and approximations to it are necessary to estimate functions of θ that may be of interest.The variational algorithms implemented in this package are an example of such approximation algorithms,but their use in the Bayesian literature pales in comparison to the use of Markov Chain Monte Carlo(MCMC) methods. In this section, we focus our discussion of MCMC on Gibbs sampling because of itsclose relationship with variational inference: the derivations necessary to derive and implement a Gibbssampler can be extended to derive the corresponding variational algorithms. For a comprehensive treatmentof MCMC algorithms, we refer the reader to Robert and Casella [2013] and Brooks et al. [2011]. Gibbs sampling is one of the most popular MCMC algorithms in use today. It is a widely applicable specialcase of the Metropolis-Hastings algorithm and is straightforward to use when the full conditional distributionsof each parameter are easy to sample from.To understand how these algorithms are implemented, first note that the posterior distribution, p ( θ | y ) canalso be written as p ( θ , . . . , θ P | y ). We then choose an initial state of the Markov chain for θ , ( θ (0)1 , . . . , θ (0) P )and update this state one element at a time by sampling from its full conditional, the distribution of thatparameter conditioned on all others. If we continue this procedure for many iterations, updating each elementone at a time in no particular order, the Markov chain will converge to the posterior distribution of interest,and storing the state at the end of each iteration after a ‘burn-in’ period will give us approximate draws3 lgorithm 1: Gibbs Sampler
Result:
Samples from a posterior approximation
Input:
Integers: burn in and n iter , with burn in < n iter Set θ (0) = ( θ (0)1 , . . . , θ (0) P ) ; for i ← to n iter do θ ( i +1)1 ∼ p ( θ | θ ( i )2 , θ ( i )3 , . . . , θ ( i ) P ); θ ( i +1)2 ∼ p ( θ | θ ( i +1)1 , θ ( i )3 , . . . , θ ( i ) P );... θ ( i +1) p ∼ p ( θ p | θ ( i +1)1 , θ ( i +1)2 , . . . , θ ( i +1) p − , θ ( i ) p +1 , . . . , θ ( i ) P );... θ ( i +1) P ∼ p ( θ p | θ ( i +1)1 , . . . θ ( i +1) P − ); endOutput: (cid:110) θ ( burn in +1) , . . . , θ ( n iter ) (cid:111) from p ( θ | y ); the full description of the algorithm is given in Algorithm 1. An important technique to allowthe use of a Gibbs sampler in a model is to use conjugate priors for the parameters, which lead to each fullconditional being of the same form as the prior distribution. All the models we consider in this chapter havepriors with this conjugate relationship. Finding a conjugate prior for a distribution is always possible for regular exponential families (Bernardo andSmith [2000], Proposition 5.4), where the probability density (mass) function for a random variable, y , canbe written in the form: f ( y | θ ) = h ( y ) exp {(cid:104) δ y ( θ ) | t ( y ) (cid:105) − A [ δ y ( θ )] } , (1)where δ y ( θ ) is called the natural parameter of the distribution, t ( y ) is a vector of sufficient statistics, and (cid:104)·|·(cid:105) is an inner product. It should be noted that t ( y ) lies in a general vector space and its elements can bescalars, vectors, or matrices, with the inner product (cid:104)·|·(cid:105) defined to be the sum of the inner products of therespective spaces. For example, a p dimensional multivariate normal distribution for a random variable y ,has sufficient statistics y ∈ R p and yy T ∈ R p × p . Hence, t ( y ) ∈ R p × R p × p with (cid:104)·|·(cid:105) = (cid:104)·|·(cid:105) R p + (cid:104)·|·(cid:105) R p × p .For any distribution that can be written in the form of (1), there exists a conjugate prior for the parameter θ that can be written in the same form: p ( θ | α ) = h ( θ ) exp {(cid:104) α | t ( θ ) (cid:105) − A ( α ) } , t ( θ ) is, t ( θ ) = δ y ( θ ) − A [ δ y ( θ )] , and α , the natural parameter of the prior distribution, can be partitioned into α = ( α , α ) T , with α beingof the same dimension as δ y ( θ ) and α being a scalar. Assuming we have N independent and identicallydistributed samples from f ( y | θ ), multiplying the likelihood with the prior we see that: p ( θ | y , . . . , y N , α ) ∝ N (cid:89) n =1 p ( y n | θ ) p ( θ | α ) , ∝ N (cid:89) n =1 exp {(cid:104) δ y ( θ ) | t ( y n ) (cid:105) − A [ δ y ( θ )] } h ( θ ) exp {(cid:104) α | t ( θ ) (cid:105) − A ( α ) } , ∝ h ( θ ) exp {(cid:104) δ ∗ θ | t ( θ ) (cid:105) − A ( α ) } , where, δ ∗ θ = (cid:80) Nn =1 t ( y n ) + α N + α , (2)and we use the fact that − A [ δ y ( θ )] = (cid:104)− A [ δ y ( θ )] | (cid:105) R .Such manipulations with exponential families are of critical importance when deriving variational algo-rithms. We will see that writing the full conditionals of a Gibbs sampler in a natural parameter exponentialfamily form simplifies the derivation of the stochastic gradient descent algorithms implemented in the pack-age. MCMC methods are known to scale poorly to large datasets, both in the number of observations ( N ) andthe number of parameters ( P ). Variational Inference [Blei et al., 2017b, Bishop, 2006, Murphy, 2012] aimsto alleviate these issues by approximating the probability distribution of interest by utilizing optimizationinstead of sampling. As will be seen in our numerical experiments in Section 5, when compared withGibbs sampling, these methods yield similar results for many quantities of interest while approximating theposterior at a fraction of the computational cost. In the following sections we will review some of the salientdetails of variational inference while referring the reader to Blei et al. [2017b], Hoffman et al. [2013], Bishop52006], Murphy [2012] for a thorough review. For the rest of this section, let p ( θ | Y ), be the posterior distribution of interest, with θ being the parameterand Y being the observed data, Y = ( y , . . . , y n ). While MCMC aims to sample from this distribution, VIaims to approximate it minimizing the Kullbak-Leibler (KL) divergence using a family of candidate densities, D . The optimization problem is: q ∗ ( θ ) = arg min q ( θ ) ∈D KL { q ( θ ) , p ( θ | Y ) } , = arg min q ( θ ) ∈D (cid:90) q ( θ ) log (cid:26) q ( θ ) p ( θ | Y ) (cid:27) d θ . (3)By utlizing Bayes’ rule, we can write (3) as a maximization problem:arg min q ( θ ) ∈D KL { q ( θ ) , p ( θ | Y ) } = arg min q ( θ ) ∈D { E q [log q ( θ )] − E q [log p ( θ | Y )] } , = arg min q ( θ ) ∈D { E q [log q ( θ )] − E q [log p ( θ , Y )] + log p ( Y ) } , = arg min q ( θ ) ∈D { E q [log q ( θ )] − E q [log p ( θ , Y )] } , = arg max q ( θ ) ∈D { E q [log p ( θ , Y )] − E q [log q ( θ )] } , (4)where the second to last line drops log p ( Y ) because it does not depend on θ and the last multiplies theequation by negative one. We call the term being maximized in (4) the evidence lower bound (ELBO)because it is a lower bound for the marginal distribution of Y , p ( Y ), also called the evidence in the machinelearning literature [Blei et al., 2017b]. The tractability of the optimization problem depends on the family of densities under consideration; the morecomplicated the family of densities the harder the optimization problem will become. Additionally, once wefind the optimal density, we will need to calculate expectations and quantiles with respect to it, which pushesus towards simpler approximations. One of the most popular families to use in variational inference is toassume that the distribution of subsets of the parameter vector, θ = ( θ T , . . . , θ TP ) T , are independent, where6he subsets are chosen for computational convenience. Hence, D = (cid:40) q ( θ ) : q ( θ ) = P (cid:89) p =1 q p ( θ p ) (cid:41) . (5)The class of densities in (5) do not make an assumption regarding the optimal distributions for each θ p except for the fact that they are independent. Additionally, the groups within θ can be selected to with aparticular structure in mind, with a focus on grouping correlated parameters together. Coordinate ascent is the most popular approach for finding the optimal distribution under the mean-fieldrestrictions in (5). This approach iteratively optimizes the ELBO with respect to each factorized density,while holding all other constant. The resulting algorithm is dubbed coordinate ascent variational interference(CAVI). The derivations for the optimal density are given in various texts Bishop [2006], Blei et al. [2017b],Murphy [2012] and we state the result below for convenience. First, note that the ELBO with respect to thefactor, q p ( θ p ), can be written as:ELBO[ q p ( θ p )] = E q ( θ ) [log p ( θ , Y )] − P (cid:88) j =1 E q j ( θ j ) [log q j ( θ j )] , = E q p ( θ p ) (cid:2) E q r ( θ r ) { log p ( θ , Y ) } (cid:3) − E q p ( θ p ) [log q p ( θ p )] + const , = − KL (cid:2) q p ( θ p ) , exp (cid:8) E q r ( θ r ) [log p ( θ , Y )] (cid:9)(cid:3) + const , where we define q r ( θ r ) as the variational distribution for the rest of the parameters: q r ( θ r ) = (cid:81) Pj (cid:54) = p q j ( θ j ).Since the KL divergence between two probability distributions is always positive, the negative KL divergenceis maximized when the divergence between two probability densities is equal to zero. Hence the optimalvariational density for the p th factor is: q ∗ p ( θ p ) ∝ exp (cid:8) E q r ( θ r ) [log p ( θ , Y )] (cid:9) , (6) ∝ exp (cid:8) E q r ( θ r ) [log p ( θ p | θ r , Y )] (cid:9) . (7)Consequently, the optimal variational density for the p th coordinate is a function of the full conditionaldistribution the parameter, a density that is required calculation for a Gibbs sampler.It should be noted that, in most situations, the difficulty of calculating the expectations in (6) is a directfunction of the simplicity of the variational family in (5). Assuming that a larger number of parameters7ndependently factor in the posterior simplifies the calculations of the expectations of, for example, the innerproduct of two vector parameters. If we further assume that the complete conditionals of the parameter given all others are in the exponentialfamily (1), and require the individual factors in the mean-field family (5) to be in the same exponentialfamily, we can scale variational inference to large datasets that do not fit in memory. This approach,termed Stochastic Variational Inference (SVI) [Hoffman et al., 2013], utilizes gradient based optimization tomaximize the ELBO instead of using the coordinate ascent algorithm presented in Section 3.2. SVI comparesfavorable to CAVI in models that have local variables; this means that for each data point y n there exists alatent parameter z n that needs to be estimated. In such cases, both Gibbs sampling and the CAVI algorithmprocess the entire dataset every iteration.For the rest of this section, we present SVI in the context of probit regression with a normal (ridge) prior.This model has the hierarchy: y n = I ( z n > ,z n ∼ N ( x T n b , , b ∼ N (0 , λ − I ) ,λ ∼ G ( a λ , b λ ) . (8)In this situation, the observed data is y n , and each y n has a corresponding local variable, z n , which allowsthe use of the probit link in the calculation of P ( y n = 1) = Φ( x T n b ) where Φ( · ) is the CDF of the standardnormal distribution. In a Gibbs sampler and a CAVI algorithm, we would have to update each z n before wecan update the parameter vector b , estimation of which is of primary interest.It is well known that the complete conditional distributions of the parameters in (8) are given by: z n |· ∼ N + (cid:0) x T n b , (cid:1) if y n = 1 N − (cid:0) x T n b , (cid:1) if y n = 0 , b |· ∼ N (cid:0) ( X T X + λ I ) − X T z , ( X T X + λ I ) − (cid:1) ,λ |· ∼ G (cid:18) a λ + P , || b || + b λ (cid:19) , where, for example, b |· denotes the distribution of b conditioned on all other parameters in the model,8 = ( x T , . . . , x T N ) T , and N + and N − are truncated normal distributions on (0 , ∞ ) and ( −∞ , b to be a normaldistribution, i.e. q b ( b ) = N ( µ b , Σ b ). Because the normal distribution is part of the exponential family, itcan be written in the form of (1): q ( b | δ b ( θ )) = h ( b ) exp {(cid:104) δ b ( θ ) | t b ( b ) (cid:105) − A [ δ b ( θ )] } , (9)where, t ( b ) = bbb T and δ b ( θ ) = − Σ − b µ b − Σ − b . (10)Now, looking at the full conditional for b , we can also put it into exponential family form as: p ( b | X , z , λ ) = h ( b ) exp {(cid:104) δ b ( X , z , λ ) | t b ( b ) (cid:105) − A [ δ b ( X , z , λ )] } . (11)Instead of taking the gradient of the ELBO with respect to δ b ( θ ), Hoffman et al. [2013] utilize the naturalgradient [Amari, 1998], which accounts for the geometry of the parameter space. Applying their derivationfor the natural gradient of the ELBO to our case (Hoffman et al. [2013] ; Equation 14), gives us the naturalgradient with respect to δ b ( θ ): ∇ δ b ( θ ) ELBO = E q [ δ b ( X , z , λ )] − δ b ( θ ) , = E q (cid:80) Nn =1 z n x n (cid:80) Nn =1 x n x T n + λ I − δ b ( θ ) , = (cid:80) Nn =1 E q [ z n ] x n (cid:80) Nn =1 x n x T n + E q [ λ ] I − δ b ( θ ) . (12)This expression can be used in a gradient descent algorithm where, at each iteration, the natural parametersof the variational distribution, δ b ( θ ), are updated using the following formula: δ b ( θ ) ( t ) = δ b ( θ ) ( t − + ρ t (cid:104) E q [ δ b ( X , z , λ )] − δ b ( θ ) ( t − (cid:105) , = (1 − ρ t ) δ b ( θ ) ( t − + ρ t E q [ δ b ( X , z , λ )] , (13)where ρ t is a predetermined step size.Note that the gradient in (12) requires the processing of all data points due to the summation over N .9nstead of calculating the gradient with respect to the full data, we can calculate an approximation which isequal to full gradient in expectation, and follow the iterative procedure in (13). If the sequence of step sizes, ρ t , meets the conditions in (14), δ b ( θ ) ( t ) will converge to a local optimum. (cid:88) t ρ t = ∞ and (cid:88) t ρ t < ∞ (14)A noisy estimate of the gradient can be calculated by using only one uniformly sampled data point replicated N times. This means that the summation in (12) would change to the value for one data point multipliedby N ; that is, the parameter update in (13) becomes: δ b ( θ ) ( t ) = (1 − ρ t ) δ b ( θ ) ( t − + ρ t { N E q [ δ b ( x n , z n )] } , (15)with N E q [ δ b ( x n , z n )] = N E q [ z n ] x n N x n x T n + E q [ λ ] I . Finally, while the stochastic gradient updates in (15) are only computed for one data point at a time, themethodology can be extended to process S data points simultaneously. The only difference would be toaverage the individual results from each data point. The update step becomes: δ b ( θ ) ( t ) = (1 − ρ t ) δ b ( θ ) ( t − + ρ t S (cid:40) N (cid:88) s ∈S E q [ δ b ( x s , z s )] (cid:41) , which implies that, N (cid:88) s ∈S E q [ δ b ( x n , z n )] = N X T s E q [ z s ] N X T s X s + S E q [ λ ] I , where S is the set of sub-sampled data points, S = |S| , and X s and z s are X and z restricted to thecorresponding indices in S . The package vir contains implementations of the CAVI and SVI algorithms for univariate and multivariateregression models with the normal and probit link. For the univariate linear models, we derive and implement10 able 1
List of functions implemented in the vir package. The functions have the format model prior algorithm .Consequently, if an analyst wishes to use a Gibbs sampler to fit a multivariate linear model with a non-informativeprior then they could call the function mv lm uninf gibbs() . If they wished to fit a univariate probit model with ahorseshoe prior using CAVI, they would call the function probit hs cavi() . Each function is implemented in
C++ using the
RcppEigen package.
Model Priors Algorithms
Normal: lm Probit: probit
Ridge: ridge
LASSO: lasso
Horseshoe: hs Gibbs: gibbs
CAVI: cavi
SVI: svi
Multivariate Normal: mv lm
Multivariate Probit: mv probit
Non-informative: uninf a Gibbs sampler, and the CAVI and SVI algorithms for the ridge (normal), LASSO (double-exponential)[Park and Casella, 2008], and horseshoe [Carvalho et al., 2010, Makalic and Schmidt, 2015] priors, while themultivariate linear models are implemented with non-informative priors for the regression coefficients anda factor model for the covariance structure. Each function is implemented in
C++ and leverages the
Rcpp [Eddelbuettel and Francois, 2011] and
RcppEigen [Bates et al., 2013] packages to optimize performance.The functions in the package are named according to the link function, lm for normal linear models and probit for binary regression, followed by the shrinkage prior type: ridge , lasso , hs , uninf ; and then thealgorithm: gibbs , cavi , or svi . Therefore, if an analyst wishes to use the svi algorithm with a linear modeland horseshoe prior, they can call the function lm hs svi to analyze the data. Table 1 contains a summaryof the names for the functions in the package. All functions take, as arguments, a matrix of predictors, X and a vector or matrix of responses, y or Y .The variational algorithms utilize two assumptions regarding the regression coefficients. The first assumesthat each regression coefficient is correlated, and sets the optimal distribution of the parameter to be amultivariate normal, that is: q ( b ) = N ( µ b , Σ b ). Alternatively, we also implement a version of the algorithmwhich assumes that all regression coefficients are independent, i.e., q ( b ) = (cid:81) Pp =1 q ( b p ), where each component, q ( b p ) is a univariate normal distribution. The use of these two options is problem dependent; assuming thefull correlation structure may lead to superior variance estimates at the expense of slower computation time.Below we provide an example use case, analyzing a simulated dataset with a univariate normal linearmodel (16). We first demonstrate the use of CAVI with a normal prior. As described earlier, the functionto be called in this situation is lm ridge cavi() ; the documentation for which can be seen by executing ?lm ridge cavi in the R console. We install the package from GitHub and start by simulating a smalldataset and fitting the model. > devtools::install_github("suchitm/vir") library(vir)> set.seed(42)> X = matrix(nrow = 100, ncol = 5, rnorm(5 * 100))> colnames(X) = paste0("X", 1:5)> b = rnorm(5)> y = rnorm(1) + X %*% b + rnorm(100)> ridge_cavi_fit = lm_ridge_cavi(y, X, n_iter = 100, rel_tol = 0.0001) Each of the algorithms output a nested list. The first layer contains an element for each of the parametersand the ELBO, while the second lists the form of the optimal distribution (normal, gamma, etc.), and thecorresponding optimal parameters. Since the normal ridge model has four parameters, the first level of thelist has the names: > names(ridge_cavi_fit) [1] "b0" "b" "tau" "lambda" "elbo" where b0 is the intercept term, b is a vector of regression coefficients, tau is the error precision, and lambda is the prior precision. The optimal distribution of the regression coefficients, b , is a multivariate normal withmean, µ , and covariance matrix, Σ , the names for that parameter element contain the distribution type andcorresponding parameters. > names(ridge_cavi_fit$b) [1] "dist" "mu" "sigma_mat" > ridge_cavi_fit$b$dist> ridge_cavi_fit$b$mu> round(ridge_cavi_fit$b$sigma_mat, 4) [1] "multivariate normal"[1] 0.87668826 0.92639922 0.08866204 0.10254637 -0.75907903[,1] [,2] [,3] [,4] [,5][1,] 0.0110 -0.0006 0.0017 -0.0006 -0.0010[2,] -0.0006 0.0144 -0.0011 -0.0005 0.0017[3,] 0.0017 -0.0011 0.0115 0.0007 -0.0010[4,] -0.0006 -0.0005 0.0007 0.0156 -0.0025[5,] -0.0010 0.0017 -0.0010 -0.0025 0.0118 summary vi() . > summary_vi(ridge_cavi_fit, level = 0.95, coef_names = colnames(X)) Estimate Lower UpperIntercept -0.17491248 -0.3901667 0.04034176X1 0.87668826 0.6711879 1.08218858X2 0.92639922 0.6909048 1.16189365X3 0.08866204 -0.1218641 0.29918818X4 0.10254637 -0.1424468 0.34753954X5 -0.75907903 -0.9717937 -0.54636432
Using the model fit, one can generate estimates and corresponding credible intervals for new data by utilizingthe predict lm vi() function. > X_test = matrix(nrow = 5, ncol = 5, rnorm(25))> predict_lm_vi(ridge_cavi_fit, X_test) $estimate[1] -1.0235563 2.3194540 -0.5776915 -0.6762113 -1.2336189$ci [,1] [,2][1,] -3.1989515 1.1518389[2,] 0.1016273 4.5372806[3,] -2.7163884 1.5610055[4,] -2.8682399 1.5158172[5,] -3.4223356 0.9550978
The are three material differences between the CAVI and SVI implementations, with the first being thestopping criteria. The CAVI algorithms use a relative tolerance criteria ( rel tol ) comparing the ELBO atthe current iteration with the ELBO five iterations in the past. On the other hand, the SVI algorithms donot terminate based on a stopping criteria, but will instead run until the maximum number of iterations( n iter ) is reached. For each sub-sample of the data, a noisy estimate for the ELBO is calculated andstored, which can then be used to assess convergence of the algorithms.13econd, the step sizes, ρ t , in stochastic gradient algorithms have to follow the schedule outlined in (14).We implement two ways of meeting the necessary conditions. The default approach allows setting a constantstep size via the parameter const rhot , which keeps the step size constant for each iteration of the algorithm.Alternatively, we also allow the user to specify a schedule based on the equation ρ t = ( t + ω ) − κ , where ω ≥ κ ∈ (0 . ,
1] is called the forgetting rate [Hoffman et al., 2013]. Finally, SVI algorithmsutilize sub-samples of the data, and the size of these samples has to be set via the batch size parameter.Below we show how the corresponding SVI algorithm for normal linear regression with a normal prior can beimplemented in our package; in this case we set a constant learning rate of 0 . > ridge_svi_fit = lm_ridge_svi(+ y, X, n_iter = 1000, verbose = FALSE, batch_size = 10, const_rhot = 0.1+ ) It should be noted that their exists an interaction between the batch size and the step size for anystochastic gradient descent algorithm. For our numerical experiments in Section 5, we use a constantlearning rate of 0.01, which seemed to work well for our purposes. Consequently, this is the default ratefor our algorithms. The choice of adaptive learning rates is an open area of research in the stochasticoptimization literature [Schaul et al., 2013, Kingma and Ba, 2014] and we leave their implementation tolater iterations of our package.
In this section we provide a simulation study comparing the ridge and LASSO penalties implemented in glmnet , to a Gibbs sampler, and the CAVI and SVI algorithms for linear and probit regression with ridge,LASSO, and horseshoe priors. For each set of simulations we generate D = 50 datasets for our comparisons. In this section, we simulate from the model y = b + Xb + e ; e ∼ N (cid:0) , σ I (cid:1) , (16)where b ∼ N (0 , n th row of X , x n ∼ N ( , V ), V j,j (cid:48) = cov ( x j , x j (cid:48) ) = 0 . | j − j (cid:48) | , b ∼ N ( , I ), and σ ischosen to set the signal-to-noise ratio in the data to one. We hold the number of predictors P constant at 75,14 able 2 Simulation results for the normal linear model.
N = 100 N = 1000 N = 5000
MSE Cov. MSPE MSE Cov. MSPE MSE Cov. MSPE
Ridge
GLM-1SE 0.160 - 0.917 0.038 - 0.732 0.012 - 0.708GLM-Min 0.160 - 0.913 0.039 - 0.732 0.012 - 0.707Gibbs-Corr 0.125 0.941 0.856 0.024 0.948 0.718 0.005 0.951 0.700CAVI-Corr 0.135 0.946 0.868 0.024 0.949 0.718 0.005 0.952 0.700CAVI-Indep 0.122 0.913 0.853 0.024 0.871 0.718 0.005 0.871 0.700SVI-Corr 0.135 0.946 0.866 0.026 0.943 0.721 0.008 0.893 0.703SVI-Indep 0.122 0.913 0.854 0.025 0.868 0.720 0.006 0.819 0.704
LASSO
GLM-1SE 0.129 - 0.859 0.029 - 0.721 0.010 - 0.705GLM-Min 0.131 - 0.864 0.029 - 0.721 0.010 - 0.705Gibbs-Corr 0.120 0.968 0.852 0.017 0.966 0.713 0.004 0.963 0.700CAVI-Corr 0.133 0.937 0.863 0.018 0.958 0.713 0.004 0.959 0.700CAVI-Indep 0.106 0.901 0.833 0.017 0.911 0.712 0.004 0.897 0.700SVI-Corr 0.129 0.938 0.859 0.019 0.952 0.714 0.006 0.908 0.701SVI-Indep 0.091 0.917 0.814 0.014 0.930 0.709 0.005 0.877 0.701 HS Gibbs-Corr 0.097 0.945 0.809 0.010 0.975 0.705 0.002 0.983 0.698CAVI-Corr 0.113 0.909 0.834 0.011 0.956 0.706 0.002 0.967 0.698CAVI-Indep 0.099 0.888 0.813 0.010 0.931 0.706 0.002 0.941 0.698SVI-Corr 0.110 0.908 0.831 0.011 0.954 0.706 0.002 0.961 0.698SVI-Indep 0.099 0.888 0.812 0.010 0.933 0.705 0.002 0.937 0.698and vary the size of the dataset, N . We set 80% of the values in b to zero, which motivates our comparisonof the shrinkage priors below.We compare the mean squared error for the coefficients (MSE), coverage for 95% credible intervals (Cov.),and the relative mean squared prediction error (MSPE) for predictions on a test set of size 500 simulatedfrom the model in (16). The formulas to calculate our metrics are below:MSE = 1 DP D (cid:88) d =1 P (cid:88) p =1 ( b p − ˆ b p ) , MPSE = (cid:115) || Xb − X ˆ b || || Xb || , COV = 1 DP D (cid:88) d =1 P (cid:88) p =1 I ( l p ≤ b p ≤ u p ) , where b p is the true coefficient, ˆ b p is the corresponding estimated value, X is the predictor matrix for thetest set, and ( l p , u p ) is the credible set for the p th predictor.The results for our simulations are given in Table 2. As can be seen, all implementations of the variational15lgorithms are competitive with glmnet and the Gibbs sampler in regards to MSE and MSPE. The variationalalgorithms for linear regression with the horseshoe prior provide superior parameter estimation with respectto the LASSO or ridge priors, along with outperforming the ridge and LASSO glmnet implementations.With regards to coverage, the variational algorithms which assume posterior independence between theregression parameters have the lowest coverage among the Bayesian approaches. However, the coverageimproves as N increases for the horseshoe priors, which has approximately 94% coverage for the 95% crediblesets measured. On the other hand, the variational algorithms which estimate a correlated structure among theparameters seem to retain coverage similar to that of the Gibbs samplers while being estimated at a fractionof the computational cost. Therefore, while, in general variational algorithms are known to underestimatethe posterior variance of the model parameters, this does not seem to be an issue in the simulation settingsconsidered. Additionally, the SVI and CAVI algorithms perform comparably in terms of point and varianceestimation, providing confidence for using SVI approaches in the context of a linear regression where thedata does not fit in memory. We compare the probit regression algorithms in the package using a similar design as in Section 5.1 with themodification that z n = b + x T n b and P ( y n = 1) = Φ( z n ) for the probit link or P ( y n = 1) = expit( z n ) forthe logit link. Because glmnet implements the logistic link instead of the probit, we simulate from both thelogistic and probit models and compare the algorithms using only one design with N = 500, P = 50, andset 40 of the parameters to zero.We calculate the mean squared error for the parameter estimates for each of the algorithms when the datagenerating process is the same as the one assumed by the model (logistic for glmnet and probit for vir ), andcompare the coverage of the parameter estimates in the probit case for the Gibbs sampler and the variationalalgorithms. We also compare the variable selection quality of the respective algorithms by using the RANDindex Rand [1971], thinking of the variable selection problem as one of calculating the dissimilarities betweentwo cluster indicator vectors. Finally, we compare the predictive capacity of our models by using the areaunder the precision recall curve (AUC-PR). The results for our simulation are given in Table 3.In both simulation settings, the Gibbs samplers remain the most consistent in terms of the RAND index.The CAVI and SVI algorithms with the horseshoe prior perform comparably to the Gibbs samplers butthe variational algorithms with the ridge prior perform meaningfully worse, while the ones with the LASSOperform only slightly worse. All algorithms perform similarly in terms of AUC-PR, with the horseshoe priorbeing slightly better than the others. 16 able 3 Simulation results for binary regression with shrinkage priors.
Logit Probit
MSE RAND AUC-PR MSE Coverage RAND AUC-PR
Ridge
GLM-1SE 0.079 0.673 0.881 - - 0.673 0.924GLM-Min 0.174 0.673 0.885 - - 0.673 0.922Gibbs-Corr - 0.852 0.885 0.023 0.951 0.850 0.930CAVI-Corr - 0.772 0.886 0.034 0.700 0.706 0.929CAVI-Indep - 0.671 0.886 0.035 0.588 0.607 0.929SVI-Corr - 0.765 0.885 0.031 0.692 0.692 0.929SVI-Indep - 0.660 0.885 0.032 0.568 0.588 0.929
LASSO
GLM-1SE 0.180 0.819 0.894 - - 0.794 0.934GLM-Min 0.174 0.621 0.898 - - 0.591 0.938Gibbs-Corr - 0.866 0.892 0.024 0.970 0.863 0.936CAVI-Corr - 0.811 0.895 0.016 0.782 0.790 0.939CAVI-Indep - 0.779 0.895 0.019 0.740 0.763 0.939SVI-Corr - 0.810 0.894 0.017 0.782 0.790 0.939SVI-Indep - 0.846 0.898 0.019 0.779 0.829 0.942 HS Gibbs-Corr - 0.845 0.900 0.012 0.981 0.875 0.943CAVI-Corr - 0.871 0.901 0.010 0.856 0.869 0.944CAVI-Indep - 0.869 0.901 0.010 0.845 0.873 0.944SVI-Corr - 0.871 0.901 0.009 0.857 0.868 0.944SVI-Indep - 0.868 0.901 0.009 0.845 0.870 0.944In terms of parameter estimation in the probit simulation case, we see that the estimation quality of thevariational algorithms is comparable to the Gibbs samplers for the LASSO and horseshoe priors, while beingworse for the ridge prior. As with the linear regression simulations, the performance of the SVI and CAVIalgorithms is similar, giving us confidence in their use for large scale regression problems.Unfortunately, when it comes to coverage, the variational algorithms meaningfully under perform thecorresponding Gibbs samplers, likely due to the mean-field assumption that the latent variables for eachdata point are uncorrelated with the regression coefficients. Fasano et al. [2019] show that a mean-fieldapproach for probit regression causes the regression estimates to be shrunk towards zero, and propose avariational algorithm in the spirit of the Gibbs sampler developed by Holmes et al. [2006] to remedy theissue. However, their proposed algorithm requires the prior for the regression coefficients be fixed a-priori,and their updates for the latent variables require the whole data set. Consequently, their algorithms cannotbe readily extended to the adaptive shrinkage priors considered in this article, and their CAVI algorithmcannot be easily extended to utilize stochastic optimization. Therefore, we recommend that the estimatedposterior variance of the regression coefficients be viewed with skepticism for the variational algorithms andnote that if one desires accurate uncertainty quantification in this situation, one may, ironically, be able to17tilize the bootstrap [Chen et al., 2018].
A motivating factor for the development of our package was the computational efficiency provided by varia-tional methods relative to MCMC; we aimed to scale linear regression with shrinkage priors to large datasets.While it is known that variational algorithms outperform their MCMC counterparts computationally, in thissection we provide timing comparisons of our implementations for linear and binary regression to those inthe package glmnet . Before we proceed to a discussion of the relative timing results for each model, it shouldbe noted that the variational algorithms provide relatively accurate uncertainty quantification for normallinear models while glmnet implementations provide only a point estimate; in fact, frequentist uncertaintyquantification for L glmnet imple-ments a path algorithm for estimating the regression coefficients at various values of the tuning parameter,which is not easily extendable to utilize stochastic optimization. Therefore, to obtain point estimates forstochastic optimization based approaches for the LASSO, one would have to perform cross-validation to findthe optimal value of the tuning parameter, an issue that does not exist for variational algorithms.We conducted two sets of timing experiments. First, we fixed N at 1000 and varied P from 100 to 800(with 80% set to zero) to estimate how our methods scale with the number of predictors. Second, we fixed P at 100 and varied N from 1000 to 50000 to estimate how our methods scale with dataset size. We simulatedfive different datasets and calculated the time to convergence for cv.glmnet() and the CAVI algorithmswith a relative tolerance of 0 . N casefor the binary regressions, since the primary reason to use SVI is its applicability in large N settings wherethere are a large number of local variables that need to be updated.Figure 1 shows our results for the timing experiments when N is fixed and P is varied. For the linearmodels, we see that the time to convergence increases for each model as the number of predictors is increased.Interestingly, we see that, for the settings considered, our CAVI implementations for normal linear modelsconverge faster than the path algorithm implemented in cv.glmnet() . It can also be seen that assuming anindependence structure among the regression coefficients can lead to computational benefits as the numberof predictors increases. In the case of the probit regression models, we see, as expected, that the timeto convergence for the CAVI algorithms increases with the number of predictors. However, perplexingly,the cv.glmnet() implementation does not seem to increase in computational complexity as the number ofpredictors increases.Figure 2 shows the time to convergence when P is fixed and N is varied. We see that in the normal18 ormal Binary200 400 600 800 200 400 600 8000102030 Data Size T i m e ( s e c ond s ) MeanMedian
Algo.
CAVI−CorrCAVI−IndepGLM
Figure 1
Timing results as the number of predictors is varied with sample size fixed. N o r m a l B i na r y Data Size T i m e ( s e c ond s ) Algo.
CAVI−CorrCAVI−IndepGLMSVI−CorrSVI−IndepMeanMedian
Figure 2
Timing results when the sample size is varied with the number of predictors fixed at 100. cv.glmnet() as the sample sizeincreases. It should be noted that the CAVI-Indep implementation is slower than CAVI-Corr because alarge error term needs to be calculated for each predictor in each iteration of the algorithm. In the binaryregression cases, the CAVI implementations were substantially slower the glmnet implementations and wereomitted from the plot. However, since Bayesian binary regression is an appropriate use case for the SVIalgorithm even when the data fits in memory, we see that the use of such algorithms for a fixed batch size andnumber of iterations done not lead to an increase in computation time as the size of the dataset increases.
In this chapter we proposed a new R package, vir , for computationally efficient Bayesian linear regressionwith shrinkage priors. We compared its performance to glmnet , which is one of the most widely used toolsfor quickly performing variable selection and prediction in linear regression.We conducted a simulation study which showed that variational algorithms can be relied upon for variableselection and their performance is comparable to MCMC based approaches for parameter estimation anduncertainty quantification in normal linear models. Hence, we have provided a new tool in the Bayesiantoolbox for quickly exploring a large number of models before doing a final MCMC based analysis. Second,our timing comparisons showed situations in which our package outperforms state of the art approachesin glmnet , while at the same time providing approximate uncertainty quantification for the coefficients.Additionally, in the normal linear model case, our approach does not require the use of the bootstrap to douncertainty quantification, which creates substantial computational overhead for frequentist penalization-based approaches to linear regression.Future work will focus on increasing the number of variational algorithms implemented in our package.At present, we focused on linear and binary regression, and will extend the algorithms to count and survivaldata. We also aim to add additional methods for clustering with Dirichlet Processes and non-parametricregression models. Finally, while variational algorithms are one way to approximate a posterior distributionfor large datasets, our long term goal is to create a package which solves common problems using a variety ofmethodology for big data analysis in Bayesian models (stochastic gradient MCMC, GPU acceleration, etc.). References
Amari, S.-I. (1998). Natural gradient works efficiently in learning.
Neural computation
Journal of Statistical Software
Bayesian theory , volume 405. John Wiley & Sons.Betancourt, M. (2017). A conceptual introduction to hamiltonian monte carlo. arXiv preprintarXiv:1701.02434 .Bishop, C. M. (2006).
Pattern recognition and machine learning . springer.Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017a). Variational inference: A review for statisticians.
Journal of the American Statistical Association
Journal of the American Statistical Association
Handbook of markov chain monte carlo . CRCpress.Carbonetto, P., Stephens, M., et al. (2012). Scalable variational inference for bayesian variable selection inregression, and its accuracy in genetic association studies.
Bayesian analysis Biometrika
The Annals of Applied Statistics
Journal of StatisticalSoftware, Articles arXiv pages arXiv–1911.Gelfand, A. E. and Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities.
Journal of the American statistical association
IEEE Transactions on pattern analysis and machine intelligence pages 721–741.Guhaniyogi, R. and Dunson, D. B. (2015). Bayesian compressed regression.
Journal of the AmericanStatistical Association
The Journalof Machine Learning Research
Journal of Machine Learning Research
Bayesian analysis arXiv preprintarXiv:1412.6980 .Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). Automatic differentiationvariational inference. Journal of Machine Learning Research
Bayesian Analysis Advances inNeural Information Processing Systems , pages 2917–2925.Makalic, E. and Schmidt, D. F. (2015). A simple sampler for the horseshoe estimator.
IEEE Signal ProcessingLetters
The journal of chemical physics
Machine learning: a probabilistic perspective . MIT press.Park, T. and Casella, G. (2008). The bayesian lasso.
Journal of the American Statistical Association
Journal of the American Statistical Association .Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods.
Journal of the AmericanStatistical association
Monte Carlo statistical methods . Springer Science & Business Media.Saibaba, A. K., Bardsley, J., Brown, D. A., and Alexanderian, A. (2019). Efficient marginalization-basedmcmc methods for hierarchical bayesian inverse problems.
SIAM/ASA Journal on Uncertainty Quantifi-cation International Conference onMachine Learning , pages 343–351.Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016).Bayes and big data: The consensus monte carlo algorithm.
International Journal of Management Scienceand Engineering Management
Journal of Statistical Software
Artificial Intelligence and Statistics , pages 912–920.Stan Development Team (2018). Rstan: the r interface to stan. r package version 2.17. 3.Sturtz, S., Ligges, U., and Gelman, A. E. (2005). R2winbugs: a package for running winbugs from r.Terenin, A., Dong, S., and Draper, D. (2019). Gpu-accelerated gibbs sampling: a case study of the horseshoeprobit model.
Statistics and Computing