Model Based Screening Embedded Bayesian Variable Selection for Ultra-high Dimensional Settings
MModel Based Screening Embedded Bayesian VariableSelection for Ultra-high Dimensional Settings
Dongjin Li, Somak Dutta and Vivekananda Roy ∗ Department of Statistics, Iowa State University, Ames, IA 50010.
Abstract
We develop a Bayesian variable selection method, called SVEN, based on a hierarchicalGaussian linear model with priors placed on the regression coefficients as well as on the modelspace. Sparsity is achieved by using degenerate spike priors on inactive variables, whereasGaussian slab priors are placed on the coefficients for the important predictors making theposterior probability of a model available in explicit form (up to a normalizing constant). Thestrong model selection consistency is shown to be attained when the number of predictorsgrows nearly exponentially with the sample size and even when the norm of mean effectssolely due to the unimportant variables diverge, which is a novel attractive feature. An appeal-ing byproduct of SVEN is the construction of novel model weight adjusted prediction intervals.Embedding a unique model based screening and using fast Cholesky updates, SVEN producesa highly scalable computational framework to explore gigantic model spaces, rapidly identifythe regions of high posterior probabilities and make fast inference and prediction. A tempera-ture schedule guided by our model selection consistency derivations is used to further mitigatemultimodal posterior distributions. The performance of SVEN is demonstrated through a num-ber of simulation experiments and a real data example from a genome wide association studywith over half a million markers.
Key words:
GWAS, hierarchical model, posterior prediction, shrinkage, spike and slab, stochasticsearch, subset selection. ∗ Corresponding address: Department of Statistics, 2438 Osborn Dr, Ames, IA, 50011. Email: [email protected] a r X i v : . [ s t a t . M E ] J u l Introduction
In almost every scientific discipline, rapid collection of sophisticated data has been boomingdue to recent advancements in technology. In biology, for example, automated sequencing toolshave made whole genome sequencing possible in a cost effective manner, thus providing variationsof millions of single nucleotides between individuals. On the other hand, because phenotypic dataare typically collected via carefully conducted scientific experiments or other observational studies,number of observations remains on the smaller size, giving rise to regression problems where thenumber of variables p far exceeds the sample size n . Nevertheless, only a few of these variablesare believed to be associated with the response. Thus, variable selection plays a crucial role in themodern scientific discoveries.Classical approaches to deal with the variable selection problems are through regularizationmethods. A variety of methods using different penalization techniques have been proposed forvariable selection in the linear models, such as the lasso (Tibshirani, 1996; Datta and Zou, 2017),SCAD (Fan and Li, 2001; Kim et al., 2008), elastic net (Zou and Hastie, 2005), adaptive lasso(Zou, 2006), the octagonal shrinkage and clustering algorithm for regression (Bondell and Reich,2008), L0-penalty for best subset regression (Bertsimas et al., 2016; Huang et al., 2018) and others.These methods achieve sparsity by either penalizing the effect sizes or the model sizes but rarelyboth. Several Bayesian variable selection methods exploit the connection between the penalizedestimators and the modes of Bayesian posterior densities under suitably chosen prior distributionson the regression coefficients. Example includes the lasso-Laplace prior connection (Tibshirani,1996), the hierarchical Bayesian lasso (Park and Casella, 2008) and other works by Kyung et al.(2010), Xu and Ghosh (2015) and Roy and Chakraborty (2017).Another popular approach to Bayesian variable selection is integrating the penalties on theeffect size and the model size via priors distributions. To that end, auxiliary indicator variablesindicating the presence or absence of each variable are introduced to obtain a ‘spike and slab’ prior2n the regression coefficients. Here the ‘spike’ corresponds to the probability mass concentratedat zero or around zero for the variables vulnerable to deletion and the ‘slab’ specifies prior un-certainty for coefficients of other variables. Analysis using such models determines (selects) themost promising variables by summarizing the posterior density of the indicator variables and/orthe regression coefficients. The seminal works of Mitchell and Beauchamp (1988); George andMcCulloch (1993, 1997) developed a hierarchy of priors over the regression coefficients and thelatent indicators and used Gibbs sampler to identify promising models in low dimensional setup(see also Yuan and Lin, 2005; Ishwaran and Rao, 2005; Liang et al., 2008; Johnson and Rossell,2012). Several of these methods have been recently modified and extended to the ultra-high dimen-sional setup. Narisetty and He (2014) pioneered the theoretical study of Bayesian variable selectionin the ultra-high dimensional setup, Roˇckov´a and George (2014) introduced the EM algorithm forfast exploration of high-posterior models, Yang et al. (2016) studied model selection consistencyand computational complexity when g -prior is placed on the regression coefficients, Shin et al.(2018) extended the popular non-local priors to model selection and modified the stochastic shot-gun model search algorithm (Hans et al., 2007), while Zhou and Guan (2019) and Zanella andRoberts (2019) implemented Metropolis Hastings algorithms with an iterative complex factoriza-tion and a tempered Gibbs sampler, respectively, for estimating posterior model probabilities.From a practical standpoint, in the ultra-high dimensional set up, where the number of variables( p ) is much larger than the sample size ( n ), generally variable screening is performed to reduce thenumber of variables before applying any of the aforementioned variable selection methods forchoosing important variables. The classical approaches as well as Narisetty and He (2014) resortto a two stage procedure where they first use frequentist screening algorithms (Fan and Lv, 2008;Wang and Leng, 2016) to reduce the dimension of the problem and then perform variable selec-tion. Shin et al. (2018) as well as Cao et al. (2020) fuse the frequentist iterated sure independentscreening in their stochastic search algorithm. However, these screening methods are frequentistprocedures that are not guaranteed to be fidelitous to the Bayesian model in practice.3n this work, we extend the classical variable selection model of Mitchell and Beauchamp(1988) to the ultra-high dimensional setting. Following the path laid by Narisetty and He (2014)we derive posterior consistency results. By considering zero (exact spike) inflated mixture pri-ors for regression coefficients, we are able to introduce sparsity and relax some assumptions ofNarisetty and He (2014). Furthermore, we develop a novel methodology for variable selection inthe spirit of the stochastic shotgun search algorithm (Hans et al., 2007) with embedded screeningthat is faithful to the hierarchical Bayesian model. We develop sophisticated computational frame-work that allows us to consider larger search neighborhoods and compute exact unnormalizedposterior probabilities in contrast to Shin et al. (2018). Furthermore, in order to recover mod-els with large posterior probabilities and mitigate posterior multimodality associated with variableselection models, we use a temperature schedule that is guided by our posterior model selectionconsistency asymptotics. We call this Bayesian method and the computational framework s electionof v ariables with e mbedded scree n ing (SVEN). Keeping prediction of future observations in mind,we develop novel methods for computing approximate posterior predictive distribution and predic-tion intervals. In particular, using SVEN we construct two prediction intervals, called Z-predictionintervals and Monte Carlo prediction intervals.The rest of the paper is laid out as follows. In Section 2 we describe the hierarchical Bayesianvariable selection model and prove strong model selection consistency results (Section 2.1); de-velop the SVEN framework (Section 2.2) and prediction methods (Section 3). We perform detailedsimulation studies in Section 4 and compare our methods to several other popular Bayesian andfrequentist methods. In Section 5 we analyze a massive dataset from an agricultural experimentwith n = 3 , and p = 546 , where among the Bayesian methods used for comparison onlyour method is able to perform variable selection on the whole data. We also show the practical use-fulness of our method in obtaining posterior predictive distribution and prediction intervals for theyield of novel crop varieties. We conclude in Section 6 with some discussion and future researchdirections. A supplement document containing the proofs of the theoretical results and some com-4utational details is available with sections referenced here with the prefix ‘S’. The methodologyproposed here is implemented in an accompanying R package ‘bravo’ for B ayesian sc r eening a nd v ariable selecti o n. Let y = ( y , . . . , y n ) denote a n × vector of response values, Z = ( Z , . . . , Z p ) an n × p designmatrix of p potential predictors, with vector of partial regression coefficients µ ≡ ( µ , . . . , µ p ) . Weassume latent indicator vector γ = ( γ , . . . , γ p ) ∈ { , } p to denote a model such that the j thpredictor is included in the regression model if and and only if γ j = 1 . Corresponding to thebinary vector, the size of a model γ is denoted as | γ | , where | γ | = (cid:80) pj =1 γ j . Also, with model γ , let Z γ be the n × | γ | sub-matrix of Z that consists of columns of Z corresponding to model γ and µ γ be the vector that contains the regression coefficients for model γ . In the first hierarchyof the Bayesian hierarchical mixture model we assume that the conditional distribution of y given Z, γ, µ , µ and σ is n -dimensional Gaussian and is given by y | Z, γ, µ , µ, σ ∼ N n ( µ n + Z γ µ γ , σ I n ) , (1)where µ is the intercept term and σ > is the conditional variance. Thus (1) indicates thateach γ corresponds to a Gaussian linear regression model y = µ Z γ µ γ + (cid:15) where the residualvector (cid:15) ∼ N n (0 , σ I n ) . However, because the original covariates could have unbalanced scales,a common approach is to reparameterize the above model using a scaled covariate matrix. Tothat end, suppose ¯ Z is the vector of column means of Z and D is the p × p diagonal matrix5hose i th diagonal entry is the sample standard deviation of Z i (the i th column of Z ) and let X = ( Z − n ¯ Z (cid:62) ) D − denote the scaled covariate matrix. Also we assume that β = Dµ and β = µ + ¯ Z (cid:62) µ. The Bayesian hierarchical regression model after reparameterization is given by y | β, β , σ , γ ∼ N n (cid:0) n β + X γ β γ , σ I (cid:1) , (2a) β j | β , σ , γ ind ∼ N (cid:16) , γ j λ σ (cid:17) for j = 1 , . . . , p, (2b) (cid:0) β , σ (cid:1) | γ ∼ f (cid:0) β , σ (cid:1) ∝ /σ , (2c) γ | w ∼ f ( γ | w ) = w | γ | (1 − w ) p −| γ | . (2d)In this hierarchical setup a popular non-informative prior is set for ( β , σ ) in (2c) and a conjugateindependent normal prior is used on β given γ in (2b) with λ > controlling the precision of theprior independently from the scales of measurements. Note that under this prior, if a covariate isnot included in the model, the prior on the corresponding regression coefficient degenerates at zero. In (2d) an independent Bernoulli prior is set for γ , where w ∈ (0 , reflects the prior inclusionprobability of each predictor. We assume λ and w are known non-random functions of n and p. The hierarchical model (2) with centered X allows us to obtain the distribution of y given γ ina closed form by integrating out β , β γ and σ (Roy et al., 2018, section S6). Consequently, themarginal likelihood function of γ is given by L ( γ | y ) = (cid:90) R + (cid:90) R γ (cid:90) R f (cid:0) y | γ, σ , β , β γ (cid:1) f (cid:0) β γ | γ, σ , β (cid:1) f (cid:0) σ , β (cid:1) dβ dβ γ dσ = c n,p λ | γ | / | A γ | − / R − ( n − / γ , (3)where A γ = X (cid:62) γ X γ + λI, | A γ | is the determinant of A γ ,R γ = ˜ y (cid:62) ˜ y − ˜ y (cid:62) X γ A − γ X (cid:62) γ ˜ y = ˜ y (cid:62) ˜ y − ˜ β (cid:62) γ A γ ˜ β γ = ˜ y (cid:0) I + λ − X γ X (cid:62) γ (cid:1) − ˜ y (4)6s the ridge residual sum of squares, ˜ y = y − ¯ y n , ¯ y = (cid:80) ni =1 y i /n, ˜ β γ = A − γ X (cid:62) γ ˜ y and c n =Γ(( n − / /π ( n − / is the normalizing constant.In order to identify the important variables, we use the (marginal) posterior distribution of γ .Thanks to the explicit form of the marginal likelihood (3), this posterior density is given by f ( γ | y ) ∝ f ( y | γ ) f ( γ ) ∝ λ | γ | / | A γ | − / R − ( n − / γ w | γ | (1 − w ) p −| γ | . It’s often convenient to work with log of the posterior density which is given by log f ( γ | y ) = const + | γ | log λ − log | A γ | − ( n −
1) log R γ + | γ | log( w/ (1 − w )) . (5) Remark . It is important to note that the regression model (1) on the original covariate scaleshould be used for prediction, instead of (2) because the hierarchical prior (2b) is defined under theassumption that (cid:62) n X = 0 and X (cid:62) j X j = n for all j. Remark . In this work we assume w is fixed. However, a popular alternative is to assign a Betaprior on w , i.e., let w ∼ f ( w ) ∝ w a − (1 − w ) b − for some a, b > . Then it is possible to integrateout w from (2d) to obtain the marginal prior distribution of γ given by f ( γ ) = B ( | γ | + a, p − | γ | + b ) /B ( a, b ) , where B ( · , · ) is the beta function. This will replace the last term in (5) by log f ( γ ) . Remark . As an alternative to the independent normal prior (2b), it is also possible to considerZellner’s g -prior (Zellner, 1986) on β γ given by β γ | γ, σ ∼ N | γ | (cid:0) , gσ ( X (cid:62) γ X γ ) − (cid:1) provided thatfor every k ≤ n − , all n × k submatrices of X have full column rank and we restrict the supportof the prior distribution on γ to models of size at most n − . Assuming that g is a non-randomfunction of n and p, the marginal posterior of γ is then given by f g ( γ | y ) ∝ (cid:20) ˜ y (cid:62) ˜ y − gg + 1 ˜ y (cid:62) X γ ( X (cid:62) γ X γ ) − X (cid:62) γ ˜ y (cid:21) − ( n − / w | γ | (1 − w ) p −| γ | (1 + g ) | γ | / I ( | γ | < n ) , where the priors on β and σ have been assumed to be the same as (2c).7deally, as the sample size increases we would like the posterior of γ to concentrate more andmore on the important variables. Several works have alluded to asymptotic guarantees for strongmodel selection consistency in the ultra-high dimensional regression where p is allowed to varysubexponentially with n, i.e. min { n, p } → ∞ and (log p ) /n → . Here, the strong model selectionconsistency implies that the posterior probability of the true set of variables converge to 1 as n tends to infinity. Under shrinking and diffusing priors, Narisetty and He (2014) developed explicitscaling laws for hyper-parameters that are sufficient for strong model selection consistency. On theother hand, Shin et al. (2018), and more recently, Cao et al. (2020) established sufficient conditionsfor strong model selection consistency under non-local type priors (Johnson and Rossell, 2012).The Bayesian hierarchical model (2) is similar to Narisetty and He’s (2014) model with the crucialdistinction that the spike prior is degenerate: P ( β i = 0 | γ i = 0) = 1 . Consequently, although mostassumptions used here for selection consistency are similar to those made by Narisetty and He(2014), we are able to relax some of the conditions to allow for more noisy unimportant variables.In the next section we describe strong model selection consistency results for (2).
We consider the ultra-high dimensional setting where the number of variables p is allowedto vary subexponentially with the sample size. As established by Narisetty and He (2014) theslab precision λ also needs to vary with n for strong model selection consistency. In order tostate the assumptions and the main results, we use the following notations. Abusing notation,we interchangeably use a model γ either as a p -dimensional binary vector or as a set of indicesof non-zero entries of the binary vector. For models γ and s, γ c denotes the complement of themodel γ , and γ ∨ s and γ ∧ s denote the union and intersection of γ and s , respectively. For tworeal sequences ( a n ) and ( b n ) , a n ∼ b n means a n /b n → c for some constant c > ; a n (cid:23) b n (or b n (cid:22) a n ) means b n = O ( a n ) ; a n (cid:31) b n (or b n ≺ a n ) means b n = o ( a n ) . Also for any matrix A , let α min ( A ) and α max ( A ) denote its minimum and maximum eigenvalues, respectively, and let8 ∗ min ( A ) be its minimum nonzero eigenvalue. Again, abusing notations, for two real numbers a and b , a ∨ b and a ∧ b denote max ( a, b ) and min ( a, b ) , respectively. Define r γ = rank ( X γ ) and for ν > , r ∗ γ = r γ ∧ u n ( ν ) where u n ( ν ) = p ∧ n (2 + ν ) log p and η nm ( ν ) = inf | γ |≤ u n ( ν ) α ∗ min ( X (cid:62) γ X γ /n ) . Finally for any fixed positive integer J , define ∆ n ( J ) = inf { γ : | γ |
For δ given in C2, there exists J > /δ such that ∆ n ( J ) (cid:31) log( √ n ∨ p ) , and for some ν < δ , κ < ( J − δ/ , η nm ( ν ) (cid:23) (cid:16) n ∨ p δ n/λ ∨ p − κ (cid:17) . C5.
For some positive constants a and b , a < α min (cid:16) X (cid:62) t X t n (cid:17) < α max (cid:16) X (cid:62) t X t n (cid:17) < b ∀ n. The condition C2 states that the conditional distribution of β i given γ i = 1 is diffused in thesense that it’s conditional prior variance goes to infinity at a particular rate. The condition C3greatly relaxes the boundedness assumption on (cid:107) X t c β t c (cid:107) in Narisetty and He (2014), by slightlystrengthening the identifiability condition C4. Yang et al. (2016) obtained similar results under g -priors on β but as mentioned by them our independence prior is ‘a more realistic choice’. More-9ver, Yang et al. (2016) assumed that α min ( X (cid:62) γ X γ /n ) is bounded away from zero for all models γ of size at most O ( n/ log p ) , which is unrealistic because, for example, even when entries of X areiid N(0,1), inf ≤ i ≤ p X (cid:62) i X i /n converges to zero in probability. Because of the degenerated form ofthe spike priors, the regularity assumptions on the submatrices of the design matrix X in C4 relaxthe assumptions on the bound on their largest eigenvalues. Narisetty and He (2014) showed thatif the rows of X are independent isotropic sub-Gaussian random vectors then C4 holds with over-whelmingly large probability (see also Chen and Chen, 2008; Kim et al., 2012; Shin et al., 2018).The regularity assumption for the true model C5 is standard and has been used in both Narisettyand He (2014) and Cao et al. (2020) without being explicitly stated.Note that the condition C3 does not explicitly specify the true model t and the relaxation toallow higher noise (cid:107) X t c β t c (cid:107) warrants a validation of the identifiability of t . To that end, supposeon the contrary that it is possible to include some variables, say s from t c into the true model andstill maintain the conditions C1-C5 for both t and t ∨ s as true models for every n. Then conditionC4 with γ = t (now excluding the apparently true variable s ) would imply (cid:107) ( I − P t ) X s β s (cid:107) = (cid:107) ( I − P t )( X t β t + X s β s ) (cid:107) = (cid:107) ( I − P t ) X t ∨ s β t ∨ s (cid:107) (cid:31) log( p ∨ √ n ) . Here, the first equality followsfrom the fact that P t X t = X t . But because I − P t is symmetric and idempotent, (cid:107) X s β s (cid:107) ≥ (cid:107) ( I − P t ) X s β s (cid:107) (cid:31) (cid:113) log( p ∨ √ n ) ≥ (cid:112) log p. (6)However, condition C3 for t ∨ s implies (cid:107) X t c ∧ s c β t c ∧ s c (cid:107) (cid:22) √ log p. This with (6) implies that (cid:107) X t c β t c (cid:107) = (cid:107) X s β s + X t c ∧ s c β t c ∧ s c (cid:107) ≥ (cid:107) X s β s (cid:107) − (cid:107) X t c ∧ s c β t c ∧ s c (cid:107) (cid:31) (cid:112) log p, which contradicts condition C3. We now present the strong model selection consistency results. Theorem 1.
Assume conditions C1–C5 hold and that σ is known. Then the posterior probabilityof the true model, f ( t | y, σ ) → in probability as the sample size n approaches ∞ . roof. The proof is given in Section S4 of the supplementary materials.Note that the statement of Theorem 1 is equivalent to (cid:0) − f ( t | y, σ ) (cid:1) /f ( t | y, σ ) → inprobability as n → ∞ . The proof of Theorem 1 also provides the rate of convergence given by, − f ( t | y, σ ) f ( t | y, σ ) (cid:22) exp {− vn } + ρ n + ρ ( J − | t | / n + exp (cid:8) − v (cid:48) (cid:0) ∆ n ( J ) − v (cid:48)(cid:48) log( √ n ∨ p ) (cid:1)(cid:9) with probability greater than − (cid:2) {− cn } + 2 exp {− c (cid:48) log p } + exp {− c (cid:48)(cid:48) ∆ n ( J ) } (cid:3) for somepositive constants v , v (cid:48) , v (cid:48)(cid:48) , c , c (cid:48) and c (cid:48)(cid:48) , where ρ n = p − δ/ ∧ (cid:0) p δ/ / √ n (cid:1) . It is encouraging thatdespite relaxing the boundedness condition on (cid:107) X t c β t c (cid:107) , the rate of convergence remains the sameas in Narisetty and He (2014).However, in practice σ is typically never known. In this case, we need a further assumptionthat assigns a prior probability of zero on (cid:102) M = { γ : r γ > r t + n/ [(2 + ν (cid:48) ) log p ] } for some ν (cid:48) > ν ∨ (2 /δ ) . C6.
For some ν > and ν (cid:48) > ν ∨ (2 /δ ) , P (cid:16) γ ∈ (cid:102) M (cid:17) = 0 . This condition is same as in Narisetty and He (2014) and also equivalent to the assumptions onthe prior model sizes in Shin et al. (2018) and Cao et al. (2020).
Theorem 2.
Assume conditions C1–C6 hold. Then the posterior probability of the true model, f ( t | y ) → in probability as the sample size n approaches ∞ .Proof. The proof is given in Section S5 of the supplementary materials.Note that strong consistency results also imply that with probability tending to one, the truemodel is the posterior mode, that is, P ( t = arg max γ f ( γ | y )) → as n → ∞ . However, infinite sample this need not be true. Furthermore, when the regularity conditions do not hold,there may be multiple models with large posterior probabilities even for large n . Thus, we wouldlike to discover models with practically large posterior probability values. However, in ultra-highdimensional problems, traditional computational methods based on Markov chain Monte Carlo11MCMC) algorithms have poor performance. Thus next we describe SVEN to explore the posteriordistribution f ( γ | y ) . In particular, SVEN will be used to discover high probability regions and findthe maximum a posteriori (MAP) model arg max γ f ( γ | y ) . Hans et al. (2007) proposed the stochastic shotgun search (SSS) algorithm for recovering mod-els with large posterior probabilities. To that end, for a given model γ let nbd ( γ ) = γ + ∪ γ ◦ ∪ γ − denote a neighborhood of γ , where γ + is an “added” set containing all the models with one ofthe p − | γ | remaining covariates added to the current model γ , γ − is a “deleted” set obtained byremoving one variable from γ ; and γ ◦ is a “swapped” set containing the models with one of thevariables from γ replaced by one variable from γ c . The SSS algorithm then starts with an initialmodel g (0) , and for k = 1 , , . . . - (SSS1) Compute f ( γ | y ) for all γ ∈ nbd( g ( k − ).- (SSS2) Separately sample s + from g ( k − , s ◦ from g ( k − ◦ and s − from g ( k − − with prob-abilities proportional to f ( ·| y ) . - (SSS3) Sample g ( k ) from s + , s ◦ and s − with probability proportional to f ( s + | y ) , f ( s ◦ | y ) and f ( s − | y ) respectively.After running for some prespecified large number of iterations, the algorithm then declares themodel discovered with the largest (unnormalized) posterior probability as the MAP model. Hanset al. (2007) notes that the sampling probabilities in (SSS1) and (SSS2) can be replaced by theBayesian information criteria (BIC) and the sampling weights can be computed in parallel.Following the success of SSS, Shin et al. (2018) propose further improvement. Note that, Shinet al. (2018) use non-local priors, and so the posterior probabilities f ( γ | y ) are not available an-12lytically. In fact, they resort to using computationally expensive Laplace approximation whichsuggests exact numerical computations of these quantities are also not straightforward (see alsoCao et al., 2020). Also in ultra-high dimensional problems, SSS may not be scalable due to its im-plementation. Thus Shin et al. (2018) propose a simplified stochastic shotgun search with screening(S5) by dropping the “swapped” set from consideration and moreover, by screening out variablesfrom the “added” set. (Note that, in high dimension, the number of “swapped” models is muchlarger than the numbers of “added” and “deleted” models.) For screening, borrowing ideas fromfrequentist correlation screening of Fan and Lv (2008), they propose computing the least squaresresiduals from a regression of y on X γ and compute the absolute correlations between each columnof X γ c and the residuals. They then propose keeping models in the “added” set corresponding tothe largest few of the absolute correlations. This greatly reduces the burden of computing f ( γ | y ) for all γ in the “added” set. However, in their R package BayesS5, the authors have used ridgeresiduals with unit ridge penalty instead of the least squares residuals. Nevertheless, the S5 algo-rithm has been useful for exploring the posterior distribution of γ (Cao et al., 2020).In the variable selection model (2), the Gaussian conjugacy provides analytically tractableforms for f ( γ | y ) up to a normalizing constant. We also show that f ( γ | y ) can be rapidly com-puted for the swapped models, thereby allowing us to include the swapped models in the neigh-borhood. We thus develop a stochastic shotgun algorithm with (posetrior) model based screeningand develop scalable statistical computations for drawing fast Bayesian inference and prediction. In order to describe the SVEN algorithm, we first describe how to compute the unnormalizedposterior probabilities in the (SSS1) step. To that end, compute ζ = X (cid:62) ˜ y as D − Z (cid:62) ˜ y once and forall. Next, suppose we have a current model γ and we want to compute the posterior probabilitiesof each model in γ + . Suppose U γ is the upper triangular Cholesky factor of X (cid:62) γ X γ + λI and v γ = U −(cid:62) γ X (cid:62) γ ˜ y. In the algorithm below, scalar addition to vector, division between two vectors13nd other arithmetical and algebraic operation on vectors are interpreted as entry-wise operations,as implemented in most statistical software (e.g. in R ). Then1. Compute S ← U −(cid:62) γ X (cid:62) γ by using forward substitution.2. Update S ← S ZD − . [No need to center Z because S . ]3. Compute S as the sum of squares of each column of S . Note that S is a | γ | × p matrix andso these sums of squares should be computed without storing another | γ | × p matrix.4. Set S ← √ n + λ − S where the arithmatical operations are performed entrywise on thevector. Also in this operation, the entries corresponding to the variables in γ are ignored.5. Compute S ← ( ζ − S (cid:62) v γ ) /S .
6. Compute S ← log det U γ + log S
7. Compute S ← (cid:107) ˜ y (cid:107) − (cid:107) v γ (cid:107) − S
8. Compute S ← . | γ | + 1) log λ − S − . n −
1) log S + ( | γ + 1 | ) log( w/ (1 − w )) . Then for all i / ∈ γ, the i th entry of S above contains the unnormalized posterior probability ofthe model obtained by including i in γ. The other entries are ignored. For each model in γ − , itsposterior probability can be computed easily because typically | γ | is small. Furthermore, for each γ (cid:48) ∈ γ − we can use the above algorithm to compute the unnormalized posterior probabilities of γ (cid:48)(cid:48) in γ (cid:48) + . Thus we can compute the (unnormalized) posterior probabilities of each model in nbd ( γ ) . Given the current model γ , the complexity for computing (unnormalized) f ( γ | y ) for all γ ∈ nbd ( γ ) by the above algorithm is O ( | γ | n + | γ | + | γ | (cid:107) Z (cid:107) + | γ | p + p ) , where (cid:107) Z (cid:107) denotes thenumber of non-zero elements in Z . Since | γ | is practically finite, the computational complexityis simply O ( n ∨ p + (cid:107) Z (cid:107) ) . If in addition, Z is sparse, as in the genome-wide association studyexample in section 5, the complexity for computing all posterior probabilities in nbd ( γ ) is linear in14oth n and p . Finally, note that, the additional memory requirement for the above algorithm exceptstoring the Z matrix is practically O ( n ∨ p ) . Also, different steps including step 2 of the abovealgorithm can be performed in parallel using distributed computing architecture.Using the above algorithm as the foundation, we now discuss the SVEN algorithm. Suppose T < T < · · · < T m is a given temperature schedule. Let g (0) denote the empty model (i.e.the model without any predictor included). Then, for i = 1 , , . . . , m - Set g ( i, to be the empty model. Then for k = 1 , . . . , N - (SVEN1) [Same as (SSS1)] Compute f ( g (cid:48) | y ) for all g (cid:48) ∈ nbd ( g ( i,k − ) . - (SVEN2) [Screening step] Consider at most
20 highest probability neighboring models. Thatis, construct the set M k ⊆ nbd ( g ( i,k − ) with |M k | ≤ such that g (cid:48) ∈ M k only if f ( g (cid:48) | y ) > (cid:37) max g (cid:48)(cid:48) ∈ nbd ( g ( i,k − ) f ( g (cid:48)(cid:48) | y ) and f ( g (cid:48) | y ) ≥ f ( g (cid:48)(cid:48) | y ) , ∀ g (cid:48)(cid:48) ∈ nbd ( g ( i,k − ) ∩ M ck , where (cid:37) is some prespecified number(we use (cid:37) = exp( − ).- (SVEN3) [Shotgun step] Assign the weight f ( g (cid:48) | y ) /T i to a model g (cid:48) ∈ M k . Sample a modelfrom M k using these weights and set it as g ( i,k ) . Our ability to efficiently compute posterior probability of all neighboring models allows us toimplement the screening (SVEN2) directly using the objective function f ( γ | y ) . This is a keydifference between SVEN and S5 of Shin et al. (2018). Because models with large probabilitiescould be separated by models with very low probabilities, a temperature schedule has been used.Such tempering is quite common in simulated annealing (Kirkpatrick et al., 1983) and has also beenused in Shin et al. (2018). In order to choose a temperature schedule, we turn to our asymptoticresults from Section S4. In particular, the theory indicates that the log-posterior probabilities of15ood models with small model size are separated by roughly O (log p ) . Thus in order to facilitatejumps between these models we set T m = log p + log log p where the additional log log p is aheuristic adjustment common in numerical computations. Also the remaining temperatures arechosen to be equally spaced between 1 and T m . Note that at every temperature we start the SVEN algorithm at the empty model that are runseparately. Because the stochastic shotgun might have a tendency to wander off to obscure valleyscontaining large number of variables especially under high temperature; running them separatelyavoids getting trapped in such a valley. Most good models have small size and so they could beexplored relatively early when started multiple times from the empty model.Note that our algorithm does not require explicitly storing the matrix X. Indeed, in manyapplications, Z could be sparse and efficiently stored in the memory. The matrix X on the otherhand is always dense. Overall our method is extremely memory efficient, and we are able todirectly perform variable selection with significantly larger p than the other methods may handle.In addition to the MAP model, our method also provides the posterior probability of all themodels explored by the algorithm and facilitate approximate Bayesian model averaging (Shin et al.,2018). To that end, we sort the models { g ( i,k ) , ≤ i ≤ m, ≤ k ≤ N } according to decreasingposterior probabilities and retain the best (highest probability) K models γ (1) , γ (2) , . . . , γ ( K ) where K is chosen so that f ( γ ( K ) | y ) /f ( γ (1) | y ) > ε where ε is a prespecified tolerance (we use log ε = − ). Then we assign the weights w i = f ( γ ( i ) | y ) / K (cid:88) k =1 f ( γ ( k ) | y ) (7)to the model γ ( i ) . We define the approximate marginal inclusion probabilities for the j th variable as ˆ π j = (cid:80) Kk =1 w k I ( γ ( k ) j = 1) and define the weighted average model (WAM) as the model containingvariables j with ˆ π j > . . Note that if SVEN is allowed to run indefinitely to explore all p modelsand ε is set as zero, then the WAM would be theoretically identical to the median probability model16Barbieri and Berger, 2004). However, computing the median probability model is infeasible when p >> n because enumerating all the posterior probabilities of γ is practically impossible.In the literature, mostly the MAP (more precisely the discovered MAP model) model is usedfor prediction. In the next section we develop methods for point and interval predictions using thetop models γ ( k ) ’s with associated weights w k ’s. The posterior predictive distribution of the response y ∗ at a new covariate vector z ∗ ∈ R p , conditional on the observed covariate matrix Z and hyper-parameters λ and w is given by, f ( y ∗ | y ) = (cid:88) γ (cid:90) S γ f ( y ∗ | z ∗ , γ, µ , µ γ , σ ) f ( γ, µ , µ γ , σ | y, Z ) dµ dµ γ dσ , (8)where f ( y ∗ | z ∗ , µ , µ γ , γ, σ ) is the density of N ( µ + µ (cid:62) γ z ∗ γ , σ ) as given in (1), f ( γ, µ , µ γ , σ | y, Z ) is the joint posterior density of ( γ, µ , µ γ , σ ) given ( y, Z ) deduced from the hierarchicalmodel (2), and S γ = (0 , ∞ ) × R | γ | × R . Note that, the distribution (8) is not tractable. However,as shown later in this section, posterior predictive mean and variance of y ∗ can be expressed as(posterior) expectations of some analytically available functions of γ . Also, samples from anapproximation of (8) can be drawn using our SVEN algorithm. Using these approaches, we nowpropose two methods for computing approximate posterior prediction intervals for y ∗ . In this section we describe some approximations to E ( y ∗ | y ) and Var( y ∗ | y ) and use those toconstruct an interval for y ∗ . To that end, from (2) we observe that β and β γ are conditionally17ndependent given y, γ, σ , and Z with β | y, Z, γ, σ ∼ N (¯ y, σ /n ) , and β γ | y, Z, γ, σ ∼ N (cid:0) A − γ X (cid:62) γ ˜ y, σ A − γ (cid:1) , (9)where A γ = X (cid:62) γ X γ + λI as defined in section 2.1.1. Consequently, the full conditional distributionof ( µ , µ γ ) is a ( | γ | + 1) -dimensional multivariate Gaussian distribution given by µ µ γ (cid:12)(cid:12)(cid:12)(cid:12) σ , γ, y ∼ N ¯ y − ¯ Z (cid:62) γ F γ D γ X (cid:62) γ ˜ yF γ D γ X (cid:62) γ ˜ y , σ n − + ¯ Z (cid:62) γ F γ ¯ Z γ − ¯ Z (cid:62) γ F γ − F γ ¯ Z γ F γ , (10)where ¯ Z γ and D γ are sub-vector of ¯ Z and sub-matrix of D , respectively corresponding to themodel γ , and F γ = D − γ A − γ D − γ . Also, σ | γ, y ∼ IG(( n − / , R γ / , (11)where IG( a, b ) denotes a inverse gamma random variable with density f ( σ ) ∝ ( σ ) − a − exp( − b/σ ) ,and R γ is defined in (4). Next, let ˜ z γ = z ∗ γ − ¯ Z γ and note that E( σ | γ, y ) = R γ / ( n − . Thus,using iterated expectation and variance formulas, we have E( y ∗ | y ) = E (cid:2) E (cid:8) y ∗ | γ, σ , µ , µ, y (cid:9) | y (cid:3) = E (cid:2) E (cid:8) µ + µ (cid:62) γ z ∗ γ | σ , γ, y (cid:9) | y (cid:3) = ¯ y + E (cid:2)(cid:8) ˜ z (cid:62) γ F γ D γ X (cid:62) γ ˜ y (cid:9) | y (cid:3) and, (12a) Var( y ∗ | y ) = E (cid:0) Var (cid:8) y ∗ | γ, σ , µ , µ, y (cid:9) | y (cid:1) + Var (cid:0) E (cid:8) y ∗ | γ, σ , µ , µ, y (cid:9) | y (cid:1) = E (cid:0) σ | y (cid:1) + Var (cid:0) µ + µ (cid:62) γ z ∗ γ | y (cid:1) = E (cid:2) E( σ | γ, y ) | y (cid:3) + E (cid:2) Var (cid:8) µ + µ (cid:62) γ z ∗ γ | σ , γ, y (cid:9) | y (cid:3) + Var (cid:2) E( µ + µ (cid:62) γ z ∗ γ | σ , γ, y ) | y (cid:3) = E (cid:20) R γ n − (cid:26) n + ˜ z (cid:62) γ F γ ˜ z γ (cid:27) (cid:12)(cid:12)(cid:12)(cid:12) y (cid:21) + Var (cid:2)(cid:8) ˜ z (cid:62) γ F γ D γ X (cid:62) γ ˜ y (cid:9) | y (cid:3) (12b)18rom (12a) and (12b) we see that both E( y ∗ | y ) and Var( y ∗ | y ) can be expressed as posterior expec-tations of analytically available functions of γ . However, because the posterior of γ is not entirelyavailable, we propose using the models γ (1) , . . . , γ ( K ) obtained from SVEN as described in sec-tion 2.2.2 with weights w , . . . , w K respectively, to approximate these expectations and variances.We can use these approximate posterior predictive mean and variance of y ∗ to obtain a (1 − α ) prediction interval for y ∗ as (cid:98) E ( y ∗ | y ) ∓ z α/ (cid:100) Var( y ∗ | y ) / , where z α/ is the (1 − α/ th standardnormal quantile. We call this interval Z-prediction interval (Z-PI). Also, the posterior predictivemean is used as a point estimate of y ∗ . In the next section, we describe an alternative method forcomputing a prediction interval for y ∗ using Monte Carlo simulation. A prediction interval for y ∗ can also be constructed using Monte Carlo (MC) samples generatedfrom the posterior predictive distribution (8). Specifically, a (1 − α ) prediction interval for y ∗ isgiven by (cid:104) F − y ∗ | y ( α/ , F − y ∗ | y (1 − α/ (cid:105) , where F − y ∗ | y ( α ) denotes the α -th quantile of the distribu-tion (8). Now, we describe a method for sampling from an approximation of (8) using SVEN. Tothat end, we consider ˜ f ( y ∗ | y ) given by ˜ f ( y ∗ | y ) = K (cid:88) i =1 w i (cid:90) S γ ( i ) f ( y ∗ | z ∗ , γ ( i ) , µ , µ γ ( i ) , σ ) f ( µ , µ γ ( i ) , σ | γ ( i ) , y, Z ) dµ dµ γ ( i ) dσ , (13)where w i ’s are defined in (7), and γ (1) , γ (2) , . . . , γ ( K ) are the K highest probability models ob-tained by SVEN as described in section 2.2.2. Thus, ˜ f ( y ∗ | y ) is the posterior predictive pdf f ( y ∗ | y ) given in (8) except that the marginal posterior of γ is replaced by a mixture distribution of modelschosen by SVEN. Samples from (13) can be drawn as follows. First, we sample γ from the top K models with P ( γ = γ ( k ) ) = w k , (1 ≤ k ≤ K ) . Given γ, we then sample σ from (11). Next given γ and σ , we sample β and β γ from (9). Then we compute µ γ = D − γ β γ and µ = β − ¯ Z (cid:62) γ µ γ ,which are samples from (10). Finally generate y ∗ from N (cid:0) µ + µ (cid:62) γ z ∗ γ , σ (cid:1) . We repeat the above19rocess a large number of times and construct a (1 − α ) MC prediction interval (MC-PI) for y ∗ as (cid:104) ˜ F − ( α/ , ˜ F − (1 − α/ (cid:105) , where ˜ F − ( · ) denotes the empirical quantiles based on these sam-ples. In practice, generally one wants prediction intervals at several new covariate vectors z ∗ ’s. Insection S1 of the supplementary materials, we describe a computationally efficient way of drawingmultiple samples from (13) using the above method and thus simultaneously computing predictionintervals at several new covariate vectors z ∗ ’s. In this section, we study the performance of our SVEN method through several numerical ex-periments, and compare it with some other existing methods. The competing variable selectionmethods we consider are S5 ( R package: BayesS5 ), EMVS ( R package: EMVS ), fastBVSR andthree penalization methods, LASSO, Elastic Net with elastic mixing parameter α = 0 . ( R pack-age: glmnet ) and SCAD ( R package: ncvreg ). As also noted in Shin et al. (2018), we couldnot include BASAD (Narisetty and He, 2014) for its high computational burden and our ultra-highdimensional examples. As used in Table 1 of Roˇckov´a and George (2014) we run EMVS with v = 1000 and three choices for v , namely, v = 0 . (EMVS ), v = 0 . (EMVS ) and v = 1 (EMVS ). For fastBVSR, the results are obtained using 100,000 MCMC iterations after a burn-inof 10,000 steps. For S5 the hyperparameters are tuned using a function provided in BayesS5 .Moreover, we denote by piMOM and peMOM, respectively, the product inverse-moment and theproduct exponential moment non-local priors used under S5. In addition, for piMOM and peMOM,we use MAP and LS to denote the MAP estimator and the least squares estimator from the MAPmodel, respectively. Under SVEN, both MAP and WAM models, as described in section 2.2 areconsidered. For SVEN, we use N = 200 and the temperature schedule described in Section 2.2.2with m = 9 . Also, for SVEN, the ridge estimator ˜ β γ is used to estimate the regression coefficientsfor the MAP and the WAM models. 20 able 1: Independent predictors (Section 4.1.1) Method MSPE MSE β Coverageprobability (%) Averagemodel size FDR (%) FNR (%) JaccardIndex (%)
SVEN(WAM) 0.6387 0.0083 100 5 0 0 100SVEN(MAP) 0.6387 0.0083 100 5 0 0 100piMOM(MAP) 0.6384 0.0081 100 5 0 0 100peMOM(MAP) 0.6384 0.0080 100 5 0 0 100piMOM(LS) 0.6387 0.0083 100 5 0 0 100peMOM(LS) 0.6387 0.0083 100 5 0 0 100fastBVSR 0.6478 0.0091 100 5.09 1.45 0 98.55EMVS Our numerical studies are conducted in six different simulation settings described below.
In this example, entries of X are generated independently from N (0 , . The coefficients arespecified as β = 0 . , β = 0 . , β = 1 , β = 1 . , β = 1 . , and β j = 0 , ∀ j > . This example is taken from Example 3 in Wang (2009) and Example 2 in Wang and Leng(2016). The rows of X are generated independently from N p (cid:0) , (1 − ρ ) I p + ρ p (cid:62) p (cid:1) where wetake ρ = 0 . . The regression coefficients are set as β j = 5 for j = 1 , . . . , and β j = 0 otherwise.21 able 2: Compound symmetry (Section 4.1.2) with ρ = 0 . . Method MSPE MSE β Coverageprobability (%) Averagemodel size FDR (%) FNR (%) JaccardIndex (%)
SVEN(WAM) 48.3069 1.1912 100 5 0 0 100SVEN(MAP) 48.3069 1.1892 100 5 0 0 100piMOM(MAP) 48.2277 1.0018 100 5 0 0 100peMOM(MAP) 50.1528 3.5669 94 4.96 0.37 1.2 98.5piMOM(LS) 48.3069 1.1892 100 5 0 0 100peMOM(LS) 50.2789 3.8758 94 4.96 0.37 1.2 98.5fastBVSR 50.0479 2.5620 100 5.78 9.54 0 90.46EMVS ρ = 0 . . Method MSPE MSE β Coverageprobability (%) Averagemodel size FDR (%) FNR (%) JaccardIndex (%)
SVEN(WAM) 2.1521 0.0173 100 3 0 0 100SVEN(MAP) 2.1521 0.0173 100 3 0 0 100piMOM(MAP) 2.1519 0.0172 100 3 0 0 100peMOM(MAP) 2.1515 0.0168 100 3 0 0 100piMOM(LS) 2.1521 0.0173 100 3 0 0 100peMOM(LS) 2.1521 0.0173 100 3 0 0 100fastBVSR 2.1961 0.0187 100 3.03 0.75 0 99.25EMVS able 4: Group structure with 3 groups (Section 4.1.5). Method MSPE MSE β Coverageprobability (%) Averagemodel size FDR (%) FNR (%) JaccardIndex (%)
SVEN(WAM) λ = n/p , w = √ n/p ; λ = 200 , w = 0 . .Table 5: Factor model with 2 factors (Section 4.1.4). Method MSPE MSE β Coverageprobability (%) Averagemodel size FDR (%) FNR (%) JaccardIndex (%)
SVEN(WAM) 42.9106 0.3892 100 5 0 0 100SVEN(MAP) 42.9103 0.3891 100 5 0 0 100piMOM(MAP) 42.8731 0.3724 100 5 0 0 100peMOM(MAP) 42.9491 0.4211 100 5.01 0.17 0 99.83piMOM(LS) 42.9103 0.3891 100 5 0 0 100peMOM(LS) 42.9361 0.4083 100 5.01 0.17 0 99.83fastBVSR 67.0982 19.9837 87 6.14 18.52 3.60 79.89EMVS able 6: Extreme correlation (Section 4.1.6). Method MSPE MSE β Coverageprobability (%) Averagemodel size FDR (%) FNR (%) JaccardIndex (%)
SVEN(WAM) 14.0754 0.1571 100 5 0 0 100SVEN(MAP) 14.0757 0.1569 100 5 0 0 100piMOM(MAP) 14.0732 0.1547 100 5 0 0 100peMOM(MAP) 14.0750 0.1562 100 5 0 0 100piMOM(LS) 14.0757 0.1569 100 5 0 0 100peMOM(LS) 14.0757 0.1569 100 5 0 0 100fastBVSR 31.2771 32.1993 97 6.55 18.65 0.6 81.03EMVS The auto-regressive correlation structure is commonly observed in time series data where thecorrelation between observations depends on the time lag between them. In this example, we useAR(1) structure where the variables further apart from each other are less correlated. FollowingExample 2 in Wang and Leng (2016), X j = ρX j − + (1 − ρ ) / z j , for ≤ j ≤ p, where X and z j ( ≤ j ≤ p ) are iid ∼ N n (0 , I n ) . We use ρ = 0 . and set the regression coefficients as β = 3 , β = 1 . , β = 2 and β j = 0 for j (cid:54)∈ { , , } . This example is from Meinshausen and B¨uhlmann (2006) and Wang and Leng (2016). Witha fixed number of factors, K , we first generate a p × K matrix F whose entries are iid standardnormal. Then the rows of X are independently generated from N p (0 , F F (cid:62) + I p ) . We fix K = 2 and the regression coefficients are set to be the same as in Example 4.1.2.24 .1.5 Group structure This special correlation structure arises when variables are grouped together in the sense thatthe variables from the same group are highly correlated. This example is similar to Wang and Leng(2016) and is similar to example 4 of Zou and Hastie (2005) where 15 true variables are assignedto 3 groups. We generate the predictors as X m = z + ζ ,m , X m = z + ζ ,m , X m = z + ζ ,m where z i are iid ∼ N n (0 , I n ) and ζ i,m iid ∼ N n (0 , . I n ) for ≤ i ≤ and for m = 0 , , , , . Theregression coefficients are set as β j = 3 for j ∈ { , , . . . , } and β j = 0 otherwise. This challenging example is the example 6 of Wang and Leng (2016). In this example, Wefirst simulate z j , j = 1 , . . . , p and w j , j = 1 , . . . , independently from the multivariate standardnormal distribution N n (0 , I n ) . Then the covariates are generated as X j = ( z j + w j ) / √ for j = 1 , . . . , and X j = ( z j + (cid:80) i =1 w i ) / for j = 6 , . . . , p . By setting the number of true covariatesto be 5 and let β j = 5 for j = 1 , . . . , and β j = 0 for j = 6 , ..., p , the correlation betweenthe response and the unimportant covariates is around . / √ times larger than that between theresponse and the true covariates, making it difficult to identify the important covariates.Our simulation experiments are conducted using 100 simulated pairs of training and testingdata sets. For each of the simulation settings introduced above, we set p = 20 , and generatetraining data set and testing data set of size n = 400 each. The error variance σ is determined bysetting theoretical R = 90% (Wang, 2009). The hyperparameters w and λ are chosen to be √ n/p and n/p , respectively, except for group structure where we also use λ = 200 and w = 0 . toaccount for the high within-group correlation and relatively large true model size.In order to evaluate the performance of the propose method, we compute the following metrics:(1) mean squared prediction error based on testing data (MSPE); (2) mean squared error betweenthe estimated regression coefficients and the true coefficients (MSE β ); (3) coverage probability25hich is defined as the proportion of the selected models containing the true model (4) averagemodel size which is calculated as the average number of predictors included in the selected modelsover all the replications (5) false discovery rate (FDR); (6) false negative rate (FNR) and (7) theJaccard index which is defined as the size of the intersection divided by the size of the union ofthe selected model and the true model. All computations are done using single–threaded R on aworkstation with two 2.6 GHz 8-Core Intel R (cid:13) E5-2640 v3 processors and 128GB RAM.
The average of the metrics of our simulation results are presented in Tables 1-6. For peMOMand piMOM priors, the difference between the MAP and the LS only arise in the MSPE and theMSE β but not in the other metrics. We can observe from the tables that SVEN and S5 performmuch better than EMVS, fastBVSR and the three frequentist penalized methods in general. In mostsettings, the penalized methods result in many false discoveries, yet attaining similar or worse cov-erage probabilities compared to the Bayesian methods. Since the estimates of β from EMVS arenot sparse, it has higher MSE β than SVEN and S5. As observed in Tables 5 and 6, fastBVSRresults in large values of MSPE and MSE β due to poor estimates of β . Also, SVEN yields compet-itive prediction errors and has better FDR and Jaccard indices in every case other than the groupstructure.For the case of group structure (Table 4) where there is a high correlation between the vari-ables within the same group, SVEN with w = √ n/p and λ = n/p and S5 both pick up only onerepresentative variable from each group, resulting in a high false negative rate and average modelsize around three. Although elastic net regression successfully includes all the important variablesit also includes a large number of unimportant variables and thus leads to a very high false dis-covery rate. However, by increasing the shrinkage to λ = 200 and increasing the prior inclusionprobability to w = 0 . , SVEN stands out from its competitors. In fact, if important predictors are26nticipated to be highly correlated, this prior information can be incorporated by choosing a largervalue for λ. In addition, we compare the computing times between S5 (with both piMOM and peMOMpriors) and SVEN and find that SVEN hits the MAP model faster than S5. The details are providedin Section S2.
We examine the practical performance of our proposed method by applying it to a real dataexample. Cook et al. (2012) conducted a genome-wide association study on starch, protein, andkernel oil content in maize. The original field trial at Clayton, NC in 2006 consisted of morethan 5,000 inbred lines and check varieties primarily coming from a diverse IL panel consistingof 282 founding lines (Flint-Garcia et al., 2005). Because the dataset comes from a field trial,the responses could be spatially autocorrelated. Thus we use a random row-column adjustment toobtain the adjusted phenotypes of the varieties. However, marker information of only n = 3 , We compare our method to S5, fastBVSR and the three penalized regression methods (LASSO,Elastic Net and SCAD). Since both R packages
BayesS5 (version 1.31) and glmnet (version2.0-18) do not work on this massive data set, we perform a screening of these markers before27
LASSO (size = 1764.98)Elastic Net (size = 1799.14)SCAD (size = 16.62)fastBVSR (size = 711.33)S5(piMoM) (size = 27.57)SVEN(MAP)² (size = 9.66)SVEN(WAM)² (size = 9.32)SVEN(MAP)¹ (size = 18.84)SVEN(WAM)¹ (size = 18.48) 2 3 4
MSPE M e t hod Figure 1: Boxplots for MSPE using SVEN, S5, fastBVSR, LASSO and Elastic net after screening. w =1 /p , λ = √ n ; w = 1 /p , λ = n /p . conducting variable selection so as to reduce the dimension of the data. We randomly split the datainto a training set of size n = 3 , and testing set of size 100. Then we use high dimensionalordinary least squares projection (HOLP) screening method (Wang and Leng, 2016) to preserve p = 3 , markers. Note that the training sets are formed by controlling the MAF of eachmarker to be no less than 1.5%. Because markers tend to be highly correlated, we use SVEN with w = 1 /p but with two choices of λ : λ = √ n (high shrinkage) and λ = n /p (low shrinkage);and with m = 3 and N = 50 for selecting the markers. In our experience, both the model sizeand MSPE lie in between the respective reported values for other intermediate values of λ that wehave tried. We repeat the entire process 50 times – each time computing the MSPE and the modelsize from each method. The peMOM non-local prior in S5 failed to provide any result even after100 hours of running, and S5 with the piMOM prior failed to provide a result in three cases. ThefastBVSR algorithm ran successfully in only 39 out of the 50 cases, while the complex iterativefactorization at the core of fastBVSR encountered floating point errors in the remaining 11 casesand could not produce any result. In contrast, SVEN faced no difficulties and produced the resultswithin reasonable time.The boxplots of these MSPEs are shown in Figure 1 along with the average model sizes. Over-all SVEN, S5 and SCAD perform significantly better than the lasso, the elastic net regression andfastBVSR and produce smaller MSPE with smaller model sizes. Moreover, SVEN and S5 pro-28uce comparable MSPE values but SVEN results in more parsimonious models. SVEN with highshrinkage produces slightly smaller MSPE but double model size than with low shrinkage. Unlike other variable selection methods, SCAD and SVEN can be successfully directly appliedto the whole data set without any pre-screening. We ran the SVEN 50 times again with the temper-ature schedule described in Section 2.2.2 with m = 3 , with N = 100 iterations per temperature,each time starting with a different random seed. Initially, we use w = 1 /p and try several valuesof λ as done in Section 5.1. The best models from these 50 runs vary suggesting the posterior sur-face is severely multimodal. With λ = n/p , we find that although the sizes of these best modelsremain around nine, the number of unique markers included in at least one of these 50 best modelsis over 30 (for SCAD these numbers were > and > , respectively). Other larger values of λ produce even larger models and more unique variables. Interestingly, by taking a further look intothe markers it identified, we discovered that the presence of some of these markers in a model isalways accompanied by the absence of certain other markers. More specifically, some pairs andtriplets of the markers are never included simultaneously in the MAP models but the frequenciesat which they are selected add up to 50. Thus to achieve more parsimonious models, we reduce w to /p and use λ = n/p . Using such a small w , the sizes of the best models from each run reduceto around four and the number of unique markers that are included at least once in the 50 bestmodels comes down to eight. To verify our conjecture on the correlations between these markers,we calculated the pairwise partial correlations between these eight markers. It turns out that thepairs of markers that are never included in the same model are indeed relatively highly partiallycorrelated than other pairs. Figure 2 gives the partial correlations for those markers where the sizeof the nodes indicates the number of times the markers are included in one of the 50 best modelsand Pairs of markers that are never included or excluded jointly are joined by a line segment. Note29 Figure 2: Graph for the selected markers and their corresponding partial correlations using w = 1 /p and λ = n/p . The SNP accession numbers of the selected markers are: 1=5 151885291, 2= 5 197591528,3=5 200552088, 4=6 7585863, 5=7 153216557, 6=9 142949160, 7=10 72608193, and 8=10 110298386. that the partial correlation between the connected markers are at least 29% whereas the largestpartial correlation for markers that are not connected is around 18%. The inclusion frequenciesof the pairs of connected markers add up to 50. Note that the fifth and sixth important markersare not grouped with other markers because their inclusions or exclusions are not related with theinclusion or exclusion of any other marker. Thus SVEN is able to identify pairs of markers thathave similar effect on the response. Figure 3: Boxplots of the widths of MC-PIs (grey) and Z-PIs (white).
Next, we study the performance and the widths of the 90% and 95% Z-PIs and MC-PIs de-scribed in Section 3. To that end, we randomly split the entire data into a training set of size n = 3 , under the constraint that the MAF of each marker is at least 1.5% and a testing set ofsize . We also remove any duplicated markers from the training set, which results in a smaller30 = 544 , . We generate 10,000 samples from the approximate posterior predictive distribution(13) to compute the MC-PIs. We find that the Z-PIs and the MC-PIs attain identical coverage ratesand these are found to be 91% and 95% for the 90% and 95% prediction intervals, respectively.The boxplots of the widths of the 200 intervals from each method are presented in Figure 3. Wefind that widths of the the Z-PIs are less variable compared to the same for the MC-PIs. It is en-couraging to see that despite non-normality of the posterior prediction distribution, the Z-PIs arebetter than simulation based intervals. In this article, we introduce a Bayesian variable selection method with embedded screeningfor ultrahigh-dimensional settings. The model used here is a hierarchical model with well knownspike and slab priors on the regression coefficients. Use of the degenerate spike prior for inactivevariables not only results in sparse estimates of regression coefficients and (much) lesser compu-tational burden, it also allows us to establish strong model selection consistency under somewhatweaker conditions than Narisetty and He (2014). In particular, we prove that the posterior prob-ability of the true model converges to one even when the norm of mean effects solely due to theunimportant variables diverge. On the other hand, our method crucially hinges on the fact thatmodel probabilities are available in closed form (up to a normalizing constant) which is due to theuse of Gaussian slab priors on active covariates. We propose a scalable variable selection algorithmwith an inbuilt screening method that efficiently explores the huge model space and rapidly findsthe MAP model. The screening is actually model based in the sense that it is performed on a setof candidate models rather than the set of potential variables. The algorithm also incorporates thetemperature control into a neighbor based stochastic search method. We use fast Cholesky updateto efficiently compute the (unnormalized) posterior probabilities of the neighboring models. Sincemean and variance of the posterior predictive distribution are shown to be means of analytically31vailable functions of the models, a derivative of the proposed method is construction of novel pre-diction intervals for future observations. Both Z based intervals and simulation based intervals arederived and compared. In the context of the real data analysis, we observe that Z based predictionintervals lead to the same coverage rates, although are narrower than Monte Carlo intervals. Theextensive simulations studies in section 4 and the real data analysis in section 5 demonstrate thesuperiority of the proposed method compared with the other state of the art methods, even thoughthe hyperparameters in the proposed method are not carefully tuned. Among the Bayesian meth-ods used for comparison, the package associated with the proposed algorithm seems to be the onlyone that can be directly applied to datasets of dimension as high as the one analyzed here with thecomputing resources mentioned before.Based on the Cholesky update described in Section 2.2.2, SVEN can be extended to ac-commodate the determinantal point process prior (Kojima and Komaki, 2016) on γ given by p ( γ | ω ) ∝ ω | γ | (cid:12)(cid:12) X (cid:62) γ X γ (cid:12)(cid:12) , where ω > . Variable selection and consistency of the resulting posteri-ors for high dimensional generalized linear models are considered in Liang et al. (2013). It wouldbe interesting to extend our method to the generalized linear regression model setup. The datasetwe have used comes from an agricultural field trial and hence the observations are expected to bespatially autocorrelated. Although we have used a two stage procedure by first obtaining spatiallyadjusted genotypic effects, our model can be extended to include spatial random effects (Dutta andMondal, 2014). Also, in many applications, the covariates may have a non-linear effect on theresponse and our method could be extended to additive models.
Supplemental materials
The supplemental materials contain additional details on computationsand proofs of the theoretical results stated in the paper.
References
Barbieri, M. M. and Berger, J. O. (2004), “Optimal predictive model selection,”
The Annals of Statistics , 32,870–897. ertsimas, D., King, A., and Mazumder, R. (2016), “Best subset selection via a modern optimization lens,” The Annals of Statistics , 44, 813–852.Bondell, H. and Reich, B. (2008), “Simultaneous regression shrinkage, variable selection, and supervisedclustering of predictors with oscar,”
Biometrics , 64, 115–123.Cao, X., Khare, K., and Ghosh, M. (2020), “High-dimensional posterior consistency for hierarchical non-local priors in regression,”
Bayesian Analysis , 15, 241–262.Chen, J. and Chen, Z. (2008), “Extended Bayesian information criteria for model selection with large modelspaces,”
Biometrika , 95, 759–771.Cook, J. P., McMullen, M. D., Holland, J. B., Tian, F., Bradbury, P., Ross-Ibarra, J., Buckler, E. S., andFlint-Garcia, S. A. (2012), “Genetic architecture of maize kernel composition in the nested associationmapping and inbred association panels,”
Plant Physiology , 158, 824–834.Datta, A. and Zou, H. (2017), “Cocolasso for high-dimensional error-in-variables regression,”
The Annalsof Statistics , 45, 2400–2426.Dutta, S. and Mondal, D. (2014), “An h-likelihood method for spatial mixed linear model based on intrinsicautoregressions,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 77, 699–726.Fan, J. and Li, R. (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,”
Journal of the American statistical Association , 96, 1348–1360.Fan, J. and Lv, J. (2008), “Sure independence screening for ultrahigh dimensional feature space,”
Journal ofthe Royal Statistical Society,
Series B, 70, 849–911.Flint-Garcia, S. A., Thuillet, A.-C., Yu, J., Pressoir, G., Romero, S. M., Mitchell, S. E., Doebley, J., Kreso-vich, S., Goodman, M. M., and Buckler, E. S. (2005), “Maize association population: a high-resolutionplatform for quantitative trait locus dissection,”
The Plant Journal , 44, 1054–1064.George, E. I. and McCulloch, R. E. (1993), “Variable selection via Gibbs sampling,”
Journal of the AmericanStatistical Association , 88, 881–889.— (1997), “Approaches for Bayesian variable selection,”
Statistica Sinica , 339–373.Hans, C., Dobra, A., and West, M. (2007), “Shotgun stochastic search for “large p” regression,”
Journal ofthe American Statistical Association , 102, 507–516.Huang, J., Jiao, Y., Liu, Y., and Lu, X. (2018), “A constructive approach to L0 penalized regression,”
TheJournal of Machine Learning Research , 19, 403–439.Ishwaran, H. and Rao, J. S. (2005), “Spike and slab variable selection: frequentist and Bayesian strategies,”
The Annals of Statistics , 33, 730–773.Johnson, V. E. and Rossell, D. (2012), “Bayesian model selection in high-dimensional settings,”
Journal ofthe American Statistical Association , 107, 649–660.Kim, Y., Choi, H., and Oh, H.-S. (2008), “Smoothly clipped absolute deviation on high dimensions,”
Journalof the American Statistical Association , 103, 1665–1673. im, Y., Kwon, S., and Choi, H. (2012), “Consistent model selection criteria on high dimensions,” Journalof Machine Learning Research , 13, 1037–1057.Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983), “Optimization by simulated annealing,” science ,220, 671–680.Kojima, M. and Komaki, F. (2016), “Determinantal point process priors for Bayesian variable selection inlinear regression,”
Statistica Sinica , 26, 97–117.Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010), “Penalized Regression, Standard Errors, andBayesian Lassos,”
Bayesian Analysis , 5, 369–412.Laurent, B. and Massart, P. (2000), “Adaptive estimation of a quadratic functional by model selection,”
Annals of Statistics , 1302–1338.Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008), “Mixtures of g priors for Bayesianvariable selection,” Journal of the American Statistical Association , 103, 410–423.Liang, F., Song, Q., and Yu, K. (2013), “Bayesian subset modeling for high-dimensional generalized linearmodels,”
Journal of the American Statistical Association , 108, 589–606.Meinshausen, N. and B¨uhlmann, P. (2006), “High-dimensional graphs and variable selection with the lasso,”
The Annals of Statistics , 34, 1436–1462.Mitchell, T. J. and Beauchamp, J. J. (1988), “Bayesian variable selection in linear regression,”
Journal ofthe American Statistical Association , 83, 1023–1032.Narisetty, N. N. and He, X. (2014), “Bayesian variable selection with shrinking and diffusing priors,”
TheAnnals of Statistics , 42, 789–817.Park, T. and Casella, G. (2008), “The Bayesian Lasso,”
Journal of the American Statistical Association , 103,681–686.Roˇckov´a, V. and George, E. I. (2014), “EMVS: The EM approach to Bayesian variable selection,”
Journalof the American Statistical Association , 109, 828–846.Roy, V. and Chakraborty, S. (2017), “Selection of tuning parameters, solution paths and standard errors forBayesian lassos,”
Bayesian Analysis , 12, 753–778.Roy, V., Tan, A., and Flegal, J. (2018), “Estimating standard errors for importance sampling estimators withmultiple Markov chains,”
Statistica Sinica , 28, 1079–1101.Shin, M., Bhattacharya, A., and Johnson, V. E. (2018), “Scalable Bayesian variable selection using nonlocalprior densities in ultrahigh-dimensional settings,”
Statistica Sinica , 28, 1053.Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,”
Journal of the Royal StatisticalSociety: Series B (Methodological) , 58, 267–288.Wang, H. (2009), “Forward regression for ultra-high dimensional variable screening,”
Journal of the Amer-ican Statistical Association , 104, 1512–1524.Wang, X. and Leng, C. (2016), “High dimensional ordinary least squares projection for screening variables,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78, 589–611. u, X. and Ghosh, M. (2015), “Bayesian variable selection and estimation for group lasso,” Bayesian Anal-ysis , 10, 909–936.Yang, Y., Wainwright, M. J., and Jordan, M. I. (2016), “On the computational complexity of high-dimensional Bayesian variable selection,”
The Annals of Statistics , 44, 2497–2532.Yuan, M. and Lin, Y. (2005), “Efficient empirical Bayes variable selection and estimation in linear models,”
Journal of the American Statistical Association , 100, 1215–1225.Zanella, G. and Roberts, G. (2019), “Scalable importance tempering and Bayesian variable selection,”
Jour-nal of the Royal Statistical Society,
Series B, 81, 489–517.Zellner, A. (1986), “On assessing prior distributions and Bayesian regression analysis with g-prior distribu-tions,” in
Bayesian inference and decision techniques: Essays in Honor of Bruno de Finetti , eds. Goel,P. K. and Zellner, A., Elsevier Science, 233–243.Zhou, Q. and Guan, Y. (2019), “Fast model-fitting of Bayesian variable selection regression using the itera-tive complex factorization algorithm,”
Bayesian analysis , 14, 573.Zou, H. (2006), “The adaptive lasso and its oracle properties,”
Journal of the American statistical associa-tion , 101, 1418–1429.Zou, H. and Hastie, T. (2005), “Regularization and variable selection via the elastic net,”
Journal of theroyal statistical society: series B (statistical methodology) , 67, 301–320. upplement to“Model Based Screening Embedded Bayesian Variable Selection forUltra-high Dimensional Settings”Dongjin Li, Somak Dutta and Vivekananda Roy S1 Efficient computations for multiple predictions
We describe in this section how we efficiently generate multiple y ∗ in order to obtain the empir-ical posterior predictive distribution and compute the Monte Carlo prediction intervals at severalnew covariates z ∗ (1) , . . . , z ∗ ( L ) . Recall from Section 2.2.2 that for a model γ , U γ is the upper tri-angular Cholesky factor of X (cid:62) γ X γ + λI and v γ = U −(cid:62) γ X (cid:62) γ ˜ y. The detailed procedure is describedbelow.
Algorithm
Generate multiple y ∗ Sample N models with replacement from the best K models returned by SVEN, with proba-bilities proportional to w i defined in (7) for i = 1 , . . . , K From the models sampled from step 1, find the unique models γ , . . . , γ M such that (cid:80) m = Mm =1 S m = N , where S m denote the number of models identical to γ m Compute U γ m and v γ m for m = 1 , . . . , M for m = 1 to m = M do for j = 1 to j = S m do Sample σ from IG (( n − / , R γ m / Sample e i from N (0 , σ ) for i = 1 , . . . , | γ m | Compute µ γ m = D − γ m (cid:0) U − γ m ( v γ m + e ) (cid:1) , where e = ( e , . . . , e | γ m | ) (cid:62) Sample µ from N (¯ y − ¯ Z (cid:62) γ m µ γ m , σ /n ) for (cid:96) = 1 to (cid:96) = L do Generate y ∗ from N (cid:16) µ + z ∗ ( (cid:96) ) γ m µ γ m , σ (cid:17) end for end for end for We examine the computation time it takes for SVEN to hit the MAP model for the first time,and compare it with S5 under the piMOM and the peMOM priors. We simulate the data accordingto Section 4.1.3, where Z has an AR(1) structure. We consider five different ( n, p ) pairs with p =2 n / where n ∈ { , , , , } . For each of the ( n, p ) pair, we obtain the computationtimes over 10 replicates. For SVEN, we use w = (cid:112) n/p , λ = n/p and N = 50 with thetemperature schedule described in Section 2.2.2 with m = 3 . Again, S5 is implemented usingR-package
BayesS5 using their default tuning parameter with only one repetition.Figure S1 shows the median computation times SVEN and S5 take to first hit the MAP model,excluding the preprocessing steps which are negligible. Both algorithms attain the same MAPmodel for all the data sets. In general, SVEN hits the MAP model faster than S5 for both small andlarge number of variables. Moreover, compared to S5, the computation time for SVEN increasesat a slower rate as p gets larger. T i m e ( s e c ond s ) Method
S5(peMOM)S5(piMOM)SVEN
Figure S1: The median computation time to first hit the MAP model for SVEN and S5. Let R ∗ γ = ˜ Y (cid:62) ( I − P γ ) ˜ Y which is the residual sum of squares obtained by ordinary least squaresand also let Q γ = λ | γ | / | X (cid:62) γ X γ + λI | − / . Before proving the model selection consistency statedin Theorems 1 and 2, we first provide some preliminary results on the bound of Q γ /Q t which willbe used to bound the posterior ratio of a given model γ to the true model t , and the bound of thedifference between R t and R ∗ t . Lemma 1.
For any model γ (cid:54) = t , Q γ Q t ≤ v (cid:48) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| t ∧ γ c | / where v (cid:48) > is aconstant.Proof. Because nonzero eigenvalues of X (cid:62) γ X γ and X γ X (cid:62) γ are identical, it follows that Q γ = λ | γ | / | X (cid:62) γ X γ + λI | − / = | I + λ − X γ X (cid:62) γ | − / . We first show that Q γ Q γ ∧ t ≤ ( nη nm ( ν ) /λ ) − ( r ∗ γ − r γ ∧ t ) / . There are two cases depending on | γ | ≤ , or > u n ( ν ) . Case 1:
Suppose | γ | ≤ u n ( ν ) . We then have Q γ Q γ ∧ t = | I + λ − X γ X (cid:62) γ | − / | I + λ − X γ ∧ t X (cid:62) γ ∧ t | / = (cid:12)(cid:12) I + λ − X γ ∧ t X (cid:62) γ ∧ t + λ − X γ ∧ t c X (cid:62) γ ∧ t c (cid:12)(cid:12) − / (cid:12)(cid:12) I + λ − X γ ∧ t X (cid:62) γ ∧ t (cid:12)(cid:12) / = (cid:12)(cid:12)(cid:12) I + λ − X (cid:62) γ ∧ t c (cid:0) I + λ − X γ ∧ t X (cid:62) γ ∧ t (cid:1) − X γ ∧ t c (cid:12)(cid:12)(cid:12) − / . Next, using Sherman–Morrison–Woodbury matrix identity we have, (cid:0) I + λ − X γ ∧ t X (cid:62) γ ∧ t (cid:1) − = I − X γ ∧ t ( X (cid:62) γ ∧ t X γ ∧ t + λI ) − X (cid:62) γ ∧ t . Thus by letting E = X (cid:62) γ ∧ t X γ ∧ t , F = X (cid:62) γ ∧ t X γ ∧ t c and G = X (cid:62) γ ∧ t c X γ ∧ t c we have Q γ Q γ ∧ t = (cid:12)(cid:12) λ − { G + λI − F (cid:62) ( E + λI ) − F } (cid:12)(cid:12) − / . (S1)3owever, note that G + λI − F (cid:62) ( E + λI ) − F is the Schuar complement in H = E + λI FF (cid:62) G + λI = X (cid:62) γ ∧ t X γ ∧ t + λI X (cid:62) γ ∧ t X γ ∧ t c X (cid:62) γ ∧ t c X γ ∧ t X (cid:62) γ ∧ t c X γ ∧ t c + λI, so that the smallest eigenvalue of G + λI − F (cid:62) ( E + λI ) − F is at least the smallest eigenvalue of H which is, in turn, at least nη nm ( ν ) + λ because H can be obtained by applying one permutationon the rows and columns of X (cid:62) γ X γ + λI. Consequently, from (S1), we finally have Q γ Q γ ∧ t ≤ (cid:0) λ − ( λ + nη nm ( ν ) (cid:1) − r γ ∧ tc / ≤ ( nη nm ( ν ) /λ ) − ( r ∗ γ − r γ ∧ t ) / because | γ ∧ t c | ≥ r γ ∧ t c ≥ r γ − r γ ∧ t ≥ r ∗ γ − r γ ∧ t . Case 2: If | γ | > u n ( ν ) write γ = γ (cid:48) ∨ γ (cid:48)(cid:48) where γ (cid:48) and γ (cid:48)(cid:48) are disjoint, | γ (cid:48) | ≤ u n ( ν ) and γ (cid:48) ∧ t = γ ∧ t. Then Q γ ∧ t = Q γ (cid:48) ∧ t and Q γ = | I + λ − X γ (cid:48) X (cid:62) γ (cid:48) + λ − X γ (cid:48)(cid:48) X (cid:62) γ (cid:48)(cid:48) | − / ≤ | I + λ − X γ (cid:48) X (cid:62) γ (cid:48) | − / = Q γ (cid:48) . Since γ = γ (cid:48) ∨ γ (cid:48)(cid:48) , r γ ≥ r γ (cid:48) implying r ∗ γ ≥ r ∗ γ (cid:48) . Also, γ (cid:48) ∧ t = γ ∧ t. Hence, Q γ Q γ ∧ t ≤ Q γ (cid:48) Q γ (cid:48) ∧ t ≤ ( nη nm ( ν ) /λ ) − ( r ∗ γ (cid:48) − r γ (cid:48)∧ t ) / ≤ ( nη nm ( ν ) /λ ) − ( r ∗ γ − r γ ∧ t ) / . Furthermore, Q γ ∧ t Q t = | I + λ − X γ ∧ t X (cid:62) γ ∧ t | − / | I + λ − X t X (cid:62) t | / = (cid:12)(cid:12)(cid:12)(cid:0) I + λ − X γ ∧ t X (cid:62) γ ∧ t (cid:1) − (cid:0) I + λ − X γ ∧ t X (cid:62) γ ∧ t + λ − X γ c ∧ t X (cid:62) γ c ∧ t (cid:1)(cid:12)(cid:12)(cid:12) / = (cid:12)(cid:12)(cid:12) I + λ − X γ c ∧ t (cid:0) I + λ − X γ ∧ t X (cid:62) γ ∧ t (cid:1) − X (cid:62) γ c ∧ t (cid:12)(cid:12)(cid:12) / ≤ (cid:12)(cid:12) I + λ − X γ c ∧ t X (cid:62) γ c ∧ t (cid:12)(cid:12) / ≤ v (cid:48) ( n/λ ) | γ c ∧ t | / , v (cid:48) > , where the second to last inequality holds because I + λ − X γ ∧ t X (cid:62) γ ∧ t ≥ I and thelast inequality holds since the condition C3 is in force and the fact that X γ c ∧ t is a submatrix of X t . Since X t is full rank, r t = r γ ∧ t + r γ c ∧ t = r γ ∧ t + | γ c ∧ t | . Thus finally we have, Q γ Q t ≤ v (cid:48) ( nη nm ( ν ) /λ ) − ( r ∗ γ / ( nη nm ( ν ) /λ ) ( r t −| γ c ∧ t | ) / ( n/λ ) | γ c ∧ t | / = v (cid:48) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| γ c ∧ t | / . Then, using Lemma 1, we have the following corollary.
Corollary 1.
For any model γ (cid:54) = t , PR( γ, t ) = f ( γ | Y, σ ) f ( t | Y, σ ) ≤ v (cid:48) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| γ c ∧ t | / b | γ |−| t | n × exp (cid:26) − σ ( R γ − R t ) (cid:27) , where b n = w/ (1 − w ) ∼ p − , and v (cid:48) > is a constant.Proof. The posterior of the model γ under (2a)-(2d) is given by f ( γ | Y, σ ) ∝ exp (cid:26) − σ (cid:16) ˜ Y (cid:62) ˜ Y − ˜ Y (cid:62) X γ (cid:0) X (cid:62) γ X γ + λI (cid:1) − X (cid:62) γ ˜ Y (cid:17)(cid:27) × λ | γ | / (cid:12)(cid:12) X (cid:62) γ X γ + λI (cid:12)(cid:12) − / w | γ | (1 − w ) p −| γ | ≤ Q γ b | γ | n exp (cid:26) − σ R γ (cid:27) . PR( γ, t ) = f ( γ | Y, σ ) f ( t | Y, σ ) = Q γ Q t b | γ |−| t | n exp (cid:26) − σ ( R γ − R t ) (cid:27) ≤ v (cid:48) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| γ c ∧ t | / b | γ |−| t | n × exp (cid:26) − σ ( R γ − R t ) (cid:27) . Lemma 2.
For any sequence h n → ∞ , we have P ( R t − R ∗ t > h n ) ≤ exp( − c (cid:48) nh n /λ ) for some c (cid:48) > .Proof. Since ( n/λ ) I +( X (cid:62) t X t n ) − ≥ ( n/λ ) I and (cid:62) n X = 0 , Sherman–Morrison–Woodbury identityimplies ≤ R t − R ∗ t = Y (cid:62) X t (cid:104) ( X (cid:62) t X t ) − − (cid:0) λI + X (cid:62) t X t (cid:1) − (cid:105) X (cid:62) t Y = Y (cid:62) X t ( X (cid:62) t X t ) − (cid:0) λ − I + ( X (cid:62) t X t ) − (cid:1) − ( X (cid:62) t X t ) − X (cid:62) t Y ≤ ( n/λ ) − Y (cid:62) W Y, where W = nX t ( X (cid:62) t X t ) − X (cid:62) t has rank | t | and by condition C5 has bounded eigenvalues. Wewant to show that P ( R t − R ∗ t > h n ) ≤ P (cid:0) Y (cid:62) W Y > nλ − h n (cid:1) ≤ exp ( − c (cid:48) nλ − h n ) . Next, since (cid:62) n X = 0 , we have Y (cid:62) W Y =( β (cid:62) t X (cid:62) t + β (cid:62) t c X (cid:62) t c + (cid:15) (cid:62) ) W ( X t β t + X t c β t c + (cid:15) )= β (cid:62) t X (cid:62) t W X t β t + β (cid:62) t X (cid:62) t W X t c β t c + β (cid:62) t X (cid:62) t W (cid:15) + β (cid:62) t c X (cid:62) t c W X t β t + β (cid:62) t c X (cid:62) t c W X t c β t c + β (cid:62) t c X (cid:62) t c W (cid:15) + (cid:15) (cid:62) W X t β t + (cid:15) (cid:62) W X t c β t c + (cid:15) (cid:62) W (cid:15) = nβ (cid:62) t β t + β (cid:62) t X (cid:62) t W X t c β t c + β (cid:62) t c X (cid:62) t c W X t β t + β (cid:62) t X (cid:62) t W (cid:15) + (cid:15) (cid:62) W X t β t + β (cid:62) t c X (cid:62) t c W (cid:15) + (cid:15) (cid:62) W X t c β t c + β (cid:62) t c X (cid:62) t c W X t c β t c + (cid:15) (cid:62) W (cid:15).
6o find the bound of the tail probability, we spilt our proof into following five steps:(i) First, we want to show that | β (cid:62) t X (cid:62) t W X t c β t c | (cid:22) √ n log p . It is clear that | β (cid:62) t X (cid:62) t W X t c β t c | ≤ (cid:107) X t β t (cid:107)(cid:107) X t c β t c (cid:107) α max ( W ) , and (cid:107) X t β t (cid:107) = β (cid:62) t X (cid:62) t X t β t = nβ (cid:62) t (cid:16) X (cid:62) t X t n (cid:17) β t ≤ nc (cid:107) β t (cid:107) for some constant c > . Bycondition C5 we also know that α max ( W ) is bounded. Then with (cid:107) X t c β t c (cid:107) (cid:22) √ log p fromcondition C3, we have | β (cid:62) t X (cid:62) t W X t c β t c | ≤ c √ n log p for some constant c > .(ii) Next we will show that | β (cid:62) t c X (cid:62) t c W X t c β t c | (cid:22) log p . Condition C3 and the fact that W hasbounded eigenvalues implies that | β (cid:62) t c X (cid:62) t c W X t c β t c | ≤ (cid:107) X t c β t c (cid:107) α max ( W ) ≤ c log p (iii) Next, we will show that P ( (cid:15) (cid:62) W (cid:15) ≥ a ) ≤ P ( c χ , | t | ≥ a ) for all a > , and n ≥ , and for some c > , where χ , | t | is distributed as χ with | t | degrees of freedom. To thatend, let W = P Λ P (cid:62) , where P is an orthogonal matrix and Λ = diag { λ , . . . , λ n } is thediagonal matrix of the eigenvalues of W . Then, since P (cid:62) (cid:15) ∼ N (0 , σ I ) , (cid:15) (cid:62) W (cid:15)/σ = (cid:15) (cid:62) P Λ P (cid:62) (cid:15)/σ = (cid:80) | t | i =1 λ i G i σ , where G i iid ∼ N (0 , σ ) , i = 1 , . . . , | t | . Since eigenvalues of W are bounded, λ i σ ≤ c for some c > for all i = 1 , . . . , | t | . Hence, P ( (cid:15) (cid:62) W (cid:15) ≥ a ) = P ( σ (cid:80) | t | i =1 λ i G i σ ≥ a ) ≤ P ( c χ , | t | ≥ a ) .(iv) Next we want to show P ( | β (cid:62) t c X (cid:62) t c W (cid:15) | ≥ a ) ≤ P ( c χ , | t | ≥ a/ √ log p ) for all a > , n ≥ and for some c > , where χ , | t | is distributed as χ with | t | degrees of freedom. To thatend, note that by Cauchy-Schwarz inequality, | β (cid:62) t c X (cid:62) t c W (cid:15) | ≤ (cid:107) β (cid:62) t c X (cid:62) t c (cid:107)(cid:107) W (cid:15) (cid:107) (cid:107)
W (cid:15) (cid:107) is stochastically dominated by a constant multiple of χ − distributed random variable since W = P Λ P (cid:62) has rank | t | and bounded eigenvalues.Hence by Condition C3, there exists c > , such that P ( | β (cid:62) t c X (cid:62) t c W (cid:15) | ≥ a ) ≤ P ( (cid:107) β (cid:62) t c X (cid:62) t c (cid:107)(cid:107) W (cid:15) (cid:107) ≥ a ) ≤ P (cid:0) c (log p ) / χ , | t | ≥ a (cid:1) (v) Thus all sufficiently large n , we have Y (cid:62) W Y ≤ β (cid:62) t X (cid:62) t W (cid:15) + 2 β t c X t c W (cid:15) + (cid:15) (cid:62) W (cid:15) + nβ (cid:62) t β t + 2 c (cid:112) n log p + c log p. Now note that β (cid:62) t X (cid:62) t W (cid:15) = 2 nβ (cid:62) t ( X (cid:62) t X t ) − X (cid:62) t (cid:15) ∼ N (0 , nC n ) , where C n = 4 σ β (cid:62) t (cid:18) X (cid:62) t X t n (cid:19) − β t is bounded. Also note that nλ − h n − nβ (cid:62) t β t − c (cid:112) n log p − c log p > nλ − h n / and nλ − h n / (log p ) > for sufficiently large n. Thus, for sufficiently large n, we have, P (cid:0) Y (cid:62) W Y > nλ − h n (cid:1) ≤ P (cid:18) β (cid:62) t X (cid:62) t W (cid:15) + 2 β t c X t c W (cid:15) + (cid:15) (cid:62) W (cid:15) > nλ − h n (cid:19) ≤ P (cid:18) β (cid:62) t X (cid:62) t W (cid:15) > nλ − h n (cid:19) + P (cid:18) β t c X t c W (cid:15) > nλ − h n (cid:19) + P (cid:18) (cid:15) (cid:62) W (cid:15) > nλ − h n (cid:19) ≤ P (cid:18) β (cid:62) t X (cid:62) t W (cid:15) √ nC n > √ n h n C n λ (cid:19) + P (cid:32) c χ , | t | >
136 ( nλ − h n ) log p (cid:33) + P (cid:18) c χ , | t | > nh n λ (cid:19) ≤ exp (cid:0) − c (cid:48)(cid:48) nλ − h n (cid:1) + exp (cid:0) − c (cid:48)(cid:48)(cid:48) nλ − h n (cid:1) + exp (cid:0) − c (cid:48)(cid:48)(cid:48)(cid:48) nλ − h n (cid:1) ≤ exp (cid:0) − c (cid:48) nλ − h n (cid:1) , c (cid:48) , c (cid:48)(cid:48) , c (cid:48)(cid:48)(cid:48) and, c (cid:48)(cid:48)(cid:48)(cid:48) . S4 Proof of Theorem 1
To prove the model selection consistency, we use the same strategy as in Narisetty and He(2014) by dividing the set of models into the following subsets:(i) Unrealistically large models: M = { γ : r γ > u n } , the models of rank greater than u n .Abusing notation we use u n and u n ( ν ) interchangeably.(ii) Over-fitted models: M = { γ : γ ⊃ t, r γ ≤ u n } , the models of rank smaller than u n whichinclude all the important variables and at least one unimportant variables.(iii) Large models: M = { γ : γ (cid:54)⊃ t, J | t | < r γ ≤ u n } , that is, the models which miss one ormore important variables with rank greater than J | t | but smaller than u n for some fixedpositive integer J .(iv) Under-fitted models: M = { γ : γ (cid:54)⊃ t, r γ ≤ J | t |} , the models which have rank smallerthan J | t | and miss at least one important variable.We aim to show that (cid:80) γ ∈ M k PR( γ, t ) P −→ for each k = 1 , , , , with σ known. S4.1 Unrealistically large models
We first want to prove that the sum of posterior ratios
P R ( γ, t ) over γ ∈ M converges expo-nentially to zero. Note that M is empty if p < n/ log p ν . The reason is that if p < n/ log p ν ,then u n ( ν ) = p and r γ ≤ p , which contradicts the definition of M . First, we want to find a set of9vents that are almost unlikely to happen. So, note that for any s > , P (cid:2) ∪ γ ∈ M (cid:8) R t − R γ > n (1 + 2 s ) σ (cid:9)(cid:3) ≤ P (cid:2) R t > n (1 + 2 s ) σ (cid:3) ≤ P (cid:2) R ∗ t > (1 + s ) nσ (cid:3) + P (cid:2) R t − R ∗ t > snσ (cid:3) = P (cid:20) R ∗ t nσ − > s (cid:21) + P (cid:2) R t − R ∗ t > snσ (cid:3) ≤ exp {− cn } + exp {− c (cid:48) sn σ /λ } , (S2)for some c, c (cid:48) > , due to Lemma A.2 of Narisetty and He (2014).Now we consider the term nη nm ( ν ) /λ . Condition C2 indicates that nη nm ( ν ) /λ (cid:22) n ∨ p δ because η nm ( ν ) is the smallest eigenvalue of a correlation matrix, i.e., η nm ( ν ) < , and condition C4 impliesthat nη nm ( ν ) /λ (cid:23) n ∨ p δ , that is, ( n ∨ p δ ) (cid:22) nη nm ( ν ) /λ (cid:22) ( n ∨ p δ ) . (S3)Then we restrict our attention to the high probability event ∩ γ ∈ M { R t − R γ ≤ n (1 + 2 s ) σ } for s < δ/ δ ) . Note that, in this case, the upper bound (S2) of the probability of the complementof this event is bounded by {− c (cid:48)(cid:48) n } for some c (cid:48)(cid:48) > . First, by Lemma 1, we have (cid:88) γ ∈ M PR( γ, t ) (cid:22) (cid:88) γ ∈ M v (cid:48) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| γ c ∧ t | / b | γ |−| t | n e n (1+2 s ) / (cid:22) (cid:88) γ ∈ M p − (1+ δ )( u n −| t | ) ( η nm ( ν )) −| t | / b | γ |−| t | n e n (1+2 s ) / . because for all γ ∈ M , r ∗ γ = r γ ∧ u n = u n , | γ c ∧ t | ≤ | t | and condition C4 is in force. Recall that b n ∼ p − . Thus, (1 + b n ) p ∼ . Also, by condition C1, p = exp( nd n ) for some d n → . Then, due10o condition C4 u n = n/ log p ν ≥ n/ log p δ since ν < δ , we have (cid:88) γ ∈ M PR( γ, t ) (cid:22) (cid:88) γ ∈ M e − (1+ δ )( u n −| t | ) log p b | γ |−| t | n ( η nm ( ν )) −| t | / e n (1+2 s ) / (cid:22) (cid:88) γ ∈ M e − (1+ δ ) n (2+ δ ) log p log p e n (1+2 s ) / p κ | t | / b | γ |−| t | n (cid:22) e − n (1+ δ ) / (2+ δ ) e n (1+2 s ) / p κ | t | / (cid:88) γ ∈ M b ( | γ |−| t | ) n (cid:22) e − n (1+ δ ) / (2+ δ ) e n (1+2 s ) / p (1+ κ/ | t | p (cid:88) | γ | = u n p | γ | b | γ | n (cid:22) e − n (1+ δ ) / (2+ δ ) e n (1+2 s ) / e n (1+ κ/ | t | d n (1 + b n ) p (cid:22) e − v (cid:48) n −→ , as n → ∞ for some v (cid:48) > , if s satisfies s < δ ) / (2 + δ ) , i.e., s < δ/ δ ) . Therefore,we have (cid:88) γ ∈ M PR( γ, t ) P −→ . (S4) S4.2 Over-fitted models
Models in M include all important variables plus one or more unimportant variables. For γ ∈ M , R ∗ t − R ∗ γ = Y (cid:62) ( I − P t ) Y − Y (cid:62) ( I − P γ ) Y = (cid:107) ( P γ − P t )( X t β t + X t c β t c + (cid:15) ) (cid:107) = ( (cid:107) ( P γ − P t ) X t c β t c (cid:107) + (cid:107) ( P γ − P t ) (cid:15) (cid:107) ) ≤ (cid:18) (cid:107) X t c β t c (cid:107) + (cid:113) (cid:15) (cid:62) ( P γ − P t ) (cid:15) (cid:19) . (cid:15) (cid:62) ( P γ − P t ) (cid:15)/σ ∼ χ r γ − r t , forany x > and for some (cid:112) / < v < , we have for all sufficiently large n , P (cid:2) R ∗ t − R ∗ γ > σ (2 + 3 x ) ( r γ − r t ) log p (cid:3) ≤ P (cid:34)(cid:18) (cid:107) X t c β t c (cid:107) + (cid:113) (cid:15) (cid:62) ( P γ − P t ) (cid:15) (cid:19) > σ (2 + 3 x ) ( r γ − r t ) log p (cid:35) = P (cid:34) (cid:15) (cid:62) ( P γ − P t ) (cid:15)σ (2 + 3 x ) ( r γ − r t ) log p > − (cid:107) X t c β t c (cid:107) (cid:112) σ (2 + 3 x ) ( r γ − r t ) log p − (cid:107) X t c β t c (cid:107) σ (2 + 3 x ) ( r γ − r t ) log p (cid:35) ≤ P (cid:20) (cid:15) (cid:62) ( P γ − P t ) (cid:15) > σ (2 + 3 vx ) ( r γ − r t ) log p (cid:21) ≤ P (cid:20) χ r γ − r t − ( r γ − r t ) > { (2 + 3 vx ) log p − } ( r γ − r t ) (cid:21) ≤ P (cid:20) χ r γ − r t − ( r γ − r t ) > (cid:0) v x (cid:1) ( r γ − r t ) log p (cid:21) ≤ P (cid:110) χ r γ − r t − ( r γ − r t ) > (cid:113) ( r γ − r t ) [( r γ − r t )(1 + x ) log p + a ] + [( r γ − r t )(1 + x ) log p + a ] (cid:111) ≤ exp {− ( r γ − r t )(1 + x ) log p + a }≤ c exp {− (1 + x ) ( r γ − r t ) log p } = c p − (1+ x )( r γ − r t ) , (S5)where c = exp( a ) > and a is a constant such that (cid:113) ( r γ − r t ) [( r γ − r t )(1 + x ) log p + a ] + [( r γ − r t )(1 + x ) log p + a ] < (cid:0) v x (cid:1) ( r γ − r t ) log p. Now, consider < s < δ/ and define the event E ( γ ) := (cid:8) R t − R γ > σ (1 + 4 s )( r γ − r t ) log p (cid:9) ⊂ (cid:8) R t − R γ > σ (1 + 2 s )( r γ − r t ) log p (cid:9) . Then, for a fixed dimension d > r t , consider the event U ( d ) := (cid:83) { γ : r γ = d } E ( γ ) . Since R γ ≥ R ∗ γ ,we have P [ U ( d )] ≤ P (cid:2) ∪ { γ : r γ = d } (cid:8) R t − R γ > σ (1 + 2 s ) ( r γ − r t ) log p (cid:9)(cid:3) P (cid:2) ∪ { γ : r γ = d } (cid:8) R t − R ∗ γ > σ (1 + 2 s ) ( r γ − r t ) log p (cid:9)(cid:3) ≤ P (cid:2) ∪ { γ : r γ = d } (cid:8) R ∗ t − R ∗ γ > σ (2 + 3 s ) ( d − r t ) log p (cid:9)(cid:3) + P (cid:2) R t − R ∗ t > sσ ( d − r t ) log p (cid:3) ≤ (cid:88) γ : r γ = d P (cid:2) R ∗ t − R ∗ γ > σ (2 + 3 s ) ( d − r t ) log p (cid:3) + P (cid:2) R t − R ∗ t > sσ ( d − r t ) log p (cid:3) ≤ (cid:88) γ : r γ = d c p − (1+ s )( d − r t ) + exp (cid:8) − c (cid:48) nsσ ( d − r t )(log p ) /λ (cid:9) ≤ c p − (1+ s )( d − r t ) p d − r t + exp {− c (cid:48) s ( d − r t ) log p } = c p − s ( d − r t ) + p − c (cid:48) s ( d − r t ) ≤ c p − c s ( d − r t ) , for some c , c > , where the fifth and the sixth inequality hold due to (S5), Lemma 2, conditionC2, and the fact that the event (cid:8) R ∗ t − R ∗ γ > σ (2 + 3 s ) ( d − r t ) log p (cid:9) depends only on the projec-tion matrix P γ ∧ t c , so we can write the union ∪ { γ : r γ = d } as a smaller set of events indexed by P γ ∧ t c .Note that since there exists at most p k subspaces of rank k , the cardinality of such projections is atmost p d − r t . Next, we consider the union of all such events U ( d ) , that is, P (cid:2) ∪ { d>r t } U ( d ) (cid:3) ≤ (cid:88) { d>r t } P [ U ( d )] ≤ c (cid:88) d>r t p − c s ( d − r t ) ≤ c ∞ (cid:88) d − r t =1 p − c s ( d − r t ) ≤ c p − c s − p − c s = c p c s − −→ as n → ∞ . r ∗ γ = r γ as r γ < u n for γ ∈ M . Then again restricting to the high probability event ∩ { d>r t } U ( d ) c , by Lemma 1, (S3) and the fact that γ c ∧ t is empty, we have (cid:88) γ ∈ M PR( γ, t ) (cid:22) (cid:88) γ ∈ M ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / b ( | γ |−| t | ) n ( η nm ( ν )) −| γ c ∧ t | / × exp (cid:26) − σ ( R γ − R t ) (cid:27) (cid:22) (cid:88) γ ∈ M (cid:0) p − (2+2 δ ) ∧ n − (cid:1) ( r γ − r t ) / b ( | γ |−| t | ) n p (1+4 s )( r γ − r t ) (cid:22) (cid:88) γ ∈ M (cid:0) p δ ∨ √ n (cid:1) − ( r γ − r t ) b ( | γ |−| t | ) n p (1+4 s )( r γ − r t ) (cid:22) (cid:88) γ ∈ M (cid:0) p δ − − s ∨ √ np − − s (cid:1) − ( r γ − r t ) b ( | γ |−| t | ) n (cid:22) (cid:88) γ ∈ M (cid:18) p − δ/ ∧ p δ/ √ n (cid:19) ( r γ − r t ) b ( | γ |−| t | ) n (cid:22) (cid:18) p − δ/ ∧ p δ/ √ n (cid:19) p (cid:88) | γ | = | t | +1 p | γ | − | t | b ( | γ |−| t | ) n (cid:22) (cid:18) p − δ/ ∧ p δ/ √ n (cid:19) (1 + b n ) p ∼ ρ n −→ , as n → ∞ , where ρ n = (cid:16) p − δ/ ∧ p δ/ √ n (cid:17) . In the above, we used the fact that ρ n ≤ and r γ − r t ≥ . Note that, δ − s ≥ δ since < s < δ/ . Hence, we have (cid:88) γ ∈ M PR( γ, t ) P −→ . (S6)14 For models in M where the rank is at least J | t | and one or more important variables are notinclued, similar to what we’ve shown in Section S4.2, for < s < δ/d we use the event E ( γ ) ⊂ (cid:8) R t − R γ ∨ t > σ (1 + 2 s )( r γ − r t ) log p (cid:9) . Then we consider the union of such events U ( d ) = (cid:83) { γ : r γ = d } E ( γ ) , for d > J | t | , and s = δ/ .Using (S5) with the fact that r γ ∨ t ≥ r γ and Lemma 2 we have P [ U ( d )] ≤ P (cid:2) ∪ { γ : r γ = d } (cid:8) R t − R γ ∨ t > σ (1 + 2 s ) ( r γ − r t ) log p (cid:9)(cid:3) ≤ P (cid:2) ∪ { γ : r γ = d } (cid:8) R t − R ∗ γ ∨ t > σ (1 + 2 s ) ( r γ − r t ) log p (cid:9)(cid:3) ≤ (cid:88) { γ : r γ = d } P (cid:2) R ∗ t − R ∗ γ ∨ t > σ (2 + 3 s )( d − r t ) log p (cid:3) + P [ R t − R ∗ t > sσ ( d − r t ) log p ] ≤ c p − (1+ s )( d − r t ) p d + e − c (cid:48) nsσ ( d − r t )(log p ) /λ ≤ c p − c d , for some c , c > .Then, P [ ∪ { d>J | t |} U ( d )] ≤ (cid:88) d>J | t | P [ U ( d )] ≤ (cid:88) d>J | t | c p − c d −→ as n → ∞ . Now, we restrict our attention to the high probability event (cid:84) { d>r t } U ( d ) c , we have (cid:88) γ ∈ M PR( γ, t ) (cid:22) (cid:88) γ ∈ M ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| γ c ∧ t | / b ( | γ |−| t | ) n × exp (cid:26) − σ ( R γ − R t ) (cid:27) (cid:88) γ ∈ M (cid:0) p δ ∨ √ n (cid:1) − ( r γ − r t ) ( η nm ( ν )) −| t | / b ( | γ |−| t | ) n p (1+4 s )( r γ − r t ) (cid:22) (cid:88) γ ∈ M (cid:0) p δ − − s ∨ √ np − − s (cid:1) − ( r γ − r t ) p κ | t | / b ( | γ |−| t | ) n (cid:22) (cid:88) γ ∈ M (cid:18) p − δ/ ∧ p δ/ √ n (cid:19) r γ − r t p κ | t | / b ( | γ |−| t | ) n (cid:22) (cid:18) p − δ/ ∧ p δ/ √ n (cid:19) ( J − r t +1 p δ ( J − | t | / (cid:88) γ ∈ M b ( | γ |−| t | ) n (cid:22) ρ ( J − r t +1 n p δ ( J − | t | / (1 + b n ) p ( ∼ ρ ( J − | t | / n ) −→ as n → ∞ . In the above, we used the fact that κ < ( J − δ/ by condition C4. Note that ρ r γ − r t n ≤ ρ ( J − r t +1 n because r γ > J | t | = J r t and ρ n ≤ .Thus, we have (cid:88) γ ∈ M PR( γ, t ) P −→ . (S7) S4.4 Under-fitted Models
First, we will prove that for c ∈ (0 , ,P [ ∪ γ ∈ M { R γ − R t < ∆ n (1 − c ) } ] −→ , ∆ n ≡ ∆ n ( J ) is defined in Condition C4. Since (cid:62) n X = 0 , by conditions C3 and C4 wehave R ∗ γ − R ∗ γ ∨ t = (cid:107) ( P γ ∨ t − P γ ) Y (cid:107) = (cid:107) ( P γ ∨ t − P γ ) X t β t + ( P γ ∨ t − P γ ) X t c β t c + ( P γ ∨ t − P γ ) (cid:15) (cid:107) = (cid:107) ( P γ ∨ t − P γ ) X t β t + ( P γ ∨ t − P γ ) (cid:15) (cid:107) ≥ ( (cid:107) ( P γ ∨ t − P γ ) X t β t (cid:107) − (cid:107) ( P γ ∨ t − P γ ) (cid:15) (cid:107) ) = ( (cid:107) ( I − P γ ) X t β t (cid:107) − (cid:107) ( P γ ∨ t − P γ ) (cid:15) (cid:107) ) ≥ (cid:16)(cid:112) ∆ n − (cid:107) ( P γ ∨ t − P γ ) (cid:15) (cid:107) (cid:17) , for all large n . Since (cid:107) P t (cid:15) (cid:107) /σ ∼ χ r t , for any v (cid:48) ∈ (0 , , we have P (cid:104) ∪ γ ∈ M (cid:110) R ∗ γ − R ∗ γ ∨ t < (1 − v (cid:48) ) ∆ n (cid:111)(cid:105) ≤ P (cid:20) ∪ γ ∈ M (cid:26)(cid:16)(cid:112) ∆ n − (cid:107) ( P γ ∨ t − P γ ) (cid:15) (cid:107) (cid:17) < (1 − v (cid:48) ) ∆ n (cid:27)(cid:21) ≤ P (cid:104) ∪ γ ∈ M (cid:110) (cid:107) ( P γ ∨ t − P γ ) (cid:15) (cid:107) > v (cid:48) (cid:112) ∆ n (cid:111)(cid:105) ≤ P (cid:104) (cid:107) P t (cid:15) (cid:107) > v (cid:48) ∆ n (cid:105) ≤ e − c ∆ n , (S8)for some constant c > . We also have for any v (cid:48) ∈ (0 , , P (cid:2) ∪ γ ∈ M (cid:8) R ∗ γ ∨ t − R γ ∨ t < − ∆ n v (cid:48) / (cid:9)(cid:3) < e − c ∆ n , for some constant c > . To see this, let X γ ∨ t = U n × r Λ r × r V (cid:62) r ×| γ ∨ t | be the SVD of X γ ∨ t , where r = rank ( X γ ∨ t ) . Then, P γ ∨ t = U U (cid:62) is the projection matrix onto the column space of X γ ∨ t andthus, using (cid:62) n X = 0 and equation (4) in the main paper, we have R ∗ γ ∨ t − R γ ∨ t = Y (cid:62) ( I − U U (cid:62) ) Y − Y (cid:62) (cid:0) I + λ − U Λ U (cid:62) (cid:1) − Y Y (cid:62) U (cid:0) Λ ( λI + Λ ) − − I (cid:1) U (cid:62) Y = λY (cid:62) U ( λI + Λ ) − U (cid:62) Y ≤ ( nη nm ( ν ) /λ ) − Y (cid:62) U U (cid:62) Y, (S9)where the last inequality holds because λI + Λ ≥ Λ ≥ nη nm ( ν ) I .Since the rank of U is at most ( J + 1) | t | , by (S9) and (S3) we have P (cid:20) ∪ γ ∈ M (cid:26) R ∗ γ ∨ t − R γ ∨ t < − ∆ n v (cid:48) (cid:27)(cid:21) (cid:22) P (cid:20) ∪ γ ∈ M (cid:26)(cid:0) nλ − η nm ( ν ) (cid:1) − Y (cid:62) U U (cid:62)
Y < − ∆ n v (cid:48) (cid:27)(cid:21) (cid:22) exp (cid:8) − v (cid:48) nλ − η nm ( ν )∆ n (cid:9) p ( J +1) | t | (cid:22) exp (cid:8) − p δ ∆ n + ( J + 1) | t | log p (cid:9) (cid:22) e − c ∆ n (S10)The last inequality above holds because by condition C4 ( J + 1) | t | log p ∆ n −→ as n → ∞ Then with R γ ≥ R ∗ γ , from (S8) and (S10), we have for any v ∈ (0 , , P [ ∪ γ ∈ M { R γ − R γ ∨ t < ∆ n (1 − v ) } ] ≤ P (cid:2) ∪ γ ∈ M (cid:8) R ∗ γ − R ∗ γ ∨ t < ∆ n (1 − v/ (cid:3) + P (cid:2) ∪ γ ∈ M (cid:8) R ∗ γ ∨ t − R γ ∨ t < − ∆ n v/ (cid:9)(cid:3) ≤ e − c ∆ n −→ , (S11)18or some constant c > . Due to (S11) and Lemma 2 with condition C2, for < c = 3 v < , wehave P [ ∪ γ ∈ M { R γ − R t < ∆ n (1 − c ) } ] ≤ P [ ∪ γ ∈ M { R γ − R γ ∨ t < ∆ n (1 − v ) } ] + P [ ∪ γ ∈ M { R γ ∨ t − R t < − ∆ n v } ] ≤ P [ ∪ γ ∈ M { R γ − R γ ∨ t < ∆ n (1 − v ) } ] + P (cid:2) ∪ γ ∈ M (cid:8) R t − R γ ∨ t > ∆ n v (cid:9)(cid:3) ≤ exp {− c ∆ n } + P (cid:2) R t − R ∗ t > ∆ n v / (cid:3) + P (cid:2) ∪ γ ∈ M (cid:8) R ∗ t − R ∗ γ ∨ t (cid:9) > ∆ n v / (cid:3) ≤ exp {− c ∆ n } + exp {− c (cid:48) ∆ n } + P (cid:2) χ J | t | > ∆ n v / (cid:3) ≤ {− c ∆ n } → , for some constant c > . Therefore, restricting to the high probability event { R γ − R t ≥ ∆ n (1 − c ) , ∀ γ ∈ M } , by corollary 1 and (S3) we get (cid:88) γ ∈ M P R ( γ, t ) (cid:22) (cid:88) γ ∈ M ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| γ c ∧ t | / b ( | γ |−| t | ) n exp (cid:26) − σ ( R γ − R t ) (cid:27) (cid:22) (cid:88) γ ∈ M (cid:0) p δ ∨ n (cid:1) | t | / p δ | t | / b | γ |−| t | n exp (cid:8) − ∆ n (1 − c ) / σ (cid:9) , (S12)because r t − r ∗ γ < r t = | t | and η nm ( ν ) = ( nη nm ( ν ) /λ ) / ( n/λ ) (cid:23) ( p δ ∨ n ) / ( p δ ∨ n ) = p − δ due19o condition C2 and condition C4. Then by (S12) we have (cid:88) γ ∈ M P R ( γ, t ) (cid:22) exp (cid:26) − σ (cid:0) ∆ n (1 − c ) − σ | t | log (cid:0) p δ ∨ n (cid:1) − σ (2 + δ ) | t | log p (cid:1)(cid:27) (cid:88) γ ∈ M b | γ | n (cid:22) exp (cid:26) − σ (cid:2) ∆ n (1 − c ) − σ | t | (log( p δ ∨ np δ )) (cid:3)(cid:27) (1 + b n ) p (cid:22) exp (cid:26) − σ (∆ n (1 − c ) − c τ n ) (cid:27) −→ as n → , (S13)where c > and τ n = 5(1 + δ ) log( √ n ∨ p ) . To see the last inequality, we consider two cases.First, if √ n < p, τ n = log( p δ ) and np δ < p δ < p δ , and thus log( p δ ∨ np δ ) =log( p δ ) < log( p δ ) = τ n . Then, if √ n > p, τ n = log( n δ ) / ) . Then, p δ < p δ
Next, we will show that with a prior on σ in (2c), the model selection consistency holdsunder the assumption that P ( γ ∈ (cid:102) M ) = 0 . Note that since log p = o ( n ) and ν (cid:48) > ν , we have M ⊂ (cid:102) M eventually. Thus P ( γ ∈ M ) = 0 for all large n . Therefore, we shall show that (cid:80) γ ∈ (cid:102) M k (cid:102) PR( γ, t ) P −→ for k = 2 , , where (cid:102) M k = M k ∩ (cid:102) M and (cid:102) PR( γ, t ) ≡ P ( γ | Y ) /P ( t | Y ) . By(2d) and (3) of the main paper, we have f ( γ | Y ) = c n,p Q γ b | γ | n (1 − w ) p R − ( n − / γ .
20y condition C2 and Lemma 1, we then get (cid:102)
PR( γ, t ) (cid:22) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| t ∧ γ c | / b ( | γ |−| t | ) n ( R γ /R t ) − ( n − / . (S15)Define ζ n := R t nσ − . Due to our Lemma 2 and Lemma A.2(ii) of Narisetty and He (2014), for φ > , we have P ( | ζ n | > φ ) = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) R ∗ t nσ − R t − R ∗ t nσ (cid:12)(cid:12)(cid:12)(cid:12) > φ (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) R ∗ t nσ − (cid:12)(cid:12)(cid:12)(cid:12) > φ (cid:19) + P (cid:0) R t − R ∗ t ≥ φnσ (cid:1) ≤ − c n ) , (S16)for some positive quantity c depending on φ . From (S15) we have (cid:102) PR( γ, t ) (cid:22) ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / ( η nm ( ν )) −| t ∧ γ c | / b ( | γ |−| t | ) n (cid:18) R γ − R t nσ (1 + ζ n ) (cid:19) − n − . (S17)Define z n := ( r γ − r t ) log p/n. Note that for models in M , r γ > r t . Since condition C6 is in force,we have z n < / (2 + ν (cid:48) ) , and choose s > and ˜ φ > such that s ) / { (1 − ˜ φ )(2 + ν (cid:48) ) } < and < (1 + 4 s )(1 − ˜ φ ) / (cid:110) − s ) / [(1 − ˜ φ )(2 + ν (cid:48) )] (cid:111) < ( δ + 1) / , which is possible since ν (cid:48) δ > . Consequently, x n := − log (cid:18) − s − ˜ φ z n (cid:19) < s ) z n (1 − ˜ φ ) { − s ) z n / (1 − ˜ φ ) } < δ/ z n . (S18)where the first inequality follows from the fact that − log(1 − x ) < x/ (1 − x ) for < x < . Usingthe similar way as in Section S4.2, we only consider the high probability event (cid:8) ∩ { d>r t } U ( d ) c (cid:9) ∩ | ζ n | < ˜ φ (cid:111) , where U ( d ) is defined the same as in Section S4.2. Note that on U ( d ) c , R γ − R t ) / ( nσ ) > − s ) z n . Note that on M , γ c ∧ t is empty. Then, due to (S17), (S18), and(S3) we obtain (cid:88) γ ∈ M (cid:102) PR( γ, t ) (cid:22) (cid:88) γ ∈ M ( nη nm ( ν ) /λ ) − ( r ∗ γ − r t ) / b ( | γ |−| t | ) n exp (cid:26)(cid:18) n − (cid:19) x n (cid:27) (cid:22) (cid:88) γ ∈ M (cid:0) p δ ∨ √ n (cid:1) − ( r γ − r t ) b ( | γ |−| t | ) n exp (cid:110)(cid:16) n (cid:17) x n (cid:111) (cid:22) (cid:88) γ ∈ M (cid:0) p δ ∨ √ n (cid:1) − ( r γ − r t ) b ( | γ |−| t | ) n p ( δ/ r γ − r t ) ∼ ρ n → , as n → ∞ , where ρ n is defined in Section S4.2. Also, following from the proof for large models in SectionS4.3, we can show that (cid:88) γ ∈ M ∪ M (cid:102) PR( γ, t ) P −→ . For under-fitted models in M , if ∆ n = o ( n ) , similar to (S12) and (S13) restricting to the highprobability event { R γ − R t ≥ ∆ n (1 − c ) } ∩ (cid:110) | ζ n | < ˜ φ (cid:111) , we get (cid:88) γ ∈ M (cid:102) PR( γ, t ) (cid:22) (cid:88) γ ∈ M ( nη nm / λ ) | t | / ( η nm ( ν )) −| γ c ∧ t | / b | γ |−| t | n (cid:18) R γ − R t nσ (1 + ζ n ) (cid:19) − n (cid:22) (cid:88) γ ∈ M (cid:0) p δ ∨ n (cid:1) | t | / p δ | t | / b ( | γ |−| t | ) n exp (cid:26) − ∆ n (1 − c )2 σ (1 + ˜ φ ) (cid:27) (cid:22) exp (cid:26) − σ (cid:16) ∆ n (1 − c ) / (1 + ˜ φ ) − σ | t | log (cid:0) p δ ∨ n (cid:1) − σ | t | (2 + δ ) log p (cid:17)(cid:27) (cid:22) exp (cid:26) − σ (∆ n (1 − c (cid:48) ) − τ n ) (cid:27) → , as n → ∞ . ∆ n ∼ n , then by taking ˜ φ < / , we have for some v (cid:48) > and c (cid:48) > (cid:88) γ ∈ M (cid:102) PR( γ, t ) (cid:22) (cid:0) p δ ∨ n (cid:1) | t | / p (1+ δ/ | t | (cid:18) n (1 − c (cid:48) )4 nσ (cid:19) − ( n ) (cid:22) ( p ∨ n ) (2+3 δ ) | t | e − v (cid:48) n → , as n → ∞ ..