[PDF] Bayesian Variable Selection for Ultrahigh-dimensional Sparse Linear Models

Abstract

We propose a Bayesian variable selection procedure for ultrahigh-dimensional linear regression models. The number of regressors involved in regression, p n , is allowed to grow exponentially with n . Assuming the true model to be sparse, in the sense that only a small number of regressors contribute to this model, we propose a set of priors suitable for this regime. The model selection procedure based on the proposed set of priors is shown to be variable selection consistent when all the 2 p n models are considered. In the ultrahigh-dimensional setting, selection of the true model among all the 2 p n possible ones involves prohibitive computation. To cope with this, we present a two-step model selection algorithm based on screening and Gibbs sampling. The first step of screening discards a large set of unimportant covariates, and retains a smaller set containing all the active covariates with probability tending to one. In the next step, we search for the best model among the covariates obtained in the screening step. This procedure is computationally quite fast, simple and intuitive. We demonstrate competitive performance of the proposed algorithm for a variety of simulated and real data sets when compared with several frequentist, as well as Bayesian methods.

Full PDF

aa r X i v : . [ s t a t . M E ] S e p arxiv (0000) , Number 0, pp. 1 Bayesian Variable Selection forUltrahigh-dimensional Sparse Linear Models

Minerva Mukhopadhyay ∗ and Subhajit Dutta † Abstract.

We propose a Bayesian variable selection procedure for ultrahigh-dimensional linear regression models. The number of regressors involved in re-gression, p n , is allowed to grow exponentially with n . Assuming the true model tobe sparse, in the sense that only a small number of regressors contribute to thismodel, we propose a set of priors suitable for this regime. The model selection pro-cedure based on the proposed set of priors is shown to be variable selection consis-tent when all the 2 p n models are considered. In the ultrahigh-dimensional setting,selection of the true model among all the 2 p n possible ones involves prohibitivecomputation. To cope with this, we present a two-step model selection algorithmbased on screening and Gibbs sampling. The ﬁrst step of screening discards a largeset of unimportant covariates, and retains a smaller set containing all the activecovariates with probability tending to one. In the next step, we search for thebest model among the covariates obtained in the screening step. This procedureis computationally quite fast, simple and intuitive. We demonstrate competitiveperformance of the proposed algorithm for a variety of simulated and real datasets when compared with several frequentist, as well as Bayesian methods. Keywords:

Model selection consistency, Screening consistency, Gibbs sampling.

Variable selection in ultrahigh-dimensional setup is a ﬂourishing area in the contem-porary research scenario. It has become more important with increasing availabilityof data in various ﬁelds like genetics, ﬁnance, machine learning, etc. Sparsity has fre-quently been identiﬁed as an underlying feature for this kind of data sets. For example,in genome wide association studies (GWAS), “a prototype is measured for a large panel c (cid:13) ∗ Bethune College, Kolkata-700006, India. [email protected] † Indian Institute of Technology, Kanpur-208016, India. [email protected] c (cid:13) Bayesian Variable Selection for Ultrahigh-dimensional Settings of individuals, and a large number of single nucleotide polymorphisms (SNPs) through-out the genome are genotyped in all these participants. The goal is to identify SNPsthat are statistically associated with the phenotype and ultimately to build statisticalmodels to capture the eﬀect of genetics on the phenotype” (Rosset (2013)). One suchdata set is the metabolic quantitative trait loci which consists of 10000 SNPs that areclose to the regulatory regions ( predictor variables ) over a total of 50 participants ( ob-servations ). A previous study by Song and Liang (2015) identiﬁed two particular SNPsto be important and signiﬁcant. We have studied this data set in detail in a later section.Several methods have been proposed to model high-dimensional data sets in boththe frequentist and the Bayesian paradigm. Frequentist solutions to this problem areoften through penalized likelihood, among which variants of LASSO like the elasticnet of Zou and Hastie (2005), the group LASSO of Yuan and Lin (2006) and the adap-tive LASSO of Zou (2006) are worth mentioning. Another important frequentist so-lution to this problem involves a screening algorithm to ﬁrst reduce the data dimen-sion, and then use some classical methods on this reduced data. This idea is imple-mented in sure independence screening (SIS) of Fan and Lv (2008), iterative SIS (ISIS)of Fan and Song (2010), forward selection based screening of Wang (2009), nonpara-metric independence screening (NIS) of Fan et al. (2011), iterative varying-coeﬃcientscreening (IVIS) of Song et al. (2014), etc. Other ways of approaching this problem isby using the smoothly clipped absolute deviation (SCAD) of Fan and Li (2001), theDantzig selector of Cand`es and Tao (2007), modiﬁed EBIC of Chen and Chen (2012),etc. A detailed and nice review of most of these methods is contained in the paper byFan and Lv (2010).In the Bayesian literature, popular methods include the empirical Bayes variableselection of George and Foster (2000), and the spike and slab variable selection ofIshwaran and Rao (2005). Among recent developments, the methods of Bondell and Reich(2012), Liang et al. (2013), Song and Liang (2015) and Castillo et al. (2015) use theidea of penalized credible regions to accomplish variable selection in the ultrahigh-dimensional setting. While Castillo et al. (2015) have proved theoretical results relatedto the posterior consistency for the regression parameter, Liang et al. (2013) have shownthe equivalence of posterior consistency and model selection consistency under appro-priate sparsity assumptions. The authors of Narisetty and He (2014) claim to prove the ukhopadhyay and Dutta ‘strongest selection consistency result’ using the spike and slab prior in the Bayesianframework. They introduce shrinking and diﬀusing priors, and establish strong selectionconsistency of their approach. In all of the above studies, the authors have consideredthe case where log p n = o ( n ).Note that the algorithms for computing the posterior distribution for the spike andslab prior are routine for small values of p n and n , but the resulting computationsare quite intensive for higher dimensions due to the large number of possible models.Several authors have developed MCMC algorithms that can cope with larger numbersof covariates, but truly high-dimensional models are still ‘out of reach of fully Bayesianmethods at the present time’ (see Castillo et al. (2015)).In this paper, we propose a Bayesian method for model selection, and examinemodel selection consistency for the same under the assumption of sparsity. In caseswhere p n >> n , the number of competing models is so large that one ﬁrst requires ascreening algorithm to discard unimportant covariates. We present a two-step modelselection procedure based on a screening algorithm and Gibbs sampling. The ﬁrst stepof the algorithm is shown to achieve screening consistency in the sense that it discardsa large set of unimportant covariates with probability one.The objective of the present work is three-fold. First, to develop a method whichis suitable for ultrahigh-dimensional models. Secondly, to provide a faster and intuitivemodel selection algorithm. Finally, to keep the method and the algorithm as simpleas possible. The proposed set of priors has the advantage of generating closed formexpressions of marginals, which makes the method as tractable as a simple penalizedlikelihood method, such as Bayesian information criterion (BIC). To the best of ourknowledge, this is the ﬁrst work in the area of Bayesian variable selection which canaccommodate cases with log p n = O ( n ). The selection algorithm we adopt is simple andintuitive, and it makes the selection procedure quite fast. Further, its good performanceis supported through theoretical results.In Section 2, the prior setup and the model selection algorithm are described indetail. Section 3 contains the theoretical results including model selection consistencyof the proposed set of priors, and consistency of the proposed algorithm. In Sections 4and 5, we validate the performance of the proposed algorithm using simulated and real Bayesian Variable Selection for Ultrahigh-dimensional Settings data sets, respectively. Proofs of the main results are provided in Section 6, that of theother results and mathematical details are provided in a supplementary ﬁle.

Suppose we have n data points, each consisting of p n regressors { x ,i , x ,i , . . . , x p n ,i } and a response y i with i = 1 , , . . . n . The response vector y n is modeled as follows y n = X n β + e n , (2.1)where X n is the n × p n design matrix, β = ( β , β , . . . , β p n ) ′ is the vector of corre-sponding regression parameters and e n is the vector of regression errors. We considera sparse situation, where only a small number of regressors contributes to the model,while p n >> n . For simplicity, we assume that the design matrix X n is non-stochasticand e n ∼ N ( , σ I n ).The space of all the 2 p n models is denoted by A , and indexed by α . Here, each α consists of a subset of size p n ( α ) (0 ≤ p n ( α ) ≤ p n ) of the set { , , . . . , p n } , indicatingwhich regressors are selected in the model. Under M α , with α ∈ A , y n is modeled as M α : y n = X α β α + e n , where X α is a sub-matrix of X n consisting of the p n ( α ) columns speciﬁed by α and β α is the corresponding vector of regression coeﬃcients. When M α is true, we assume thatall the elements of β α are non-zero. We consider the problem of selecting the model M α with α ∈ A , which best explains the data. The true data generating model, denoted by M α c , is assumed to be an element of A , and is expressed as M α c : y n = µ n + e n = X α c β α c + e n , where µ n is the regression of y n given X n . The dimension of M α c , denoted by p ( α c ), isassumed to be a small number, free of n .In a Bayesian approach, each model M α is assigned a prior probability p ( M α ), andthe corresponding set of parameters θ α = (cid:0) β , β α , σ (cid:1) involved in M α , is also assigned ukhopadhyay and Dutta

5a prior distribution p ( θ α | M α ). Given the set of priors, one computes the posteriorprobability of each model. The posterior probability of the model M α is given by p ( M α | y n ) = p ( M α ) m α ( y n ) P α ∈A p ( M α ) m α ( y n ) , where m α ( y n ) = Z p ( y n | θ α , M α ) p ( θ α | M α ) d θ α is the marginal density of y n , p ( y n | θ α , M α ) is the density of y n given the model param-eters θ α and p ( θ α | M α ) is the prior density of θ α under M α . We consider the procedurethat selects the model in A with the highest posterior probability.We denote the rank of the design matrix of model M α by r α , i.e., r ( X ′ α X α ) = r α ,and also refer r α as the rank of M α . For two numbers a and b , the notations a ∨ b and a ∧ b are used to denote max { a, b } and min { a, b } , respectively. For α, α ∗ ∈ A , thenotations X α ∨ α ∗ and X α ∧ α ∗ are used to denote sub-matrices of X formed by columnscorresponding to either X α or X α ∗ (or both), and columns which are common to both X α and X α ∗ , respectively. For two square matrices A and B of the same order, A ≤ B means that B − A is positive semideﬁnite. On each model M α with α ∈ A , we assign the Bernoulli prior probability as follows: P ( M α ) = q p n ( α ) n (1 − q n ) p n − p n ( α ) with q n = 1 /p n . Given a model M α , we consider the conjugate prior on β α as β α | σ , M α ∼ N ( , g n σ I p n ( α ) ) , where g n is a hyperparameter which depends on n . When σ is unknown, we considerthe popular Jeﬀreys prior π ( σ ) ∝ /σ .The Bernoulli prior probability is widely used as model prior probability because ofits property of favoring, or penalizing models of large, or small dimensions. The choice q n = 1 /p n has previously been considered by Narisetty and He (2014). This prior is Bayesian Variable Selection for Ultrahigh-dimensional Settings particularly useful for sparse regression models, where it is known in advance that thetrue model is small-dimensional, and p n is quite large.The use of the inverse gamma prior for error variance is fairly conventional in theliterature of model selection (see, e.g., Johnson and Rossell (2012), Narisetty and He(2014)). Jeﬀreys prior is the limit of inverse gamma prior, as both the hyperparame-ters involved in inverse gamma prior approach zero. The property of invariance underreparametrization makes it suitable as a prior on the scale parameter.We choose a simple set of priors. Except for the choice of g n , we completely specifythe set of priors. We do not provide any speciﬁc choice of g n , rather indicate the optimalorder which is necessary to achieve consistency. The posterior probabilities generatedusing this set of priors is of a closed form, which makes the resulting method easilyapplicable. Our model selection procedure is quite simple as it chooses the model with the highestposterior probability among all competing models. In the next section, we will show thatthe proposed set of priors is model selection consistent in the sense that the posteriorprobability of the true model goes to one. However, identifying the model with thehighest posterior probability still remains a challenging task for ultrahigh-dimensionaldata. As p n = exp { O ( n ) } , it is impossible to evaluate all the 2 p n models in the modelspace even for small values of n . For example, if n = 5, the model space can be of orderexp(45), which is a huge number. Therefore, we need to develop a screening algorithm,which reduces the model space to a tractable one. In other words, we need to discarda set of ‘unimportant’ variables at the beginning using some suitable algorithm. Afterimplementation of the algorithm, ideally, we will be left with a smaller set of variableswhich includes all the active covariates. We describe the algorithm in detail below. The Two-step Algorithm.

Screening:

The ﬁrst step discards a large set of unimportant covariates. Here, we usethe fact that the number of regressors in the true model, p ( α c ), is very small and freeof n . First, we choose an integer d such that d/K ≤ p ( α c ) < d , where K is a positive ukhopadhyay and Dutta n . We will choose the best model in the class of all models of dimension d . Thus among the 2 p n models, we only compare (cid:0) p n d (cid:1) models. Once d is selected, weproceed along the following steps:1. Initialization.

Choose any model M α of dimension d , and calculate its marginal.2. Evaluation.

Consider each of the covariates present in M α individually. Let x bea covariate of M α . Replace x with each of the covariates which are not presentin M α one by one, and compute the marginal density. Update M α by replacing x with the covariate that yields the highest marginal density, say x ∗ , if x ∗ = x .Retain x otherwise. Do the same for the other ( d −

1) active covariates of M α as well.3. Replication.

Repeat the previous step N times, where N is a moderately largenumber.In the next section, we will show that if N is moderately large, the screening algorithmﬁnally selects a supermodel of the true model with probability tending to one. Model Selection:

Once the screening algorithm selects a model, say M α ∗ , we discardall the regressors that are not present in M α ∗ . In the next step, we apply the Gibbssampling algorithm to select the best model among the 2 d models, which can be formedby the d regressors present in M α ∗ . The sampling scheme that we use is completelydescribed in Chipman et al. (2001, Section 3.5) in the section on Gibbs Sampling Al-gorithms under the conjugate setup. Note that the Gibbs sampling algorithm choosesmodels directly following a Markov chain with ratio of the posterior probabilities as thetransition kernel, and the set of regressors obtained at the end of screening step containsall the active covariates with probability tending to one. Therefore, after suﬃcient it-erations, the algorithm must select the model with highest posterior probability, i.e., thetrue model.

Remark 2.1.

Note that the total number of models among which the algorithm selectsthe best one is (cid:0) p n d (cid:1) + 2 d . Thus, if we have some idea about the actual number of activecovariates, we can use it to choose d as small as possible. A small choice of d makesthe algorithm much faster. Bayesian Variable Selection for Ultrahigh-dimensional Settings

Remark 2.2.

In the evaluation step of the screening algorithm, we update the chosenmodel d ( p n − d ) times. If we repeat the evaluation step N times, then N d ( p n − d ) updatestake place. Therefore, it is enough to choose a moderately large N .Note that N d ( p n − d ) is also the computational complexity of the screening step.Even if we consider all the d competing models for comparison in the second stage,the total computational complexity of the proposed algorithm would be N d ( p n − d ) + 2 d ,which is linear in p n and much smaller than p n . This section is dedicated towards asserting consistency results of the proposed methodof model selection. We consider the cases with known, as well as unknown error variances σ separately, stating clearly the assumptions required in each case. Often one has enough data to estimate the variance σ properly, or being independentof the design matrix, σ is estimated from earlier data sets. In such cases, σ may beassumed to be known. In this subsection, we discuss results for the case with known σ .Given σ and g n , the posterior probability of model M α is proportional to P ( M α | y n ) ∝ (cid:18) p n − (cid:19) p n ( α ) (cid:12)(cid:12) I p n ( α ) + g n X ′ α X α (cid:12)(cid:12) − / exp (cid:26) − R ∗ α σ (cid:27) , where R ∗ α = y ′ n n I n − X α (cid:0) I p n ( α ) /g n + X ′ α X α (cid:1) − X ′ α o y n . Our results for the case withknown σ is based on the following set of assumptions.(A1) The number of regressors p n = exp { b n − r } where 0 ≤ r < b is any numberfree of n .(A2) The true model M α c is unique. There exists constants τ ∗ max and τ ∗ min , free of n ,such that nτ ∗ min I p ( α c ) ≤ X ′ α c X α c ≤ nτ ∗ max I p ( α c ) . (A3) Let τ max and τ min be the highest and lowest non-zero eigenvalues of X ′ n X n /n ,then τ max ≤ p | z n | n with z n →

0, and τ min ≥ p −| w n | n with w n → ukhopadhyay and Dutta K > = (cid:8) δn − s (cid:9) ∨ { σ p ( α c ) log p n } with δ > < s ≤ .

5. Let A = { α ∈ A : M α c * M α , r α ≤ K p ( α c ) } , µ n = X α c β α c and P n ( α ) be the projection matrix onto the span of X α . We assume thatinf α ∈A µ ′ n ( I − P n ( α )) µ n > ∆ . (A5) The hyperparameter g n is such that ng n = p δ n , for some 5 / ( K − ≤ δ ≤ n , where K is as stated in assumption (A4) above.Assumptions (A1) and (A5) describe the setup and our choice of the hyperparameter g n , respectively. Assumption (A2) states that the true model is unique, and it includesa set of independent regressors. The design matrix corresponding to X ′ α c X α c dependson n , but not on p n . Therefore, we allow the eigenvalues of the true model to vary onlywith n . Assumption (A3) is also quite general, as we allow the eigenvalues of X ′ n X n tovary with both n and p n . This is more reasonable since the dimension of X ′ n X n dependson both n and p n .Assumption (A4) is commonly termed as an identiﬁability condition for model selec-tion. The quantity µ ′ n ( I − P n ( α )) µ n may be interpreted as the distance of the α th modelfrom the true model. For consistent model selection, it is necessary for the true model tokeep a distance from other models. Otherwise, the true model may not be identiﬁable. Ithas been proved in Moreno et al. (2015, Lemma 3) that lim n →∞ µ ′ n ( I − P n ( α )) µ n /n > µ ′ n ( I − P n ( α )) µ n over non-supermodels of low rank, and ﬁxed a threshold value forthe case when log p n = b n with b >

0. When log p n = b n − r with r >

0, the thresholdis not even of order n , and therefore, the condition is trivially satisﬁed (by Moreno et al.(2015, Lemma 3)).The consistency results are split into two parts. Model selection consistency of theproposed set of priors is shown in Section 3.1.1, and consistency of the model selectionalgorithm is shown in Section 3.1.2. If the true model is among one of the candidate models in the model space, it is naturalto check whether a model selection procedure can identify the true model with prob-0

Bayesian Variable Selection for Ultrahigh-dimensional Settings ability tending to one. This property, known as ‘model selection consistency’ , requires P ( M α c | y n ) p −→ , which is equivalent to showing that X α ∈A r { α c } P ( M α | y n ) P ( M α c | y n ) p −→ . (3.1)We now state the result on model selection consistency for the case where σ is known. Theorem 3.1.

Consider the model (2.1) with known σ . Under assumptions (A1)-(A5),the method based on the proposed set of priors is model selection consistent. Remark 3.1.

The proof of model selection consistency for known σ (see Section 6.2)only requires p n → ∞ , and does not explicitly require n → ∞ . As log p n ≤ b n , anappropriately large p n and only a moderately large n is suﬃcient for good performanceof this set of priors in practice. From this point of view, the proposed set of priors issuitable for high-dimensional medium sample size settings. In the proposed model selection algorithm at the end of the screening step, one will beleft with a model of dimension d . We claim that this step of screening is consistent inthe sense that the model chosen at the end of it, say M α ∗ , may not be unique but itwould be a supermodel of the true model, i.e., M α c ⊆ M α ∗ with probability tending toone. We now consider the following result. Theorem 3.2.

Let d be an integer such that d/K ≤ p ( α c ) < d , and M α and M α be a supermodel, and a non-supermodel of M α c of dimension d , respectively. If σ isknown and assumptions (A1)-(A5) hold, then for the proposed set of priors we have sup α ,α P ( M α | y n ) P ( M α | y n ) p −→ , or, equivalently sup α ,α m ( M α | y n ) m ( M α | y n ) p −→ under M α c . This theorem states that under M α c , the posterior probability of a supermodel of M α c of dimension d is much higher than the posterior probability of a non-supermodel ofsame dimension. As the algorithm selects a model on the basis of the marginal density,which is equivalent to the posterior probability when models of the same dimension areconsidered, it is expected that a supermodel will be selected after some iterations.After a supermodel is selected in the ﬁrst stage, we only consider the d regressors ukhopadhyay and Dutta d possible ones formedby these d regressors. As the Gibbs sampling algorithm in the second step choosesmodels on the basis of posterior probabilities, consistency of this part is immediatefrom the screening consistency and the model selection consistency of the proposed setof priors. When the error variance σ is unknown, we assign the standard non-informative Jeﬀreysprior π ( σ ) ∝ /σ . In this case, the posterior probability of any model M α is p ( M α | y n ) ∝ (cid:18) p n − (cid:19) p n ( α ) | I + g n X ′ α X α | − / (cid:0) R ∗ α (cid:1) − n/ . As σ is unknown, we need to modify assumptions (A1)-(A5) of the previous subsection.The modiﬁed assumptions are stated below:(B1) For some positive integer M free of n , rank ( X ′ n X n ) ≤ M .(B2) The number of regressors p n = exp { b n − r } where 0 ≤ r < b is any numberfree of n . For r = 0, we need b < (cid:2) ξ (1 − ξ ) { ξ )( M − p ( α c )) } − (cid:3) ∧ (4 p ( α c )) − for some 1 / ( K − < ξ ≤ . K > g n is such that ng n = p δ n , 7 ξ/ (1 − ξ ) +2 / ( K − ≤ δ ≤ = { σ p ( α c ) log p n } ∨ { δn − s } for 0 < s ≤ . n . Assumption (B2) imposes some additional restrictionson the dimension p n when it is of the order exp { O ( n ) } . Unlike the case for known σ ,here we fail to accommodate any p n of the order exp { O ( n ) } (recall assumption (A1)),rather impose a multiplicative constant b such that log p n ≤ b n . Assumption (B3)indicates that we need a slightly larger value of g n in order to achieve consistency whenthe parameter σ is unknown. Finally, assumption (B4) is same as assumption (A4)with a partially changed threshold value. Nevertheless, implications of the assumptionand its importance remains the same here. We now state the result on model selectionconsistency for an unknown value of σ .2 Bayesian Variable Selection for Ultrahigh-dimensional Settings

Theorem 3.3.

Consider the model (2.1) with unknown σ . Under assumptions (B1)-(B4), the method based on the proposed set of priors is model selection consistent. We do not present a separate result for screening consistency of the algorithm statedin Section 2.3 for the case where σ is unknown. A result similar to Theorem 3.3 canbe stated here. A proof similar to that Theorem 3.3 (i.e., the case with known σ , seeSection 6.2) can also be presented in this respect using assumptions (B1)-(B4) insteadof (A1)-(A5). We validate the performance of the proposed method of model selection using a wide va-riety simulated data sets. Under diﬀerent simulation schemes, we present the proportionof times a model selection algorithm selects the true model.

Our method:

The model selection algorithm we follow is completely described inSection 2.3. The number of regressors selected at the ﬁrst stage, d , is taken to be [ n/ g n = p n /n and in the second step of modelselection, we choose g n = d . Other methods:

As we mentioned in the Introduction, there are several methodsfor variable selection both from the classical, as well as the Bayesian perspectives. Weconsider some of the more competitive methods for comparison. Among the classicalmethods, we consider three approaches based on iterative sure independence screening(ISIS), namely, ISIS-LASSO-BIC, ISIS-SCAD-BIC and ISIS-MCP-BIC. Here an initialset of variables are ﬁrst selected by ISIS, and then a step of penalized regression is car-ried out using the least absolute shrinkage and selection operator (LASSO), smoothlyclipped absolute deviation (SCAD), or minimax concave penalty (MCP, Zhang (2010))with the regularization parameter tuned by the BIC. Among the Bayesian competitors,we consider methods based on Bayesian credible region (BCR.marg and BCR.joint,Bondell and Reich (2012)) and Bayesian shrinking and diﬀusing prior (BASAD, Narisetty and He(2014)). We have used R codes for all the methods. For ISIS, we have implementedcodes from the R package

SIS . The R codes for BCR is obtained from the ﬁrst author’swebsite, while the ﬁrst author of Narisetty and He (2014) kindly shared the codes forBASAD with us. There are two versions for BASAD, one is exact while the other is an ukhopadhyay and Dutta

Simulation setup.

We consider two values for n , namely, 50 and 100. For n = 50, wechoose p n = 100 and 500, while for n = 100 we choose p n = 500, 1000 and 2000.The model y n = µ n + e n is considered as the true model, where µ n = X α c β α c . Thevector β α c is assumed to be sparse, i.e., there are only p ( α c ) components in β α c with p ( α c ) << p n and these p ( α c ) components are chosen randomly from the set of indices { , . . . , p n } . When p n is less than or equal to 500 we set the number of active regressors p ( α c ) = 5, while p ( α c ) = 10 for higher values of p n . The p ( α c ) values of β α c are takento be equal (say, β ), and we ﬁx a common constant value of β = 2.Each data vector x i of the design matrix X n = ( x , . . . , x n ) ′ is assumed to follow theGaussian distribution with mean and covariance Σ p n for i = 1 , . . . , n . The covariancestructure of Σ p n = (( σ ij )) for 1 ≤ i, j ≤ p n is taken to be the following four types: Case 1. ( Identity ) Σ p n = I , i.e., there is no correlation among the covariates. Case 2. ( Block Dependence ) Σ p n has a block covariance setting, where the active covari-ates have common correlation ρ = 0 .

25, the inactive covariates have common correla-tion ρ = 0 .

75 and each pair of active and inactive covariate has correlation ρ = 0 . Case 3. ( Equi-correlation ) Σ p n = 0 . I + 0 . ′ , where is the p n -dimensional vector ofones. This exhibits a strong dependence structure uniformly among the covariates. Case 4. ( Autoregressive ) Here, we take σ ii = 1 for 1 ≤ i ≤ p n , and σ ij = 0 . | i − j | for1 ≤ i = j ≤ p n . Clearly, we have a decaying correlation structure depending on thedistance | i − j | . With the increase in distance, here the correlation decreases.Let e n ∼ f n , where f n denotes a n -dimensional multivariate distribution. We haveconsidered two choices for f n , namely, the standard multivariate Gaussian distributionand a heavy-tailed distribution, namely, the multivariate t distribution. Note that themoments of order 2, or higher fail to exist for the t distribution. In the tables below,we have reported the proportion of times each method selected the true model in the200 random iterations.4 Bayesian Variable Selection for Ultrahigh-dimensional Settings

Simulation results.

In this simulated regime, we have provided some extra informationto BASAD and BCR. When we implemented BCR tuned with BIC, as well as BASADtuned with BIC, we speciﬁed the exact number of non-zero components, i.e., the infor-mation of p ( α c ). Further, we notice that the covariance structure for Case 2 becomessingular for p n = 1000, or higher and we have restricted it for p n = 500, or less.Table 1: Proportion of times true model is selected by each method for n = 50 Gaussian errorMethods Case 1 Case 2 Case 3 Case 4 ↓ p n →

100 500 100 500 100 500 100 500ISIS-SCAD-BIC 0.710 0.325 0.015 0.000 0.715 0.355 0.560 0.150ISIS-MCP-BIC 0.525 0.135 0.010 0.000 0.485 0.155 0.230 0.020BCR.marg-BIC 0.280 0.010 0.395 0.015 0.255 0.000 0.180 0.000BASAD 0.930 0.445 0.865 0.150 0.925 0.430 0.880 0.605BASAD-BIC t errorCase 1 Case 2 Case 3 Case 4Methods ↓ p n →

100 500 100 500 100 500 100 500ISIS-SCAD-BIC 0.385 0.330 0.020 0.000 0.355 0.335 0.315 0.230ISIS-MCP-BIC 0.325 0.280 0.025 0.000 0.305 0.285 0.225 0.150BCR.marg-BIC 0.180 0.005 0.300 0.000 0.200 0.000 0.180 0.000BASAD 0.680 0.270 0.600 0.105 0.705 0.335 0.645 0.375BASAD-BIC

Proposed 0.640

Among the three methods of variable selection based on ISIS, we have reported theresults for SCAD and MCP only. LASSO usually over-estimates β α c , and we have notreported it for our numerical study. For the other two methods, we observe (see Table1) that SCAD performed uniformly better than MCP. It is also clear from Table 1 thatISIS is aﬀected drastically when the dependence structure varies among the diﬀerentsets of covariates (Case 2). Moreover, the proportion of times it selects the true modeldecreased signiﬁcantly for heavy-tailed errors. This is explained by the fact that ISISrelies on directly computing covariances between the variables.Generally, the Bayesian methods turn out to be more robust than the moment basedapproaches. Among the Bayesian methods, BASAD clearly performed the best for n = 50 with p = 100; and n = 100 with p = 500. We observed that BASAD-BIClead to an improved performance over BASAD in some cases, which is unlike what ukhopadhyay and Dutta p ( α c ).However, the performance of BASAD falls drastically for higher values of p n (seeTable 1). Note that BASAD needs to compute the inverse of the covariance matrix foreach model, which is computationally prohibitive for such high-dimensional data. Toresolve this problem, they use a block covariance structure to simplify some of the matrixcomputations and this can be one of the main reasons behind the poor performance.For BCR, we observe that the joint version leads to singularity in several iterations for p n = 1000 onwards. Therefore, we have reported results for the more stable marginalversion only. The strength of our proposed method is re-instated from this numericalstudy, especially for higher values of p n . Clearly, there is a systematic improvement ofthe proposed method over BASAD when we move from p n = 100 to p n = 500 for n = 50across several covariance structures, and for both error distributions.For n = 100, we consider three values of p n = 500, 1000 and 2000. Again, we observethat the proportion of times ISIS based methods select the true model decreased signif-icantly when we consider t errors instead of Gaussian, as well as for Case 2. BASADTable 2: Proportion of times true model is selected by each method for n = 100 Gaussian errorMethods Case 1 Case 2 Case 3 Case 4 ↓ p n →

500 1000 2000 500 500 1000 2000 500 1000 2000ISIS-SCAD-BIC 0.850 0.460 0.320 0.000 0.835 0.430 0.275 0.610 0.145 0.037ISIS-MCP-BIC 0.670 0.255 0.230 0.000 0.615 0.240 0.156 0.065 0.000 0.000BCR.marg 0.280 0.000 0.000 0.275 0.245 0.000 0.000 0.125 0.000 0.000BASAD 0.985 0.240 0.000 0.935 0.975 0.270 0.000 0.975 0.300 0.000BASAD-BIC t errorMethods Case 1 Case 2 Case 3 Case 4 ↓ p n →

500 1000 2000 500 500 1000 2000 500 1000 2000ISIS-SCAD-BIC 0.435 0.375 0.325 0.000 0.450 0.360 0.304 0.400 0.260 0.130ISIS-MCP-BIC 0.385 0.325 0.290 0.000 0.400 0.325 0.275 0.255 0.180 0.080BCR.marg 0.265 0.000 0.000 0.000 0.205 0.000 0.000 0.095 0.000 0.000BASAD 0.905 0.060 0.000 0.105 0.875 0.115 0.000 0.755 0.212 0.000BASAD-BIC Bayesian Variable Selection for Ultrahigh-dimensional Settings performs best for the ﬁrst three co-variance structures when p n = 500 irrespective of thedistribution of the errors. However, the proposed method outperforms BASAD for Case4 and we now observe an improvement over BASAD even for p n = 500. For p n = 1000with n = 100, this proportion again falls drastically for BASAD while our methodyields a more stable performance. Interestingly, methods based on ISIS lead to com-parable results in some of the cases for such high-dimensional data. We again observea systematic improvement of the marginal version over the BCR.joint for p n = 2000(whenever the joint yielded a valid result), but the overall performance of BCR is notvery good compared to other Bayesian methods. The strength of our proposed methodis clear from Table 2 with higher values of p n . In particular, when p n = 2000 we observethat only our method leads to a non-zero value for the proportion among the Bayesianmethods that we have studied in this paper.To check the sensitivity of our method to the value of β α c , we have done a furthersimulation study. We consider Case 1 ( Σ p n = I ) with the error distribution as normalfor n = 100; and two choices of β α c . First, a set of decaying values of β α c in the range(1 , . . . , ′ and a set of increasing values of β α c in the range (2 , . . . , ′ . An incrementof 0 . p n = 500 so that we have p ( α c ) = 5, and an increment of 0 . p n = 1000 and 2000 so that p ( α c ) = 10. The results are summarized in Table 3 below.Table 3: Proportion of times true model is selected by each method for n = 100 Methods β α c = (2 . , . , . . . , ′ β α c = (2 . , . , . . . , ′ ↓ p n →

500 1000 2000 500 1000 2000ISIS-SCAD-BIC 0.660 0.400 0.244 0.820 0.465 0.330ISIS-MCP-BIC 0.630 0.260 0.000 0.675 0.265 0.190BCR.marg 0.135 0.000 0.000 0.275 0.000 0.000BASAD 0.985 0.140 0.000 0.980 0.275 0.000BASAD-BIC 0.995 0.300 0.000

The good performance of the proposed method is further re-instated from the nu-merical results of this table. However, we observe an improvement in BASAD-BIC forthe latter choice β α c when p n = 1000 than the former, which indicates that this meth-ods is more sensitive to the actual value of the β s than the proposed method. We furthernotice that SCAD leads to a higher proportion than MCP for larger values of p n . ukhopadhyay and Dutta We have analyzed two data sets in this section. For each of these data sets, more detailscan be found in the paper by Song and Liang (2015).

The ﬁrst example is related to a metabolic quantitative trait loci experiment whichlinks single nucletide polymorphisms (SNPs) data to metabolomics data. The predictors come from a GWAS study of the candidate genes for alanine amino transferase enzymeelevation in liver along with the mass spectroscopy metabolomics data. A total of 10000SNPs are pre-selected as candidate predictors, and the number of subjects included inthe data set is 50. The genotype of each SNP is coded as 0, 1 and 2 for homozygous rare,heterozygous and homozygous common allele, respectively. A particular metabolite binthat discriminates well between the disease status of the clinical trial’s participants isselected as the response variable .The SAM approach of Song and Liang (2015) selected two SNPs, rs17041311 andrs17392161. The ﬁrst SNP rs7896824 has the same genotype as the SNP rs17041311,while the SNP rs17392161 shares the same genotype with eleven other SNPs rs17390419,rs12328732, rs2164473, rs322664, rs17415876, rs16950829, rs6607364, rs829156, rs829157,rs2946537 and rs9756 across all the 50 subjects. We implement the algorithm for ourproposed method starting from d = 5 till d = 50 (which is the maximum possible valuethat d may attain). From our analysis, the proposed method identiﬁes all the SNPs (twofrom the ﬁrst group, and all the twelve from the second group) from d = 25 onwards.We further observe that the proposed method consistently identiﬁes a new set of SNPsconsists of rs6704330 and rs12744386; and this is a novel set of SNPs which was notdetected in the earlier study.For the sake of comparison, we implement all the competing methods from oursimulations, namely, ISIS-SCAD-BIC, ISIS-MCP-BIC, BCR.marg-BIC, BASAD andBASAD-BIC on this data. We ﬁrst ﬁx a value of the model size ( d ), and then a modelselection method is used to obtain a subset of the predictor variables. To assess therelative performance of these methods, we compute both the mean and the mediansquare errors based on leave-one-out cross-validation (CV). For all the methods, values8 Bayesian Variable Selection for Ultrahigh-dimensional Settings of the mean square errors turn out to be quite high. Therefore, we use the median squareerrors for comparison. For increasing values of d , Figure 1 below gives us an idea aboutthe overall performance of each of these methods. Clearly, BASAD yields the lowestmedian square of errors, while the performance for our proposal is the second best. ForBASAD-BIC and BCR, surprisingly, we observe an abrupt increase in the value of theerror corresponding to d = 50.

10 20 30 40 50 model size C V − m ed i an s qua r e e rr o r

10 20 30 40 50 model size C V − m ed i an s qua r e e rr o r ISIS−SCAD−BICISIS−MCP−BICProposedBASADBASAD−BICBCR.marg−BIC

Figure 1: Comparison of the diﬀerent methods using median square errors

This data is related to a polymerase chain reaction. A total of 60 samples, with 31 femaleand 29 male mice, are used to monitor the expression levels of 22575 genes. Some physi-ological phenotypes, including numbers of phosphoenopyruvate carboxykinase, glycerol-3-phosphate acyltransferase and stearoyl-CoA desaturase 1 are measured by quantita-tive realtime polymerase chain reaction. In this data, the relationship between the geneexpression level ( perdictor ) and phosphoenopyruvate carboxykinase ( response ) is stud-ied. The gene expression data is standardized before the statistical analysis. To analyzethis data, we repeat the same procedure as above.Both BASAD and BCR could not be implemented for this data due to memory overﬂowfor d = 22575. Figure 2 gives us the overall picture of the performance of the other ukhopadhyay and Dutta

10 20 30 40 50 . . . . . . model size C V − m ed i an s qua r e e rr o r

10 20 30 40 50 . . . . . . model size C V − m ed i an s qua r e e rr o r ISIS−SCAD−BICISIS−MCP−BICProposed

Figure 2: Comparison of the diﬀerent methods using median square errorsmethods, and they all yield quite low median square errors. Clearly, ISIS-MCP leadsto the lowest errors, and the proposed method performs marginally better than ISIS-SCAD for d = 15 and 25. However, the maximum diﬀerence in errors of the proposedmethod with both the methods based on ISIS is less than 0 .

11 over all values of d . For simplicity of presentation we drop the suﬃxes of p n , g n , X n and p n ( α ). In this section, we present auxiliary results which are used in proving the main results.

Lemma 6.1. (a) Let X be a n × p matrix, such that X ′ X has non-zero eigenvalues φ , φ , · · · , φ r , r ≤ n . Then | I + X ′ X | = (1 + φ )(1 + φ ) . . . (1 + φ r ) , where | A | is the determinant of the matrix A .(b) Let X be a sub-matrix of rank r of the matrix X constructed by taking a subset ofthe columns of X . If X ′ X has the non-zero eigenvalues φ , φ , . . . , φ m ( r ≤ m ),then (1 + φ min ) r ≤ | I + X ′ X | ≤ (1 + φ max ) r , where φ max and φ min are respectivelythe highest and lowest eigen values of X ′ X . Bayesian Variable Selection for Ultrahigh-dimensional Settings

Lemma 6.2. If W follows a non-central χ distribution with degrees of freedom (df ) r and non-centrality parameter (ncp) λ , then P ( W > t ) ≤ P (cid:18) χ r > t − λ (cid:19) + P (cid:18) Z > t − λ √ λ (cid:19) , where χ r denotes a central χ random variable with df r and Z ∼ N (0 , . Lemma 6.3.

Let M α , α ∈ A , be any model and M α ′ be a model such that nτ ′ min I ≤ X ′ α ′ X α ′ ≤ nτ ′ max I for some τ ′ min , and τ ′ max , free of n . Then under assumption (A3), | I + gX ′ α X α | − | I + gX ′ α ′ X α ′ | − ≤ c ( ng ) r α ′ − r α τ ′ r α ′ max τ ′ − ( r α + r )min τ r max for suﬃciently large p , where r = r ( X α ∧ α ′ ) and c > is some constant. Lemma 6.4. If M α c be the true model, then under the setup (2.1) and assumptions(A2) and (A5), and for any ǫ > , the probabilities of the following three events(a) R ∗ α c − R α c > ǫ , (b) R α c > n (1 + ǫ ) σ , and (c) R α c < n (1 − ǫ ) σ ,are tending to 0 exponentially in n . Lemma 6.5.

Let A be the set of all super models of the true model M α c , A ∗ be thesubset of A containing models of rank at most d , for some d > p ( α c ) , and A = { α : M α c * M α , r α > K p ( α c ) } . Then the following statements hold.(a) For any R > , with probability tending to 1 max α ∈A ( R α c − R α ) ≤ Rσ ( r α − p ( α c )) log p. (b) For any α ∈ A ∗ and any ǫ > , the probability that R ∗ α − R α > ǫ , tends to 0exponentially in n .(c) For R > K / ( K − , with probability tending to 1, max α ∈A ( R α c − R α c ∨ α ) ≤ Rσ ( r α − p ( α c )) log p. Lemma 6.6.

Let y n = µ n + e n with e n ∼ N ( , σ I ) and µ ′ n µ n = O ( n ) . For any h n ,such that h n = n k for some . < k < , we have | µ ′ n e n | = o p ( h n ) . The proofs of Lemma 6.1-6.6 are given in the supplementary ﬁle. ukhopadhyay and Dutta Proof of Theorem 3.1 . The ratio of posterior probabilities of any model to the truemodel is given by P ( M α | y n ) P ( M α c | y n ) = (cid:18) p n − (cid:19) p n ( α ) − p ( α c ) exp (cid:26) − R ∗ α − R ∗ α c σ (cid:27) (cid:12)(cid:12) I + g n X ′ α c X α c (cid:12)(cid:12) / | I + g n X ′ α X α | / . (6.1)We split A into three subclasses as follows:( i ) Supermodel of the true model, A = { α : M α c ⊂ M α } .( ii ) Non-supermodel of large dimension, A = { α : M α c * M α ; r α > K p ( α c ) } where r α is the rank of X α . (iii) Non-supermodel of small to moderate rank, A = { α : M α c * M α ; r α ≤ K p ( α c ) } .We prove (3.1) separately for models in A = A i , for i = 1 , , Case I: Super-models ( α ∈ A ) First, we obtain a uniform upper bound for ratio ofthe posterior probabilities of any model M α and M α c , given in (6.1). Note that R ∗ α c − R ∗ α ≤ R ∗ α c − R α = R ∗ α c − R α c + R α c − R α , where R α = y ′ n n I − X α ( X ′ α X α ) − X ′ α o y n and R ∗ α ≥ R α .By part (a) of Lemma 6.4 we have R ∗ α c − R α c = o p (1). By part (a) of Lemma6.5, for α ∈ A and some R = 2(1 + ǫ ), ǫ >

0, we have max α ∈A (cid:0) R α c − R α (cid:1) >Rσ ( r α − p ( α c )) log p. Therefore, for any ǫ > (cid:26) − σ ( R ∗ α − R ∗ α c ) (cid:27) ≤ p R ( r α − p ( α c )) / o p (1) . Again, by Lemma 6.3 and assumptions (A2)-(A3) we have | I + gX ′ α X α | − / (cid:12)(cid:12) I + gX ′ α c X α c (cid:12)(cid:12) − / ≤ c ∗ ( ngτ min ) − ( r α − p ( α c )) / τ p ( α c ) / ≤ c ∗ ( ngτ min ) − ( r α − p ( α c )) / p o (1) , where c ∗ is some appropriate constant. Therefore, summing the ratio of posterior prob-abilities over M α ∈ A , we have X α ∈A p ( M α | y n ) p ( M α c | y n ) ≤ X α ∈A c ∗ p (1+ ǫ )( r α − p ( α c ))+ o p (1)+ o (1) ( p − p n ( α ) − p ( α c ) ( τ min n g ) ( r α − p ( α c )) / Bayesian Variable Selection for Ultrahigh-dimensional Settings ≤ X α ∈A s p δ / p δ  r α − p ( α c ) p − p n ( α ) − p ( α c ) , for some suitably chosen ǫ >

0. This is due to the fact that we can choose ǫ so that theterm ǫ + o p (1)+ o (1) < δ /

3, for suﬃciently large p . By assumption (A2), the true modelis unique and is of full rank, and therefore r α − p ( α c ) ≥

1. Thus, the above expressionis less than p − δ / p − p ( α c ) X q =1 (cid:18) p − p ( α c ) q (cid:19) p − q ≤ p − δ / (cid:26)(cid:18) p − (cid:19) p − (cid:27) , and this tends to 0 as p → ∞ . Case II : Non-super models of large dimension ( α ∈ A ) We split R ∗ α c − R ∗ α asbefore and use the fact that R α ∨ α c ≤ R α . Thus, we have R ∗ α c − R ∗ α ≤ R ∗ α c − R α ≤ R ∗ α c − R α c + R α c − R α ∨ α c . From part (c) of Lemma 6.5, with probability tending to 1, R α c − R α ∨ α c ≤ Rσ ( r α − p ( α c )) log p for R = 2(1 + s ) with s > / ( K − X α ∈A p ( M α | y n ) p ( M α c | y n ) ≤ X α ∈A c ∗ p (1+ s )( r α − p ( α c ))+ o p (1)+ o (1) ( p − p n ( α ) − p ( α c ) ( τ min n g ) ( r α − p ( α c )) / ≤ X α ∈A c ∗ (cid:18) p / (5( K − √ ng (cid:19) r α − p ( α c ) p p ( α c ) p − p n ( α ) , for an appropriately chosen c ∗ and s so that s + o p (1) + o (1) ≤ / (5( K − n . As r α − p ( α c ) > ( K − p ( α c ), the above expression is less than c ∗ (cid:18) p / (5( K − √ ng (cid:19) ( K − p ( α c ) p X q = K t +1 (cid:18) pq (cid:19) p − q . (6.2)Also δ ≥ / ( K − p − p ( α c ) / , which converges to 0 as p → ∞ . However, the third term is dominated by P pq =1 (cid:0) pq (cid:1) ( p − − q , which converges to e as p → ∞ . The above facts together imply that(6.2) converges to 0 as p → ∞ . ukhopadhyay and Dutta Case III : Non-super models of small to moderate rank ( α ∈ A ) As in the previouscase, we have R ∗ α c − R ∗ α ≤ R ∗ α c − R α c + R α c − R α . By part (a) of Result 6.4, we have R ∗ α c − R α c = o p (1). Next, consider the third partin the right hand side of the above expression. R α − R α c = y ′ n ( P n ( α c ) − P n ( α )) y n = µ ′ n ( P n ( α c ) − P n ( α )) µ n + 2 µ ′ n ( P n ( α c ) − P n ( α )) e n + e ′ n ( P n ( α c ) − P n ( α )) e n . Note that µ ′ n ( P n ( α c ) − P n ( α )) µ n = µ ′ n ( I − P n ( α )) µ n > ∆ by assumption (A4). Again, µ ′ n ( P n ( α c ) − P n ( α )) e n = µ ′ n ( P n ( α c ) − P n ( α ∨ α c )) e n + µ ′ n ( P n ( α ∨ α c ) − P n ( α )) e n ≥ − | µ ′ n e n | . By Lemma 6.6, | µ ′ n e n | = o p ( h n ) for h n = n d for some 0 . < d <

1. Finally, we get e ′ n ( P n ( α c ) − P n ( α )) e n ≥ − e ′ n ( P n ( α ∨ α c ) − P n ( α )) e n . As P n ( α ∨ α c ) − P n ( α ) is an idempotent matrix, we have e ′ n ( P n ( α ∨ α c ) − P n ( α )) e n ≥ e ′ n ( P n ( α ∨ α c ) − P n ( α )) e n ≤ e ′ n P n ( α c ) e n for any α ∈ A (see Section 2.3.2of Yanai et al. (2011)). Also, e ′ n P n ( α c ) e n = O p (1) since it follows the σ χ distributionwith df p ( α c ). Combining all these facts and using Assumption (A4), we have R ∗ α c − R ∗ α ≤ − ∆ (1 + o p (1)) . Further, from Lemma 6.3, the ratio of determinants in the last term of (6.1) is less than c ∗ (cid:0) √ ngτ max (cid:1) p ( α c ) for an appropriately chosen c ∗ >

0. Therefore, X α ∈A p ( M α | y n ) p ( M α c | y n ) ≤ c ∗ ( p √ ngτ max ) p ( α c ) exp (cid:26) − ∆ σ (1 + o p (1)) (cid:27) p X q =1 (cid:18) pq (cid:19) p − q . (6.3)For suﬃciently large p , we have c ∗ (cid:0) p √ ngτ max (cid:1) p ( α c ) ≤ p δ / p ( α c ) . By assumption(A4), exp {− ∆ / (2 σ ) } ≤ p − p ( α c ) . Thus the product of ﬁrst three terms in the righthand side (rhs) of (6.3) converges to zero, whereas the last term converges to e . Usingthe above facts it is evident that the rhs of (6.3) is less than p − (1 − δ / p ( α c ) . As p → ∞ ,the result follows. Proof of Theorem 3.2 . First note that M α and M α are of the dimension d , and d is a constant free of n . Therefore, the ranks of both the models r α and r α , are also4 Bayesian Variable Selection for Ultrahigh-dimensional Settings free of n . We now havesup α ,α m ( M α | y n ) m ( M α | y n ) = sup α ,α exp (cid:26) − σ ( R ∗ α − R ∗ α ) (cid:27) (cid:12)(cid:12) I + gX ′ α X α (cid:12)(cid:12) − / (cid:12)(cid:12) I + gX ′ α X α (cid:12)(cid:12) − / . (6.4)By assumptions (A2) and (A3) and Lemma 6.3, we getsup α ,α (cid:12)(cid:12) I + gX ′ α X α (cid:12)(cid:12) / (cid:12)(cid:12) I + gX ′ α X α (cid:12)(cid:12) / ≤ ( √ ng ) r α − r α τ r + p ( α c )max τ r − r α min (1 + ξ n ) ≤ ( √ ng ) r α − r α + o (1) where r = rank ( X α ∧ α ). We also have R ∗ α − R ∗ α ≤ R ∗ α − R α = R ∗ α − R α + R α − R α c + R α c − R α . From part (b) of Lemma 6.5, we get sup α ( R ∗ α − R α ) = o p (1). By the properties ofprojection matrices, R α − R α c ≤

0. Next consider R α − R α c which is equal to y ′ n ( P n ( α c ) − P n ( α )) y n = µ ′ n ( P n ( α c ) − P n ( α )) µ n + 2 µ ′ n ( P n ( α c ) − P n ( α )) e n + e ′ n ( P n ( α c ) − P n ( α )) e n . We now have µ ′ n ( P n ( α c ) − P n ( α )) e n = µ ′ n ( P n ( α c ) − P n ( α ∨ α c )) e n + µ ′ n ( P n ( α ∨ α c ) − P n ( α )) e n ≥ − | µ ′ n e n | . By Lemma 6.6, | µ ′ n e n | = o p ( h n ) for h n = n k for some 0 . < k <

1. Finally, e ′ n ( P n ( α c ) − P n ( α )) e n ≥ − e ′ n ( P n ( α ∨ α c ) − P n ( α c )) e n , as e ′ n ( P n ( α ∨ α c ) − P n ( α )) e n ≥

0. Note that e ′ n ( P n ( α ∨ α c ) − P n ( α c )) e n ≤ e ′ n P n ( α c ) e n ,and e ′ n P n ( α c ) e n = O p (1).Again, by assumption (A4), we have µ ′ n ( P n ( α c ) − P n ( α )) µ n ≥ ∆ . Combining theabove statements and using assumption (A5), from (6.4) we havesup α ,α m ( M α | y n ) m ( M α | y n ) ≤ ( √ ng ) p ( α c ) p o (1) exp (cid:26) − ∆ σ (1 + o p (1)) (cid:27) ≤ p − (1 − δ / o p (1)) p ( α c ) which converges to 0 as p → ∞ . Supplementary Material (https://drive.google.com/open?id=0By7-ldtnmyfvUjlpWUZnaGpwNWs). The

Supple-mentary Material contains proofs of all the Lemmas and Theorem 3.3. A pdf copy isavailable at the link mentioned above. ukhopadhyay and Dutta References

Bondell, H. D. and Reich, B. J. (2012). “Consistent high-dimensional Bayesian variableselection via penalized credible regions.”

J. Amer. Statist. Assoc. , 107(500): 1610–1624.Cand`es, E. and Tao, T. (2007). “Rejoinder: “The Dantzig selector: statistical esti-mation when p is much larger than n ” [Ann. Statist. (2007), no. 6, 2313–2351;MR2382644].” Ann. Statist. , 35(6): 2392–2404.Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. (2015). “Bayesian linear regressionwith sparse priors.”

Ann. Statist. , 43(5): 1986–2018.Chen, J. and Chen, Z. (2012). “Extended BIC for small- n -large- P sparse GLM.” Statist.Sinica , 22(2): 555–574.Chipman, H., George, E. I., and McCulloch, R. E. (2001). “The practical implemen-tation of Bayesian model selection.” In

Model selection , volume 38 of

IMS LectureNotes Monogr. Ser. , 65–134. Inst. Math. Statist., Beachwood, OH.Fan, J., Feng, Y., and Song, R. (2011). “Nonparametric independence screening insparse ultra-high-dimensional additive models.”

J. Amer. Statist. Assoc. , 106(494):544–557.Fan, J. and Li, R. (2001). “Variable selection via nonconcave penalized likelihood andits oracle properties.”

J. Amer. Statist. Assoc. , 96(456): 1348–1360.Fan, J. and Lv, J. (2008). “Sure independence screening for ultrahigh dimensionalfeature space.”

J. R. Stat. Soc. Ser. B Stat. Methodol. , 70(5): 849–911.— (2010). “A selective overview of variable selection in high dimensional feature space.”

Statist. Sinica , 20(1): 101–148.Fan, J. and Song, R. (2010). “Sure independence screening in generalized linear modelswith NP-dimensionality.”

Ann. Statist. , 38(6): 3567–3604.George, E. I. and Foster, D. P. (2000). “Calibration and empirical Bayes variableselection.”

Biometrika , 87(4): 731–747.Ishwaran, H. and Rao, J. S. (2005). “Spike and slab variable selection: frequentist and6

Bayesian Variable Selection for Ultrahigh-dimensional Settings

Bayesian strategies.”

Ann. Statist. , 33(2): 730–773.Johnson, V. E. and Rossell, D. (2012). “Bayesian model selection in high-dimensionalsettings.”

J. Amer. Statist. Assoc. , 107(498): 649–660.Liang, F., Song, Q., and Yu, K. (2013). “Bayesian subset modeling for high-dimensionalgeneralized linear models.”

J. Amer. Statist. Assoc. , 108(502): 589–606.Moreno, E., Gir´on, J., and Casella, G. (2015). “Posterior model consistency in variableselection as the model dimension grows.”

Statist. Sci. , 30(2): 228–241.Narisetty, N. N. and He, X. (2014). “Bayesian variable selection with shrinking anddiﬀusing priors.”

Ann. Statist. , 42(2): 789–817.Rosset, S. (2013). “Practical Sparse Modeling: an Overview and Two Examples fromGenetics.”

Chapter 3 in Practical Applications of Sparse Modeling, I. Rish et al.(eds.), MIT Press .Song, Q. and Liang, F. (2015). “A split-and-merge Bayesian variable selection approachfor ultrahigh dimensional regression.”

J. R. Stat. Soc. Ser. B. Stat. Methodol. , 77(5):947–972.Song, R., Yi, F., and Zou, H. (2014). “On varying-coeﬃcient independence screeningfor high-dimensional varying-coeﬃcient models.”

Statist. Sinica , 24(4): 1735–1752.Wang, H. (2009). “Forward regression for ultra-high dimensional variable screening.”

J. Amer. Statist. Assoc. , 104(488): 1512–1524.Yanai, H., Takeuchi, K., and Takane, Y. (2011).

Projection matrices, generalized in-verse matrices, and singular value decomposition . Statistics for Social and BehavioralSciences. Springer, New York.Yuan, M. and Lin, Y. (2006). “Model selection and estimation in regression with groupedvariables.”

J. R. Stat. Soc. Ser. B Stat. Methodol. , 68(1): 49–67.Zhang, C.-H. (2010). “Nearly unbiased variable selection under minimax concavepenalty.”

Ann. Statist. , 38(2): 894–942.Zou, H. (2006). “The adaptive lasso and its oracle properties.”

J. Amer. Statist. Assoc. ,101(476): 1418–1429. ukhopadhyay and Dutta

J. R. Stat. Soc. Ser. B Stat. Methodol. , 67(2): 301–320.