[PDF] Non-penalized variable selection in high-dimensional linear model settings via generalized fiducial inference

Abstract

Standard penalized methods of variable selection and parameter estimation rely on the magnitude of coefficient estimates to decide which variables to include in the final model. However, coefficient estimates are unreliable when the design matrix is collinear. To overcome this challenge an entirely new perspective on variable selection is presented within a generalized fiducial inference framework. This new procedure is able to effectively account for linear dependencies among subsets of covariates in a high-dimensional setting where p can grow almost exponentially in n, as well as in the classical setting where p \le n. It is shown that the procedure very naturally assigns small probabilities to subsets of covariates which include redundancies by way of explicit L_{0} minimization. Furthermore, with a typical sparsity assumption, it is shown that the proposed method is consistent in the sense that the probability of the true sparse subset of covariates converges in probability to 1 as n \to \infty, or as n \to \infty and p \to \infty. Very reasonable conditions are needed, and little restriction is placed on the class of possible subsets of covariates to achieve this consistency result.

Full PDF

SSubmitted to the Annals of Statistics arXiv:

NON-PENALIZED VARIABLE SELECTION INHIGH-DIMENSIONAL LINEAR MODEL SETTINGS VIAGENERALIZED FIDUCIAL INFERENCE

By Jonathan P Williams and Jan Hannig

University of North Carolina at Chapel Hill

Standard penalized methods of variable selection and parameterestimation rely on the magnitude of coeﬃcient estimates to decidewhich variables to include in the ﬁnal model. However, coeﬃcientestimates are unreliable when the design matrix is collinear. To over-come this challenge an entirely new perspective on variable selectionis presented within a generalized ﬁducial inference framework. Thisnew procedure is able to eﬀectively account for linear dependenciesamong subsets of covariates in a high-dimensional setting where p can grow almost exponentially in n , as well as in the classical settingwhere p ≤ n . It is shown that the procedure very naturally assignssmall probabilities to subsets of covariates which include redundan-cies by way of explicit L minimization. Furthermore, with a typicalsparsity assumption, it is shown that the proposed method is con-sistent in the sense that the probability of the true sparse subset ofcovariates converges in probability to 1 as n → ∞ , or as n → ∞ and p → ∞ . Very reasonable conditions are needed, and little restrictionis placed on the class of possible subsets of covariates to achieve thisconsistency result.

1. Introduction.

A strategy for developing variable selection proce-dures with desirable consistency properties entails exploiting some distin-guishing property of the theoretical true data generating model. For exam-ple, standard penalized methods of variable selection within a linear modelframework such as LASSO of Tibshirani (1996), SCAD of Fan and Li (2001),and the Dantzig Selector of Candes and Tao (2007) rely on the magnitudeof the coeﬃcients in the true data generating model being relatively largerthan those of the other coeﬃcients. Johnson and Rossell (2012) use thisproperty to construct nonlocal prior densities over all subsets of covariates.The deﬁning property of their nonlocal density is that it takes the value ofzero for subsets containing a covariate with a zero-valued coeﬃcient.We propose a more desirable way for eliminating redundancies from thesample space of candidate subsets which does not explicitly rely on coeﬃcient

MSC 2010 subject classiﬁcations:

Keywords and phrases: variable selection, best subset selection, high dimensional re-gression, L minimization, generalized ﬁducial inference a r X i v : . [ s t a t . M E ] F e b J. WILLIAMS AND J. HANNIG magnitudes. That is, any candidate true model should be non-redundant inthe sense that it contains the minimal amount of information necessary forexplaining and/or predicting the observed data. One such criterion to ex-ploit this non-redundancy property is that the only subsets with nonzeroposterior probability should be those which cannot be predicted to somechosen precision by a subset of fewer covariates. Such a criterion requiresconstructing a probability distribution on the space of candidate models,which is consistent with a Bayesian or ﬁducial variable selection paradigm.The literature on high-dimensional linear models is vast, but we hope to con-tribute to it by using this setting to build a foundation for a fresh perspectiveon variable selection.Recent work in the Bayesian high-dimensional linear model setting in-cludes Roˇckov´a and George (2016) who develop methods for separable andnon-separable spike-and-slab penalized estimation, the credible set approachof Bondell and Reich (2012), and Narisetty and He (2014) who propose amethod based on shrinking and diﬀusing coeﬃcient priors in which the vari-ance of the priors are sample size dependent. Lai, Hannig and Lee (2015)layout framework for penalized estimation within a GFI approach.Ghosh and Ghattas (2015) provide insights into complications in Bayesianvariable selection. Namely, the size of the sample space (2 p ) is often toolarge to compute all model probabilities, and even typically larger than canreasonably be sampled by Markov chain Monte Carol (MCMC) methods.Thus, the nonlocal prior approach of Johnson and Rossell (2012) can achieveasymptotic consistency (where other approaches can only achieve pairwiseconsistency) because it is able to eﬀectively eliminate a large enough portionof the 2 p subsets from the sample space. To illustrate this point, considerthe following simple example. Let (1) Y ∼ N n (cid:16) β · x (1) + · · · + β p · x ( p ) , σ I n (cid:17) , where β j ∈ R and x ( j ) ∈ R n for j ∈ { , . . . , p } , and σ >

0. Further, supposethat the true but unknown values of ( β , β , β , . . . , β p ) (cid:48) are ( b , b , , . . . , (cid:48) .Within the nonlocal prior framework, the only subsets with non-negligibleposterior probability are contained in the set (cid:8) { x (1) } , { x (2) } , { x (1) , x (2) } (cid:9) .When viewed as a prior density on the coeﬃcients, nonlocal priors assignzero prior density to the true parameter value when the true parameter valueis zero. From a Bayesian perspective this is philosophically problematic, butvery insightful for consistency of model selection. The insight lends itself tothe question: What other properties might the true model have which canbe exploited to develop a statistical procedure with the ability to eﬀectivelyeliminate subsets from the sample space? FI VARIABLE SELECTION In addressing this question, we build our proposed methods from theidea that any candidate true model, as determined by the actual non-zeroparameter values, should be non-redundant in the sense that it contains theminimal amount of information necessary for explaining and/or predictingthe observed data. We denote such subsets of the parameter space as ε - admissible , and deﬁne them precisely in Deﬁnition 2.1. Then, using theabove nonlocal prior example, the entire model space { x (1) , x (2) , x (3) } forinstance, is not ε - admissible because it can be perfectly predicted by thesmaller subset { x (1) , x (2) } .To further illustrate the intuition behind our proposal, consider an ex-ample where x (2) is highly collinear with all of x (3) , . . . , x ( p ) but is not correlated with x (1) , and that the true values of ( β , β , β , . . . , β p ) (cid:48) are( b , b , b , . . . , b p ) (cid:48) with b j (cid:54) = 0 for all j ∈ { , . . . , p } . In this case, assum-ing strong enough collinearity, ∃ c ∈ R with c · x (2) ≈ b · x (2) + · · · + b p · x ( p ) ,i.e., (cid:13)(cid:13)(cid:0) b · x (1) + · · · + b p · x ( p ) (cid:1) − (cid:0) b · x (1) + c · x (2) (cid:1)(cid:13)(cid:13) < ε where ε > { x (1) , . . . , x ( p ) } is not ε - admissible , but would be assignednonzero posterior probability in the nonlocal prior framework.We construct a posterior-like probability distribution over all subsets,which assigns negligible probability to elements that are not ε - admissible .In constructing the posterior-like probability distribution we adopt a gener-alized ﬁducial inference (GFI) approach because it has similar to an objectiveBayes interpretation with data driven priors, gives a systematic method ofconstructing a distribution function given a data generating equation suchas a linear model, and it does not suﬀer from the issue of arbitrary normal-izing constants which arise in many objective Bayesian priors (Berger et al.,2001). In this manuscript we will provide a gentle introduction to GFI. Afuller account of GFI is given in the recent review paper Hannig et al. (2016).An advantage of both our approach and the nonlocal prior approach ofJohnson and Rossell (2012) is that in addition to providing theoretical guar-antees, our statistical procedures yield estimates of the posterior distribu-tion over subsets of covariates. This is in contrast to frequentist penaliza-tion based methods or Bayesian procedures fully dedicated to maximuma posteriori probability (MAP) estimation. Such methods do not yield theposterior probability of a chosen model (i.e., the relative probability, giventhe observed data, of a given model against competing models). Further-more, Ghosh and Ghattas (2015) argue that joint summaries of subsets ofcovariates are more robust to collinearity.With our approach to constructing a posterior-like distribution whoseprobability mass function value is negligible for subsets of the parameter J. WILLIAMS AND J. HANNIG space which are not ε - admissible , we are able to show that the probabilityof the true data generating model converges to 1 asymptotically in n and p .This consistency is shown to be true even with p growing almost exponen-tially in n . The reason being that the true model yields a stronger signalsince it no longer has to compete within an overly redundant sample space.The paper is organized as follows. Section 2 serves to introduce the gen-eral methodology and computational algorithm for carrying out our variableselection procedure based on a recent algorithm for explicit L minimization(Bertsimas, King and Mazumder, 2016), which is fast enough to be used onreal data. The conditions needed for the main results, and the main resultsare presented and discussed in Section 3. Proofs are organized in the ap-pendix. We demonstrate the empirical performance of our procedure andcompare it to other Bayesian and frequentist methods in simulation setupson synthetic data in Section 4. Computer code implementing our procedureis provided at https://jonathanpw.github.io/software.html .

2. Methodology.

As described in the previous section, our idea behindexploiting a non-redundancy property of the true data generating modelrelies on constructing a probability distribution concentrated on what wedenote as ε - admissible subsets. This object is deﬁned precisely in Deﬁnition2.1, but ﬁrst an aside on the notation used throughout the paper.Let Y be an n -dimensional random vector, X an n × p matrix with columnsscaled to have unit norm, and β a ﬁxed p -dimensional vector with nonzero(or active) components indexed by the subset M o ⊂ { , . . . , p } , with (2) Y ∼ N n (cid:16) X M o β M o , ( σ M o ) I n (cid:17) . The design matrix denoted by X M o is deﬁned as the matrix composed ofonly those columns of X corresponding to the index set M o . The subscript ‘ o ’refers to the interpretation of M o as corresponding to the ‘oracle’ subset ofcovariates. Moreover, β M o denotes the true values of the oracle coeﬃcients,while β M o is understood as a random vector whose uncertainty resides in notknowing the true coeﬃcients β M o . For any subset M the vector β M refersto the projection of the column space of X M on the true coeﬃcients β M o ,that is, β M = ( X (cid:48) M X M ) − X (cid:48) M X M o β M o = E y ( (cid:98) β M ). Lastly, σ M o > σ M o is a random variablewhose distribution expresses the uncertainty from not knowing σ M o , underthe oracle model.The objective is to construct a statistical procedure which can be shown,asymptotically and demonstrated empirically, to be able to identify M o asthe index set of the oracle model within the sample space of all 2 p candidate FI VARIABLE SELECTION subsets M ⊂ { , . . . , p } . For each index set, M , in the sample space theconditional sampling distribution of the data is assumed as (3) Y | β M , σ M ∼ N n (cid:16) X M β M , σ M I n (cid:17) . The centerpiece of our methodology is then the following deﬁnition. Thefunction | · | denotes the absolute value function if its argument is scalar-valued, and denotes the cardinality function if its argument is set-valued.The norms (cid:107) · (cid:107) and (cid:107) · (cid:107) refer, respectfully, to the usual L and L normsdeﬁned on ﬁnite-dimensional Euclidean spaces. Definition . Assume ﬁxed ε > . A given β M coupled within someindex subset M ⊂ { , . . . , p } is said to be ε - admissible if and only if h ( β M ) =1 , where (4) h ( β M ) := I (cid:110) (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) ≥ ε (cid:111) , and b min solves min b ∈ R p (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M | − . Observe that this deﬁnition is consistent with the heuristic descriptionof ε - admissible subsets given in the previous section. In particular, if thesubset of covariates indexed by M is linearly dependent or if one of thecomponents of β M is zero, then h ( β M ) = 0. The subtlety in this deﬁnitionis assuming an appropriately chosen ε which is able to strike an optimalbalance for distinguishing signal from noise. Intuitively, ε = ε ( n, p, M ), i.e.,is a function of the amount of information available given by n , the diﬃcultyof the problem represented by p , and information about a given M beingconsidered such as | M | . For instance, if | M | > n then h ( β M ) = 0 because X M cannot have full rank. In this case any ε > ε matters a lot if | M | ≤ n . The choice of ε is a major focus of Section3 where the main results of the paper are presented, and from where wesuggest the following default choice: (5) ε = Λ M (cid:98) σ M (cid:16) n . | M | log( pπ ) . − p o (cid:17) + , where Λ M := tr (cid:0) ( H M X ) (cid:48) H M X (cid:1) and (cid:98) σ M := RSS M / ( n − | M | ) with RSS M := y (cid:48) ( I n − H M ) y and H M := X M ( X (cid:48) M X M ) − X (cid:48) M , and the vector y an obser-vation from the true model (2). The parameter p o represents prior beliefabout | M o | , the number of covariates in the true model M o . In practice, avalue of p o can be directly speciﬁed or selected by cross-validation. A built-incross-validation procedure is included in the accompanying software to thispaper. Details are provided with the simulation study in Section 4. J. WILLIAMS AND J. HANNIG

Within the h function in Deﬁnition 2.1 the quantity (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) represents the diﬀerence in prediction for a subset M against allsubsets with fewer covariates. This measure of distance has been adaptedfrom Candes and Tao (2007), but they deal with the error (cid:107) X (cid:48) ( y − Xb ) (cid:107) ∞ over b ∈ R p . This is very diﬀerent from using X M β M in place of y becausethe former results in a noiseless measure of distance. To illustrate, observethat E y (cid:0) (cid:107) X (cid:48) ( Y − Xb ) (cid:107) (cid:1) = (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) + σ M o · p, where E y ( · ) is used to denote the expectation taken with respect to thesampling distribution of the data Y .There are various reasons for using the quantity X (cid:48) ( X M β M − Xb ) fromthe Dantzig selector (Candes and Tao, 2007) versus simply the diﬀerence inpredictions ( X M β M − Xb ) as in the LASSO (Tibshirani, 1996). One reasonis that X (cid:48) ( X M β M − Xb ) accounts for diﬀerence in predictions as well ascorrelations with the explanatory data, as discussed in Berk (2008). If thediﬀerence in predictions is small but is highly correlated with the designmatrix, then it is likely that the smaller subset of covariates is unable toaccount for the eﬀect of one or more of the covariates in M . Thus, using X (cid:48) ( X M β M − Xb ) instead of just the diﬀerence in predictions is a methodof controlling for potential omitted variable eﬀects which could incorrectlyﬁnd a close ﬁtting subset to M . Another advantage of X (cid:48) ( X M β M − Xb ) isthat it is invariant under orthogonal transformations of the design matrix,as pointed out in Candes and Tao (2007).Now that the foundation for ε - admissible subsets of the parameter spacehas been laid out, it remains to show how Deﬁnition 2.1 can be coupledwith a likelihood based approach for constructing a probability distributionover index subsets M ⊂ { , . . . , p } . This is a common strategy for Bayesianapproaches, i.e., construct a prior density with desirable properties for vari-able selection and then couple the prior with a likelihood function of thedata to study the resulting posterior distribution. However, it is not clearwhat sort of a prior to use within our ε - admissible subsets approach, and re-cent developments in generalized ﬁducial inference (GFI) oﬀer a systematicmethod of deriving objective Bayes-like posterior distributions. To illustrateas in Hannig et al. (2016), suppose that some data Y = G ( U, θ ) for somedeterministic data generating equation G ( · , · ), some parameters θ , and somerandom component U whose distribution is independent of θ and is com-pletely known. The generalized ﬁducial distribution of θ is then given by r ( θ | y ) = f ( y, θ ) J ( y, θ ) (cid:82) Θ f ( y, θ (cid:48) ) J ( y, θ (cid:48) ) dθ (cid:48) , where f is the likelihood function and FI VARIABLE SELECTION J ( y, θ ) = D (cid:18) ddθ G ( u, θ ) (cid:12)(cid:12)(cid:12) u = G − ( y,θ ) (cid:19) with D ( A ) = (det A (cid:48) A ) . The component J ( y, θ ) is termed the J acobian because it results from inverting the data generating equation on the data.We are committing a slight abuse of notation as r ( θ | y ) is not a conditionaldensity in the usual sense. Instead, we are using this notation to stress thatthe generalized ﬁducial distribution is a function of the observed data y .To make matters concrete in the linear model setting of (3), the pa-rameters are θ = ( β M , σ M ), the data generating equation is speciﬁed as G (cid:0) U, ( β M , σ M ) (cid:1) = X M β M + σ M U where U ∼ N n (0 , I n ), and the Jacobianterm reduces to J (cid:0) y, ( β M , σ M ) (cid:1) = σ − M | det( X (cid:48) M X M ) | RSS M . Thus, r (cid:0) ( β M , σ M ) | y (cid:1) ∝ σ − nM e − (cid:107) y − XM βM (cid:107) σ M σ − M | det( X (cid:48) M X M ) | RSS M · h ( β M ) , where the factor of h ( β M ) appears in the likelihood from only considering ε - admissible subsets of the parameter space. Accordingly, as is done with aBayesian posterior density and in Section 3 of Hannig et al. (2016), deﬁne theGFI probability of a given subset M to be proportional to the normalizingconstant of r (cid:0) ( β M , σ M ) | y (cid:1) . That is, r ( M | y ) := (cid:82) f (cid:0) y, ( β M , σ M ) (cid:1) J (cid:0) y, ( β M , σ M ) (cid:1) h ( β M ) d ( σ M , β M ) p (cid:80) j =1 (cid:80) | M | = j (cid:82) f (cid:0) y, ( β M , σ M ) (cid:1) J (cid:0) y, ( β M , σ M ) (cid:1) h ( β M ) d ( σ M , β M ) ∝ (cid:90) R pM (cid:90) ∞ h ( β M ) | det( X (cid:48) M X M ) | RSS M σ n +1 M e ( y − XM βM ) (cid:48) ( y − XM βM )2 σ M dσ M dβ M , which simpliﬁes to (6) r ( M | y ) ∝ π | M | Γ (cid:16) n − | M | (cid:17) RSS − ( n −| M |− ) M E( h ( β M )) , where the expectation is taken with respect to the location-scale multivariateT distribution, (7) t n −| M | (cid:16) (cid:98) β M , RSS M n − | M | ( X (cid:48) M X M ) − (cid:17) with (cid:98) β M := ( X (cid:48) M X M ) − X (cid:48) M y . Notice that the quantity E( h ( β M )) is a func-tion of the observed data y .Observe that (6) expresses the relative likelihood of the subset M overall 2 p possible subsets. The expression can be described as a product oftwo terms, the ﬁrst being comprised of information from the sampling dis-tribution of the data and largely driven by the residual sum of squares,RSS M , and the second having to do with the ε - admissibility of β M , in the J. WILLIAMS AND J. HANNIG form of E( h ( β M )). Thus, the support of r ( M | y ) in (6) is dominated by the ε - admissible subsets, as desired.Section 3 provides the conditions and supporting lemmas and theoremsneeded to show that r ( M o | Y ) → n, p → ∞ . First however,a few remarks are provided about computing r ( M | y ) on actual data.2.1. Remarks on computation.

With a probability distribution now de-ﬁned over ε - admissible subsets, it must be demonstrated that r ( M | y ) in(6) can be eﬃciently computed. There are two main computational issues todeal with. The ﬁrst is to evaluate h ( β M ) for a given β M , and the second is tosample subsets M via pseudo-marginal based MCMC. The computationalcomplexity and the need for pseudo-marginal based MCMC arises becauseneither h ( β M ) nor E( h ( β M )) have a closed form solution.To evaluate h ( β M ) for a given β M we adapt an explicit L minimizationalgorithm introduced in Bertsimas, King and Mazumder (2016). The authorsstate that their algorithm borrows ideas from projected gradient descent andmethods in ﬁrst-order convex optimization, and solves problems of the formmin b ∈ R p g ( b ) subject to (cid:107) b (cid:107) ≤ κ , where g ( b ) ≥ (cid:107)∇ g ( b ) − ∇ g ( (cid:101) b ) (cid:107) ≤ l (cid:107) b − (cid:101) b (cid:107) . The algorithm isnot guaranteed to ﬁnd a global optimum (unless formal optimality tests arerun, which can take a long time), but Bertsimas, King and Mazumder (2016)provide provable guarantees that the algorithm will converge to a ﬁrst-orderstationary point, which is deﬁned as a vector ˜ b ∈ R p with (cid:107) ˜ b (cid:107) ≤ κ whichsatisﬁes ˜ b = ˜ b − l ∇ g (˜ b ). Paraphrasing from Bertsimas, King and Mazumder(2016), their algorithm detects the active set after a few iterations, and thentakes additional time to estimate the coeﬃcient values to a high accuracylevel. In our application of their algorithm we are not ﬁrst-most interestedin ﬁnding a global optimum. To evaluate h ( β M ), we need only determine ifmin b ∈ R p (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) is smaller than ε (as in (5)). For β M which arenot ε - admissible , the objective function, (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) , can be madesmall very quickly via our implementation of the L minimization algorithm.To illustrate how consider the following speciﬁcs of our implementation. Theprecise details regarding the algorithm can be found accompanying our soft-ware documentation at https://jonathanpw.github.io/software.html .First, to estimate E( h ( β M )) we use a sample mean of sample vectorsdrawn from the location-scale multivariate T distribution in (7). This mul-tivariate T distribution is centered at the least squares estimator, (cid:98) β M , andmultivariate theory suggests that (cid:98) β M will on average be close to the coeﬃ-cients β M . By warm starting the L minimization algorithm at (cid:98) β M with thesmallest coeﬃcient removed, subsets corresponding to β M with at least one FI VARIABLE SELECTION zero coeﬃcient typically yield h ( β M ) = 0 within a few steps of the algorithm.Second, as per the deﬁnition of h ( · ) in (4) the objective function is min-imized over all b ∈ R p with (cid:107) b (cid:107) ≤ | M | −

1. Hence, the κ required for the L minimization algorithm from Bertsimas, King and Mazumder (2016) isnaturally chosen for us as κ = | M | −

1. Knowing how to choose κ greatly re-duces the L optimization problem. Moreover, our implementation is furthersimpliﬁed by the fact that the closest prediction to X M β M for a given M isguaranteed to have | M | − h ( β M ) need not be minimized over all b ∈ R p with (cid:107) b (cid:107) ≤ | M | −

1, butcan be minimized over all b ∈ R p with (cid:107) b (cid:107) = | M | − M via pseudo-marginal based MCMC. We do this by using the Grouped IndependenceMetropolis Hastings (GIMH) algorithm from Andrieu and Roberts (2009),but originally introduced in Beaumont (2003). The reason standard MCMCtechniques do not apply is that there is no obvious closed form expressionfor the probability mass function (6) because of the expectation, E( h ( β M )),in the expression. As described in Andrieu and Roberts (2009) such situa-tions warrant introducing a latent variable to yield analytical expressions oreasier implementation.In the case of r ( M | y ) in (6), we introduce the latent location-scale multi-variate T vector in (7) from within the expectation E( h ( β M )). Our pseudo-marginal based MCMC is carried out by sampling an index subset M alongwith sampling some pre-speciﬁed number, N , of multivariate T vectors (cor-responding to M ) from (7). The sample of multivariate T vectors, say B , isthen used to compute the sample mean estimate of E( h ( β M )). Accordingly,we deﬁne a joint Markov chain on ( M, B ), but discard B to obtain samplesfrom the marginal distribution of M . As argued in Andrieu and Roberts(2009), this is a valid MCMC sampling strategy, but is known to suﬀer fromslower mixing than if we were able to integrate the β M out of the massfunction r ( M | y ) in (6), i.e., analytically evaluate E( h ( β M )). However, this isnot possible given the h function in (4). Additionally, the mixing associatedwith pseudo-marginal approaches is known to be poor when the number ofimportance samples ( N , the sample size of B ) is small. These practical bot-tlenecks outline avenues for future research. Nonetheless, we demonstrate inSection 4 that our computational strategies are eﬃcient enough to be imple-mented on actual data, in comparison to other common penalized likelihoodand Bayesian approaches.

3. Theoretical results.

The main objective of this section is to showunder what conditions, asymptotically, r ( M o | Y ) in (6) will converge to 1, J. WILLIAMS AND J. HANNIG particularly if p >> n . The ε - admissible subsets approach is able to achievesuch a strong consistency result because the resulting sample space is ef-fectively reduced to only those subsets with no redundancies. The essenceof the mathematical result is that the space of ε - admissible sets is smallenough that the true model can be detected. This addresses the issue raisedin Ghosh and Ghattas (2015) that high-dimensional settings often lead to ar-bitrarily small probabilities for all models (including the true model) simplybecause there are too many models to consider.3.1. Discussion of the conditions.

The ﬁrst two conditions, Condition3.1 and Condition 3.2, are to ensure that the true model, M o , is identiﬁable.Observe from (4) that ε is used to control the sensitivity and speciﬁcityfor identifying ε - admissible subsets. In particular, if ε is too large, then h ( β M o ) will incorrectly be set to zero implying that β M o is not ε - admissible .Condition 3.1 speciﬁes how large ε can be whilst the true model remainsidentiﬁable. This condition turns out to be critically important in actual dataapplications because computing h ( β M o ) is closely related to the comparisonin equation (8). Condition . For large n and p , (8) 118 (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≥ ε where b min solves min b ∈ R p (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M o | − . Condition 3.2 is born from Lemma A.1 which is an important necessaryresult for the main result of this paper, Theorem 3.9. The term log( n ) γ represents the sparsity assumption for the true model, i.e., the number ofcovariates in the true model must not exceed log( n ) γ for some ﬁxed scalar γ >

0. The γ parameter indicates that the asymptotic results remain trueif the true model grows faster than log( n ), but not faster than some powerof log( n ). In ﬁnite-sample applications, γ has no consequence and can beignored.The constant α ∈ (0 ,

1) reﬂects the only explicit restriction needed onthe sample space of 2 p subsets to show that r ( M o | Y ) → n and p , Theorem 3.9. The residual sum of squares term in r ( M | Y )in (6) cannot be controlled (as a ratio to r ( M o | Y )) for arbitrary subsets M with | M | = O ( n ) because the column span of X M includes y ∈ R n when rank( X M ) = n . To eliminate such subsets from the sample space,Condition 3.2 requires that only subsets of size | M | ≤ n α can be givennonzero probability. However, recall from Deﬁnition 2.1 that h ( β M ) = 0 if FI VARIABLE SELECTION | M | > n because in this case the columns of X M must be linearly dependent.Accordingly, all subsets M with | M | > n are given zero probability, bydeﬁnition. Evidenced by this fact, the only explicit restriction placed on thesample space is that subsets M with | M | ∈ ( n α , n ) are excluded. In sparsesettings it is assumed that | M o | << n anyway, so neglecting such subsets isreasonable. Convergence to the true model M o will be quicker for smaller α because there are less models to consider, but too small of an α will exclude M o from the sample space.In Condition 3.2, and for the remainder of this section assume that γ > γ = 1, and α ∈ (0 , α = .

5, have been chosen and ﬁxed atappropriate values.

Condition . The true model M o satisﬁes | M o | ≤ log( n ) γ , and lim n →∞ p →∞ min (cid:110) ∆ M | M o | log( p ) : M (cid:54) = M o , | M | ≤ | M o | (cid:111) = ∞ , lim inf n →∞ p →∞ n − α log( p ) > , and log( p ) < n − | M o | −

14 log( n ) γ , for large n and p , where ∆ M := (cid:107) X M o β M o − H M X M o β M o (cid:107) as in Lai, Hannigand Lee (2015). This is a slightly weaker version of condition (11) in Lai, Hannig andLee (2015). They relate Condition 3.2 to the sparse Riesz condition (Zhangand Huang, 2008) which requires that the eigenvalues of X (cid:48) M X M /n areuniformly bounded away from 0 and ∞ . Essentially, ∆ M is a measure ofhow distinct the true model predictions X M o β M o are from their projectiononto the column space of X M for M (cid:54) = M o and | M | ≤ | M o | . Recall that H M := X M ( X (cid:48) M X M ) − X (cid:48) M . In particular, if X M is orthogonal to X M o ,then ∆ M = (cid:107) X M o β M o (cid:107) which will be much larger than the denominator, | M o | log( p ). The requirements of this condition are reasonable because ∆ M grows very fast for M such that M o (cid:54)⊆ M .Condition 3.2 is important for being able to identify the true modelamongst other models M with | M | ≤ | M o | . The next two conditions ad-dress the requirements for M with | M | > | M o | , which primarily rely on thefact that such subsets are not ε - admissible .Conditions 3.3 and 3.4 demonstrate how large ε needs to be to achieve theconsistency result of the main theorem. Condition 3.3 states that for subsetsof covariates with redundancies, ε needs to be larger than the diﬀerence inprojections of the true model prediction, X M o β M o , onto M and onto a strictsubset of M . This condition facilitates the intuition that the variable selec-tion procedure will not concentrate on subsets M with redundant covariates. J. WILLIAMS AND J. HANNIG

If a given subset M is not ε - admissible , then the diﬀerence in projectionswill be small so that the condition is easily achieved. Condition . For any M with | M | > | M o | , for large n or p , (cid:107) X (cid:48) ( H M − H M ( − ) X M o β M o (cid:107) < ε, where H M ( − is the projection matrix for M after omitting the covariatewhich minimizes (cid:107) X (cid:48) ( H M − H M ( − ) X M o β M o (cid:107) . In fact, if M o ⊂ M with | M o | < | M | , then H M X M o β M o = H M ( − X M o β M o in which case Condition 3.3 holds trivially.Lastly, Condition 3.4 describes the rate at which ε needs to grow to achievethe consistency of the main result. The distinction between Condition 3.3and Condition 3.4 is that the former provides a necessary lower bound forarguing that E( h ( β M )) vanishes for M such that | M | > | M o | , while thelatter provides the rate at which ε must grow to achieve the consistencyresult of Theorem 3.9. The terms which compete with ε arise in the proofsof Lemma A.1 and Theorem 3.9. Recall that (cid:98) σ M := RSS M / ( n − | M | ), whereRSS M is the classical residual sum of squares for model M , and that Λ M :=tr (cid:0) ( H M X ) (cid:48) H M X (cid:1) . Condition . lim n →∞ p →∞ min | M |≤ n α ε M (cid:98) σ M + D | M o | − ϕ ( M, n, p )log( n ) = ∞ , where ϕ ( M, n, p ) := 4 e n α + ( D + (1 + 4 e ) log( p )) | M | + log (cid:16) | M | − n −| M | (cid:17) , and D = log (cid:16) π − nα +2 n (cid:17) . Additionally, ε M (cid:98) σ M < n −| M | for all M with | M | ≤ n α . The Λ M term arises from Lemmas 3.5 and 3.6. It is intimately relatedto the presence of collinearity amongst the covariates, and Condition 3.4implies that ε must account for collinearity by controlling for Λ M . Observethat if X is orthogonal, then Λ M = | M | .3.2. Main result.

The ﬁrst two results are lemmas which are needed inthe proofs of Theorems 3.7 and 3.8. Lemma 3.5 illustrates the rate at which β M concentrates around its mean, (cid:98) β M , the least squares estimator, andLemma 3.6 illustrates the rate at which (cid:98) β M concentrates around its mean, FI VARIABLE SELECTION E y ( (cid:98) β M ). Theorem 3.7 uses these two lemmas to bound the rate at which β M concentrates around E y ( (cid:98) β M ) for subsets M with | M | > | M o | . This yields anupper bound on E( h ( β M )) with a probabilistic guarantee, and implies thatE( h ( β M )) vanishes for large n and p , for large non- ε - admissible subsets.The proofs are relegated to Appendix A. Lemma . For any ﬁxed c ∈ (0 , assume | M | ≤ c n , and choose n and p such that ε M (cid:98) σ M < n −| M | . If β M ∼ t n −| M | (cid:16) (cid:98) β M , RSS M n − | M | ( X (cid:48) M X M ) − (cid:17) , where (cid:98) β M = ( X (cid:48) M X M ) − X (cid:48) M y , then P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ | M | (cid:98) σ M √ Λ M e − ε M (cid:98) σ M √ πε (1 − n −| M | ) . In the next lemma, P y is used to denote the probability measure associatedwith the sampling distribution of the data Y . Lemma . Assume | M | < n , and Y | β M , σ M ∼ N n (cid:16) X M β M , σ M I n (cid:17) .Then the classical least squares estimator (cid:98) β M ∼ N ( E y ( (cid:98) β M ) , σ M ( X (cid:48) M X M ) − ) ,and P y (cid:16) (cid:107) X (cid:48) X M (cid:0) (cid:98) β M − E y ( (cid:98) β M ) (cid:1) (cid:107) ≥ ε (cid:17) ≤ | M | σ M √ Λ M √ πε e − ε σ M Λ M . Combining these two lemmas gives the following non-asymptotic concen-tration result for models that are larger than the true model. Recall thatthe expectation E( h ( β M )) depends on the observed data y . The followingtwo theorems study the frequentist behavior of this quantity with respectto the sampling distribution of Y . Theorem . For any ﬁxed c ∈ (0 , suppose | M o | < | M | ≤ c n ,choose n and p such that ε M (cid:98) σ M < n −| M | , and assume that ε satisﬁes Con-dition 3.3. Then P y (cid:18) E ( h ( β M )) ≤ | M | (cid:98) σ M √ Λ M √ πε (1 − n −| M | ) e − ε M (cid:98) σ M (cid:19) ≥ − | M | σ M √ Λ M √ επe ε σ M Λ M . The next result is a probabilistic guarantee the true model is ε - admissible given ε satisﬁes Condition 3.1. This result is a statement that M o is identi-ﬁable. J. WILLIAMS AND J. HANNIG

Theorem . For any ﬁxed c ∈ (0 , suppose | M o | ≤ c n , choose n and p such that ε Mo (cid:98) σ Mo < n −| M o | , and assume that ε satisﬁes Condition3.1. Then P y (cid:32) E ( h ( β M o )) ≥ − p M o (cid:98) σ M o (cid:112) Λ M o √ πε (1 − n − p Mo ) e ε Mo (cid:98) σ Mo (cid:33) ≥ − p M o σ M o (cid:112) Λ M o √ επe ε σ Mo Λ Mo . The following result is the main result of the paper. It shows that theratio of the generalized ﬁducial probability of the true model to the sumover that of all other subsets of covariates M satisfying | M | ≤ n α willconverge to 1 in probability for large n and p . Note that the restriction tosubsets M with | M | ≤ n α is a stronger restriction than | M | ≤ c n , which issuﬃcient for Theorems 3.7 and 3.8. The reason being that the main result isstronger than the results of these two theorems. In fact, Theorems 3.7 and3.8 are non-asymptotic results that hold for each ﬁxed model M separately,while Theorem 3.9 is an asymptotic result which applies uniformly over all | M | ≤ n α . Just like with a conditional distribution, r ( M | Y ) is obtained byreplacing the observed data y with the random variable Y (random withrespect to the sampling distribution), in (6). Theorem . Given Conditions 3.1-3.4, the true model M o satisﬁes, r ( M o | Y ) (cid:80) n α j =1 (cid:80) M : | M | = j r ( M | Y ) P y − → as n → ∞ or n, p → ∞ . Although this is an asymptotic result, many of the ingredients that areused in its proof are non-asymptotic concentration results which are validif Conditions 3.1-3.4 are satisﬁed. Therefore, it can be expected that whenthese conditions are satisﬁed, in ﬁnite-sample situations the generalized ﬁdu-cial distribution will concentrate on the true model M o . This expectationis indeed validated by the empirical performance of the procedure, which isdemonstrated in the following section.

4. Simulation results.

This section serves to demonstrate the empir-ical performance of our algorithm on synthetic data. It is comprised of es-sentially two simulation setups. The ﬁrst setup, similar to that presented inJohnson and Rossell (2012), compares our procedure to the nonlocal priorapproach, the spike and slab LASSO of Roˇckov´a and George (2016), the elas-tic net as implementated in the

P ython module scikit-learn (Pedregosaet al., 2011), and to the SCAD as implementated in the R package ncvreg (Breheny and Huang, 2011). The authors of the nonlocal prior and the spike FI VARIABLE SELECTION and slab LASSO, respectively, have made available the R packages mombf and SSL for implementing their methods.The second setup illustrates a critical diﬀerence between our ε - admissible subsets approach and the nonlocal prior approach. Namely, for highly collinearﬁnite-sample settings in which the true model is not uniquely expressed,given the level of noise in the data (i.e., σ M o ), we demonstrate that ourapproach concentrates (in the sense of the MAP estimator) on subsets withfewer covariates without sacriﬁcing prediction error. The intuition for whythis should be the case was discussed in Section 1.4.1. Simulation setup 1.

Here we generate 2000 data vectors y accordingto model (2) with M o consisting of 8 covariates corresponding to β M o =( − . , − , − . , − . , . , . , , . (cid:48) , and σ M o = 1. The n × p design matrix X is generated with rows from the N p (0 , Σ) distribution, where the diagonalcomponents Σ ii = 1 and the oﬀ-diagonal components Σ ij = ρ for i (cid:54) = j . Theﬁrst 1000 y correspond to an independent design with ρ = 0, while the last1000 y correspond to ρ = .

25, as in the simulation setup of Johnson andRossell (2012). Note that 2000 design matrices X are generated, and one y is generated from each design. The sample size n is set at n = 100, and p = 100 , , , ,

500 are all considered.We implement our algorithm on each of the 2000 synthetic data sets for15000 MCMC steps with the ﬁrst 5000 discarded. Squared coeﬃcient esti-mates from elastic net (using scikit-learn ) added by n − serve as MCMCcovariate proposal weights. The default ε in (5) is used, and we implementeda 10-fold cross-validation scheme for choosing our tuning parameter p o (priorto starting the algorithm). The cross-validation consists of breaking the datainto 10 folds (with a diﬀerent set of 10 observations held out at each foldsince n = 100), and implementing our MCMC algorithm separately for each p o in the grid { , , . . . , } , on each of the 10 training sets. Each of the 10implementations of the MCMC on each of the 10 training sets is run for200 steps with the ﬁrst 100 steps discarded ( N = 30 is set during the cross-validation procedure). Squared nonzero coeﬃcient estimates from elastic net(using scikit-learn ) serve as MCMC covariate proposal weights withinthe cross-validation procedure. The MAP estimated subset for each MCMCchain is used to compute the Bayesian information criterion (BIC) on theheld-out test set, and the computed BIC values are averaged over the 10test sets, for each p o ∈ { , , . . . , } . The p o corresponding to the minimumaverage test set BIC is then selected.Finally, for our implementation of the algorithm post-selection of p o , thenumber of importance samples for estimating E( h ( β M )) within each step of J. WILLIAMS AND J. HANNIG the algorithm is set at N = 100 which, through empirical experimentation,seems to be enough. All competing variable selection procedures are imple-mented using existing software at default speciﬁcations. The one exceptionis that the nonlocal prior procedure is set to run for 5000 steps, as is the casein the simulation setup of Johnson and Rossell (2012). The nonlocal priorprocedure/software did not scale well for increased p , and required over aweeks worth of parallel computations on a computing cluster to obtain theresults for the ﬁrst simulation setup. The tuning parameters for all methodsare chosen with the default cross-validation procedures provided with thesoftware. Lastly, as in the simulation section for Roˇckov´a and George (2016)their λ is set at 1 (with a grid of 20 λ values ending at 50), and theiradaptive (best performing) procedure is used with θ ∼ Beta(1 , p ).Figure 1 shows results of the ﬁrst simulation setup. The ﬁrst row of plotsdisplays the average generalized ﬁducial probability of the true model (i.e.,average r ( M o | y )), or the average posterior probability of the true model forthe Bayesian nonlocal procedure (i.e., average P ( M o | y )), over the 1000 syn-thetic data sets (for ρ = 0 and ρ = .

25, respectively). Conditional on thedata, these plots address the consistency of the procedures with respect tothe uncertainty from not knowing M o . This generalized ﬁducial or Bayesian-like consistency is that which is dealt with in Theorem 3.9. Note that fre-quentist and MAP estimators do not yield posterior probability estimates,and thus cannot be compared to in the ﬁrst row of plots.The second row of Figure 1 shows the average proportion of correct modelselections over the 1000 synthetic data sets (for ρ = 0 and ρ = .

25, respec-tively). For the GFI and the Bayesian procedures the MAP subset is takento be the estimator of the true model, and in the frequentist procedures theestimated model is considered to be the subset of covariates with nonzerocoeﬃcient estimates. These plots address the consistency of the procedureswith respect to repeated sampling (i.e., frequentist) uncertainty. Finally, thethird row of Figure 1 presents the average root mean squared error (RMSE)over the 1000 synthetic data sets (for ρ = 0 and ρ = .

25, respectively).The MAP estimated model is used to compute the RMSE for the GFI andBayesian procedures. For all procedures, the RMSE is computed on an out-of-sample test set of 100 observations.Note that the criterions used in the ﬁrst two rows of Figure 1 are verystrict. They only reﬂect instances when the procedures are exactly correct,and count the procedure as incorrect if it is missing even one covariate fromthe true model or includes even one spurious covariate. Often the elastic netand SCAD are able to identify all of the true covariates but estimate ex-tra coeﬃcients to be nonzero. This results in poor identiﬁcation of the true

FI VARIABLE SELECTION subset, and worse out-of-sample prediction error. The remaining procedures(including our ε - admissible subsets method) struggle to identify the two co-variates with smallest coeﬃcient magnitudes, but typically do not introducemore than one or two false positives.

100 200 300 400 500 p r ( M o | y ) o r P ( M o | y ) = 0

100 200 300 400 500 p r ( M o | y ) o r P ( M o | y ) = . 25

100 200 300 400 500 p P r o p o r t i o n o f c o rr e c t m o d e l s e l e c t i o n s admissible subsetsnonlocal priorSSLelastic netSCAD

100 200 300 400 500 p P r o p o r t i o n o f c o rr e c t m o d e l s e l e c t i o n s

100 200 300 400 500 p R M S E

100 200 300 400 500 p R M S E Fig 1 . The average r ( M o | y ) , or average P ( M o | y ) is displayed in the ﬁrst row, the averageproportion of correct model selections in the second row, and the average RMSE in thethird. Averages are over 1000 synthetic data sets (for ρ = 0 and ρ = . , respectively).For the GFI and Bayesian procedures the MAP subset is used as the estimator of the truemodel, and in the frequentist procedures the estimated model is considered to be the subset ofcovariates with nonzero coeﬃcient estimates. The RMSE is computed on an out-of-sampletest set of 100 observations. J. WILLIAMS AND J. HANNIG

Our ε - admissible subsets procedure evidently performs the best at as-signing highest posterior probability to the true model. And even in com-parison to the frequentist-oriented metrics, proportion of correct model se-lections and out-of-sample RMSE, our ε - admissible subsets procedure per-forms more or less on par with the best performing methods considered.Note that in collinear design settings (i.e., ρ >

0) it may be the case thata strict subset of the true data generating model is identiﬁed as the ‘true’model within the ε - admissible framework. This is meaningful because itmanifests the fact that the true model may not be minimal (i.e., containsredundant predictors) in collinear, ﬁnite-sample settings. Furthermore, itexplains the larger diﬀerence in proportion of true model selections betweenthe ε - admissible and nonlocal prior performances in the ρ = .

25 case (versusthe ρ = 0 case), which is accompanied by only a very small change in thediﬀerence between RMSE performance. This phenomenon is illustrated in amore extreme case of collinearity in the next simulation setup.4.2. Simulation setup 2.

The ε - admissible subsets approach has beendeveloped in this paper as a method of obtaining a posterior-like distributionwhich eﬀectively eliminates (i.e., assigns negligible probability to) all subsetswith redundancies. To illustrate that our constructed methods do just that,consider the following setup in which the true data generating model lacksuniqueness for the small sample size n = 30: (9) Y ∼ N n (cid:16) · x (1) + 1 · x (2) + · · · + 1 · x (9) , I n (cid:17) , where x (1) , x (2) , x (3) iid ∼ N n (0 , I n ), and x (4) ∼ N n (cid:16) . · x (1) , . I n (cid:17) x (5) ∼ N n (cid:16) . · x (2) , . I n (cid:17) x (6) ∼ N n (cid:16) − . · x (3) , . I n (cid:17) x (7) ∼ N n (cid:16) x (1) + x (3) , . I n (cid:17) x (8) ∼ N n (cid:16) x (2) − x (3) , . I n (cid:17) x (9) ∼ N n (cid:16) x (1) + x (2) + x (3) , . I n (cid:17) With standard deviations of .1 and a model error standard deviation of1, covariates x (4) , . . . , x (9) can all be approximately expressed as a linearcombination of x (1) , x (2) , x (3) . Accordingly, with a small increase in errorvariance, model (9) can be re-expressed using various combinations of the9 predictors. However, observe that a subset with 4 or more predictors pre-dominantly contains redundant information. FI VARIABLE SELECTION Recall from Section 1 that the nonlocal prior approach of Johnson andRossell (2012) is designed to assign negligible probabilities to subsets con-taining predictor(s) with coeﬃcients of zero. So, in theory, the full subset { x (1) , . . . , x (9) } will remain the best candidate for the true model within thenonlocal prior framework. In fact, this is demonstrated to be the case inTable 1 which shows the performance of both the nonlocal prior and the ε - admissible subsets approach on 1000 data vectors y generated accordingto (9), with each covariate having a ‘true’ coeﬃcient of 1. Note that as inthe ﬁrst simulation setup 1000 design matrices X are generated to generatethe 1000 y vectors. model size RMSE r ( M MAP | y ) or P ( M MAP | y ) ε - admissible subsets 3.476 1.138 .365nonlocal prior 8.997 1.197 .333 Table 1

The average number of covariates in the MAP estimator, M MAP , ( | M MAP | ) is presentedin the ﬁrst column, the average RMSE in the second, and the average r ( M MAP | y ) or P ( M MAP | y ) in the third. Averages are over 1000 synthetic data sets from model (9). TheRMSE is computed on an out-of-sample test set of 30 observations. Table 1 shows that the MAP estimate for the ε - admissible subsets ap-proach contains 3-4 covariates, on average, and that in fact the averageRMSE is smaller than that of the nonlocal prior approach. Indeed, the MAPestimates for the nonlocal prior procedure typically includes all 9 covariateseven though the y vectors can be mostly explained by only 3 of the pre-dictors. This simple simulation illustrates a pivotal diﬀerence between thenonlocal prior and ε - admissible subsets approaches. With p = 9 the im-plication of not discriminating against redundant subsets may seem trivial.However, the 2 p size of the sample space grows rapidly in p and thus, putsexponentially more burden on procedures which do not discriminate basedon redundancies. This is reﬂected by comparing the diﬀerences in strength ofasymptotic consistency achieved for the two procedures. Though, the consis-tency result of the nonlocal prior method from Johnson and Rossell (2012)is argued to be stronger in an as-of-yet unpublished manuscript by Shin,Bhattacharya and Johnson (2015).

5. Concluding remarks.

In this paper we have developed a new per-spective for variable selection to exploit a non-redundancy property of atrue data generating model. The basic idea calls for deﬁning a true model asone which contains minimal amount of information necessary for explainingand/or predicting the observed data. The diﬀerence between our deﬁnitionof a true model and the usual deﬁnition arises only in ﬁnite-sample appli-cations, and was illustrated in Section 4.2. Within our variable selectionframework, this deﬁnition allows us to show that the posterior-like proba- J. WILLIAMS AND J. HANNIG bility of the true model converges to 1 asymptotically, even with p growingalmost exponentially in n , with the intuition that redundancies in the sam-ple space are very eﬀectively eliminated. Moreover, our empirical simulationresults are consistent with this strong consistency result, and as desired,it is demonstrated in a situation of high collinearity that the ε - admissible subsets approach yields a posterior-like distribution which is concentratedover subsets with fewer covariates, without sacriﬁcing prediction error.A non-redundancy property of a true data generating model is seeminglygeneral enough to extend to variable or feature selection problems beyondthe linear model setting, but would seem infeasible if it could not be de-veloped for the linear model setting. Thus, the goal of this paper has beento establish the potential feasibility of exploiting such a property. In futurework we hope to extend our methods to more complicated settings. References.

Andrieu, C. and

Roberts, G. (2009). The psuedo-marginal approach for eﬃcient montecarlo computations.

The Annals of Statistics Beaumont, M. A. (2003). Estimation of population growth or decline in geneticallymonitored populations.

Genetics

Berger, J. O. , Pericchi, L. R. , Ghosh, J. , Samanta, T. , De Santis, F. , Berger, J. and

Pericchi, L. (2001). Objective Bayesian methods for model selection: introductionand comparison.

Lecture Notes-Monograph Series

Berk, R. A. (2008).

Statistical Learning from a Regression Perspective . Springer, NewYork, NY.

Bertsimas, D. , King, A. and

Mazumder, R. (2016). Best Subset Selection via a ModernOptimization Lens.

The Annals of Statistics Bondell, H. D. and

Reich, B. J. (2012). Consistent high-dimensional Bayesian variableselection via penalized credible regions.

Journal of the American Statistical Association

Breheny, P. and

Huang, J. (2011). Coordinate descent algorithms for nonconvex pe-nalized regression, with applications to biological feature selection.

Annals of AppliedStatistics Candes, E. and

Tao, T. (2007). The Dantzig Selector: Statistical estimation when p ismuch greater than n . The Annals of Statistics Fan, J. and

Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood andits Oracle Properities.

Journal of the American Statistical Association Ghosh, J. and

Ghattas, A. E. (2015). Bayesian Variable Selection Under Collinearity.

The American Statistician Hannig, J. , Iyer, H. , Lai, R. C. S. and

Lee, T. C. M. (2016). Generalized FiducialInference: A Reviev and New Results.

Journal of American Statistical Association

Jameson, G. J. O. (2013). Inequalities for gamma function ratios.

American Math.Monthly

Johnson, V. E. and

Rossell, D. (2012). Bayesian Model Selection in High-DimensionalSettings.

Journal of the American Statistical Association .FI VARIABLE SELECTION Lai, R. C. S. , Hannig, J. and

Lee, T. C. M. (2015). Generalized Fiducial Inferencefor Ultrahigh-Dimensional Regression.

Journal of the American Statistical Association

Lan, H. , Chen, M. , Flowers, J. B. , Yandell, B. S. , Stapleton, D. S. , Mata, C. M. , Mui, E. T. , Flowers, M. T. , Schueler, K. L. , Manly, K. F. , Williams, R. W. , Kendziorski, K. and

Attie, A. D. (2006). Combined expresssion trait correlationsand expression quantitative trait locus mapping.

PLoS Genetics . Luo, S. and

Chen, Z. (2013). Extended BIC for Linear Regression Models With Diverg-ing Number of Relevant Features and High or Ultra-High Feature Spaces.

Journal ofStatistical Planning and Inference

Narisetty, N. N. and

He, X. (2014). Bayesian variable selection with shrinking anddiﬀusing priors.

The Annals of Statistics Pedregosa, F. , Varoquaux, G. , Gramfort, A. , Michel, V. , Thirion, B. , Grisel, O. , Blondel, M. , Prettenhofer, P. , Weiss, R. , Dubourg, V. , Vanderplas, J. , Pas-sos, A. , Cournapeau, D. , Brucher, M. , Perrot, M. and

Duchesnay, E. (2011).Scikit-learn: Machine Learning in Python.

Journal of Machine Learning Research Rossell, D. and

Telesca, D. (2017). Nonlocal Priors for High-Dimensional Estimation.

Journal of the American Statistical Association

Roˇckov´a, V. and

George, E. I. (2016). The Spike-and-Slab LASSO.

Journal of theAmerican Statistical Association . Shin, M. , Bhattacharya, A. and

Johnson, V. E. (2015). Scalable Bayesian variableselection using nonlocal prior densities in ultrahigh-dimensional settings. arXiv preprintarXiv:1507.07106 . Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.

Journal of theRoyal Statistical Society Yuan, M. and

Lin, Y. (2005). Eﬃcient Empirical Bayes Variable Selection and Estimationin Linear Models.

Journal of the American Statistical Association

Zhang, C. and

Huang, J. (2008). The Sparsity and Bias of the Lasso Selection in High-Dimensional Linear Regression.

The Annals of Statistics APPENDIX A: PROOFS

Proof of Lemma 3.5.

From the distributional assumption on β M it followsthat T := (cid:115) n − | M | RSS M ( X (cid:48) M X M ) ( β M − (cid:98) β M ) ∼ t n −| M | (0 , I | M | ) . Thus, (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) = (cid:107) AT (cid:107) RSS M n − | M | = (cid:107) QDW (cid:48) T (cid:107) RSS M n − | M | where A = X (cid:48) X M ( X (cid:48) M X M ) − is a p × | M | matrix which has the singularvalue decomposition A = QDW (cid:48) for Q and W each orthogonal matrices. Bythe deﬁnition of the multivariate T distribution, T = (cid:114) n − | M | V Z = ⇒ W (cid:48) T = (cid:114) n − | M | V W (cid:48) Z =: (cid:114) n − | M | V (cid:101) Z =: (cid:101) T , where V ∼ χ n −| M | and Z, (cid:101) Z ∼ MVN(0 , I | M | ). Then J. WILLIAMS AND J. HANNIG (cid:107) AT (cid:107) = (cid:107) QD (cid:101) T (cid:107) = (cid:101) T (cid:48) D (cid:48) D (cid:101) T = | M | (cid:88) (cid:101) T i λ i ≤ Λ M max (cid:101) T j where Λ M = (cid:80) | M | λ i , and λ i is the i th eigenvalue of A (cid:48) A . Observe that Λ M also has the more intuitive expression Λ M = tr( A (cid:48) A ) = tr (cid:0) ( H M X ) (cid:48) H M X (cid:1) which illustrates that it is intimately related to the presence of collinearityamongst the covariates.Recall that (cid:98) σ M := RSS M / ( n − | M | ). Then P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) = P (cid:16) (cid:107) AT (cid:107) RSS M n − | M | ≥ ε (cid:17) ≤ P (cid:16)

12 Λ M max (cid:101) T j RSS M n − | M | ≥ ε (cid:17) = P (cid:16) n − | M | V max Z j (cid:98) σ M ≥ ε M (cid:17) . Since V ∼ χ n −| M | , P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ (cid:90) ∞ P (cid:16) max | Z j | ≥ (cid:112) εv/ ( n − | M | )3 (cid:98) σ M √ Λ M (cid:17) v n −| M | − e − v n −| M | Γ( n −| M | ) dv ≤ (cid:90) ∞ | M | (cid:88) j =1 P (cid:16) | Z j | ≥ (cid:112) εv/ ( n − | M | )3 (cid:98) σ M √ Λ M (cid:17) v n −| M | − e − v n −| M | Γ( n −| M | ) dv ≤ | M | (cid:98) σ M (cid:112) Λ M ( n − | M | ) √ πε (cid:18) ε M (cid:98) σ M ( n −| M | ) (cid:19) n −| M | − · Γ( n −| M |− )Γ( n −| M | )(10) where the last inequality follows because for the standard normal CDF, Φfor x >

0, Φ( − x ) ≤ √ π e − x x . To simplify the bound, observe ﬁrst that Jameson (2013) gives (11) Γ( n −| M |− )Γ( n −| M | ) ≤ (cid:112) n − | M | ) n − | M | − . Second, observe that for 0 ≤ x ≤ n , (cid:16) xn (cid:17) − n ≤ e − x + x n = e − x (1 − x n ) ≤ e − x . By assumption | M | ≤ c n , and ε M (cid:98) σ M < n −| M | which implies that FI VARIABLE SELECTION (cid:32) ε M (cid:98) σ M ( n −| M | ) (cid:33) − n −| M | + ≤ e − ε M (cid:98) σ M . Therefore, applying (11) and (12) to (10) gives P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ | M | (cid:98) σ M √ Λ M √ πε (1 − n −| M | ) e − ε M (cid:98) σ M . (cid:4) Proof of Lemma 3.6.

Similar to the proof of Lemma 3.5. (cid:4)

Proof of Theorem 3.7.

Recall that h ( β M ) = I (cid:110) (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) ≥ ε (cid:111) , where b min solves min b ∈ R p (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M | − E( h ( β M )) = P (cid:16) (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) ≥ ε (cid:17) ≤ P (cid:16) (cid:107) X (cid:48) (cid:0) X M β M − X E y ( (cid:98) β M ( − ) (cid:1) (cid:107) ≥ ε (cid:17) , (13) where (cid:98) β M ( − is the least squares estimate corresponding to the subset ofcovariates M with one covariate removed so that (cid:107) E y ( (cid:98) β M ( − ) (cid:107) ≤ | M | − (cid:98) β M . To reﬁne the bound on E( h ( β M )), decompose thelast expression in (13) using the triangle inequality as follows. P (cid:16) (cid:107) X (cid:48) ( X M β M − X M ( − E y ( (cid:98) β M ( − )) (cid:107) ≥ ε (cid:17) ≤ P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) + I (cid:110) (cid:107) X (cid:48) X M (cid:0) (cid:98) β M − E y ( (cid:98) β M ) (cid:1) (cid:107) ≥ ε (cid:111) , + I (cid:110) (cid:107) X (cid:48) ( H M − H M ( − ) X M o β M o (cid:107) ≥ ε (cid:111) , (14) where H M := X M ( X (cid:48) M X M ) − X (cid:48) M . Observe that by Lemma 3.5 the ﬁrstterm on the right hand side, P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ | M | (cid:98) σ M √ Λ M √ πε (1 − n −| M | ) e − ε M (cid:98) σ M . Note that the last two terms are written as an indicator function becausethe uncertainty here comes from β M . However, the second term results fromthe uncertainty in observing Y . Accordingly, by Lemma 3.6, J. WILLIAMS AND J. HANNIG P y (cid:16) I (cid:110) (cid:107) X (cid:48) X M (cid:0) (cid:98) β M − E y ( (cid:98) β M ) (cid:1) (cid:107) ≥ ε (cid:111) = 0 (cid:17) ≥ − | M | σ M √ Λ M √ επe ε σ M Λ M . Recall that P y is used to denote the probability measure associated withthe sampling distribution of Y . The third term in (14) must be zero byCondition 3.3. (cid:4) Proof of Theorem 3.8.

Recall that h ( β M o ) = I (cid:110) (cid:107) X (cid:48) ( X M o β M o − Xb (cid:48) min ) (cid:107) ≥ ε (cid:111) , where b (cid:48) min solves min b ∈ R p (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M o | −

1. Toshow the desired result, let b min be the solution to min b ∈ R p (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M o | −

1. Then observe that (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≤ (cid:107) X (cid:48) ( X M o β M o − Xb (cid:48) min ) (cid:107) ≤ (cid:107) X (cid:48) X M o ( β M o − β M o ) (cid:107) + (cid:107) X (cid:48) ( X M o β M o − Xb (cid:48) min ) (cid:107) . Note the diﬀerence between b min and b (cid:48) min here. The rightmost term is thequantity of interest because it will become E( h ( β M o )) in the next few steps.The term on the left of the inequality corresponds to the quantity in Con-dition 3.1.Adding and subtracting (cid:98) β M o inside the ﬁrst term on the right side of thesecond inequality, and applying the triangle inequality gives, I (cid:110) (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≥ ε (cid:111) ≤ P (cid:16) (cid:107) X (cid:48) X M o ( β M o − (cid:98) β M o (cid:107) ≥ ε (cid:17) + I (cid:110) (cid:107) X (cid:48) X M o ( (cid:98) β M o − E y ( (cid:98) β M o ) (cid:107) ≥ ε (cid:111) + E( h ( β M o )) , and by applying Lemma 3.5, I (cid:110) (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≥ ε (cid:111) ≤ p M o (cid:98) σ M o (cid:112) Λ M o √ πε (1 − n − p Mo ) e − ε Mo (cid:98) σ Mo + I (cid:110) (cid:107) X (cid:48) X M o ( (cid:98) β M o − E y ( (cid:98) β M o ) (cid:107) ≥ ε (cid:111) + E( h ( β M o )) where the middle term is written as an indicator function because the un-certainty here comes from β M o . This indicator is 0 by Lemma 3.6 withprobability exceeding (15). Therefore, since Condition 3.1 implies that theindicator on the left side of the inequality is 1, FI VARIABLE SELECTION − p M o (cid:98) σ M o (cid:112) Λ M o √ πε (1 − n − p Mo ) e − ε Mo (cid:98) σ Mo ≤ E ( h ( β M o ))with probability exceeding (15) 1 − p M o σ M o (cid:112) Λ M o √ επ e − ε σ Mo Λ Mo , due to the uncertainty in observing Y . (cid:4) The following lemma is needed for the proof of the main result, Theorem3.9.

Lemma

A.1 . Assume all conditions and notations of Theorem 3.9, andwithout loss of generality suppose σ M o = 1 , where σ M o is the true but un-known error standard deviation. Then the following holds. Case 1

This case pertains to subsets M with | M | ≤ | M o | . P y  | M o | (cid:92) j =1 (cid:92) M j (cid:110) Y : (cid:16) RSS M o RSS M (cid:17) n − j − ≤ e − | M o | log( p ) (cid:111) ≥ − V , where M j := { M (cid:54) = M o : | M | = j } , V := max M (cid:54) = M o | M |≤| M o | (cid:40) | M o | e − ∆ M + | M o | log( p ) √ π ∆ M + | M o | e − ∆ M + | Mo | + | M o | log( p ) + | M o | e − ξn, | Mo |

48 ( n −| Mo |− M log( n ) γ log( p ) + n −| Mo | + | M o | log( p ) (cid:41) , and ξ n,j = 1 − n ) γ log( p )( n − j − / → . Case 2

This case pertains to subsets M with | M o | + 1 ≤ | M | ≤ n α . P y  n α (cid:92) j = | M o | +1 (cid:92) M : | M | = j (cid:110) Y : (cid:16) RSS M o RSS M (cid:17) n − j − ≤ e e ( n α + j log( p )) (cid:111) ≥ − V , where V := n α e − b n (cid:2) n α − | Mo | bn + (cid:0) − bn − bn log( p ) (cid:1) ( | M o | +1) log( p ) (cid:3) + n α e − n − nα −| Mo | + n α log( p ) (cid:112) π ( n − n α − | M o | ) , and b n := n − n α −| M o | n −| M o |− → . Note that V and V both vanish for large n and p by Condition 3.2. Thiscondition also ensures that ξ n,j ∈ (0 ,

1) which is needed in the proof of thislemma.

Proof of Lemma A.1.

There are two cases to consider. J. WILLIAMS AND J. HANNIG

Case 1

For M ∈ M j with j ≤ | M o | let ξ n,j = 1 − n ) γ log( p )( n − j − / ∈ (0 ,

1) byCondition 3.2 for large n and p . Then P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:19) = P y (cid:0) U (cid:48) ( I n − H M o ) U/ξ n,j > ∆ M + 2 (cid:112) ∆ M Z + U (cid:48) ( I n − H M ) U (cid:1) (16) since by assumption Y = X M o β M o + U with U ∼ N n (0 , I n ), and so RSS M o = U (cid:48) ( I n − H M o ) U andRSS M = ∆ M + 2 β (cid:48) M o X (cid:48) M o ( I n − H M ) U + U (cid:48) ( I n − H M ) U = Z ∆ M + U (cid:48) ( I n − H M ) U where Z ∆ M := ∆ M + 2 √ ∆ M Z , and Z ∼ N(0 , M := β (cid:48) M o X (cid:48) M o ( I n − H M ) X M o β M o . Continuing in (16) by subtracting χ n −| M o | from both sides of the inequality gives, P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) = P y (cid:0) χ n −| M o | /ξ n,j − χ n −| M o | > Z ∆ M + χ n − j − χ n −| M o | (cid:1) = P y (cid:0) χ n −| M o | (1 /ξ n,j − > Z ∆ M + χ | M o | − χ j (cid:1) ≤ P y (cid:0) χ n −| M o | (1 /ξ n,j − > Z ∆ M − χ j (cid:1) (17) The last inequality follows because the χ | M o | random variable is non-negative, and removing it simpliﬁes the remaining argument. Then P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) ≤ P y (cid:0) ξ n,j χ j + χ n −| M o | (1 − ξ n,j ) − ξ n,j (cid:112) ∆ M Z > ξ n,j ∆ M (cid:1) ≤ P y (cid:0) | Z | > (cid:112) ∆ M / (cid:1) + P y (cid:0) χ j > ∆ M / (cid:1) + P y (cid:16) χ n −| M o | > ξ n,j ∆ M − ξ n,j ) (cid:17) . For the second and third term, apply the Chernoﬀ bound, and evaluatethe moment generating function for the χ j and χ n −| M o | distributionsat 1 /

4. Accordingly, P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) ≤ P y (cid:0) Z < − (cid:112) ∆ M / (cid:1) + e − ∆ M + j + e − ξn,j ∆ M − ξn,j ) + n −| Mo | . Finally the remaining probability can be controlled by the bound forthe CDF of a standard normal random variable for x > − x ) ≤ √ π e − x x . FI VARIABLE SELECTION Hence, P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) ≤ e − ∆ M √ π ∆ M + e − ∆ M + | Mo | + e − ξn, | Mo |

48 ( n −| Mo |− M log( n ) γ log( p ) + n −| Mo | , where the last inequality follows by observing that j ≤ | M o | , andrecalling the expression for ξ n,j .Therefore, the probability that (cid:16) RSS Mo RSS M (cid:17) n − j − > ξ n − j − n,j is satisﬁed forsome M with | M | ≤ | M o | is P y  | M o | (cid:91) j =1 (cid:91) M j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:27) ≤ | M o | (cid:88) j =1 (cid:18) pj (cid:19) max M j P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:19) . Note that (18) (cid:18) pj (cid:19) = p ( p − · · · ( p − j + 1) j ! = p j (1 − p ) · · · (1 − j − p ) j ! ≤ p j . In fact, Luo and Chen (2013) show that if log( j ) / log( p ) → δ as p → ∞ ,for some δ >

0, then log (cid:0) pj (cid:1) = j log( p )(1 − δ )(1 + o (1)). Thus, P y  | M o | (cid:91) j =1 (cid:91) M j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:27) ≤ | M o | (cid:88) j =1 max M j (cid:40) e − ∆ M + j log( p ) √ π ∆ M + e − ∆ M + | Mo | + j log( p ) + e − ξn, | Mo |

48 ( n −| Mo |− M log( n ) γ log( p ) + n −| Mo | + j log( p ) (cid:41) . Since ξ n, | M o | →

1, this bound vanishes by Condition 3.2. Therefore,with probability exceeding one minus the above bound, (cid:16)

RSS M o RSS M (cid:17) n − j − ≤ (cid:16) − n ) γ log( p )( n − j − / (cid:17) n − j − ≤ e − n ) γ log( p ) uniformly over all M such that | M | ≤ | M o | . Case 2

Consider any subset M with | M o | < | M | ≤ n α for some positiveconstant α <

1, and let { a n } be an arbitrarily sequence of numbers. To J. WILLIAMS AND J. HANNIG begin, repeating the steps in (17), but subtracting χ n − j /ξ n,j on bothsides instead of χ n −| M o | , and replacing the label ξ n,j with a n , yields P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:19) ≤ P y (cid:0) χ j > a n Z ∆ M + χ n − j ( a n − (cid:1) , where Z ∆ M = ∆ M + 2 √ ∆ M Z , Z ∼ N (0 , M = β (cid:48) M o X (cid:48) M o ( I n − H M ) X M o β M o . Since M o ⊂ M implies ∆ M = 0, the above bound canbe simpliﬁed by including in the subset M any covariates in M o notalready included in M . Accordingly, let M (cid:48) := M ∪ M o which includes j + l covariates, where l ∈ { , . . . , | M o |} is the number of covariatesnot shared by M and M o . Then because RSS M (cid:48) ≤ RSS M , P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:19) ≤ P y (cid:18)(cid:16) RSS M o RSS M (cid:48) (cid:17) n − j − > a n − j − n (cid:19) ≤ P y (cid:0) χ j + l > χ n − j − l ( a n − (cid:1) , and for any nonnegative s ∈ R , P y (cid:16) RSS M o RSS M > a n (cid:17) ≤ P y (cid:0)(cid:8) χ j + l > s ( a n − (cid:9) ∩ (cid:8) χ n − j − l ≥ s (cid:9)(cid:1) + P y (cid:0)(cid:8) χ j + l > χ n − j − l ( a n − (cid:9) ∩ (cid:8) χ n − j − l < s (cid:9)(cid:1) ≤ P y (cid:0) χ j + l > s ( a n − (cid:1) + P y ( χ n − j − l < s )(19) Consider each of these last two terms in turn. For the ﬁrst term applythe Chernoﬀ bound, and evaluate the moment generating function forthe χ j + l distribution at 1 /

4. That gives P y (cid:0) χ j + l > s ( a n − (cid:1) ≤ j + l e − s ( an − ≤ e − s ( an − + j + l . For the second term in (19) write out the expression to evaluate theprobability explicitly, and then apply the simple bound e − x ≤ x ≥

0. Noting that s > P y ( χ n − j − l < s ) ≤ n − j − l Γ (cid:16) n − j − l (cid:17) s n − j − l n − j − l ≤ ( e · s ) n − j − l · − n − j − l √ π ( n − j − l ) n − j − l + , where the last inequality follows from the well known Sterling lowerbound on the gamma functionΓ( x ) ≥ √ πx x − e − x for x > . It is clear from the last expression that for the probability to vanish, e · s must grow no faster than n − j − l . Accordingly, choosing s = n − j − le gives FI VARIABLE SELECTION P y (cid:16) χ n − j − l < n − j − le (cid:17) ≤ e − n − j − l (cid:112) π ( n − j − l ) . Combining the two bounds for (19) now yields P y (cid:16) RSS M o RSS M > a n (cid:17) ≤ e − ( n − j − le ) an − + j + l + e − n − j − l (cid:112) π ( n − j − l ) . It only remains to choose the smallest a n such that the ﬁrst term inthe bound vanishes exponentially fast so that the cumulative proba-bility will vanish in probability over all subsets M with | M | ≤ n α .Accordingly, it should become apparent shortly that a good choice is (20) a n = 1 + 8 e ( n α + j log( p )) n − j − . The probability that (cid:16)

RSS Mo RSS M (cid:17) n − j − > a n − j − n is satisﬁed for some M with | M o | < | M | ≤ n α is P y  n α (cid:91) j = | M o | +1 (cid:91) M : | M | = j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:27) ≤ n α (cid:88) j = | M o | +1 (cid:18) pj (cid:19) max M : | M | = j P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:19) . Thus, bounding the binomial coeﬃcient as in (18), and substituting(20) for a n yields P y  n α (cid:91) j = | M o | +1 (cid:91) M : | M | = j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:27) ≤ n α (cid:88) j = | M o | +1 e − b n (cid:2) n α − l bn + (cid:0) − bn − bn log( p ) (cid:1) j log( p ) (cid:3) + e − n − j −| Mo | + j log( p ) (cid:112) π ( n − j − | M o | ) . ≤ n α e − b n (cid:2) n α − | Mo | bn + (cid:0) − bn − bn log( p ) (cid:1) ( | M o | +1) log( p ) (cid:3) + n α e − n − nα −| Mo | + n α log( p ) (cid:112) π ( n − n α − | M o | ) , where b n := n − n α −| M o | n −| M o |− →

1. Note that this bound vanishes by Con-dition 3.2. Therefore, with probability exceeding one minus the abovebound, (cid:16)

RSS M o RSS M (cid:17) n − j − ≤ (cid:16) e ( n α + j log( p )) n − j − (cid:17) n − j − ≤ e e ( n α + j log( p )) , uniformly over all M such that | M o | < | M | ≤ n α . J. WILLIAMS AND J. HANNIG (cid:4)

Proof of Theorem 3.9.

Without loss of generality suppose σ M o = 1, where σ M o is the true but unknown error standard deviation. For j ∈ { , . . . , p } deﬁne the following classes of subsets M j := { M (cid:54) = M o : | M | = j } . Recallthat (cid:98) σ M := RSS M / ( n − | M | ). It will ﬁrst be shown that for any subset ofcovariates M (cid:54) = M o the ratio r ( M | Y ) r ( M o | Y ) vanishes in probability for large n and p . Accordingly, for any M ∈ M j , r ( M | Y ) r ( M o | Y ) = π j −| Mo | Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) RSS n −| Mo |− M o RSS n − j − M E( h ( β M ))E( h ( β M o ))= π j −| Mo | Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) (cid:16) RSS M o RSS M (cid:17) n − j − RSS j −| Mo | M o E( h ( β M ))E( h ( β M o )) Before proceeding with the rest of the proof, a the following notation isneeded. V and V are as stated in Lemma A.1. As in Theorem 3.8, deﬁne V := 3 p M o σ M (cid:112) Λ M o √ επ e − ε σ M Λ Mo , and corresponding to Theorem 3.7, deﬁne V := n α (cid:88) j =1 (cid:88) M ∈M j | M | σ M √ Λ M √ επ e − ε σ M Λ M ≤ n α (cid:88) j =1 max M j | M | σ M √ Λ M √ επe ε σ M Λ M + j log( p ) by bounding the binomial coeﬃcient as in (18). Note that V then vanishesby Condition 3.4. Further, recall that RSS M o ∼ χ n −| M o | , so by the Chernoﬀbound (evaluating the moment generating function at 1 / P y (cid:0) χ n −| M o | > n − | M o | ) (cid:1) ≤ e − n −| Mo | (cid:124) (cid:123)(cid:122) (cid:125) =: V . With this notation it is now possible to account for all of the uncertaintydue to Y .Accordingly, by Theorem 3.8, with probability exceeding 1 − V − V , (21) r ( M | Y ) r ( M o | Y ) ≤ Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) (cid:16) RSS M o RSS M (cid:17) n − j − (3 π ( n − | M o | )) j −| Mo | E( h ( β M ))1 − g ( M o , n, p ) , where g ( M, n, p ) := 2 j (cid:98) σ M √ Λ M √ πε (1 − n − j ) e − ε M (cid:98) σ M . Further, ﬁx A ∈ (0 , n and p suﬃcientlylarge so that g ( M o , n, p ) < A . FI VARIABLE SELECTION Suppose j ≤ | M o | . As in Jameson (2013), the ratio of gamma functionscan be bounded byΓ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) = Γ (cid:16) n −| M o | + | M o |− j (cid:17) Γ (cid:16) n −| M o | (cid:17) ≤ (cid:16) n − | M o | (cid:17)(cid:16) n − j (cid:17) | Mo |− j − . Applying Lemma A.1 to bound the ratio of residual sums of squares, andbounding the expectation by 1, with probability exceeding 1 − V − V − V ,(21) implies r ( M | Y ) r ( M o | Y ) ≤ (cid:16) n − | M o | (cid:17)(cid:16) n − j (cid:17) | Mo |− j − (3 π ( n − | M o | )) j −| Mo | (1 − A ) e n ) γ log( p ) ≤ e − n ) γ log( p ) − A , (22) where the last inequality follows for all n ≥ | M o | − π . Since n >> | M o | , assumewithout loss of generality that n is suﬃciently large.Conversely, suppose | M o | < j ≤ n α . In this setting, as in Jameson (2013),the ratio of gamma functions can be bounded by  Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17)  − = Γ (cid:16) n −| M o | (cid:17) Γ (cid:16) n − j (cid:17) = Γ (cid:16) n − j + j −| M o | (cid:17) Γ (cid:16) n − j (cid:17) ≥ (cid:16) n − j − (cid:17) j −| Mo | . Applying Theorem 3.7 to bound the expectation, and applying Lemma A.1to bound the ratio of residual sums of squares, with probability exceeding1 − V − V − V − V , (21) implies r ( M | Y ) r ( M o | Y ) ≤ e e ( n α + j log( p )) (cid:16) n − j − (cid:17) j −| Mo | (3 π ( n − | M o | )) j −| Mo | g ( M, n, p )1 − A ≤ j (cid:98) σ M √ Λ M e − ε M (cid:98) σ M +4 e ( n α + j log( p ))+ j −| Mo | log (cid:16) π − nα +2 n (cid:17) √ πε (1 − n − j )(1 − A )(23) Notice that Theorem 3.7 is being applied here with n α in place of c n . Thiscan be done without loss of generality because n α grows slower than c n ,for any choices of α, c ∈ (0 , (cid:80) n α j =1 (cid:80) M ∈M j r ( M | Y ) r ( M o | Y ) vanishes in probabil-ity for large n and p . Apply the bounds in (22) and (23) in the followingargument. n α (cid:88) j =1 (cid:88) M ∈M j r ( M | Y ) r ( M o | Y ) ≤ | M o | (cid:88) j =1 (cid:18) pj (cid:19) max M ∈M j r ( M | Y ) r ( M o | Y ) (cid:124) (cid:123)(cid:122) (cid:125) =: S + n α (cid:88) j = | M o | +1 (cid:18) pj (cid:19) max | M | = j r ( M | Y ) r ( M o | Y ) (cid:124) (cid:123)(cid:122) (cid:125) =: S J. WILLIAMS AND J. HANNIG