Non-penalized variable selection in high-dimensional linear model settings via generalized fiducial inference
SSubmitted to the Annals of Statistics arXiv:
NON-PENALIZED VARIABLE SELECTION INHIGH-DIMENSIONAL LINEAR MODEL SETTINGS VIAGENERALIZED FIDUCIAL INFERENCE
By Jonathan P Williams and Jan Hannig
University of North Carolina at Chapel Hill
Standard penalized methods of variable selection and parameterestimation rely on the magnitude of coefficient estimates to decidewhich variables to include in the final model. However, coefficientestimates are unreliable when the design matrix is collinear. To over-come this challenge an entirely new perspective on variable selectionis presented within a generalized fiducial inference framework. Thisnew procedure is able to effectively account for linear dependenciesamong subsets of covariates in a high-dimensional setting where p can grow almost exponentially in n , as well as in the classical settingwhere p ≤ n . It is shown that the procedure very naturally assignssmall probabilities to subsets of covariates which include redundan-cies by way of explicit L minimization. Furthermore, with a typicalsparsity assumption, it is shown that the proposed method is con-sistent in the sense that the probability of the true sparse subset ofcovariates converges in probability to 1 as n → ∞ , or as n → ∞ and p → ∞ . Very reasonable conditions are needed, and little restrictionis placed on the class of possible subsets of covariates to achieve thisconsistency result.
1. Introduction.
A strategy for developing variable selection proce-dures with desirable consistency properties entails exploiting some distin-guishing property of the theoretical true data generating model. For exam-ple, standard penalized methods of variable selection within a linear modelframework such as LASSO of Tibshirani (1996), SCAD of Fan and Li (2001),and the Dantzig Selector of Candes and Tao (2007) rely on the magnitudeof the coefficients in the true data generating model being relatively largerthan those of the other coefficients. Johnson and Rossell (2012) use thisproperty to construct nonlocal prior densities over all subsets of covariates.The defining property of their nonlocal density is that it takes the value ofzero for subsets containing a covariate with a zero-valued coefficient.We propose a more desirable way for eliminating redundancies from thesample space of candidate subsets which does not explicitly rely on coefficient
MSC 2010 subject classifications:
Keywords and phrases: variable selection, best subset selection, high dimensional re-gression, L minimization, generalized fiducial inference a r X i v : . [ s t a t . M E ] F e b J. WILLIAMS AND J. HANNIG magnitudes. That is, any candidate true model should be non-redundant inthe sense that it contains the minimal amount of information necessary forexplaining and/or predicting the observed data. One such criterion to ex-ploit this non-redundancy property is that the only subsets with nonzeroposterior probability should be those which cannot be predicted to somechosen precision by a subset of fewer covariates. Such a criterion requiresconstructing a probability distribution on the space of candidate models,which is consistent with a Bayesian or fiducial variable selection paradigm.The literature on high-dimensional linear models is vast, but we hope to con-tribute to it by using this setting to build a foundation for a fresh perspectiveon variable selection.Recent work in the Bayesian high-dimensional linear model setting in-cludes Roˇckov´a and George (2016) who develop methods for separable andnon-separable spike-and-slab penalized estimation, the credible set approachof Bondell and Reich (2012), and Narisetty and He (2014) who propose amethod based on shrinking and diffusing coefficient priors in which the vari-ance of the priors are sample size dependent. Lai, Hannig and Lee (2015)layout framework for penalized estimation within a GFI approach.Ghosh and Ghattas (2015) provide insights into complications in Bayesianvariable selection. Namely, the size of the sample space (2 p ) is often toolarge to compute all model probabilities, and even typically larger than canreasonably be sampled by Markov chain Monte Carol (MCMC) methods.Thus, the nonlocal prior approach of Johnson and Rossell (2012) can achieveasymptotic consistency (where other approaches can only achieve pairwiseconsistency) because it is able to effectively eliminate a large enough portionof the 2 p subsets from the sample space. To illustrate this point, considerthe following simple example. Let (1) Y ∼ N n (cid:16) β · x (1) + · · · + β p · x ( p ) , σ I n (cid:17) , where β j ∈ R and x ( j ) ∈ R n for j ∈ { , . . . , p } , and σ >
0. Further, supposethat the true but unknown values of ( β , β , β , . . . , β p ) (cid:48) are ( b , b , , . . . , (cid:48) .Within the nonlocal prior framework, the only subsets with non-negligibleposterior probability are contained in the set (cid:8) { x (1) } , { x (2) } , { x (1) , x (2) } (cid:9) .When viewed as a prior density on the coefficients, nonlocal priors assignzero prior density to the true parameter value when the true parameter valueis zero. From a Bayesian perspective this is philosophically problematic, butvery insightful for consistency of model selection. The insight lends itself tothe question: What other properties might the true model have which canbe exploited to develop a statistical procedure with the ability to effectivelyeliminate subsets from the sample space? FI VARIABLE SELECTION In addressing this question, we build our proposed methods from theidea that any candidate true model, as determined by the actual non-zeroparameter values, should be non-redundant in the sense that it contains theminimal amount of information necessary for explaining and/or predictingthe observed data. We denote such subsets of the parameter space as ε - admissible , and define them precisely in Definition 2.1. Then, using theabove nonlocal prior example, the entire model space { x (1) , x (2) , x (3) } forinstance, is not ε - admissible because it can be perfectly predicted by thesmaller subset { x (1) , x (2) } .To further illustrate the intuition behind our proposal, consider an ex-ample where x (2) is highly collinear with all of x (3) , . . . , x ( p ) but is not correlated with x (1) , and that the true values of ( β , β , β , . . . , β p ) (cid:48) are( b , b , b , . . . , b p ) (cid:48) with b j (cid:54) = 0 for all j ∈ { , . . . , p } . In this case, assum-ing strong enough collinearity, ∃ c ∈ R with c · x (2) ≈ b · x (2) + · · · + b p · x ( p ) ,i.e., (cid:13)(cid:13)(cid:0) b · x (1) + · · · + b p · x ( p ) (cid:1) − (cid:0) b · x (1) + c · x (2) (cid:1)(cid:13)(cid:13) < ε where ε > { x (1) , . . . , x ( p ) } is not ε - admissible , but would be assignednonzero posterior probability in the nonlocal prior framework.We construct a posterior-like probability distribution over all subsets,which assigns negligible probability to elements that are not ε - admissible .In constructing the posterior-like probability distribution we adopt a gener-alized fiducial inference (GFI) approach because it has similar to an objectiveBayes interpretation with data driven priors, gives a systematic method ofconstructing a distribution function given a data generating equation suchas a linear model, and it does not suffer from the issue of arbitrary normal-izing constants which arise in many objective Bayesian priors (Berger et al.,2001). In this manuscript we will provide a gentle introduction to GFI. Afuller account of GFI is given in the recent review paper Hannig et al. (2016).An advantage of both our approach and the nonlocal prior approach ofJohnson and Rossell (2012) is that in addition to providing theoretical guar-antees, our statistical procedures yield estimates of the posterior distribu-tion over subsets of covariates. This is in contrast to frequentist penaliza-tion based methods or Bayesian procedures fully dedicated to maximuma posteriori probability (MAP) estimation. Such methods do not yield theposterior probability of a chosen model (i.e., the relative probability, giventhe observed data, of a given model against competing models). Further-more, Ghosh and Ghattas (2015) argue that joint summaries of subsets ofcovariates are more robust to collinearity.With our approach to constructing a posterior-like distribution whoseprobability mass function value is negligible for subsets of the parameter J. WILLIAMS AND J. HANNIG space which are not ε - admissible , we are able to show that the probabilityof the true data generating model converges to 1 asymptotically in n and p .This consistency is shown to be true even with p growing almost exponen-tially in n . The reason being that the true model yields a stronger signalsince it no longer has to compete within an overly redundant sample space.The paper is organized as follows. Section 2 serves to introduce the gen-eral methodology and computational algorithm for carrying out our variableselection procedure based on a recent algorithm for explicit L minimization(Bertsimas, King and Mazumder, 2016), which is fast enough to be used onreal data. The conditions needed for the main results, and the main resultsare presented and discussed in Section 3. Proofs are organized in the ap-pendix. We demonstrate the empirical performance of our procedure andcompare it to other Bayesian and frequentist methods in simulation setupson synthetic data in Section 4. Computer code implementing our procedureis provided at https://jonathanpw.github.io/software.html .
2. Methodology.
As described in the previous section, our idea behindexploiting a non-redundancy property of the true data generating modelrelies on constructing a probability distribution concentrated on what wedenote as ε - admissible subsets. This object is defined precisely in Definition2.1, but first an aside on the notation used throughout the paper.Let Y be an n -dimensional random vector, X an n × p matrix with columnsscaled to have unit norm, and β a fixed p -dimensional vector with nonzero(or active) components indexed by the subset M o ⊂ { , . . . , p } , with (2) Y ∼ N n (cid:16) X M o β M o , ( σ M o ) I n (cid:17) . The design matrix denoted by X M o is defined as the matrix composed ofonly those columns of X corresponding to the index set M o . The subscript ‘ o ’refers to the interpretation of M o as corresponding to the ‘oracle’ subset ofcovariates. Moreover, β M o denotes the true values of the oracle coefficients,while β M o is understood as a random vector whose uncertainty resides in notknowing the true coefficients β M o . For any subset M the vector β M refersto the projection of the column space of X M on the true coefficients β M o ,that is, β M = ( X (cid:48) M X M ) − X (cid:48) M X M o β M o = E y ( (cid:98) β M ). Lastly, σ M o > σ M o is a random variablewhose distribution expresses the uncertainty from not knowing σ M o , underthe oracle model.The objective is to construct a statistical procedure which can be shown,asymptotically and demonstrated empirically, to be able to identify M o asthe index set of the oracle model within the sample space of all 2 p candidate FI VARIABLE SELECTION subsets M ⊂ { , . . . , p } . For each index set, M , in the sample space theconditional sampling distribution of the data is assumed as (3) Y | β M , σ M ∼ N n (cid:16) X M β M , σ M I n (cid:17) . The centerpiece of our methodology is then the following definition. Thefunction | · | denotes the absolute value function if its argument is scalar-valued, and denotes the cardinality function if its argument is set-valued.The norms (cid:107) · (cid:107) and (cid:107) · (cid:107) refer, respectfully, to the usual L and L normsdefined on finite-dimensional Euclidean spaces. Definition . Assume fixed ε > . A given β M coupled within someindex subset M ⊂ { , . . . , p } is said to be ε - admissible if and only if h ( β M ) =1 , where (4) h ( β M ) := I (cid:110) (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) ≥ ε (cid:111) , and b min solves min b ∈ R p (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M | − . Observe that this definition is consistent with the heuristic descriptionof ε - admissible subsets given in the previous section. In particular, if thesubset of covariates indexed by M is linearly dependent or if one of thecomponents of β M is zero, then h ( β M ) = 0. The subtlety in this definitionis assuming an appropriately chosen ε which is able to strike an optimalbalance for distinguishing signal from noise. Intuitively, ε = ε ( n, p, M ), i.e.,is a function of the amount of information available given by n , the difficultyof the problem represented by p , and information about a given M beingconsidered such as | M | . For instance, if | M | > n then h ( β M ) = 0 because X M cannot have full rank. In this case any ε > ε matters a lot if | M | ≤ n . The choice of ε is a major focus of Section3 where the main results of the paper are presented, and from where wesuggest the following default choice: (5) ε = Λ M (cid:98) σ M (cid:16) n . | M | log( pπ ) . − p o (cid:17) + , where Λ M := tr (cid:0) ( H M X ) (cid:48) H M X (cid:1) and (cid:98) σ M := RSS M / ( n − | M | ) with RSS M := y (cid:48) ( I n − H M ) y and H M := X M ( X (cid:48) M X M ) − X (cid:48) M , and the vector y an obser-vation from the true model (2). The parameter p o represents prior beliefabout | M o | , the number of covariates in the true model M o . In practice, avalue of p o can be directly specified or selected by cross-validation. A built-incross-validation procedure is included in the accompanying software to thispaper. Details are provided with the simulation study in Section 4. J. WILLIAMS AND J. HANNIG
Within the h function in Definition 2.1 the quantity (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) represents the difference in prediction for a subset M against allsubsets with fewer covariates. This measure of distance has been adaptedfrom Candes and Tao (2007), but they deal with the error (cid:107) X (cid:48) ( y − Xb ) (cid:107) ∞ over b ∈ R p . This is very different from using X M β M in place of y becausethe former results in a noiseless measure of distance. To illustrate, observethat E y (cid:0) (cid:107) X (cid:48) ( Y − Xb ) (cid:107) (cid:1) = (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) + σ M o · p, where E y ( · ) is used to denote the expectation taken with respect to thesampling distribution of the data Y .There are various reasons for using the quantity X (cid:48) ( X M β M − Xb ) fromthe Dantzig selector (Candes and Tao, 2007) versus simply the difference inpredictions ( X M β M − Xb ) as in the LASSO (Tibshirani, 1996). One reasonis that X (cid:48) ( X M β M − Xb ) accounts for difference in predictions as well ascorrelations with the explanatory data, as discussed in Berk (2008). If thedifference in predictions is small but is highly correlated with the designmatrix, then it is likely that the smaller subset of covariates is unable toaccount for the effect of one or more of the covariates in M . Thus, using X (cid:48) ( X M β M − Xb ) instead of just the difference in predictions is a methodof controlling for potential omitted variable effects which could incorrectlyfind a close fitting subset to M . Another advantage of X (cid:48) ( X M β M − Xb ) isthat it is invariant under orthogonal transformations of the design matrix,as pointed out in Candes and Tao (2007).Now that the foundation for ε - admissible subsets of the parameter spacehas been laid out, it remains to show how Definition 2.1 can be coupledwith a likelihood based approach for constructing a probability distributionover index subsets M ⊂ { , . . . , p } . This is a common strategy for Bayesianapproaches, i.e., construct a prior density with desirable properties for vari-able selection and then couple the prior with a likelihood function of thedata to study the resulting posterior distribution. However, it is not clearwhat sort of a prior to use within our ε - admissible subsets approach, and re-cent developments in generalized fiducial inference (GFI) offer a systematicmethod of deriving objective Bayes-like posterior distributions. To illustrateas in Hannig et al. (2016), suppose that some data Y = G ( U, θ ) for somedeterministic data generating equation G ( · , · ), some parameters θ , and somerandom component U whose distribution is independent of θ and is com-pletely known. The generalized fiducial distribution of θ is then given by r ( θ | y ) = f ( y, θ ) J ( y, θ ) (cid:82) Θ f ( y, θ (cid:48) ) J ( y, θ (cid:48) ) dθ (cid:48) , where f is the likelihood function and FI VARIABLE SELECTION J ( y, θ ) = D (cid:18) ddθ G ( u, θ ) (cid:12)(cid:12)(cid:12) u = G − ( y,θ ) (cid:19) with D ( A ) = (det A (cid:48) A ) . The component J ( y, θ ) is termed the J acobian because it results from inverting the data generating equation on the data.We are committing a slight abuse of notation as r ( θ | y ) is not a conditionaldensity in the usual sense. Instead, we are using this notation to stress thatthe generalized fiducial distribution is a function of the observed data y .To make matters concrete in the linear model setting of (3), the pa-rameters are θ = ( β M , σ M ), the data generating equation is specified as G (cid:0) U, ( β M , σ M ) (cid:1) = X M β M + σ M U where U ∼ N n (0 , I n ), and the Jacobianterm reduces to J (cid:0) y, ( β M , σ M ) (cid:1) = σ − M | det( X (cid:48) M X M ) | RSS M . Thus, r (cid:0) ( β M , σ M ) | y (cid:1) ∝ σ − nM e − (cid:107) y − XM βM (cid:107) σ M σ − M | det( X (cid:48) M X M ) | RSS M · h ( β M ) , where the factor of h ( β M ) appears in the likelihood from only considering ε - admissible subsets of the parameter space. Accordingly, as is done with aBayesian posterior density and in Section 3 of Hannig et al. (2016), define theGFI probability of a given subset M to be proportional to the normalizingconstant of r (cid:0) ( β M , σ M ) | y (cid:1) . That is, r ( M | y ) := (cid:82) f (cid:0) y, ( β M , σ M ) (cid:1) J (cid:0) y, ( β M , σ M ) (cid:1) h ( β M ) d ( σ M , β M ) p (cid:80) j =1 (cid:80) | M | = j (cid:82) f (cid:0) y, ( β M , σ M ) (cid:1) J (cid:0) y, ( β M , σ M ) (cid:1) h ( β M ) d ( σ M , β M ) ∝ (cid:90) R pM (cid:90) ∞ h ( β M ) | det( X (cid:48) M X M ) | RSS M σ n +1 M e ( y − XM βM ) (cid:48) ( y − XM βM )2 σ M dσ M dβ M , which simplifies to (6) r ( M | y ) ∝ π | M | Γ (cid:16) n − | M | (cid:17) RSS − ( n −| M |− ) M E( h ( β M )) , where the expectation is taken with respect to the location-scale multivariateT distribution, (7) t n −| M | (cid:16) (cid:98) β M , RSS M n − | M | ( X (cid:48) M X M ) − (cid:17) with (cid:98) β M := ( X (cid:48) M X M ) − X (cid:48) M y . Notice that the quantity E( h ( β M )) is a func-tion of the observed data y .Observe that (6) expresses the relative likelihood of the subset M overall 2 p possible subsets. The expression can be described as a product oftwo terms, the first being comprised of information from the sampling dis-tribution of the data and largely driven by the residual sum of squares,RSS M , and the second having to do with the ε - admissibility of β M , in the J. WILLIAMS AND J. HANNIG form of E( h ( β M )). Thus, the support of r ( M | y ) in (6) is dominated by the ε - admissible subsets, as desired.Section 3 provides the conditions and supporting lemmas and theoremsneeded to show that r ( M o | Y ) → n, p → ∞ . First however,a few remarks are provided about computing r ( M | y ) on actual data.2.1. Remarks on computation.
With a probability distribution now de-fined over ε - admissible subsets, it must be demonstrated that r ( M | y ) in(6) can be efficiently computed. There are two main computational issues todeal with. The first is to evaluate h ( β M ) for a given β M , and the second is tosample subsets M via pseudo-marginal based MCMC. The computationalcomplexity and the need for pseudo-marginal based MCMC arises becauseneither h ( β M ) nor E( h ( β M )) have a closed form solution.To evaluate h ( β M ) for a given β M we adapt an explicit L minimizationalgorithm introduced in Bertsimas, King and Mazumder (2016). The authorsstate that their algorithm borrows ideas from projected gradient descent andmethods in first-order convex optimization, and solves problems of the formmin b ∈ R p g ( b ) subject to (cid:107) b (cid:107) ≤ κ , where g ( b ) ≥ (cid:107)∇ g ( b ) − ∇ g ( (cid:101) b ) (cid:107) ≤ l (cid:107) b − (cid:101) b (cid:107) . The algorithm isnot guaranteed to find a global optimum (unless formal optimality tests arerun, which can take a long time), but Bertsimas, King and Mazumder (2016)provide provable guarantees that the algorithm will converge to a first-orderstationary point, which is defined as a vector ˜ b ∈ R p with (cid:107) ˜ b (cid:107) ≤ κ whichsatisfies ˜ b = ˜ b − l ∇ g (˜ b ). Paraphrasing from Bertsimas, King and Mazumder(2016), their algorithm detects the active set after a few iterations, and thentakes additional time to estimate the coefficient values to a high accuracylevel. In our application of their algorithm we are not first-most interestedin finding a global optimum. To evaluate h ( β M ), we need only determine ifmin b ∈ R p (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) is smaller than ε (as in (5)). For β M which arenot ε - admissible , the objective function, (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) , can be madesmall very quickly via our implementation of the L minimization algorithm.To illustrate how consider the following specifics of our implementation. Theprecise details regarding the algorithm can be found accompanying our soft-ware documentation at https://jonathanpw.github.io/software.html .First, to estimate E( h ( β M )) we use a sample mean of sample vectorsdrawn from the location-scale multivariate T distribution in (7). This mul-tivariate T distribution is centered at the least squares estimator, (cid:98) β M , andmultivariate theory suggests that (cid:98) β M will on average be close to the coeffi-cients β M . By warm starting the L minimization algorithm at (cid:98) β M with thesmallest coefficient removed, subsets corresponding to β M with at least one FI VARIABLE SELECTION zero coefficient typically yield h ( β M ) = 0 within a few steps of the algorithm.Second, as per the definition of h ( · ) in (4) the objective function is min-imized over all b ∈ R p with (cid:107) b (cid:107) ≤ | M | −
1. Hence, the κ required for the L minimization algorithm from Bertsimas, King and Mazumder (2016) isnaturally chosen for us as κ = | M | −
1. Knowing how to choose κ greatly re-duces the L optimization problem. Moreover, our implementation is furthersimplified by the fact that the closest prediction to X M β M for a given M isguaranteed to have | M | − h ( β M ) need not be minimized over all b ∈ R p with (cid:107) b (cid:107) ≤ | M | −
1, butcan be minimized over all b ∈ R p with (cid:107) b (cid:107) = | M | − M via pseudo-marginal based MCMC. We do this by using the Grouped IndependenceMetropolis Hastings (GIMH) algorithm from Andrieu and Roberts (2009),but originally introduced in Beaumont (2003). The reason standard MCMCtechniques do not apply is that there is no obvious closed form expressionfor the probability mass function (6) because of the expectation, E( h ( β M )),in the expression. As described in Andrieu and Roberts (2009) such situa-tions warrant introducing a latent variable to yield analytical expressions oreasier implementation.In the case of r ( M | y ) in (6), we introduce the latent location-scale multi-variate T vector in (7) from within the expectation E( h ( β M )). Our pseudo-marginal based MCMC is carried out by sampling an index subset M alongwith sampling some pre-specified number, N , of multivariate T vectors (cor-responding to M ) from (7). The sample of multivariate T vectors, say B , isthen used to compute the sample mean estimate of E( h ( β M )). Accordingly,we define a joint Markov chain on ( M, B ), but discard B to obtain samplesfrom the marginal distribution of M . As argued in Andrieu and Roberts(2009), this is a valid MCMC sampling strategy, but is known to suffer fromslower mixing than if we were able to integrate the β M out of the massfunction r ( M | y ) in (6), i.e., analytically evaluate E( h ( β M )). However, this isnot possible given the h function in (4). Additionally, the mixing associatedwith pseudo-marginal approaches is known to be poor when the number ofimportance samples ( N , the sample size of B ) is small. These practical bot-tlenecks outline avenues for future research. Nonetheless, we demonstrate inSection 4 that our computational strategies are efficient enough to be imple-mented on actual data, in comparison to other common penalized likelihoodand Bayesian approaches.
3. Theoretical results.
The main objective of this section is to showunder what conditions, asymptotically, r ( M o | Y ) in (6) will converge to 1, J. WILLIAMS AND J. HANNIG particularly if p >> n . The ε - admissible subsets approach is able to achievesuch a strong consistency result because the resulting sample space is ef-fectively reduced to only those subsets with no redundancies. The essenceof the mathematical result is that the space of ε - admissible sets is smallenough that the true model can be detected. This addresses the issue raisedin Ghosh and Ghattas (2015) that high-dimensional settings often lead to ar-bitrarily small probabilities for all models (including the true model) simplybecause there are too many models to consider.3.1. Discussion of the conditions.
The first two conditions, Condition3.1 and Condition 3.2, are to ensure that the true model, M o , is identifiable.Observe from (4) that ε is used to control the sensitivity and specificityfor identifying ε - admissible subsets. In particular, if ε is too large, then h ( β M o ) will incorrectly be set to zero implying that β M o is not ε - admissible .Condition 3.1 specifies how large ε can be whilst the true model remainsidentifiable. This condition turns out to be critically important in actual dataapplications because computing h ( β M o ) is closely related to the comparisonin equation (8). Condition . For large n and p , (8) 118 (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≥ ε where b min solves min b ∈ R p (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M o | − . Condition 3.2 is born from Lemma A.1 which is an important necessaryresult for the main result of this paper, Theorem 3.9. The term log( n ) γ represents the sparsity assumption for the true model, i.e., the number ofcovariates in the true model must not exceed log( n ) γ for some fixed scalar γ >
0. The γ parameter indicates that the asymptotic results remain trueif the true model grows faster than log( n ), but not faster than some powerof log( n ). In finite-sample applications, γ has no consequence and can beignored.The constant α ∈ (0 ,
1) reflects the only explicit restriction needed onthe sample space of 2 p subsets to show that r ( M o | Y ) → n and p , Theorem 3.9. The residual sum of squares term in r ( M | Y )in (6) cannot be controlled (as a ratio to r ( M o | Y )) for arbitrary subsets M with | M | = O ( n ) because the column span of X M includes y ∈ R n when rank( X M ) = n . To eliminate such subsets from the sample space,Condition 3.2 requires that only subsets of size | M | ≤ n α can be givennonzero probability. However, recall from Definition 2.1 that h ( β M ) = 0 if FI VARIABLE SELECTION | M | > n because in this case the columns of X M must be linearly dependent.Accordingly, all subsets M with | M | > n are given zero probability, bydefinition. Evidenced by this fact, the only explicit restriction placed on thesample space is that subsets M with | M | ∈ ( n α , n ) are excluded. In sparsesettings it is assumed that | M o | << n anyway, so neglecting such subsets isreasonable. Convergence to the true model M o will be quicker for smaller α because there are less models to consider, but too small of an α will exclude M o from the sample space.In Condition 3.2, and for the remainder of this section assume that γ > γ = 1, and α ∈ (0 , α = .
5, have been chosen and fixed atappropriate values.
Condition . The true model M o satisfies | M o | ≤ log( n ) γ , and lim n →∞ p →∞ min (cid:110) ∆ M | M o | log( p ) : M (cid:54) = M o , | M | ≤ | M o | (cid:111) = ∞ , lim inf n →∞ p →∞ n − α log( p ) > , and log( p ) < n − | M o | −
14 log( n ) γ , for large n and p , where ∆ M := (cid:107) X M o β M o − H M X M o β M o (cid:107) as in Lai, Hannigand Lee (2015). This is a slightly weaker version of condition (11) in Lai, Hannig andLee (2015). They relate Condition 3.2 to the sparse Riesz condition (Zhangand Huang, 2008) which requires that the eigenvalues of X (cid:48) M X M /n areuniformly bounded away from 0 and ∞ . Essentially, ∆ M is a measure ofhow distinct the true model predictions X M o β M o are from their projectiononto the column space of X M for M (cid:54) = M o and | M | ≤ | M o | . Recall that H M := X M ( X (cid:48) M X M ) − X (cid:48) M . In particular, if X M is orthogonal to X M o ,then ∆ M = (cid:107) X M o β M o (cid:107) which will be much larger than the denominator, | M o | log( p ). The requirements of this condition are reasonable because ∆ M grows very fast for M such that M o (cid:54)⊆ M .Condition 3.2 is important for being able to identify the true modelamongst other models M with | M | ≤ | M o | . The next two conditions ad-dress the requirements for M with | M | > | M o | , which primarily rely on thefact that such subsets are not ε - admissible .Conditions 3.3 and 3.4 demonstrate how large ε needs to be to achieve theconsistency result of the main theorem. Condition 3.3 states that for subsetsof covariates with redundancies, ε needs to be larger than the difference inprojections of the true model prediction, X M o β M o , onto M and onto a strictsubset of M . This condition facilitates the intuition that the variable selec-tion procedure will not concentrate on subsets M with redundant covariates. J. WILLIAMS AND J. HANNIG
If a given subset M is not ε - admissible , then the difference in projectionswill be small so that the condition is easily achieved. Condition . For any M with | M | > | M o | , for large n or p , (cid:107) X (cid:48) ( H M − H M ( − ) X M o β M o (cid:107) < ε, where H M ( − is the projection matrix for M after omitting the covariatewhich minimizes (cid:107) X (cid:48) ( H M − H M ( − ) X M o β M o (cid:107) . In fact, if M o ⊂ M with | M o | < | M | , then H M X M o β M o = H M ( − X M o β M o in which case Condition 3.3 holds trivially.Lastly, Condition 3.4 describes the rate at which ε needs to grow to achievethe consistency of the main result. The distinction between Condition 3.3and Condition 3.4 is that the former provides a necessary lower bound forarguing that E( h ( β M )) vanishes for M such that | M | > | M o | , while thelatter provides the rate at which ε must grow to achieve the consistencyresult of Theorem 3.9. The terms which compete with ε arise in the proofsof Lemma A.1 and Theorem 3.9. Recall that (cid:98) σ M := RSS M / ( n − | M | ), whereRSS M is the classical residual sum of squares for model M , and that Λ M :=tr (cid:0) ( H M X ) (cid:48) H M X (cid:1) . Condition . lim n →∞ p →∞ min | M |≤ n α ε M (cid:98) σ M + D | M o | − ϕ ( M, n, p )log( n ) = ∞ , where ϕ ( M, n, p ) := 4 e n α + ( D + (1 + 4 e ) log( p )) | M | + log (cid:16) | M | − n −| M | (cid:17) , and D = log (cid:16) π − nα +2 n (cid:17) . Additionally, ε M (cid:98) σ M < n −| M | for all M with | M | ≤ n α . The Λ M term arises from Lemmas 3.5 and 3.6. It is intimately relatedto the presence of collinearity amongst the covariates, and Condition 3.4implies that ε must account for collinearity by controlling for Λ M . Observethat if X is orthogonal, then Λ M = | M | .3.2. Main result.
The first two results are lemmas which are needed inthe proofs of Theorems 3.7 and 3.8. Lemma 3.5 illustrates the rate at which β M concentrates around its mean, (cid:98) β M , the least squares estimator, andLemma 3.6 illustrates the rate at which (cid:98) β M concentrates around its mean, FI VARIABLE SELECTION E y ( (cid:98) β M ). Theorem 3.7 uses these two lemmas to bound the rate at which β M concentrates around E y ( (cid:98) β M ) for subsets M with | M | > | M o | . This yields anupper bound on E( h ( β M )) with a probabilistic guarantee, and implies thatE( h ( β M )) vanishes for large n and p , for large non- ε - admissible subsets.The proofs are relegated to Appendix A. Lemma . For any fixed c ∈ (0 , assume | M | ≤ c n , and choose n and p such that ε M (cid:98) σ M < n −| M | . If β M ∼ t n −| M | (cid:16) (cid:98) β M , RSS M n − | M | ( X (cid:48) M X M ) − (cid:17) , where (cid:98) β M = ( X (cid:48) M X M ) − X (cid:48) M y , then P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ | M | (cid:98) σ M √ Λ M e − ε M (cid:98) σ M √ πε (1 − n −| M | ) . In the next lemma, P y is used to denote the probability measure associatedwith the sampling distribution of the data Y . Lemma . Assume | M | < n , and Y | β M , σ M ∼ N n (cid:16) X M β M , σ M I n (cid:17) .Then the classical least squares estimator (cid:98) β M ∼ N ( E y ( (cid:98) β M ) , σ M ( X (cid:48) M X M ) − ) ,and P y (cid:16) (cid:107) X (cid:48) X M (cid:0) (cid:98) β M − E y ( (cid:98) β M ) (cid:1) (cid:107) ≥ ε (cid:17) ≤ | M | σ M √ Λ M √ πε e − ε σ M Λ M . Combining these two lemmas gives the following non-asymptotic concen-tration result for models that are larger than the true model. Recall thatthe expectation E( h ( β M )) depends on the observed data y . The followingtwo theorems study the frequentist behavior of this quantity with respectto the sampling distribution of Y . Theorem . For any fixed c ∈ (0 , suppose | M o | < | M | ≤ c n ,choose n and p such that ε M (cid:98) σ M < n −| M | , and assume that ε satisfies Con-dition 3.3. Then P y (cid:18) E ( h ( β M )) ≤ | M | (cid:98) σ M √ Λ M √ πε (1 − n −| M | ) e − ε M (cid:98) σ M (cid:19) ≥ − | M | σ M √ Λ M √ επe ε σ M Λ M . The next result is a probabilistic guarantee the true model is ε - admissible given ε satisfies Condition 3.1. This result is a statement that M o is identi-fiable. J. WILLIAMS AND J. HANNIG
Theorem . For any fixed c ∈ (0 , suppose | M o | ≤ c n , choose n and p such that ε Mo (cid:98) σ Mo < n −| M o | , and assume that ε satisfies Condition3.1. Then P y (cid:32) E ( h ( β M o )) ≥ − p M o (cid:98) σ M o (cid:112) Λ M o √ πε (1 − n − p Mo ) e ε Mo (cid:98) σ Mo (cid:33) ≥ − p M o σ M o (cid:112) Λ M o √ επe ε σ Mo Λ Mo . The following result is the main result of the paper. It shows that theratio of the generalized fiducial probability of the true model to the sumover that of all other subsets of covariates M satisfying | M | ≤ n α willconverge to 1 in probability for large n and p . Note that the restriction tosubsets M with | M | ≤ n α is a stronger restriction than | M | ≤ c n , which issufficient for Theorems 3.7 and 3.8. The reason being that the main result isstronger than the results of these two theorems. In fact, Theorems 3.7 and3.8 are non-asymptotic results that hold for each fixed model M separately,while Theorem 3.9 is an asymptotic result which applies uniformly over all | M | ≤ n α . Just like with a conditional distribution, r ( M | Y ) is obtained byreplacing the observed data y with the random variable Y (random withrespect to the sampling distribution), in (6). Theorem . Given Conditions 3.1-3.4, the true model M o satisfies, r ( M o | Y ) (cid:80) n α j =1 (cid:80) M : | M | = j r ( M | Y ) P y − → as n → ∞ or n, p → ∞ . Although this is an asymptotic result, many of the ingredients that areused in its proof are non-asymptotic concentration results which are validif Conditions 3.1-3.4 are satisfied. Therefore, it can be expected that whenthese conditions are satisfied, in finite-sample situations the generalized fidu-cial distribution will concentrate on the true model M o . This expectationis indeed validated by the empirical performance of the procedure, which isdemonstrated in the following section.
4. Simulation results.
This section serves to demonstrate the empir-ical performance of our algorithm on synthetic data. It is comprised of es-sentially two simulation setups. The first setup, similar to that presented inJohnson and Rossell (2012), compares our procedure to the nonlocal priorapproach, the spike and slab LASSO of Roˇckov´a and George (2016), the elas-tic net as implementated in the
P ython module scikit-learn (Pedregosaet al., 2011), and to the SCAD as implementated in the R package ncvreg (Breheny and Huang, 2011). The authors of the nonlocal prior and the spike FI VARIABLE SELECTION and slab LASSO, respectively, have made available the R packages mombf and SSL for implementing their methods.The second setup illustrates a critical difference between our ε - admissible subsets approach and the nonlocal prior approach. Namely, for highly collinearfinite-sample settings in which the true model is not uniquely expressed,given the level of noise in the data (i.e., σ M o ), we demonstrate that ourapproach concentrates (in the sense of the MAP estimator) on subsets withfewer covariates without sacrificing prediction error. The intuition for whythis should be the case was discussed in Section 1.4.1. Simulation setup 1.
Here we generate 2000 data vectors y accordingto model (2) with M o consisting of 8 covariates corresponding to β M o =( − . , − , − . , − . , . , . , , . (cid:48) , and σ M o = 1. The n × p design matrix X is generated with rows from the N p (0 , Σ) distribution, where the diagonalcomponents Σ ii = 1 and the off-diagonal components Σ ij = ρ for i (cid:54) = j . Thefirst 1000 y correspond to an independent design with ρ = 0, while the last1000 y correspond to ρ = .
25, as in the simulation setup of Johnson andRossell (2012). Note that 2000 design matrices X are generated, and one y is generated from each design. The sample size n is set at n = 100, and p = 100 , , , ,
500 are all considered.We implement our algorithm on each of the 2000 synthetic data sets for15000 MCMC steps with the first 5000 discarded. Squared coefficient esti-mates from elastic net (using scikit-learn ) added by n − serve as MCMCcovariate proposal weights. The default ε in (5) is used, and we implementeda 10-fold cross-validation scheme for choosing our tuning parameter p o (priorto starting the algorithm). The cross-validation consists of breaking the datainto 10 folds (with a different set of 10 observations held out at each foldsince n = 100), and implementing our MCMC algorithm separately for each p o in the grid { , , . . . , } , on each of the 10 training sets. Each of the 10implementations of the MCMC on each of the 10 training sets is run for200 steps with the first 100 steps discarded ( N = 30 is set during the cross-validation procedure). Squared nonzero coefficient estimates from elastic net(using scikit-learn ) serve as MCMC covariate proposal weights withinthe cross-validation procedure. The MAP estimated subset for each MCMCchain is used to compute the Bayesian information criterion (BIC) on theheld-out test set, and the computed BIC values are averaged over the 10test sets, for each p o ∈ { , , . . . , } . The p o corresponding to the minimumaverage test set BIC is then selected.Finally, for our implementation of the algorithm post-selection of p o , thenumber of importance samples for estimating E( h ( β M )) within each step of J. WILLIAMS AND J. HANNIG the algorithm is set at N = 100 which, through empirical experimentation,seems to be enough. All competing variable selection procedures are imple-mented using existing software at default specifications. The one exceptionis that the nonlocal prior procedure is set to run for 5000 steps, as is the casein the simulation setup of Johnson and Rossell (2012). The nonlocal priorprocedure/software did not scale well for increased p , and required over aweeks worth of parallel computations on a computing cluster to obtain theresults for the first simulation setup. The tuning parameters for all methodsare chosen with the default cross-validation procedures provided with thesoftware. Lastly, as in the simulation section for Roˇckov´a and George (2016)their λ is set at 1 (with a grid of 20 λ values ending at 50), and theiradaptive (best performing) procedure is used with θ ∼ Beta(1 , p ).Figure 1 shows results of the first simulation setup. The first row of plotsdisplays the average generalized fiducial probability of the true model (i.e.,average r ( M o | y )), or the average posterior probability of the true model forthe Bayesian nonlocal procedure (i.e., average P ( M o | y )), over the 1000 syn-thetic data sets (for ρ = 0 and ρ = .
25, respectively). Conditional on thedata, these plots address the consistency of the procedures with respect tothe uncertainty from not knowing M o . This generalized fiducial or Bayesian-like consistency is that which is dealt with in Theorem 3.9. Note that fre-quentist and MAP estimators do not yield posterior probability estimates,and thus cannot be compared to in the first row of plots.The second row of Figure 1 shows the average proportion of correct modelselections over the 1000 synthetic data sets (for ρ = 0 and ρ = .
25, respec-tively). For the GFI and the Bayesian procedures the MAP subset is takento be the estimator of the true model, and in the frequentist procedures theestimated model is considered to be the subset of covariates with nonzerocoefficient estimates. These plots address the consistency of the procedureswith respect to repeated sampling (i.e., frequentist) uncertainty. Finally, thethird row of Figure 1 presents the average root mean squared error (RMSE)over the 1000 synthetic data sets (for ρ = 0 and ρ = .
25, respectively).The MAP estimated model is used to compute the RMSE for the GFI andBayesian procedures. For all procedures, the RMSE is computed on an out-of-sample test set of 100 observations.Note that the criterions used in the first two rows of Figure 1 are verystrict. They only reflect instances when the procedures are exactly correct,and count the procedure as incorrect if it is missing even one covariate fromthe true model or includes even one spurious covariate. Often the elastic netand SCAD are able to identify all of the true covariates but estimate ex-tra coefficients to be nonzero. This results in poor identification of the true
FI VARIABLE SELECTION subset, and worse out-of-sample prediction error. The remaining procedures(including our ε - admissible subsets method) struggle to identify the two co-variates with smallest coefficient magnitudes, but typically do not introducemore than one or two false positives.
100 200 300 400 500 p r ( M o | y ) o r P ( M o | y ) = 0
100 200 300 400 500 p r ( M o | y ) o r P ( M o | y ) = . 25
100 200 300 400 500 p P r o p o r t i o n o f c o rr e c t m o d e l s e l e c t i o n s admissible subsetsnonlocal priorSSLelastic netSCAD
100 200 300 400 500 p P r o p o r t i o n o f c o rr e c t m o d e l s e l e c t i o n s
100 200 300 400 500 p R M S E
100 200 300 400 500 p R M S E Fig 1 . The average r ( M o | y ) , or average P ( M o | y ) is displayed in the first row, the averageproportion of correct model selections in the second row, and the average RMSE in thethird. Averages are over 1000 synthetic data sets (for ρ = 0 and ρ = . , respectively).For the GFI and Bayesian procedures the MAP subset is used as the estimator of the truemodel, and in the frequentist procedures the estimated model is considered to be the subset ofcovariates with nonzero coefficient estimates. The RMSE is computed on an out-of-sampletest set of 100 observations. J. WILLIAMS AND J. HANNIG
Our ε - admissible subsets procedure evidently performs the best at as-signing highest posterior probability to the true model. And even in com-parison to the frequentist-oriented metrics, proportion of correct model se-lections and out-of-sample RMSE, our ε - admissible subsets procedure per-forms more or less on par with the best performing methods considered.Note that in collinear design settings (i.e., ρ >
0) it may be the case thata strict subset of the true data generating model is identified as the ‘true’model within the ε - admissible framework. This is meaningful because itmanifests the fact that the true model may not be minimal (i.e., containsredundant predictors) in collinear, finite-sample settings. Furthermore, itexplains the larger difference in proportion of true model selections betweenthe ε - admissible and nonlocal prior performances in the ρ = .
25 case (versusthe ρ = 0 case), which is accompanied by only a very small change in thedifference between RMSE performance. This phenomenon is illustrated in amore extreme case of collinearity in the next simulation setup.4.2. Simulation setup 2.
The ε - admissible subsets approach has beendeveloped in this paper as a method of obtaining a posterior-like distributionwhich effectively eliminates (i.e., assigns negligible probability to) all subsetswith redundancies. To illustrate that our constructed methods do just that,consider the following setup in which the true data generating model lacksuniqueness for the small sample size n = 30: (9) Y ∼ N n (cid:16) · x (1) + 1 · x (2) + · · · + 1 · x (9) , I n (cid:17) , where x (1) , x (2) , x (3) iid ∼ N n (0 , I n ), and x (4) ∼ N n (cid:16) . · x (1) , . I n (cid:17) x (5) ∼ N n (cid:16) . · x (2) , . I n (cid:17) x (6) ∼ N n (cid:16) − . · x (3) , . I n (cid:17) x (7) ∼ N n (cid:16) x (1) + x (3) , . I n (cid:17) x (8) ∼ N n (cid:16) x (2) − x (3) , . I n (cid:17) x (9) ∼ N n (cid:16) x (1) + x (2) + x (3) , . I n (cid:17) With standard deviations of .1 and a model error standard deviation of1, covariates x (4) , . . . , x (9) can all be approximately expressed as a linearcombination of x (1) , x (2) , x (3) . Accordingly, with a small increase in errorvariance, model (9) can be re-expressed using various combinations of the9 predictors. However, observe that a subset with 4 or more predictors pre-dominantly contains redundant information. FI VARIABLE SELECTION Recall from Section 1 that the nonlocal prior approach of Johnson andRossell (2012) is designed to assign negligible probabilities to subsets con-taining predictor(s) with coefficients of zero. So, in theory, the full subset { x (1) , . . . , x (9) } will remain the best candidate for the true model within thenonlocal prior framework. In fact, this is demonstrated to be the case inTable 1 which shows the performance of both the nonlocal prior and the ε - admissible subsets approach on 1000 data vectors y generated accordingto (9), with each covariate having a ‘true’ coefficient of 1. Note that as inthe first simulation setup 1000 design matrices X are generated to generatethe 1000 y vectors. model size RMSE r ( M MAP | y ) or P ( M MAP | y ) ε - admissible subsets 3.476 1.138 .365nonlocal prior 8.997 1.197 .333 Table 1
The average number of covariates in the MAP estimator, M MAP , ( | M MAP | ) is presentedin the first column, the average RMSE in the second, and the average r ( M MAP | y ) or P ( M MAP | y ) in the third. Averages are over 1000 synthetic data sets from model (9). TheRMSE is computed on an out-of-sample test set of 30 observations. Table 1 shows that the MAP estimate for the ε - admissible subsets ap-proach contains 3-4 covariates, on average, and that in fact the averageRMSE is smaller than that of the nonlocal prior approach. Indeed, the MAPestimates for the nonlocal prior procedure typically includes all 9 covariateseven though the y vectors can be mostly explained by only 3 of the pre-dictors. This simple simulation illustrates a pivotal difference between thenonlocal prior and ε - admissible subsets approaches. With p = 9 the im-plication of not discriminating against redundant subsets may seem trivial.However, the 2 p size of the sample space grows rapidly in p and thus, putsexponentially more burden on procedures which do not discriminate basedon redundancies. This is reflected by comparing the differences in strength ofasymptotic consistency achieved for the two procedures. Though, the consis-tency result of the nonlocal prior method from Johnson and Rossell (2012)is argued to be stronger in an as-of-yet unpublished manuscript by Shin,Bhattacharya and Johnson (2015).
5. Concluding remarks.
In this paper we have developed a new per-spective for variable selection to exploit a non-redundancy property of atrue data generating model. The basic idea calls for defining a true model asone which contains minimal amount of information necessary for explainingand/or predicting the observed data. The difference between our definitionof a true model and the usual definition arises only in finite-sample appli-cations, and was illustrated in Section 4.2. Within our variable selectionframework, this definition allows us to show that the posterior-like proba- J. WILLIAMS AND J. HANNIG bility of the true model converges to 1 asymptotically, even with p growingalmost exponentially in n , with the intuition that redundancies in the sam-ple space are very effectively eliminated. Moreover, our empirical simulationresults are consistent with this strong consistency result, and as desired,it is demonstrated in a situation of high collinearity that the ε - admissible subsets approach yields a posterior-like distribution which is concentratedover subsets with fewer covariates, without sacrificing prediction error.A non-redundancy property of a true data generating model is seeminglygeneral enough to extend to variable or feature selection problems beyondthe linear model setting, but would seem infeasible if it could not be de-veloped for the linear model setting. Thus, the goal of this paper has beento establish the potential feasibility of exploiting such a property. In futurework we hope to extend our methods to more complicated settings. References.
Andrieu, C. and
Roberts, G. (2009). The psuedo-marginal approach for efficient montecarlo computations.
The Annals of Statistics Beaumont, M. A. (2003). Estimation of population growth or decline in geneticallymonitored populations.
Genetics
Berger, J. O. , Pericchi, L. R. , Ghosh, J. , Samanta, T. , De Santis, F. , Berger, J. and
Pericchi, L. (2001). Objective Bayesian methods for model selection: introductionand comparison.
Lecture Notes-Monograph Series
Berk, R. A. (2008).
Statistical Learning from a Regression Perspective . Springer, NewYork, NY.
Bertsimas, D. , King, A. and
Mazumder, R. (2016). Best Subset Selection via a ModernOptimization Lens.
The Annals of Statistics Bondell, H. D. and
Reich, B. J. (2012). Consistent high-dimensional Bayesian variableselection via penalized credible regions.
Journal of the American Statistical Association
Breheny, P. and
Huang, J. (2011). Coordinate descent algorithms for nonconvex pe-nalized regression, with applications to biological feature selection.
Annals of AppliedStatistics Candes, E. and
Tao, T. (2007). The Dantzig Selector: Statistical estimation when p ismuch greater than n . The Annals of Statistics Fan, J. and
Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood andits Oracle Properities.
Journal of the American Statistical Association Ghosh, J. and
Ghattas, A. E. (2015). Bayesian Variable Selection Under Collinearity.
The American Statistician Hannig, J. , Iyer, H. , Lai, R. C. S. and
Lee, T. C. M. (2016). Generalized FiducialInference: A Reviev and New Results.
Journal of American Statistical Association
Jameson, G. J. O. (2013). Inequalities for gamma function ratios.
American Math.Monthly
Johnson, V. E. and
Rossell, D. (2012). Bayesian Model Selection in High-DimensionalSettings.
Journal of the American Statistical Association .FI VARIABLE SELECTION Lai, R. C. S. , Hannig, J. and
Lee, T. C. M. (2015). Generalized Fiducial Inferencefor Ultrahigh-Dimensional Regression.
Journal of the American Statistical Association
Lan, H. , Chen, M. , Flowers, J. B. , Yandell, B. S. , Stapleton, D. S. , Mata, C. M. , Mui, E. T. , Flowers, M. T. , Schueler, K. L. , Manly, K. F. , Williams, R. W. , Kendziorski, K. and
Attie, A. D. (2006). Combined expresssion trait correlationsand expression quantitative trait locus mapping.
PLoS Genetics . Luo, S. and
Chen, Z. (2013). Extended BIC for Linear Regression Models With Diverg-ing Number of Relevant Features and High or Ultra-High Feature Spaces.
Journal ofStatistical Planning and Inference
Narisetty, N. N. and
He, X. (2014). Bayesian variable selection with shrinking anddiffusing priors.
The Annals of Statistics Pedregosa, F. , Varoquaux, G. , Gramfort, A. , Michel, V. , Thirion, B. , Grisel, O. , Blondel, M. , Prettenhofer, P. , Weiss, R. , Dubourg, V. , Vanderplas, J. , Pas-sos, A. , Cournapeau, D. , Brucher, M. , Perrot, M. and
Duchesnay, E. (2011).Scikit-learn: Machine Learning in Python.
Journal of Machine Learning Research Rossell, D. and
Telesca, D. (2017). Nonlocal Priors for High-Dimensional Estimation.
Journal of the American Statistical Association
Roˇckov´a, V. and
George, E. I. (2016). The Spike-and-Slab LASSO.
Journal of theAmerican Statistical Association . Shin, M. , Bhattacharya, A. and
Johnson, V. E. (2015). Scalable Bayesian variableselection using nonlocal prior densities in ultrahigh-dimensional settings. arXiv preprintarXiv:1507.07106 . Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
Journal of theRoyal Statistical Society Yuan, M. and
Lin, Y. (2005). Efficient Empirical Bayes Variable Selection and Estimationin Linear Models.
Journal of the American Statistical Association
Zhang, C. and
Huang, J. (2008). The Sparsity and Bias of the Lasso Selection in High-Dimensional Linear Regression.
The Annals of Statistics APPENDIX A: PROOFS
Proof of Lemma 3.5.
From the distributional assumption on β M it followsthat T := (cid:115) n − | M | RSS M ( X (cid:48) M X M ) ( β M − (cid:98) β M ) ∼ t n −| M | (0 , I | M | ) . Thus, (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) = (cid:107) AT (cid:107) RSS M n − | M | = (cid:107) QDW (cid:48) T (cid:107) RSS M n − | M | where A = X (cid:48) X M ( X (cid:48) M X M ) − is a p × | M | matrix which has the singularvalue decomposition A = QDW (cid:48) for Q and W each orthogonal matrices. Bythe definition of the multivariate T distribution, T = (cid:114) n − | M | V Z = ⇒ W (cid:48) T = (cid:114) n − | M | V W (cid:48) Z =: (cid:114) n − | M | V (cid:101) Z =: (cid:101) T , where V ∼ χ n −| M | and Z, (cid:101) Z ∼ MVN(0 , I | M | ). Then J. WILLIAMS AND J. HANNIG (cid:107) AT (cid:107) = (cid:107) QD (cid:101) T (cid:107) = (cid:101) T (cid:48) D (cid:48) D (cid:101) T = | M | (cid:88) (cid:101) T i λ i ≤ Λ M max (cid:101) T j where Λ M = (cid:80) | M | λ i , and λ i is the i th eigenvalue of A (cid:48) A . Observe that Λ M also has the more intuitive expression Λ M = tr( A (cid:48) A ) = tr (cid:0) ( H M X ) (cid:48) H M X (cid:1) which illustrates that it is intimately related to the presence of collinearityamongst the covariates.Recall that (cid:98) σ M := RSS M / ( n − | M | ). Then P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) = P (cid:16) (cid:107) AT (cid:107) RSS M n − | M | ≥ ε (cid:17) ≤ P (cid:16)
12 Λ M max (cid:101) T j RSS M n − | M | ≥ ε (cid:17) = P (cid:16) n − | M | V max Z j (cid:98) σ M ≥ ε M (cid:17) . Since V ∼ χ n −| M | , P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ (cid:90) ∞ P (cid:16) max | Z j | ≥ (cid:112) εv/ ( n − | M | )3 (cid:98) σ M √ Λ M (cid:17) v n −| M | − e − v n −| M | Γ( n −| M | ) dv ≤ (cid:90) ∞ | M | (cid:88) j =1 P (cid:16) | Z j | ≥ (cid:112) εv/ ( n − | M | )3 (cid:98) σ M √ Λ M (cid:17) v n −| M | − e − v n −| M | Γ( n −| M | ) dv ≤ | M | (cid:98) σ M (cid:112) Λ M ( n − | M | ) √ πε (cid:18) ε M (cid:98) σ M ( n −| M | ) (cid:19) n −| M | − · Γ( n −| M |− )Γ( n −| M | )(10) where the last inequality follows because for the standard normal CDF, Φfor x >
0, Φ( − x ) ≤ √ π e − x x . To simplify the bound, observe first that Jameson (2013) gives (11) Γ( n −| M |− )Γ( n −| M | ) ≤ (cid:112) n − | M | ) n − | M | − . Second, observe that for 0 ≤ x ≤ n , (cid:16) xn (cid:17) − n ≤ e − x + x n = e − x (1 − x n ) ≤ e − x . By assumption | M | ≤ c n , and ε M (cid:98) σ M < n −| M | which implies that FI VARIABLE SELECTION (cid:32) ε M (cid:98) σ M ( n −| M | ) (cid:33) − n −| M | + ≤ e − ε M (cid:98) σ M . Therefore, applying (11) and (12) to (10) gives P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ | M | (cid:98) σ M √ Λ M √ πε (1 − n −| M | ) e − ε M (cid:98) σ M . (cid:4) Proof of Lemma 3.6.
Similar to the proof of Lemma 3.5. (cid:4)
Proof of Theorem 3.7.
Recall that h ( β M ) = I (cid:110) (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) ≥ ε (cid:111) , where b min solves min b ∈ R p (cid:107) X (cid:48) ( X M β M − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M | − E( h ( β M )) = P (cid:16) (cid:107) X (cid:48) ( X M β M − Xb min ) (cid:107) ≥ ε (cid:17) ≤ P (cid:16) (cid:107) X (cid:48) (cid:0) X M β M − X E y ( (cid:98) β M ( − ) (cid:1) (cid:107) ≥ ε (cid:17) , (13) where (cid:98) β M ( − is the least squares estimate corresponding to the subset ofcovariates M with one covariate removed so that (cid:107) E y ( (cid:98) β M ( − ) (cid:107) ≤ | M | − (cid:98) β M . To refine the bound on E( h ( β M )), decompose thelast expression in (13) using the triangle inequality as follows. P (cid:16) (cid:107) X (cid:48) ( X M β M − X M ( − E y ( (cid:98) β M ( − )) (cid:107) ≥ ε (cid:17) ≤ P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) + I (cid:110) (cid:107) X (cid:48) X M (cid:0) (cid:98) β M − E y ( (cid:98) β M ) (cid:1) (cid:107) ≥ ε (cid:111) , + I (cid:110) (cid:107) X (cid:48) ( H M − H M ( − ) X M o β M o (cid:107) ≥ ε (cid:111) , (14) where H M := X M ( X (cid:48) M X M ) − X (cid:48) M . Observe that by Lemma 3.5 the firstterm on the right hand side, P (cid:16) (cid:107) X (cid:48) X M ( β M − (cid:98) β M ) (cid:107) ≥ ε (cid:17) ≤ | M | (cid:98) σ M √ Λ M √ πε (1 − n −| M | ) e − ε M (cid:98) σ M . Note that the last two terms are written as an indicator function becausethe uncertainty here comes from β M . However, the second term results fromthe uncertainty in observing Y . Accordingly, by Lemma 3.6, J. WILLIAMS AND J. HANNIG P y (cid:16) I (cid:110) (cid:107) X (cid:48) X M (cid:0) (cid:98) β M − E y ( (cid:98) β M ) (cid:1) (cid:107) ≥ ε (cid:111) = 0 (cid:17) ≥ − | M | σ M √ Λ M √ επe ε σ M Λ M . Recall that P y is used to denote the probability measure associated withthe sampling distribution of Y . The third term in (14) must be zero byCondition 3.3. (cid:4) Proof of Theorem 3.8.
Recall that h ( β M o ) = I (cid:110) (cid:107) X (cid:48) ( X M o β M o − Xb (cid:48) min ) (cid:107) ≥ ε (cid:111) , where b (cid:48) min solves min b ∈ R p (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M o | −
1. Toshow the desired result, let b min be the solution to min b ∈ R p (cid:107) X (cid:48) ( X M o β M o − Xb ) (cid:107) subject to (cid:107) b (cid:107) ≤ | M o | −
1. Then observe that (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≤ (cid:107) X (cid:48) ( X M o β M o − Xb (cid:48) min ) (cid:107) ≤ (cid:107) X (cid:48) X M o ( β M o − β M o ) (cid:107) + (cid:107) X (cid:48) ( X M o β M o − Xb (cid:48) min ) (cid:107) . Note the difference between b min and b (cid:48) min here. The rightmost term is thequantity of interest because it will become E( h ( β M o )) in the next few steps.The term on the left of the inequality corresponds to the quantity in Con-dition 3.1.Adding and subtracting (cid:98) β M o inside the first term on the right side of thesecond inequality, and applying the triangle inequality gives, I (cid:110) (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≥ ε (cid:111) ≤ P (cid:16) (cid:107) X (cid:48) X M o ( β M o − (cid:98) β M o (cid:107) ≥ ε (cid:17) + I (cid:110) (cid:107) X (cid:48) X M o ( (cid:98) β M o − E y ( (cid:98) β M o ) (cid:107) ≥ ε (cid:111) + E( h ( β M o )) , and by applying Lemma 3.5, I (cid:110) (cid:107) X (cid:48) ( X M o β M o − Xb min ) (cid:107) ≥ ε (cid:111) ≤ p M o (cid:98) σ M o (cid:112) Λ M o √ πε (1 − n − p Mo ) e − ε Mo (cid:98) σ Mo + I (cid:110) (cid:107) X (cid:48) X M o ( (cid:98) β M o − E y ( (cid:98) β M o ) (cid:107) ≥ ε (cid:111) + E( h ( β M o )) where the middle term is written as an indicator function because the un-certainty here comes from β M o . This indicator is 0 by Lemma 3.6 withprobability exceeding (15). Therefore, since Condition 3.1 implies that theindicator on the left side of the inequality is 1, FI VARIABLE SELECTION − p M o (cid:98) σ M o (cid:112) Λ M o √ πε (1 − n − p Mo ) e − ε Mo (cid:98) σ Mo ≤ E ( h ( β M o ))with probability exceeding (15) 1 − p M o σ M o (cid:112) Λ M o √ επ e − ε σ Mo Λ Mo , due to the uncertainty in observing Y . (cid:4) The following lemma is needed for the proof of the main result, Theorem3.9.
Lemma
A.1 . Assume all conditions and notations of Theorem 3.9, andwithout loss of generality suppose σ M o = 1 , where σ M o is the true but un-known error standard deviation. Then the following holds. Case 1
This case pertains to subsets M with | M | ≤ | M o | . P y | M o | (cid:92) j =1 (cid:92) M j (cid:110) Y : (cid:16) RSS M o RSS M (cid:17) n − j − ≤ e − | M o | log( p ) (cid:111) ≥ − V , where M j := { M (cid:54) = M o : | M | = j } , V := max M (cid:54) = M o | M |≤| M o | (cid:40) | M o | e − ∆ M + | M o | log( p ) √ π ∆ M + | M o | e − ∆ M + | Mo | + | M o | log( p ) + | M o | e − ξn, | Mo |
48 ( n −| Mo |− M log( n ) γ log( p ) + n −| Mo | + | M o | log( p ) (cid:41) , and ξ n,j = 1 − n ) γ log( p )( n − j − / → . Case 2
This case pertains to subsets M with | M o | + 1 ≤ | M | ≤ n α . P y n α (cid:92) j = | M o | +1 (cid:92) M : | M | = j (cid:110) Y : (cid:16) RSS M o RSS M (cid:17) n − j − ≤ e e ( n α + j log( p )) (cid:111) ≥ − V , where V := n α e − b n (cid:2) n α − | Mo | bn + (cid:0) − bn − bn log( p ) (cid:1) ( | M o | +1) log( p ) (cid:3) + n α e − n − nα −| Mo | + n α log( p ) (cid:112) π ( n − n α − | M o | ) , and b n := n − n α −| M o | n −| M o |− → . Note that V and V both vanish for large n and p by Condition 3.2. Thiscondition also ensures that ξ n,j ∈ (0 ,
1) which is needed in the proof of thislemma.
Proof of Lemma A.1.
There are two cases to consider. J. WILLIAMS AND J. HANNIG
Case 1
For M ∈ M j with j ≤ | M o | let ξ n,j = 1 − n ) γ log( p )( n − j − / ∈ (0 ,
1) byCondition 3.2 for large n and p . Then P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:19) = P y (cid:0) U (cid:48) ( I n − H M o ) U/ξ n,j > ∆ M + 2 (cid:112) ∆ M Z + U (cid:48) ( I n − H M ) U (cid:1) (16) since by assumption Y = X M o β M o + U with U ∼ N n (0 , I n ), and so RSS M o = U (cid:48) ( I n − H M o ) U andRSS M = ∆ M + 2 β (cid:48) M o X (cid:48) M o ( I n − H M ) U + U (cid:48) ( I n − H M ) U = Z ∆ M + U (cid:48) ( I n − H M ) U where Z ∆ M := ∆ M + 2 √ ∆ M Z , and Z ∼ N(0 , M := β (cid:48) M o X (cid:48) M o ( I n − H M ) X M o β M o . Continuing in (16) by subtracting χ n −| M o | from both sides of the inequality gives, P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) = P y (cid:0) χ n −| M o | /ξ n,j − χ n −| M o | > Z ∆ M + χ n − j − χ n −| M o | (cid:1) = P y (cid:0) χ n −| M o | (1 /ξ n,j − > Z ∆ M + χ | M o | − χ j (cid:1) ≤ P y (cid:0) χ n −| M o | (1 /ξ n,j − > Z ∆ M − χ j (cid:1) (17) The last inequality follows because the χ | M o | random variable is non-negative, and removing it simplifies the remaining argument. Then P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) ≤ P y (cid:0) ξ n,j χ j + χ n −| M o | (1 − ξ n,j ) − ξ n,j (cid:112) ∆ M Z > ξ n,j ∆ M (cid:1) ≤ P y (cid:0) | Z | > (cid:112) ∆ M / (cid:1) + P y (cid:0) χ j > ∆ M / (cid:1) + P y (cid:16) χ n −| M o | > ξ n,j ∆ M − ξ n,j ) (cid:17) . For the second and third term, apply the Chernoff bound, and evaluatethe moment generating function for the χ j and χ n −| M o | distributionsat 1 /
4. Accordingly, P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) ≤ P y (cid:0) Z < − (cid:112) ∆ M / (cid:1) + e − ∆ M + j + e − ξn,j ∆ M − ξn,j ) + n −| Mo | . Finally the remaining probability can be controlled by the bound forthe CDF of a standard normal random variable for x > − x ) ≤ √ π e − x x . FI VARIABLE SELECTION Hence, P y (cid:16) RSS M o RSS M > ξ n,j (cid:17) ≤ e − ∆ M √ π ∆ M + e − ∆ M + | Mo | + e − ξn, | Mo |
48 ( n −| Mo |− M log( n ) γ log( p ) + n −| Mo | , where the last inequality follows by observing that j ≤ | M o | , andrecalling the expression for ξ n,j .Therefore, the probability that (cid:16) RSS Mo RSS M (cid:17) n − j − > ξ n − j − n,j is satisfied forsome M with | M | ≤ | M o | is P y | M o | (cid:91) j =1 (cid:91) M j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:27) ≤ | M o | (cid:88) j =1 (cid:18) pj (cid:19) max M j P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:19) . Note that (18) (cid:18) pj (cid:19) = p ( p − · · · ( p − j + 1) j ! = p j (1 − p ) · · · (1 − j − p ) j ! ≤ p j . In fact, Luo and Chen (2013) show that if log( j ) / log( p ) → δ as p → ∞ ,for some δ >
0, then log (cid:0) pj (cid:1) = j log( p )(1 − δ )(1 + o (1)). Thus, P y | M o | (cid:91) j =1 (cid:91) M j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > ξ n − j − n,j (cid:27) ≤ | M o | (cid:88) j =1 max M j (cid:40) e − ∆ M + j log( p ) √ π ∆ M + e − ∆ M + | Mo | + j log( p ) + e − ξn, | Mo |
48 ( n −| Mo |− M log( n ) γ log( p ) + n −| Mo | + j log( p ) (cid:41) . Since ξ n, | M o | →
1, this bound vanishes by Condition 3.2. Therefore,with probability exceeding one minus the above bound, (cid:16)
RSS M o RSS M (cid:17) n − j − ≤ (cid:16) − n ) γ log( p )( n − j − / (cid:17) n − j − ≤ e − n ) γ log( p ) uniformly over all M such that | M | ≤ | M o | . Case 2
Consider any subset M with | M o | < | M | ≤ n α for some positiveconstant α <
1, and let { a n } be an arbitrarily sequence of numbers. To J. WILLIAMS AND J. HANNIG begin, repeating the steps in (17), but subtracting χ n − j /ξ n,j on bothsides instead of χ n −| M o | , and replacing the label ξ n,j with a n , yields P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:19) ≤ P y (cid:0) χ j > a n Z ∆ M + χ n − j ( a n − (cid:1) , where Z ∆ M = ∆ M + 2 √ ∆ M Z , Z ∼ N (0 , M = β (cid:48) M o X (cid:48) M o ( I n − H M ) X M o β M o . Since M o ⊂ M implies ∆ M = 0, the above bound canbe simplified by including in the subset M any covariates in M o notalready included in M . Accordingly, let M (cid:48) := M ∪ M o which includes j + l covariates, where l ∈ { , . . . , | M o |} is the number of covariatesnot shared by M and M o . Then because RSS M (cid:48) ≤ RSS M , P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:19) ≤ P y (cid:18)(cid:16) RSS M o RSS M (cid:48) (cid:17) n − j − > a n − j − n (cid:19) ≤ P y (cid:0) χ j + l > χ n − j − l ( a n − (cid:1) , and for any nonnegative s ∈ R , P y (cid:16) RSS M o RSS M > a n (cid:17) ≤ P y (cid:0)(cid:8) χ j + l > s ( a n − (cid:9) ∩ (cid:8) χ n − j − l ≥ s (cid:9)(cid:1) + P y (cid:0)(cid:8) χ j + l > χ n − j − l ( a n − (cid:9) ∩ (cid:8) χ n − j − l < s (cid:9)(cid:1) ≤ P y (cid:0) χ j + l > s ( a n − (cid:1) + P y ( χ n − j − l < s )(19) Consider each of these last two terms in turn. For the first term applythe Chernoff bound, and evaluate the moment generating function forthe χ j + l distribution at 1 /
4. That gives P y (cid:0) χ j + l > s ( a n − (cid:1) ≤ j + l e − s ( an − ≤ e − s ( an − + j + l . For the second term in (19) write out the expression to evaluate theprobability explicitly, and then apply the simple bound e − x ≤ x ≥
0. Noting that s > P y ( χ n − j − l < s ) ≤ n − j − l Γ (cid:16) n − j − l (cid:17) s n − j − l n − j − l ≤ ( e · s ) n − j − l · − n − j − l √ π ( n − j − l ) n − j − l + , where the last inequality follows from the well known Sterling lowerbound on the gamma functionΓ( x ) ≥ √ πx x − e − x for x > . It is clear from the last expression that for the probability to vanish, e · s must grow no faster than n − j − l . Accordingly, choosing s = n − j − le gives FI VARIABLE SELECTION P y (cid:16) χ n − j − l < n − j − le (cid:17) ≤ e − n − j − l (cid:112) π ( n − j − l ) . Combining the two bounds for (19) now yields P y (cid:16) RSS M o RSS M > a n (cid:17) ≤ e − ( n − j − le ) an − + j + l + e − n − j − l (cid:112) π ( n − j − l ) . It only remains to choose the smallest a n such that the first term inthe bound vanishes exponentially fast so that the cumulative proba-bility will vanish in probability over all subsets M with | M | ≤ n α .Accordingly, it should become apparent shortly that a good choice is (20) a n = 1 + 8 e ( n α + j log( p )) n − j − . The probability that (cid:16)
RSS Mo RSS M (cid:17) n − j − > a n − j − n is satisfied for some M with | M o | < | M | ≤ n α is P y n α (cid:91) j = | M o | +1 (cid:91) M : | M | = j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:27) ≤ n α (cid:88) j = | M o | +1 (cid:18) pj (cid:19) max M : | M | = j P y (cid:18)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:19) . Thus, bounding the binomial coefficient as in (18), and substituting(20) for a n yields P y n α (cid:91) j = | M o | +1 (cid:91) M : | M | = j (cid:26)(cid:16) RSS M o RSS M (cid:17) n − j − > a n − j − n (cid:27) ≤ n α (cid:88) j = | M o | +1 e − b n (cid:2) n α − l bn + (cid:0) − bn − bn log( p ) (cid:1) j log( p ) (cid:3) + e − n − j −| Mo | + j log( p ) (cid:112) π ( n − j − | M o | ) . ≤ n α e − b n (cid:2) n α − | Mo | bn + (cid:0) − bn − bn log( p ) (cid:1) ( | M o | +1) log( p ) (cid:3) + n α e − n − nα −| Mo | + n α log( p ) (cid:112) π ( n − n α − | M o | ) , where b n := n − n α −| M o | n −| M o |− →
1. Note that this bound vanishes by Con-dition 3.2. Therefore, with probability exceeding one minus the abovebound, (cid:16)
RSS M o RSS M (cid:17) n − j − ≤ (cid:16) e ( n α + j log( p )) n − j − (cid:17) n − j − ≤ e e ( n α + j log( p )) , uniformly over all M such that | M o | < | M | ≤ n α . J. WILLIAMS AND J. HANNIG (cid:4)
Proof of Theorem 3.9.
Without loss of generality suppose σ M o = 1, where σ M o is the true but unknown error standard deviation. For j ∈ { , . . . , p } define the following classes of subsets M j := { M (cid:54) = M o : | M | = j } . Recallthat (cid:98) σ M := RSS M / ( n − | M | ). It will first be shown that for any subset ofcovariates M (cid:54) = M o the ratio r ( M | Y ) r ( M o | Y ) vanishes in probability for large n and p . Accordingly, for any M ∈ M j , r ( M | Y ) r ( M o | Y ) = π j −| Mo | Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) RSS n −| Mo |− M o RSS n − j − M E( h ( β M ))E( h ( β M o ))= π j −| Mo | Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) (cid:16) RSS M o RSS M (cid:17) n − j − RSS j −| Mo | M o E( h ( β M ))E( h ( β M o )) Before proceeding with the rest of the proof, a the following notation isneeded. V and V are as stated in Lemma A.1. As in Theorem 3.8, define V := 3 p M o σ M (cid:112) Λ M o √ επ e − ε σ M Λ Mo , and corresponding to Theorem 3.7, define V := n α (cid:88) j =1 (cid:88) M ∈M j | M | σ M √ Λ M √ επ e − ε σ M Λ M ≤ n α (cid:88) j =1 max M j | M | σ M √ Λ M √ επe ε σ M Λ M + j log( p ) by bounding the binomial coefficient as in (18). Note that V then vanishesby Condition 3.4. Further, recall that RSS M o ∼ χ n −| M o | , so by the Chernoffbound (evaluating the moment generating function at 1 / P y (cid:0) χ n −| M o | > n − | M o | ) (cid:1) ≤ e − n −| Mo | (cid:124) (cid:123)(cid:122) (cid:125) =: V . With this notation it is now possible to account for all of the uncertaintydue to Y .Accordingly, by Theorem 3.8, with probability exceeding 1 − V − V , (21) r ( M | Y ) r ( M o | Y ) ≤ Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) (cid:16) RSS M o RSS M (cid:17) n − j − (3 π ( n − | M o | )) j −| Mo | E( h ( β M ))1 − g ( M o , n, p ) , where g ( M, n, p ) := 2 j (cid:98) σ M √ Λ M √ πε (1 − n − j ) e − ε M (cid:98) σ M . Further, fix A ∈ (0 , n and p sufficientlylarge so that g ( M o , n, p ) < A . FI VARIABLE SELECTION Suppose j ≤ | M o | . As in Jameson (2013), the ratio of gamma functionscan be bounded byΓ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) = Γ (cid:16) n −| M o | + | M o |− j (cid:17) Γ (cid:16) n −| M o | (cid:17) ≤ (cid:16) n − | M o | (cid:17)(cid:16) n − j (cid:17) | Mo |− j − . Applying Lemma A.1 to bound the ratio of residual sums of squares, andbounding the expectation by 1, with probability exceeding 1 − V − V − V ,(21) implies r ( M | Y ) r ( M o | Y ) ≤ (cid:16) n − | M o | (cid:17)(cid:16) n − j (cid:17) | Mo |− j − (3 π ( n − | M o | )) j −| Mo | (1 − A ) e n ) γ log( p ) ≤ e − n ) γ log( p ) − A , (22) where the last inequality follows for all n ≥ | M o | − π . Since n >> | M o | , assumewithout loss of generality that n is sufficiently large.Conversely, suppose | M o | < j ≤ n α . In this setting, as in Jameson (2013),the ratio of gamma functions can be bounded by Γ (cid:16) n − j (cid:17) Γ (cid:16) n −| M o | (cid:17) − = Γ (cid:16) n −| M o | (cid:17) Γ (cid:16) n − j (cid:17) = Γ (cid:16) n − j + j −| M o | (cid:17) Γ (cid:16) n − j (cid:17) ≥ (cid:16) n − j − (cid:17) j −| Mo | . Applying Theorem 3.7 to bound the expectation, and applying Lemma A.1to bound the ratio of residual sums of squares, with probability exceeding1 − V − V − V − V , (21) implies r ( M | Y ) r ( M o | Y ) ≤ e e ( n α + j log( p )) (cid:16) n − j − (cid:17) j −| Mo | (3 π ( n − | M o | )) j −| Mo | g ( M, n, p )1 − A ≤ j (cid:98) σ M √ Λ M e − ε M (cid:98) σ M +4 e ( n α + j log( p ))+ j −| Mo | log (cid:16) π − nα +2 n (cid:17) √ πε (1 − n − j )(1 − A )(23) Notice that Theorem 3.7 is being applied here with n α in place of c n . Thiscan be done without loss of generality because n α grows slower than c n ,for any choices of α, c ∈ (0 , (cid:80) n α j =1 (cid:80) M ∈M j r ( M | Y ) r ( M o | Y ) vanishes in probabil-ity for large n and p . Apply the bounds in (22) and (23) in the followingargument. n α (cid:88) j =1 (cid:88) M ∈M j r ( M | Y ) r ( M o | Y ) ≤ | M o | (cid:88) j =1 (cid:18) pj (cid:19) max M ∈M j r ( M | Y ) r ( M o | Y ) (cid:124) (cid:123)(cid:122) (cid:125) =: S + n α (cid:88) j = | M o | +1 (cid:18) pj (cid:19) max | M | = j r ( M | Y ) r ( M o | Y ) (cid:124) (cid:123)(cid:122) (cid:125) =: S J. WILLIAMS AND J. HANNIG