[PDF] A Risk Ratio Comparison of l 0 and l 1 Penalized Regression

Abstract

There has been an explosion of interest in using l 1 -regularization in place of l 0 -regularization for feature selection. We present theoretical results showing that while l 1 -penalized linear regression never outperforms l 0 -regularization by more than a constant factor, in some cases using an l 1 penalty is infinitely worse than using an l 0 penalty. We also show that the "optimal" l 1 solutions are often inferior to l 0 solutions found using stepwise regression. We also compare algorithms for solving these two problems and show that although solutions can be found efficiently for the l 1 problem, the "optimal" l 1 solutions are often inferior to l 0 solutions found using greedy classic stepwise regression. Furthermore, we show that solutions obtained by solving the convex l 1 problem can be improved by selecting the best of the l 1 models (for different regularization penalties) by using an l 0 criterion. In other words, an approximate solution to the right problem can be better than the exact solution to the wrong problem.

Full PDF

AA Risk Ratio Comparison of l and l Penalized Regression

Kory D. Johnson , Dongyu Lin , Lyle H. Ungar , Dean P. Foster , and Robert A. Stine The Wharton School, University of Pennsylvania AT&T Labs The School of Engineering and Applied Science, University of Pennsylvania

June 27, 2018

Abstract

There has been an explosion of interest in using l -regularization in place of l -regularizationfor feature selection. We present theoretical results showing that while l -penalized linear regressionnever outperforms l -regularization by more than a constant factor, in some cases using an l penaltyis inﬁnitely worse than using an l penalty. We also show that the “optimal” l solutions are ofteninferior to l solutions found using stepwise regression.We also compare algorithms for solving these two problems and show that although solutions canbe found eﬃciently for the l problem, the “optimal” l solutions are often inferior to l solutionsfound using greedy classic stepwise regression. Furthermore, we show that solutions obtained bysolving the convex l problem can be improved by selecting the best of the l models (for diﬀerentregularization penalties) by using an l criterion. In other words, an approximate solution to theright problem can be better than the exact solution to the wrong problem. Keywords:

Variable Selection, Streaming Feature Selection, Regularization, Stepwise Regression,Submodularity

In the past decade, a rich literature has been developed using l -regularization for linear regressionincluding Lasso (Tibshirani, 1996), LARS (Efron et al., 2004), fused lasso (Tibshirani et al., 2005), elasticnet (Zou and Hastie, 2005), grouped lasso (Yuan and Lin, 2006), adaptive lasso (Zou, 2006), and relaxedlasso (Meinshausen, 2007). These methods, like the l -penalized regression methods which precededthem (Akaike, 1974; Schwarz, 1978; Foster and George, 1994), address variable selection problems inwhich there is a large set of potential features, only a few of which are likely to be helpful. This type ofsparsity is common in machine learning tasks, such as predicting disease based on thousands of genes,or predicting the topic of a document based on the occurrences of hundreds of thousands of words. l -regularization is popular because, unlike the l regularization historically used for feature selectionin regression problems, the l penalty gives rise to a convex problem that can be solved eﬃciently usingconvex optimization methods. l methods have given reasonable results on a number of data sets, butthere has been no careful analysis of how they perform when compared to l methods. This paperprovides a formal analysis of the two methods, and shows that l can give arbitrarily worse models. Weoﬀer some intuition as to why this is the case – l shrinks coeﬃcients too much and does not zero outenough of them – and suggest how to use an l penalty with l optimization.We study the problem of selecting predictive features from a large feature space. We assume theclassic normal linear model y = X β + (cid:15) (cid:15) ∼ N n ( , σ I n )with n observations y = ( y , . . . , y n ) (cid:48) and p features x , . . . , x p , where X = ( x , . . . , x p ) is an n × p “design matrix” of features, β = ( β , . . . , β p ) (cid:48) is the coeﬃcient parameters, and error ε ∼ N ( , σ I n ).We expect most of the elements of β to be 0. Hence, generating good predictions requires identifyingthe small subset of predictive features. This standard linear model proliferates the statistics and ma-chine learning literature. In modern applications, p can approach millions, making the selection of an1 a r X i v : . [ m a t h . S T ] O c t ppropriate subset of these features essential for prediction. The size and scope of these problems raiseconcerns about both the speed and statistical robustness of the selection procedure. Namely, it must befast enough to be computationally feasible and must ﬁnd signal without over-ﬁtting the data.The traditional statistical approach to this problem, namely, the l regularization problem, ﬁnds anestimator that minimizes the l penalized sum of squared errors,arg min β (cid:8) (cid:107) y − X β (cid:107) + λ (cid:107) β (cid:107) l (cid:9) , (1)where (cid:107) β (cid:107) l = (cid:80) pi =1 I { β i (cid:54) =0 } counts the number of nonzero coeﬃcients. However, this problem is NP-hard (Natarajan, 1995). A tractable problem relaxes the l penalty to the l norm, (cid:107) β (cid:107) l = (cid:80) pi =1 | β i | ,and seeks arg min β (cid:8) (cid:107) y − X β (cid:107) + λ (cid:107) β (cid:107) l (cid:9) , (2)This is known as the l -regularization problem (Tibshirani, 1996), which is convex. This problem canbe solved eﬃciently using a variety of methods (Tibshirani, 1996; Efron et al., 2004; Candes and Tao,2007).We assess our models using the predictive risk function (3) R ( β , ˆ β ) = E β (cid:107) ˆ y − E ( y | X ) (cid:107) = E β (cid:107) X ˆ β − X β (cid:107) . (3)We are interested in the ratios of the risks of the estimates provided by these two criteria. Unlikerisk functions, predictive risk measures the relevant prediction error on future observations, ignoringirreducible variance. Smaller risks imply better expected prediction performance. It is an ideal metric toanalyze testing error or out-of-sample errors when the parameter distribution is assumed to be known.Recent literature has focused on selection consistency: whether or not the true variable can be identiﬁedin the limit. However, in real application, due to ubiquitous multicollinearity, predictors are hard toseparate as “true” and “false”. Here, we focus on predictive accuracy and advocate the concept ofpredictive risk.Our ﬁrst result in this paper, given below as Theorems 1 and 2 in Section 2, is that l estimatesprovide more accurate predictions than l estimates do, in the sense of minimax risk ratios. This isillustrated in Figure 1. Proofs of these theorems are in Appendix A. • inf γ sup β R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) is bounded by a small constant; furthermore, it is close to one for most γ s,especially for large γ s, which are mostly used in sparse systems. • inf γ sup β R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) tends to inﬁnity quadratically; in an extremely sparse system, the l estimatemay perform arbitrarily badly. • R ( β , ˆ β l ( γ )) is more likely to have a larger risk than R ( β , ˆ β l ( γ )) does.A detailed discussion on the risk ratios will be presented in Section 2, along with a discussion ofother advantages of l regularization. Our other comparative results include showing that applying the l criterion on an l subset searching path can ﬁnd the best performing model (Section 2.3.1) and runningstepwise regression and Lasso on a reduced NP hard example shows that stepwise regression gives bettersolutions (Section 2.3.2).We compare l vs. l penalties under three assumptions about the structure of the feature matrix X : independence, incoherence (near independence) and NP-hard. For independence, we ﬁnd provide thetheoretical results mentioned above. For near independence, we ﬁnd that l penalized regression followedby l outperforms l selection. For the NP-hard case, we ﬁnd that if one could do the search, then therisk ratio could be arbitrarily bad for l relative to l . We assess our models using the predictive risk function (3) R ( β , ˆ β ) = E β (cid:107) ˆ y − E ( y | X ) (cid:107) = E β (cid:107) X ˆ β − X β (cid:107) . (4)This is the relevant component after decomposing the expected squared error loss from predicting anew observation. This is clear from the following standard decomposition. For increased generality, let E [ y ] = η ] and H X be the projection onto the column space of X . Then E (cid:107) y ∗ − X ˆ β (cid:107) = E (cid:107) y ∗ − η (cid:107) + E (cid:107) η − H X η (cid:107) + E (cid:107) H X η − X ˆ β (cid:107) = nσ (cid:124)(cid:123)(cid:122)(cid:125) common error + (cid:107) ( I − H X ) η (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) wrong X + E (cid:107) H X η − X ˆ β (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) predictive risk gg Log(Risk Ratio) max/min when R (0) = R (0)optimal bounds . . . . . . gg Sup(l Risk / l Risk) R (0) = R (0)optimal gg Sup(l Risk / l Risk) R (0) = R (0)optimal Figure 1:

Left : The gray area shows the feasible region for the risk ratios–the log risk-ratio is above zero when l producesa better ﬁt. The graph shows that most of the time l is better. The actual estimators being compared are those thathave the same risk at β = 0, i.e., R (0 , ˆ β l ( γ )) = R (0 , ˆ β l ( γ )). Middle : This graph traces out the bottom envelope of theleft hand graph (but takes the reciprocal risk ratio and no longer uses the logarithm scale). The dashed blue line displayssup β R ( β, ˆ β l ( γ )) /R ( β, ˆ β l ( γ )) for γ calibrated to have the same risk at zero as γ . This maximum ratio tends to 1 when γ → ∞ (the sparse case). With an optimal choice of γ , inf γ sup β R ( β, ˆ β l ( γ )) /R ( β, ˆ β l ( γ ))(solid red line) behaves similarly. Speciﬁcally, the supremum over γ is bounded by 1.8. Right : This graph traces out theupper envelopes of the left hand graph on a normal scale. When γ → ∞ , sup β R ( β, ˆ β l ( γ )) /R ( β, ˆ β l ( γ )) tends to ∞ ,for both γ that is calibrated at β = 0 and that minimizes the maximum risk ratio. The ﬁrst term, common error, is unavoidable, regardless of the method being used. All methods weconsider, namely linear methods based on X , suﬀer the error from incorrect X . Since X is given, itis more instructive to consider the projection of η onto the column space of X , deﬁning X β = H X η .Ignoring these two forms of error, leaves the predictive risk function (3).Predictive risk has guided selection procedures such as Mallow’s C p and RIC. The former resultsfrom an unbiased estimate of the predictive risk, while the later provides minimax control of the risk induring model selection. We maintain this minimax viewpoint and show that in terms of the removable variation in prediction, l performs better than l . l solutions give more accurate predictions. Suppose that ˆ β is an estimator of β . For this section, we assume X is orthogonal. (For example,wavelets, Fourier transforms, and PCA all are orthogonal). The l problem (1) can then be solved bysimply picking those predictors with least squares estimates | ˆ β i | > γ , where the choice of γ dependson the penalty λ in (1). It was shown (Donoho and Johnstone, 1994; Foster and George, 1994) that λ = 2 σ log p is optimal in the sense that it asymptotically minimizes the maximum predictive riskinﬂation due to selection.Let ˆ β l ( γ ) = (cid:16) ˆ β I {| ˆ β | >γ } , . . . , ˆ β p I {| ˆ β p | >γ } (cid:17) (cid:48) (5)be the l estimator that solves (1), and let the l solution to (2) beˆ β l ( γ ) = (cid:16) sign( ˆ β )( | ˆ β | − γ ) + , . . . , sign( ˆ β p )( | ˆ β p | − γ ) + (cid:17) (cid:48) , (6)where the ˆ β i ’s are the least squares estimates.We are interested in the ratios of the risks of these two estimates, R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) and R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) . I.e., we want to know how the risk is inﬂated when another criterion is used. The smaller the riskratio, the less risky (and hence better) the numerator estimate is, compared to the denominator estimate.Speciﬁcally, a risk ratio less than one implies that the top estimate is better than the bottom estimate.3ormally, we have the following theorems, whose proofs are given in Appendix A:

Theorem 1.

There exists a constant C such that for any γ ≥ , inf γ sup β R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) ≥ C + γ . I.e., given γ , for any γ , there exist β ’s such that the ratio becomes extremely large.Contrast this with the protection provided by l : Theorem 2.

There exists a constant C > such that for any γ ≥ , inf γ sup β R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) ≤ C γ − . I.e., for any γ , we can pick the l cutoﬀ so that we perform almost as good as l , even in the worstcase.The above theorems can deﬁnitely be strengthened, as demonstrated by the bounds shown in Figure1, but at the cost of complicating the proofs. We conjecture that there exist constants r >

1, and C , C , C >

0, such that inf γ sup β R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) ≥ C γ r , (7)inf γ sup β R ( β , ˆ β l ( γ )) R ( β , ˆ β l ( γ )) ≤ C γ e − C γ . (8)These theorems suggest that for any γ chosen by the algorithm, we can always adapt γ such thatˆ β l ( γ ) outperforms ˆ β l ( γ ) most of the time and loses out a little for some β ’s; but for any γ chosen,no γ can perform consistently well on all β ’s.Because of the additivity of risk functions, (see appendix equations (13) and (14)), due to the or-thogonality assumption, we focus on the individual behavior of β i for each single feature. Also the riskfunctions are symmetric on β , so only the cases of β i ≥ γ , we can pick a γ , s.t. the risk ratio is below 1 for most β except around ( γ + γ ) /

2, yet thisratio does not exceed one by more than a small factor, even for the worst case. . . . . bb / ( gg + gg ) l Risk / l Risk gg = 2 gg = 4 gg = 6 gg = 10 Figure 2:

For each γ , we let γ = γ + 4 log( γ ) /γ . This choice of γ makes the risk ratios small at β ≈ β ≥ γ ,only inﬂated around β/ ( γ + γ ) = 1 /

2, albeit very little especially when γ is large enough. The intuition as to why l fares better than l in the risk ratio results is that l must make a “devil’schoice” between shrinking the coeﬃcients too much or putting in too many spurious features. l penalizedregression avoids this problem. 4 .2 l shrinks coeﬃcients too much l Estimatel Estimate l and l Estimates . . . . . Number of Features in the Model C oe f. E s t. o f t he T r ue F ea t u r e l Shrinks the Coefficient

True ValueMean Coefficient Estimate −4 −2 0 2 4 . . . . . . . . x P ( X = x ) Laplacian and Cauchy Densities

LaplacianCauchy(0,2)

Figure 3:

Left : The l estimate keeps the least squares value after the cutting point, but the l estimate always shrinksthe least squares estimate by a ﬁxed amount. Middle : the model we simulate has only one true feature with true β = 1and a thousand spurious features. We compute the average Lasso estimate of β for a ﬁxed number of features includedin the model (as an index of the l penalty) from several diﬀerent trials. ¯ˆ β l is always shrunk by at least 20% in thisexperiment. Right : The Cauchy density has heavier tails than the Laplacian density does. Thus, a Laplacian prior tendsto shrink large values of β ’s. From a frequentist’s point of view, the l estimator (6) shrinks the coeﬃcients and thus is biased(Figure 3). In practice, ˆ β l can be substantially shrunk towards zero when the system is sparse, as shownin the middle panel of Figure 3.From a Bayesian’s perspective, the l penalty is equivalent to putting a Laplacian prior on β (Tib-shirani, 1996; Efron et al., 2004), while the l penalty can be approximated by Cauchy priors (Johnstoneand Silverman, 2005; Foster and Stine, 2005). The right panel of Figure 3 shows that the Cauchy distri-bution has a much heavier tail than the Laplacian distribution does. This implies that when the true β is far away from 0, the l penalty will substantially shrink the estimate toward zero.The bias caused by the shrinkage increases the predictive risk proportionally to the squared amountof the shrinkage. The sparser the problem is, the greater the shrinkage is, thus the larger the risk is.These results show that in theory the l estimate has a lower risk and provides a more accurate solution.Empirically, stepwise regression performs well in large data sets, where a sparse solution is particularlypreferred (George and Foster, 2000; Foster and Stine, 2004; Zhou et al., 2006). l optimization using an l criterion We can make use of the LARS algorithm to generate a set of candidate solutions and then use the l criterion to ﬁnd the best of the solutions along the regularization path. We evaluated this method asfollows. We simulated y from a thousand features, only 4 of which have nonzero contributions, plus arandom noise distributed as N (0 , n = 100. We applythe Lasso algorithm implemented by LARS on this synthetic data set. For each step on the regularizationpath, this algorithm selects a subset C ⊂ { , . . . , } of features that are included in the model. Wethen adopt a modiﬁed RIC criterion suggested in George and Foster (2000): (cid:107) y − X C ˆ β C (cid:107) + |C| (cid:88) q =1 p/q ) σ (9)to ﬁnd an optimal C . The crucial part here is that the coeﬃcient estimate ˆ β C being used in (9) is theleast squares estimate of the true β obtained by ﬁtting y on X C = ( x j ) j ∈C , and not the Lasso estimateˆ β l provided by the algorithm. We also use this least squares estimate in out-of-sample calculations.We compare two cases: the x j ’s are generated independently of each other, meaning that X (cid:48) X isdiagonal, and the x j ’s are generated with a pairwise correlation ρ = 0 .

64. As shown in Figure 4, in the5

10 15 20 step l Penalized SSE l The Modell Picked . . . . . . . Out−of−Sample RMSE step l The Modell Picked

LassoLeast Squares

Figure 4: l penalties help ﬁnding the best model (independent predictors case). y is simulated from one thousandfeatures, only four of which have nonzero contributions, plus an N (0 ,

1) error. Both the training set and the test set havesizes n = 100. Each step in the LARS algorithm gives a set of features with nonzero coeﬃcient estimates. We compute theleast squares (LS) estimates on this subset and the modiﬁed RIC criterion (9) on the training set. We also compare theout-of-sample root mean squared errors using the LS estimates and the Lasso estimates on this LARS path. The featuresare independently generated. The model that minimizes the l penalized error has exactly four variables in it. It alsooutperforms any of the l models out-of-sample on this data set. step l Penalized SSE l The Modell Picked . . . . . . Out−of−Sample RMSE step l The Modell Picked

LassoLeast Squares

Figure 5: l penalties help ﬁnding the best model (correlated predictors case). The setup is exactly the same as in Figure4 except that each pair of features has a correlation ρ = 0 .

64. In this case, the optimal model under the modiﬁed RICcriterion has a slightly better RMSE than the best l model. The Lasso out-of-sample RMSE is typically minimized whenthe model has included more than 50 features. l -picked model and the best Lasso model in this case, but Lassoadds around 50 more spurious variables.Thus, by combining the computational eﬃciency of an l algorithm and the sparsity guaranteed bythe l penalization, we can easily select an accurate model without cross validation. l and NP-hardness The l problem is NP-hard and hence, at least in theory, intractable. (In practice, of course, people oftenuse approximate solutions to problems that in the worst case can be NP-hard.) One of the attractions of l -regularization is that it is convex, hence solvable in polynomial time. In this section, we compare howthe two approaches fare on a known NP-hard regression problem. We start with a simple constructiveproof that the risk ratio for l to l can be arbitrarily bad. Construct data as follows. Pick a largenumber of independent features z j . Construct new features x = z + (cid:15)z and x = z − (cid:15)z and. Let y = ( z + z ) / y = x /(cid:15) . Include the rest of the features z j , j > y = n , X is an n × p binary matrix with each column havingthree nonzero elements: (cid:107) x i (cid:107) = 3, β is a p × ε > β (cid:107) β (cid:107) , s.t. (cid:107) y − X β (cid:107) < ε. (10)Note that if there is a solution to this problem, the number of features being chosen should be n/ l problem or an exact solution to the l problem. To this end, we applied Lasso and forward stepwiseregression on various n ’s. For small n ’s, we took full collections of the three subsets, i.e., p equals n choose 3; for larger n ’s, we took p = 10 · n . Table 1 and 2 list the number of subsets included in themodel. Forward stepwise regression always ﬁnds fewer subsets, and hence a better solution, than Lasso. Method n = 9 n = 12 n = 15 n = 18 n = 21 n = 24 n = 27 n = 30 Lasso

Stepwise

The number of subsets chosen by Lasso and by forward stepwise regression with ε = 1 /

4. All 3-subsets wereconsidered, i.e., p = (cid:0) n (cid:1) . Forward stepwise regression always has the fewest possible number of subsets, namely, n/ Method n = 99 n = 240 n = 540 n = 990 n = 1500 Lasso

93 219 504 812 1372(2 × − ) (9 × − ) (9 × − ) (6 × − ) (2 × − ) Stepwise

40 96 223 364 595(1 × − ) (6 × − ) (3 × − ) (6 × − ) (1 × − )Table 2: The number of subsets chosen by Lasso and by forward stepwise regression with ε = 1 / p = 10 · n All of our experiments on both synthetic and real data sets show that greedy search algorithms, suchas stepwise regression, aimed at minimizing l -regularized error provide sparser results. This is because l penalizes the sparsity directly, while l does not. It is easy to construct an example where l will picka solution with a smaller l norm but with a less sparse solution (Candes et al., 2007). In many statistical contexts, the l regularization criterion is superior to that of l regularization; l generally provides a more accurate solution and controls the false discovery rate better. l can give7rbitrarily worse predictive accuracy than l , since l regularization tends to shrink coeﬃcients toomuch to include many spurious features. Computationally, l appears to be more attractive; convexprogramming makes the computation feasible and eﬃcient. In practice, however, approximate solutionsto the l problem are often better than than exact solutions to the l problem. The best properties ofthe two methods can be combined. Superior results were obtained by using convex optimization of the l problem to generate a set of candidate models (the regularization path generated by LARS), and thenselecting the best model by minimizing the l -penalized training error. Appendices

A Risk Ratio Proofs

We will drop the γ ’s when the situation is clear, and denote ˆ β l ( γ ) as ˆ β l and ˆ β l ( γ ) as ˆ β l for simplicity.Without loss of generality, we assume X (cid:48) X = I and σ = 1. The l risk can be written as R ( β , ˆ β l ) = E β (cid:107) X β − X ˆ β (cid:107) = E β p (cid:88) i =1 (cid:107) x i (cid:107) ( β i − ˆ β i ) = E β p (cid:88) i =1 (cid:32)(cid:18) x (cid:48) i ε (cid:107) x i (cid:107) (cid:19) I {| ˆ β i | >γ } + ( (cid:107) x i (cid:107) β i ) I {| ˆ β i |≤ γ } (cid:33) (11)= p (cid:88) i =1 (cid:8) σ E β (cid:2) Z i I {| β i + σZ i | >γ } (cid:3) + ( (cid:107) x i (cid:107) β i ) P ( | β i + σZ i | ≤ γ ) (cid:9) , where Z i = x (cid:48) i ε /σ (cid:107) x i (cid:107) ∼ N (0 , i = 1 , . . . , p .Similarly, the l risk can be written as R ( β , ˆ β l ) = E β p (cid:88) i =1 (cid:32) (cid:18) x (cid:48) i ε (cid:107) x i (cid:107) − ˜ γ (cid:19) I { ˆ β i > ˜ γ } + (cid:18) x (cid:48) i ε (cid:107) x i (cid:107) + ˜ γ (cid:19) I { ˆ β i < − ˜ γ } (12)+( (cid:107) x i (cid:107) β i ) I {| ˆ β i |≤ ˜ γ } (cid:33) = p (cid:88) i =1 (cid:8) E β (cid:2) ( σZ i − ˜ γ ) I { β i + σZ i > ˜ γ } + ( σZ i + ˜ γ ) I { β i + σZ i < − ˜ γ } (cid:3) +( (cid:107) x i (cid:107) β i ) P ( | β i + σZ i | ≤ γ ) (cid:9) Speciﬁcally, we consider the case when p = 1. Let Φ( z ) = P ( Z ≤ z ) and ˜Φ( z ) = P ( Z > z ) be thelower and upper tail probabilities of a standard normal distribution and the two risk functions can beexplicitly written as R ( β, ˆ β l ) = (cid:90) ∞ γ − β z φ ( z ) d z + (cid:90) − γ − β −∞ z φ ( z ) d z + β (cid:104) Φ( γ − β ) − ˜Φ( γ + β ) (cid:105) = ( γ − β ) φ ( γ − β ) + ( γ + β ) φ ( γ + β ) (13)+Φ( β − γ ) + β Φ( γ − β ) + (1 − β ) ˜Φ( γ + β ) ,R ( β, ˆ β l ) = (cid:90) ∞ γ − β ( z − γ ) φ ( z ) d z + (cid:90) − γ − β −∞ ( z + γ ) φ ( z ) d z + β (cid:104) Φ( γ − β ) − ˜Φ( γ + β ) (cid:105) = ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β ) (14)+( γ + 1)Φ( β − γ ) + β Φ( γ − β ) + ( γ + 1 − β ) ˜Φ( γ + β )We list a few Gaussian tail bounds here that we will use in the proofs later. Detailed discussioncan be found in related articles (Feller, 1968; Donoho and Johnstone, 1994; Foster and George, 1994;Abramovich et al., 2006). 8 emma 1. For any z > o , φ ( z )( z − − z − ) ≤ ˜Φ( z ) ≤ φ ( z ) z − ;

2. ˜Φ( z ) ≤ e − z / . φ ( z )( x − − x − + (1 · · x − − (1 · · · x − + · · · + ( − k · (2 k − · x − k − overestimates ˜Φ( z ) if k is even, and underestimates ˜Φ( z ) if k is odd. φ ( z )( z − − z − + (1 · · z − − (1 · · · z − + · · · + ( − k · (2 k − · z − k − overestimates ˜Φ( z ) if k is even, and underestimates ˜Φ( z ) if k is odd. Lemma 2.

For large enough γ > , inf γ sup β R ( β, ˆ β l ) R ( β, ˆ β l ) > γ . (15) Proof.

It suﬃces to show that for any ﬁxed γ and any γ sup β R ( β, ˆ β l ) R ( β, ˆ β l ) > γ . Suppose γ ≥ γ / √

2, let β n = ( n + 1) γ , then (cid:107) ˆ β l − ˆ β LS (cid:107) > (cid:107) ˆ β l − ˆ β LS (cid:107) I { ˆ β LS >γ } = γ I { ˆ β LS >γ } ≥ γ I { ˆ β LS >γ } . Hence, E (cid:107) ˆ β l − ˆ β LS (cid:107) > γ P ( ˆ β LS > γ ) , Thus, E (cid:107) ˆ β l − β n (cid:107) ≥ E (cid:107) ˆ β l − ˆ β LS (cid:107) − E (cid:107) ˆ β LS − β n (cid:107) > γ P ( ˆ β LS > γ ) − (cid:18) γ − (cid:19) P ( ˆ β LS > γ ) − P ( ˆ β LS ≤ γ ) > γ Φ(( n + 1) γ − γ ) − Φ( γ − ( n + 1) γ ) , for large enough γ and Z ∼ N (0 , E (cid:107) ˆ β l − β n (cid:107) = − nγ φ ( nγ ) + ( n + 2) γ φ (( n + 2) γ ) + Φ( nγ )+( n + 1) γ ˜Φ( nγ ) + (1 − ( n + 1) γ ) ˜Φ(( n + 2) γ ) ≤ (cid:18) − nγ − nγ + ( n + 1) γ n (cid:19) φ ( nγ )+ (cid:18) ( n + 2) γ + 1 − ( n + 1) γ ( n + 2) γ (cid:19) φ (( n + 2) γ ) ≤ (cid:18) n + 2 e − n +1) γ (cid:19) γ φ ( nγ ) . Hence, R ( β n , ˆ β l ) R ( β n , ˆ β l ) ≥ γ Φ(( n + 1) γ − γ ) − Φ( γ − ( n + 1) γ )1 + (cid:0) n − + 2 e − n +1) γ (cid:1) γ φ ( nγ )Let n → ∞ , then sup β R ( β, ˆ β l ) R ( β, ˆ β l ) ≥ lim n →∞ R ( β n , ˆ β l ) R ( β n , ˆ β l ) ≥ γ . (16)For those 0 ≤ γ < γ / √

2, we consider β = 0 and denote R ( γ ) = R (0 , ˆ β l ( γ )) = 2 γ φ ( γ ) + 2 ˜Φ( γ ) (17) R ( γ ) = R (0 , ˆ β l ( γ )) = − γ φ ( γ ) + 2( γ + 1) ˜Φ( γ ) . (18)9onsider 0 < c ≤ γ < γ / √

2. By Lemma 1, for any z ≥ c , ˜Φ( z ) − φ ( z )(1 /z − /z + 1 /z ) ≥

0. Wehave φ ( γ ) ≥ φ ( γ ) e γ / , and R ( γ ) ≥ − γ φ ( γ ) + 2( γ + 1) (cid:0) γ − − γ − + γ − (cid:1) φ ( γ ) ≥ / γ − φ ( γ ) e γ / R ( γ ) ≤ (cid:0) γ + γ − (cid:1) φ ( γ ) . Thus for large enough γ R ( γ ) R ( γ ) ≥ / e γ / γ + γ > γ . Lastly, since ddγ R ( γ ) = − φ ( γ ) + 4 γ ˜Φ( γ ) < , for any 0 ≤ γ ≤ c , we have R ( γ ) R ( γ ) ≥ R ( c ) R ( γ ) > γ . Proof of Theorem 1.

Let C = min γ > (cid:110) / e γ / γ + γ − γ (cid:111) < . γ = 5 . Lemma 3.

There exists an

M > and a constant C > , such that for all γ > M , inf γ sup β R ( β, ˆ β l ) R ( β, ˆ β l ) ≤ Cγ − (19)It suﬃces to show that for all β ≥ γ , we have R ( β, ˆ β l ) R ( β, ˆ β l ) ≤ Cγ − . (20)The proof is done by generating bounds for the risks at various β ’s. Before giving these proofs, weneed to relate R ( β, ˆ β l ) to R ( β, ˆ β l ). R ( β, ˆ β l ) = ( γ − β ) φ ( γ − β ) + ( γ + β ) φ ( γ + β )+Φ( − γ + β ) + β Φ( γ − β ) + (1 − β ) ˜Φ( γ + β )= ( γ + ∆ γ − β ) φ ( γ − β ) + ( γ + ∆ γ − β ) ∂∂γ φ ( γ − β ) (cid:12)(cid:12)(cid:12)(cid:12) γ ∆ γ +( γ + ∆ γ + β ) φ ( γ + β ) + ( γ + ∆ γ + β ) ∂∂γ φ ( γ + β ) (cid:12)(cid:12)(cid:12)(cid:12) γ ∆ γ +Φ( − γ + β ) + ∂∂γ Φ( − γ + β ) (cid:12)(cid:12)(cid:12)(cid:12) γ ∆ γ + β Φ( γ − β ) + β ∂∂γ Φ( γ − β ) (cid:12)(cid:12)(cid:12)(cid:12) γ ∆ γ +(1 − β ) ˜Φ( γ + β ) + (1 − β ) ∂∂γ ˜Φ( γ + β ) (cid:12)(cid:12)(cid:12)(cid:12) γ ∆ γ + γ e − γ / o (∆ γ )= ( γ − β ) φ ( γ − β ) + ( γ + β ) φ ( γ + β ) + Φ( − γ + β ) + β Φ( γ − β )+(1 − β ) ˜Φ( γ + β ) − ( γ − βγ ) φ ( γ − β )∆ γ − ( γ + 2 βγ ) φ ( γ + β )∆ γ + γ e − γ o (∆ γ )= R ( β, ˆ β l ) + 2 γ φ ( γ − β ) + 2 γ φ ( γ + β ) − γ Φ( − γ + β ) − γ ˜Φ( γ + β ) − ( γ − βγ ) φ ( γ − β )∆ γ − ( γ + 2 βγ ) φ ( γ + β )∆ γ + γ e − γ / o (∆ γ )We can now provide separate proofs for β within the following regions:1. 0 ≤ β ≤ γ − (cid:112) log( γ /

2) 10. γ − (cid:112) log( γ / < β ≤ γ + (cid:112) γ )3. γ + (cid:112) γ ) < β Proof for case 1, ≤ β ≤ γ − (cid:112) log( γ / . Use the trivial estimator ˆ β l = 0, ie. set γ = ∞ . Then, R ( β, ˆ β ) = β and R ( β, ˆ β ) = E β (cid:2) ( Z − γ ) I { β + Z>γ } + ( Z + γ ) I { β + Z< − γ } (cid:3) + ( (cid:107) x (cid:107) β ) P ( | β + Z | ≤ γ ) > β P ( − γ − β ≤ Z ≤ γ − β ) > β P (cid:16) − γ + (cid:112) log( γ / ≤ Z ≤ (cid:112) log( γ / (cid:17) = β (cid:16) − Φ (cid:16) − γ + (cid:112) log( γ / (cid:17) − ˜Φ (cid:16)(cid:112) log( γ / (cid:17)(cid:17) = β (cid:16) − ˜Φ (cid:16) γ − (cid:112) log( γ / (cid:17) − ˜Φ (cid:16)(cid:112) log( γ / (cid:17)(cid:17) = β (cid:16) − (cid:16)(cid:112) log( γ / (cid:17)(cid:17) > β (cid:18) − exp (cid:26) − (cid:16)(cid:112) log( γ / (cid:17) (cid:27)(cid:19) = β (cid:18) − γ (cid:19) therefore R ( β, ˆ β l ) R ( β, ˆ β l ) ≤ β β (cid:16) − γ (cid:17) = γ γ −

2= 1 + 2 γ −

2= 1 + o ( γ − ), if γ > = 1, which can be justiﬁed in this case using a limit argument. Proof for case 2, γ − (cid:112) log( γ / < β ≤ γ + (cid:112) γ ) . Recall that R ( β, ˆ β ) − R ( β, ˆ β ) = (2 γ + 2∆ γ + 2 βγ ∆ γ − γ ∆ γ ) φ ( γ − β ) − γ Φ( − γ + β )+(2 γ + 2∆ γ − γ ∆ γ − βγ ∆ γ ) φ ( γ + β ) − γ ˜Φ( γ + β ) + o (∆ γ )We want to replace φ ( γ + β ) by φ ( γ − β ). Need to make sure that the term on φ ( γ + β ) is positive,so that this in fact increases the diﬀerence. This holds for β ≤ γ + √ γ and ∆ γ = γ ,2 γ + 2∆ γ − γ ∆ γ − βγ ∆ γ = 2 γ + 2 12 γ − γ γ − βγ γ = 2 γ + 1 γ − γ − β> γ + 1 γ − γ − γ − (cid:112) γ )= γ γ − (cid:112) γ ) >

0, if γ ≥ R ( β, ˆ β ) − R ( β, ˆ β ) < (4 γ + 4∆ γ − γ ∆ γ ) φ ( γ − β ) + o ( γ − )= (4 γ + 4 12 γ − γ γ ) φ ( γ − β ) + o ( γ − ) < (cid:18) γ + 2 γ (cid:19) φ ( γ − β ) + o ( γ − )11ext, we need a lower bound R l . R ( β, ˆ β l ) = ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β )+( γ + 1)Φ( β − γ ) + β Φ( γ − β ) + ( γ + 1 − β ) ˜Φ( γ + β )= ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β )+( γ + 1) ˜Φ( γ − β ) + β Φ( γ − β ) + ( γ + 1 − β )Φ( − γ − β )= ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β )+( γ + 1)(1 − Φ( γ − β )) + β Φ( γ − β ) + ( γ + 1 − β )Φ( − γ − β )= ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β )+( γ + 1) − ( γ + 1)Φ( γ − β )) + β Φ( γ − β ) + ( γ + 1 − β )Φ( − γ − β )= ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β )+( γ + 1) − ( γ + 1 − β )Φ( γ − β )) + ( γ + 1 − β )Φ( − γ − β )For replacement while keeping bounds, separate into two cases: 1) β ≥ γ + 1, and 2) β < γ + 1. R ( β, ˆ β l ) = ( − γ − β ) φ ( γ − β ) + ( β − γ ) φ ( γ + β )+( γ + 1) − ( γ + 1 − β )Φ( γ − β )) + ( γ + 1 − β )Φ( − γ − β )Case 1 > − γ ( φ ( γ − β ) + φ ( γ + β )) − β ( φ ( γ − β ) − φ ( γ + β ))( γ + 1) − ( γ + 1 − β )Φ( − γ − β ) + ( γ + 1 − β )Φ( − γ − β ) > − γ − βφ ( γ − β ) + γ + 1 > − γ − γ / γ + 1= γ − γ >

0, if γ > − γ ( φ ( γ − β ) + φ ( γ + β )) − β ( φ ( γ − β ) − φ ( γ + β ))( γ + 1) − ( γ + 1 − β )Φ( γ − β ) + ( γ + 1 − β )Φ( − γ − β ) > − γ − βφ ( γ − β ) + ( γ + 1) − ( γ + 1 − β )Φ( γ − β ) > − γ γ + 1 − γ − β > − γ (cid:16) γ − (cid:112) log( γ / (cid:17) >

0, if γ ≥ β < γ + 1, the above yields R ( β, ˆ β l ) R ( β, ˆ β l ) ≤ γ + γ + o ( γ − ) − γ + (cid:16) γ − (cid:112) log( γ / (cid:17) = 1 + o ( γ − ), for γ ≥ Proof for case 3, β > γ + (cid:112) γ ) . Let ∆ γ = 0, R ( β, ˆ β l ) − R ( β, ˆ β l ) = 2 γ φ ( γ − β ) + 2 γ φ ( γ + β ) − γ Φ( − γ + β ) − γ ˜Φ( γ + β ) < γ φ ( (cid:112) γ ) − γ Φ( (cid:112) γ )= 4 √ π − γ Φ( (cid:112) γ ) <

0, if γ ≥ . Proof of Lemma 3.

Using the above proofs, let M = 2 and C the constant suppressed in o ( γ − )). Proof of Theorem 2.

For γ < M we know that there exists some (cid:15) > R ( β, ˆ β l ( γ )) ≥ (cid:15) forall β . If we use the trivial estimator γ = 0, we know it has risk 1. Hence, we can pick C = max(1 /(cid:15), C )where C is from our lemma, then Theorem 2 follows.12 eferences Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006). Adapting to unknownsparsity by controlling the false discovery rate.

Ann. Stat. , 34(2):584–653.Akaike, H. (1974). A new look at the statistical model identiﬁcation.

IEEE Transactions on AutomaticControl , 19(6):716–723.Candes, E. and Tao, T. (2007). The Dantzig Selector: Statistical Estimation When p Is Much Largerthan n.

The Annals of Statistics , 35(6):pp. 2313–2351.Candes, E. J., Wakin, M. B., and Boyd, S. P. (2007). Enhancing Sparsity by Reweighted L1 Minimization.Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage.

Biometrika ,81(3):425–455.Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression.

The Annals ofStatistics , 32(2):407–451.Feller, W. (1968). volume 1. John Wiley & Sons, Inc.Foster, D. P. and George, E. I. (1994). The Risk Inﬂation Criterion for Multiple Regression.

The Annalsof Statistics , 22(4):pp. 1947–1975.Foster, D. P. and Stine, R. A. (2004). Variable selection in data mining: Building a predictive model forbankruptcy.

Journal of the American Statistical Association , pages 303–313.Foster, D. P. and Stine, R. A. (2005). Polyshrink: An Adaptive Variable Selection Procedure That IsCompetitive with Bayes Experts.George, E. I. and Foster, D. P. (2000). Calibration and Empirical Bayes Variable Selection.

Biometrika ,87:731–747.Johnstone, I. M. and Silverman, B. W. (2005). Empirical Bayes selection of wavelet thresholds.

Annalsof Statistics 2005, Vol. 33, No. 4, 1700-1752 .Meinshausen, N. (2007). Relaxed Lasso.

Computational Statistics & Data Analysis , 52(1):374–393.Natarajan, B. K. (1995). Sparse Approximate Solutions to Linear Systems.

SIAM J. Comput. , 24(2):227–234.Schwarz, G. (1978). Estimating the dimension of a model.

The annals of statistics , 6(2):461–464.Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso.

Journal of the Royal StatisticalSociety. Series B (Methodological) , pages 267–288.Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via thefused lasso.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 67(1):91–108.Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1):49–67.Zhou, J., Foster, D. P., Stine, R. A., and Ungar, L. H. (2006). Streamwise Feature Selection.

Journal ofMachine Learning Research , 7:1861–1885.Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties.

Journal of the American StatisticalAssociation , 101:1418–1429.Zou, H. and Hastie, T. (2005). Regularization and variable selection via the Elastic Net.