Strong rules for discarding predictors in lasso-type problems
Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, Ryan J. Tibshirani
aa r X i v : . [ m a t h . S T ] N ov Strong Rules for Discarding Predictors inLasso-type Problems
Robert Tibshirani, ∗ Jacob Bien, Jerome Friedman,Trevor Hastie, Noah Simon, Jonathan Taylor,Ryan TibshiraniNovember 25, 2010
Abstract
We consider rules for discarding predictors in lasso regression andrelated problems, for computational efficiency. El Ghaoui et al. (2010)propose “SAFE” rules, based on univariate inner products between eachpredictor and the outcome, that guarantee a coefficient will be zero inthe solution vector. This provides a reduction in the number of variablesthat need to be entered into the optimization. In this paper, we propose strong rules that are not foolproof but rarely fail in practice. These arevery simple, and can be complemented with simple checks of the Karush-Kuhn-Tucker (KKT) conditions to ensure that the exact solution to theconvex problem is delivered. These rules offer a substantial savings in bothcomputational time and memory, for a variety of statistical optimizationproblems.
Our focus here is on statistical models fit using ℓ regularization. We startwith penalized linear regression. Consider a problem with N observations and p predictors, and let y denote the N -vector of outcomes, and X be the N × p matrix of predictors, with j th column x j and i th row x i . For a set of indices A = { j , . . . j k } , we write X A to denote the N × k submatrix X A = [ x j , . . . x j k ],and also b A = ( b j , . . . b j k ) for a vector b . We assume that the predictors andoutcome are centered, so we can omit an intercept term from the model.The lasso Tibshirani (1996) solves the optimization problemˆ β = argmin β k y − X β k + λ k β k , (1) ∗ Departments of Statistics and Health Research and Policy, Stanford University, StanfordCA 94305, USA. E-mail: [email protected] λ ≥ N and p . A main reason for using the lasso is that the ℓ penaltytends to give exact zeros in ˆ β , and therefore it performs a kind of variableselection. Now suppose we knew, a priori to solving (1), that a subset of thevariables S ⊆ { , . . . p } will have zero coefficients in the solution, that is, ˆ β S = 0.Then we could solve problem (1) with the design matrix replaced by X S c , where S c = { , . . . p } \ S , for the remaining coefficients ˆ β S c . If S is relatively large,then this could result in a substantial computational savings.El Ghaoui et al. (2010) construct such a set S of “screened” or “discarded”variables by looking at the inner products | x Tj y | , j = 1 , . . . p . The authorsuse a clever argument to derive a surprising set of rules called “SAFE”, andshow that applying these rules can reduce both time and memory in the overallcomputation. In a related work, Wu et al. (2009) study ℓ penalized logisticregression and build a screened set S based on similar inner products. However,their construction does not guarantee that the variables in S actually have zerocoefficients in the solution, and so after fitting on X S c , the authors check theKarush-Kuhn-Tucker (KKT) optimality conditions for violations. In the caseof violations, they weaken their set S , and repeat this process. Also, Fan &Lv (2008) study the screening of variables based on their inner products in thelasso and related problems, but not from a optimization point of view. Theirscreening rules may again set coefficients to zero that are nonzero in the solution,however, the authors argue that under certain situations this can lead to betterperformance in terms of estimation risk.In this paper, we propose strong rules for discarding predictors in the lassoand other problems that involve lasso-type penalties. These rules discard manymore variables than the SAFE rules, but are not foolproof, because they cansometimes exclude variables from the model that have nonzero coefficients inthe solution. Therefore we rely on KKT conditions to ensure that we are indeedcomputing the correct coefficients in the end. Our method is most effectivefor solving problems over a grid of λ values, because we can apply our strongrules sequentially down the path, which results in a considerable reduction incomputational time. Generally speaking, the power of the proposed rules stemsfrom the fact that: • the set of discarded variables S tends to be large and violations rarelyoccur in practice, and • the rules are very simple and can be applied to many different problems,including the elastic net, lasso penalized logistic regression, and the graph-ical lasso.In fact, the violations of the proposed rules are so rare, that for a while a groupof us were trying to establish that they were foolproof. At the same time, othersin our group were looking for counter-examples [hence the large number of co-authors!]. After many flawed proofs, we finally found some counter-examples to2he strong sequential bound (although not to the basic global bound). Despitethis, the strong sequential bound turns out to be extremely useful in practice.Here is the layout of this paper. In Section 2 we review the SAFE rules of ElGhaoui et al. (2010) for the lasso. The strong rules are introduced and illustratedin Section 3 for this same problem. We demonstrate that the strong rules rarelymake mistakes in practice, especially when p ≫ N . In Section 4 we give acondition under which the strong rules do not erroneously discard predictors(and hence the KKT conditions do not need to be checked). We discuss theelastic net and penalized logistic regression in Sections 5 and 6. Strong rules formore general convex optimization problems are given in Section 7, and these areapplied to the graphical lasso. In Section 8 we discuss how the strong sequentialrule can be used to speed up the solution of convex optimization problems,while still delivering the exact answer. We also cover implementation details ofthe strong sequential rule in our glmnet algorithm (coordinate descent for lassopenalized generalized linear models). Section 9 contains some final discussion. The basic SAFE rule of El Ghaoui et al. (2010) for the lasso is defined as follows:fitting at λ , we discard predictor j if | x Tj y | < λ − k x j k k y k λ max − λλ max , (2)where λ max = max i | x Ti y | is the smallest λ for which all coefficients are zero.The authors derive this bound by looking at a dual of the lasso problem (1).This is: ˆ θ = argmax θ G ( θ ) = 12 k y k − k y + θ k (3)subject to | x Tj θ | ≤ λ for j = 1 , . . . p. The relationship between the primal and dual solutions is ˆ θ = X ˆ β − y , and x Tj ˆ θ ∈ { + λ } if ˆ β j > {− λ } if ˆ β j < − λ, λ ] if ˆ β j = 0 (4)for each j = 1 , . . . p . Here is a sketch of the argument: first we find a dual feasiblepoint of the form θ = s y , ( s is a scalar), and hence γ = G ( s y ) represents alower bound for the value of G at the solution. Therefore we can add theconstraint G ( θ ) ≥ γ to the dual problem (3) and nothing will be changed. Foreach predictor j , we then find m j = argmax θ | x Tj θ | subject to G ( θ ) ≥ γ. m j < λ (note the strict inequality), then certainly at the solution | x Tj ˆ θ | < λ ,which implies that ˆ β j = 0 by (4). Finally, noting that s = λ/λ max produces adual feasible point and rewriting the condition m j < λ gives the rule (2).In addition to the basic SAFE bound, the authors also derive a more compli-cated but somewhat better bound that they call “recursive SAFE” (RECSAFE).As we will show, the SAFE rules have the advantage that they will never discarda predictor when its coefficient is truly nonzero. However, they discard far fewerpredictors than the strong sequential rule, introduced in the next section. Our basic (or global) strong rule for the lasso problem (1) discards predictor j if | x Tj y | < λ − λ max , (5)where as before λ max = max j | x Tj y | .When the predictors are standardized ( k x j k = 1 for each j ), it is not difficultto see that the right hand side of (2) is always smaller than the right hand sideof (5), so that in this case the SAFE rule is always weaker than the basic strongrule. This follows since λ max ≤ k y k , so that λ − k y k λ max − λλ max ≤ λ − ( λ max − λ ) = 2 λ − λ max . Figure 1 illustrates the SAFE and basic strong rules in an example.When the predictors are not standardized, the ordering between the twobounds is not as clear, but the strong rule still tends to discard more variablesin practice unless the predictors have wildly different marginal variances.While (5) is somewhat useful, its sequential version is much more powerful.Suppose that we have already computed the solution ˆ β ( λ ) at λ , and wish todiscard predictors for a fit at λ < λ . Defining the residual r = y − X ˆ β ( λ ),our strong sequential rule discards predictor j if | x Tj r | < λ − λ . (6)Before giving a detailed motivation for these rules, we first demonstrate theirutility. Figure 2 shows some examples of the applications of the SAFE andstrong rules. There are four scenarios with various values of N and p ; in thefirst three panels, the X matrix is dense, while it is sparse in the bottom rightpanel. The population correlation among the feature is zero, positive, negativeand zero in the four panels. Finally, 25% of the coefficients are non-zero, witha standard Gaussian distribution. In the plots, we are fitting along a pathof decreasing λ values and the plots show the number of predictors left afterscreening at each stage. We see that the SAFE and RECSAFE rules only excludepredictors near the beginning of the path. The strong rules are more effective:4 .00 0.05 0.10 0.15 0.20 0.25 − . − . . . . I nne r p r odu c t w i t h r e s i dua l λ Figure 1:
SAFE and basic strong bounds in an example with 10 predictors,labelled at the right. The plot shows the inner product of each predictor with thecurrent residual, with the predictors in the model having maximal inner productequal to ± λ . The dotted vertical line is drawn at λ max ; the broken vertical lineis drawn at λ . The strong rule keeps only predictor We now give some motivation for the strong rule (5) and later, the sequentialrule (6). We start with the KKT conditions for the lasso problem (1). Theseare x Tj ( y − X ˆ β ) = λ · s j (7)for j = 1 , . . . p , where s j is a subgradient of ˆ β j : s j ∈ { +1 } if ˆ β j > {− } if ˆ β j < − ,
1] if ˆ β j = 0 . (8)Let c j ( λ ) = x Tj { y − X ˆ β ( λ ) } , where we emphasize the dependence on λ . Supposein general that we could assume | c ′ j ( λ ) | ≤ , (9)where c ′ j is the derivative with respect to λ , and we ignore possible points ofnon-differentiability. This would allow us to conclude that | c j ( λ max ) − c j ( λ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z λ max λ c ′ j ( λ ) dλ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (10) ≤ Z λ max λ | c ′ j ( λ ) | dλ (11) ≤ λ max − λ, and so | c j ( λ max ) | < λ − λ max ⇒ | c j ( λ ) | < λ ⇒ ˆ β j ( λ ) = 0 , the last implication following from the KKT conditions, (7) and (8). Then thestrong rule (5) follows as ˆ β ( λ max ) = 0, so that | c j ( λ max ) | = | x Tj y | .Where does the slope condition (9) come from? The product rule applied to(7) gives c ′ j ( λ ) = s j ( λ ) + λ · s ′ j ( λ ) , (12)6
50 100 150
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng SAFEstrong/globalstrong/seqRECSAFE
Dense Nocorr n= 200 p= 5000
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng Dense Pos n= 200 p= 5000
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng Dense Neg n= 200 p= 5000
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng Sparse Nocorr n= 500 p= 50000
Figure 2:
Lasso regression: results of different rules applied to four differentscenarios. There are four scenarios with various values of N and p ; in the firstthree panels the X matrix is dense, while it is sparse in the bottom right panel.The population correlation among the feature is zero, positive, negative and zeroin the four panels. Finally, 25% of the coefficients are non-zero, with a standardGaussian distribution. In the plots, we are fitting along a path of decreasing λ values and the plots show the number of predictors left after screening at eachstage. A broken line with unit slope is added for reference. The proportion ofvariance explained by the model is shown along the top of the plot. There wereno violations of any of the rules in any of the four scenarios.
50 100 150
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng SAFEstrong/globalstrong/seqRECSAFE
Equal population variance
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng Unequal population variance
Figure 3:
Lasso regression: results of different rules when the predictors are notstandardized. The scenario in the left panel is the same as in the top left panelof Figure 2, except that the features are not standardized before fitting the lasso.In the data generation for the right panel, each feature is scaled by a randomfactor between 1 and 50, and again, no standardization is done. | s j ( λ ) | ≤
1, condition (9) can be obtained if we simply drop the sec-ond term above. For an active variable, that is ˆ β j ( λ ) = 0, we have s j ( λ ) =sign { ˆ β j ( λ ) } , and continuity of ˆ β j ( λ ) with respect to λ implies s ′ j ( λ ) = 0. But s ′ j ( λ ) = 0 for inactive variables, and hence the bound (9) can fail, which makesthe strong rule (5) imperfect. It is from this point of view—writing out theKKT conditions, taking a derivative with respect to λ , and dropping a term—that we derive strong rules for ℓ penalized logistic regression and more generalproblems.In the lasso case, condition (9) has a more concrete interpretation. FromEfron et al. (2004), we know that each coordinate of the solution ˆ β j ( λ ) is apiecewise linear function of λ , hence so is each inner product c j ( λ ). Therefore c j ( λ ) is differentiable at any λ that is not a kink, the points at which variablesenter or leave the model. In between kinks, condition (9) is really just a boundon the slope of c j ( λ ). The idea is that if we assume the absolute slope of c j ( λ ) isat most 1, then we can bound the amount that c j ( λ ) changes as we move from λ max to a value λ . Hence if the initial inner product c j ( λ max ) starts too farfrom the maximal achieved inner product, then it cannot “catch up” in time.An illustration is given in Figure 4.The argument for the strong bound (intuitively, an argument about slopes),uses only local information and so it can be applied to solving (1) on a grid of λ values. Hence by the same argument as before, the slope assumption (9) leadsto the strong sequential rule (6).It is interesting to note that | x Tj r | < λ (13)is just the KKT condition for excluding a variable in the solution at λ . Thestrong sequential bound is λ − ( λ − λ ) and we can think of the extra term λ − λ as a buffer to account for the fact that | x Tj r | may increase as we move from λ to λ . Note also that as λ → λ , the strong sequential rule becomes the KKTcondition (13), so that in effect the sequential rule at λ “anticipates” the KKTconditions at λ .In summary, it turns out that the key slope condition (9) very often holds,but can be violated for short stretches, especially when p ≈ N and for smallvalues of λ in the “overfit” regime of a lasso problem. In the next section weprovide an example that shows a violation of the slope bound (9), which breaksthe strong sequential rule (6). We also give a condition on the design matrix X under which the bound (9) is guaranteed to hold. However in simulationsin that section, we find that these violations are rare in practice and virtuallynon-existent when p >> N . 9 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . λ c j = x T j ( y − X ˆ β ) λ max − λ λ λ max Figure 4:
Illustration of the slope bound (9) leading to the strong rule (6) . Theinner product c j is plotted in red as a function of λ , restricted to only onepredictor for simplicity. The slope of c j between λ max and λ is bounded inabsolute value by 1, so the most it can rise over this interval is λ max − λ .Therefore, if it starts below λ − ( λ max − λ ) = 2 λ − λ max , it can not possiblyreach the critical level by λ . Here we demonstrate a counter-example of both the slope bound (9) and ofthe strong sequential rule (6). We believe that a counter-example for the basicstrong rule (5) can also be constructed, but we have not yet found one. Such anexample is somewhat more difficult to construct because it would require thatthe average slope exceed 1 from λ max to λ , rather than exceeding 1 for shortstretches of λ values.We took N = 50 and p = 30, with the entries of y and X drawn inde-pendently from a standard normal distribution. Then we centered y and the10olumns of X , and standardized the columns of X . As Figure 5 shows, theslope of c j ( λ ) = x Tj { y − X ˆ β ( λ ) } is c ′ j ( λ ) = − .
586 for all λ ∈ [ λ , λ ], where λ = 0 . λ = 0 . j = 2. Moreover, if we were to use the solutionat λ to eliminate predictors for the fit at λ , then we would eliminate the 2ndpredictor based on the bound (6). But this is clearly a problem, because the2nd predictor enters the model at λ . By continuity, we can choose λ in aninterval around 0 . λ in an interval around 0 . Tibshirani & Taylor (2010) prove a general result that can be used to give thefollowing sufficient condition for the unit slope bound (9). Under this condition,both basic and strong sequential rules are guaranteed not to fail.Recall that a matrix A is diagonally dominant if | A ii | ≥ P j = i | A ij | for all i .Their result gives us the following: Theorem 1.
Suppose that X is N × p , with N ≥ p , and of full rank. If ( X T X ) − is diagonally dominant , (14) then the slope bound (9) holds at all points where c j ( λ ) is differentiable, for j = 1 , . . . p , and hence the strong rules (5) , (6) never produce violations.Proof. Tibshirani & Taylor (2010) consider a generalized lasso problemargmin β k y − X ˆ β k + λ k D β k , (15)where D is a general m × p penalty matrix. In the proof of their “boundarylemma”, Lemma 1, they show that if rank( X ) = p and D ( X T X ) − D T is di-agonally dominant, then the dual solution ˆ u ( λ ) corresponding to problem (15)satisfies | ˆ u j ( λ ) − ˆ u j ( λ ) | ≤ | λ − λ | for any j = 1 , . . . m and λ, λ . By piecewise linearity of ˆ u j ( λ ), this means that | ˆ u ′ j ( λ ) | ≤ λ except the kink points. Furthermore, when D = I , problem(15) is simply the lasso, and it turns out that the dual solution ˆ u j ( λ ) is exactlythe inner product c j ( λ ) = x Tj { y − X ˆ β ( λ ) } . This proves the slope bound (9)under the condition that ( X T X ) − is diagonally dominant.Finally, the kink points are countable and hence form a set of Lebesgue mea-sure zero. Therefore c j ( λ ) is differentiable almost everywhere and the integralsin (10) and (11) make sense. This proves the strong rules (5) and (6).We note a similarity between condition (14) and the positive cone conditionused in Efron et al. (2004). It is not hard to see that the positive cone conditionimplies (14), and actually (14) is a much weaker condition because it doesn’trequire looking at every possible subset of columns.11 .000 0.005 0.010 0.015 0.020 0.025 − . − . . . . λ c j = x T j ( y − X ˆ β ) λ − λ λ λ Figure 5:
Example of a violation of the slope bound (9) , which breaks the strongsequential rule (6) . The entries of y and X were generated as independent,standard normal random variables with N = 50 and p = 30 . (Hence there is nounderlying signal.) The lines with slopes ± λ are the envelop of maximal innerproducts achieved by predictors in the model for each λ . For clarity we onlyshow a short stretch of the solution path. The rightmost vertical line is drawnat λ , and we are considering the new value λ < λ , the vertical line to itsleft. The horizontal line is the bound (9) . In the top right part of the plot, theinner product path for the predictor j = 2 is drawn in red, and starts below thebound, but enters the model at λ . The slope of the red segment between λ and λ exceeds 1 in absolute value. A dotted line with slope -1 is drawn beside thered segment for reference.
12 simple model in which diagonal dominance holds is when the columnsof X are orthonormal, because then X T X = I . But the diagonal dominancecondition (14) certainly holds outside of the orthogonal design case. We givetwo such examples below. • Equi-correlation model.
Suppose that k x j k = 1 for all j , and x Tj x k = r for all j = k . Then the inverse of X T X is( X T X ) − = I · − r − − r (cid:18) T r ( p − (cid:19) , where is the vector of all ones. This is diagonally dominant as along as r ≥ • Haar basis model.
Suppose that the columns of X form a Haar basis, thesimplest example being X =
11 1...1 1 . . . , (16)the lower triangular matrix of ones. Then ( X T X ) − is diagonally domi-nant. This arises, for example, in the one-dimensional fused lasso wherewe solve argmin β N X i =1 ( y i − β i ) + λ N X i =2 | β i − β i − | . If we transform this problem to the parameters α = 1, α i = β i − β i − for i = 2 , . . . N , then we get a lasso with design X as in (16). The slope bound (9) possesses an interesting connection to a concept called the“irrepresentable condition”. Let us write A as the set of active variables at λ , A = { j : ˆ β j ( λ ) = 0 } , and k b k ∞ = max i | b i | for a vector b . Then, using the work of Efron et al.(2004), we can express the slope condition (9) as k X T A c X A ( X T A X A ) − sign( ˆ β A ) k ∞ ≤ , (17)where by X T A and X T A c , we really mean ( X A ) T and ( X A c ) T , and the sign isapplied element-wise.On the other hand, a common condition appearing in work about modelselection properties of lasso, in both the finite-sample and asymptotic settings,is the so called “irrepresentable condition” ? , which is closely related to the13oncept of “mutual incoherence” ? . Roughly speaking, if β T denotes the nonzerocoefficients in the true model, then the irrepresentable condition is that k X T T c X T ( X T T X T ) − sign( β T ) k ∞ ≤ − ǫ (18)for some 0 < ǫ ≤ T is associated with the true model, we can put a probabilitydistribution on it and a probability distribution on sign( β T ), and then showthat with high probability, certain designs X are mutually incoherent (18). Forexample, Candes & Plan (2009) suppose that k is sufficiently small, T is drawnfrom the uniform distribution on k -sized subsets of { , . . . p } , and each entry ofsign( β T ) is equal to +1 or − /
2, independent of each other.Under this model, they show that designs X with max j = k | x Tj x k | = O (1 / log p )satisfy the irrepresentable condition with very high probability.Unfortunately the same types of arguments cannot be applied directly to(17). A distribution on T and sign( β T ) induces a different distribution on A and sign( ˆ β A ), via the lasso optimization procedure. Even if the distributions of T and sign( β T ) are very simple, the distributions of A and sign( ˆ β A ) can becomequite complicated. Still, it does not seem hard to believe that confidence in (18)translates to some amount of confidence in (17). Luckily for us, we do not needthe slope bound (17) to hold exactly or with any specified level of probability,because we are using it as a computational tool and can simply revert to checkingthe KKT conditions when it fails. We generated Gaussian data with N = 100, varying values of the number ofpredictors p and pairwise correlation 0.5 between the predictors. One quarterof the coefficients were non-zero, with the indices of the nonzero predictorsrandomly chosen and their values equal to ±
2. We fit the lasso for 80 equallyspaced values of λ from λ max to 0, and recorded the number of violations ofthe strong sequential rule. Figure 6 shows the number of violations (out of p predictors) averaged over 100 simulations: we plot versus the percent varianceexplained instead of λ , since the former is more meaningful. Since the verticalaxis is the total number of violations, we see that violations are quite rarein general never averaging more than 0.3 out of p predictors. They are morecommon near the right end of the path. They also tend to occur when p is fairlyclose to N . When p ≫ N ( p = 500 or 1000 here), there were no violations. Notsurprisingly, then, there were no violations in the numerical examples in thispaper since they all have p ≫ N .Looking at (13), it suggests that if we take a finer grid of λ values, thereshould be fewer violations of the rule. However we have not found this to be14 .0 0.2 0.4 0.6 0.8 1.0 . . . . . . Percent variance explained T o t a l nu m be r o f v i o l a t i on s p=20p=50p=100p=500p=1000 Figure 6:
Total number of violations (out of p predictors) of the strong sequentialrule, for simulated data with N = 100 and varying values of p . A sequence ofmodels is fit, with decreasing values of λ as we move from left to right. Thefeatures are uncorrelated. The results are averages over 100 simulations. λ staysabout the same. In the elastic net we solve the problem minimize 12 || y − X β || + 12 λ || β || + λ || β || (19)Letting X ∗ = (cid:18) X √ λ · I (cid:19) ; y ∗ = (cid:18) y (cid:19) , (20)we can write (19) asminimize 12 || y ∗ − X ∗ β || + λ || β || . (21)In this form we can apply the SAFE rule (2) to obtain a rule for discardingpredictors. Now | x ∗ j T y ∗ | = | x Tj y | , || x ∗ j || = p || x j || + λ , || y ∗ || = || y || . Hencethe global rule for discarding predictor j is | x Tj y | < λ − || y || · q || x j || + λ · λ max − λ λ max (22)Note that the glmnet package uses the parametrization ((1 − α ) λ, αλ ) ratherthan ( λ , λ ). With this parametrization the basic SAFE rule has the form | x Tj y | < (cid:16) αλ − || y || · q || x j || + (1 − α ) λ · λ max − λλ max (cid:17) (23)The strong screening rules turn out to be the same as for the lasso. Withthe glmnet parametrization the global rule is simply | x Tj y | < α (2 λ − λ max ) (24)while the sequential rule is | x Tj r | < α (2 λ − λ ) . (25)Figure 7 show results for the elastic net with standard independent Gaussiandata, n = 100 , p = 1000, for 3 values of α . There were no violations in any ofthese figures, i.e. no predictor was discarded that had a non-zero coefficientat the actual solution. Again we see that the strong sequential rule performsextremely well, leaving only a small number of excess predictors at each stage. This differs from the original form of the “naive” elastic net in Zou & Hastie (2005) bythe factors of 1 /
2, just for notational convenience.
100 300 500
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng SAFEstrong globalstrong/seq
Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng α = 0 . α = 0 . α = 0 . Elastic net: results for different rules for three different values of themixing parameter α . In the plots, we are fitting along a path of decreasing λ values and the plots show the number of predictors left after screening at eachstage. The proportion of variance explained by the model is shown along thetop of the plot is shown. There were no violations of any of the rules in the 3scenarios. Screening rules for logistic regression
Here we have a binary response y i = 0 , Y = 1 | x ) = 1 / (1 + exp( − β − x T β )) (26)Letting p i = Pr( Y = 1 | x i ), the penalized log-likelihood is ℓ ( β , β ) = − X i [ y i log p i + (1 − y i ) log(1 − p i )] + λ || β || (27)El Ghaoui et al. (2010) derive an exact global rule for discarding predictors,based on the inner products between y and each predictor, using the same kindof dual argument as in the Gaussian case.Here we investigate the analogue of the strong rules (5) and (6). The sub-gradient equation for logistic regression is X T ( y − p ( β )) = λ · sign( β ) (28)This leads to the global rule: letting ¯ p = ¯ y , λ max = max | x Tj ( y − ¯ p ) | , wediscard predictor j if | x Tj ( y − ¯ p ) | < λ − λ max (29)The sequential version, starting at λ , uses p = p ( ˆ β ( λ ) , ˆ β ( λ )): | x Tj ( y − p ) | < λ − λ . (30)Figure 8 show the result of various rules in an example, the newsgroupdocument classification problem (Lang 1995). We used the training set culturedfrom these data by Koh et al. (2007). The response is binary, and indicatesa subclass of topics; the predictors are binary, and indicate the presence ofparticular tri-gram sequences. The predictor matrix has 0 .
05% nonzero values. Results for are shown for the new global rule (29) and the new sequentialrule (30). We were unable to compute the logistic regression global SAFE rulefor this example, using our R language implementation, as this had a very longcomputation time. But in smaller examples it performed much like the globalSAFE rule in the Gaussian case. Again we see that the strong sequential rule(30), after computing the inner product of the residuals with all predictors ateach stage, allows us to discard the vast majority of the predictors before fitting.There were no violations of either rule in this example.Some approaches to penalized logistic regression such as the glmnet packageuse a weighted least squares iteration within a Newton step. For these algo-rithms, an alternative approach to discarding predictors would be to apply oneof the Gaussian rules within the weighted least squares iteration. This dataset is available as a saved R data object at
20 40 60 80 100 + + + Number of predictors in model N u m be r o f p r ed i c t o r s l e ft a ft e r f il t e r i ng strong/globalstrong/seq Newsgroup data
Figure 8:
Logistic regression: results for newsgroup example, using the newglobal rule (29) and the new sequential rule (30). The broken black curve is the45 o line, drawn on the log scale.
19u et al. (2009) used | x Tj ( y − ¯ p ) | to screen predictors (SNPs) in genome-wide association studies, where the number of variables can exceed a million.Since they only anticipated models with say k <
15 terms, they selected a smallmultiple, say 10 k , of SNPs and computed the lasso solution path to k terms.All the screened SNPs could then be checked for violations to verify that thesolution found was global. Suppose that we have a convex problem of the formminimize β h f ( β ) + λ · K X k =1 g ( β j ) i (31)where f and g are convex functions, f is differentiable and β = ( β , β , . . . β K )with each β k being a scalar or vector. Suppose further that the subgradientequation for this problem has the form f ′ ( β ) + λ · s k = 0; k = 1 , , . . . K (32)where each subgradient variable s k satisfies || s k || q ≤ A , and || s k || q = A whenthe constraint g ( β j ) = 0 is satisfied (here ||·|| q is a norm). Suppose that we havetwo values λ < λ , and corresponding solutions ˆ β ( λ ) , ˆ β ( λ ). Then following thesame logic as in Section 3, we can derive the general strong rule || f ( ˆ β k d β k ) || q < (1 + A ) λ − Aλ (33)This can be applied either globally or sequentially. In the lasso regression set-ting, it is easy to check that this reduces to the rules (5),(6) where A = 1.The rule (33) has many potential applications. For example in the graphicallasso for sparse inverse covariance estimation (Friedman et al. 2007), we observe N multivariate normal observations of dimension p , with mean 0 and covarianceΣ, with observed empirical covariance matrix S . Letting Θ = Σ − , the problemis to maximize the penalized log-likelihoodlog det Θ − tr( S Θ) − λ || Θ || , (34)over non-negative definite matrices Θ. The penalty || Θ || sums the absolutevalues of the entries of Θ; we assume that the diagonal is not penalized. Thesubgradient equation is Σ − S − λ · Γ = 0 , (35)where Γ ij ∈ sign(Θ ij ). One could apply the rule (33) elementwise, and thiswould be useful for an optimization method that operates elementwise. Thisgives a rule of the form | S ij − ˆΣ( λ ) | < λ − λ . However, the graphical lassoalgorithm proceeds in a blockwise fashion, optimizing one whole row and column20
50 100 150 200 250 300
Number rows/cols in model N u m be r r o w s / c o l s l e ft a ft e r f il t e r i ng strong globalstrong/seq Figure 9:
Strong global and sequential rules applied to the graphical lasso. Abroken line with unit slope is added for reference. at a time. Hence for the graphical lasso, it is more effective to discard entirerows and columns at once. For each row i , let s , σ , and Γ denote S i, − i ,Σ i, − i , and Γ i, − i , respectively. Then the subgradient equation for one row hasthe form σ − s − λ · Γ = 0 , (36)Now given two values λ < λ , and solution ˆΣ at λ , we form the sequentialrule max | ˆ σ − s | < λ − λ . (37)If this rule is satisfied, we discard the entire i th row and column of Θ, andhence set them to zero (but retain the i th diagonal element). Figure 9 shows anexample with N = 100 , p = 300, standard independent Gaussian variates. Noviolations of the rule occurred.Finally, we note that strong rules can be derived in a similar way, for otherproblems such as the group lasso (Yuan & Lin 2007). In particular, if X ℓ denotesthe n × p ℓ block of the design matrix corresponding to the features in the ℓ thgroup, then the strong sequential rule is simply || X Tℓ r || < λ − λ max . When this holds, we set β ℓ = . 21 Implementation and numerical studies
The strong sequential rule (6) can be used to provide potential speed improve-ments in convex optimization problems. Generically, given a solution ˆ β ( λ ) andconsidering a new value λ < λ , let S ( λ ) be the indices of the predictors thatsurvive the screening rule (6): we call this the strong set . Denote by E theeligible set of predictors. Then a useful strategy would be1. Set E = S ( λ ).2. Solve the problem at value λ using only the predictors in E .3. Check the KKT conditions at this solution for all predictors. If there areno violations, we are done. Otherwise add the predictors that violate theKKT conditions to the set E , and repeat steps (b) and (c).Depending on how the optimization is done in step (b), this can be quite ef-fective. Now in the glmnet procedure, coordinate descent is used, with warmstarts over a grid of decreasing values of λ . In addition, an “ever-active” setof predictors A ( λ ) is maintained, consisting of the indices of all predictors thathave a non-zero coefficient for some λ ′ greater than the current value λ underconsideration. The solution is first found for this active set: then the KKTconditions are checked for all predictors. if there are no violations, then we havethe solution at λ ; otherwise we add the violators into the active set and repeat.The two strategies above are very similar, with one using the strong set S ( λ ) and the other using the ever-active set A ( λ ). Figure 10 shows the activeand strong sets for an example. Although the strong rule greatly reduces thetotal number of predictors, it contains more predictors than the ever-active set;accordingly, violations occur more often in the ever-active set than the strongset. This effect is due to the high correlation between features and the factthat the signal variables have coefficients of the same sign. It also occurs withlogistic regression with lower correlations, say 0.2.In light of this, we find that using both A ( λ ) and S ( λ ) can be advantageous.For glmnet we adopt the following combined strategy:1. Set E = A ( λ ).2. Solve the problem at value λ using only the predictors in E .3. Check the KKT conditions at this solution for all predictors in S ( λ ). Ifthere are violations, add these predictors into E , and go back to step (a)using the current solution as a warm start.4. Check the KKT conditions for all predictors. If there are no violations,we are done. Otherwise add these violators into A ( λ ), recompute S ( λ )and go back to step (a) using the current solution as a warm start.Note that violations in step (c) are fairly common, while those in step (d) arerare. Hence the fact that the size of S ( λ ) is ≪ p can make this an effectivestrategy. 22
20 40 60 80
Number of predictors in model N u m be r o f p r ed i c t o r s ever activestrong/seq Number of predictors in model N u m be r o f p r ed i c t o r s Figure 10:
Gaussian lasso setting, N = 200 , p = 20 , , pairwise correlationbetween features of . . The first 50 predictors have positive, decreasing co-efficients. Shown are the number of predictors left after applying the strongsequential rule (6) and the number that have ever been active (i.e. had a non-zero coefficient in the solution) for values of λ larger than the current value.A broken line with unit slope is added for reference. The right-hand plot is azoomed version of the left plot.
23e implemented this strategy and compare it to the standard glmnet algo-rithm in a variety of problems, shown in Tables 1–3. Details are given in thetable captions. We see that the new strategy offers a speedup factor of five ormore in some cases, and never seems to slow things down.The strong sequential rules also have the potential for space savings. With alarge dataset, one could compute the inner products { x Tj r } p offline to determinethe strong set of predictors, and then carry out the intensive optimization stepsin memory using just this subset of the predictors. In this paper we have proposed strong global and sequential rules for discardingpredictors in statistical convex optimization problems such as the lasso. Whencombined with checks of the KKT conditions, these can offer substantial im-provements in speed while still yielding the exact solution. We plan to includethese rules in a future version of the glmnet package.The RECSAFE method uses the solution at a given point λ to derive a rulefor discarding predictors at λ < λ . Here is another way to (potentially) applythe SAFE rule in a sequential manner. Suppose that we have ˆ β = ˆ β ( λ ), and r = y − X ˆ β , and we consider the fit at λ < λ , with r = y − X ˆ β . Defining λ = max j ( | x Tj r | ); (38)we discard predictor j if | x Tj r | < λ − || r ||| x j || λ − λλ (39)We have been unable to prove the correctness of this rule, and do not know if itis infallible. At the same time, we have been not been able to find a numericalexample in which it fails. Acknowledgements:
We thank Stephen Boyd for his comments, and Lau-rent El Ghaoui and his co-authors for sharing their paper with us before publi-cation, and for helpful feedback on their work. The first author was supportedby National Science Foundation Grant DMS-9971405 and National Institutes ofHealth Contract N01-HV-28183.
References
Candes, E. J. & Plan, Y. (2009), ‘Near-ideal model selection by ℓ minimization’, Annals of Statistics (5), 2145–2177.Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004), ‘Least angle regres-sion’, Annals of Statistics (2), 407–499.24l Ghaoui, L., Viallon, V. & Rabbani, T. (2010), Safe feature elimination insparse supervised learning, Technical Report UC/EECS-2010-126, EECSDept., University of California at Berkeley.Fan, J. & Lv, J. (2008), ‘Sure independence screening for ultra-high dimensionalfeature space’, Journal of the Royal Statistical Society Series B, to appear .Friedman, J., Hastie, T., Hoefling, H. & Tibshirani, R. (2007), ‘Pathwise coor-dinate optimization’,
Annals of Applied Statistics (1), 302–332.Fuchs, J. (2005), ‘Recovery of exact sparse representations in the presense ofnoise’, IEEE Transactions on Information Theory (10), 3601–3608.Koh, K., Kim, S.-J. & Boyd, S. (2007), ‘An interior-point method for large-scalel1-regularized logistic regression’, Journal of Machine Learning Research , 1519–1555.Lang, K. (1995), Newsweeder: Learning to filter netnews., in ‘Proceedings ofthe Twenty-First International Conference on Machine Learning (ICML)’,pp. 331–339.Meinhausen, N. & Buhlmann, P. (2006), ‘High-dimensional graphs and variableselection with the lasso’, Annals of Statistics , 1436–1462.Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journalof the Royal Statistical Society Series B (1), 267–288.Tibshirani, R. & Taylor, J. (2010), The solution path of the generalized lasso.Submitted.* Tropp, J. (2006), ‘Just relax: Convex programming methods for identify-ing sparse signals in noise’,
IEEE Transactions on Information Theory (52), 1030–1051.Wainwright, M. (2006), Sharp thresholds for high-dimensional and noisy spar-sity recovery using ℓ -constrained quadratic programming (lasso), Techni-cal report, Statistics and EECS Depts., University of California at Berkeley.Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. & Lange, K. (2009), ‘Genomewideassociation analysis by lasso penalized logistic regression’, Bioinformatics (6), 714–721.Yuan, M. & Lin, Y. (2007), ‘Model selection and estimation in regressionwith grouped variables’, Journal of the Royal Statistical Society, SeriesB (1), 49–67.Zhao, P. & Yu, B. (2006), ‘On model selection consistency of the lasso’, Journalof Machine Learning Research , 2541–2563.25ou, H. & Hastie, T. (2005), ‘Regularization and variable selection via the elasticnet’, Journal of the Royal Statistical Society Series B. (2), 301–320.26able 1: Glmnet timings (seconds) for fitting a lasso problem in the Gaussiansetting. In the first four columns, there are p = 100 , predictors, N = 200 observations, 30 nonzero coefficients, with the same value and signs alternat-ing; signal-to-noise ratio equal to 3. In the rightmost column, the data matrixis sparse, consisting of just zeros and ones, with . of the values equal to1. There are p = 50 , predictors, N = 500 observations, with 25% of thecoefficients nonzero, having a Gaussian distribution; signal-to-noise ratio equalto 4.3. Method Population correlation0.0 0.25 0.5 0.75 Sparseglmnet 4.07 6.13 9.50 17.70 4.14with seq-strong 2.50 2.54 2.62 2.98 2.52Table 2:
Glmnet timings (seconds) for fitting an elastic net problem. There are p = 100 , predictors, N = 200 observations, 30 nonzero coefficients, with thesame value and signs alternating; signal-to-noise ratio equal to 3 Method α Glmnet timings (seconds) fitting a lasso/logistic regression problem.Here the data matrix is sparse, consisting of just zeros and ones, with . ofthe values equal to 1. There are p = 50 , predictors, N = 800 observations,with 30% of the coefficients nonzero, with the same value and signs alternating;Bayes error equal to 3%.observations,with 30% of the coefficients nonzero, with the same value and signs alternating;Bayes error equal to 3%.