Screening Rules for Lasso with Non-Convex Sparse Regularizers
SScreening Rules for Lasso with Non-Convex Sparse Regularizers
Alain Rakotomamonjy Gilles Gasso Joseph Salmon Abstract
Leveraging on the convexity of the Lasso prob-lem, screening rules help in accelerating solversby discarding irrelevant variables, during the op-timization process. However, because they pro-vide better theoretical guarantees in identifyingrelevant variables, several non-convex regulariz-ers for the Lasso have been proposed in the lit-erature. This work is the first that introduces ascreening rule strategy into a non-convex Lassosolver. The approach we propose is based on aiterative majorization-minimization (MM) strat-egy that includes a screening rule in the innersolver and a condition for propagating screenedvariables between iterations of MM. In additionto improve efficiency of solvers, we also pro-vide guarantees that the inner solver is able toidentify the zeros components of its critical pointin finite time. Our experimental analysis illus-trates the significant computational gain broughtby the new screening rule compared to classicalcoordinate-descent or proximal gradient descentmethods.
1. Introduction
Sparsity-inducing penalties are classical tools in statisti-cal machine learning, especially in settings where the dataavailable for learning is scarce and of high-dimension. Inaddition, when the solution of the learning problem isknown to be sparse, using those penalties yield to modelsthat can leverage this prior knowledge. The
Lasso (Tibshi-rani, 1996) and the
Basis pursuit (Chen et al., 2001; Chen& Donoho, 1994) where the first approaches that have em-ployed (cid:96) -norm penalty for inducing sparsity.While the Lasso has had a great impact on machine learn-ing and signal processing communities given some successstories (Shevade & Keerthi, 2003; Donoho, 2006; Lustiget al., 2008; Ye & Liu, 2012) it also comes with some Universit´e Rouen Normandie, Criteo AI Lab INSA RouenNormandie Universit´e de Montpellier. Correspondence to:alain.rakotomamonjy < fi[email protected] > .February 20, 2019 theoretical drawbacks ( e.g., biased estimates of large co-efficient of the model). Hence, several authors have pro-posed non-convex penalties that approximate better the (cid:96) -(pseudo)norm, the later being the original measure of spar-sity though it leads to NP-hard learning problem. The mostcommonly used non-convex penalties are the SmoothlyClipped Absolute Deviation (SCAD) (Fan & Li, 2001), the
Log Sum penalty (LSP) (Cand`es et al., 2008), the capped- (cid:96) penalty (Zhang, 2010), the Minimax Concave Penalty (MCP) (Zhang et al., 2010). We refer the interested readerto (Soubies et al., 2017) for a discussion on the pros andcons of such non-convex formulations.From an optimization point of view, the
Lasso can ben-efit from a large palette of algorithms ranging from block-coordinate descent (Friedman et al., 2007; Fu, 1998) to (ac-celerated) proximal gradient approaches (Beck & Teboulle,2009). In addition, by leveraging convex optimization the-ory, some of these algorithms can be further accelerated bycombining them with sequential or dynamic safe screeningrules (El Ghaoui et al., 2012; Bonnefoy et al., 2015; Fercoqet al., 2015), which allow to safely set some useless vari-ables to before terminating (or even sometimes beforestarting) the algorithm.While learning sparse models with non-convex penal-ties are seemingly more challenging to address, block-coordinate descent algorithms (Breheny & Huang, 2011;Mazumder et al., 2011) or iterative shrinkage threshold-ing (Gong et al., 2013) algorithms can be extended tothose learning problems. Another popular way of handlingnon-convexity is to consider the majorization-minimization(MM) (Hunter & Lange, 2004) principle which consistsin iteratively minimizing a majorization of a (non-convex)objective function. When applied to non-convex sparsityenforcing penalty, the MM scheme leads to solving a se-quence of weighted Lasso problems (Cand`es et al., 2008;Gasso et al., 2009; Mairal, 2013).In this paper, we propose a screening strategy that can beapplied when dealing with non-convex penalties. As far aswe know, this work is the first attempt in that direction. Forthat, we consider a MM framework and in this context, ourcontributions are • the definition of a MM algorithm that produces a se- a r X i v : . [ c s . L G ] F e b creening Rules for Lasso with Non-Convex Sparse Regularizers quence of iterates known to converge towards a criti-cal point of our non-convex learning problem, • the proposition of a duality gap based screening rulefor weighted Lasso , which is the core algorithm of ourMM framework, • the introduction of conditions allowing to propagatescreened variables from one MM iteration to the next, • we also empirically show that our screening strategyindeed improves the running time of MM algorithmswith respect to block-coordinate descent or proximalgradient descent methods.
2. Global non-convex framework
We introduce the problem we are interesting in, its first or-der optimality conditions and propose an MM approach forits resolution.
We consider solving the problem of least-squares regres-sion with a generic penalty of the form min w ∈ R d (cid:107) y − Xw (cid:107) + d (cid:88) j =1 r λ ( | w j | ) , (1)where y ∈ R n is a target vector, X = [ x , . . . , x d ] ∈ R n × d is the design matrix with column-wise features x j , w is the coefficient vector of the model and the map r λ : R + (cid:55)→ R + is concave and differentiable on [0 , + ∞ ) witha regularization parameter λ > . In addition, we assumethat r λ ( | w | ) is lower semi-continuous function. Note thatmost penalty functions such as SCAD, MCP or log sum(see their definitions in Table 1) admit such a property.We consider tools such as Fr´echet subdifferentials andlimiting-subdifferentials (Kruger, 2003; Rockafellar &Wets, 2009; Mordukhovich et al., 2006) well suited fornon-smooth and non-convex optimization, so that a vec-tor w (cid:63) belongs to the set of minimizers (not necessarilyglobal) of Problem (1) if the following Fermat conditionholds (Clarke, 1989; Kruger, 2003): X (cid:62) ( y − Xw (cid:63) ) ∈ d (cid:88) j =1 ∂r λ ( | w (cid:63)j | ) , (2)with ∂r λ ( · ) being the Fr´echet subdifferential of r λ , assum-ing it exists at w (cid:63) . In particular this is the case for the MCP,log sum and SCAD penalties presented in Table 1. For anillustration, we present the optimality conditions for MCPand log sum. Example 1.
For the MCP penalty (see Table 1 for thedefinition and subdifferential), it is easy to show that the ∂r λ (0) = [ − λ, λ ] . Hence, the Fermat condition becomes − x (cid:62) j ( y − Xw (cid:63) ) = 0 , if | w (cid:63)j | > λθ | x (cid:62) j ( y − Xw (cid:63) ) | ≤ λ, if w (cid:63)j = 0 − x (cid:62) j ( y − Xw (cid:63) ) + λ sign( w (cid:63)j ) − w (cid:63)j θ = 0 , otherwise . (3) Example 2.
For the log sum penalty one can explicitlycompute ∂r λ (0) = [ − λθ , λθ ] and leverage on smoothnessof r λ ( | w | ) when | w | > for computing ∂r λ ( | w | ) . Then theabove necessary condition can be translated as − x (cid:62) j ( y − Xw (cid:63) ) + λ sign( w (cid:63)j ) θ + | w (cid:63)j | = 0 , if w (cid:63)j (cid:54) = 0 , | x (cid:62) j ( y − Xw (cid:63) ) | ≤ λθ , if w (cid:63)j = 0 . (4)As we can see, first order optimality conditions lead to sim-ple equations and inclusion. Remark 1.
There exists a critical parameter λ max suchthat is a critical point for the primal problem for all λ ≥ λ max . This parameter depends on the subdifferential of r λ at . For the MCP penalty we have λ max (cid:44) max j | x (cid:62) j y | ,and for the log sum penalty λ max (cid:44) θ max j | x (cid:62) j y | . Fromnow on, we assume that λ ≤ λ max to avoid such irrelevantlocal solutions. There exists several majorization-minimization algorithmsfor solving non-smooth and non-convex problems involv-ing sparsity-inducing penalties (Gasso et al., 2009; Gonget al., 2013; Mairal, 2013).In this work, we focus on MM algorithms provably conver-gent to a critical point of Problem 1 such as those describedby Kang et al. (2015). Their main mechanism is to itera-tively build a majorizing surrogate objective function thatis easier to solve than the original learning problem. In thecase of non-convex penalties that are either fully concaveor that can be written as the sum of a convex and concavefunctions, the idea is to linearize the concave part, and thenext iterate is obtained by optimizing the resulting surro-gate function. Since r λ ( | · | ) is a concave and differentiablefunction on [0 , + ∞ ) , at any iterate k , we have : r λ ( | w j | ) ≤ r λ ( | w kj | ) + r (cid:48) λ ( | w kj | )( | w j | − | w kj | ) . To take advantage of MM convergence properties (Kanget al., 2015), we also majorize the objective function and creening Rules for Lasso with Non-Convex Sparse Regularizers
Table 1.
Common non-convex penalties with their sub-differentials. Here λ > , θ > ( θ > for MCP, θ > for SCAD).Penalty r λ ( | w | ) ∂r λ ( | w | ) Log sum λ log(1 + | w | /θ ) (cid:40) (cid:2) − λθ , λθ (cid:3) if w = 0 (cid:110) λ sign( w ) θ + | w | (cid:111) if w > MCP (cid:26) λ | w | − w θ if | w | ≤ λθθλ / if | w | > θλ [ − λ, λ ] if w = 0 { λ sign( w ) − wθ } if < | w | ≤ λθ { } if | w | > θλ SCAD λ | w | if | w | ≤ λ θ − ( − w + 2 θλ | w | − λ ) if λ < | w | ≤ λθ λ (1+ θ )2 if | w | > θλ [ − λ, λ ] if w = 0 { λ sign( w ) } if < | w | ≤ λ (cid:110) θ − ( − w + θλ sign( w )) (cid:111) if < | w | ≤ λθ { } if | w | > θλ our algorithm boils down to the following iterative process w k +1 = arg min w ∈ R d (cid:107) y − Xw (cid:107) + α (cid:107) w − w k (cid:107) (5) + d (cid:88) j =1 r (cid:48) λ ( | w kj | ) | w j | , where α > is some user-defined parameter controllingthe proximal regularization strength. Note that as α → ∞ ,the above problem recovers the reweighted (cid:96) iterationsinvestigated by Cand`es et al. (2008); Gasso et al. (2009).Moreover, when using w k = (cid:48) ( e.g., when evaluating thefirst λ in a path-wise fashion, for k = 0 ) this recovers theElastic-net penalty (Zou & Hastie, 2005).
3. Proximal Weighted
Lasso : coordinatedescent and screening
As we have stated, the screening rule we have developedfor regression with non-convex sparsity enforcing penaltiesis based on iterative minimization of Proximal Weighted
Lasso problem. Hereafter, we briefly show that such sub-problems can be solved by iteratively optimizing coordi-nate wise. Then, we derive a duality gap based screeningrule to screen coordinate-wise. In what follows, we assumethat = 0 . Solving the following primal problem (as it encompassProblem 5 as a special case) would prove useful in the ap-plication of our MM framework: min w ∈ R d P Λ ( w ) (cid:44) (cid:107) y − Xw (cid:107) + α (cid:107) w − w (cid:48) (cid:107) + d (cid:88) j =1 λ j | w j | . (6)with w (cid:48) is some pre-defined vector of R d , and Λ =( λ . . . λ d ) (cid:62) with λ ≥ , . . . , λ d ≥ some regulariza- This will be of interest cases where λ j = 0 in Problem 6. tion parameters. In the sequel we denote Problem 6 as theProximal Weighted Lasso (PWL) problem.Problem (6) can be solved through proximal gradient de-scent algorithm (Beck & Teboulle, 2009) but in order tobenefit from screening rules, coordinate descent algorithmsare to be privileged. Friedman et al. (2010) have proposed acoordinate-wise update for the Elastic-net penalty and sim-ilarly, it can be shown that for the above, the following up-date holds for any coordinate w j and λ j ≥ : w j ← (cid:107) x j (cid:107) + α sign( t j ) max(0 , | t j | − λ j ) , (7)with t j = x (cid:62) j ( y − Xw + x j w j ) + α w (cid:48) j .Typically, coordinate descent algorithms visit all the vari-ables in a cyclic way (another popular choice is samplinguniformly at random among the coordinates) unless somescreening rules prevent them from unnecessary updates.Recent efficient screening methods rely on producingprimal-dual approximate solutions and on defining testsbased on these approximate solutions (Fercoq et al., 2015;Shibagaki et al., 2016; Ndiaye et al., 2017). While ourapproach follows the same road, it does not result froma straightforward extension of the works of Fercoq et al.(2015) and Shibagaki et al. (2016) due to the proximal reg-ularizer.The primal objective of Proximal Weighted Lasso (given inEquation (6)) is convex (and lower bounded) and admits atleast one global solution. To derive our screening tests, weneed to investigate the dual formulation associated to thisproblem, that reads: max s ∈ R n v ∈ R d D ( s , v ) (cid:44) − (cid:107) s (cid:107) − α (cid:107) v (cid:107) + s (cid:62) y − v (cid:62) w (cid:48) (8)s.t. | X (cid:62) s − v | (cid:52) Λ , (9)with the inequality operator (cid:52) being applied in acoordinate-wise manner. As a side result of the dual deriva-tion (see Appendix for the details), key conditions for creening Rules for Lasso with Non-Convex Sparse Regularizers screening any primal variable w j are obtained as: | x (cid:62) j s (cid:63) − v (cid:63)j | − λ j < ⇒ w (cid:63)j = 0 , with s (cid:63) and v (cid:63) being solutions of the dual formulation fromEquation (8). Screening test on the Proximal Weighted
Lasso problemcan thus be derived if we are able to provide an upper boundon | x (cid:62) j s (cid:63) − v (cid:63)j | that is guaranteed to be strictly smaller than λ j . Suppose that we have an intermediate triplet of primal-dual solution ( ˆ w , ˆ s , ˆ v ) with ˆ s and ˆ v being dual feasible ,then we can derive the following bound | x (cid:62) j s (cid:63) − v (cid:63)j | = | x (cid:62) j ˆ s − ˆ v j + x (cid:62) j ( s (cid:63) − ˆ s ) − ( v (cid:63)j − ˆ v j ) |≤ | x (cid:62) j ˆ s − ˆ v j | + (cid:107) x j (cid:107)(cid:107) s (cid:63) − ˆ s (cid:107) + | v (cid:63)j − ˆ v j | . Now we need an upper bound on the distance betweenthe approximated and the optimal dual solution in order tomake screening condition exploitable. By exploiting theproperty that the objective function D ( s , v ) of the dualproblem given in Equation (8) is quadratic and stronglyconcave , the following inequality holds D (ˆ s , ˆ v ) ≤ D ( s (cid:63) , v (cid:63) ) − ∇ s D ( s (cid:63) , v (cid:63) ) (cid:62) (ˆ s − s (cid:63) ) − ∇ v D ( s (cid:63) , v (cid:63) ) (cid:62) (ˆ v − v (cid:63) ) − (cid:107) ˆ s − s (cid:63) (cid:107) − α (cid:107) ˆ v − v (cid:63) (cid:107) , with ∇ s D = (cid:104) ∂D∂s , . . . , ∂D∂s n (cid:105) (cid:62) , ∇ v D = (cid:104) ∂D∂v , . . . , ∂D∂v d (cid:105) (cid:62) .As the dual problem is a constrained optimization prob-lem, the first-order optimality condition for ( s (cid:63) , v (cid:63) ) reads ∇ s D ( s (cid:63) , v (cid:63) ) (cid:62) ( s − s (cid:63) ) + ∇ v D ( s (cid:63) , v (cid:63) ) (cid:62) ( v − v (cid:63) ) ≥ , ∀ s ∈ R n , v ∈ R d ; thus we have D ( s (cid:63) , v (cid:63) ) − D (ˆ s , ˆ v )) ≥ (cid:107) ˆ s − s (cid:63) (cid:107) + α (cid:107) ˆ v − v (cid:63) (cid:107) . By strong duality, we have P Λ ( ˆ w ) ≥ D ( s (cid:63) , v (cid:63) ) , hence P Λ ( ˆ w ) − D (ˆ s , ˆ v )) ≥ (cid:107) ˆ s − s (cid:63) (cid:107) + α (cid:107) ˆ v − v (cid:63) (cid:107) . We can now use the duality gap for bounding (cid:107) ˆ s − s (cid:63) (cid:107) and | ˆ v j − v (cid:63)j | . Hence, given a primal-dual intermediate solu-tion ( ˆ w , ˆ s , ˆ v ) , with duality gap G Λ ( ˆ w , ˆ s , ˆ v ) (cid:44) P Λ ( ˆ w ) − D (ˆ s , ˆ v ) , the screening test for a variable j is | x (cid:62) j ˆ s − ˆ v j | + (cid:112) G Λ ( ˆ w , ˆ s , ˆ v ) (cid:16) (cid:107) x j (cid:107) + 1 α (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) T ( λj ) j ( ˆ w , ˆ s , ˆ v ) < λ j , (10)and we can safely state that the j -th coordinate of w (cid:63) iszero when this happens. meaning | X (cid:62) ˆ s − ˆ v | (cid:52) Λ . see (Nesterov, 2004) for a precise definition of strong con-vexity/concavity. Finding approximate primal-dual solutions:
In ourcase of interest, the Proximal Weighted
Lasso problem issolved in its primal form using a coordinate descent algo-rithm that optimizes one coordinate at a time. Hence, anapproximate primal solution ˆ w is easy to obtain by consid-ering the current solution at a given iteration of the algo-rithm. From this primal solution, we show how to obtaina dual feasible solution ˆ s that can be considered for thescreening test.One can check the following primal/dual link (for instanceby deriving the first order conditions of the maximizationin Equation 16 or equivalent formulation 21, see Appendix) y − Xw (cid:63) = s (cid:63) and w (cid:63) − w (cid:48) (cid:63) = α v (cid:63) with the constraints that | x (cid:62) j s (cid:63) − v (cid:63)j | ≤ λ j , ∀ j ∈ [ d ] . Hence, a good approximation of the dual solution can beobtained by scaling the residual vector y − X ˆ w such that itbecomes dual feasible. Indeed, the condition | x (cid:62) j s (cid:63) − v (cid:63)j | ≤ λ j , ∀ j ∈ [ d ] is guaranteed only at optimality. To avoidissues with dividing by vanishing λ j ’s, let us consider theset S = { j ∈ [ d ] : λ j > } of associated indexes andassume that this set is non-empty. Then, one can define j † = arg max j ∈S λ j (cid:12)(cid:12) x (cid:62) j ( y − X ˆ w ) − α ( ˆ w j − w (cid:48) j ) (cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) ρ ( j ) . (11)For all j ∈ S , if ρ ( j † ) ≤ then , ˆ s (cid:44) y − X ˆ w and v = α ( ˆ w − w (cid:48) ) are dual feasible, no scaling is needed.If ρ ( j † ) > , we define the approximate dual solution as ˆ s = y − X ˆ w ρ ( j † ) and ˆ v = ˆ w − w (cid:48) αρ ( j † ) which are dual feasible. Hence,in practice, we compute our screening test using the triplet { ˆ w , y − X ˆ w max(1 ,ρ ( j † )) , ˆ w − w (cid:48) α max(1 ,ρ ( j † )) } . Remark 2.
As the above dual approximation is valid onlyfor components of λ j > , we have added a special treat-ment for components j ∈ [ d ] \ S setting ˆ v j = x (cid:62) j ˆ s forsuch indexes, guaranteeing dual feasibility. Note that thespecial case λ j = 0 is not an innocuous case. For somenon-convex penalties like MCP or SCAD, their gradient r (cid:48) λ is equal to for large values and this is one of their keystatistical property. Hence, the situation in which λ j = 0 is very likely to occur in practice for these penalties. Comparing with other screening tests:
Our screeningtest exploits duality gap and strong concavity of the dualfunction, and the resulting test is similar to the one derivedby Fercoq et al. (2015). Indeed, it can be shown that Equa-tion 10 boils down to be a test based on a sphere centered on ˆ s with radius (cid:112) G Λ ( ˆ w , ˆ s , ˆ v ) . The main difference relieson the threshold of the test λ j , which differs on a coordinatebasis instead of being uniform.Similarly to the GAP test of Fercoq et al. (2015), ourscreening rule come with theoretical properties. It can be creening Rules for Lasso with Non-Convex Sparse Regularizers Algorithm 1
Proximal Weighted Lasso (PWL)
Input: X , y , w , Λ , α , LstScreen Output: w , r = y − Xw k ← repeat for variable j (cid:54)∈ LstScreen do w k +1 j ← update coordinate w kj using Equation (7) end for compute duality gap, r = y − Xw k +1 , approximatedual variables for variable j (cid:54)∈ LstScreen do update screening condition using Equation (10) end for k ← k + 1 until convergenceshown that if we have a converging sequence { ˆ w k } of pri-mal coefficients lim k →∞ ˆ w k = w (cid:63) , then ˆ s k and ˆ v k de-fined as above converge to s (cid:63) and v (cid:63) . Moreover, we canstate a property showing the ability of our screening ruleto remove irrelevant variables after a finite number of iter-ations. Property 1.
Define the
Equicorrelation set of the Proxi-mal Weighted Lasso as E (cid:63) (cid:44) { j ∈ [ d ] : | x (cid:62) j s (cid:63) − v (cid:63)j | = λ j } and E k (cid:44) { j ∈ [ d ] : | x (cid:62) j ˆ s k − ˆ v kj | ≥ λ j } obtained at it-eration k of an algorithm solving the Proximal Weighted Lasso . Then, there exists an iteration k ∈ N s.t. ∀ k ≥ k , E k ⊂ E (cid:63) .Proof. Because ˆ w k , ˆ s k and ˆ v k are convergent, owing tothe strong duality, the duality gap also converges towardszero. Now, for any given (cid:15) , define k such that ∀ k ≥ k ,we have (cid:107) ˆ s k − s (cid:63) (cid:107) ≤ (cid:15), (cid:107) ˆ v k − v (cid:63) (cid:107) ∞ ≤ (cid:15) and (cid:112) G Λ ≤ (cid:15) . For j (cid:54)∈ E (cid:63) , we have | x (cid:62) j (ˆ s k − s (cid:63) ) − (ˆ v kj − v (cid:63)j ) | ≤ | x (cid:62) j (ˆ s k − s (cid:63) ) | + | (ˆ v kj − v (cid:63)j ) |≤ (max j (cid:54)∈E (cid:63) (cid:107) x j (cid:107) + 1) (cid:15) . From triangle inequality we have: | x (cid:62) j ˆ s k − ˆ v kj | ≤ | x (cid:62) j (ˆ s k − s (cid:63) ) − (ˆ v kj − v (cid:63)j ) | + | x (cid:62) j s (cid:63) − v (cid:63)j |≤ (max j (cid:54)∈E (cid:63) (cid:107) x j (cid:107) + 1) (cid:15) + | x (cid:62) j s (cid:63) − v (cid:63)j | If we add √ G Λ (cid:16) (cid:107) x j (cid:107) + α (cid:17) on both sides, we get T ( λ j ) j ≤ (cid:18) j (cid:54)∈E (cid:63) (cid:107) x j (cid:107) + 1 + 1 α (cid:19) (cid:15) + | x (cid:62) j s (cid:63) − v (cid:63)j | . following a terminology introduced byTibshirani (2013) Now define the constant C (cid:44) min j (cid:54)∈E (cid:63) [ λ j − | x (cid:62) j s (cid:63) − v (cid:63)j | ] and because j (cid:54)∈ E (cid:63) , C > . Hence, if we choose (cid:15) < C j (cid:54)∈E (cid:63) (cid:107) x j (cid:107) + 1 + α , we have T ( λ j ) j < λ j which means that the variable j hasbeen screened, hence j (cid:54)∈ E k . To conclude, we have j (cid:54)∈ E (cid:63) implies that j (cid:54)∈ E k , which also translates in E k ⊂ E (cid:63) .This property thus tells that all the zero variables of w (cid:63) arecorrectly detected and screened by our screening rule in afinite number of iterations of the algorithm.
4. Screening rule for non-convex regularizers
Now that we have described the inner solver and its screen-ing rule, we are going to analyze how this rule can be im-proved into a majorization-minimization (MM) approach.
At first, let us discuss some properties of our MM algo-rithm. According to the first order condition of the Proxi-mal Weighted
Lasso related to the MM problem at iteration k , the following inequality holds for any j ∈ [ d ] | x (cid:62) j s k,(cid:63) − v k,(cid:63)j | ≤ λ kj , where the superscript denotes the optimal solution at iter-ation k . Owing to the convergence properties of the MMalgorithm Kang et al. (2015), we know that the sequence { w k } converges towards a vector satisfying Equation (2).Owing to the continuity of r λ ( | w | ) , we deduce that the se-quence { λ kj } also converges towards a λ (cid:63)j . Thus, by takingthe limits of the above inequality, the following conditionholds for w (cid:63)j = 0 : | x (cid:62) j s (cid:63) − v (cid:63)j | ≤ λ (cid:63)j , with λ (cid:63)j = r (cid:48) λ ( | w (cid:63)j | ) . This inequality basically tells us aboutthe relation between vanishing primal component and theoptimal dual variable at each iteration k . While this sug-gests that screening within each inner problem by defining λ j in Equation (6) as λ kj = r (cid:48) λ ( | w kj | ) should improve ef-ficiency of the global MM solver, it does not tell whetherscreened variables at iteration k are going to be screenedat the next iteration as the λ kj ’s are also expected to varybetween iterations. Remark 3.
In general, the behavior of a λ j across MM it-erations strongly depends on an initial w and the optimal w (cid:63) . For instance, if one variable w j is initialized at and w (cid:63)j is large, λ j will tend to be decreasing across iterations.Conversely, λ j will tend to increase. creening Rules for Lasso with Non-Convex Sparse Regularizers In what follows, we derive conditions on allowing to prop-agate screened coefficients from one iteration to another inthe MM framework.
Property 2.
Consider a Proximal Weighted
Lasso problemwith weights { λ j } and its primal-dual approximate solu-tions ˆ w , ˆ s and ˆ v allowing to evaluate a screening test inEquation 10. Suppose that we have a new set of weight Λ ν = { λ νj } j =1 ,...,d defining a new Proximal Weighted Lasso problem. Given a primal-dual approximate solutiondefined by the triplet ( ˆ w ν , ˆ s ν , ˆ v ν ) for the latter problem, ascreening test for variable j reads T ( λ j ) j ( ˆ w , ˆ s , ˆ v )+ (cid:107) x j (cid:107) ( a + √ b )+ c + 1 α √ b < λ νj , (12) where T ( λ j ) j ( ˆ w , ˆ s , ˆ v ) is the screening test for j at λ j , a , b and c are constants such that (cid:107) ˆ s ν − ˆ s (cid:107) ≤ a , | G Λ ( ˆ w , ˆ s , ˆ v ) − G Λ ν ( ˆ w ν , ˆ s ν , ˆ v ν ) | ≤ b and | ˆ v νj − ˆ v j | ≤ c .Proof. The screening test for the novel Proximal Weighted
Lasso problem for parameter Λ νj can be written as | x (cid:62) j ˆ s ν − ˆ v ν | + (cid:112) G Λ ν ( ˆ w ν , ˆ s ν , v ν ) (cid:0) (cid:107) x j (cid:107) + α (cid:1) < λ νj . Let us bound the terms in the left-hand side of this inequal-ity. At first, we have: | x (cid:62) j ˆ s ν − ˆ v ν | ≤ | x (cid:62) j ˆ s − ˆ v j | + | x (cid:62) j (ˆ s ν − ˆ s ) | + | ˆ v νj − ˆ v j |≤ | x (cid:62) j ˆ s − ˆ v j | + (cid:107) x j (cid:107)(cid:107) ˆ s ν − ˆ s (cid:107) + | ˆ v νj − ˆ v j | , and (cid:112) G Λ ν ≤ (cid:112) | G Λ ν − G Λ | + | G Λ | (13) ≤ (cid:112) | G Λ ν − G Λ | + (cid:112) | G Λ | . where the last inequality holds by applying the norm prop-erty (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) to the 2-dimensional vector of compo-nent [ (cid:112) | G Λ ν − G Λ | , (cid:112) | G Λ | ] , where we have drop the de-pendence on ( ˆ w , ˆ s , ˆ v ) and ( ˆ w ν , ˆ s ν , ˆ v ν ) for simplicity. Bygathering the pieces together, the left-hand side of Equation4.2 can be bounded by | x (cid:62) j ˆ s − ˆ v j | + (cid:107) x j (cid:107)(cid:107) ˆ s ν − ˆ s (cid:107) + (cid:112) G Λ (cid:0) (cid:107) x j (cid:107) + α (cid:1) (14) + (cid:112) | G Λ ν − G Λ | (cid:0) (cid:107) x j (cid:107) + α (cid:1) + | ˆ v νj − ˆ v j | . which leads us to the novel screening test T ( λ j ) j ( ˆ w , ˆ s , ˆ v )+ (cid:107) x j (cid:107) ( a + √ b )+ c + α √ b ≤ λ νj . (15)with T ( λ j ) j ( ˆ w , ˆ s , ˆ v ) defined in Equation (10).In order to make this screening test tractable, we need atfirst an approximation ˆ s ν of the dual solution, then an up-per bound on the norm of (cid:107) ˆ s − ˆ s ν (cid:107) and a bound on the Algorithm 2
MM algorithm for Lasso with penalty r λ ( | w | ) Input: X , y , λ , w , α k ← repeat Λ k = { λ kj } j =1 ,...,d ← { r (cid:48) λ ( | w kj | ) } j =1 ,...,d compute approximate dual s k and duality gap G Λ k given w k and Λ k if needed then compute screening scores T ( λ j ) j ( w k , s k , v k ) ac-cording to Eq. (10) store reference duality gap and approximate dual else estimate screening scores according to Eq. (12) end if LstScreen ← updated screened variables list basedon results of Line or w k +1 , y − Xw k +1 ← PWL( X , y , w k , Λ k , α ,Lstcreen)13: k ← k + 1 until convergencedifference in duality gap | G Λ ν − G Λ | . In practice, we willdefine ˆ s ν = y − X ˆ w ρ ν ( j † ) and then compute exactly (cid:107) ˆ s − ˆ s ν (cid:107) and | G Λ ν − G Λ | . Interestingly, since we consider the primalsolution ˆ w as our approximate primal solution for the newproblem, computing ρ ν ( j ) is only costs an element-wisedivision since y − X ˆ w and | x (cid:62) j ( y − X ˆ w ) | have alreadybeen pre-computed at the previous MM iteration. Anotherinteresting point to highlight is that given pre-computedscreening values { T ( λ j ) j ( ˆ w , ˆ s , ˆ v ) } , the screening test givenin Equation (12) does not involve any additional dot prod-uct and thus is cheaper to compute. When considering MM algorithms for handling non-convex penalties, we advocate the use of weighted Lassowith screening as a solver for Equation 5 and a screen-ing variable propagation condition as given in Equation 12.This last property allows us to cheaply evaluating whethera variable can be screened before entering in the innersolver. However, Equation (12) also needs the screeningscore computed at some previous { λ j } and a trade-off hasthus to be sought between relevance of the test when λ νj isnot too far from λ j and the computational time needed forevaluating T ( λ j ) j . In practice, we compute the exact score T ( λ j ) j ( ˆ w , ˆ s , ˆ v ) every iterations and apply Equation (12)for the rest of the iterations. This results in Algorithm 2where the inner solvers, denoted as PWL and solved ac-cording to Algorithm 1, are warm-started with previousoutputs and provided with a list of already screened vari-ables. In practice, α > helps us guaranteeing theoret-ical convergence of the sequence { w k } and we have set creening Rules for Lasso with Non-Convex Sparse Regularizers C P U t i m i n g ( s ) Regularization Path - n=50 d=100 p=5 =2.00 ncxCDGISTMM genuineMM screening 1.00e-03 1.00e-04 1.00e-05Tolerance0255075100125150175 C P U t i m i n g ( s ) Regularization Path - n=500 d=5000 p=5 =2.00 ncxCDGISTMM genuineMM screening
Figure 1.
Comparing running time for coordinate descent (CD) and proximal gradient algorithms as well as screening-based CD andMM algorithms under different tolerances on the stopping condition and under different samples, features settings. For all experiments,we have active variables. (left) n = 50 , d = 100 , σ = 2 . (right) n = 500 , d = 5000 , σ = 2 α = 10 for all our experiments, leading to very small reg-ularization allowing large deviations from w k .
5. Numerical experiments
The goal of screening rules is to improve the efficiencyof solvers by focusing only on variables that are non-vanishing. In this section, we thus report the computationalgains we obtain owing to our screening strategy.
Our goal is to compare the computational running time ofdifferent algorithms for computing a regularization pathon problem (1). For most relevant non-convex regulariz-ers, coordinatewise update (Breheny & Huang, 2011) andproximal operator (Gong et al., 2013) can be derived in aclosed-form. For our experiments, we have used the log-sum penalty which has an hyperparameter θ . Hence, ourregularization path involves parameters ( λ t , θ t ). The setof { λ t } N t − t =0 has been defined as λ t (cid:44) λ max − tNt − , N t being dependent of the problems and θ ∈ { . , . , } .Our baseline method should have been an MM algo-rithm, in which each subproblem as given in Equation5 is solved using a coordinate descent algorithm withoutscreening. However, due to its very poor running time,we have omitted its performance. Two other competi-tors that directly address the non-convex learning problemshave instead been investigated: the first one, denoted as GIST , uses a majorization of the loss function and itera-tive shrinkage-thresholding (Gong et al., 2013) while thesecond one, named ncxCD is a coordinate descent algo-rithm that directly handles non-convex penalties (Breheny& Huang, 2011; Mazumder et al., 2011). We have named
MM-screen our method that screens within each Proxi-mal Weighted
Lasso and propagates screening scores whilethe genuine version drops the screening propagation (butscreening inside inner solvers is kept), is denoted
MM-genuine . All algorithms have been stopped when opti-mality conditions as described in Equation 4 are satisfiedup to the same tolerance τ . Note that all algorithms havebeen implemented in Python/Numpy. Hence, this may giveGIST a slight computational advantage over coordinate de-scent approaches that heavily benefit on loop efficiency oflow level languages. For our MM approaches, we stop theinner solver of the Proximal Weighted Lasso when the du-ality gap is smaller than − and screening is computedevery iterations. In the outer iterations, we perform thescreening every iterations. Our toy regression problem has been built as follows. Theentries of the regression design matrix X ∈ R n × d aredrawn uniformly from a Gaussian distribution of zero meanand variance . For a given n and d and a number p of ac-tive variables, the true coefficient vector w (cid:63) is obtained asfollows. The p non-zero positions are chosen randomly,and their values are drawn from a zero-mean unit varianceGaussian distribution, to which we added ± . accordingto the sign of w (cid:63)j . Finally, the target vector is obtained as y = Xw (cid:63) + e where e is a random noise vector drawnfrom a Gaussian distribution with zero-mean and standarddeviation σ . For the toy problem, we have set N t = 50 .Figure 1 presents the running time needed for the differ-ent algorithms to reach convergence under different set-tings. We note that indifferently to the settings our screen-ing rules help in reducing the computational time by a fac- creening Rules for Lasso with Non-Convex Sparse Regularizers Number of features G a i n G a i n Number of features G a i n tol=1e-4tol=1e-6tol=1e-8 Figure 2.
Computational gain of propagating screening within MM iterations (left) w.r.t. number of features at a KKT tolerance of − .(middle) w.r.t. to noise level at a KKT tolerance of − . (right) w.r.t. number of features at different tolerance in a low-noise regime. P e r c e n t a g e T i m e o f n c x C D ncxCDGISTMM genuineMM screening P e r c e n t a g e T i m e o f MM g e nu i n e MM genuineMM screening
Figure 3.
Running time for computing the regularization path on (left) the dense Leukemia dataset n = 72 , d = 7129 . (right) sparsedataset Newsgroup with n = 961 and d = 21319 . tor of about compared to a plain non-convex coordinatedescent. The gain compared to GIST is mostly notable forhigh-precision and noisy problems.We have also analyzed the benefit brought by our screeningpropagation strategy. Figure 2 presents the gain in compu-tation time when comparing
MM screening and
MM gen-uine . As we have kept the number of relevant features to , one can note that for a fixed amount of noise the gain isaround . for a wide range of dimensionality. Similarly,the gain is almost constant for a large range of noise and thebest benefit occurs in a low-noise setting. More interest-ingly, we have compared this gain for increasing precisionand for a low-noise situation σ = 0 . , which is classicalnoise level in the screening literature (Ndiaye et al., 2016;Tibshirani et al., 2012). We can note from the mostrightpanel of Figure 2 that the more precision we require onthe resolution of the learning problem, the more we benefitfrom screening propagation. For a tolerance of − , thegain peaks at about . whereas in most regimes of numberof features, the gain is about . Even for a lower tolerance,the genuine screening approach is times slower than thefull approach we propose. We have also run the comparison on different real datasets.Figure 3 presents the results obtained on the leukemia dataset, which is a dense data with n = 72 examples indimension d = 7129 . For the path computation, the set of λ t has been fixed to N t = 20 elements. Remark that thegain compared to ncvxCD varies from to dependingon the tolerance and is about compared to GIST at hightolerance. On the right panel of Figure 3, we compare thegain in running time brought by screening propagation rulein a real world sparse dataset newsgroups in which we havekept only categories ( religion and graphics ) resulting in n = 961 and d = 21319 . We can note that the gain is sim-ilar to what we have observed on the toy problem rangingin between . and . .
6. Conclusion
We have presented the first screening rule strategythat handles sparsity-inducing non-convex regularizers.The approach we propose is based on a majorization-minimization framework in which each inner iterationsolves a Proximal Weighted
Lasso problem. We introduceda screening rule for this learning problem and a rule forpropagating screened variables within MM iterations. In- creening Rules for Lasso with Non-Convex Sparse Regularizers terestingly, our screening rule for the weighted Lasso isable to identify all the variables to be screened in a finiteamount of time. We have carried out several numerical ex-periments showing the benefits of the proposed approachcompared to methods directly handling the non-convexityof the regularizers and illustrating the situation in which ourpropagating-screening rule helps in accelerating efficiencyof the solver.
References
Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM journal on imaging sciences , 2(1):183–202, 2009.Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R.Dynamic screening: Accelerating first-order algorithmsfor the lasso and group-lasso.
IEEE Transactions on Sig-nal Processing , 63(19):5121–5132, 2015.Boyd, S. and Vandenberghe, L.
Convex Optimization .Cambridge University Press, New York, NY, USA, 2004.Breheny, P. and Huang, J. Coordinate descent algorithmsfor nonconvex penalized regression, with applications tobiological feature selection.
The annals of applied statis-tics , 5(1):232, 2011.Cand`es, E. J., Wakin, M. B., and Boyd, S. P. Enhanc-ing sparsity by reweighted l minimization. Journalof Fourier analysis and applications , 14(5-6):877–905,2008.Chen, S. and Donoho, D. Basis pursuit. In IEEE (ed.),
Pro-ceedings of 1994 28th Asilomar Conference on Signals,Systems and Computers , 1994.Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomicdecomposition by basis pursuit.
SIAM review , 43(1):129–159, 2001.Clarke, F. H.
Method of Dynamic and Nonsmooth Opti-mization . SIAM, 1989.Donoho, D. L. Compressed sensing.
IEEE Transactions oninformation theory , 52(4):1289–1306, 2006.El Ghaoui, L., Viallon, V., and Rabbani, T. Safe featureelimination in sparse supervised learning.
Journal of Pa-cific Optimization , pp. 667–698, 4 2012.Fan, J. and Li, R. Variable selection via nonconcave pe-nalized likelihood and its oracle properties.
Journalof the American statistical Association , 96(456):1348–1360, 2001.Fercoq, O., Gramfort, A., and Salmon, J. Mind the dualitygap: safer rules for the lasso. In editor (ed.),
Proceedings of the International Conference on Machine Learning ,pp. 333–342, 2015.Friedman, J., Hastie, T., H¨ofling, H., Tibshirani, R., et al.Pathwise coordinate optimization.
The Annals of AppliedStatistics , 1(2):302–332, 2007.Friedman, J., Hastie, T., and Tibshirani, R. Regulariza-tion paths for generalized linear models via coordinatedescent.
Journal of statistical software , 33(1):1, 2010.Fu, W. J. Penalized regressions: the bridge versus the lasso.
Journal of computational and graphical statistics , 7(3):397–416, 1998.Gasso, G., Rakotomamonjy, A., and Canu, S. Recoveringsparse signals with a certain family of nonconvex penal-ties and dc programming.
IEEE Transactions on SignalProcessing , 57(12):4686–4698, 2009.Gong, P., Zhang, C., Lu, Z., Huang, J., and Ye, J. A gen-eral iterative shrinkage and thresholding algorithm fornon-convex regularized optimization problems. In
Inter-national Conference on Machine Learning , pp. 37–45,2013.Hunter, D. R. and Lange, K. A tutorial on mm algorithms.
The American Statistician , 58(1):30–37, 2004.Johnson, T. and Guestrin, C. Blitz: A principled meta-algorithm for scaling sparse optimization. In
Interna-tional Conference on Machine Learning , pp. 1171–1179,2015.Kang, Y., Zhang, Z., and Li, W.-J. On the globalconvergence of majorization minimization algorithmsfor nonconvex optimization problems. arXiv preprintarXiv:1504.07791 , 2015.Kruger, A. Y. On Fr´echet subdifferentials.
Journal of Math-ematical Sciences , 116(3):3325–3358, 2003.Lustig, M., Donoho, D. L., Santos, J. M., and Pauly, J. M.Compressed sensing MRI.
IEEE signal processing mag-azine , 25(2):72–82, 2008.Mairal, J. Stochastic majorization-minimization algorithmsfor large-scale optimization. In
Advances in Neural In-formation Processing Systems , pp. 2283–2291, 2013.Mazumder, R., Friedman, J. H., and Hastie, T. Sparsenet:Coordinate descent with nonconvex penalties.
Journalof the American Statistical Association , 106(495):1125–1138, 2011.Mordukhovich, B. S., Nam, N. M., and Yen, N. Fr´echetsubdifferential calculus and optimality conditions innondifferentiable programming.
Optimization , 55(5-6):685–708, 2006. creening Rules for Lasso with Non-Convex Sparse Regularizers
Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. Gapsafe screening rules for sparse-group lasso. In
Advancesin Neural Information Processing Systems , pp. 388–396,2016.Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J.Gap safe screening rules for sparsity enforcing penal-ties.
Journal of Machine Learning Research , 18(128):1–33, 2017.Nesterov, Y.
Introductory lectures on convex optimization ,volume 87 of
Applied Optimization . Kluwer AcademicPublishers, Boston, MA, 2004.Rockafellar, R. T. and Wets, R. J.-B.
Variational analysis ,volume 317. Springer Science & Business Media, 2009.Shevade, S. K. and Keerthi, S. S. A simple and efficientalgorithm for gene selection using sparse logistic regres-sion.
Bioinformatics , 19(17):2246–2253, 2003.Shibagaki, A., Karasuyama, M., Hatano, K., and Takeuchi,I. Simultaneous safe screening of features and samplesin doubly sparse modeling. In
International Conferenceon Machine Learning , pp. 1577–1586, 2016.Soubies, E., Blanc-F´eraud, L., and Aubert, G. A unifiedview of exact continuous penalties for (cid:96) - (cid:96) minimiza-tion. SIAM J. Optim. , 27(3):2034–2060, 2017.Tibshirani, R. Regression shrinkage and selection via thelasso.
Journal of the Royal Statistical Society. Series B(Methodological) , pp. 267–288, 1996.Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon,N., Taylor, J., and Tibshirani, R. J. Strong rules for dis-carding predictors in lasso-type problems.
Journal of theRoyal Statistical Society: Series B (Statistical Methodol-ogy) , 74(2):245–266, 2012.Tibshirani, R. J. The lasso problem and uniqueness.
Elec-tronic Journal of Statistics , 7:1456–1490, 2013.Ye, J. and Liu, J. Sparse methods for biomedical data.
ACMSigkdd Explorations Newsletter , 14(1):4–15, 2012.Zhang, C.-H. et al. Nearly unbiased variable selection un-der minimax concave penalty.
The Annals of statistics ,38(2):894–942, 2010.Zhang, T. Analysis of multi-stage convex relaxation forsparse regularization.
Journal of Machine Learning Re-search , 11(Mar):1081–1107, 2010.Zou, H. and Hastie, T. J. Regularization and variable se-lection via the elastic net.
J. R. Stat. Soc. Ser. B Stat.Methodol. , 67(2):301–320, 2005. creening Rules for Lasso with Non-Convex Sparse Regularizers
Supplementary Material of “Screening rules for Lasso with non-convex sparse regularizers”
Dual problem of the weighted Lasso minimization
Let recall the weighted lasso problem min w ∈ R d (cid:107) y − Xw (cid:107) + α (cid:107) w − w (cid:48) (cid:107) + d (cid:88) j =1 λ j | w j | . This is Elastic-Net type problem and can be expressed as min w ∈ R d (cid:107) ˜ y − ˜ Xw (cid:107) + d (cid:88) i =1 λ j | w j | , where ˜ y = (cid:34) y w (cid:48) √ α (cid:35) ∈ R n + d and ˜ X = (cid:20) X I √ α (cid:21) ∈ R ( n + d ) × d . Let ˜ a i = ˜ X (cid:62) i, : and φ i ( z i ) = (˜ y i − z i ) being the quadratic loss function. Let its convex conjugate (Boyd & Vandenberghe,2004) being φ ∗ i ( η i ) = max z i η i z i − φ i ( z i ) for a scalar η i , which results in φ ∗ i ( η i ) = η i + η i ˜ y i . Note also that φ i = φ ∗∗ i as φ i is convex.Following (Johnson & Guestrin, 2015) we derive the dual of the weighted problem through these steps min w ∈ R d (cid:107) ˜ y − ˜ Xw (cid:107) + d (cid:88) i =1 λ j | w j | = min w ∈ R d n + d (cid:88) i =1 (˜ y i − ˜ a (cid:62) i w ) + d (cid:88) i =1 λ j | w j | = min w ∈ R d n + d (cid:88) i =1 φ i (˜ a (cid:62) i w ) + d (cid:88) i =1 λ j | w j | = min w ∈ R d n + d (cid:88) i =1 φ ∗∗ i (˜ a (cid:62) i w ) + d (cid:88) i =1 λ j | w j | = min w ∈ R d n + d (cid:88) i =1 max η i [(˜ a (cid:62) i w ) η i − φ ∗ i ( η i )] + d (cid:88) i =1 λ j | w j | (16) = min w ∈ R d max η ∈ R n + d − n + d (cid:88) i =1 φ ∗ i ( η i ) + w (cid:62) ˜ X (cid:62) η + d (cid:88) i =1 λ j | w j | = max η ∈ R n + d − n + d (cid:88) i =1 φ ∗ i ( η i ) + min w ∈ R d w (cid:62) ˜ X (cid:62) η + d (cid:88) i =1 λ j | w j | = max η : | ˜ X (cid:62) η | (cid:52) Λ − (cid:107) η (cid:107) − η (cid:62) ˜ y . (17)The dual objective function is obtained by substituting the expression φ ∗ i and using the optimality condition of the problem min w ∈ R d w (cid:62) ˜ X (cid:62) η + d (cid:88) i =1 λ j | w j | . (18)This problem is separable and optimality condition with respect to any w j is as follows, provided g j = (cid:16) ˜ X (cid:62) η (cid:17) j (cid:40) g j + λ j sign ( w j ) = 0 if w j (cid:54) = 0 | g j | ≤ λ j if w j = 0 . (19) creening Rules for Lasso with Non-Convex Sparse Regularizers The latter condition implies the coordinate-wise inequality constraint | ˜ X (cid:62) η | (cid:52) Λ with Λ (cid:62) = (cid:0) λ , . . . , λ d (cid:1) . Also we caneasily establish that g j w j + λ j | w j | = 0 . Hence the objective function in Equation (18) vanishes.Finally let us decompose the dual vector as η = (cid:20) − s √ α v (cid:21) where s ∈ R n and v ∈ R d . Recalling the form of ˜ y and ˜ X , it iseasy to see that the dual problem (17) becomes max s , v : | X (cid:62) s − v | (cid:52) Λ − (cid:107) s (cid:107) − α (cid:107) v (cid:107) + s (cid:62) y − v (cid:62) w (cid:48) . Also, from (19), it holds the screening conditions | x (cid:62) j s − v j | < λ j = ⇒ w j = 0 , ∀ j ∈ [ d ] , (20)remind that x j = X : ,j is the j th covariate. In addition, the maximisation in Equation (16) takes the form max s , v − (cid:107) s (cid:107) − α (cid:107) v (cid:107) + s (cid:62) ( y − Xw ) + v (cid:62) ( w − w (cid:48) ) . (21)Thus given an optimal solution w (cid:63) , we may have s (cid:63) = y − Xw (cid:63) and α v (cid:63) = w (cid:63) − w (cid:48)(cid:48)