Provably Convergent Working Set Algorithm for Non-Convex Regularized Regression
Alain Rakotomamonjy, Fireworks Bcd, Maxvc Bcd, Alain Criteo, A Lab, Rémi Flamary, Gilles Gasso, Joseph Salmon
PProvably Convergent Working Set Algorithm forNon-Convex Regularized Regression
Alain Rakotomamonjy
LITIS, Univ. de RouenCriteo AI Lab, Paris [email protected]
R´emi Flamary
Univ Cˆote d’Azur,CNRS , OCA Lagrange [email protected]
Gilles Gasso
LITIS, INSA de Rouen [email protected]
Joseph Salmon
IMAG, Universit´e de Montpellier, CNRSMontpellier, France [email protected]
Abstract
Owing to their statistical properties, non-convex sparse regularizers have attractedmuch interest for estimating a sparse linear model from high dimensional data.Given that the solution is sparse, for accelerating convergence, a working set strat-egy addresses the optimization problem through an iterative algorithm by incre-menting the number of variables to optimize until the identification of the solutionsupport. While those methods have been well-studied and theoretically supportedfor convex regularizers, this paper proposes a working set algorithm for non-convex sparse regularizers with convergence guarantees. The algorithm, named
FireWorks , is based on a non-convex reformulation of a recent primal-dual ap-proach and leverages on the geometry of the residuals. Our theoretical guaranteesderive from a lower bound of the objective function decrease between two innersolver iterations and shows the convergence to a stationary point of the full prob-lem. More importantly, we also show that convergence is preserved even when theinner solver is inexact, under sufficient decay of the error across iterations. Our ex-perimental results demonstrate high computational gain when using our workingset strategy compared to the full problem solver for both block-coordinate descentor a proximal gradient solver.
Many real-world learning problems are of (very) high dimension. This is the case for natural lan-guage processing problems with very large vocabulary or recommendation problems involving user-item features. In such cases, one way of addressing the learning problem is to consider sparsity-inducing penalties. Likewise, when the solution of a learning problem is known to be sparse, usingthose penalties yield to models that can leverage this prior knowledge. The
Lasso [27] and the
Basispursuit [6, 5] where the first approaches that have employed (cid:96) -norm penalty for inducing sparsity.The Lasso model has enjoyed large practical successes in the machine learning and signal processingcommunities [25, 9, 18, 32]. Nonetheless, it suffers from theoretical drawbacks ( e.g., biased esti-mates for large coefficient of the model) which can be overcome by considering non-convex sparsity-inducing penalties. Those penalties provide continuous approximations of the (cid:96) -(pseudo)normwhich is the true measure of sparsity. There exists a flurry of different penalties like the SmoothlyClipped Absolute Deviation (SCAD) [10], the
Log Sum penalty (LSP) [4], the capped- (cid:96) penalty Preprint. Under review. a r X i v : . [ c s . L G ] J un Minimax Concave Penalty (MCP) [33]. We refer the interested reader to [26] for a discus-sion on the pros and cons of such non-convex formulations.In addition to theoretical statistical analyses, efforts have also been made for developing computa-tionally efficient algorithms for non-convex regularized optimization problems. This includes coor-dinate descent algorithms [3], proximal gradient descent [15] or Newton method [31, 22]. However,all these methods share one kind of inefficiency in the sense that they spend a similar computationaleffort for each variable, even when those variables will end up being irrelevant (zero weight) inthe final learnt model. In the non-convex setting, few methods have tried to lift this issue. Oneapproach mixes importance sampling and randomized coordinate descent [11], while another oneseeks at safely screening features that are irrelevant [23]. Working set (also known as active set)strategy aims at focusing computational effort on a subset of variables, making them highly rele-vant for optimization problem with sparse solutions, provided that the algorithm is able to quicklyidentify the “relevant” features. In the literature, several works on working set algorithms addressthis selection issue mostly for convex optimization problems such as the Support Vector Machineproblem [30, 13] or the Lasso problem [12, 28, 16, 20]. Working set strategies have been extendedto non-convex sparse optimization problems [1, 2] but lack of theoretical understandings.In this work, inspired by the Blitz algorithm proposed by Johnson and Guestrin [16](see also [19, 20]for its connection with safe screening rules) we propose a theoretically supported method for select-ing working set in non-convex regularized sparse optimization problems. While Blitz is tailored forconvex problems, leveraging on primal-dual aspects of the (cid:96) -regularized problem, we show that asimilar algorithm can be designed by exploiting the key role of the residual in the sparse regressionproblem. Our algorithm proposes a method for selecting variables to integrate into a working set,and provides a theoretical guarantee on objective value decrease. Based on those results, we provide,as far as we know, the first convergence guarantee of working set algorithm in a non-convex Lassosetting and we show that this convergence property is preserved in a inexact setting.In summary, our contributions are the following : • We propose a working set algorithm for non-convex regularized regression that selectsfeatures to integrate in the model based on a so-called “feasible” residual; • We show that the algorithm enjoys properties such as convergence to a stationary point,even when the inner solver is inexact, under sufficient decay of the error along the iterations; • Our experimental results show that our
FireWorks algorithm achieves substantial computa-tional gain (that can reach 2 order of magnitude) compared to the baseline approach and another working set approach.
Notation
We denote as X ∈ R n × d the design matrix. We write vectors of size d or size n in bold e.g., y ∈ R n or w ∈ R d . We will consider several sets and they are noted in calligraphic mode. Wehave set of indices, mostly noted as A , with A being a subset of indices extracted from { , . . . , d } and with cardinality noted |A| . Given a set A , ¯ A denotes its complement in [1 , · · · , d ] . Set definedby (union of) function level-set will be denoted as C , with indices defining the function. Vectorsnoted as w A are of size |A| and we note ˜ w A ∈ R d for the vector of component w j, A for all j ∈ A and elsewhere. We first introduce the non-convex Lasso problem we are interested in as well as its first order op-timality conditions. We emphasize on the form of the optimality conditions which will be key fordesigning our working set algorithm.
We consider solving the problem of least-squares regression with a generic penalty of the form min w ∈ R d f ( w ) (cid:44) (cid:107) y − Xw (cid:107) + d (cid:88) j =1 r λ ( | w j | ) , (1)2 enalty r λ ( | w | ) ∂r λ ( | w | ) Log sum λ log(1 + | w | /θ ) (cid:40) (cid:2) − λθ , λθ (cid:3) if w = 0 (cid:110) λ sign( w ) θ + | w | (cid:111) if w (cid:54) = 0 MCP (cid:26) λ | w | − w θ if | w | ≤ λθθλ / if | w | > θλ [ − λ, λ ] if w = 0 { λ sign( w ) − wθ } if < | w | ≤ λθ { } if | w | > θλ SCAD λ | w | if | w | ≤ λ − w +2 θλ | w |− λ θ − if λ < | w | ≤ λθ λ (1+ θ )2 if | w | > θλ [ − λ, λ ] if w = 0 { λ sign( w ) } if < | w | ≤ λ (cid:110) − w + θλ sign( w ) θ − (cid:111) if < | w | ≤ λθ { } if | w | > θλ Table 1: Common non-convex penalties with their sub-differentials. Here λ > , θ > ( θ > forMCP, θ > for SCAD).where y ∈ R n is a target vector, X = [ x , . . . , x d ] ∈ R n × d is the design matrix with column-wisefeatures x j ∈ R n , w is the coefficient vector of the model and the map r λ : R + (cid:55)→ R + is concaveand differentiable on [0 , + ∞ ) with a regularization parameter λ > . In addition, we assume that r λ ( | w | ) is a lower semi-continuous function. Note that most penalty functions such as SCAD, MCPor log sum (see their definitions in Table 1) admit such a property and that for those penalties, f ( · ) is lower bounded.We consider tools such as Fr´echet subdifferentials and limiting-subdifferentials [17, 24, 21] wellsuited for non-smooth and non-convex optimization, so that a vector w (cid:63) belongs to the set of min-imizers (not necessarily global) of Problem (1) if the following Fermat’s condition holds [7, 17]: ∀ j, x (cid:62) j ( y − Xw (cid:63) ) ∈ ∂r λ ( | w (cid:63)j | ) , (2)with ∂r λ ( | · | ) being the Fr´echet subdifferential of r λ ( | · | ) , assuming it exists at w (cid:63) . In particular,this is the case for the MCP, log sum and SCAD penalties presented in Table 1. As a sake of clarity,we present the optimality conditions for MCP and log sum, in the next examples. Example 1.
For the MCP penalty (see Table 1 for the definition and subdifferential), it is easy toshow that the ∂r λ ( | | ) = [ − λ, λ ] . Hence, the Fermat’s condition becomes − x (cid:62) j ( y − Xw (cid:63) ) = 0 , if | w (cid:63)j | > λθ − x (cid:62) j ( y − Xw (cid:63) ) + λ sign( w (cid:63)j ) − w (cid:63)j θ = 0 , if < | w (cid:63)j | ≤ λθ | x (cid:62) j ( y − Xw (cid:63) ) | ≤ λ, if w (cid:63)j = 0 (3) Example 2.
For the log sum penalty, one can explicitly compute ∂r λ ( | | ) = [ − λθ , λθ ] and leverageon smoothness of r λ ( | w | ) when | w | > for computing ∂r λ ( | w | ) . Then, the condition in Equation (2) can be written as: (cid:40) − x (cid:62) j ( y − Xw (cid:63) ) + λ sign( w (cid:63)j ) θ + | w (cid:63)j | = 0 , if w (cid:63)j (cid:54) = 0 , | x (cid:62) j ( y − Xw (cid:63) ) | ≤ λθ , if w (cid:63)j = 0 . (4)As we can see, first order optimality conditions lead to simple equations and inclusions. Moreinterestingly, one can note that regardless of the regularizer, the structure of optimality condition fora weight w (cid:63)j = 0 depends on the correlation of the feature x j with the optimal residual. Hence, theseconditions can be used for defining a region in which the optimal residual y − Xw (cid:63) has to live. Given a set A of m indices belonging to { , . . . , d } , we denote as restricted problem, the problemdefined in Equation (1) but restricted to columns of X defined by A , leading thus to min w A ∈ R m (cid:107) y − X A w A (cid:107) + |A| (cid:88) j =1 r λ ( | w j, A | ) . (5)3aturally, a vector w (cid:63) A minimizing this problem has to satisfy its own Fermat’s condition. However,the next proposition derives another necessary condition for achieving Fermat’s condition that willbe useful in the sequel for characterizing optimality of ˜ w (cid:63) A on the full problem. Proposition 1. If w (cid:63) A satisfies Fermat’s condition of problem (5) , then for all j ∈ A , we have | x (cid:62) j ( y − X A w (cid:63) A ) | ≤ r (cid:48) λ (0) Proof.
At first, note that since the function r λ is concave, then its gradient is decreasing hence ∀ w, r (cid:48) λ ( w ) ≤ r (cid:48) λ (0) . Now, when j ∈ { j ∈ A : w (cid:63)j, A = 0 } then the above inequality naturallycomes from the Fermat’s condition. When j ∈ { j ∈ A : w (cid:63)j, A (cid:54) = 0 } , then for a stationary point, wemust have x (cid:62) j ( y − X A w (cid:63) A ) = r (cid:48) λ ( | w (cid:63)j, A | ) . Taking the absolute value of this equation and pluggingin the inequality of the gradient concludes the proof.Based on this latter condition, we define the function h j for j ∈ [1 , . . . , d ] as h j ( a ) = | x (cid:62) j a |− r (cid:48) λ (0) ,the convex set C j defined by the slab C j (cid:44) { a : R n : h j ( a ) ≤ } and C = j (cid:44) { a : R n : h j ( a ) =0 } . Now, we are going to define a set used for characterizing candidate stationary points of eitherEquations (1) or (5). We note as C = (cid:84) dj =1 C j and C A = (cid:84) j ∈A C j and from this, the necessaryoptimality condition defined in Proposition 1 can be written as y − X A w (cid:63) A ∈ C A . Note that C is thedual feasible set for convex regularizers. Now, assume w (cid:63) A is a minimizer of its restricted problemthen ˜ w (cid:63) A (in R d ) satisfies the Fermat’s condition of the full problem if and only if we also have y − X ˜ w (cid:63) A ∈ C ¯ A , (6)where ¯ A is the complement of A in { , . . . , d } . Indeed, since w (cid:63) A is optimal for the restrictedproblem, Fermat’s condition is already satisfied for all j ∈ A . Then, the above condition ensuresthat ∀ j ∈ ¯ A , we have | x (cid:62) j ( y − X ˜ w (cid:63) A ) | ≤ r (cid:48) (0) since, as by definition, ˜ w (cid:63)j = 0 , ∀ j ∈ ¯ A . As theyplay a key role in our algorithm and for monitoring optimality, we define the distance of a vector r to the convex set C j and C = j asdist ( r , C j ) = min z ∈ R n (cid:107) z − r (cid:107) , s.t. h j ( z ) ≤ dist S ( s , C = j ) (cid:44) min z ∈ R n (cid:107) z − s (cid:107) , s.t. h j ( z ) = 0 . From this definition, we also note that given a set A , the solution w A of Equation (5) and r = y − X A w A , the index j (cid:63) = arg max j ∈ ¯ A dist ( r , C j ) is the index of the most violated constraintamong non-active variables for a given residual r . A working set algorithm for solving problem (1) consists in sequentially solving a series of restrictedproblem as defined in Equation (5) with a sequence of working sets A , A , . . . , A k . The maindifferences among working set algorithms lie on the way the set is being updated. For instance[1, 2] selects the most violated constraints (as defined above) in the non-active set to be includedin the new working set leading to the algorithm presented in the supplementary material. Flamaryet al. [11] followed similar approach but considered a randomized selection in which the probabilityof selection is related to dist ( y − X A w (cid:63) A , C j ) .The algorithm we proposed is similar to Blitz [16] which is a working set algorithm dedicated toconvex constrained optimization problem. As the problem we address is a non-convex one, wemanipulate different mathematical objects that need to be adapted. The procedure is presented inAlgorithm 2. It starts by selecting a small subset of indices (either randomly or cleverly like theindices with largest | x (cid:62) j y | , for instance) as initial working set and by choosing a vector s suchthat s ∈ C = (cid:84) dj =1 C j , for instance setting s = . From this vector s , we will generate asequence of { s k } that plays a key role in the selection of the constraints to be integrated in the nextrestricted model. The approach, at iteration k , starts by solving the restricted problem with the set A k and then by computing the residual r k = y − X A k w (cid:63) A k . As noted in Equation (6), if r k ∈ C ¯ A then the vector ˜ w (cid:63) A k is a stationary point of the full problem. If r k (cid:54)∈ C ¯ A , we need to update theworking set A k . For doing so, we proceed by defining s k +1 as the vector on the segment [ s k , r k ] ,nearest to r k that belongs to C . Then, we update the working set by prioritizing the j -th coordinate w.r.t. the distance of s k +1 to C = j and the constraint associated to j is included in the new working4 lgorithm 1 FireWorks: Feasible Residual Working Set Algorithm
Input: { X , y } , A active set, s ∈ C , a sequence of τ k or a mechanism for defining τ k , initialvector ˜ w A Output: ˜ w A k for k = 1 , , . . . do w A k = arg min w (cid:107) y − X A k w (cid:107) + (cid:80) j ∈A k r λ ( | w j | ) { warm-start solver with w A k − } r k = y − X A k w A k { compute the current residual } α k = max { α ∈ [0 ,
1] : α r k + (1 − α ) s k ∈ C} s k +1 = α k r k + (1 − α k ) s k { define the most ”feasible” residual } A k = A k / { j ∈ A j : w j, A k = 0 } { we prune the set from inactive features } compute τ k { e.g sort dist S ( s k +1 , C = j ) so as to keep constant number of features to add } A k +1 = { j : dist S ( s k +1 , C = j ) } ≤ τ k } ∪ A k { update working set } end for Build ˜ w A k − − − − − − x C x C x C C Illustration for the sets C j and C − − − − − − C C C C yr s s Iteration 1 with A = { } − − − − − − C C C C y r s s r s Iteration 2 with A = { , } Figure 1: Illustrating the constraint selection. (left) Given three variables, we plot their associateslabs {C j } j =1 . C is the intersection of the slabs. We assume that the working set is { } . (middle)After the first iteration, the residual r satisfies constraint h ( a ) ≤ and thus lies in the h ’s feasibleregion C . Then, the segment [ s , r ] gives us the most feasible point s ∈ C . If τ is chosen so as toselect only one feature, it is then j = 1 . The new working set is { , } . (right) After optimizing overthis working set, the residual r satisfies constraints h and h and thus lies in the C ∩ C region.set if dist S ( s k +1 , C = j ) ≤ τ k . τ k is a strictly positive term that defines the number of constraintsto be added to the current working set. In practice, we have chosen τ k so that a fixed number ofconstraints is added to A k for each k .We provide the following intuition on why this algorithm works in practice. At first, note that byconstruction s k +1 is a convex combination of two vectors one of which is the residual hence justifiesits interpretation as a residual. However, the main difference between the s k ’s and r k ’s is that theformer belongs to C and thus to any C ¯ A while r k belongs to C only for a potential ˜ w (cid:63) A k optimalfor the full problem. Then, when w (cid:63) A k is a stationary point for the restricted problem but not forthe full problem, r k ∈ C A but r k (cid:54)∈ C . Hence, s k +1 represents a residual candidate for optimalityand constraints near this residual can be interpreted as the constraints that need to be relaxed byintegrating related features j in the working set (allowing thus w j to be potentially non-zero at thenext iteration). The indices j for which distances of s k +1 to C = j are below a given threshold are thenintegrated into the working set. The mechanism for constraint selection is illustrated in Figure 1. In this subsection, we provided some analyzes of the proposed algorithm. At first, let us introducean optimality condition of a vector solving the restricted problem.
Proposition 2.
Given a working set A k and w (cid:63) A k solving the related restricted problem, ˜ w (cid:63) A k isalso optimal for the full problem if and only if α = 1 (which also means s k +1 = r k ).Proof. (sketch) apply the fact that at optimality, we have r k ∈ C ¯ A .5ow, we are going to characterize the decrease in objective value obtained between two updates ofworking sets. Proposition 3.
Assume that w A k and w A k +1 are respectively the solutions of the restricted problemwith the working set A k and A k +1 , with A k +1 = { j } ∪ A k . Denote as r k = y − X A k w A k , then,we have the following inequality (cid:107) ˜ w A k +1 − ˜ w A k (cid:107) ≥ (cid:107) X (cid:107) dist( r k , C j ) . Proof.
We have the following inequalities (cid:107) r k +1 − r k (cid:107) = (cid:107) X ( ˜ w A k +1 − ˜ w A k ) (cid:107) ≤ (cid:107) X (cid:107) (cid:107) ˜ w A k +1 − ˜ w A k (cid:107) . (7)Now recall that r k (cid:54)∈ C A k +1 since it violates the h j constraint while r k +1 ∈ C A k +1 as w A k +1 hasbeen optimized over A k +1 . As such, we also have h j ( r k +1 ) ≤ . Now by definition of dist ( r k , C j ) either r k +1 is the minimizer of this optimization problem, hence dist ( r k , C j ) = (cid:107) r k +1 − r k (cid:107) ordist ( r k , C j ) ≤ (cid:107) r k +1 − r k (cid:107) . Plugging this latter inequality in Equation (7) concludes the proof. Lemma 1.
At step k , consider a constraint C j such that h j ( r k ) > , thendist ( r k , C j ) ≥ − α k α k τ k . (8)The proof of this lemma has been deported to the supplementary. From the above Proposition 3 andLemma 1, we can ensure that the sequence of { ˜ w A k } produced by Algorithm 2 converges towardsa stationary point under mild conditions on the inner solver. Proposition 4.
Suppose that for each step k , the algorithm solving the inner problem ensures adecrease in the objective value in the form f ( ˜ w A k +1 ) − f ( ˜ w A k ) ≤ − γ k (cid:107) ˜ w A k +1 − ˜ w A k (cid:107) . with ∀ k, γ k > . For the inner solver, we also impose that when solving the problem with set A k +1 ,the inner solver is warm-started with w A k . Assume also that (cid:107) X (cid:107) > and that τ k > , then thesequence of α k produced by Algorithm 2 converges towards and ∀ j, lim k →∞ | x (cid:62) j r k | ≤ r (cid:48) λ (0) .Proof. Using result in Proposition 3 and Lemma 1 and the above assumption, we have f ( ˜ w A k +1 ) ≤ f ( ˜ w A k ) − γ k (cid:18) − α k α k (cid:19) τ k ≤ f ( ˜ w A ) − k (cid:88) (cid:96) =1 γ (cid:96) (cid:18) − α (cid:96) α (cid:96) (cid:19) τ (cid:96) . (9)This means that (cid:80) k(cid:96) =1 γ (cid:96) (cid:16) − α (cid:96) α (cid:96) (cid:17) τ (cid:96) ≤ f ( ˜ w A ) − f ( ˜ w A k +1 ) . Since f is bounded from below,the right hand side is less than some positive constant, hence (cid:80) ∞ (cid:96) =1 γ j (cid:16) − α (cid:96) α (cid:96) (cid:17) τ (cid:96) < ∞ . Since thelatter sum is bounded, it implies that γ (cid:96) (cid:16) − α (cid:96) α (cid:96) (cid:17) τ (cid:96) → as (cid:96) → ∞ , and as γ (cid:96) > , τ (cid:96) > , we have lim (cid:96) →∞ α (cid:96) = 1 . Now using the definition of s k +1 , we have ∀ j, x (cid:62) j r k = α k x (cid:62) j s k +1 − − α k α k x (cid:62) j s k .Then, taking the absolute value, triangle inequality and using the fact that ∀ k, s k ∈ C concludes theproof.The above proposition ensures convergence to a stationary point under some conditions on the in-ner solver. Several algorithms may satisfy this assumption. For instance, any first-order iterativealgorithm which selects it step size /t k based on line search criterion of the form ∀ k, f ( w k +1 ) ≤ f ( w k ) − σ t k (cid:107) w k +1 − w k (cid:107) , where σ is a constant in the interval (0 , , provides such a guarantee.This is the case of the generalized proximal algorithm of Gong et al. [15] or proximal Newton ap-proaches [22], assuming that f is differentiable with gradient Lipschitz and r λ ( · ) admits a proximaloperator. As non-convex block coordinate descent algorithms [3] can also be interpreted as proximalalgorithm, they also satisfy sufficient decrease condition under the same assumptions than proximalapproaches. 6 nexact inner solver One key point when considering a meta-solver like Blitz [16] or a workingset algorithm is that for some approaches, theoretical properties hold only when the inner solver issolved exactly. This is for instance the case for the SimpleSVM algorithm of Vishwanathan et al.[30] or the active set algorithm proposed by Boisbunon et al. [1] which convergence is based onnon-cyclicity of the constraint selection (prohibiting pruning) and thus on the ability of solvingexactly the inner problem. For the approach we propose, we show next that the distance betweentwo consecutive inexact solutions of the inner problem is still lower bounded.
Proposition 5.
Given w A k and w A k +1 approximate solutions of the inner problem with respectivelythe working sets A k and A k +1 . Assume that w A k +1 as being obtained through a tolerance of ξ k +1 ≤ τ k of its Fermat’s condition ( e.g., for the log sum penalty, Equation (4) are satisfied up to ξ k +1 ), then the following inequality holds : (cid:107) w A k +1 − w A k (cid:107) ≥ (cid:107) X (cid:107) (cid:0) dist( r k , h j ) − ξ k +1 (cid:1) . Proof.
First note that if w A k +1 is such that r k +1 ∈ C A k +1 then we are in the same conditionthan in Proposition 1 and the same proof applies. Let us assume then that r k +1 (cid:54)∈ C A k +1 anddist ( r k +1 , C j ) ≤ ξ k +1 . Define as u the point in C j that defines the distance of r k to C j and as p the point that minimizes the distance between r k +1 and the segment [ u , r k ] . Then, owing to simplegeometrical arguments and orthogonality we have : (cid:107) r k +1 − r k (cid:107) = (cid:107) r k +1 − p (cid:107) + (cid:107) p − r k (cid:107) andthus (cid:107) r k +1 − r k (cid:107) ≥ (cid:107) r k − p (cid:107) . Now, because p belongs to the segment defined by u and r k , wehave (cid:107) r k +1 − r k (cid:107) ≥ (cid:107) r k − u (cid:107) − (cid:107) u − p (cid:107) ≥ dist ( r k , C j ) − ξ k +1 where the last inequality comes from the fact that (cid:107) u − p (cid:107) = dist ( r k +1 , C j ) ≤ ξ k +1 . Plugging thisinequality into Equation (7) completes the proof.Note that the above lower bound is meaningful only if the tolerance ξ k +1 is smaller than the distanceof the residual to the set C j . This is a reasonable assumption to be made since we expect r k to violate C j . Now, we can derive condition of convergence towards a stationary point of the full problem. Corollary 1.
Under the assumption of Proposition 4 and assuming that the sequence of toleranceis such that (cid:80) k ξ k < ∞ , then Algorithm 2 produces a sequence of iterates that converges towardsa stationary point. The proof follows the same step of the one of Proposition 4, with the addition of the convergentseries of ξ k and thus has been omitted. Note that the assumption of convergent sum of errors is acommon assumption, notably in the proximal algorithm literature [8, 29] and it helps guaranteeingconvergence towards exact stationary point instead of an approximate convergence (up to a tolerance τ ). Relation with maximum violated constraint.
The mechanism we have proposed for updatingthe working set is based on the current residual r k and a feasible residual s k . By changing how s k +1 is defined, we can retrieve the classical maximal violated constraint approach. Indeed, if weset at Line 5 of Algorithm 2, ∀ k, s k = 0 and s k +1 = α k r k , with α k ∈ [0 , then s k +1 is a rescalingof the current residual and the scale is chosen so that s k +1 ∈ C . Using simple inequality argument, itis straightforward to show that α k = min(min j ∈ ¯ A λθ | x (cid:62) j r k | , and the minimum in j occurs for thelargest value of | x (cid:62) j r k | . Recall again that the polynomial convergence of this algorithm is guaranteedfor exact inner solver and when no working set pruning (removing from the set A k variables whichweights are ) occurs. Set-up
We now present some numerical studies showing the computational gain achieved by ourapproach. As an inner solver and baseline algorithms, we have considered a proximal algorithm [15]and a block-coordinate descent approach [3]; they are respectively denoted as GIST and BCD. Theyhave been implemented in Python/Numpy and the code will be shared online upon publication.We have integrated those baselines into the maximum-violating constraint (MaxVC) working setapproach (algorithm in the appendix) and our approach denoted as FireWorks for FeasIble REsidualWORKing Set). For these approaches, we leverage on the closed-form proximal operator availablefor several non-convex regularizers. For our experiments, we have used the log-sum penalty which7able 2: Running time in seconds of different algorithms on different problems. In the firstcolumn, we reported data, the tolerance on the stopping criterion and the constant K such that λ = K max j | x (cid:62) j y | (the larger the K , the sparser w (cid:63) is). The small Toy dataset has n = 100 , d = 1000 and p = 30 the large one has n = 1000 , d = 5000 , p = 500 . For each inner solver,we bold the most efficient algorithm. − denotes that the algorithm did not finish one iteration in hours. Number in parenthesis is the number of non-zero weights in w (cid:63) A . All experiments have beenrun on one single core of an Intel Xeon CPU E5-2680 clocked at 2,4Ghz. Data and Setting GIST MaxVC Gist FireWorks Gist BCD MaxVC BCD FireWorks BCDToy small -1.00e-03- 0.07 0.8 ± ± ± (34) 14.2 ± ± ± (34)Toy small -1.00e-05- 0.07 1.4 ± ± ± (33) 22.9 ± ± ± (33)Toy small -1.00e-03- 0.01 6.3 ± ± ± (66) 73.7 ± ± ± (66)Toy small -1.00e-05- 0.01 14.1 ± ± ± (61) 154.6 ± ± ± (61)Toy large -1.00e-03- 0.07 26.2 ± ± (371) 8.2 ± ± ± ± (371)Toy large -1.00e-05- 0.07 50.5 ± ± ± (366) 1030.5 ± ± ± (366)Toy large -1.00e-03- 0.01 91.6 ± ± ± (759) 1192.1 ± ± ± (759)Toy large -1.00e-05- 0.01 ± (761) 1020.6 ± ± ± ± ± (761) Data and Setting GIST MaxVC Gist FireWorks Gist BCD MaxVC BCD FireWorks BCDLeukemia -1.00e-03- 0.07 17.9 ± ± (9) 0.4 ± ± ± (9) ± (9)Leukemia -1.00e-05- 0.07 26.1 ± ± (8) 0.5 ± ± ± ± (8)Leukemia -1.00e-03- 0.01 186.1 ± ± (46) 5.5 ± ± ± ± (46)Leukemia -1.00e-05- 0.01 525.2 ± ± ± (41) 1412.8 ± ± ± (41)newsgroup-3 -1.00e-02- 0.01 6041.1 ± ± ± ± ± ± ± ± ± ±
18 53.2 ± ± newsgroup-3 -1.00e-04- 0.01 5734.0 ± ± ± ±
19 279.2 ± ± newsgroup-7 -1.00e-02- 0.01 26711.1 ± ± ± ± ± ± newsgroup-7 -1.00e-03- 0.01 26685.6 ± ± ± ± ± ± newsgroup-7 -1.00e-04- 0.01 26752.5 ± ± ± ± ± ± Criteo -1.00e-02- 0.005 - - - - 41095.3 ± ± Criteo -1.00e-03- 0.005 - - - - 49006.7 ± ± Criteo -1.00e-04- 0.005 - - - - 59303.8 ± ± has an hyperparameter θ that has been set to . For all algorithms, the stopping criterion is basedon the tolerance over the Fermat’s optimality condition and the measure for comparison is the CPUrunning time of each algorithm. For all problems, we have set τ k so as to add a fixed number offeatures into the working set of MaxVC and our FireWorks algorithm. Toy problem
Here, the regression matrix X ∈ R n × d is drawn uniformly from a standard Gaussiandistribution (zero-mean unit variance). For given n, d and a number p of active variables, the truecoefficient vector w (cid:63) is obtained as follows. The p non-zero positions are chosen randomly, andtheir values are drawn from a zero-mean unit variance Gaussian distribution, to which we added ± . according sign( w (cid:63)j ) . Finally, the target vector is obtained as y = Xw (cid:63) + e where e is azero-mean Gaussian noise with standard deviation σ = 0 . . For these problems, features areadded to the working set at each iteration. Table 2 presents the running time for different algorithmsto reach convergence under various settings. We note that our FireWorks algorithm is faster than thegenuine inner solver and (at least on par) with the MaxVC approach especially in setting where λ isproperly tuned with respect to the number of variables, ie when the solution is not too sparse. Real data
We have reported comparisons on three real datasets. The first one is the
Leukemia dataset [14] which has a dense regression matrix with n = 72 and d = 7129 . We have alsoconsidered sparse problem such as newsgroups dataset in which we have kept only categories( religion and graphics ) resulting in n = 1441 , d = 26488 and 7 categories comp leading to n =4891 , d = 94414 . For these two problems, we have respectively and non-zeroselements. We have also used a large-scale dataset which is a subset of the Criteo Kaggle datasetcomposed of M samples and M features, with about M non-zero elements. For
Leukemia ,we added features at each iteration, whereas we have added and respectively for the newsgroup and Criteo problem. Results reported in Table 2(down) show that using FireWorks leadsto a speedup of at least one order of magnitude compared to the baseline algorithm. For large λ leading to sparse solutions, MaxVC is the most efficient approach on Leukemia , while for large-scale datasets newsgroup-3 and newsgroup-7 , FireWorks benefits from pruning and it is substantiallyfaster than all competitors. For
Criteo , only the BCD working set algorithms are able to terminatein reasonable time and FireWorks is more efficient than MaxVC.8
Conclusions
We have introduced in this paper a working set based meta-algorithm for non-convex regularizedregression. By mimicking the concept of primal-dual approach, but in a non-convex setting, wewere able to derive a novel rule for updating the variables optimized by an iterative incrementalalgorithm. From a theoretical point of view, we showed convergence of the algorithm, even whenthe inner problem is not solved exactly. This is in constrast with the classical maximal violatingconstraint approach which convergence is based on exact resolution of each inner problem. Ourexperimental results show the computational gain achieved for a given solver when applied directlyon the full variables or within our working set algorithm.
Broader Impact
We expect this work to benefit research and applications related to large scale sparse learning prob-lems. Since the work is a methodological work and as such it is hard to see any foreseeable societalconsequences without precise applications.
Acknowledgments
This work benefited from the support of the project OATMIL ANR-17-CE23-0012 of the FrenchNational Research Agency (ANR) and 3IA Cote dAzur Investments ANR-19-P3IA-0002 of theFrench National Research Agency (ANR). This work was performed using computing resources ofCRIANN (Normandy, France).
References [1] A. Boisbunon, R. Flamary, and A. Rakotomamonjy. Active set strategy for high-dimensionalnon-convex sparse optimization problems. In
ICASSP , pages 1517–1521. IEEE, 2014.[2] A. Boisbunon, R. Flamary, A. Rakotomamonjy, A. Giros, and J. Zerubia. Large scale sparseoptimization for object detection in high resolution images. In
IEEE Workshop in MachineLearning for Signal Processing (MLSP) , 2014.[3] P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression,with applications to biological feature selection.
Ann. Appl. Stat. , 5(1):232, 2011.[4] E. J. Cand`es, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l minimization. J. Fourier Anal. Applicat. , 14(5-6):877–905, 2008.[5] S. Chen and D. Donoho. Basis pursuit. In IEEE, editor,
Proceedings of 1994 28th AsilomarConference on Signals, Systems and Computers , 1994.[6] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.
SIAMreview , 43(1):129–159, 2001.[7] F. H. Clarke.
Method of Dynamic and Nonsmooth Optimization . SIAM, 1989.[8] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting.
Multiscale Modeling & Simulation , 4(4):1168–1200, 2005.[9] D. L. Donoho. Compressed sensing.
IEEE Trans. Inf. Theory , 52(4):1289–1306, 2006.[10] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle proper-ties.
J. Amer. Statist. Assoc. , 96(456):1348–1360, 2001.[11] R. Flamary, A. Rakotomamonjy, and G. Gasso. Importance sampling strategy for non-convexrandomized block-coordinate descent. In , pages 301–304, 2015.[12] J. Friedman, T. J. Hastie, and R. Tibshirani. Regularization paths for generalized linear modelsvia coordinate descent.
J. Stat. Softw. , 33(1):1–22, 2010.[13] T. Glasmachers and C. Igel. Maximum-gain working set selection for SVMs.
Journal ofMachine Learning Research , 7(Jul):1437–1466, 2006.914] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill PMesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecularclassification of cancer: class discovery and class prediction by gene expression monitoring. science , 286(5439):531–537, 1999.[15] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholdingalgorithm for non-convex regularized optimization problems. In
ICML , pages 37–45, 2013.[16] T. B. Johnson and C. Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimiza-tion. In
ICML , volume 37, pages 1171–1179, 2015.[17] A Ya Kruger. On Fr´echet subdifferentials.
Journal of Mathematical Sciences , 116(3):3325–3358, 2003.[18] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly. Compressed sensing MRI.
IEEESignal Processing Magazine , 25(2):72–82, 2008.[19] M. Massias, A. Gramfort, and J. Salmon. From safe screening rules to working sets for fasterlasso-type solvers. In
NIPS-OPT , 2017.[20] M. Massias, A. Gramfort, and J. Salmon. Celer: a Fast Solver for the Lasso with Dual Extrap-olation. In
ICML , volume 80, pages 3315–3324, 2018.[21] B.S. Mordukhovich, N. M. Nam, and N. D. Yen. Fr´echet subdifferential calculus and optimal-ity conditions in nondifferentiable programming.
Optimization , 55(5-6):685–708, 2006.[22] A. Rakotomamonjy, R. Flamary, and G. Gasso. DC proximal Newton for nonconvex optimiza-tion problems.
IEEE transactions on neural networks and learning systems , 27(3):636–647,2015.[23] A. Rakotomamonjy, G. Gasso, and J. Salmon. Screening rules for lasso with non-convex sparseregularizers. In
ICML , volume 97, pages 5341–5350, 2019.[24] R. T. Rockafellar and R. J.-B. Wets.
Variational analysis , volume 317. Springer Science &Business Media, 2009.[25] S. K. Shevade and S. S. Keerthi. A simple and efficient algorithm for gene selection usingsparse logistic regression.
Bioinformatics , 19(17):2246–2253, 2003.[26] E. Soubies, L. Blanc-F´eraud, and G. Aubert. A unified view of exact continuous penalties for (cid:96) - (cid:96) minimization. SIAM J. Optim. , 27(3):2034–2060, 2017.[27] R. Tibshirani. Regression shrinkage and selection via the lasso.
J. R. Stat. Soc. Ser. B Stat.Methodol. , 58(1):267–288, 1996.[28] R. Tibshirani, J. Bien, J. Friedman, T. J. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strongrules for discarding predictors in lasso-type problems.
J. R. Stat. Soc. Ser. B Stat. Methodol. ,74(2):245–266, 2012.[29] S. Villa, S. Salzo, L. Baldassarre, and A. Verri. Accelerated and inexact forward-backwardalgorithms.
SIAM Journal on Optimization , 23(3):1607–1633, 2013.[30] S. V. N. Vishwanathan, A. J. Smola, and M. N. Murty. Simplesvm. In
ICML , pages 760–767,2003.[31] R. Wang, N. Xiu, and S. Zhou. Fast Newton method for sparse logistic regression. arXivpreprint arXiv:1901.02768 , 2019.[32] J. Ye and J. Liu. Sparse methods for biomedical data.
ACM Sigkdd Explorations Newsletter ,14(1):4–15, 2012.[33] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty.
Ann. Statist. ,38(2):894–942, 2010.[34] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization.
Journal ofMachine Learning Research , 11(Mar):1081–1107, 2010.10 upplementary materialProvably Convergent Working Set Algorithm for Non-Convex RegularizedRegression
The maximum-violating constraint algorithm is a simple algorithm that solves at each iteration asub-problem with a subset of variables and then add some others that violate the most the statement y − X ˜ w (cid:63) A k ∈ C ¯ A k , where “the most” is evaluated in term of distance to each set C j , with j ∈ ¯ A k .Hence, at each iteration, we compute all these distances, sort them in descending order and add to thecurrent working set, the n k variables that yield to the largest distances. The algorithm is presentedbelow. Algorithm 2
Maximum Violating Constraints Algorithm
Input: { X , y } , A active set, n k number of variables to add at iteration k , initial vector ˜ w A Output: ˜ w A k for k = 1 , , . . . do w A k = arg min w ∈A k (cid:107) y − X A k w (cid:107) + (cid:80) j ∈A k r λ ( | w j | ) { warm-start solver with w A k − } r k = y − X A k w A k { compute the current residual } v = argsort dist ( r k , C j ) in descending order A k +1 = v [0 : n k ] ∪ A k { update working set by adding the n k most violating variables } end for Build ˜ w A k Given a working set A k and w (cid:63) A k solving the related restricted problem, ˜ w (cid:63) A k isalso optimal for the full problem if and only if α = 1 (which also means s k +1 = r k ).Proof. Assume that w (cid:63) A k and ˜ w (cid:63) A k are optimal respectively for the restricted and the full problem.Let us show that in this case α k = 1 . Since ˜ w (cid:63) A k is optimal for the full problem, we thus have ∀ j ∈ ¯ A k , | x (cid:62) j ( y − X ˜ w (cid:63) A k ) | ≤ r (cid:48) (0) . And thus we have the following equivalent statement y − X ˜ w (cid:63) A k ∈ C ⇔ y − X A k w (cid:63) A k ∈ C ⇔ r k ∈ C and thus α k = 1 .Now assume that α k = 1 and let us show that ˜ w (cid:63) A k is optimal for the full problem. Since α k = 1 ,we have s k +1 = r k and thus r k ∈ C . The latter means that ∀ j ∈ ¯ A k , | x (cid:62) j ( y − X A k w (cid:63) A k ) | ≤ r (cid:48) (0) and thus ∀ j ∈ ¯ A k , | x (cid:62) j ( y − X ˜ w (cid:63) A k ) | ≤ r (cid:48) (0) . Given this last property and the definition of ˜ w A (cid:63)k based on w (cid:63) A k , we can conclude that ˜ w (cid:63) A k is optimal for the full problem. The proof follows similar steps as those given by Johnson and Guestrin [16].
Lemma 1.
At step k , consider a constraint C j such that h j ( r k ) > , thendist ( r k , C j ) ≥ − α k α k τ k . (10)11 roof. Denote as j the index of the function h j such that h j ( r k ) > as z k ∈ { z ∈ R n : h j ( z ) = 0 } . The following equality holdsdist ( r k , C j ) = (cid:107) z k − r k (cid:107) = (cid:107) z k − α k ( s k +1 − (1 − α k ) s k ) (cid:107) = (cid:107) z k − α k s k +1 + 1 − α k α k s k (cid:107) = (cid:107) − z k + 1 α k s k +1 − − α k α k s k (cid:107) = 1 − α k α k (cid:107) − α k − α k z k + 11 − α k s k +1 − s k (cid:107) (11)By construction, we have h j ( z k ) = 0 as z k is a minimizer of the distance and h j ( s k +1 ) = 0 as wehave chosen j as the index of the set that that makes s k +1 (cid:54)∈ C . Since h j ( · ) ≤ is a convex set andthe coefficients − α k − α k and − α k are coefficients that does not lead to a convex combination of z k and s k +1 and hence, we have h j ( − α k − α k z k + − α k s k +1 ) ≥ . On the other hand by construction,we have s k ∈ C j . Furthermore, we have dist S ( s k , C = j ) ≥ τ k . Indeed, since h j ( r k ) > , we have j (cid:54)∈ A k as by construction r k ∈ C A ( w A k has been optimized over A k ). Because j (cid:54)∈ A k meansthat dist S ( s k , C = j ) ≥ τ k .Now as h j ( − α k − α k z k + − α k s k +1 ) ≥ and dist S ( s k , C = j ) ≥ τ k , the norm is the above equation(11) is lower bounded by τ k and we havedist ( r k , C j ) ≥ − α k α k τ k ..