[PDF] Provably Convergent Working Set Algorithm for Non-Convex Regularized Regression

Abstract

Owing to their statistical properties, non-convex sparse regularizers have attracted much interest for estimating a sparse linear model from high dimensional data. Given that the solution is sparse, for accelerating convergence, a working set strategy addresses the optimization problem through an iterative algorithm by incre-menting the number of variables to optimize until the identification of the solution support. While those methods have been well-studied and theoretically supported for convex regularizers, this paper proposes a working set algorithm for non-convex sparse regularizers with convergence guarantees. The algorithm, named FireWorks, is based on a non-convex reformulation of a recent primal-dual approach and leverages on the geometry of the residuals. Our theoretical guarantees derive from a lower bound of the objective function decrease between two inner solver iterations and shows the convergence to a stationary point of the full problem. More importantly, we also show that convergence is preserved even when the inner solver is inexact, under sufficient decay of the error across iterations. Our experimental results demonstrate high computational gain when using our working set strategy compared to the full problem solver for both block-coordinate descent or a proximal gradient solver.

Full PDF

PProvably Convergent Working Set Algorithm forNon-Convex Regularized Regression

Alain Rakotomamonjy

LITIS, Univ. de RouenCriteo AI Lab, Paris [email protected]

R´emi Flamary

Univ Cˆote d’Azur,CNRS , OCA Lagrange [email protected]

Gilles Gasso

LITIS, INSA de Rouen [email protected]

Joseph Salmon

IMAG, Universit´e de Montpellier, CNRSMontpellier, France [email protected]

Abstract

FireWorks , is based on a non-convex reformulation of a recent primal-dual ap-proach and leverages on the geometry of the residuals. Our theoretical guaranteesderive from a lower bound of the objective function decrease between two innersolver iterations and shows the convergence to a stationary point of the full prob-lem. More importantly, we also show that convergence is preserved even when theinner solver is inexact, under sufﬁcient decay of the error across iterations. Our ex-perimental results demonstrate high computational gain when using our workingset strategy compared to the full problem solver for both block-coordinate descentor a proximal gradient solver.

Many real-world learning problems are of (very) high dimension. This is the case for natural lan-guage processing problems with very large vocabulary or recommendation problems involving user-item features. In such cases, one way of addressing the learning problem is to consider sparsity-inducing penalties. Likewise, when the solution of a learning problem is known to be sparse, usingthose penalties yield to models that can leverage this prior knowledge. The

Lasso [27] and the

Basispursuit [6, 5] where the ﬁrst approaches that have employed (cid:96) -norm penalty for inducing sparsity.The Lasso model has enjoyed large practical successes in the machine learning and signal processingcommunities [25, 9, 18, 32]. Nonetheless, it suffers from theoretical drawbacks ( e.g., biased esti-mates for large coefﬁcient of the model) which can be overcome by considering non-convex sparsity-inducing penalties. Those penalties provide continuous approximations of the (cid:96) -(pseudo)normwhich is the true measure of sparsity. There exists a ﬂurry of different penalties like the SmoothlyClipped Absolute Deviation (SCAD) [10], the

Log Sum penalty (LSP) [4], the capped- (cid:96) penalty Preprint. Under review. a r X i v : . [ c s . L G ] J un Minimax Concave Penalty (MCP) [33]. We refer the interested reader to [26] for a discus-sion on the pros and cons of such non-convex formulations.In addition to theoretical statistical analyses, efforts have also been made for developing computa-tionally efﬁcient algorithms for non-convex regularized optimization problems. This includes coor-dinate descent algorithms [3], proximal gradient descent [15] or Newton method [31, 22]. However,all these methods share one kind of inefﬁciency in the sense that they spend a similar computationaleffort for each variable, even when those variables will end up being irrelevant (zero weight) inthe ﬁnal learnt model. In the non-convex setting, few methods have tried to lift this issue. Oneapproach mixes importance sampling and randomized coordinate descent [11], while another oneseeks at safely screening features that are irrelevant [23]. Working set (also known as active set)strategy aims at focusing computational effort on a subset of variables, making them highly rele-vant for optimization problem with sparse solutions, provided that the algorithm is able to quicklyidentify the “relevant” features. In the literature, several works on working set algorithms addressthis selection issue mostly for convex optimization problems such as the Support Vector Machineproblem [30, 13] or the Lasso problem [12, 28, 16, 20]. Working set strategies have been extendedto non-convex sparse optimization problems [1, 2] but lack of theoretical understandings.In this work, inspired by the Blitz algorithm proposed by Johnson and Guestrin [16](see also [19, 20]for its connection with safe screening rules) we propose a theoretically supported method for select-ing working set in non-convex regularized sparse optimization problems. While Blitz is tailored forconvex problems, leveraging on primal-dual aspects of the (cid:96) -regularized problem, we show that asimilar algorithm can be designed by exploiting the key role of the residual in the sparse regressionproblem. Our algorithm proposes a method for selecting variables to integrate into a working set,and provides a theoretical guarantee on objective value decrease. Based on those results, we provide,as far as we know, the ﬁrst convergence guarantee of working set algorithm in a non-convex Lassosetting and we show that this convergence property is preserved in a inexact setting.In summary, our contributions are the following : • We propose a working set algorithm for non-convex regularized regression that selectsfeatures to integrate in the model based on a so-called “feasible” residual; • We show that the algorithm enjoys properties such as convergence to a stationary point,even when the inner solver is inexact, under sufﬁcient decay of the error along the iterations; • Our experimental results show that our

FireWorks algorithm achieves substantial computa-tional gain (that can reach 2 order of magnitude) compared to the baseline approach and another working set approach.

Notation

We denote as X ∈ R n × d the design matrix. We write vectors of size d or size n in bold e.g., y ∈ R n or w ∈ R d . We will consider several sets and they are noted in calligraphic mode. Wehave set of indices, mostly noted as A , with A being a subset of indices extracted from { , . . . , d } and with cardinality noted |A| . Given a set A , ¯ A denotes its complement in [1 , · · · , d ] . Set deﬁnedby (union of) function level-set will be denoted as C , with indices deﬁning the function. Vectorsnoted as w A are of size |A| and we note ˜ w A ∈ R d for the vector of component w j, A for all j ∈ A and elsewhere. We ﬁrst introduce the non-convex Lasso problem we are interested in as well as its ﬁrst order op-timality conditions. We emphasize on the form of the optimality conditions which will be key fordesigning our working set algorithm.

We consider solving the problem of least-squares regression with a generic penalty of the form min w ∈ R d f ( w ) (cid:44) (cid:107) y − Xw (cid:107) + d (cid:88) j =1 r λ ( | w j | ) , (1)2 enalty r λ ( | w | ) ∂r λ ( | w | ) Log sum λ log(1 + | w | /θ ) (cid:40) (cid:2) − λθ , λθ (cid:3) if w = 0 (cid:110) λ sign( w ) θ + | w | (cid:111) if w (cid:54) = 0 MCP (cid:26) λ | w | − w θ if | w | ≤ λθθλ / if | w | > θλ  [ − λ, λ ] if w = 0 { λ sign( w ) − wθ } if < | w | ≤ λθ { } if | w | > θλ SCAD  λ | w | if | w | ≤ λ − w +2 θλ | w |− λ θ − if λ < | w | ≤ λθ λ (1+ θ )2 if | w | > θλ  [ − λ, λ ] if w = 0 { λ sign( w ) } if < | w | ≤ λ (cid:110) − w + θλ sign( w ) θ − (cid:111) if < | w | ≤ λθ { } if | w | > θλ Table 1: Common non-convex penalties with their sub-differentials. Here λ > , θ > ( θ > forMCP, θ > for SCAD).where y ∈ R n is a target vector, X = [ x , . . . , x d ] ∈ R n × d is the design matrix with column-wisefeatures x j ∈ R n , w is the coefﬁcient vector of the model and the map r λ : R + (cid:55)→ R + is concaveand differentiable on [0 , + ∞ ) with a regularization parameter λ > . In addition, we assume that r λ ( | w | ) is a lower semi-continuous function. Note that most penalty functions such as SCAD, MCPor log sum (see their deﬁnitions in Table 1) admit such a property and that for those penalties, f ( · ) is lower bounded.We consider tools such as Fr´echet subdifferentials and limiting-subdifferentials [17, 24, 21] wellsuited for non-smooth and non-convex optimization, so that a vector w (cid:63) belongs to the set of min-imizers (not necessarily global) of Problem (1) if the following Fermat’s condition holds [7, 17]: ∀ j, x (cid:62) j ( y − Xw (cid:63) ) ∈ ∂r λ ( | w (cid:63)j | ) , (2)with ∂r λ ( | · | ) being the Fr´echet subdifferential of r λ ( | · | ) , assuming it exists at w (cid:63) . In particular,this is the case for the MCP, log sum and SCAD penalties presented in Table 1. As a sake of clarity,we present the optimality conditions for MCP and log sum, in the next examples. Example 1.

For the MCP penalty (see Table 1 for the deﬁnition and subdifferential), it is easy toshow that the ∂r λ ( | | ) = [ − λ, λ ] . Hence, the Fermat’s condition becomes  − x (cid:62) j ( y − Xw (cid:63) ) = 0 , if | w (cid:63)j | > λθ − x (cid:62) j ( y − Xw (cid:63) ) + λ sign( w (cid:63)j ) − w (cid:63)j θ = 0 , if < | w (cid:63)j | ≤ λθ | x (cid:62) j ( y − Xw (cid:63) ) | ≤ λ, if w (cid:63)j = 0 (3) Example 2.

For the log sum penalty, one can explicitly compute ∂r λ ( | | ) = [ − λθ , λθ ] and leverageon smoothness of r λ ( | w | ) when | w | > for computing ∂r λ ( | w | ) . Then, the condition in Equation (2) can be written as: (cid:40) − x (cid:62) j ( y − Xw (cid:63) ) + λ sign( w (cid:63)j ) θ + | w (cid:63)j | = 0 , if w (cid:63)j (cid:54) = 0 , | x (cid:62) j ( y − Xw (cid:63) ) | ≤ λθ , if w (cid:63)j = 0 . (4)As we can see, ﬁrst order optimality conditions lead to simple equations and inclusions. Moreinterestingly, one can note that regardless of the regularizer, the structure of optimality condition fora weight w (cid:63)j = 0 depends on the correlation of the feature x j with the optimal residual. Hence, theseconditions can be used for deﬁning a region in which the optimal residual y − Xw (cid:63) has to live. Given a set A of m indices belonging to { , . . . , d } , we denote as restricted problem, the problemdeﬁned in Equation (1) but restricted to columns of X deﬁned by A , leading thus to min w A ∈ R m (cid:107) y − X A w A (cid:107) + |A| (cid:88) j =1 r λ ( | w j, A | ) . (5)3aturally, a vector w (cid:63) A minimizing this problem has to satisfy its own Fermat’s condition. However,the next proposition derives another necessary condition for achieving Fermat’s condition that willbe useful in the sequel for characterizing optimality of ˜ w (cid:63) A on the full problem. Proposition 1. If w (cid:63) A satisﬁes Fermat’s condition of problem (5) , then for all j ∈ A , we have | x (cid:62) j ( y − X A w (cid:63) A ) | ≤ r (cid:48) λ (0) Proof.

At ﬁrst, note that since the function r λ is concave, then its gradient is decreasing hence ∀ w, r (cid:48) λ ( w ) ≤ r (cid:48) λ (0) . Now, when j ∈ { j ∈ A : w (cid:63)j, A = 0 } then the above inequality naturallycomes from the Fermat’s condition. When j ∈ { j ∈ A : w (cid:63)j, A (cid:54) = 0 } , then for a stationary point, wemust have x (cid:62) j ( y − X A w (cid:63) A ) = r (cid:48) λ ( | w (cid:63)j, A | ) . Taking the absolute value of this equation and pluggingin the inequality of the gradient concludes the proof.Based on this latter condition, we deﬁne the function h j for j ∈ [1 , . . . , d ] as h j ( a ) = | x (cid:62) j a |− r (cid:48) λ (0) ,the convex set C j deﬁned by the slab C j (cid:44) { a : R n : h j ( a ) ≤ } and C = j (cid:44) { a : R n : h j ( a ) =0 } . Now, we are going to deﬁne a set used for characterizing candidate stationary points of eitherEquations (1) or (5). We note as C = (cid:84) dj =1 C j and C A = (cid:84) j ∈A C j and from this, the necessaryoptimality condition deﬁned in Proposition 1 can be written as y − X A w (cid:63) A ∈ C A . Note that C is thedual feasible set for convex regularizers. Now, assume w (cid:63) A is a minimizer of its restricted problemthen ˜ w (cid:63) A (in R d ) satisﬁes the Fermat’s condition of the full problem if and only if we also have y − X ˜ w (cid:63) A ∈ C ¯ A , (6)where ¯ A is the complement of A in { , . . . , d } . Indeed, since w (cid:63) A is optimal for the restrictedproblem, Fermat’s condition is already satisﬁed for all j ∈ A . Then, the above condition ensuresthat ∀ j ∈ ¯ A , we have | x (cid:62) j ( y − X ˜ w (cid:63) A ) | ≤ r (cid:48) (0) since, as by deﬁnition, ˜ w (cid:63)j = 0 , ∀ j ∈ ¯ A . As theyplay a key role in our algorithm and for monitoring optimality, we deﬁne the distance of a vector r to the convex set C j and C = j asdist ( r , C j ) = min z ∈ R n (cid:107) z − r (cid:107) , s.t. h j ( z ) ≤ dist S ( s , C = j ) (cid:44) min z ∈ R n (cid:107) z − s (cid:107) , s.t. h j ( z ) = 0 . From this deﬁnition, we also note that given a set A , the solution w A of Equation (5) and r = y − X A w A , the index j (cid:63) = arg max j ∈ ¯ A dist ( r , C j ) is the index of the most violated constraintamong non-active variables for a given residual r . A working set algorithm for solving problem (1) consists in sequentially solving a series of restrictedproblem as deﬁned in Equation (5) with a sequence of working sets A , A , . . . , A k . The maindifferences among working set algorithms lie on the way the set is being updated. For instance[1, 2] selects the most violated constraints (as deﬁned above) in the non-active set to be includedin the new working set leading to the algorithm presented in the supplementary material. Flamaryet al. [11] followed similar approach but considered a randomized selection in which the probabilityof selection is related to dist ( y − X A w (cid:63) A , C j ) .The algorithm we proposed is similar to Blitz [16] which is a working set algorithm dedicated toconvex constrained optimization problem. As the problem we address is a non-convex one, wemanipulate different mathematical objects that need to be adapted. The procedure is presented inAlgorithm 2. It starts by selecting a small subset of indices (either randomly or cleverly like theindices with largest | x (cid:62) j y | , for instance) as initial working set and by choosing a vector s suchthat s ∈ C = (cid:84) dj =1 C j , for instance setting s = . From this vector s , we will generate asequence of { s k } that plays a key role in the selection of the constraints to be integrated in the nextrestricted model. The approach, at iteration k , starts by solving the restricted problem with the set A k and then by computing the residual r k = y − X A k w (cid:63) A k . As noted in Equation (6), if r k ∈ C ¯ A then the vector ˜ w (cid:63) A k is a stationary point of the full problem. If r k (cid:54)∈ C ¯ A , we need to update theworking set A k . For doing so, we proceed by deﬁning s k +1 as the vector on the segment [ s k , r k ] ,nearest to r k that belongs to C . Then, we update the working set by prioritizing the j -th coordinate w.r.t. the distance of s k +1 to C = j and the constraint associated to j is included in the new working4 lgorithm 1 FireWorks: Feasible Residual Working Set Algorithm

Input: { X , y } , A active set, s ∈ C , a sequence of τ k or a mechanism for deﬁning τ k , initialvector ˜ w A Output: ˜ w A k for k = 1 , , . . . do w A k = arg min w (cid:107) y − X A k w (cid:107) + (cid:80) j ∈A k r λ ( | w j | ) { warm-start solver with w A k − } r k = y − X A k w A k { compute the current residual } α k = max { α ∈ [0 ,

1] : α r k + (1 − α ) s k ∈ C} s k +1 = α k r k + (1 − α k ) s k { deﬁne the most ”feasible” residual } A k = A k / { j ∈ A j : w j, A k = 0 } { we prune the set from inactive features } compute τ k { e.g sort dist S ( s k +1 , C = j ) so as to keep constant number of features to add } A k +1 = { j : dist S ( s k +1 , C = j ) } ≤ τ k } ∪ A k { update working set } end for Build ˜ w A k − − − − − − x C x C x C C Illustration for the sets C j and C − − − − − − C C C C yr s s Iteration 1 with A = { } − − − − − − C C C C y r s s r s Iteration 2 with A = { , } Figure 1: Illustrating the constraint selection. (left) Given three variables, we plot their associateslabs {C j } j =1 . C is the intersection of the slabs. We assume that the working set is { } . (middle)After the ﬁrst iteration, the residual r satisﬁes constraint h ( a ) ≤ and thus lies in the h ’s feasibleregion C . Then, the segment [ s , r ] gives us the most feasible point s ∈ C . If τ is chosen so as toselect only one feature, it is then j = 1 . The new working set is { , } . (right) After optimizing overthis working set, the residual r satisﬁes constraints h and h and thus lies in the C ∩ C region.set if dist S ( s k +1 , C = j ) ≤ τ k . τ k is a strictly positive term that deﬁnes the number of constraintsto be added to the current working set. In practice, we have chosen τ k so that a ﬁxed number ofconstraints is added to A k for each k .We provide the following intuition on why this algorithm works in practice. At ﬁrst, note that byconstruction s k +1 is a convex combination of two vectors one of which is the residual hence justiﬁesits interpretation as a residual. However, the main difference between the s k ’s and r k ’s is that theformer belongs to C and thus to any C ¯ A while r k belongs to C only for a potential ˜ w (cid:63) A k optimalfor the full problem. Then, when w (cid:63) A k is a stationary point for the restricted problem but not forthe full problem, r k ∈ C A but r k (cid:54)∈ C . Hence, s k +1 represents a residual candidate for optimalityand constraints near this residual can be interpreted as the constraints that need to be relaxed byintegrating related features j in the working set (allowing thus w j to be potentially non-zero at thenext iteration). The indices j for which distances of s k +1 to C = j are below a given threshold are thenintegrated into the working set. The mechanism for constraint selection is illustrated in Figure 1. In this subsection, we provided some analyzes of the proposed algorithm. At ﬁrst, let us introducean optimality condition of a vector solving the restricted problem.

Proposition 2.

Given a working set A k and w (cid:63) A k solving the related restricted problem, ˜ w (cid:63) A k isalso optimal for the full problem if and only if α = 1 (which also means s k +1 = r k ).Proof. (sketch) apply the fact that at optimality, we have r k ∈ C ¯ A .5ow, we are going to characterize the decrease in objective value obtained between two updates ofworking sets. Proposition 3.

Assume that w A k and w A k +1 are respectively the solutions of the restricted problemwith the working set A k and A k +1 , with A k +1 = { j } ∪ A k . Denote as r k = y − X A k w A k , then,we have the following inequality (cid:107) ˜ w A k +1 − ˜ w A k (cid:107) ≥ (cid:107) X (cid:107) dist( r k , C j ) . Proof.

We have the following inequalities (cid:107) r k +1 − r k (cid:107) = (cid:107) X ( ˜ w A k +1 − ˜ w A k ) (cid:107) ≤ (cid:107) X (cid:107) (cid:107) ˜ w A k +1 − ˜ w A k (cid:107) . (7)Now recall that r k (cid:54)∈ C A k +1 since it violates the h j constraint while r k +1 ∈ C A k +1 as w A k +1 hasbeen optimized over A k +1 . As such, we also have h j ( r k +1 ) ≤ . Now by deﬁnition of dist ( r k , C j ) either r k +1 is the minimizer of this optimization problem, hence dist ( r k , C j ) = (cid:107) r k +1 − r k (cid:107) ordist ( r k , C j ) ≤ (cid:107) r k +1 − r k (cid:107) . Plugging this latter inequality in Equation (7) concludes the proof. Lemma 1.

At step k , consider a constraint C j such that h j ( r k ) > , thendist ( r k , C j ) ≥ − α k α k τ k . (8)The proof of this lemma has been deported to the supplementary. From the above Proposition 3 andLemma 1, we can ensure that the sequence of { ˜ w A k } produced by Algorithm 2 converges towardsa stationary point under mild conditions on the inner solver. Proposition 4.

Suppose that for each step k , the algorithm solving the inner problem ensures adecrease in the objective value in the form f ( ˜ w A k +1 ) − f ( ˜ w A k ) ≤ − γ k (cid:107) ˜ w A k +1 − ˜ w A k (cid:107) . with ∀ k, γ k > . For the inner solver, we also impose that when solving the problem with set A k +1 ,the inner solver is warm-started with w A k . Assume also that (cid:107) X (cid:107) > and that τ k > , then thesequence of α k produced by Algorithm 2 converges towards and ∀ j, lim k →∞ | x (cid:62) j r k | ≤ r (cid:48) λ (0) .Proof. Using result in Proposition 3 and Lemma 1 and the above assumption, we have f ( ˜ w A k +1 ) ≤ f ( ˜ w A k ) − γ k (cid:18) − α k α k (cid:19) τ k ≤ f ( ˜ w A ) − k (cid:88) (cid:96) =1 γ (cid:96) (cid:18) − α (cid:96) α (cid:96) (cid:19) τ (cid:96) . (9)This means that (cid:80) k(cid:96) =1 γ (cid:96) (cid:16) − α (cid:96) α (cid:96) (cid:17) τ (cid:96) ≤ f ( ˜ w A ) − f ( ˜ w A k +1 ) . Since f is bounded from below,the right hand side is less than some positive constant, hence (cid:80) ∞ (cid:96) =1 γ j (cid:16) − α (cid:96) α (cid:96) (cid:17) τ (cid:96) < ∞ . Since thelatter sum is bounded, it implies that γ (cid:96) (cid:16) − α (cid:96) α (cid:96) (cid:17) τ (cid:96) → as (cid:96) → ∞ , and as γ (cid:96) > , τ (cid:96) > , we have lim (cid:96) →∞ α (cid:96) = 1 . Now using the deﬁnition of s k +1 , we have ∀ j, x (cid:62) j r k = α k x (cid:62) j s k +1 − − α k α k x (cid:62) j s k .Then, taking the absolute value, triangle inequality and using the fact that ∀ k, s k ∈ C concludes theproof.The above proposition ensures convergence to a stationary point under some conditions on the in-ner solver. Several algorithms may satisfy this assumption. For instance, any ﬁrst-order iterativealgorithm which selects it step size /t k based on line search criterion of the form ∀ k, f ( w k +1 ) ≤ f ( w k ) − σ t k (cid:107) w k +1 − w k (cid:107) , where σ is a constant in the interval (0 , , provides such a guarantee.This is the case of the generalized proximal algorithm of Gong et al. [15] or proximal Newton ap-proaches [22], assuming that f is differentiable with gradient Lipschitz and r λ ( · ) admits a proximaloperator. As non-convex block coordinate descent algorithms [3] can also be interpreted as proximalalgorithm, they also satisfy sufﬁcient decrease condition under the same assumptions than proximalapproaches. 6 nexact inner solver One key point when considering a meta-solver like Blitz [16] or a workingset algorithm is that for some approaches, theoretical properties hold only when the inner solver issolved exactly. This is for instance the case for the SimpleSVM algorithm of Vishwanathan et al.[30] or the active set algorithm proposed by Boisbunon et al. [1] which convergence is based onnon-cyclicity of the constraint selection (prohibiting pruning) and thus on the ability of solvingexactly the inner problem. For the approach we propose, we show next that the distance betweentwo consecutive inexact solutions of the inner problem is still lower bounded.

Proposition 5.

Given w A k and w A k +1 approximate solutions of the inner problem with respectivelythe working sets A k and A k +1 . Assume that w A k +1 as being obtained through a tolerance of ξ k +1 ≤ τ k of its Fermat’s condition ( e.g., for the log sum penalty, Equation (4) are satisﬁed up to ξ k +1 ), then the following inequality holds : (cid:107) w A k +1 − w A k (cid:107) ≥ (cid:107) X (cid:107) (cid:0) dist( r k , h j ) − ξ k +1 (cid:1) . Proof.

First note that if w A k +1 is such that r k +1 ∈ C A k +1 then we are in the same conditionthan in Proposition 1 and the same proof applies. Let us assume then that r k +1 (cid:54)∈ C A k +1 anddist ( r k +1 , C j ) ≤ ξ k +1 . Deﬁne as u the point in C j that deﬁnes the distance of r k to C j and as p the point that minimizes the distance between r k +1 and the segment [ u , r k ] . Then, owing to simplegeometrical arguments and orthogonality we have : (cid:107) r k +1 − r k (cid:107) = (cid:107) r k +1 − p (cid:107) + (cid:107) p − r k (cid:107) andthus (cid:107) r k +1 − r k (cid:107) ≥ (cid:107) r k − p (cid:107) . Now, because p belongs to the segment deﬁned by u and r k , wehave (cid:107) r k +1 − r k (cid:107) ≥ (cid:107) r k − u (cid:107) − (cid:107) u − p (cid:107) ≥ dist ( r k , C j ) − ξ k +1 where the last inequality comes from the fact that (cid:107) u − p (cid:107) = dist ( r k +1 , C j ) ≤ ξ k +1 . Plugging thisinequality into Equation (7) completes the proof.Note that the above lower bound is meaningful only if the tolerance ξ k +1 is smaller than the distanceof the residual to the set C j . This is a reasonable assumption to be made since we expect r k to violate C j . Now, we can derive condition of convergence towards a stationary point of the full problem. Corollary 1.

Under the assumption of Proposition 4 and assuming that the sequence of toleranceis such that (cid:80) k ξ k < ∞ , then Algorithm 2 produces a sequence of iterates that converges towardsa stationary point. The proof follows the same step of the one of Proposition 4, with the addition of the convergentseries of ξ k and thus has been omitted. Note that the assumption of convergent sum of errors is acommon assumption, notably in the proximal algorithm literature [8, 29] and it helps guaranteeingconvergence towards exact stationary point instead of an approximate convergence (up to a tolerance τ ). Relation with maximum violated constraint.

The mechanism we have proposed for updatingthe working set is based on the current residual r k and a feasible residual s k . By changing how s k +1 is deﬁned, we can retrieve the classical maximal violated constraint approach. Indeed, if weset at Line 5 of Algorithm 2, ∀ k, s k = 0 and s k +1 = α k r k , with α k ∈ [0 , then s k +1 is a rescalingof the current residual and the scale is chosen so that s k +1 ∈ C . Using simple inequality argument, itis straightforward to show that α k = min(min j ∈ ¯ A λθ | x (cid:62) j r k | , and the minimum in j occurs for thelargest value of | x (cid:62) j r k | . Recall again that the polynomial convergence of this algorithm is guaranteedfor exact inner solver and when no working set pruning (removing from the set A k variables whichweights are ) occurs. Set-up

We now present some numerical studies showing the computational gain achieved by ourapproach. As an inner solver and baseline algorithms, we have considered a proximal algorithm [15]and a block-coordinate descent approach [3]; they are respectively denoted as GIST and BCD. Theyhave been implemented in Python/Numpy and the code will be shared online upon publication.We have integrated those baselines into the maximum-violating constraint (MaxVC) working setapproach (algorithm in the appendix) and our approach denoted as FireWorks for FeasIble REsidualWORKing Set). For these approaches, we leverage on the closed-form proximal operator availablefor several non-convex regularizers. For our experiments, we have used the log-sum penalty which7able 2: Running time in seconds of different algorithms on different problems. In the ﬁrstcolumn, we reported data, the tolerance on the stopping criterion and the constant K such that λ = K max j | x (cid:62) j y | (the larger the K , the sparser w (cid:63) is). The small Toy dataset has n = 100 , d = 1000 and p = 30 the large one has n = 1000 , d = 5000 , p = 500 . For each inner solver,we bold the most efﬁcient algorithm. − denotes that the algorithm did not ﬁnish one iteration in hours. Number in parenthesis is the number of non-zero weights in w (cid:63) A . All experiments have beenrun on one single core of an Intel Xeon CPU E5-2680 clocked at 2,4Ghz. Data and Setting GIST MaxVC Gist FireWorks Gist BCD MaxVC BCD FireWorks BCDToy small -1.00e-03- 0.07 0.8 ± ± ± (34) 14.2 ± ± ± (34)Toy small -1.00e-05- 0.07 1.4 ± ± ± (33) 22.9 ± ± ± (33)Toy small -1.00e-03- 0.01 6.3 ± ± ± (66) 73.7 ± ± ± (66)Toy small -1.00e-05- 0.01 14.1 ± ± ± (61) 154.6 ± ± ± (61)Toy large -1.00e-03- 0.07 26.2 ± ± (371) 8.2 ± ± ± ± (371)Toy large -1.00e-05- 0.07 50.5 ± ± ± (366) 1030.5 ± ± ± (366)Toy large -1.00e-03- 0.01 91.6 ± ± ± (759) 1192.1 ± ± ± (759)Toy large -1.00e-05- 0.01 ± (761) 1020.6 ± ± ± ± ± (761) Data and Setting GIST MaxVC Gist FireWorks Gist BCD MaxVC BCD FireWorks BCDLeukemia -1.00e-03- 0.07 17.9 ± ± (9) 0.4 ± ± ± (9) ± (9)Leukemia -1.00e-05- 0.07 26.1 ± ± (8) 0.5 ± ± ± ± (8)Leukemia -1.00e-03- 0.01 186.1 ± ± (46) 5.5 ± ± ± ± (46)Leukemia -1.00e-05- 0.01 525.2 ± ± ± (41) 1412.8 ± ± ± (41)newsgroup-3 -1.00e-02- 0.01 6041.1 ± ± ± ± ± ± ± ± ± ±

18 53.2 ± ± newsgroup-3 -1.00e-04- 0.01 5734.0 ± ± ± ±

19 279.2 ± ± newsgroup-7 -1.00e-02- 0.01 26711.1 ± ± ± ± ± ± newsgroup-7 -1.00e-03- 0.01 26685.6 ± ± ± ± ± ± newsgroup-7 -1.00e-04- 0.01 26752.5 ± ± ± ± ± ± Criteo -1.00e-02- 0.005 - - - - 41095.3 ± ± Criteo -1.00e-03- 0.005 - - - - 49006.7 ± ± Criteo -1.00e-04- 0.005 - - - - 59303.8 ± ± has an hyperparameter θ that has been set to . For all algorithms, the stopping criterion is basedon the tolerance over the Fermat’s optimality condition and the measure for comparison is the CPUrunning time of each algorithm. For all problems, we have set τ k so as to add a ﬁxed number offeatures into the working set of MaxVC and our FireWorks algorithm. Toy problem

Here, the regression matrix X ∈ R n × d is drawn uniformly from a standard Gaussiandistribution (zero-mean unit variance). For given n, d and a number p of active variables, the truecoefﬁcient vector w (cid:63) is obtained as follows. The p non-zero positions are chosen randomly, andtheir values are drawn from a zero-mean unit variance Gaussian distribution, to which we added ± . according sign( w (cid:63)j ) . Finally, the target vector is obtained as y = Xw (cid:63) + e where e is azero-mean Gaussian noise with standard deviation σ = 0 . . For these problems, features areadded to the working set at each iteration. Table 2 presents the running time for different algorithmsto reach convergence under various settings. We note that our FireWorks algorithm is faster than thegenuine inner solver and (at least on par) with the MaxVC approach especially in setting where λ isproperly tuned with respect to the number of variables, ie when the solution is not too sparse. Real data

We have reported comparisons on three real datasets. The ﬁrst one is the

Leukemia dataset [14] which has a dense regression matrix with n = 72 and d = 7129 . We have alsoconsidered sparse problem such as newsgroups dataset in which we have kept only categories( religion and graphics ) resulting in n = 1441 , d = 26488 and 7 categories comp leading to n =4891 , d = 94414 . For these two problems, we have respectively and non-zeroselements. We have also used a large-scale dataset which is a subset of the Criteo Kaggle datasetcomposed of M samples and M features, with about M non-zero elements. For

Leukemia ,we added features at each iteration, whereas we have added and respectively for the newsgroup and Criteo problem. Results reported in Table 2(down) show that using FireWorks leadsto a speedup of at least one order of magnitude compared to the baseline algorithm. For large λ leading to sparse solutions, MaxVC is the most efﬁcient approach on Leukemia , while for large-scale datasets newsgroup-3 and newsgroup-7 , FireWorks beneﬁts from pruning and it is substantiallyfaster than all competitors. For

Criteo , only the BCD working set algorithms are able to terminatein reasonable time and FireWorks is more efﬁcient than MaxVC.8

Conclusions

We have introduced in this paper a working set based meta-algorithm for non-convex regularizedregression. By mimicking the concept of primal-dual approach, but in a non-convex setting, wewere able to derive a novel rule for updating the variables optimized by an iterative incrementalalgorithm. From a theoretical point of view, we showed convergence of the algorithm, even whenthe inner problem is not solved exactly. This is in constrast with the classical maximal violatingconstraint approach which convergence is based on exact resolution of each inner problem. Ourexperimental results show the computational gain achieved for a given solver when applied directlyon the full variables or within our working set algorithm.

Broader Impact

We expect this work to beneﬁt research and applications related to large scale sparse learning prob-lems. Since the work is a methodological work and as such it is hard to see any foreseeable societalconsequences without precise applications.

Acknowledgments

This work beneﬁted from the support of the project OATMIL ANR-17-CE23-0012 of the FrenchNational Research Agency (ANR) and 3IA Cote dAzur Investments ANR-19-P3IA-0002 of theFrench National Research Agency (ANR). This work was performed using computing resources ofCRIANN (Normandy, France).

References [1] A. Boisbunon, R. Flamary, and A. Rakotomamonjy. Active set strategy for high-dimensionalnon-convex sparse optimization problems. In

ICASSP , pages 1517–1521. IEEE, 2014.[2] A. Boisbunon, R. Flamary, A. Rakotomamonjy, A. Giros, and J. Zerubia. Large scale sparseoptimization for object detection in high resolution images. In

IEEE Workshop in MachineLearning for Signal Processing (MLSP) , 2014.[3] P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression,with applications to biological feature selection.

Ann. Appl. Stat. , 5(1):232, 2011.[4] E. J. Cand`es, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l minimization. J. Fourier Anal. Applicat. , 14(5-6):877–905, 2008.[5] S. Chen and D. Donoho. Basis pursuit. In IEEE, editor,

Proceedings of 1994 28th AsilomarConference on Signals, Systems and Computers , 1994.[6] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.

SIAMreview , 43(1):129–159, 2001.[7] F. H. Clarke.

Method of Dynamic and Nonsmooth Optimization . SIAM, 1989.[8] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting.

Multiscale Modeling & Simulation , 4(4):1168–1200, 2005.[9] D. L. Donoho. Compressed sensing.

IEEE Trans. Inf. Theory , 52(4):1289–1306, 2006.[10] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle proper-ties.

J. Amer. Statist. Assoc. , 96(456):1348–1360, 2001.[11] R. Flamary, A. Rakotomamonjy, and G. Gasso. Importance sampling strategy for non-convexrandomized block-coordinate descent. In , pages 301–304, 2015.[12] J. Friedman, T. J. Hastie, and R. Tibshirani. Regularization paths for generalized linear modelsvia coordinate descent.

J. Stat. Softw. , 33(1):1–22, 2010.[13] T. Glasmachers and C. Igel. Maximum-gain working set selection for SVMs.

Journal ofMachine Learning Research , 7(Jul):1437–1466, 2006.914] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill PMesirov, Hilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecularclassiﬁcation of cancer: class discovery and class prediction by gene expression monitoring. science , 286(5439):531–537, 1999.[15] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholdingalgorithm for non-convex regularized optimization problems. In

ICML , pages 37–45, 2013.[16] T. B. Johnson and C. Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimiza-tion. In

ICML , volume 37, pages 1171–1179, 2015.[17] A Ya Kruger. On Fr´echet subdifferentials.

Journal of Mathematical Sciences , 116(3):3325–3358, 2003.[18] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly. Compressed sensing MRI.

IEEESignal Processing Magazine , 25(2):72–82, 2008.[19] M. Massias, A. Gramfort, and J. Salmon. From safe screening rules to working sets for fasterlasso-type solvers. In

NIPS-OPT , 2017.[20] M. Massias, A. Gramfort, and J. Salmon. Celer: a Fast Solver for the Lasso with Dual Extrap-olation. In

ICML , volume 80, pages 3315–3324, 2018.[21] B.S. Mordukhovich, N. M. Nam, and N. D. Yen. Fr´echet subdifferential calculus and optimal-ity conditions in nondifferentiable programming.

Optimization , 55(5-6):685–708, 2006.[22] A. Rakotomamonjy, R. Flamary, and G. Gasso. DC proximal Newton for nonconvex optimiza-tion problems.

IEEE transactions on neural networks and learning systems , 27(3):636–647,2015.[23] A. Rakotomamonjy, G. Gasso, and J. Salmon. Screening rules for lasso with non-convex sparseregularizers. In

ICML , volume 97, pages 5341–5350, 2019.[24] R. T. Rockafellar and R. J.-B. Wets.

Variational analysis , volume 317. Springer Science &Business Media, 2009.[25] S. K. Shevade and S. S. Keerthi. A simple and efﬁcient algorithm for gene selection usingsparse logistic regression.

Bioinformatics , 19(17):2246–2253, 2003.[26] E. Soubies, L. Blanc-F´eraud, and G. Aubert. A uniﬁed view of exact continuous penalties for (cid:96) - (cid:96) minimization. SIAM J. Optim. , 27(3):2034–2060, 2017.[27] R. Tibshirani. Regression shrinkage and selection via the lasso.

J. R. Stat. Soc. Ser. B Stat.Methodol. , 58(1):267–288, 1996.[28] R. Tibshirani, J. Bien, J. Friedman, T. J. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strongrules for discarding predictors in lasso-type problems.

J. R. Stat. Soc. Ser. B Stat. Methodol. ,74(2):245–266, 2012.[29] S. Villa, S. Salzo, L. Baldassarre, and A. Verri. Accelerated and inexact forward-backwardalgorithms.

SIAM Journal on Optimization , 23(3):1607–1633, 2013.[30] S. V. N. Vishwanathan, A. J. Smola, and M. N. Murty. Simplesvm. In

ICML , pages 760–767,2003.[31] R. Wang, N. Xiu, and S. Zhou. Fast Newton method for sparse logistic regression. arXivpreprint arXiv:1901.02768 , 2019.[32] J. Ye and J. Liu. Sparse methods for biomedical data.

ACM Sigkdd Explorations Newsletter ,14(1):4–15, 2012.[33] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty.

Ann. Statist. ,38(2):894–942, 2010.[34] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization.

Journal ofMachine Learning Research , 11(Mar):1081–1107, 2010.10 upplementary materialProvably Convergent Working Set Algorithm for Non-Convex RegularizedRegression

The maximum-violating constraint algorithm is a simple algorithm that solves at each iteration asub-problem with a subset of variables and then add some others that violate the most the statement y − X ˜ w (cid:63) A k ∈ C ¯ A k , where “the most” is evaluated in term of distance to each set C j , with j ∈ ¯ A k .Hence, at each iteration, we compute all these distances, sort them in descending order and add to thecurrent working set, the n k variables that yield to the largest distances. The algorithm is presentedbelow. Algorithm 2

Maximum Violating Constraints Algorithm

Input: { X , y } , A active set, n k number of variables to add at iteration k , initial vector ˜ w A Output: ˜ w A k for k = 1 , , . . . do w A k = arg min w ∈A k (cid:107) y − X A k w (cid:107) + (cid:80) j ∈A k r λ ( | w j | ) { warm-start solver with w A k − } r k = y − X A k w A k { compute the current residual } v = argsort dist ( r k , C j ) in descending order A k +1 = v [0 : n k ] ∪ A k { update working set by adding the n k most violating variables } end for Build ˜ w A k Given a working set A k and w (cid:63) A k solving the related restricted problem, ˜ w (cid:63) A k isalso optimal for the full problem if and only if α = 1 (which also means s k +1 = r k ).Proof. Assume that w (cid:63) A k and ˜ w (cid:63) A k are optimal respectively for the restricted and the full problem.Let us show that in this case α k = 1 . Since ˜ w (cid:63) A k is optimal for the full problem, we thus have ∀ j ∈ ¯ A k , | x (cid:62) j ( y − X ˜ w (cid:63) A k ) | ≤ r (cid:48) (0) . And thus we have the following equivalent statement y − X ˜ w (cid:63) A k ∈ C ⇔ y − X A k w (cid:63) A k ∈ C ⇔ r k ∈ C and thus α k = 1 .Now assume that α k = 1 and let us show that ˜ w (cid:63) A k is optimal for the full problem. Since α k = 1 ,we have s k +1 = r k and thus r k ∈ C . The latter means that ∀ j ∈ ¯ A k , | x (cid:62) j ( y − X A k w (cid:63) A k ) | ≤ r (cid:48) (0) and thus ∀ j ∈ ¯ A k , | x (cid:62) j ( y − X ˜ w (cid:63) A k ) | ≤ r (cid:48) (0) . Given this last property and the deﬁnition of ˜ w A (cid:63)k based on w (cid:63) A k , we can conclude that ˜ w (cid:63) A k is optimal for the full problem. The proof follows similar steps as those given by Johnson and Guestrin [16].

Lemma 1.

At step k , consider a constraint C j such that h j ( r k ) > , thendist ( r k , C j ) ≥ − α k α k τ k . (10)11 roof. Denote as j the index of the function h j such that h j ( r k ) > as z k ∈ { z ∈ R n : h j ( z ) = 0 } . The following equality holdsdist ( r k , C j ) = (cid:107) z k − r k (cid:107) = (cid:107) z k − α k ( s k +1 − (1 − α k ) s k ) (cid:107) = (cid:107) z k − α k s k +1 + 1 − α k α k s k (cid:107) = (cid:107) − z k + 1 α k s k +1 − − α k α k s k (cid:107) = 1 − α k α k (cid:107) − α k − α k z k + 11 − α k s k +1 − s k (cid:107) (11)By construction, we have h j ( z k ) = 0 as z k is a minimizer of the distance and h j ( s k +1 ) = 0 as wehave chosen j as the index of the set that that makes s k +1 (cid:54)∈ C . Since h j ( · ) ≤ is a convex set andthe coefﬁcients − α k − α k and − α k are coefﬁcients that does not lead to a convex combination of z k and s k +1 and hence, we have h j ( − α k − α k z k + − α k s k +1 ) ≥ . On the other hand by construction,we have s k ∈ C j . Furthermore, we have dist S ( s k , C = j ) ≥ τ k . Indeed, since h j ( r k ) > , we have j (cid:54)∈ A k as by construction r k ∈ C A ( w A k has been optimized over A k ). Because j (cid:54)∈ A k meansthat dist S ( s k , C = j ) ≥ τ k .Now as h j ( − α k − α k z k + − α k s k +1 ) ≥ and dist S ( s k , C = j ) ≥ τ k , the norm is the above equation(11) is lower bounded by τ k and we havedist ( r k , C j ) ≥ − α k α k τ k ..