[PDF] Newton Method for Sparse Logistic Regression: Quadratic Convergence and Extensive Simulations

Abstract

Sparse logistic regression, {as an effective tool of classification,} has been developed tremendously in recent two decades, from its origination the ℓ 1 -regularized version to the sparsity constrained models. This paper is carried out on the sparsity constrained logistic regression by the Newton method. We begin with establishing its first-order optimality condition associated with a τ -stationary point. This point can be equivalently interpreted as an equation system which is then efficiently solved by the Newton method. The method has a considerably low computational complexity and enjoys global and quadratic convergence properties. Numerical experiments on random and real data demonstrate its superior performance when against seven state-of-the-art solvers.

Full PDF

FFast Newton Method for Sparse Logistic Regression

Rui Wang [email protected]

Department of Applied MathematicsBeijing Jiaotong UniversityBeijing, P. R. China

Naihua Xiu [email protected]

Department of Applied MathematicsBeijing Jiaotong UniversityBeijing, P. R. China

Shenglong Zhou [email protected]

School of MathematicsUniversity of SouthamptonSouthampton, UK

Abstract

Sparse logistic regression has been developed tremendously in recent two decades, fromits origination the (cid:96) -regularized version by Tibshirani (1996) to the sparsity constrainedmodels by Bahmani, Raj, and Boufounos (2013); Plan and Vershynin (2013). This paperis carried out on the sparsity constrained logistic regression through the classical Newtonmethod. We begin with analysing its ﬁrst optimality condition to acquire a strong τ -stationary point for some τ >

0. This point enables us to equivalently derive a stationaryequation system which is able to be eﬃciently solved by Newton method. The proposedmethod

NSLR an abbreviation for Newton method for sparse logistic regression, enjoysa very low computational complexity, local quadratic convergence rate and terminationwithin ﬁnite steps. Numerical experiments on random data and real data demonstrate itssuperior performance when against with seven state-of-the-art solvers.

Keywords:

Sparsity constrained logistic regression, Strong τ -stationary point, Stationaryequation, Newton method, Quadratic convergence rate

1. Introduction

Logistic regression is a well-known eﬀective tool of classiﬁcation with extensive applica-tions ranging from machine learning, data mining, pattern recognition, medical science tostatistics. It describes the relation between a sample data x and its associated binaryresponse/label y ∈ { , } through the conditional probabilityPr { y | x , z } = e y (cid:104) x , z (cid:105) e (cid:104) x , z (cid:105) , where Pr { y | x , z } is the conditional probability of the label y , given the sample x and aparameter vector z . To ﬁnd the maximum likelihood estimate of the parameter z , a set of n independent and identically distributed samples ( x i , y i ) , i = 1 , , ..., n are ﬁrst drawn, where x i ∈ R p and y i ∈ { , } , yielding a joint likelihood of the interested parameter/classiﬁer z . Then the maximum likelihood estimate is obtained by addressing the classical logistic1 a r X i v : . [ m a t h . O C ] J a n egression model, min z ∈ R p (cid:96) ( z ) := 1 n n (cid:88) i =1 (cid:110) ln(1 + e (cid:104) x i , z (cid:105) ) − y i (cid:104) x i , z (cid:105) (cid:111) . (1)This model usually performs well in scenario when the number n of samples is larger thanthe number p of features because it enjoys a strictly convexity and thus admits a uniqueoptimal solution provided that the matrix [ x , x , · · · , x n ] has a full row rank. However,when it comes to the case n < ( (cid:28) ) p , it usually suﬀers from the over-ﬁtting. That is,the solved classiﬁer through (1) well ﬁts the model (making the loss suﬃciently small) ontraining data but may behave poorly on unseen data.Unfortunately, the case n < ( (cid:28) ) p occurs often in many real applications. For instance,each gene expression data sample comprises of thousands of genes whilst common medicalequipments are only able to obtain very limited samples. In image processing, each imagecontains large amounts of pixels, which is far more than the number of observed images.Fortunately, despite numerous features in those data, there is only a small portion thatis of importance. For example, apart from the classiﬁcation task, the micro-array dataexperiments also attempt to identify a small set of informative genes (to distinguish thetumour and the normal tissues) in each gene expression data. The purpose is to removethe irrelevant genes so as to simplify the inference. This naturally gives rise to the so-calledsparse logistic regression. Sparse logistic regression was originated from the (cid:96) -regularized logistic regression proposedby Tibshirani (1996), min z ∈ R p (cid:96) ( z ) + ν (cid:107) z (cid:107) , (2)where (cid:107) z (cid:107) is the (cid:96) -norm and ν >

0. Under the help of (cid:96) -regularization, this model iscapable of rendering a sparse solution, which allows us to capture important features amongothers. A vector is called sparse if only a few entries are non-zero and the rest are zeros.With the advance in sparse optimization in recent decade, (2) has been extensively extendedto the following general model, min z ∈ R p f φ ( z ) := (cid:96) ( z ) + φ ν ( z ) , (3)where the regularized function φ ν ( z ) : R p → R is designed to pursue a sparse solution.Methods associated with model (3) are treated as the relaxation/regularization methodsfrom the perspective of optimization. They share the advantage that (3) is unconstrained,making most generic optimization methods for non-diﬀerentiable problems tractable.An alternative is to consider logistic regression with a sparsity constraint, which wasﬁrst studied by Bahmani, Raj, and Boufounos (2013); Plan and Vershynin (2013) separatelyand then well investigated by Wang et al. (2017). The model they considered ismin z ∈ R p (cid:96) ( z ) , s . t . z ∈ S := { z ∈ R p : (cid:107) z (cid:107) ≤ s } , (4)where S is the sparse set with s (cid:28) p , and (cid:107) z (cid:107) is the (cid:96) pseudo norm of z , counting thenumber of nonzero elements of z . Apparently, this model suﬀers form the discreteness of2he constraint set, which causes the NP-hardness of solving it. However, the general case ofthis model, where (cid:96) ( z ) is replaced by a general function, has been extensively explored (seeBeck and Hallak, 2015; Pan et al., 2015) since it was ﬁrst introduced by Bahmani, Raj, andBoufounos (2013) and Beck and Eldar (2013). In statistics, this model (4) where the logisticloss (cid:96) is replaced by a square norm loss of linear regression, is the so-called best subspaceselection (see Hastie et al., 2017; Mazumder et al., 2017; Hazimeh and Mazumder, 2018;Xie and Deng, 2018). Those researches have presented fruitful results, which not only showthat sparsity constrained model possesses many advantages over the regularized model, suchas parameter-free, ease of sparsity controlling and low computational complexity in termsof algorithmic design, but also provide a series of tractable tools to address this NP-hardproblem. Therefore, the work in this paper is carried out along with this model. Since there are large numbers of methods that have been proposed to deal with the sparseoptimization problems containing the SLR as a special example, which is far beyond ourscope of review, we only focus on those directly processing (3) and (4).For regularization model (3), diﬀerent penalty functions φ ν yield diﬀerent methods,which enable to be summarized as two categorizes based on the convexity of φ ν . Convex regularizations are mainly associated with the usage of (cid:96) norm. • φ ν ( z ) = ν (cid:107) z (cid:107) . Expectation maximization method ( EM , see Figueiredo, 2003; Krish-napuram et al., 2005), at each iteration, managed to ﬁnd a majorization/upper-boundfunction F φ of f φ and then generate a Newton step to minimize F φ ; Andrew and Gao(2007) proposed an Orthant-Wise Limited-memory Quasi-Newton method ( OWL-QN )where BFGS Quasi-Newton steps were carried out to minimize a quadratic approx-imation of f φ at each iteration; Koh et al. (2007) took interior-point method intoaccount to form a truncated Newton interior-point method ( TNIP ); Shi et al. (2010)created a hybrid iterative shrinkage method (

HIS ) which was comprised of two phases:iterative shrinkage (ﬁxed-point method) and interior-point scheme; Yu et al. (2010)adopted the

OWL-QN but with their own direction ﬁnding procedure to derive an im-proved

OWL-QN ∗ ; Yuan et al. (2010) exploited the coordinate descent method witheach coordinate being updated by using one dimensional Newton direction ( CDN ). • φ ν ( z ) = ν (cid:107) z (cid:107) + ν (cid:107) z (cid:107) , where ν = [ ν , ν ] > (cid:107) z (cid:107) is the (cid:96) norm. Forthis penalty, Liu et al. (2009b) created a well known package SLEP . The informationthey mainly used were the function value and gradient; Friedman et al. (2010) found aquadratic approximation L of (cid:96) and applied the coordinate descent method into solving L + φ ν . They packed the method into a solver GLMNET which was then improved byYuan et al. (2012) to get the improved

GLMNET . • φ ν ( z ) = ν (cid:107) z (cid:107) + δ (cid:107) z (cid:107) ≤ t ( z ), where t > δ (cid:107) z (cid:107) ≤ t ( z ) = 0 if (cid:107) z (cid:107) ≤ t and+ ∞ otherwise. Model (3) with this penalty can be solved by a ﬁrst order method Lassplore (Liu et al., 2009a) or

SLEP (Liu et al., 2009b); When ν = 0, namely, the (cid:96) constrained logistic regression,min z ∈ R p (cid:96) ( z ) , s . t . (cid:107) z (cid:107) ≤ t, (5)3ee et al. (2006) developed the IRLS-LARS scheme where

LARS (see Efron et al., 2004)was ﬁrst introduced to address an (cid:96) constrained least square problem and a directionthat was analogous to Newton direction was then chosen to update next iteration. Nonconvex regularizations diﬀer slightly. • SCAD : Fan and Li (2001) ﬁrst took advantage of the smoothly clipped absolute devia-tion (

SCAD ) as the penalty. • One-step LLA : Zou and Li (2008) developed a one step local linear approximation(

LLA ) algorithm in which at each step, the

LLA estimator was applied to adopt asparse representation. • Group Bridge : Huang et al. (2009) beneﬁted from a group bridge for multiple regres-sion problems when covariates are grouped. • GIST : Gong et al. (2013) designed a general iterative shrinkage and thresholding algo-rithm (

GIST ) , which has the ability to precess a group of nonconvex regularizationssuch as

SCAD , Log-sum penalty (

LSP,

Candes et al., 2008), Capped- (cid:96) penalty (Zhang,2010), Minimax Concave Penalty ( MCP,

Zhang et al., 2010) and to name a few. • APG : Li and Lin (2015) proposed the accelerated proximal gradient method (

APG ) todeal with (3) with the capped- (cid:96) penalty. • HONOR : Gong and Ye (2015) proposed an eﬃcient hybrid optimization algorithm fornon-convex regularized problems (

HONOR ), where they eﬀectively integrated a Quasi-Newton step and a gradient descent step. • DC-PN : Rakotomamonjy et al. (2016) combined the idea of the diﬀerence of two convexfunctions and the proximal Newton optimization scheme (

DC-PN ) to solve (3). Theregularization used in the model was a diﬀerence of two convex functions that equalledto the capped- (cid:96) penalty.For sparsity constrained model (4), since it was introduced by Bahmani et al. (2013)and Beck and Eldar (2013), various approaches have been proposed. • GraSP : Bahmani et al. (2013) generalized the compressive sampling matching pursuit(

CoSaMP , see Needell and Tropp, 2009) to get the gradient support pursuit (

GraSP ).The latter is able to be reduced to the former when the logistic regression is replacedby the linear regression. Notice that apart from solving (4),

GraSP was also able toaddress the model min (cid:96) ( z ) + ( ν/ (cid:107) z (cid:107) , s . t . z ∈ S. (6) • Logit-GOMP : Lozano et al. (2011) applied orthogonal matching pursuit (

OMP , see Mal-lat and Zhang, 1993) into model (4), where the data is assumed to be classiﬁed intoseveral groups, to develop a group

OMP method (

Logit-GOMP ). • PD : Lu and Zhang (2013) designed a penalty decomposition (PD) method in which asequence of penalty subproblems were solved by a block coordinate descent method.4 GraHTP : Inspired by some greedy algorithms for compressing sensing (Donoho, 2006)such as iterative hard thresholding (

IHT , see Blumensath and Davies, 2009) and hardthresholding pursuit (

HTP , see Foucart, 2011), Yuan et al. (2014, 2018) designed gra-dient

HTP methods

GraHTP and

FGraHTP . • NTGP : Yuan and Liu (2014, 2017) beneﬁted from Newton method to establish Newtongreedy pursuit methods

NTGP and

QNTGP to solve the model (6). • IIHT : Pan et al. (2017) followed the

IHT method by (Blumensath and Davies, 2009)and built a convergent improved

IHT method (

IIHT ). • GPGN : Wang et al. (2017) combined the projected gradient method and the Newtonmethod to derive a greedy projected gradient-Newton method (

GPGN ). The Newtonsteps were only applied into a series of chosen subspaces. • FNHTP : Chen and Gu (2017) proposed a fast Newton hard thresholding pursuit (

FNHTP )by using an unbiased stochastic Hessian estimator for the inverse Hessian matrix.Notice that many of those above mentioned algorithms involve the Newton method, asecond order method, with being able to be summarized into three groups. The ﬁrst groupcomprises of EM , OWL-QN , OWL-QN ∗ , CDN , TNIP , HIS and

HONOR . Those algorithms adoptedNewton method to solve an unconstrained optimization. Most of them derived Newtondirections through solving a linear equation system with at least p variables and p equa-tions. The second group contains NTGP and

GPGN , with preforming Newton steps on chosensubspaces. The former in each step updated an iterate by processing a subproblem, thatis z k +1 = argmin z ∈ S Q ( z ; z k ), where Q ( z ; z k ) a local quadratic approximation of (cid:96) ( z ) atcurrent iterate z k . This subproblem was then solved by Newton method. While GPGN ben-eﬁted from

IHT to update an iterate and only imposed Newton step on a subspace whentwo consecutive iterates had same support sets. The third group seeks an approximationto estimate the inverse of the whole Hessian matrix, such as an unbiased stochastic Hessianestimator used in

FNHTP , with hence low computational complexity compared with meth-ods from group one. Inspired by those well established researches, we make use of Newtonmethod to solve (4) as well. However, the way for us to use this method is diﬀerent withthat of any of above algorithms.

As what we mentioned above, we focus on the model (4) via the classic Newton method inthis paper. The main contributions are summarized as follows.i) We start with considering the optimality condition of model (4) by introducing astrong τ -stationary point (see Deﬁnition 3 for more details) which turns out to be alocal (even a global under some circumstances) minimizer of problem (4) by Theorem6. More importantly, a strong τ -stationary point draws forth the stationary equation(20), a key concept in this paper that makes the classic Newton method applicable.ii) Diﬀering with any of above mentioned algorithms, we perform Newton method di-rectly on solving the stationary equation, one kind of optimality conditions of prob-lem (4). Thanks to nice properties of the stationary equation, our proposed Newton5ethod dubbed as NSLR , an abbreviation for Newton method for the SLR, has asimple framework (see Table 1) making its implementation easy and has a low com-putational complexity O ( s + s n + np ) per each iteration. One of reasons for suchlow computational complexity is that we only calculate a small linear equation systemwith s variables and s equations to update the Newton direction.iii) In spite of many well established convergence results of Newton method to processcontinuous optimization, it is not trivial to do convergence analysis of Newton methodto tackle (4), a combinatorial optimization. We show that the generated sequence hasa local quadratic convergence and NSLR terminates within ﬁnite steps which can beestimated as O (log ( c/ √ (cid:15) )) (see Theorem 12 iv) for details) if the chosen initial point issuﬃciently close to a strong τ -stationary point, where c is a constant being associatedwith samples and the initial point, (cid:15) is the the stopping tolerance of NSLR .iv) Finally, the eﬃciency of

NSLR is demonstrated against seven state-of-the-art methodson a number of randomly generated and real datasets. The ﬁtting accuracy and com-putational speed are very competitive. Especially, in high dimensional data setting,

NSLR outperforms others in terms of the computational time.Now we would like to highlight the diﬀerence between above mentioned second ordermethods and our proposed method. Firstly,

NSLR exploits Newton method to solve thestationary equation of the original problem (4) itself, which clearly diﬀers with those ingroup one that aimed at solving unconstrained optimizations. Most of those methods weretypical computationally slow due to addressing a large linear equation system in each stepwith complexity roughly about O ( p ). By contrast, NSLR only tackles a relatively smalllinear equation system with complexity roughly about O ( s ). Secondly, unlike NTGP usingNewton method to solve quadratic subproblems, no subproblems in

NSLR are considered.Moreover, our proposed method enjoys the local quadratic convergence rate and terminateswith ﬁnite steps. Compared with

GPGN who combined

IHT procedures and Newton directionstogether,

NSLR always beneﬁts from Newton direction. Finally, diﬀering with

FNHTP to seekan unbiased stochastic Hessian to approximate the inverse of the whole Hessian matrix,

NSLR only concentrates on a small sub-matrix of the whole Hessian, which results in a smallscale linear equation to be calculated.

This paper is organized as follows. To explore the optimality conditions of (4), the nextsection introduces a strong τ -stationary point (see Deﬁnition 3) and establishes the itsrelationship with a local/global minimizer (see Theorem 6). This relation allows us tosolve a stationary equation system (20) to acquire a local/global optimal solution of (4).Based on the stationary equation, Section 3 proposes our method NSLR , an abbreviation forNewton method for SLR, which turns out to have simple algorithmic framework and lowcomputational complexity. In Section 4, we establish the convergence results, including thelocal quadratic convergence and termination with ﬁnite steps, of the sequence generatedby

NSLR . In Section 5, the superior performance of

NSLR is demonstrated against some ofthe state-of-the-art solvers on randomly generated and real datasets in high dimensionalscenarios. Concluding remarks are made in the last section.6o end this section, we would like to deﬁne some notation. Let [ x , x , · · · , x n ] =: X (cid:62) ∈ R p × n be the sample matrix and y = ( y , y , · · · , y n ) (cid:62) ∈ R n be the response vector. Denote N p := { , , · · · , p } . Let | T | be the number of elements of T ⊆ N p and T := N p \ T be thecomplementary set T . Write z T ∈ R | T | (resp. X T ) as the sub vector (resp. matrix) of z (resp. X ) containing elements (resp. columns) indexed on T . Similarly, X I,T is the submatrix containing rows indexed on I and columns indexed on T . Let 0 ab stand for an order a × b zero matrix and I a be an order a × a identity matrix. For notational convenience, wewrite 0 to denote all 0 ab if there is no ambiguity in the context and combine two vectorsas ( x ; y ) := ( x (cid:62) y (cid:62) ) (cid:62) . We denote [ z ] ↓ i the i th largest (in absolute) elements of z . Let (cid:107) · (cid:107) denote the Frobenius norm for a matrix and Euclidean norm for a vector respectively,and (cid:107) · (cid:107) denote the Spectral norm for a matrix. Furthermore, λ min ( A ) and λ max ( A ) arerespectively the minimal and maximal eigenvalues of the symmetric matrix A . Denote the vector with all entries being ones.

2. Optimality and Stationary Equation

This section is devoted to investigate the optimality conditions and a stationary equationof (4). We summarize some properties of its objective function (cid:96) ( z ) based on Wang et al.(2017), which will beneﬁt for this paper. Property 1 (Lemma 2.2-2.4, Lemma A.3 (Wang et al., 2017))

The objective func-tion (cid:96) ( z ) of (4) is twice continuously diﬀerentiable and has the following basic properties:i) The logistic loss function (cid:96) ( z ) is nonnegative, convex and strongly smooth on R p witha parameter λ x := λ max ( X (cid:62) X ) / (4 n ) , namely, for any z , z ∗ ∈ R p (cid:96) ( z ) ≤ (cid:96) ( z ∗ ) + (cid:104)∇ (cid:96) ( z ∗ ) , z − z ∗ (cid:105) + ( λ x / (cid:107) z − z ∗ (cid:107) . (7) ii) The gradient ∇ (cid:96) ( z ) is given by ∇ (cid:96) ( z ) = X (cid:62) ( h ( z ) − y ) /n, (8) where h ( z ) is a vector with ( h ( z )) i = 1 / (1 + e −(cid:104) x i , z (cid:105) ) , i = 1 , . . . , n . And it enjoys (cid:107)∇ (cid:96) ( z ) − ∇ (cid:96) ( z ∗ ) (cid:107) ≤ λ x (cid:107) z − z ∗ (cid:107) . (9) iii) The Hessian matrix ∇ (cid:96) ( z ) is given by ∇ (cid:96) ( z ) = X (cid:62) D ( z ) X/n, (10) where D ( z ) is a diagonal matrix with with ( D ( z )) ii = e (cid:104) x i , z (cid:105) (1 + e (cid:104) x i , z (cid:105) ) ∈ (0 , / , i = 1 , . . . , n. iv) For any z , z ∈ R p , (cid:107)∇ (cid:96) ( z ) − ∇ (cid:96) ( z ) (cid:107) ≤ γ x (cid:107) z − z (cid:107) , (11) where γ x := 12 λ x max i =1 ,...,n (cid:107) x i (cid:107) . Property 2

For (4) , denoting c := 2 y − , we have results below. i) 0 ≤ min z ∈ S (cid:96) ( z ) ≤ (ln(2) − /

4) + min z ∈ S (cid:107) X z − c (cid:107) / (4 n ) . ii) If X (cid:62) c (cid:54) = 0 , then there always exists a subset T ⊂ N p with r := | T | = rank( X T ) ≤ s that satisﬁes X (cid:62) T c (cid:54) = 0 . For such T , let the singular value decomposition of X T be X T = U Λ V (cid:62) , where U ∈ R n × r and V ∈ R r × r are column orthogonal matrices and Λ ∈ R r × r is adiagonal matrix, then there is a ¯ z ∈ R p with ¯ z T = V Λ − U (cid:62) c and ¯ z T = 0 such that ¯ z ∈ S and (cid:96) (¯ z ) < (cid:96) (0) . If X (cid:62) c = 0 , then is a global minimizer of (4) and ∈ argmin z ∈ S (cid:107) X z − c (cid:107) . The relation between the logistic regression and linear regression in Property 2 i) providesa hint to initialize a starting point (i.e., z ∈ argmin z ∈ S (cid:107) X z − c (cid:107) ) in terms of algorithmicdesign. Property 2 ii) states that there are many non-trivial feasible solutions such thattheir loss function values are strictly less than (cid:96) (0) if X (cid:62) c (cid:54) = 0. Otherwise 0 is the globalsolution of (4) and meantime 0 ∈ argmin z ∈ S (cid:107) X z − c (cid:107) due to X (cid:62) ( X − c ) = 0, whichmeans the equation of the second inequality of i) holds.When it comes to characterize the solutions of (4), we introduce the deﬁnition of astrong τ -stationary point. To proceed, we ﬁrst give the conception of projection of a vector β ∈ R p onto the sparse set S : P S ( z ) := argmin {(cid:107) z − x (cid:107) : x ∈ S } , which sets all but s largest absolute value components of z to zero. Since the right handside may have multiple solutions, P S ( z ) is a set. But for the sake of notational convenience,we write P S ( z ) = x when P S ( z ) = { x } is a singleton. Deﬁnition 3 (Lu, 2015, Deﬁnition 3.2) A vector z ∈ S is called a strong τ -stationarypoint of (4) if there is a τ > such that z = P S ( z − τ ∇ (cid:96) ( z )) . (12)Recall that a τ -stationary point is deﬁned by Beck and Eldar (2013), z ∈ P S ( z − τ ∇ (cid:96) ( z )) . Lemma 2.2 in (Beck and Eldar, 2013) tells us that z is a τ -stationary point if and only if (cid:107) z (cid:107) ≤ s and |∇ i (cid:96) ( z ) | (cid:26) = 0 , i ∈ supp( z ) , ≤ τ [ z ] ↓ s , i / ∈ supp( z ) . (13)Similarly, we have following properties regarding to a strong τ -stationary point.8 roperty 4 For a given τ > , z ∈ R p is a strong τ -stationary point if and only if i) (cid:107) z (cid:107) < s and ∇ (cid:96) ( z ) = 0 , or ii) (cid:107) z (cid:107) = s and |∇ i (cid:96) ( z ) | (cid:26) = 0 , i ∈ supp( z ) ,< τ [ z ] ↓ s , i / ∈ supp( z ) . (14)The proof is similar to that of Lemma 2.2 by Beck and Eldar (2013), and thus is omittedhere. Now we would like to emphasize the relationship between these two kinds of points. Remark 5

The relationship between a strong τ -stationary point and τ -stationary point canbe described as follows:i) If (cid:107) z (cid:107) < s , then z is a τ -stationary point if and only if it is a strong τ -stationarypoint because of ∇ (cid:96) ( z ) = 0 .ii) If (cid:107) z (cid:107) = s , then a strong τ -stationary point z is also a τ -stationary point. In themeantime, a τ -stationary point z is also a strong τ -stationary point with any <τ < τ since |∇ i (cid:96) ( z ) | ≤ [ z ] ↓ s /τ < [ z ] ↓ s /τ . We use one simple example to illustrate ii) of above Remark. Consider s = 2 and X =   , y =   , z ∗ =  −  . Direct calculation yields ∇ (cid:96) ( z ∗ ) = ( 0 0 0 . (cid:62) . One can verify that z ∗ ∈ P S ( z ∗ − ∇ (cid:96) ( z ∗ ))and z ∗ = P S ( z ∗ − τ ∇ (cid:96) ( z ∗ )) for any given 0 < τ <

2. This means z ∗ is a τ -stationarypoint for any given 0 < τ ≤ τ -stationary point for any given 0 < τ < τ -stationary point, our ﬁrst main result is to establish therelationship between a local/global optimal solution and a strong τ -stationary point of (4). Theorem 6

Considering (4) , following results hold. i) A global minimizer is a strong τ -stationary point z ∗ for any given τ ∈ (0 , /λ x ) . ii) If (cid:107) z ∗ (cid:107) = s , a strong τ -stationary point z ∗ for a given τ > is a local minimizer. Ifit further satisﬁes ∇ (cid:96) ( z ∗ ) = 0 then it is also a global minimizer. iii) If (cid:107) z ∗ (cid:107) < s , a strong τ -stationary point z ∗ for a given τ > is a global minimizer. The above established relationship allows us to focus on a strong τ -stationary point itselfto pursuit a ‘good’ solution of (4). Rewrite (12) as (cid:26) z = P S ( z − τ d ) , d = ∇ (cid:96) ( z ) , (15)9hich is equivalent to (cid:20) z − P S ( z − τ d ) d − ∇ (cid:96) ( z ) (cid:21) = 0 . (16)Regarding to the projected property of P S ( z ), it sets all but s largest absolute value com-ponents of z to zero. To well express the solution of P S ( z − τ d ), we deﬁne two sets T ( u ; τ ) := { i ∈ N p : | z i − τ d i | ≥ [ z − τ d ] ↓ s } with | T ( u ; τ ) | = s (17) T ( u ; τ ) := { i ∈ N p : i / ∈ T ( u ; τ ) } (18)where u = ( z ; d ) ∈ R p . Clearly, it has T ( u , τ ) ∩ T ( u , τ ) = ∅ , T ( u , τ ) ∪ T ( u , τ ) = N p . Onemay discern that T ( u ; τ ) may not be unique. So we denote T B ( u ; τ ) the set that covers all T ( u ; τ ) in (17). In other words, T ( u ; τ ) is a particular element of T B ( u ; τ ), i.e., T ( u ; τ ) ∈ T B ( u ; τ ) . Hereafter, if no confusion arises, we use the simple notation T := T ( u ; τ ) , T := T ( u ; τ ) . One should keep in mind T is associated with u and τ , but for notational convenience, wedrop the dependence. Based on those notation, for any T ∈ T B ( u ; τ ) , the projection is ableto be expressed as P S ( z − τ d ) = (cid:20) ( z − τ d ) T (cid:21) , (19)and thus (16) implies that the following stationary equation F τ ( u ; T ) :=  d T z T d T − ∇ T (cid:96) ( z ) d T − ∇ T (cid:96) ( z )  = 0 (20)holds for any T ∈ T B ( u ; τ ). It is worth mentioning that if T / ∈ T B ( u ; τ ), then (19) is notcorrect and hence (20) may not hold.The idea of the stationary equation refers to the results in Huang et al. (2018), inwhich authors concentrated on (cid:96) -regularization linear regression. In this paper, we aim tosolve the logistic regression with sparsity constraint, and we provide diﬀerent theoreticalanalysis as well as some novel convergence results (refer to Section 4) from the perspectiveof optimization. The following theorem conﬁrms the equivalence between (16) and (20). Theorem 7 z is a strong τ -stationary point for a given τ > if and only if F τ ( u ; T ) = 0 for all T ∈ T B ( u ; τ ) . where u = ( z ; d ) is denoted by the above. u = ( z ; d ) and τ >

0, if a T ∈ T B ( u ; τ ) is given, then the Jacobian matrix of F τ ( u ; T ) enjoys the form ∇ F τ ( u ; T ) =  ss sr I s sr rs I r rs rr −∇ T,T (cid:96) ( z ) −∇ T,T (cid:96) ( z ) I s sr −∇ T ,T (cid:96) ( z ) −∇ T ,T (cid:96) ( z ) 0 rs I r  , (21)where r := p − s and ∇ T,T (cid:96) ( z ) := ( ∇ (cid:96) ( z )) T,T . It is worth mention that since T is relatedto u , the Jacobian of F τ ( u ; T ) may diﬀer with (21) if we treat T to be unknown. However,we do not concern about the latter case and only focus on the case T ∈ T B ( u ; τ ) beinggiven when it comes to the algorithmic design. We next show that for any given T , it doesnot aﬀect the overall structure and non-singularity of the matrix ∇ F τ ( u ; T ). To see this,we would like to introduce the so-called s -regularity of X . Deﬁnition 8 (Deﬁnition 2.2, Beck and Eldar, 2013) A matrix is s -regular if its any s columns are linearly independent. If X is s -regular, then it enables us to well deﬁne λ := min z ∈ S z (cid:62) X (cid:62) X zz (cid:62) z = min | T |≤ s λ min ( X (cid:62) T X T ) > . (22) Theorem 9

Suppose the matrix X is s -regular. Then for any given T ∈ T B ( u ; τ ) with agiven τ > , ∇ F τ ( u ; T ) is nonsingular on R p and its inverse matrix has the form ( ∇ F τ ( u ; T )) − =  ( ∇ T,T (cid:96) ) − − ( ∇ T,T (cid:96) ) − ∇ T,T (cid:96) − ( ∇ T,T (cid:96) ) − I r I s ∇ T ,T (cid:96) )( ∇ T,T (cid:96) ) − R ( z ) − ( ∇ T ,T (cid:96) )( ∇ T,T (cid:96) ) − I r  , (23) where (cid:96) := (cid:96) ( z ) and R ( z ) := −∇ T ,T (cid:96) ( ∇ T,T (cid:96) ) − ∇ T,T (cid:96) + ∇ T ,T (cid:96) . Moreover, for any z ∗ ∈ R p and a given δ > , (cid:107) ( ∇ T,T (cid:96) ( z )) − (cid:107) ≤ µ ( X, z ∗ , δ ) := e √ λ x (cid:107) z ∗ (cid:107) + δ ) λ/ (4 n ) , ∀ z ∈ N ( z ∗ , δ ) , (24) which means (cid:107) ( ∇ F τ ( u ; T )) − (cid:107) is bounded on N ( u ∗ , δ ) .

3. Fast Newton Method

In this section, we aim to design an eﬃcient algorithm to solve the stationary equation (20).Let u k = ( z k ; d k ) be the k th iterative point and ﬁx τ >

0. Similar to (17) and (18), the k thiterative index sets are deﬁned as T k := T ( u k ; τ ) , T k := T ( u k ; τ ) . ∇ F τ ( u k ; T k ) is nonsingular under the condition of the s -regularityof the matrix X . Therefore we naturally think of the Newton method to solve the systemof stationary equation (20). The classical Newton algorithm reads ∇ F τ ( u k ; T k )( u k +1 − u k ) = − F τ ( u k ; T k ) . (25)From the explicit formula of F τ ( u k ; T k ) in (20), we have d k +1 T k = 0 , z k +1 T k = 0 , (cid:34) ∇ T k ,T k (cid:96) ( z k ) ∇ T k ,T k (cid:96) ( z k ) ∇ T k ,T k (cid:96) ( z k ) ∇ T k ,T k (cid:96) ( z k ) (cid:35) (cid:34) z k +1 T k − z kT k − z kT k (cid:35) − (cid:34) d k +1 T k (cid:35) = − (cid:20) ∇ T k (cid:96) ( z k ) ∇ T k (cid:96) ( z k ) (cid:21) . (26)The third equation in (26) yields that ∇ T k ,T k (cid:96) ( z k ) z k +1 T k − ∇ T k · (cid:96) ( z k ) z k = −∇ T k (cid:96) ( z k ) , where ∇ T · (cid:96) ( z ) is the sub-matrix of ∇ (cid:96) ( z ) containing rows indexed on T , which impliesthat z k +1 T k is a solution of the following linear equation ∇ T k ,T k (cid:96) ( z k ) v = ∇ T k · (cid:96) ( z k ) z k − ∇ T k (cid:96) ( z k )= ∇ T k ,T k − (cid:96) ( z k ) z kT k − − ∇ T k (cid:96) ( z k ) , (27)where the second equation holds due to supp( z k ) ⊆ T k − from (26). Overall, we update u k +1 through (26) by  z k +1 T k = v k , z k +1 T k = 0 , d k +1 T k = 0 , d k +1 T k = ∇ T k (cid:96) ( z k ) + ∇ T k · (cid:96) ( z k )( z k +1 − z k ) , (28)where v k is a solution of (27). If ∇ T k ,T k (cid:96) ( z k ) is nonsingular, (28) means we do not need tocalculate the inverse of the whole matrix ∇ F τ ( u k ; T k ) to update u k +1 , because it can bedone through computing the inverse of ∇ T k ,T k (cid:96) ( z k ) to update z k +1 T k as z k +1 T k = v k = z kT k − (cid:104) ∇ T k ,T k (cid:96) ( z k ) (cid:105) − (cid:104) ∇ T k (cid:96) ( z k ) − ∇ T k ,T k (cid:96) ( z k ) z kT k (cid:105) . (29)Hence we propose the Newton method for (4) to iteratively process the stationary equation(20). The algorithmic framework is summarized as in following table.12able 1: Framework of the algorithm. NSLR : Newton method for the SLR (4)

Step 0

Initialize z , d = ∇ (cid:96) ( z ). Choose (cid:15), τ >

0. Set k ⇐ Step 1 ( Support set Selection )Choose T k := T ( u k ; τ ) ∈ T B ( u k ; τ ) by (17) . Step 2 ( Convergence Check )If (cid:107) F τ ( u k ; T k ) (cid:107) ≤ (cid:15) , then stop. Otherwise, go to Step 3 . Step 3 ( Full Newton )Update u k +1 = ( z k +1 ; d k +1 ) by (28), set k ⇐ k + 1 and go to Step 1 .Regarding to the proposed algorithm

NSLR , we have some comments. i) For

Step 0 , Property 2 gives us a clue to ﬁnd a starting point z ∈ argmin z ∈ S (cid:107) X z − c (cid:107) , which, however, would consume much time. Therefore, we will simply use z = 0.In addition, we will show that if the starting point z is chosen suﬃciently closely to astrong τ -stationary point z ∗ with (cid:107) z ∗ (cid:107) = s , all T k will be identical and, furthermore, NSLR will terminate within ﬁnite steps (see Theorem 12). Interestingly, in terms ofnumerical experiments, to fasten the convergence of the proposed method, the choiceof starting point z is not necessary to be close enough to a strong τ -stationary point.For simplicity, one could just start from the origin. ii) For

Step 1 , we only pick s largest elements (in absolute) to form T k , which allowsus to use mink function in MATLAB (2017b or later version) whose computationalcomplexity is O ( p + s log s ). iii) For

Step 3 , to update z k +1 T k , one need to solve a linear equation (27). The complexity ofcalculating ∇ T ,T (cid:96) ( z ) = X (cid:62) T D ( z ) X T /n with | T | = | T | = s is O ( sn + s n ) since D ( z )is the diagonal matrix. The complexity of solving a linear equation with s equationsand s variables is at most O ( s ). Therefore, the whole complexity of updating z k +1 T k is O ( s + s n ). For updating d k +1 T k , since X ( z k +1 − z k ) = X T k ∪ T k − ( z k +1 − z k ) T k ∪ T k − , itscomputational complexity is O (2 sn ). So combining ∇ T k · (cid:96) ( z k )( z k +1 − z k ) in (28), thecomplexity of updating d k +1 T k is O ( np + ns ). Overall, the computational complexityof Step 3 is O ( s + s n + np ) , which means the computation is quite fast if max { s, n } (cid:28) p . iv) It is worth mentioning that theoretically under the assumption X being s -regular (seeTheorem 9), the linear equation (27) always admits a unique solution. Numerically,to avoid a non-singular case, one could calculate [ ∇ T k ,T k (cid:96) ( z k ) + µ I ] − instead witha small positive µ . Actually, if we consider the model (6), namely (cid:96) is replace by (cid:96) + µ (cid:107) · (cid:107) , then (27) always has a unique solution without any assumptions.13 . Quadratic Convergence Our ﬁrst result states that if the sequence generated by

NSLR is convergent, then it mustconverge to a τ -stationary point of (4). Theorem 10

For any given τ > , let u ∞ := ( z ∞ ; d ∞ ) be any limit point of the sequence { u k } generated by NSLR , then z ∞ is a τ -stationary point of (4) . Our next main result shows that if the initial point u of the sequence { u k } is suﬃcientlyclose to a point u ∗ = ( z ∗ ; ∇ (cid:96) ( z ∗ )) where z ∗ is a strong τ ∗ -stationary point of (4) for a given τ ∗ >

0, then the sequence converges to u ∗ quadratically. Before this, we would like to deﬁnesome notation to ease the reading. Denote Γ ∗ := supp( z ∗ ) and τ ∈ (cid:26) (0 , τ ∗ ] , if (cid:107) z ∗ (cid:107) = s, (0 , ∞ ) , if (cid:107) z ∗ (cid:107) < s, (30) δ ∗ := min i ∈ Γ ∗ | z ∗ i | − τ ∗ max i/ ∈ Γ ∗ | d ∗ i | { , τ ∗ } , (31) N ( u ∗ , δ ∗ ) := (cid:26) (cid:8) u = ( z ; d ) ∈ R p | z ∈ S, (cid:107) u − u ∗ (cid:107) < δ ∗ (cid:9) , if δ ∗ (cid:54) = 0 , (cid:8) u = ( z ; d ) ∈ R p | z ∈ S (cid:9) , if δ ∗ = 0 . (32)Theorem 7 says that a strong τ ∗ -stationary point z ∗ is equivalent to F τ ∗ ( u ∗ ; T ∗ ) = 0 for any T ∗ ∈ T B ( u ∗ ; τ ∗ ). This implies d ∗ = ∇ (cid:96) ( z ∗ ). So one can easily verify that if z ∗ is a strong τ ∗ -stationary point, then δ ∗ > z ∗ (cid:54) = 0 and δ ∗ = 0 only if z ∗ = 0 by Property 4. Thefollowing lemma presents that for any z being suﬃciently close to z ∗ , they share the samesupport set supp( z ∗ ). Lemma 11

Let z ∗ be a strong τ ∗ -stationary point of (4) for a given τ ∗ > . Then for any u ∈ N S ( u ∗ , δ ∗ ) , following results hold. i) If (cid:107) z ∗ (cid:107) = s , then for any given < τ ≤ τ ∗ there is { supp( z ) } = T B ( u ; τ ) = T B ( u ∗ ; τ ∗ ) = { supp( z ∗ ) } . ii) If (cid:107) z ∗ (cid:107) < s , then for any given τ > there is T B ( u ; τ ) ⊆ T B ( u ∗ ; τ ∗ ) and supp( z ∗ ) ⊆ (supp( z ) ∩ T ) , ∀ T ∈ T B ( u ; τ ) . Now we are ready to claim our main result.

Theorem 12

For (4) , assume the matrix X is s -regular. Let z ∗ be a strong τ ∗ -stationarypoint for a given τ ∗ > and u ∗ = ( z ∗ ; d ∗ ) with d ∗ = ∇ (cid:96) ( z ∗ ) . Let τ , δ ∗ , N S ( · , · ) and µ ( X, u ∗ , δ ∗ ) =: µ ∗ be deﬁned as (30) , (31) , (32) and (24) respectively. Suppose that theinitial point of sequence { u k } generated by NSLR satisﬁes u ∈ N S ( u ∗ , min { δ ∗ , δ ∗ x } ) , with δ ∗ x := ( (cid:112) λ γ x µ ∗ ) − . (33) Then for any k ≥ , following results hold.

14) lim k →∞ u k = u ∗ with quadratic convergence rate, namely, (cid:107) u k +1 − u ∗ (cid:107) ≤ (0 . /δ ∗ x ) (cid:107) u k − u ∗ (cid:107) . ii) If (cid:107) z ∗ (cid:107) = s it holds T B ( u k ; τ ) = { supp( z k ) } = { supp( z ∗ ) } = T B ( u ∗ ; τ ∗ ) . If (cid:107) z ∗ (cid:107) < s it holds supp( z ∗ ) ⊆ (supp( z k ) ∩ T k ) , ∀ T k ∈ T B ( u k ; τ ) . iii) (cid:107) F τ ( u k +1 ; T k +1 ) (cid:107) ≤ c x (cid:107) u k − u ∗ (cid:107) and NSLR will terminate when k ≥ (cid:6) log ( √ c x (cid:107) u − u ∗ (cid:107) ) + log (1 / √ (cid:15) ) (cid:7) , where (cid:100) a (cid:101) is the smallest integer being no less than a and c x := (cid:2) (0 . /δ ∗ x ) + (1 . γ x ) (cid:3) . Compared with the classical convergence results of Newton method, Theorem 12 gives theexplicit value of the neighborhood and proves that

NSLR will terminate within ﬁnite steps.

5. Numerical Experiments

In this part, we will conduct extensive numerical experiments of our algorithm

NSLR byusing MATLAB (R2017b) on a desktop of 8GB of memory and Inter Core i5 2.7Ghz CPU,against seven leading solvers both on synthetic data and real data.

We ﬁrst do numerical experiments on synthetic data. In such simulations, we adopt twotypes of data generation. Lu and Zhang (2013); Pan et al. (2017) have uesd the model withidentically independently generated features [ x · · · x n ] ∈ R p × n , whilst Agarwal et al.(2010); Bahmani et al. (2013) have considered independent features with each of which x i being generated by an autoregressive process (Hamilton, 1994). Example 1 (Independent Data (Lu and Zhang, 2013; Pan et al., 2017))

To gen-erate data labels y ∈ { , } n , we ﬁrst randomly separate { , . . . , n } into two parts I and I and set y i = 0 for i ∈ I and y i = 1 for i ∈ I . Then the feature data is produced by x i = y i v i + w i , i = 1 , . . . , n with R (cid:51) v i ∼ N (0 , , R p (cid:51) w i ∼ N (0 , I p ) , where N (0 , I ) is the normal distribution withmean zero and variance identity. Since the sparse parameter z ∗ ∈ R p is unknown, diﬀerent s ( < n ) will be tested to pursuit a sparse solution. Example 2 (Correlated Data (Agarwal et al., 2010; Bahmani et al., 2013))

Thesparse parameter z ∗ ∈ R p has s nonzero entries drawn independently from the standardGaussian distribution. Each data sample x i = [ x i · · · x ip ] (cid:62) , i = 1 , . . . , n is an independent nstance of the random vector generated by an autoregressive process (see Hamilton, 1994)determined by x i ( j +1) = ρx ij + (cid:112) − ρ v ij , j = 1 , . . . , p − with x i ∼ N (0 , , v ij ∼ N (0 , and ρ ∈ [0 , being the correlation parameter. The datalabels y ∈ { , } n are then drawn randomly according to the Bernoulli distribution with Pr { y i = 0 | x i } = 11 + e (cid:104) x i , z ∗ (cid:105) , i = 1 , . . . , n. Example 3 (Real data)

Eight real data sets are taken into consideration: arcene , colon-cancer , news20.binary , newsgroup , duke breast-cancer , leukemia , gisette and rcv1.binary , which are summarized in following table, where only last four data sets havetesting data. Moreover, for arcene , colon-cancer , duke breast-cancer and leukemia ,sample-wise normalization has been conducted so that each sample has mean zero and vari-ance one, and then feature-wise normalization has been conducted so that each feature hasmean zero and variance one. For the rest three data sets, they are feature-wisely scaled to [ − , . For rcv1.binary , we only use samples for the testing data. All − s in thelabel classes y are replaced by 0. Table 2: Details of eight real datasets.Data name n samples p features training size m testing size m arcene

100 10,000 100 0 colon-cancer

62 2,000 62 0 news20.binary newsgroup duke breast-cancer

42 7,129 38 4 leukemia

72 7,129 38 34 gisette rcv1.binary

For parameter τ , despite that Theorem 12 has given us a clue, it is still diﬃcult to ﬁx aproper one since z ∗ is unknown. Alternative is to update τ adaptively. By (14), it has τ ∗ < [ z ∗ ] ↓ s max { i : d ∗ i (cid:54) =0 } | d ∗ i | , if (cid:107) z ∗ (cid:107) = s. http://archive.ics.uci.edu/ml/index.php ∼ cjlin/libsvmtools/datasets/ https://web.stanford.edu/ ∼ hastie/glmnet matlab/ τ with a ﬁxed one τ = 1 and then decreasingly updateit as following rule, τ k +1 =  . τ k , if τ k ≥ [ z k ] ↓ s max j ∈ Tk | d kj | and (cid:107) F τ k ( u k ; T k ) (cid:107) > /k,τ k , otherwise , (34)where T k ∈ T B ( u k , τ k ). The reason of keeping τ k +1 = τ k if (cid:107) F τ k ( u k ; T k ) (cid:107) < /k is that thiserror tends to zero with order 1 /k , a desirable decreasing rate.As for the initialization of NSLR , we set z = 0 as what we described before. Due toupdating τ by τ k , the stop criteria is set as (cid:107) F τ k ( u k ; T k ) (cid:107) < (cid:15) = 10 − . Notice that if the ﬁnal u k satisfying (cid:107) F τ k ( u k ; T k ) (cid:107) = 0 and ∇ (cid:96) ( z k ) = 0, then it is a globalminimizer of (4) owing to z k ∈ S and Theorem 6 iii). Since there are too many solvers being able to address the sparse logistic regression, weonly focus on those programmed by Matlab. Solvers with codes being online unavailableor being written by other languages, such as R and C, are not selected for comparisons.We thus choose 7 solvers mentioned in Subsection 1.2, which should be enough to makecomprehensive comparisons. We summarize them into following table.Table 3: Benchmark MethodsModels (3) with convex φ ν (3) with non-convex φ ν (4)First order method SLEP APG , GIST GraSP

Second order method

IRLS-LARS -- NTGP , GPGN

For

SLEP , we use it to solve (3) with φ ν ( z ) = ν (cid:107) z (cid:107) + ν (cid:107) z (cid:107) , whilst IRLS-LARS aimsto solve the (cid:96) constrained logistic regression, namely, (5). APG and

GIST are taken tosolve the capped (cid:96) logistic regression with φ ν ( z ) = ν min( | z i | , ν ). We only use non-monotonous version of APG since its numerical performance was better than that of themonotonous version (see Li and Lin, 2015). For

GraSP , the version chosen here is to solve (4)directly instead of (6). Notice that methods that aim at solving model (3) involve a penaltyparameter ν , whilst those tackling (4) need the sparsity level s . To make results comparable,we adjust their default parameters ν for each method to guarantee the generated solution z satisfying (cid:107) z (cid:107) ≤ p/

2. We will report the four indicators:( (cid:96) ( z ) , (cid:107)∇ (cid:96) ( z ) (cid:107) , SER , Time )to illustrate the performance of methods, where

Time (in seconds) is the CPU time, z is thesolution obtained by each method and SER is the sign error rate deﬁned by

SER := 1 n m (cid:88) i =1 | y − sign( (cid:104) x i , z (cid:105) + ) | . a + ) is the sign of the projection of a onto a non-negative space, namely, it returns1 if a > We now report the performance of eight methods on above three examples. To avoidrandomness, we report average results over 10-time independent trails for the ﬁrst twoexample since they are involving in randomly generated data. (a) Comparison on Example 1.

To observe the inﬂuence of the sparsity level s on four greedy methods: NSLR , GPGN , GraSP and

NTGP , we ﬁx p = 10000 , n = p/ s ∈ { , , · · · , } . As demonstrated in Figure 1,

NSLR outperforms others interms of the lowest (cid:96) ( z ), (cid:107)∇ (cid:96) ( z ) (cid:107) and SER and the shortest time, followed by

GPGN . Bycontrast,

GraSP always performs the worst results, which means this ﬁrst order method isnot competitive when against with the other three methods, three second order methods.

600 1000 1400 s -8 -6 -4 -2 l ( z ) NSLRGPGNGraSPNTGP

600 1000 1400 s -8 -6 -4 -2 || l ( z ) || NSLRGPGNGraSPNTGP

600 1000 1400 s -3 S E R NSLRGPGNGraSPNTGP

600 1000 1400 s T i m e NSLRGPGNGraSPNTGP

Figure 1: Comparison of four methods for Example 1 with p = 10000 , n = 0 . p

18o observe the inﬂuence of the ratio of the sample size n and the number of features p on all eight methods, we ﬁx p = 20000 , s = 0 . p and vary n/p ∈ { . , . , · · · , . } . Apartfrom recording the four indicators, we also report the number of non-zeros of the solution z generated by each method. Here, for the LARS , we stop it when s variables are selected andtheir default stopping conditions are met since LARS only adds one variable at each iteration(see Huang et al. (2018)). We set ν = 10 − , ν = 10 − for SLEP , ν = 10 − , ν = 10 − for APG and ν = 10 − abs ( randn ) , ν = 10 − abs ( randn ) for GIST . As presented in Figure2, in terms of

Obj , Grad and

SER , again

NSLR performs the best results, followed by

GPGN and

GIST . It is obviously that

LARS and

SLEP produce undesirable results compared withother methods. For the computational time,

NSLR runs the fastest, while

GraSP and

APG run relatively slow with over 1000 seconds when n/p ≥ .

6. Table 4 shows the sparsitylevels (cid:107) z (cid:107) only in LARS is lower than our

NSLR . This is because

LARS fails to recover thesupport and vanishs when s = 500 in this numerical experiment (this phenomenon had alsobeen observed in Huang et al. (2018) and Garg and Khandekar (2009).) n/p -5 l ( z ) NSLRGPGNGraSPNTGPLARSGISTAPGSLEP . . . . n/p -6 || l ( z ) || NSLRGPGNGraSPNTGPLARSGISTAPGSLEP n/p S E R NSLRGPGNGraSPNTGPLARSGISTAPGSLEP n/p T i m e NSLRGPGNGraSPNTGPLARSGISTAPGSLEP

Figure 2: Comparison of eight methods for Example 1 with p = 20000 , s = 0 . p. (cid:107) z (cid:107) of eight methods for Example 1 with p = 20000 , s = 0 . p . n/p . . NSLR, GPGN, GraSP, NTGP

LARS

500 500 500 500 500 500 500

GIST

APG

SLEP n/p -8 -6 -4 -2 l ( z ) NSLRGPGNGraSPNTGPLARSGISTAPGSLEP (a) ρ = 0 n/p -8 -6 -4 -2 l ( z ) NSLRGPGNGraSPNTGPLARSGISTAPGSLEP (b) ρ = 1 / n/p -8 -6 -4 -2 l ( z ) NSLRGPGNGraSPNTGPLARSGISTAPGSLEP (c) ρ = 1 / n/p -8 -6 -4 -2 l ( z ) NSLRGPGNGraSPNTGPLARSGISTAPGSLEP (d) ρ = √ / Figure 3: Comparison of eight methods for Example 2 with p = 1000 , s = 0 . p. (b) Comparison on Example 2. To observe the inﬂuence of the correlation parameter ρ on eight methods, we set p = 1000 , s = 0 . p but choose ρ ∈ { , / , / , √ / } . Figure20 shows the average (cid:96) ( z ) gotten by eight methods for a wide range of the ratio n/p ∈{ . , . , · · · , . } . Apparently, at four diﬀerent value of ρ , NSLR always performs stably bestresults. Moreover, the trends in these eight methods perform generally consistent, whichindicates the correlation parameter has little inﬂuence on these methods. Therefore, wefurther ﬁx ρ = 1 / p ∈ { , , } with n = 0 . p, s = 0 . p or s = 0 . p . Here,we set ν = 10 − , ν = 10 − for SLEP ν = 10 − , ν = 5 × − for APG and ν =5 × − abs ( randn ) , ν = 5 × − abs ( randn ) for GIST . As reported in Table 5, for casesof p = 20000 , LARS basically fails to render desirable solutions due to the highest (cid:96) ( z ). It can be clearly seen that NSLR always provides the best accuracies with consumingshortest time. More importantly, its generated ||∇ (cid:96) ( z ) || has accuracies with order of 10 − ,which means global optimal solutions are achieved. By contrast, SLEP and

LARS behave theworst due to highest (cid:96) ( z ) and ||∇ (cid:96) ( z ) || with order of 10 − . As for computational speed, NTGP and

LARS seem to be the slowest, while

NSLR and

GPGN run the fastest.Table 5: Average results of eight methods for Example 2. s = 0 . p, n = 0 . p s = 0 . p, n = 0 . p(cid:96) ( z ) ||∇ (cid:96) ( z ) || SER

Time(s) || z || (cid:96) ( z ) ||∇ (cid:96) ( z ) || SER

Time(s) || z || p = 10000 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP p = 20000 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP p = 30000 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP c) Comparison on Example 3. To observe the performance for above all eight meth-ods on real data sets, we select eight real data sets with diﬀerent dimensions. The highestdimension is up to millions (see news20.binary ). Table 6 reports results for eight methodson four datasets without testing data ( m = 0): arcene , colon-cancer , news20.binary and newsgroup . For the last two datasets, LARS makes our desktop run out of memory, thusits results are omitted here. NaN obtained by

GraSP on data newsgroup means ||∇ (cid:96) ( z ) || tends to inﬁnity. Clearly, NSLR is more eﬃcient than others for all test instances. For ex-ample,

NSLR only uses 9 seconds for data newsgroup with p = 777811 features and achievesthe smallest logistic loss with the sparsest solution.Table 6: Results of eight methods for Example 3 with m = 0. (cid:96) ( z ) ||∇ (cid:96) ( z ) || SER

Time(s) || z || (cid:96) ( z ) ||∇ (cid:96) ( z ) || SER

Time(s) || z || Data

Arcene: m = 100 , p = 10000 colon-cancer: m = 62 , p = 2000 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP news20.binary: m = 19996 , p = 1355191 newsgroup: m = 11314 , p = 777811 NSLR

GPGN

GraSP

NTGP

LARS — — — — — — — — — —

GIST

APG

SLEP

When all methods solve datasets with testing data ( m > duke breast-cancer , leukemia , gisette and rcv1.binary , the table is a little diﬀerent. Since the testing datais taken into consideration, we add two indicators to illustrate the performance of eachmethod: (cid:96) ( z )-test and SER -test. Results are reported in Table 7, where (cid:96) ( z ) and (cid:96) ( z )-test denote the objective function value on training data and testing data respectively,and similar to SER -train and

SER -test. For cases of duke breast-cancer and leukemia , LARS stops when the maximum number of iterations reaches the min { m , p } . Hence itsproduced || z || s are less than other methods. We can see that NSLR could guarantee a goodperformance on the testing data as well as the training data. It is capable of renderingsparsest solutions with smallest logistic loss and consuming least time.22able 7: Average results of eight methods for Example 3 with m > ||∇ (cid:96) ( z ) || (cid:96) ( z ) (cid:96) ( z )-test SER -train

SER -test Time(s) || z || Data duke breast-cancer: m = 38, m = 4, p = 7129 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP leukemia: m = 38, m = 34, p = 7129 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP gisette: m = 6000 , m = 1000 , p = 5000 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP rcv1.binary: m = 20242 , m = 20000 , p = 47236 NSLR

GPGN

GraSP

NTGP

LARS

GIST

APG

SLEP

6. Conclusion

In this paper, we considered the sparsity constrained logistic regression (4), which is a non-liner combinatorial problem. Despite the NP-hardness, we beneﬁted from its nice propertiesof objective function and introduced the strong τ -stationary point as a optimality conditionof (4). This contributed to a stationary equation, which can be eﬃciently solved by Newton23ethod. It turned out that our proposed method NSLR has a relatively low computationalcomplexity in each step because only a small linear equation system with s variables and s equations need to be solved to update the Newton direction. We further established itsquadratic convergence to a strong τ -stationary point and showed termination within ﬁnitesteps. It is worth mentioning that we reasonably extended the classical Newton method forsolving unconstrained problems to sparsity constrained logistic regression. The numericalperformance against several state-of-the-art methods demonstrated that NSLR is remarkablyeﬃcient and competitive in solving (4), especially in large scale settings.A drawback of

NSLR is that the starting point is required to be close enough to its limitto establish the quadratic convergence results. A potential way to compensate this is toadopt a line search scheme (see Yuan and Liu (2017) for example), which would guaranteethat the generated sequence by

NSLR converges to its limit, because of this, the closeness tothe limit of the starting point is no more demanded. We leave this as a future research.

Acknowledgments

We would like to acknowledge support for this project from the National Natural ScienceFoundation of China (11431002) and the 111 project of China (B16002). We are also verygrateful to Prof. Xiao-Tong Yuan for sharing with us their excellent package

NTGP . Appendix A.

In this appendix we prove all associated theorems from Section 2:

Proof of Property 2Proof i) Denote (cid:96) i ( z ) = log(1 + e t i ) − y i t i with t i = (cid:104) x i , z (cid:105) . By Taylor expansion of (cid:96) i ( z )at t i = 0, for any min { , t i } ≤ ς i ≤ max { , t i } , it holds (cid:96) ( z ) = 1 n n (cid:88) i =1 (cid:96) i ( z ) = 1 n n (cid:88) i =1 (cid:2) ln(2) + t i / t i / (2(2 + e ς i + e − ς i )) − y i t i (cid:3) ≤ n n (cid:88) i =1 (cid:2) ln(2) + t i / t i / (2 + e ς i + e − ς i ) − y i t i (cid:3) ≤ n n (cid:88) i =1 (cid:2) ln(2) + (1 / − y i ) t i + t i / (cid:3) = 1 n n (cid:88) i =1 (cid:2) ln(2) + [ t i − (2 y i − / − (2 y i − / (cid:3) = ln(2) + ( (cid:107) X z − c (cid:107) − (cid:107) c (cid:107) ) / (4 n )= ln(2) − / (cid:107) X z − c (cid:107) / (4 n ) , (35)where the last equality is due to y i ∈ { , } and thus 2 y i − ∈ {− , } . Then for any z ∈ S ,we must have min z ∈ S (cid:96) ( z ) ≤ (cid:96) ( z ), which results in the conclusion.24i) If X (cid:62) c (cid:54) = 0, the ﬁrst claim holds clearly (In fact, a typical case is when r = 1). Thisindicates U (cid:62) c (cid:54) = 0. Direct calculation yields that (cid:107) X ¯ z − c (cid:107) = (cid:107) X T V Λ − U (cid:62) c − c (cid:107) = (cid:107) U Λ V (cid:62) V Λ − U (cid:62) c − c (cid:107) = (cid:107) U U (cid:62) c − c (cid:107) = (cid:107) U U (cid:62) c (cid:107) − (cid:104) U U (cid:62) c , c (cid:105) + (cid:107) c (cid:107) = (cid:107) c (cid:107) − (cid:107) U (cid:62) c (cid:107) (because of U (cid:62) U = I ) < (cid:107) c (cid:107) = n, which together with (35) concludes the the second claim. If X (cid:62) c = 0, then ∇ (cid:96) (0) = X (cid:62) ( h (0) − y ) /n = − X (cid:62) c /n = 0. This together with Property 1 i) suﬃces to (cid:96) ( z ) ≥ (cid:96) (0) + (cid:104)∇ (cid:96) (0) , z − (cid:105) = (cid:96) (0)for any z ∈ S , which means 0 is a global optimal solution. And one can easily check that0 ∈ argmin z ∈ S (cid:107) X z − c (cid:107) due to X (cid:62) ( X − c ) = 0. Proof of Theorem 6Proof i) We argue by contradiction and suppose that for any given τ ∈ (0 , /λ x ), z ∗ is nota strong τ -stationary point. Then there exists a z (cid:54) = z ∗ ∈ R p satisﬁes z = P S ( z ∗ − τ ∇ (cid:96) ( z ∗ )) , which together with the deﬁnition of projection on S imply that (cid:107) z − ( z ∗ − τ ∇ (cid:96) ( z ∗ )) (cid:107) ≤ (cid:107) z ∗ − ( z ∗ − τ ∇ (cid:96) ( z ∗ )) (cid:107) . This leads to (cid:104)∇ (cid:96) ( z ∗ ) , z − z ∗ (cid:105) ≤ − / (2 τ ) (cid:107) z − z ∗ (cid:107) . (36)It follows from Property 1 (i) and (36) that for any τ ∈ (0 , /λ x ), (cid:96) ( z ) ≤ (cid:96) ( z ∗ ) + (cid:104)∇ (cid:96) ( z ∗ ) , z − z ∗ (cid:105) + ( λ x / (cid:107) z − z ∗ (cid:107) ≤ (cid:96) ( z ∗ ) − / (2 τ ) (cid:107) z − z ∗ (cid:107) + ( λ x / (cid:107) z − z ∗ (cid:107) < (cid:96) ( z ∗ )which contradicts the assumption that z ∗ is a global minimizer of SLR (4).ii) For a given τ >

0, let z ∗ be a strong τ -stationary point with (cid:107) z ∗ (cid:107) = s . For any z ∈ N ( z ∗ , δ ) ∩ S with 0 < δ < [ z ] ↓ s , we have | z i | = | z ∗ i − ( z ∗ i − z i ) | ≥ | z ∗ i | − δ > [ z ] ↓ s − [ z ] ↓ s = 0 , ∀ i ∈ supp( z ∗ ) , which means supp( z ∗ ) ⊆ supp( z ). By (cid:107) z (cid:107) ≤ s and | supp( z ∗ ) | = (cid:107) z ∗ (cid:107) = s , we obtainsupp( z ∗ ) = supp( z ) , ∀ z ∈ N ( z ∗ , δ ) ∩ S. (37)25his implies that z i = z ∗ i = 0 for i / ∈ supp( z ∗ ). It follows that the convexity of (cid:96) andProperty 4 ii) ( ∇ i (cid:96) ( z ∗ ) = 0 , i ∈ supp( z ∗ )) (cid:96) ( z ) ≥ (cid:96) ( z ∗ ) + (cid:104)∇ (cid:96) ( z ∗ ) , z − z ∗ (cid:105) = (cid:96) ( z ∗ ) + (cid:88) i ∈ supp( z ∗ ) ∇ i (cid:96) ( z ∗ )( z i − z ∗ i ) + (cid:88) i/ ∈ supp( z ∗ ) ∇ i (cid:96) ( z ∗ )( z i − z ∗ i ) = (cid:96) ( z ∗ ) . Thus z ∗ is a local minimizer of SLR (4). If z ∗ further satisﬁes (cid:96) ( z ∗ ) = 0, then for any z ∈ S ,the convexity of (cid:96) suﬃces to (cid:96) ( z ) ≥ (cid:96) ( z ∗ ) + (cid:104)∇ (cid:96) ( z ∗ ) , z − z ∗ (cid:105) = (cid:96) ( z ∗ ) , (38)which proves the global optimality of z ∗ .iii) When (cid:107) z ∗ (cid:107) < s , Property 4 i) guarantees ∇ (cid:96) ( z ∗ ) = 0, which together with (38)yields the conclusion. Proof of Theorem 7Proof

We only prove the ‘if’ part since the proof of the ‘only if’ part is quite straightfor-ward. Suppose we have F τ ( u ; T ) = 0 for all T ∈ T B ( u ; τ ), namely, z T = 0 , d T = 0 , d = ∇ (cid:96) ( z ) (39)If T B ( u ; τ ) is a singleton, by letting T be the only entry of T B ( u ; τ ), then z − P S ( z − τ ∇ (cid:96) ( z )) (39) = z − P S ( z − τ d ) = (cid:20) z T z T (cid:21) − (cid:20) ( z − τ d ) T (cid:21) (39) = (cid:20) z T − z T − (cid:21) = 0 , which means z is a strong τ -stationary point. If T B ( u ; τ ) has multiple elements, thenby the deﬁnition (17) of T ( u ; τ ) we have two claims: [ z − τ d ] ↓ s = [ z − τ d ] ↓ s +1 > z − τ d ] ↓ s = 0. Now we exclude the ﬁrst claim. Without loss of any generality, we assume | z − τ d | ≥ · · · ≥ | z s − τ d s | = | z s +1 − τ d s +1 | = [ z − τ d ] ↓ s . Let T = { , , · · · , s } and T = { , , · · · , s − , s + 1 } . Then F τ ( u ; T ) = F τ ( u ; T ) = 0 imply that d T = d T = 0 and z T = z T = 0, which lead to | z | = | z − τ d | ≥ · · · ≥ | z s | = | z s − τ d s | = | z s +1 | = | z s +1 − τ d s +1 | = [ z − τ d ] ↓ s > . This is contradicted with z T = 0 because of s + 1 ∈ T . Therefore, we have [ z − τ d ] ↓ s = 0.This together with the deﬁnition (17) of T ( z ; τ ) derives 0 = [ z − τ d ] ↓ s ≥ | z i − τ d i | = | τ d i | for any i ∈ T , which combining d T = 0 renders d = 0 and hence (cid:96) ( z ) = 0 from (39).Moreover, [ z ] ↓ s = [ z − τ d ] ↓ s = 0, yielding (cid:107) z (cid:107) < s . Consequently, z is a strong τ -stationarypoint from Property 4. 26 roof of Theorem 9Proof By using the elementary transformation of the matrix ∇ F τ ( u ), we have ∇ F τ ( u ; T ) row −−−−−−→ operation  I s I r −∇ T,T (cid:96) ( z ) 0 0 0 −∇ T ,T (cid:96) ( z ) 0 0 I r  column −−−−−−→ operation  I s I r −∇ T,T (cid:96) ( z ) 0 0 00 0 0 I r  column −−−−−−→ operation  I s I r −∇ T,T (cid:96) ( z ) 00 0 0 I r  Clearly, we haverank( ∇ F τ ( u ; T )) = rank( ∇ T,T (cid:96) ( z )) + 2 p − s = rank( X (cid:62) T D ( z ) X T ) + 2 p − s, If the matrix X is s -regular, then for ∀ h (cid:54) = 0 we have X T h (cid:54) = 0 and ( X T h ) (cid:62) D ( z )( X T h ) > D ( z ) (cid:31)

0. This shows X (cid:62) T D ( z ) X T (cid:31)

0, yielding rank( X (cid:62) T D ( z ) X T ) = s .Direct calculation yields the form of ( ∇ F τ ( u ; T )) − as (23). It remains to verify thelast statement of the theorem. For any given z ∗ ∈ R p and δ > z ∈ N ( z ∗ , δ ) implies that (cid:107) z (cid:107) < (cid:107) z ∗ (cid:107) + δ and hence (cid:107) X z (cid:107) ≤ ( λ max ( X T X )) / (cid:107) z (cid:107) = 2 λ / (cid:107) z (cid:107) < λ / ( (cid:107) z ∗ (cid:107) + δ ) =: σ, which leads to (cid:104) x i , z (cid:105) ≤ (cid:107) X z (cid:107) < σ, i = 1 , . . . , n. This together with Property 1 (iii) suﬃces to that for any z ∈ N ( z ∗ , δ ),1 / (4 e σ ) ≤ e σ / (1 + e σ ) ≤ λ min ( D ( z )) ≤ λ max ( D ( z )) ≤ / . (40)We are ready to see that (cid:107) ( ∇ T,T (cid:96) ( z )) − (cid:107) = λ max (( ∇ T,T (cid:96) ( z )) − )= (cid:2) λ min ( ∇ T,T (cid:96) ( z )) (cid:3) − = (cid:104) λ min ( X (cid:62) T D ( z ) X T /n ) (cid:105) − ≤ n (cid:104) λ min ( X (cid:62) T X T ) λ min ( D ( z )) (cid:105) − , ≤ ne σ /λ, which completes the whole proof. Appendix B.

In this appendix we prove results from Section 4:27 roof of Theorem 10Proof

Since u k → u ∞ , it holds z k → z ∞ and d k → d ∞ as k → ∞ , which suﬃces to z ∞ ∈ S due to (cid:107) z k (cid:107) ≤ s . Clearly, { T k } is bounded owing to T k ⊆ N p and | T k | = s , andthus has a subsequence { T k t } such that T k t = T k t +1 = · · · =: T ∞ . (41)Since z k t +1 → z ∞ , supp( z k t +1 ) ⊆ T k t = T ∞ , we must have T ∞ (cid:26) = supp( z ∞ ) , if (cid:107) z ∞ (cid:107) = s, ⊃ supp( z ∞ ) , if (cid:107) z ∞ (cid:107) < s. and hence z ∞ T ∞ = 0 . (42)From (27) and (28), we obtain ∇ T ∞ · (cid:96) ( z ∞ ) z ∞ = lim k t →∞ ∇ T kt · (cid:96) ( z k t ) z k t +1 = lim k t →∞ (cid:104) ∇ T kt · (cid:96) ( z k t ) z k t − ∇ T kt (cid:96) ( z k t ) (cid:105) = ∇ T ∞ · (cid:96) ( z ∞ ) z ∞ − ∇ T ∞ (cid:96) ( z ∞ ) , which yields that ∇ T ∞ (cid:96) ( z ∞ ) = 0 . (43)Again by (28), we have d ∞ T ∞ = lim k t →∞ d k t +1 T kt = 0 , (44) d ∞ T ∞ = lim k t →∞ d k t +1 T kt = lim k t →∞ (cid:104) ∇ T kt (cid:96) ( z k t ) + ∇ T kt · (cid:96) ( z k t )( z k t +1 − z k t ) (cid:105) = ∇ T ∞ (cid:96) ( z ∞ ) . (45)Now for any i ∈ T k t (41) = T ∞ , j ∈ T k t (41) = T ∞ , we have | z ∞ i | (44) = | z ∞ i − τ d ∞ i | = lim k t →∞ | z k t i − τ d k t i | (17) ≥ lim k t →∞ | z k t j − τ d k t j | = | z ∞ j − τ d ∞ j | (42) = τ | d ∞ j | , which leads to [ z ∞ ] ↓ s = min i ∈ T ∞ | z ∞ i | ≥ τ | d ∞ j | (45) = τ |∇ j (cid:96) ( z ∞ ) | , ∀ j ∈ T ∞ , (46)If (cid:107) z ∞ (cid:107) = s , then T ∞ (42) = supp( z ∞ ). Consequently, [ z ∞ ] ↓ s ≥ τ |∇ j (cid:96) ( z ∞ ) | , ∀ j / ∈ supp( z ∞ ).If (cid:107) z ∞ (cid:107) < s , then [ z ∞ ] ↓ s = 0 and ∇ (cid:96) ( z ∞ ) = 0 from (46) and (43). Those together with(13) show z ∞ is a τ -stationary point.Before we prove Lemma 11, we need the following property. Property 13

Let z be a strong τ -stationary point of (4) . We have following results. If (cid:107) z (cid:107) = s , then T B ( z ; τ ) is a singleton and for any < τ ≤ τT B ( z ; τ ) = T B ( z ; τ ) = { supp( z ) } . ii) If (cid:107) z (cid:107) < s , then T B ( z ; τ ) has ﬁnite many entries, namely, T B ( z ; τ ) = { T ⊆ N p : T ⊃ supp( z ) , | T | = s } . Moreover, for any given τ > , T B ( z ; τ ) = T B ( z ; τ ) . Proof

Since z is a strong τ -stationary point of (4), it follow form Theorem (21) that d = ∇ (cid:96) ( z ). i) When (cid:107) z (cid:107) = s , Property 4 ii) yields | d i | = |∇ i (cid:96) ( z ) | (cid:26) = 0 , i ∈ supp( z ) ,< τ [ z ] ↓ s , i / ∈ supp( z ) , (47)which suﬃces to | z i − τ d i | = | z i | ≥ [ z ] ↓ s > τ | d j | = | z j − τ d j | for any i ∈ supp( z ) , j / ∈ supp( z ). The above inequalities together with (cid:107) z (cid:107) = s implysupp( z ) ∈ T B ( z ; τ ). If there is another T ∈ T B ( z ; τ ) such that T (cid:54) = supp( z ), then it followsthat there exists a j ∈ T \ supp( z ) and i ∈ supp( z ) \ T . This means[ z ] ↓ s (47) > τ | d j | = | z j − τ d j | (17) ≥ | z i − τ d i | (47) = | z i | ≥ [ z ] ↓ s , which is a contradiction. Therefore, T B ( z ; τ ) = { T ( z ; τ ) } = { supp( z ) } is a singleton. Againby Property 4, a strong τ -stationary point z is also a strong τ -stationary point z due to τ [ z ] ↓ s ≤ τ [ z ] ↓ s for any 0 < τ ≤ τ , thus T B ( z ; τ ) = { T ( z ; τ ) } = { supp( z ) } . ii) When (cid:107) z (cid:107) < s , Property 4 i) yields d = ∇ (cid:96) ( z ) = 0, which implies that [ z − τ d ] ↓ s =[ z ] ↓ s = 0 and T ( z ; τ ) = { i ∈ N p : | x i | ≥ } with T ( z ; τ ) = s . As (cid:107) z (cid:107) < s , we must havesupp( z ) ⊂ T ( z ; τ ) due to | T ( z ; τ ) | = s . To form T ( z ; τ ), we need pick s − (cid:107) z (cid:107) indicesfrom N p \ supp( z ), with C s −(cid:107) z (cid:107) p −(cid:107) z (cid:107) choices, yielding C s −(cid:107) z (cid:107) p −(cid:107) z (cid:107) T ( z ; τ )s to form T B ( z ; τ ). Finally, ∇ f ( z ) = 0 suﬃces to z being also a τ -stationary point for any τ >

0. Similar choices toform T B ( z ; τ ) lead to T B ( z ; τ ) = T B ( z ; τ ). Proof of Lemma 11Proof i) If z ∗ is a strong τ ∗ -stationary point of (4), then Theorem 7 shows F τ ∗ ( u ∗ ; T ∗ ) = 0for any T ∗ ∈ T B ( u ∗ ; τ ∗ ) . This means z ∗ T ∗ = 0 , d ∗ T ∗ = 0 , d ∗ = ∇ (cid:96) ( z ∗ ) . (48)When (cid:107) z ∗ (cid:107) = s , Property 13 i) already shows T B ( u ∗ ; τ ∗ ) = T B ( u ∗ ; τ ) = { supp( z ∗ ) } . Thismeans T ∗ = supp( z ∗ ). Next, we show thatsupp( z ) = supp( z ∗ ) . (49)29o prove above equation, we only prove supp( z ) ⊇ supp( z ∗ ) since z ∈ S and (cid:107) z ∗ (cid:107) = s . If itis not true, i.e., there is an i ∈ supp( z ∗ ) but i / ∈ supp( z ), then it follows from the deﬁnitionof N S ( u ∗ , δ ∗ ) that[ z ∗ ] ↓ s (31) ≥ δ ∗ (32) > (cid:107) u − u ∗ (cid:107) ≥ (cid:107) z − z ∗ (cid:107) ≥ | − z ∗ i | ≥ [ z ∗ ] ↓ s . which is a contradiction. Again, u ∈ N S ( u ∗ , δ ∗ ) suﬃces to( | z i − z ∗ i | + | d i − d ∗ i | + | d j − d ∗ j | ) ≤ | z i − z ∗ i | + | d i − d ∗ i | + | d j − d ∗ j | ) ≤ (cid:107) u − u ∗ (cid:107) < δ ∗ ) . (50)For any T ∈ T B ( u ; τ ) , we now prove supp( z ∗ ) = T . Considering any i ∈ supp( z ∗ ) , j / ∈ supp( z ∗ ), direct calculation derives following chain of inequalities | z i − τ d i | − | z j − τ d j | (49) = | z i − τ d i | − | τ d j | ≥ | z i | − τ | d i | − τ | d j | (48) = | z i − z ∗ i + z ∗ i | − τ | d i − d ∗ i | − τ | d j − d ∗ j + d ∗ j |≥ | z ∗ i | − | z i − z ∗ i | − τ ∗ | d i − d ∗ i | − τ ∗ | d j − d ∗ j | − τ ∗ | d ∗ j |≥ | z ∗ i | − max { , τ ∗ } ( | z i − z ∗ i | + | d i − d ∗ i | + | d j − d ∗ j | ) − τ ∗ | d ∗ j | (50) > [ z ∗ ] ↓ s − { , τ ∗ } δ ∗ − τ ∗ max j / ∈ supp( z ∗ ) | d ∗ j | (31) ≥ . This means for any i ∈ supp( z ∗ ) it has i ∈ T , namely, supp( z ∗ ) ⊆ T . Finally (cid:107) z ∗ (cid:107) = s and | T | = s suﬃce to supp( z ∗ ) = T . This combining the arbitrariness of T ∈ T B ( u ; τ ) , indicates T B ( u ; τ ) = { supp( z ∗ ) } .ii) For (cid:107) z ∗ (cid:107) < s , if z ∗ = 0 the conclusion holds obviously due to supp( z ∗ ) = ∅ . We nowonly focus on z ∗ (cid:54) = 0, which means δ ∗ > ∗ := supp( z ∗ ) ⊆ supp( z ). If it is not true, i.e., there is an i ∈ Γ ∗ but i / ∈ supp( z ), then by the deﬁnition of N S ( u ∗ , δ ∗ ) as (32), it followsmin i ∈ Γ ∗ | z ∗ i | (31) ≥ δ ∗ > (cid:107) u − u ∗ (cid:107) ≥ (cid:107) z − z ∗ (cid:107) ≥ | − z ∗ i | ≥ min i ∈ Γ ∗ | z ∗ i | , which is a contradiction. We then prove Γ ∗ ⊆ T for any T ∈ T B ( u ; τ ), namely, | z i − τ d i | > | z j − τ d j | for any i ∈ Γ ∗ , j / ∈ Γ ∗ . For any given τ > u ∈ N S ( u ∗ , δ ∗ ), we have( | z i − z ∗ i | + | z j − z ∗ j | + | d i − d ∗ i | + | d j − d ∗ j | ) ≤ | z i − z ∗ i | + | z j − z ∗ j | + | d i − d ∗ i | + | d j − d ∗ j | ) < δ ∗ ) . (51)30otice that z ∗ is a strong τ ∗ -stationary point with (cid:107) z ∗ (cid:107) < s , then it must satisfy d ∗ = ∇ (cid:96) ( z ∗ ) = 0 by Property 4 i). Direct calculation derives following chain of inequalities | z i − τ d i | − | z j − τ d j |≥ | z i | − τ | d i | − | z j | − τ | d j | = | z i − z ∗ i + z ∗ i | − | z j − z ∗ j | − τ | d i | − τ | d j | (because of z ∗ j = 0 , ∀ j / ∈ Γ ∗ ) ≥ | z ∗ i | − | z i − z ∗ i | − | z j − z ∗ j | − τ ∗ | d i − d ∗ i | − τ ∗ | d j − d ∗ j | (because of d ∗ = 0) ≥ | z ∗ i | − max { , τ ∗ } ( | z i − z ∗ i | + | z j − z ∗ j | + | d i − d ∗ i | + | d j − d ∗ j | ) > min i ∈ Γ ∗ | z ∗ i | − { , τ ∗ } δ ∗ (because of (51)) ≥ . (because of (31))This means for any i ∈ Γ ∗ it has i ∈ T , namely, Γ ∗ ⊆ T . Overall we prove Γ ∗ ⊆ supp( x ) ∩ T, ∀ T ∈ T B ( z , τ ) . Finally, the proof of Property 13 ii) says that T B ( z ∗ ; τ ∗ ) con-tains every T ∗ which satisﬁes T ∗ ⊇ Γ ∗ and | T ∗ | = s . Particularly, it covers all T ∈ T B ( z ; τ )because of T ⊇ Γ ∗ and | T | = s , namely T B ( z ; τ ) ⊆ T B ( z ∗ ; τ ∗ ) . Proof of Theorem 12Proof i) Since z ∗ is a strong τ ∗ -stationary point, then F τ ∗ ( u ∗ ; T ∗ ) = 0 for any T ∗ ∈ T B ( u ∗ ; τ ∗ ) from Theorem 7, namely, z ∗ T ∗ = 0 , d ∗ T ∗ = 0 , d ∗ = ∇ (cid:96) ( z ∗ ) . (52)Choose T ∈ T B ( u ; τ ). Lemma 11 and u ∈ N S ( u ∗ , δ ∗ ) suﬃce to T ∈ T B ( u ; τ ) ⊆ T B ( u ∗ ; τ ∗ ). This together with (52) derives z ∗ T = 0 , d ∗ T = ∇ T (cid:96) ( z ∗ ) = 0 . (53)For any 0 ≤ t ≤

1, denote (cid:20) z ( t ) d ( t ) (cid:21) = u ( t ) := u ∗ + t ( u − u ∗ ) = (cid:20) z ∗ + t ( z − z ∗ ) d ∗ + t ( d − d ∗ ) (cid:21) . It is easy to verify that u ( t ) ∈ N S ( u ∗ , δ ∗ ).This together with Property 1 iv) generatesmax (cid:8) (cid:107)∇ T · (cid:96) ( z ) − ∇ T · (cid:96) ( z ( t )) (cid:107) , (cid:107)∇ T · (cid:96) ( z ) − ∇ T · (cid:96) ( z ( t )) (cid:107) (cid:9) ≤ (cid:107)∇ (cid:96) ( z ) − ∇ (cid:96) ( z ( t )) (cid:107) ≤ γ x (cid:107) z − z (cid:107) = (1 − t ) γ x (cid:107) z − z ∗ (cid:107) , (54)where ∇ T · (cid:96) ( z ) is the sub-matrix of ∇ (cid:96) ( z ) containing rows indexed on T . Moreover, byTaylor expansion, one has ∇ (cid:96) ( z ) − ∇ (cid:96) ( z ∗ ) = (cid:90) ∇ (cid:96) ( z ( t ))( z − z ∗ ) dt. (55)31t follows from u ∈ N S ( u ∗ , min { δ ∗ , δ ∗ x } ) that (cid:107) z − z ∗ (cid:107) ≤ (cid:107) u − u ∗ (cid:107) < δ ∗ , which combiningwith (24) in Theorem 9 leads to (cid:107) ( ∇ T ,T (cid:96) ( z )) − (cid:107) ≤ µ ∗ = e √ λ x (cid:107) z ∗ (cid:107) + δ ∗ ) λ/ (4 n ) . (56)Now, we have following chain of inequalities (cid:107) z − z ∗ (cid:107) = ( (cid:107) z T − z ∗ T (cid:107) + (cid:107) z T − z ∗ T (cid:107) ) / , = (cid:107) z T − z ∗ T (cid:107) (28) = (cid:107) z T − z ∗ T − ( ∇ T ,T (cid:96) ( z )) − ( ∇ T (cid:96) ( z ) − ∇ T ,T (cid:96) ( z ) z T ) (cid:107) (56) ≤ µ ∗ (cid:107)∇ T ,T (cid:96) ( z )( z T − z ∗ T ) + ∇ T (cid:96) ( z ) − ∇ T ,T (cid:96) ( z ) z T (cid:107) (53) = µ ∗ (cid:107)∇ T ,T (cid:96) ( z )( z T − z ∗ T ) − ∇ T (cid:96) ( z ) + ∇ T (cid:96) ( z ∗ ) + ∇ T ,T (cid:96) ( z )( z T − z ∗ T ) (cid:107) (55) = µ ∗ (cid:107)∇ T · (cid:96) ( z )( z − z ∗ ) − (cid:90) ∇ T · (cid:96) ( z ( t ))( z − z ∗ ) dt (cid:107) = µ ∗ (cid:107) (cid:90) [ ∇ T · (cid:96) ( z ) − ∇ T · (cid:96) ( z ( t ))]( z − z ∗ ) dt (cid:107)≤ µ ∗ (cid:90) (cid:107)∇ T · (cid:96) ( z ) − ∇ T · (cid:96) ( z ( t )) (cid:107)(cid:107) z − z ∗ (cid:107) dt (54) ≤ µ ∗ γ x (cid:107) z − z ∗ (cid:107) (cid:90) (1 − t ) dt = 0 . µ ∗ γ x (cid:107) z − z ∗ (cid:107) . (57)Next we estimate (cid:107) d − d ∗ (cid:107) . One can verify that µ ∗ λ x = e √ λ x (cid:107) z ∗ (cid:107) + δ ∗ ) λ/ (4 n ) λ x > λ x λ/ (4 n ) = λ max ( X (cid:62) X ) λ (22) = λ max ( X (cid:62) X )min | T |≤ s λ min ( X (cid:62) T X T ) ≥ . (58)Similarly, we have following chain of inequalities (cid:107) d − d ∗ (cid:107) = ( (cid:107) d T − d ∗ T (cid:107) + (cid:107) d T − d ∗ T (cid:107) ) / , = (cid:107) d T − d ∗ T (cid:107) (28 , = (cid:107)∇ T (cid:96) ( z ) + ∇ T · (cid:96) ( z )( z − z ) − ∇ T (cid:96) ( z ∗ ) (cid:107) = (cid:107)∇ T (cid:96) ( z ) − ∇ T (cid:96) ( z ∗ ) − ∇ T · (cid:96) ( z )( z − z ∗ ) + ∇ T · (cid:96) ( z )( z − z ∗ ) (cid:107) (55) = (cid:107) (cid:90) [ ∇ T · (cid:96) ( z ) − ∇ T · (cid:96) ( z ( t ))]( z − z ∗ ) dt + ∇ T · (cid:96) ( z )( z − z ∗ ) (cid:107) (54) ≤ γ x (cid:107) z − z ∗ (cid:107) (cid:90) (1 − t ) dt + (cid:107)∇ T · (cid:96) ( z ) (cid:107) (cid:107) z − z ∗ (cid:107)≤ . γ x (cid:107) z − z ∗ (cid:107) + (cid:107)∇ (cid:96) ( z ) (cid:107) (cid:107) z − z ∗ (cid:107)≤ . γ x (cid:107) z − z ∗ (cid:107) + λ x (cid:107) z − z ∗ (cid:107) (57) ≤ . γ x (1 + µ ∗ λ x ) (cid:107) z − z ∗ (cid:107) ≤ γ x µ ∗ λ x (cid:107) z − z ∗ (cid:107) . (59)32ased on (57) and (59), we have (cid:107) u − u ∗ (cid:107) = (cid:107) z − z ∗ (cid:107) + (cid:107) d − d ∗ (cid:107) ≤ (1 / γ x µ ∗ ) (cid:107) z − z ∗ (cid:107) + ( γ x µ ∗ λ x ) (cid:107) z − z ∗ (cid:107) ≤ (1 / λ )( γ x µ ∗ ) (cid:107) z − z ∗ (cid:107) = (0 . /δ ∗ x ) (cid:107) z − z ∗ (cid:107) ≤ (0 . /δ ∗ x ) (cid:107) u − u ∗ (cid:107) , (60)which gives rise to (cid:107) u − u ∗ (cid:107) ≤ (0 . /δ ∗ x ) (cid:107) u − u ∗ (cid:107) ≤ . (cid:107) u − u ∗ (cid:107) . (61)The above inequality suﬃces to (cid:107) u − u ∗ (cid:107) ≤ (cid:107) u − u ∗ (cid:107) < min { δ ∗ , δ ∗ x } . In addition, z T = 0from (28) means z ∈ S . Then u ∈ N S ( u ∗ , min { δ ∗ , δ ∗ x } ). Similar reasons to derive (60)allow us to get (cid:107) u − u ∗ (cid:107) ≤ (0 . /δ ∗ x ) (cid:107) u − u ∗ (cid:107) . By induction, one could conclude that u k ∈ N S ( u ∗ , min { δ ∗ , δ ∗ x } ) and (cid:107) u k +1 − u ∗ (cid:107) ≤ (0 . /δ ∗ x ) (cid:107) u k − u ∗ (cid:107) . (62)Hence lim k →∞ u k = u ∗ , and the sequence { z k } has quadratic convergence rate.ii) The above proof shows that u k ∈ N S ( u ∗ , min { δ ∗ , δ ∗ x } ) for any k ≥

0, which indicates u k ∈ N S ( u ∗ , δ ∗ ). This combining Lemma 11 directly derives the claim.iii) Since u k ∈ N S ( u ∗ , min { δ ∗ , δ ∗ x } ) and (62), we get (cid:107) u k +1 − u ∗ (cid:107) ≤ . (cid:107) u k − u ∗ (cid:107) . (63)By the last equation of (26), we have d k +1 = ∇ (cid:96) ( z k ) + ∇ (cid:96) ( z k )( z k +1 − z k ) . (64)In addition, it follows from ii) T k +1 ∈ T B ( z ∗ ; τ ∗ ), Property 4 and (52) that (cid:107) z ∗ (cid:107) = s = ⇒ supp( z ∗ ) = T k +1 = ⇒ z ∗ T k +1 = 0 , d ∗ T k +1 (52) = 0 , (cid:107) z ∗ (cid:107) < s = ⇒ supp( z ∗ ) ⊂ T k +1 , d ∗ (52) = ∇ (cid:96) ( z ∗ ) Property 4 i ) = 0= ⇒ z ∗ T k +1 = 0 , d ∗ T k +1 = 0 . (65)These give rise to F τ ( u k +1 ; T k +1 ) (20) =  d k +1 T k +1 z k +1 T k +1 d k +1 − ∇ (cid:96) ( z k +1 )  (65) =  d k +1 T k +1 − d ∗ T k +1 z k +1 T k +1 − z ∗ T k +1 d k +1 − ∇ (cid:96) ( z k +1 )  (64) =  d k +1 T k +1 − d ∗ T k +1 z k +1 T k +1 − z ∗ T k +1 ∇ (cid:96) ( z k ) + ∇ (cid:96) ( z k )( z k +1 − z k ) − ∇ (cid:96) ( z k +1 )  . ∇ (cid:96) ( z k +1 ) − ∇ (cid:96) ( z k ) = (cid:90) ∇ (cid:96) ( z k + t ( z k +1 − z k ))( z k +1 − z k ) dt enables us to derive that I := (cid:107)∇ (cid:96) ( z k ) + ∇ (cid:96) ( z k )( z k +1 − z k ) − ∇ (cid:96) ( z k +1 ) (cid:107) = (cid:107) (cid:90) ( ∇ (cid:96) ( z k + t ( z k +1 − z k )) − ∇ (cid:96) ( z k ))( z k +1 − z k ) dt (cid:107)≤ (cid:90) (cid:107)∇ (cid:96) ( z k + t ( z k +1 − z k )) − ∇ (cid:96) ( z k ) (cid:107)(cid:107) z k +1 − z k (cid:107) dt (11) ≤ (cid:90) tγ x (cid:107) z k +1 − z k (cid:107) dt = 0 . γ x (cid:107) z k +1 − z k (cid:107) ≤ . γ x (cid:107) u k +1 − u k (cid:107) ≤ γ x ( (cid:107) u k +1 − u ∗ (cid:107) + (cid:107) u k − u ∗ (cid:107) ) (63) ≤ . γ x (cid:107) u k − u ∗ (cid:107) . Next we estimate II := (cid:107) d k +1 T k +1 − d ∗ T k +1 (cid:107) + (cid:107) z k +1 T k +1 − z ∗ T k +1 (cid:107) ≤ (cid:107) u k +1 − u ∗ (cid:107) ≤ (0 . /δ ∗ x ) (cid:107) u k − u ∗ (cid:107) . Based on those facts, we obtain (cid:107) F τ ( u k +1 ; T k +1 ) (cid:107) = (cid:112) I + II ≤ (cid:2) (0 . /δ ∗ x ) + (1 . γ x ) (cid:3) / (cid:107) u k − u ∗ (cid:107) , = c x (cid:107) u k − u ∗ (cid:107) ≤ c x − k (cid:107) u − u ∗ (cid:107) , which suﬃces to the conclusion. References

Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergencerates of gradient methods for high-dimensional statistical recovery. In

Advances in NeuralInformation Processing Systems , pages 37–45, 2010.Galen Andrew and Jianfeng Gao. Scalable training of (cid:96) -regularized log-linear models. In Proceedings of the 24th international conference on Machine learning , pages 33–40. ACM,2007.Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained opti-mization.

Journal of Machine Learning Research , 14(Mar):807–841, 2013.Amir Beck and Yonina C Eldar. Sparsity constrained nonlinear optimization: Optimalityconditions and algorithms.

SIAM Journal on Optimization , 23(3):1480–1509, 2013.34mir Beck and Nadav Hallak. On the minimization over sparse symmetric sets: projections,optimality conditions, and algorithms.

Mathematics of Operations Research , 41(1):196–223, 2015.Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing.

Applied and computational harmonic analysis , 27(3):265–274, 2009.Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity byreweighted (cid:96) minimization. Journal of Fourier analysis and applications , 14(5-6):877–905, 2008.Jinghui Chen and Quanquan Gu. Fast newton hard thresholding pursuit for sparsity con-strained nonconvex optimization. In

Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , pages 757–766. ACM, 2017.David L Donoho. Compressed sensing.

IEEE Transactions on information theory , 52(4):1289–1306, 2006.Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angle regres-sion.

The Annals of statistics , 32(2):407–499, 2004.Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood andits oracle properties.

Journal of the American statistical Association , 96(456):1348–1360,2001.M´ario AT Figueiredo. Adaptive sparseness for supervised learning.

IEEE transactions onpattern analysis and machine intelligence , 25(9):1150–1159, 2003.Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing.

SIAMJournal on Numerical Analysis , 49(6):2543–2563, 2011.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalizedlinear models via coordinate descent.

Journal of statistical software , 33(1):1, 2010.Rahul Garg and Rohit Khandekar. Gradient descent with sparsiﬁcation: an iterative algo-rithm for sparse recovery with restricted isometry property. In

Proceedings of the 26thAnnual International Conference on Machine Learning , pages 337–344. ACM, 2009.Pinghua Gong and Jieping Ye. Honor: Hybrid optimization for non-convex regularizedproblems. In

Advances in Neural Information Processing Systems , pages 415–423, 2015.Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. A generaliterative shrinkage and thresholding algorithm for non-convex regularized optimizationproblems. In

International Conference on Machine Learning , pages 37–45, 2013.James Douglas Hamilton.

Time series analysis , volume 2. Princeton university press Prince-ton, NJ, 1994.Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons ofbest subset selection, forward stepwise selection, and the lasso. arXiv preprintarXiv:1707.08692 , 2017. 35ussein Hazimeh and Rahul Mazumder. Fast best subset selection: Coordinate descent andlocal combinatorial optimization algorithms. arXiv preprint arXiv:1803.01454 , 2018.Jian Huang, Shuange Ma, Huiliang Xie, and Cun-Hui Zhang. A group bridge approach forvariable selection.

Biometrika , 96(2):339–355, 2009.Jian Huang, Yuling Jiao, Yanyan Liu, and Xiliang Lu. A constructive approach to l0penalized regression.

Journal of Machine Learning Research , 19(10), 2018.Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large-scale (cid:96) -regularized logistic regression. Journal of Machine learning research , 8(Jul):1519–1555, 2007.Balaji Krishnapuram, Lawrence Carin, Mario AT Figueiredo, and Alexander J Hartemink.Sparse multinomial logistic regression: Fast algorithms and generalization bounds.

IEEETransactions on Pattern Analysis & Machine Intelligence , (6):957–968, 2005.Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Eﬃcient l regularized logisticregression. In AAAI , volume 6, pages 401–408, 2006.Huan Li and Zhouchen Lin. Accelerated proximal gradient methods for nonconvex pro-gramming. In

Advances in neural information processing systems , pages 379–387, 2015.Jun Liu, Jianhui Chen, and Jieping Ye. Large-scale sparse logistic regression. In

Proceedingsof the 15th ACM SIGKDD international conference on Knowledge discovery and datamining , pages 547–556. ACM, 2009a.Jun Liu, Shuiwang Ji, Jieping Ye, et al. Slep: Sparse learning with eﬃcient projections.

Arizona State University , 6(491):7, 2009b.Aurelie Lozano, Grzegorz Swirszcz, and Naoki Abe. Group orthogonal matching pursuit forlogistic regression. In

Proceedings of the Fourteenth International Conference on ArtiﬁcialIntelligence and Statistics , pages 452–460, 2011.Zhaosong Lu. Optimization over sparse symmetric sets via a nonmonotone projected gra-dient method. arXiv preprint arXiv:1509.08581 , 2015.Zhaosong Lu and Yong Zhang. Sparse approximation via penalty decomposition methods.

SIAM Journal on Optimization , 23(4):2448–2478, 2013.St´ephane Mallat and Zhifeng Zhang. Matching pursuit with time-frequency dictionaries.Technical report, Courant Institute of Mathematical Sciences New York United States,1993.Rahul Mazumder, Peter Radchenko, and Antoine Dedieu. Subset selection with shrinkage:Sparse linear modeling when the snr is low. arXiv preprint arXiv:1708.03288 , 2017.Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete andinaccurate samples.

Applied and computational harmonic analysis , 26(3):301–321, 2009.36i-Li Pan, Nai-Hua Xiu, and Sheng-Long Zhou. On solutions of sparsity constrained opti-mization.

Journal of the Operations Research Society of China , 3(4):421–439, 2015.Lili Pan, Shenglong Zhou, Naihua Xiu, and Hou-Duo Qi. A convergent iterative hardthresholding for nonnegative sparsity optimization.

Paciﬁc Journal of Optimization , 13(2):325–353, 2017.Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logisticregression: A convex programming approach.

IEEE Transactions on Information Theory ,59(1):482–494, 2013.Alain Rakotomamonjy, Remi Flamary, and Gilles Gasso. Dc proximal newton for nonconvexoptimization problems.

IEEE transactions on neural networks and learning systems , 27(3):636–647, 2016.Jianing Shi, Wotao Yin, Stanley Osher, and Paul Sajda. A fast hybrid algorithm for large-scale (cid:96) -regularized logistic regression. Journal of Machine Learning Research , 11(Feb):713–741, 2010.Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288, 1996.Rui Wang, Naihua Xiu, and Chao Zhang. Greedy projected gradient-newton method forlarge-scale sparse logistic regression. technical report , 2017.Weijun Xie and Xinwei Deng. The ccp selector: Scalable algorithms for sparse ridge regres-sion from chance-constrained programming. arXiv preprint arXiv:1806.03756 , 2018.Jin Yu, SVN Vishwanathan, Simon G¨unter, and Nicol N Schraudolph. A quasi-newtonapproach to nonsmooth convex optimization problems in machine learning.

Journal ofMachine Learning Research , 11(Mar):1145–1200, 2010.Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison of opti-mization methods and software for large-scale (cid:96) -regularized linear classiﬁcation. Journalof Machine Learning Research , 11(Nov):3183–3234, 2010.Guo-Xun Yuan, Chia-Hua Ho, and Chih-Jen Lin. An improved glmnet for (cid:96) -regularizedlogistic regression. Journal of Machine Learning Research , 13(Jun):1999–2030, 2012.Xiao-Tong Yuan and Qingshan Liu. Newton greedy pursuit: A quadratic approximationmethod for sparsity-constrained optimization. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 4122–4129, 2014.Xiao-Tong Yuan and Qingshan Liu. Newton-type greedy selection methods for (cid:96) -constrained minimization. IEEE transactions on pattern analysis and machine intelli-gence , 39(12):2437–2450, 2017.Xiao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In

International Conference on Machine Learning , pages 127–135, 2014. 37iao-Tong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit.

Journalof Machine Learning Research , 18(166):1–43, 2018.Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty.

The Annals of statistics , 38(2):894–942, 2010.Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization.

Journalof Machine Learning Research , 11(Mar):1081–1107, 2010.Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models.