An efficient Hessian based algorithm for solving large-scale sparse group Lasso problems
AAn efficient Hessian based algorithm for solvinglarge-scale sparse group Lasso problems
Yangjing Zhang ∗ Ning Zhang † Defeng Sun ‡ Kim-Chuan Toh § December 15, 2017
Abstract.
The sparse group Lasso is a widely used statistical model whichencourages the sparsity both on a group and within the group level. In thispaper, we develop an efficient augmented Lagrangian method for large-scalenon-overlapping sparse group Lasso problems with each subproblem beingsolved by a superlinearly convergent inexact semismooth Newton method.Theoretically, we prove that, if the penalty parameter is chosen sufficiently large,the augmented Lagrangian method converges globally at an arbitrarily fast linearrate for the primal iterative sequence, the dual infeasibility, and the duality gapof the primal and dual objective functions. Computationally, we derive explicitlythe generalized Jacobian of the proximal mapping associated with the sparsegroup Lasso regularizer and exploit fully the underlying second order sparsitythrough the semismooth Newton method. The efficiency and robustness of ourproposed algorithm are demonstrated by numerical experiments on both thesynthetic and real data sets.
Keywords.
Sparse group Lasso, generalized Jacobian, augmented Lagrangianmethod, semismooth Newton method
Mathematics Subject Classification. ∗ Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore119076 ( [email protected] ). † Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singa-pore 119076. Future Resilient Systems, Singapore-ETH center. The Singapore-ETH Centre (SEC) wasestablished as a collaboration between ETH Zurich and National Research Foundation (NRF) Singapore(FI 370074011) under the auspices of the NRF’s Campus for Research Excellence and TechnologicalEnterprise (CREATE) programme ( [email protected] ). ‡ Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, HongKong ( [email protected] ). On leave from Department of Mathematics, National University ofSingapore. § Department of Mathematics, and Institute of Operations Research and Analytics, National Universityof Singapore, 10 Lower Kent Ridge Road, Singapore 119076 ( [email protected] ). a r X i v : . [ m a t h . O C ] D ec Introduction
In this paper, we aim to design a fast algorithm for solving the following sparse groupLasso (SGLasso) problem:min x ∈ R n kA x − b k + λ k x k + λ g X l =1 w l k x G l k , (1)where A : R n → R m is a linear map, b ∈ R m is the given response vector, λ ≥ λ ≥ l = 1 , , . . . , g, w l >
0, and the set G l ⊆ { , , . . . , n } contains the indices corresponding to the l -th group of features. Wedenote the restriction of the vector x to the index set G l as x G l . Here k · k and k · k denotethe ‘ norm and ‘ norm, respectively. For convenience, we denote the SGLasso regularizerby the proper closed convex function p ( x ) := λ k x k + λ P gl =1 w l k x G l k , ∀ x ∈ R n .In recent decades, high dimensional feature selection problems have become increas-ingly important, and the penalized regression models have been proven to be particularlyuseful for these feature selection problems. For many such problems in real applications,the number of predictors n is much larger than the number of observations m . A notableexample of the penalized regression model is the Lasso model that was first proposed byTibshirani [41]. The problem (1) contains the Lasso problem as a special case if we takethe parameter λ = 0. On the other hand, by assuming that some prior information aboutthe group structure of the underlying solution x is known, Yuan and Lin [45] proposed thegroup Lasso model, i.e., in problem (1) with parameters λ = 0 and λ ≥
0. The groupLasso can select a small set of groups. However, it does not ensure sparsity within eachgroup. For the purpose of achieving sparsity of groups and within each group, Friedmanet al. [14] proposed the SGLasso model (1), potentially with overlaps between groups.Apart from the above penalized regression models, there exist a number of variants withdifferent regularizers, such as the fused Lasso [42] and the network Lasso [17].The SGLasso model has been widely applied to different fields, such as text processing,bioinformatics, signal interpretation, and object tracking (e.g., [20, 19, 32, 21, 10, 48]).Its wide ranging applications have inspired many researchers to design various algorithmsfor solving the SGLasso problem. These algorithms include the (accelerated) proximalgradient method (see e.g., [1, 45]), (randomized) block coordinate descent algorithm (seee.g., [35, 38, 34]), and alternating direction method of multipliers (see e.g., [2]). To thebest of our knowledge, these existing algorithms are first order methods that are applieddirectly to the primal problem (1) and they hardly utilize any second order information.In contrast, we aim to design an efficient second order information based algorithm forsolving the dual problem of the SGLasso problem (1). The motivations for adopting thedual approach will be explained in the following paragraphs.Solving the SGLasso problem is especially challenging when there are overlappinggroups because of the complex structure of the SGLasso regularizer p . The complicatedcomposite structure of p generally makes it impossible to compute its proximal mapping2nalytically. However, the efficient computation of such a proximal mapping is indispens-able to a number of algorithms, and many of the papers mentioned in the last paragraphthus considered the simpler case of the non-overlapping SGLasso problem. As a firstattempt to design a Hessian based algorithm for the SGLasso problem, we will also focuson the simpler case of the non-overlapping SGLasso problem. The non-overlapping casecan be treated as a preliminary study towards the final goal of designing a Hessian basedalgorithm of solving the overlapping SGLasso problem. For the rest of this paper, wemake the following blanket assumption. Assumption 1.1.
The different groups G l , l = 1 , , . . . , g form a partition of { , , . . . , n } ,i.e., G i ∩ G j = ∅ for all ≤ i < j ≤ g , and ∪ gl =1 G l = { , , . . . , n } . In order to solve the non-overlapping SGLasso problem, we aim to use the semismoothNewton (SSN) augmented Lagrangian (S
SNAL ) framework for solving the dual problemof (1). This approach is motivated by the superior numerical performance of the S
SNAL when applied to the dual of the Lasso problem [24] and that of the fused Lasso prob-lem [25]. We note that the objective functions of the Lasso and fused Lasso problemsare piecewise linear-quadratic, and therefore as proven in [24, 25], both the primal anddual iterates generated by the augmented Lagrangian method (ALM) are asymptoticallysuperlinearly convergent. It is this attractive convergence property that leads to the im-pressive numerical performance of the S
SNAL . However, the regularizer p in the objectivefunction of the SGLasso problem (1) is no longer a polyhedral function due to the presenceof the ‘ norm. As a result, the asymptotic superlinear convergence of both the primaland dual iterative sequences generated by the ALM are no longer guaranteed to hold bythe existing theoretical results. Fortunately, by leveraging on the recent advances madein Cui, Sun, and Toh [8] on the analysis of the asymptotic R-superlinear convergence ofthe ALM for convex composite conic programming, we are able to establish the globallinear convergence (with an arbitrary rate) of the primal iterative sequence, the dual in-feasibility, and the dual function values generated by the ALM for the SGLasso problem.With this convergence result, we could expect the ALM to be highly efficient for solvingthe SGLasso problem.The remaining challenge of designing an efficient ALM to solve (1) is in solving thesubproblem in each iteration. As inspired by the success in [24, 25], we will design ahighly efficient SSN method for solving the subproblem in each ALM iteration. The effec-tiveness of the SSN method relies greatly on the efficient computation of the generalizedJacobian of the proximal mapping associated with the SGLasso regularizer p . Thus amajor contribution of this paper is to analyse the structure of the generalized Jacobianand its efficient computation. As far as we know, the elements in the generalized Jaco-bian of the proximal mapping of p have not been derived before, and this paper aimsto derive an explicit formula for them. We note that the SGLasso regularizer p enjoysthe “prox-decomposition” property [43], similar to the fused Lasso regularizer (see [25]).With the “prox-decomposition” property and some necessary properties for the ‘ norm3nd ‘ norm, we are able to derive an explicit formula for the generalized Jacobian ofthe proximal mapping of p . Based on the structure of the generalized Jacobian of theproximal mapping of p , we can derive a certain structured sparsity (which we name as thesecond order sparsity) of the Hessians associated with the objective function in each ALMsubproblem to implement the SSN method efficiently. Moreover, the SSN method will beproven to have superlinear/quadratic convergence. In a nutshell, the globally fast linearconvergence (with an arbitrary linear rate) of the ALM and the superlinear/quadraticconvergence of the SSN method for solving each ALM subproblem can guarantee that ourS SNAL is highly efficient and robust for solving large-scale SGLasso problems.The rest of this paper is organized as follows. Section 2 demonstrates the decomposi-tion property of the SGLasso regularizer and provides theoretical conditions for ensuringthe global fast linear convergence of the ALM. The explicit formulation of the general-ized Jacobian of the proximal mapping of the SGLasso regularizer is derived in section3. In section 4, we design the semismooth Newton based augmented Lagrangian method(S
SNAL ) for solving the dual of the SGLasso problem (1) and derive our main conver-gence results. We will also present efficient techniques for implementing S
SNAL . Section5 evaluates the performance of S
SNAL on both the synthetic and real data sets. Finally,concluding remarks are given in section 6.
Notation.
For a linear map A : R n → R m , we denote its adjoint by A ∗ . For anyconvex function p , we denote its conjugate function by p ∗ , i.e., p ∗ ( x ) = sup z {h x, z i− p ( z ) } .For each l ∈ { , , . . . , g } , we define the linear operator P l : R n → R | G l | by P l x = x G l .Let s := P gl =1 | G l | . Define P := [ P ; P ; . . . ; P g ] : R n → R s and B := B λ , × · · · × B λ ,g ,where B λ ,l := { u l ∈ R | G l | | k u l k ≤ λ ,l } and λ ,l := λ w l . For a given closed convex set Ωand a vector x , denote the distance of x to Ω by dist( x, Ω) := inf x ∈ Ω {k x − x k} and theEuclidean projection of x onto Ω by Π Ω ( x ) := arg min x ∈ Ω {k x − x k} . We define sign( · )in a component-wise fashion such that sign( t ) = 1 if t >
0, sign( t ) = 0 if t = 0, andsign( t ) = − t <
0. For any functions f and g , define ( f ◦ g )( · ) := f ( g ( · )). We denotethe Hadamard product by (cid:12) . For a given vector x , supp( x ) denotes the support of x , i.e.,the set of indices such that x i = 0. We denote the vector of all ones by e . For a matrix A and a vector a , we denote by diag( A ) and Diag( a ) the diagonal vector of A and thediagonal matrix whose diagonal elements are the components of a , respectively. In this section, we establish the decomposition property of the SGLasso regularizer p and present some general error bound results. The SGLasso problem (1) can be writtenequivalently as follows: (P) min x ∈ R n h ( x ) := f ( x ) + p ( x ) , f ( x ) := kA x − b k , p ( x ) := ϕ ( x )+ φ ( x ), ϕ ( x ) := λ k x k , and φ ( x ) := P gl =1 λ ,l k x G l k with λ ,l := λ w l , l = 1 , , . . . , g . The dual problem of (P) takes the following form:(D) max g ( y, z ) := −h b, y i − k y k − p ∗ ( z )s.t. A ∗ y + z = 0 . In additon, the Karush-Kuhn-Tucker (KKT) optimality system associated with (P) and(D) is given by A x − y − b = 0 , Prox p ( x + z ) − x = 0 , A ∗ y + z = 0 , (2)where the proximal mapping of p is defined by:Prox p ( u ) := arg min x (cid:26) p ( x ) + 12 k x − u k (cid:27) , ∀ u ∈ R n . (3)For any given parameter t > h , the followingMoreau identity will be frequently used:Prox th ( u ) + t Prox h ∗ /t ( u/t ) = u. (4)It is well known that the proximal mappings of ‘ norm and ‘ norm can be expressed asfollows: for any given c > c k·k ( u ) = sign( u ) (cid:12) max {| u | − ce, } , Prox c k·k ( u ) = ( u k u k max {k u k − c, } , if u = 0 , , otherwise.We will show in the next lemma that the proximal mappings of the ‘ norm and ‘ normare strongly semismooth. The definition of the semismoothness was first introduced byMifflin [31] for functionals and extended to vector-valued functions in [33]. Lemma 2.1.
For any c > , the proximal mappings Prox c k·k ( · ) and Prox c k·k ( · ) are strong-ly semismooth.Proof. Since Prox c k·k ( · ) is a Lipschitz continuous piecewise affine function, it follows from[11, Proposition 7.4.7] that Prox c k·k ( · ) is strongly semimsooth everywhere. Next, we focuson the proximal mapping Prox c k·k ( · ). From the definition of Prox c k·k ( · ) and the fact thatthe projection of any vector onto the second order cone, i.e., the epigraph of the ‘ normfunction, is strongly semismooth [5, Proposition 4.3], we can obtain the conclusion directlyfrom [30, Theorem 4]. 5ext, we analyse the vital decomposition property, which is termed as “prox-decomposition”in [43], of the SGLasso regularizer p . In the next proposition, we show that the proximalmapping Prox p ( · ) of p = ϕ + φ can be decomposed into the composition of the proximalmappings Prox ϕ ( · ) and Prox φ ( · ). With this decomposition property, we are able to com-pute Prox p ( · ) in a closed form. This decomposition result was proved in [44, Theorem 1],which is mainly an extension of that for the fused Lasso regularizer in [13]. Here, we giveanother short proof based on the systematic investigation in [43]. Proposition 2.1.
Under Assumption 1.1, it holds that
Prox p ( u ) = Prox φ ◦ Prox ϕ ( u ) , ∀ u ∈ R n . Proof.
Under Assumption 1.1, the function p has a separable structure. Hence, the prob-lem (3) is separable for each group. Therefore, it is sufficient to prove thatProx λ k·k + λ ,l k·k ( u l ) = Prox λ ,l k·k ◦ Prox λ k·k ( u l ) , ∀ u l ∈ R | G l | , l = 1 , , . . . , g. By [43, Theorem 1], for each l ∈ { , , . . . , g } , it suffices to show that ∂ ( λ k u l k ) ⊆ ∂ ( λ k v l k ) , v l := Prox λ ,l k·k ( u l ) , ∀ u l ∈ R | G l | . For any given u l ∈ R | G l | , we discuss the following two cases.Case 1: If k u l k ≤ λ ,l , then v l = 0. It follows that ∂ ( λ k v l k ) = [ − λ , λ ] | G l | , whichobviously contains ∂ ( λ k u l k ).Case 2: If k u l k > λ ,l , then v l = (1 − λ ,l / k u l k ) u l , which implies that sign( v l ) = sign( u l ).Thus, it holds that ∂ ( λ k u l k ) = ∂ ( λ k v l k ) . Hence, the proof is completed.Consider an arbitrary point u ∈ R n . Based on the above proposition, we are now readyto compute Prox p ( u ) explicitly. Let v := Prox ϕ ( u ). For each group G l , l = 1 , , . . . , g , itholds that arg min x Gl n λ ,l k x G l k + 12 k x G l − v G l k o = v G l − Π B λ ,l ( v G l ) . That is, P l Prox φ ( v ) = P l v − Π B λ ,l ( P l v ). Therefore, we haveProx p ( u ) = Prox φ ( v ) = v − P ∗ Π B ( P v ) . (5)For the rest of this section, we introduce some error bound results that will be usedlater in the convergence rate analysis. Define the proximal residual function R : R n → R n by R ( x ) := x − Prox p ( x − ∇ f ( x )) , ∀ x ∈ R n (6)6nd the set-valued map T : R n ⇒ R n by T ( x ) := { v ∈ R n | v ∈ ∇ f ( x ) + ∂p ( x ) } , ∀ x ∈ R n . (7)Suppose that λ + λ >
0. Let Ω P be the optimal solution set of ( P ). Since f isnonnegative on R n , it is easy to obtain that h ( x ) → + ∞ as k x k → + ∞ . Thus, Ω P isa compact convex set. The first order optimality condition of ( P ) implies that ¯ x ∈ Ω P is equivalent to 0 ∈ T (¯ x ), which in turn is equivalent to R (¯ x ) = 0. It is proved in [46,Theorem 1] that the local error bound condition (in the sense of Luo and Tseng [28])holds around the optimal set Ω P , i.e., for every ξ ≥ inf x h ( x ), there exist positive scalars κ and δ such thatdist( x, Ω P ) ≤ κ kR ( x ) k , ∀ x ∈ R n satisfying h ( x ) ≤ ξ and kR ( x ) k ≤ δ . (8)Therefore, by using the facts that Ω P is compact and that R is continuous, we know thatfor any r >
0, there exists κ > x, Ω P ) ≤ κ kR ( x ) k , ∀ x ∈ R n satisfying dist ( x, Ω P ) ≤ r . (9)Furthermore, by mimicking the proofs in [9, Theorem 3.1] or [7, Proposition 2.4] andnoting that Ω P is a compact set, we can obtain the following result with no difficulty. Proposition 2.2.
For any r > , there exists κ > such that dist( x, Ω P ) ≤ κ dist(0 , T ( x )) , ∀ x ∈ R n satisfying dist ( x, Ω P ) ≤ r. Prox p ( · ) In this section, we shall analyse the generalized Jacobian of the proximal mapping Prox p ( · )of the SGLasso regularizer p . From Proposition 2.1, for any u ∈ R n , we haveProx p ( u ) = Prox φ (Prox ϕ ( u )) . At the first glance, we may try to apply the chain rule in deriving the generalized Jacobianof Prox p ( · ). Indeed it was illustrated in [39] that under certain conditions, the generalizedJacobian for composite functions can be obtained by the chain rule in a similar fashionas in finding the ordinary Jacobian for composite smooth functions. Specifically, if theconditions in [39, Lemma 2.1] hold, then we could have obtained by the chain rule thefollowing B-subdifferential [6], which is a subset of the generalized Jacobian, ∂ B Prox p ( u ) = n ˜Θ · Θ (cid:12)(cid:12) ˜Θ ∈ ∂ B Prox φ ( v ) , Θ ∈ ∂ B Prox ϕ ( u ) , v = Prox ϕ ( u ) o . However, the conditions in [39, Lemma 2.1] may not hold in our context, and consequentlythe above equation is usually invalid. Therefore, the generalized Jacobian of Prox p ( · ) is7ontrivial to obtain, and we have to find an alternative surrogate to bypass this difficul-ty. The challenge just highlighted also appeared in [25] when analysing the generalizedJacobian of the proximal mapping of the fused Lasso regularizer. In that work, a generaldefinition (see [25, Definition 1]) of “semismoothness with respect to a multifunction”was adopted, and such a multifunction was constructed to play the role of the generalizedJacobian. Here, we shall use the same strategy, and our task now is to identify such amultifunction.Before characterizing the multifunction relating to the semismoothness, based on thefact in (5) that Prox φ ( v ) = v − P ∗ Π B ( P v ) , ∀ v ∈ R n , we define the following alternativefor the generalized Jacobian of Prox φ ( · ):ˆ ∂ Prox φ ( v ) := n I − P ∗ Σ P (cid:12)(cid:12) Σ = Diag(Σ , . . . , Σ g ) , Σ l ∈ ∂ Π B λ ,l ( v G l ) , l = 1 , , . . . , g o . It can be observed that the main part of ˆ ∂ Prox φ ( · ) is the block diagonal matrix Σ, ofwhich each block is the generalized Jacobian of a projection operator onto an ‘ -normball. Since ∂ Π B λ ,l ( · ) admits a closed form expression, so does ˆ ∂ Prox φ ( · ). Now, we are ina position to present the following multifunction M : R n ⇒ R n × n and regard it as thesurrogate generalized Jacobian of Prox p ( · ) at any u ∈ R n : M ( u ) := ( ( I − P ∗ Σ P )Θ (cid:12)(cid:12)(cid:12) Σ = Diag(Σ , . . . , Σ g ) , Σ l ∈ ∂ Π B λ ,l ( v G l ) , l = 1 , , . . . , g,v = Prox ϕ ( u ) , Θ ∈ ∂ Prox ϕ ( u ) ) . (10) Remark 3.1.
In numerical computations, for any u ∈ R n , one needs to construct at leastone element in M ( u ) explicitly. For l = 1 , , . . . , g and v l ∈ R | G l | , the projection onto an ‘ -norm ball and its generalized Jacobian are given as follows, respectively: Π B λ ,l ( v l ) = ( λ ,l v l k v l k , if k v l k > λ ,l ,v l , otherwise , (11) ∂ Π B λ ,l ( v l ) = λ ,l k v l k ( I − v l v Tl k v l k ) , if k v l k > λ ,l , { I − t v l v Tl ( λ ,l ) | ≤ t ≤ } , if k v l k = λ ,l ,I, if k v l k < λ ,l . (12) Define a vector θ ∈ R n and construct a matrix Θ ∈ ∂ Prox ϕ ( u ) as follow: θ i = (cid:26) , if | u i | ≤ λ , , otherwise, i = 1 , . . . , n. (13) Then it is obvious that
Θ = Diag( θ ) ∈ ∂ Prox ϕ ( u ) . (14)8he following main theorem of this section justifies why M ( u ) in (10) can be treatedas the surrogate generalized Jacobian of Prox p ( · ) at u . That is, it shows that the proximalmapping Prox p is strongly semismooth on R n with respect to the multifunction M definedin (10) (cf. Definition 1 in [25]). Theorem 3.1.
Assume that Assumption 1.1 holds. Let u ∈ R n . Then the multifunction M , defined in (10) , is a nonempty compact valued upper-semicontinuous multifunction,and for any M ∈ M ( u ) , M is symmetric and positive semidefinite. Moreover, for any M ∈ M ( w ) with w → u , Prox p ( w ) − Prox p ( u ) − M ( w − u ) = O ( k w − u k ) . (15) Proof.
By Lemma 2.1, Proposition 2.1 and [11, Theorem 7.5.17], one can deduce thatthe point-to-set map M has nonempty compact images and is upper-semicontinuous, andequation (15) holds. It remains to show that M is symmetric and positive semidefinitefor any M ∈ M ( u ). Denote v := Prox ϕ ( u ). Take M ∈ M ( u ) arbitrarily. Then, thereexist Σ l ∈ ∂ Π B λ ,l ( v G l ) , l = 1 , , . . . , g and Θ = Diag( θ ) ∈ ∂ Prox ϕ ( u ), defined in (12) and(14), respectively, such that M = g X l =1 P ∗ l ( I − Σ l ) P l Θ . It suffices to show that P ∗ l ( I − Σ l ) P l Θ is symmetric and positive semidefinite for any l ∈ { , , . . . , g } . Denote the index setsΞ l := G l ∩ supp( v ) = { i ∈ G l | θ i = 1 } , l = 1 , , . . . , g. (16)For simplicity, we write v G l as v l in the following proof.Case 1: k v l k < λ ,l . By (12), I − Σ l = 0.Case 2: k v l k = λ ,l . By (12), there exists some t ∈ [0 ,
1] such that P ∗ l ( I − Σ l ) P l Θ = t ( λ ,l ) ( P ∗ l v l )( P ∗ l v l ) T Θ . By the definition of P l , we deduce that supp( P ∗ l v l ) ⊆ Ξ l . It follows from (16) that( P ∗ l v l ) T Θ = ( P ∗ l v l ) T . That is, P ∗ l ( I − Σ l ) P l Θ = t ( λ ,l ) ( P ∗ l v l )( P ∗ l v l ) T , which is symmetric and positive semidefinite.9ase 3: k v l k > λ ,l . From (12) and the proof in case 2, we have P ∗ l ( I − Σ l ) P l Θ = P ∗ l (cid:16) I − λ ,l k v l k (cid:0) I − v l v Tl k v l k (cid:1)(cid:17) P l Θ= (cid:16) − λ ,l k v l k (cid:17) P ∗ l P l Θ + λ ,l k v l k ( P ∗ l v l )( P ∗ l v l ) T Θ= (cid:16) − λ ,l k v l k (cid:17) P ∗ l P l Θ + λ ,l k v l k ( P ∗ l v l )( P ∗ l v l ) T . Since both P ∗ l P l and Θ are diagonal, it holds that P ∗ l ( I − Σ l ) P l Θ is symmetric. Fur-thermore, it is obvious that P ∗ l P l Θ is positive semidefinite. Therefore, the last equalityimplies that P ∗ l ( I − Σ l ) P l Θ is positive semidefinite.In summary, we have shown that M is symmetric and positive semidefinite. In this section, we shall design an inexact semismooth Newton based augmented La-grangian method for solving problem (D), the dual of the SGLasso problem (1). Here wealways assume that λ + λ >
0. Write (D) equivalently in the followingmin h b, y i + k y k + p ∗ ( z )s.t. A ∗ y + z = 0 . (17)For σ >
0, the augmented Lagrangian function associated with (17) is given by L σ ( y, z ; x ) = h b, y i + 12 k y k + p ∗ ( z ) + σ kA ∗ y + z − σ − x k − σ k x k . (18)The k -th iteration of the augmented Lagrangian method is given as follows: ( ( y k +1 , z k +1 ) ≈ arg min y,z {L σ k ( y, z ; x k ) } ,x k +1 = x k − σ k ( A ∗ y k +1 + z k +1 ) , k ≥ . In each iteration, the most expensive step is to solve the following subproblem:min y,z {L σ k ( y, z ; x k ) } . (19)Obviously, the subproblem (19) admits a unique optimal solution. For any y ∈ R m , define ψ k ( y ) := inf z L σ k ( y, z ; x k )= h b, y i + 12 k y k + p ∗ (Prox p ∗ /σ k ( σ − k x k − A ∗ y )) + σ k k Prox p ( σ − k x k − A ∗ y ) k − σ k k x k k . (20)10hen, ( y k +1 , z k +1 ) ≈ arg min y,z {L σ k ( y, z ; x k ) } can be computed as follows: y k +1 ≈ arg min y ψ k ( y ) and z k +1 = Prox p ∗ /σ k ( σ − k x k − A ∗ y k +1 ) . (21)Now, we propose an inexact augmented Lagrangian method for solving (17). Algorithm 1:
An inexact augmented Lagrangian method for solving (17)Let σ > y , z , x ) ∈ R m × R n × R n . Iterate thefollowing steps for k = 0 , , . . . . Step 1.
Compute ( y k +1 , z k +1 ) ≈ arg min y,z {L σ k ( y, z ; x k ) } (22)via (21). Step 2.
Compute x k +1 = x k − σ k ( A ∗ y k +1 + z k +1 ) = σ k Prox p ( σ − k x k − A ∗ y k +1 ) . (23) Step 3.
Update σ k +1 ↑ σ ∞ ≤ ∞ .Given nonnegative summable sequences { ε k } and { δ k } such that δ k < k ≥ y k +1 , z k +1 ) of (22) via the standardstopping criteria studied in [36]:(A) L σ k ( y k +1 , z k +1 ; x k ) − inf y,z L σ k ( y, z ; x k ) ≤ ε k / σ k , (B) L σ k ( y k +1 , z k +1 ; x k ) − inf y,z L σ k ( y, z ; x k ) ≤ ( δ k / σ k ) k x k +1 − x k k . Since ψ k is strongly convex with modulus 1, one has the estimate L σ k ( y k +1 , z k +1 ; x k ) − inf y,z L σ k ( y, z ; x k ) = ψ k ( y k +1 ) − inf ψ k ≤ k∇ ψ k ( y k +1 ) k . Therefore, the above stopping criteria (A) and (B) can be replaced by the following easy-to-check criteria, respectively,(A ) k∇ ψ k ( y k +1 ) k ≤ ε k / √ σ k , (B ) k∇ ψ k ( y k +1 ) k ≤ ( δ k / √ σ k ) k x k +1 − x k k . In this section, we shall analyse the linear convergence at an arbitrarily fast rate of theinexact augmented Lagrangian method for solving problem (17).11or the nonnegative summable sequence { ε k } in the stopping criterion ( A ), we in-troduce a scalar α such that P ∞ k =0 ε k ≤ α . Let r be any given positive scalar satisfying r > α . It follows from Proposition 2.2 that there exists a positive scalar κ such thatdist( x, Ω P ) ≤ κ dist(0 , T ( x )) , ∀ x ∈ R n satisfying dist( x, Ω P ) ≤ r. (24)The next lemma measures the distance of each primal iterate generated by Algorithm 1 tothe optimal solution set Ω P . The proof of Lemma 4.1 is mainly based on [8, Proposition1(c)], which itself is an extension of [37, Theorem 2] and [29, Theorem 2.1]. Compared tothe proof in [8, Proposition 1(c)], the following lemma uses (24) instead of the calmnesscondition of T − at the origin for some ¯ x ∈ Ω P . Lemma 4.1.
Suppose that the initial point x ∈ R n satisfies dist( x , Ω P ) ≤ r − α . Let { x k } be any infinite sequence generated by Algorithm 1 under criteria ( A ) and ( B ) . Thenfor all k ≥ , one has dist( x k +1 , Ω P ) ≤ µ k dist( x k , Ω P ) , where µ k := [ δ k + (1 + δ k ) κ/ p κ + σ k ] / (1 − δ k ) and κ is from (24) .Proof. Denote the proximal point mapping by P k := ( I + σ k T ) − . Criterion ( A ), [36,Proposition 6], and [37, Equation (2.6)] imply that k x k − Π Ω P ( x ) k ≤ k x − Π Ω P ( x ) k + k − X i =0 ε i ≤ k x − Π Ω P ( x ) k + α, ∀ k ≥ . Consequently, dist( x k , Ω P ) ≤ dist( x , Ω P ) + α ≤ r, ∀ k ≥ . Moreover, one has k P k ( x k ) − Π Ω P ( x k ) k = k P k ( x k ) − P k (Π Ω P ( x k )) k ≤ k x k − Π Ω P ( x k ) k ≤ r, which implies that dist( P k ( x k ) , Ω P ) ≤ r, ∀ k ≥ . Additionally, it is shown in [37, Proposition 1(a)] that P k ( x k ) ∈ T − (( x k − P k ( x k )) /σ k ) , ∀ k ≥ . Then it follows from (24) thatdist( P k ( x k ) , Ω P ) ≤ κ dist(0 , T ( P k ( x k )) ≤ ( κ/σ k ) k x k − P k ( x k ) k , ∀ k ≥ . Therefore, from the proof in [8, Proposition 1 (c)], for all k ≥
0, we obtain thatdist( P k ( x k ) , Ω P ) ≤ ( κ/ q κ + σ k ) dist( x k , Ω P )12nd that k x k +1 − Π Ω P ( P k ( x k )) k≤ δ k k x k +1 − Π Ω P ( P k ( x k ) k + (cid:16) δ k + (1 + δ k ) κ/ p κ + σ k (cid:17) dist( x k , Ω P ) . This, together with the fact that dist( x k +1 , Ω P ) ≤ k x k +1 − Π Ω P ( P k ( x k )) k , ∀ k ≥
0, com-pletes the proof.While the global convergence of Algorihm 1 follows from [36, 29] directly, the conditionsrequired in [36, 29] to guarantee the local linear convergence of both { x k } and { ( y k , z k ) } may no longer hold for the SGLasso problem due to the non-polyhedral property of the ‘ norm function. Fortunately, the new results established in [8] on the convergence ratesof the ALM allow us to establish the following theorem, which proves the global Q-linearconvergence of the primal sequence { x k } and the global R-linear convergence of the dualinfeasibility and the dual objective values. Furthermore, the linear rates can be arbitrarilyfast if the penalty parameter σ k is chosen sufficiently large. Theorem 4.1.
Let { ( y k , z k , x k ) } be an infinite sequence generated by Algorithm 1 understopping criterion ( A ) . Then, the sequence { x k } converges to some ¯ x ∈ Ω P , and thesequence { ( y k , z k ) } converges to the unique optimal solution of ( D ) .Furthermore, if criterion ( B ) is also executed in Algorithm 1 and the initial point x ∈ R n satisfies dist( x , Ω P ) ≤ r − α , then for all k ≥ , we have dist( x k +1 , Ω P ) ≤ µ k dist( x k , Ω P ) , (25a) kA ∗ y k +1 + z k +1 k ≤ µ k dist( x k , Ω P ) , (25b)sup( D ) − g ( y k +1 , z k +1 ) ≤ µ k dist( x k , Ω P ) , (25c) where µ k := h δ k + (1 + δ k ) κ/ p κ + σ k i / (1 − δ k ) ,µ k := 1 / [(1 − δ k ) σ k ] ,µ k := [ δ k k x k +1 − x k k + k x k +1 k + k x k k ] / [2(1 − δ k ) σ k ] , and κ is from (24) . Moreover, µ k , µ k , and µ k go to if σ k ↑ σ ∞ = + ∞ .Proof. The statements on the global convergence just follow from [36, Theorem 5] or [8,Proposition 2]. Inequality (25a) is a direct consequence of Lemma 4.1. Inequality (25c)can be obtained by [8, Proposition 2 (5b)]. From the updating formula (23) of x k +1 , wededuce that kA ∗ y k +1 + z k +1 k = σ − k k x k +1 − x k k , which, together with [8, Lemma 3], implies that (25b) holds. This completes the proof.13 emark 4.1. Assume that all the conditions in Theorem 4.1 are satisfied. Since theprimal objective function h is Lipschitz continuous on any compact set, there exists aconstant L > such that h is Lipschitz continuous on the set { x ∈ R n | dist( x, Ω P ) ≤ r } with modulus L . Therefore, one can obtain from Theorem 4.1 that for all k ≥ , h ( x k +1 ) − inf( P ) ≤ L dist( x k +1 , Ω P ) ≤ Lµ k dist( x k , Ω P ) . This inequality, together with (25c) and the strong duality theorem, implies that h ( x k +1 ) − g ( y k +1 , z k +1 ) ≤ ( Lµ k + µ k )dist ( x k , Ω P ) , which means that the duality gap converges to zero R-linearly at an arbitrary linear rateif σ k is sufficiently large and R-superlinearly if σ k ↑ σ ∞ = + ∞ . (21) In this subsection, we propose an efficient semismooth Newton (SSN) method for solvingthe subproblem (21). As already mentioned earlier, having an efficient method for solving(21) is critical to the efficiency of Algorithm 1. In each iteration, we have to solve thefollowing problem, for any given σ > x ,min y (cid:26) ψ ( y ) := h b, y i + 12 k y k + p ∗ (Prox p ∗ /σ ( σ − ˜ x − A ∗ y )) + σ k Prox p ( σ − ˜ x − A ∗ y ) k (cid:27) . (26)Note that ψ ( · ) is strongly convex and continuously differentiable with ∇ ψ ( y ) = b + y − σ A Prox p ( σ − ˜ x − A ∗ y ) . Thus, the unique solution ¯ y of (26) can be obtained by solving the following nonsmoothequation ∇ ψ ( y ) = 0 . (27)Generally, to solve F ( x ) = 0 , where F : R m → R m is a locally Lipschitz continuous function, one can employ thefollowing SSN method: x k +1 = x k − V − k F ( x k ) , where V k ∈ ∂F ( x k ), and ∂F ( x k ) denotes the Clarke generalized Jacobian [6, Definition2.6.1] of F at x k . For more details about the SSN method, we refer the reader to [22, 23,33, 40, 47] and the references therein. In particular, existing studies such as [33, 40] usedthe Clarke generalized Jacobian V k ∈ ∂F ( x k ) in the updating scheme and establishedcorrespondingly the convergence results of the SSN method.14e should point out again that characterizing ∂ ( ∇ ψ )( · ) is a difficult task to accom-plish. In section 3, we have constructed a multifunction M , which is used as a surrogateof the generalized Jacobian ∂ Prox p . Besides, it is illustrated in Theorem 3.1 that Prox p is strongly semismooth with respect to the multifunction M . Likewise, we define a mul-tifunction V : R m ⇒ R m × m as follows: V ( y ) := (cid:8) V | V = I + σ A M A ∗ , M ∈ M ( σ − ˜ x − A ∗ y ) (cid:9) , where M ( · ) is defined in (10). It follows from Theorem 3.1 and [11, Theorem 7.5.17]that (i) V is a nonempty compact valued upper-semicontinuous multifunction; (ii) ∇ ψ isstrongly semismooth on R m with respect to the multifunction V ; (iii) every matrix in theset V ( · ) is symmetric and positive definite.With the above analysis, we are ready to design the following SSN method for solving(27). Algorithm 2:
A semismooth Newton method for solving (27)Given µ ∈ (0 , / η ∈ (0 , τ ∈ (0 , δ ∈ (0 , y ∈ R m . Iteratethe following steps for j = 0 , , . . . . Step 1.
Choose M j ∈ M ( σ − ˜ x − A ∗ y j ). Let V j = I + σ A M j A ∗ . Solve the followinglinear system V j d = −∇ ψ ( y j ) (28)exactly or by the conjugate gradient (CG) algorithm to find d j such that k V j d j + ∇ ψ ( y j ) k ≤ min(¯ η, k∇ ψ ( y j ) k τ ). Step 2. (Line search) Set α j = δ m j , where m j is the smallest nonnegative integer m forwhich ψ ( y j + δ m d j ) ≤ ψ ( y j ) + µδ m h∇ ψ ( y j ) , d j i . Step 3.
Set y j +1 = y j + α j d j .The following convergence theorem for Algorithm 2 can be obtained directly from [25,Theorem 3]. Theorem 4.2.
Let { y j } be the sequence generated by Algorithm 2. Then { y j } is well-defined and converges to the unique solution ¯ y of (26) . Moreover, the convergence rate isat least superlinear: k y j +1 − ¯ y k = O ( k y j − ¯ y k τ ) , where τ ∈ (0 , is the parameter given in Algorithm 2. (28) In this section, we analyse the sparsity structure of the matrix in the linear system (28)and design sophisticated numerical techniques for solving the large-scale linear systems15nvolved in the SSN method. These techniques were first applied in [24] which took fulladvantage of the second order sparsity of the underlying problem.As can be seen, the most expensive step in each iteration of Algorithm 2 is in solvingthe linear system (28). Let (˜ x, y ) ∈ R n × R m and σ > I + σAM A T ) d = −∇ ψ ( y ) , (29)where A denotes the matrix representation of the linear operator A , and M ∈ M ( u ) with u = σ − ˜ x − A T y . With the fact that A is an m by n matrix and M is an n by n matrix,the cost of naively computing AM A T is O ( mn ( m + n )). Similarly, for any vector d ∈ R m ,the cost of naively computing the matrix-vector product AM A T d is O ( mn ). Since thecost of naively computing the coefficient matrix I + σAM A T and that of multiplying avector by the coefficient matrix I + σAM A T are excessively demanding, common linearsystem solvers, such as the Cholesky decomposition and the conjugate gradient method,will be extremely slow (if possible at all) in solving the linear system (29) arising fromlarge-scale problems. Therefore, it is critical for us to extract and exploit any structurespresent in the matrix AM A T to dramatically reduce the cost of solving (29).Next, we analyse the proof in Theorem 3.1 in details in order to find the specialstructure of AM A T , thereby reducing the computational cost mentioned above. Let v := Prox ϕ ( u ). From the proof in Theorem 3.1, case 1 and case 2 (taking t = 0) aresimple since the set M ( u ) contains a zero matrix. We can choose M = 0 so that I + σAM A T = I. The sole challenge lies in case 3. Here, we shall consider (cid:16) − λ ,l k v l k (cid:17) A P ∗ l P l Θ A T + λ ,l k v l k A ( P ∗ l v l )( P ∗ l v l ) T A T . Note that both P ∗ l P l and Θ are diagonal matrices whose diagonal elements are either0 or 1. Therefore, the product P ∗ l P l Θ enjoys the same property. Moreover, we havesupp(diag( P ∗ l P l )) = G l and supp(diag(Θ)) = supp( v ) by the definition of P l , (13), and(14). Therefore, supp(diag( P ∗ l P l Θ)) = G l ∩ supp( v ) = Ξ l , where Ξ l is the index set defined by (16) that corresponds to the non-zero elements of v in the l -th group. In other words, the diagonal matrix P ∗ l P l Θ is expected to containonly a few 1’s in the diagonal. Consequently, the computational cost of A P ∗ l P l Θ A T can be greatly reduced. Next, we observe that supp( P ∗ l v l ) ⊆ Ξ l . Thus to compute A ( P ∗ l v l ), one just needs to consider those columns of A corresponding to the index setΞ l , thereby reducing the cost of computing A ( P ∗ l v l ) and that of A ( P ∗ l v l )( P ∗ l v l ) T A T . Thefollowing notations are introduced to express these techniques clearly. Denote the indexset Ξ > := { l | k v l k > λ ,l , l = 1 , , . . . , g } , which corresponds to case 3 in Theorem 3.1. For16ach l = 1 , , . . . , g , let A l ∈ R m ×| Ξ l | be the sub-matrix of A with those columns in Ξ l and s l := ( P ∗ l v l ) Ξ l ∈ R | Ξ l | be the sub-vector of P ∗ l v l restricted to Ξ l . Then, we deduce that AM A T = X l ∈ Ξ > (cid:16) − λ ,l k v l k (cid:17) A P ∗ l P l Θ A T + λ ,l k v l k A ( P ∗ l v l )( P ∗ l v l ) T A T = X l ∈ Ξ > (cid:16) − λ ,l k v l k (cid:17) A l A Tl + λ ,l k v l k ( A l s l )( A l s l ) T . (30)Therefore, the cost of computing AM A T and that of the matrix-vector product AM A T d for any d ∈ R m are O ( m ( r + r )) and O ( m ( r + r )), respectively, where r := P l ∈ Ξ > | Ξ l | ≤| supp( v ) | and r := | Ξ > | ≤ g . We may refer to r as the overall sparsity and r as the groupsparsity. In other words, the computational cost depends on the overall sparsity r , thegroup sparsity r , and the number of observations m . The number r is presumably muchsmaller than n due to the fact that v = Prox ϕ ( u ). Besides, the number of observations m is usually smaller than the number of predictors n in many applications. Even if n happens to be extremely large (say, larger than 10 ), one can still solve the linear system(29) efficiently via the (sparse) Cholesky factorization as long as r , r , and m are moderate(say, less than 10 ).In addition, if the optimal solution is so sparse that r + r (cid:28) m , then the cost ofsolving (29) can be reduced further. In this case, the coefficient matrix can be written asfollows: I + σAM A T = I + DD T , where D = [ B, C ] ∈ R m × ( r + r ) with B l := r σ (cid:16) − λ ,l k v l k (cid:17) A l ∈ R m ×| Ξ l | , B := [ B l ] l ∈ Ξ > ∈ R m × r , c l := q σ λ ,l k v l k ( A l s l ) ∈ R m and C = [ c l ] l ∈ Ξ > ∈ R m × r . By the Sherman-Morrison-Woodbury formula, it holds that( I + σAM A T ) − = ( I + DD T ) − = I − D ( I + D T D ) − D T . In this case, the main cost is in computing I + D T D at O ( m ( r + r ) ) operations, as wellas to factorize the r + r by r + r matrix I + D T D at the cost of O (( r + r ) ) operations.Based on the above arguments, one can claim that the linear system (28) in eachSSN iteration can be solved efficiently at low costs. In fact based on our experiencegathered from the numerical experiments in the next section, the computational costsare so low that the time taken to perform indexing operations, such as obtaining thesub-matrix A l from A and the sub-vector ( P ∗ l v l ) Ξ l from P ∗ l v l for l ∈ Ξ > , may becomenoticeably higher than the time taken to compute the matrix AM A T itself. Fortunately,the group sparsity r generally limits the number of such indexing operations needed whencomputing AM A T .Note that in the unlikely event that computing the Cholesky factorization of AM A T or that of I + D T D is expensive, such as when r + r and m are both large (say more than170 ), one can employ the preconditioned conjugate gradient (PCG) method to solve thelinear system (29) efficiently through exploiting the fast computation of the matrix-vectorproduct AM A T d for any given vector d . In this section, we compare the performance of our semismooth Newton augmented La-grangian (S
SNAL ) method with the alternating direction method of multipliers (ADMM)and the state-of-the-art solver SLEP [27] for solving the SGLasso problem. Specifically,the function “sgLeastR” in the solver SLEP is used for comparison. ADMM was firstproposed in [15, 16], and the implementation will be illustrated in section 5.1.Since the primal problem (1) is unconstrained, it is reasonable to measure the accuracyof an approximate optimal solution ( y, z, x ) for problem (17) and problem (1) by therelative duality gap and dual infeasibility. Specifically, letpobj := 12 kA x − b k + λ k x k + λ g X l =1 w l k x G l k and dobj := −h b, y i − k y k be the primal and dual objective function values. Then the relative duality gap and therelative dual infeasibility are defined by η G := | pobj − dobj | | pobj | + | dobj | , η D := kA ∗ y + z k k z k . For given error tolerances ε D > ε G >
0, our algorithm
Ssnal will be terminatedif η D < ε D and η G < ε G , (31)while the ADMM will be terminated if the above conditions hold or the maximum numberof 10 ,
000 iterations is reached. By contrast, since SLEP does not produce the dualsequences { ( y k , z k ) } , the relative dual infeasibility cannot be used as a stopping criterionfor SLEP. Therefore, We terminate SLEP if the relative difference of the optimal objectivevalues between SLEP and S SNAL is less than ε G , i.e., η P := obj S − pobj1 + | obj S | + | pobj | < ε G , or the maximum number of 10 ,
000 iterations is reached. Here obj S denotes the objectivevalue obtained by SLEP. Note that the parameters for SLEP are set to their default valuesunless otherwise specified.
18n numerical our experiments, we choose ε D = ε G = 10 − unless otherwise specified.That is, the condition (31) for Ssnal becomes η S := max { η G , η D } < − . Similarly, the stopping condition for ADMM becomes η A := max { η G , η D } < − . In addition, we adopt the following weights: w l = p | G l | , ∀ l = 1 , , . . . , g for themodel (1). In the following tables, “S” stands for Ssnal ; “P” for SLEP; “A” for ADMM;“nnz” denotes the number of non-zero entries in the solution x obtained by Ssnal usingthe following estimation: nnz := min { k | k X i =1 | ˆ x i | ≥ . k x k } , where ˆ x is obtained via sorting x by magnitude in a descending order. We display thenumber of outer ALM iterations (in Algorithm 1) and the total number of inner SSN iter-ations (in Algorithm 2) of Ssnal in the format of “outer iteration (inner iteration)” underthe iteration column. The computation time is in the format of “hours:minutes:seconds”,and “00” in the time column means that the elapsed time is less than 0 . Matlab (version 9.0) on a windowsworkstation (24-core, Intel Xeon E5-2680 @ 2.50GHz, 128 Gigabytes of RAM).
In this section, we study the implementation of the (inexact) semi-proximal alternatingdirection method of multipliers (sPADMM), which is an extension of the classic ADMM[15, 16]. This method is one of the most natural methods for solving (17) due to itsseparable structure. Generally, the framework of the sPADMM consists of the followingiterations: y k +1 ≈ arg min y L σ ( y, z k ; x k ) + k y − y k k S ,z k +1 ≈ arg min z L σ ( y k +1 , z ; x k ) + k z − z k k S ,x k +1 = x k − τ σ ( A ∗ y k +1 + z k +1 ) , (32)where τ ∈ (0 , (1+ √ / S and S are self-adjoint positive semidefinite linear operators,and L σ is the augmented Lagrangian function defined in (18). The sPADMM is convergentunder some mild conditions, and we refer the reader to [12, 4] for the convergence results.However, due to the lack of error bound conditions for the KKT system (2), the linearconvergence rate of the sPADMM cannot be established from existing results.19n each iteration of (32), the first step is to minimize a function of y . In particular, y k +1 can be obtained by solving the following m × m linear system of equations:( σ − I + AA ∗ + S ) y k +1 = − σ − b − A ( z k − σ − x k ) + S y k . As the dimension m is a moderate number in many statistical applications. Thus, inour implementation, equation (5.1) was solved via the Cholesky factorization, and theproximal term S was taken to be the zero matrix. In the event that computing theCholesky factorization of σ − I + AA ∗ is expensive, one can choose S judiciously to makethe coefficient matrix to be a positive definite diagonal matrix plus a low-rank matrix forwhich one can invert it efficiently via the Sherman-Morrison-Woodbury formula. We referthe reader to [4, section 7.1] for the details on how to choose S appropriately.The second step in (32) is to minimize a function of z . For the SGLasso problem, onewould simply choose S = 0. In this case, by the Moreau identity (4), z k +1 is updated bythe following scheme: z k +1 = σ − x k − A ∗ y k +1 − Prox p ( σ − x k − A ∗ y k +1 ) , where Prox p is computable by Proposition 2.1. In summary, two subproblems of (32) aresolvable and consequently the framework (32) is easily implementable. Moreover, in orderto improve the convergence speed numerically, we set the step-length τ in (32) to be 1.618and tune the parameter σ according to the progress between primal feasibility and dualfeasibility in the implementation. This section presents the tests of the three algorithms
Ssnal , ADMM, and SLEP onvarious synthetic data constructed in the same way as in [38]. The data matrix A isgenerated randomly as an m × n matrix of normally distributed random numbers, andthe number of groups g is chosen manually to be 100, 1000, and 10000. Then we partition { , , . . . , n } into g groups such that the indices of components in each group are adjacent,for example, G = { , , . . . , } , G = { , , . . . , } , etc. The group sizes {| G i | , i =1 , , . . . , g } are determined randomly such that each | G i | is expected to be around themean value of ng . Subsequently, the response vector b is constructed as b = Ax + (cid:15), where (cid:15) is normally distributed random noise, x G l = (1 , , . . . , , , . . . , T for l =1 , , . . . , , and x G l = 0 for all other groups. That is, the first 10 groups are the non-trivial groups, and the true number of non-zero elements of the underlying solution x is100. The regularization parameters λ = λ are chosen to make the number of non-zeroelements of the resulting solution close to the true number of 100.20able 1 compares the numerical results of the three algorithms Ssnal , ADMM, andSLEP tested on different synthetic data. As can be seen from the table, the computa-tional time of
Ssnal is less than that of ADMM and SLEP for most cases. The overalladvantage of computational time suggests that our algorithm
Ssnal is efficient for solv-ing the SGLasso problem with randomly generated data. Moreover, we observe from thetable that ADMM is inefficient in solving the SGLasso problem with randomly generatedlarge-scale data. A possible reason is that the first order method ADMM requires a largenumber of iterations to solve the problem to the required accuracy of 10 − . The table alsoshows that our algorithm Ssnal can significantly outperform SLEP on problems with alarge number of groups. In particular,
Ssnal is more than 5 times faster than SLEP forthe high dimensional instance with problem size ( m, n ) = (1 e , e
6) and group number g = 10000. For this instance, the number of non-zero entries in the solution x is small,and we have highly conducive second order sparsity which we can fully exploit in thenumerical computations outlined in section 4.3.Table 1: The performances of Ssnal , ADMM, and SLEP on synthetic data. Regulariza-tion parameters are set as follows: λ = λ . size g λ nnz iteration time( m, n ) S | A | P S | A | P(1e3,1e5) 100 1338 166 1(3) | | | | | |
26 01 | | | |
239 02 | | | |
17 13 | | | | | | | |
148 01:38 | | This section presents the performances of the three algorithms
Ssnal , ADMM, and SLEPon large-scale UCI data sets [26] ( A , b ) that are originally obtained from the LIBSVMdata sets [3]. In our numerical experiments, we follow [24] and apply the method in [18] toexpand the original features of the data sets bodyfat , pyrim , and triazines using polynomialbasis functions. For example, a polynomial basis function of order 7 is used to expandthe features of the data set bodyfat , and then the expanded data set is named as bodyfat7 .This naming convention is also used for pyrim5 , triazines4 , and housing 7 . As noted in[24, Table 1], these data sets are quite different in terms of the problem dimension andthe largest eigenvalue of AA ∗ . For example, for a relatively high-dimensional instance log1p.E2006.train , the dimension of A is 16087 × AA ∗ is 5 . × .Next, we describe how the groups in each problem are specified. By reordering thecomponents of the variable x if necessary, without loss of generality, we assume that thevector x can be partitioned into g groups where the indices of components in each groupare adjacent. The group sizes {| G l | , l = 1 , , . . . , g } are determined randomly such that21ach | G l | is around the mean value of ng . In the experiment, the average group size isabout 300.We tested the SGLasso problems with two different sets of regularization parameterswhich are chosen manually:(S1) λ = λ = γ kA ∗ b k ∞ ;(S2) λ = 0 . γ kA ∗ b k ∞ , λ = 9 . γ kA ∗ b k ∞ . The parameter γ is chosen to produce a reasonable number of non-zero elements in theresulting solution x . Three values of γ are used for each UCI data set in our experiments.Table 2 presents the comparison results of the three algorithms Ssnal , ADMM, andSLEP on 8 selected UCI data sets with regularization parameters specified as in (S1).As shown in the table, S
SNAL has succeeded in solving all instances within 1 minutes,while SLEP failed to solve 10 cases. Although ADMM has also succeeded in solving allinstances, its running time for each case is much longer than that of
Ssnal . In majorityof the cases, S
SNAL outperformed the first order methods ADMM and SLEP by a largemargin. For example, for the instance
E2006.train with γ = 1 e -7, S SNAL solved it to thedesired accuracy in 3 seconds, ADMM took more than 8 minutes, while SLEP failed tosolve it within 10000 steps. The numerical results show convincingly that our algorithmS
SNAL can solve SGLasso problems highly efficiently and robustly. Again, the superiorperformance of our
Ssnal algorithm can be attributed to our ability to extract and exploitthe second order sparsity structure (in the SGLasso problem) within the SSN method tosolve each ALM subproblem very efficiently.Table 3 is the same as Table 2 but for the regularization parameters specified asin (S2). This table also shows that the computational time of
Ssnal is far less thanthat of ADMM and SLEP for almost all cases. Furthermore, for more difficult cases,such as those with large problem dimension ( m, n ) and large number of non-zero entries(nnz), the superiority of
Ssnal is even more striking compared to ADMM and SLEP.The results again demonstrate that our algorithm
Ssnal is highly efficient for solvingSGLasso problems.Figure 1 presents the performance profiles of
Ssnal , ADMM, and SLEP for all 48tested problems, which are presented in Table 2 and Table 3. The meaning of the perfor-mance profiles is given as follows: a point ( x, y ) is on the performance curve of a particularmethod if and only if this method can solve up to desired accuracy (100 y )% of all thetested instances within at most x times of the fastest method for each instance. As canbe seen, Ssnal outperforms ADMM and SLEP by a large margin for all tested UCI datasets with randomly generated groups. In particular, focusing on y = 40%, we can seefrom Figure 1 that Ssnal is around 30 times faster compared to ADMM and SLEP forover 60% of the tested instances. 22igure 1: Performance profiles of
Ssnal , ADMM, and SLEP on UCI data sets withrandomly generated groups.Table 2: The performances of
Ssnal , ADMM, and SLEP on 8 selected UCI data setswith randomly generated groups. The regularization parameters are specified as in ( S ) . problem name γ nnz iteration time error( m, n ); g S | A | P S | A | P η S | η A | η P E2006.train(16087,150360)501 1e-05 1 3(7) | | | |
00 1.4e-09 | | | |
11 01 | |
01 7.9e-11 | | | | | | | | | | | |
00 1.0e-10 | | | |
21 00 | |
00 1.2e-08 | | | | | | | | | |
882 09 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
54 7.7e-07 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
59 2.5e-07 | | | | | | | | | | | | | | Ssnal , ADMM, and SLEP on 8 selected UCI data setswith randomly generated groups. The regularization parameters are specified as in ( S ) . problem name γ nnz iteration time error( m, n ); g S | A | P S | A | P η S | η A | η P E2006.train(16087,150360)501 1e-05 1 3(7) | | | |
00 1.6e-10 | | | | | |
00 1.5e-09 | | | |
13 01 | |
01 3.3e-07 | | | |
16 00 | |
00 3.7e-09 | | | |
16 00 | |
00 4.9e-10 | | | |
25 01 | |
00 4.8e-07 | | | |
202 05 | | | | | | | | | | | | | | | | | |
379 04 | |
57 5.5e-07 | | | | | | | | | | | | | | | | | |
33 1.7e-08 | | | | | | | | | | | | | | | | | |
37 3.7e-07 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
29 00 | |
01 9.0e-09 | | | |
936 01 | |
17 1.3e-07 | | | | | | | | This section also makes uses of the UCI data sets mentioned in section 5.3. Instead ofspecifying the groups randomly, we attempt to generate more meaningful groups in thefollowing manner. Firstly, the classical Lasso (model (1) with λ = 0) is solved with theaccuracy of 10 − to obtain a sparse solution x , and the computed solution x is sorted ina descending order. Then, the first | G | largest variables are allocated to group 1, andthe next | G | variables are allocated to group 2, etc. Since this group membership isdetermined by the magnitude of each variable of the computed solution from the classicalLasso, we believe that this kind of group structure is more natural than that constructedrandomly in the previous section. Besides, the group sizes {| G l | , l = 1 , , . . . , g } aredetermined randomly such that each | G l | is around the mean value of ng . Compared tothe last section, a different value 30 is taken as the average group size for the diversity ofexperiments.To generate the solution from the classical Lasso to decide on the group membershipmentioned above, we take the medium value of γ in Table 4, e.g., γ = 1 e -6 for the instance E2006.train . And the regularization parameters for the classical Lasso are set as follow: λ = γ kA ∗ b k ∞ , λ = 0. For the SGLasso problem, the regularization parameters followthree different strategies: (S1) and (S2) given in the previous section, and(S3) λ = γ kA ∗ b k ∞ , λ = p λ if λ > λ = λ if λ ≤ . Ssnal has succeeded insolving all the 72 instances highly efficiently, while ADMM failed in 5 instances, and SLEPfailed in 58 instances. Moreover, for those failed instances, we observe from the tablesthat SLEP terminated when the errors are still relatively large, which is 10 − for mostcases. The results may suggest that using only first order information is not enough forcomputing high accuracy solution, while second order information can contribute to thefast convergence and high computational efficiency of a well designed second order SSNmethod. For the vast majority of the instances, the computational time of Ssnal is farless than that of ADMM and SLEP. Again, the results have demonstrated convincinglythat our algorithm
Ssnal is capable of solving large-scale SGLasso problems to highaccuracy very efficiently and robustly.Figure 2 presents the performance profiles of
Ssnal , ADMM, and SLEP for all 72tested problems, which are presented in Table 4, Table 5, and Table 6. From the figure,we find that
Ssnal not only solves all the tested instances to the desired accuracy, butalso outperforms ADMM and SLEP by an obvious margin for these tested UCI data setswith simulated groups. Within 250 times of the running time of
Ssnal , ADMM can onlysolve approximately 80% of all the tested instances, while SLEP can only solve 20% of allthe tested instances. We can safely claim that our algorithm
Ssnal can solve large-scaleSGLasso problems to high accuracy very efficiently and robustly.Figure 2: Performance profiles of
Ssnal , ADMM, and SLEP on UCI data sets withsimulated groups. 25able 4: The performances of
Ssnal , ADMM, and SLEP on 8 selected UCI data setswith simulated groups. The regularization parameters are specified as in ( S ) . problem name γ nnz iteration time error( m, n ); g S | A | P S | A | P η S | η A | η P E2006.train(16087,150360)5012 1e-05 1 4(9) | |
16 01 | |
01 6.4e-10 | | | | | | | | | | | | | | | | | |
00 5.4e-09 | | | | | | | | | | | | | | | |
290 09 | | | | -1.5e-041e-04 253 4(25) | | | | | | | | | | | | | |
284 07 | |
47 4.6e-07 | | -2.4e-041e-04 546 5(21) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Table 5: The performances of
Ssnal , ADMM, and SLEP on 8 selected UCI data setswith simulated groups. The regularization parameters are specified as in ( S ). problem name γ nnz iteration time error( m, n ); g S | A | P S | A | P η S | η A | η P E2006.train(16087,150360)5012 1e-05 1 3(7) | |
16 01 | |
01 6.4e-10 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
92 09 | |
30 6.5e-08 | | -2.9e-041e-04 39 3(16) | |
578 13 | | | | -1.8e-051e-05 597 4(22) | | | | | | | |
60 07 | |
10 8.6e-07 | | -4.4e-041e-04 47 4(17) | |
327 08 | |
53 1.3e-07 | | -1.3e-041e-05 1079 5(23) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ssnal , ADMM, and SLEP on 8 selected UCI data setswith simulated groups. The regularization parameters are specified as in ( S ). problem name γ nnz iteration time error( m, n ); g S | A | P S | A | P η S | η A | η P E2006.train(16087,150360)5012 1e-05 1 4(9) | |
16 01 | |
01 8.9e-10 | | | | | | | | | | | | | | | |
12 00 | |
00 8.8e-09 | | | | | | | | | | | | | | | |
626 13 | | | | -7.5e-061e-04 510 5(26) | | | | | | | | | | | | | |
732 12 | | | | -3.1e-051e-04 909 6(27) | | | | | | | | | | | | | | | |
28 2.4e-07 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In this paper, we have developed a highly efficient semismooth Newton based augment-ed Lagrangian method
Ssnal for solving large-scale non-overlapping sparse group Lassoproblems. The elements in the generalized Jacobian of the proximal mapping associatedwith the sparse group Lasso regularizer were first derived, and the underlying secondorder sparsity structure was thoroughly analysed and utilised to achieve superior per-formance in the numerical implementations of
Ssnal . Extensive numerical experimentshave demonstrated that the proposed algorithm is highly efficient and robust, even onhigh-dimensional real data sets. Based on the superior performance of
Ssnal for solv-ing non-overlapping sparse group Lasso problems, we can expect the effectiveness of ouralgorithmic framework for solving overlapping sparse group Lasso problems and otherlarge-scale convex composite problems in future studies.
Acknowledgments
The authors would like to thank Dr. Xudong Li and Ms. Meixia Lin for their help in thenumerical implementations. 27 eferences [1] A. Argyriou, C. A. Micchelli, M. Pontil, L. Shen, and Y. Xu. Efficient first ordermethods for linear composite regularizers. arXiv preprint arXiv:1104.1436 , 2011.[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization andstatistical learning via the alternating direction method of multipliers.
Foundationsand Trends in Machine Learning , 3(1):1–122, 2011.[3] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines.
ACMTransactions on Intelligent Systems and Technology (TIST) , 2(3):27, 2011.[4] L. Chen, D. F. Sun, and K.-C. Toh. An efficient inexact symmetric Gauss–Seidelbased majorized ADMM for high-dimensional convex composite conic programming.
Mathematical Programming , 161(1-2):237–270, 2017.[5] X. D. Chen, D. F. Sun, and J. Sun. Complementarity functions and numerical exper-iments on some smoothing Newton methods for second-order-cone complementarityproblems.
Computational Optimization and Applications , 25(1):39–56, 2003.[6] F. H. Clarke.
Optimization and nonsmooth analysis . SIAM, 1990.[7] Y. Cui, D. F. Sun, and K.-C. Toh. On the asymptotic superlinear convergence of theaugmented Lagrangian method for semidefinite programming with multiple solutions. arXiv preprint arXiv:1610.00875 , 2016.[8] Y. Cui, D. F. Sun, and K.-C. Toh. On the R-superlinear convergence of the KKTresidues generated by the augmented Lagrangian method for convex composite conicprogramming. arXiv preprint arXiv:1706.08800 , 2017.[9] Y. Dong. An extension of Luque’s growth condition.
Applied Mathematics Letters ,22(9):1390–1393, 2009.[10] Y. C. Eldar and M. Mishali. Robust recovery of signals from a structured union ofsubspaces.
IEEE Transactions on Information Theory , 55(11):5302–5316, 2009.[11] F. Facchinei and J.-S. Pang.
Finite-dimensional variational inequalities and comple-mentarity problems . Springer Science & Business Media, 2007.[12] M. Fazel, T. K. Pong, D. F. Sun, and P. Tseng. Hankel matrix rank minimizationwith applications to system identification and realization.
SIAM Journal on MatrixAnalysis and Applications , 34(3):946–977, 2013.[13] J. Friedman, T. Hastie, H. H¨ofling, R. Tibshirani, et al. Pathwise coordinate opti-mization.
The Annals of Applied Statistics , 1(2):302–332, 2007.2814] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparsegroup lasso. arXiv preprint arXiv:1001.0736 , 2010.[15] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear varia-tional problems via finite element approximation.
Computers and Mathematics withApplications , 2(1):17–40, 1976.[16] R. Glowinski and A. Marroco. Sur l’approximation, par ´el´ements finis d’ordre un, etla r´esolution, par p´enalisation-dualit´e d’une classe de probl`emes de Dirichlet nonlin´eaires.
Revue fran¸caise d’automatique, informatique, recherche op´erationnelle.Analyse num´erique , 9(R2):41–76, 1975.[17] D. Hallac, J. Leskovec, and S. Boyd. Network lasso: Clustering and optimization inlarge graphs. In
Proceedings of the 21th ACM SIGKDD international conference onknowledge discovery and data mining , pages 387–396. ACM, 2015.[18] L. Huang, J. Jia, B. Yu, B.-G. Chun, P. Maniatis, and M. Naik. Predicting executiontime of computer programs using sparse polynomial regression. In
Advances in NeuralInformation Processing Systems , pages 883–891, 2010.[19] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In
Proceedings of the 26th annual International Conference on Machine Learning , pages433–440. ACM, 2009.[20] R. Jenatton, J. Mairal, F. R. Bach, and G. R. Obozinski. Proximal methods for sparsehierarchical dictionary learning. In
Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) , pages 487–494, 2010.[21] D. Kong and C. Ding. Efficient algorithms for selecting features with arbitrary groupconstraints via group lasso. In
Data Mining (ICDM), 2013 IEEE 13th InternationalConference on , pages 379–388. IEEE, 2013.[22] B. Kummer. Newton’s method for non-differentiable functions.
Advances in Mathe-matical Optimization , 45:114–125, 1988.[23] B. Kummer. Newton’s method based on generalized derivatives for nonsmooth func-tions: convergence analysis. In
Advances in Optimization , pages 171–194. Springer,1992.[24] X. Li, D. F. Sun, and K.-C. Toh. A highly efficient semismooth Newton augmentedLagrangian method for solving Lasso problems.
SIAM Journal on Optimization , 27,2017.[25] X. Li, D. F. Sun, and K.-C. Toh. On efficiently solving the subproblems of a level-setmethod for fused lasso problems. arXiv preprint arXiv:1706.08732 , 2017.2926] M. Lichman. UCI machine learning repository, 2013.[27] J. Liu, S. Ji, J. Ye, et al. SLEP: Sparse Learning with Efficient Projections.
ArizonaState University , 6:491, 2009.[28] Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descentmethods: a general approach.
Annals of Operations Research , 46(1):157–178, 1993.[29] F. J. Luque. Asymptotic convergence analysis of the proximal point algorithm.
SIAMJournal on Control and Optimization , 22(2):277–293, 1984.[30] F. Meng, D. F. Sun, and G. Zhao. Semismoothness of solutions to generalized equa-tions and the Moreau-Yosida regularization.
Mathematical Programming , 104(2):561–581, 2005.[31] R. Mifflin. Semismooth and semiconvex functions in constrained optimization.
SIAMJournal on Control and Optimization , 15(6):959–972, 1977.[32] J. Peng, J. Zhu, A. Bergamaschi, W. Han, D.-Y. Noh, J. R. Pollack, and P. Wang.Regularized multivariate regression for identifying master predictors with applicationto integrative genomics study of breast cancer.
The Annals of Applied Statistics ,4(1):53, 2010.[33] L. Qi and J. Sun. A nonsmooth version of Newton’s method.
Mathematical Pro-gramming , 58(1):353–367, 1993.[34] Z. Qin, K. Scheinberg, and D. Goldfarb. Efficient block-coordinate descent algorithmsfor the group lasso.
Mathematical Programming Computation , 5(2):143–169, 2013.[35] P. Richt´arik and M. Tak´aˇc. Iteration complexity of randomized block-coordinatedescent methods for minimizing a composite function.
Mathematical Programming ,144(1-2):1–38, 2014.[36] R. T. Rockafellar. Augmented Lagrangians and applications of the proximal pointalgorithm in convex programming.
Mathematics of Operations Research , 1(2):97–116,1976.[37] R. T. Rockafellar. Monotone operators and the proximal point algorithm.
SIAMJournal on Control and Optimization , 14(5):877–898, 1976.[38] N. Simon, J. Friedman, T. Hastie, and R. Tibshirani. A sparse-group lasso.
Journalof Computational and Graphical Statistics , 22(2):231–245, 2013.[39] D. F. Sun. The strong second-order sufficient condition and constraint nondegener-acy in nonlinear semidefinite programming and their implications.
Mathematics ofOperations Research , 31(4):761–776, 2006.3040] D. F. Sun and J. Sun. Semismooth matrix-valued functions.
Mathematics of Opera-tions Research , 27(1):150–169, 2002.[41] R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society. Series B (Methodological) , pages 267–288, 1996.[42] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smooth-ness via the fused lasso.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67(1):91–108, 2005.[43] Y.-L. Yu. On decomposing the proximal map. In
Advances in Neural InformationProcessing Systems , pages 91–99, 2013.[44] L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso. In
Advancesin Neural Information Processing Systems , pages 352–360, 2011.[45] M. Yuan and Y. Lin. Model selection and estimation in regression with groupedvariables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,68(1):49–67, 2006.[46] H. Zhang, J. Jiang, and Z.-Q. Luo. On the linear convergence of a proximal gradientmethod for a class of nonsmooth convex minimization problems.
Journal of theOperations Research Society of China , 1(2):163–186, 2013.[47] X. Y. Zhao, D. F. Sun, and K.-C. Toh. A Newton-CG augmented Lagrangian methodfor semidefinite programming.
SIAM Journal on Optimization , 20(4):1737–1765,2010.[48] Y. Zhou, J. Han, X. Yuan, Z. Wei, and R. Hong. Inverse sparse group lasso modelfor robust object tracking.