[PDF] An Extrapolated Iteratively Reweighted l1 Method with Complexity Analysis

Abstract

The iteratively reweighted l1 algorithm is a widely used method for solving various regularization problems, which generally minimize a differentiable loss function combined with a nonconvex regularizer to induce sparsity in the solution. However, the convergence and the complexity of iteratively reweighted l1 algorithms is generally difficult to analyze, especially for non-Lipschitz differentiable regularizers such as nonconvex lp norm regularization. In this paper, we propose, analyze and test a reweighted l1 algorithm combined with the extrapolation technique under the assumption of Kurdyka-Lojasiewicz (KL) property on the objective. Unlike existing iteratively reweighted l1 algorithms with extrapolation, our method does not require the Lipschitz differentiability on the regularizers nor the smoothing parameters in the weights bounded away from zero. We show the proposed algorithm converges uniquely to a stationary point of the regularization problem and has local linear complexity--a much stronger result than existing ones. Our numerical experiments show the efficiency of our proposed method.

Full PDF

AAn Extrapolated Iteratively Reweighted (cid:96) Method withComplexity Analysis

Hao Wang ∗ School of Information Science and Technology, ShanghaiTech UniversityHao Zeng † School of Information Science and Technology, ShanghaiTech UniversityandJiashan Wang ‡ Department of Mathematics, University of WashingtonJanuary 12, 2021

Abstract

The iteratively reweighted (cid:96) algorithm is a widely used method for solving various regu-larization problems, which generally minimize a diﬀerentiable loss function combined with aconvex/nonconvex regularizer to induce sparsity in the solution. However, the convergence andthe complexity of iteratively reweighted (cid:96) algorithms is generally diﬃcult to analyze, especiallyfor non-Lipschitz diﬀerentiable regularizers such as (cid:96) p norm regularization with 0 < p <

1. Inthis paper, we propose, analyze and test a reweighted (cid:96) algorithm combined with the extrapo-lation technique under the assumption of Kurdyka-(cid:32)Lojasiewicz (KL) property on the objective.Unlike existing iteratively reweighted (cid:96) algorithms with extrapolation, our method does notrequire the Lipschitz diﬀerentiability on the regularizers nor the smoothing parameters in theweights bounded away from 0. We show the proposed algorithm converges uniquely to a station-ary point of the regularization problem and has local linear complexity—a much stronger resultthan existing ones. Our numerical experiments show the eﬃciency of our proposed method. Keywords: (cid:96) p regularization extrapolation techniques iteratively reweighted methods Kurdyka-(cid:32)Lojasiewicz non-Lipschitz regularization

Recently, sparse regularization has received an increasing contemporary attentions among researchersdue to its various important applications, e.g., compressed sensing, machine learning, and imageprocessing Figueiredo et al. [2007], Lustig et al. [2007], Jaggi [2011], Mairal et al. [2010, 2007], Zeydeet al. [2010], Luo et al. [2017]. The goal of sparse regularization is to ﬁnd sparse solutions of amathematical model such that the model performance can be better generalized to future data. Acommon approach of this regularization technique is to add a regularizer term to the objective suchthat most of the components in the resulted solution are zero. In this paper, we focus on the (cid:96) p -normregularization optimization problem of the following formmin x ∈ R n F ( x ) := f ( x ) + λ (cid:107) x (cid:107) pp , (P) ∗ Email: [email protected] † Email: [email protected] ‡ Email: [email protected] a r X i v : . [ m a t h . O C ] J a n here f : R n → R is continuously diﬀerentiable, p ∈ (0 ,

1) and λ > (cid:96) p norm is deﬁned as (cid:107) x (cid:107) p := ( (cid:80) ni =1 | x i | ) /p . Compared with the (cid:96) regularizer, (cid:96) p regularizer is often believed to be a better approximation to the (cid:96) regularizer, i.e., the numberof nonzeros of the involved vector. However, due to the nonsmooth and non-Lipschitz diﬀerentiablenature of the (cid:96) p -norm, this problem is diﬃcult to handle and analyze. In fact, it has been proven inGe et al. [2011] to be strongly NP-hard.The iteratively reweighted (cid:96) (IRL1) algorithm Candes et al. [2008], Chen and Zhou [2010],Chartrand and Yin [2008], Yu and Pong [2019], Lu et al. [2014], Wang et al. [2019] has been widelyapplied to solve regularization problems to induce sparsity in the solutions. It can easily handlevarious regularization terms including (cid:96) p -norm, log-sum Lobo et al. [2007], SCAD Fan and Li [2001]and MCP Zhang et al. [2010] by approximating regularizer with a weighted (cid:96) norm in each iteration.For example, the technique proposed by Chen Chen and Zhou [2010] and Lai Lai and Wang [2011]adds smoothing perturbation (cid:15) to each | x i | to formulate the (cid:15) -approximation of the (cid:96) p norm. In thiscase, the objective is replaced by f ( x ) + λ n (cid:88) i =1 ( | x i | + (cid:15) i ) p , (1)with prescribed (cid:15) >

0. At each iteration x k , the iteratively reweighted (cid:96) method solves the sub-problem in which each ( | x i | + (cid:15) ) p is replaced by its linearization p ( | x ki | + (cid:15) i ) p − | x i | . (2)In this case, large (cid:15) can smooth out many local minimizers, while small values make the subproblemsdiﬃcult to solve and the algorithm easily trapped into bad local minimizers. To obtain an accurateapproximation of (P), Lu Lu [2014] proposed a dynamic updating strategy to drive (cid:15) from an initialrelatively large value to 0 as k → ∞ . Recently, Wang et al. Wang et al. [2019] show the propertythat the iterates generated by the IRL1 algorithm have local stable sign value. Based on this, theyalso present a novel updating strategy for (cid:15) i that only drive (cid:15) i associated with the nonzeros in thelimit point to 0 while keeping others bounded away from 0.Since Nesterov Nesterov [1983] ﬁrst proposed the extrapolation techniques in gradient method,many works focus on analyzing and improving the convergence rate of the IRL1 algorithms. Inthis approach, a linear combination of previous two steps are used to update next step. Nesterov’sextrapolation techniques Nesterov [1983, 1998, 2009, 2013] have also been widely applied to acceleratethe performance of the ﬁrst-order methods and convex composite optimization problems, for exampleAuslender and Teboulle [2006], Becker et al. [2011], Lan et al. [2011] and Tseng [2010]. During thepast decade, it is proven to be a successful accelerating approach when applied to various algorithms.For example, Amir and Teboulle Beck and Teboulle [2009] presented a fast iterative shrinkage-thresholding algorithm (FISTA) using this technique. As for IRL1 methods, Yu and Pang Yu andPong [2019] proposed several versions of IRL1 algorithms with extrapolation and analyzed the globalconvergence.In view of the success of IRL1 combined with the extrapolation techniques, in this paper wepropose and analyze the Extrapolated Proximal Iteratively Reweighed (cid:96) method (E-PIRL1) tosolve the (cid:96) p -norm regularization problem. We show the global convergence and the local complexityof the proposed methods under the Kurdyka-(cid:32)Lojasiewicz (K(cid:32)L) property Bolte et al. [2007a, 2014];this property is a mild condition and generally believed to capture a broad spectrum of the localgeometries that a nonconvex function can have and has been shown to hold ubiquitously for mostpractical functions. In particular, we show that our method converges to a ﬁrst-order optimal solutionof the (cid:96) p regularization problem and local sublinear and linear convergence rates are established—astronger result than most existing ones.The proximal iteratively reweighted (cid:96) methods with extrapolation by Yu and Pang in Yu andPong [2019] can be an immediate and most related predecessor of our proposed method. The maindiﬀerences of our work can be summarized as follows.1. The algorithms in Yu and Pong [2019] are designed for solving problems with regularizationterm n (cid:80) i =1 r i ( | x i | ) where r i is assumed to be smooth on R ++ , concave and strictly increasing on2 + with r i (0) = 0. However, it is assumed in Yu and Pong [2019] that lim | x i |→ r (cid:48) i ( | x i | ) exits,meaning r i is Lipschitz diﬀerentiable on R ++ , which is not the case for (cid:96) p -norm regularizationterm.2. The algorithms in Yu and Pong [2019] can be extended to the (cid:96) p norm regularization problemby keeping the smoothing parameter (cid:15) bounded away from 0. In this case, the algorithmsconverge to the optimal solution of the approximated (cid:96) p regularization problem instead of theoriginal one. In contrast, our algorithm drives (cid:15) → We denote R and Q as the set of real numbers and rational numbers. The set R n is the real n -dimensional Euclidean space with R n + being the positive orthant in R n and R n ++ the interior of R n + . In R n , denote (cid:107) · (cid:107) p as the (cid:96) p norm with p ∈ (0 , + ∞ ), i.e., (cid:107) x (cid:107) p = ( (cid:80) ni =1 | x i | p ) /p . Notethat for p ∈ (0 , f : R n → ¯ R := R ∪ { + ∞} is convex, then the subdiferential of f at ¯ x is given by ∂f (¯ x ) := { z | f (¯ x ) + (cid:104) z, x − ¯ x (cid:105) ≤ f ( x ) , ∀ x ∈ R n } . In particular, for x ∈ R n , ∂ (cid:107) x (cid:107) = { ξ ∈ R n | ξ i ∈ ∂ | x i | , i = 1 , . . . , n } . Given a lower semi-continuousfunction f , the limiting subdiﬀerential at a ∈ dom f is deﬁned as¯ ∂f ( a ) := { z ∗ = lim x k → a,f ( x k ) → f ( a ) z k , z k ∈ ∂ F f ( x k ) } and the Frechet subdiﬀerential of f at a is deﬁned as ∂ F f ( a ) := { z ∈ R n | lim inf x → a f ( x ) − f ( a ) − (cid:104) z, x − a (cid:105)(cid:107) x − a (cid:107) ≥ } . The Clarke subdiﬀerential ∂ c f is the convex hull of the limiting subdiﬀerential. It holds true that ∂ F f ( a ) ⊂ ¯ ∂f ( a ) ⊂ ∂ c f ( a ). For convex functions, ∂f ( a ) = ∂ F f ( a ) = ¯ ∂f ( a ) = ∂ c f ( a ) and fordiﬀerentiable f , ∂ F f ( a ) = ¯ ∂f ( a ) = ∂ c f ( a ) = {∇ f ( a ) } .For f : R n → R and index sets A and I satisfying A ∪ I = { , . . . , n } , let f ( x A ) be the functionin the reduced space R |A| by ﬁxing x i = 0 , i ∈ I . For a, b ∈ R n , a ≤ b means the inequality holdsfor each component, i.e., a i ≤ b i for i = 1 , . . . , n and a ◦ b is the component-wise product of a and b ,i.e., ( a ◦ b ) i = a i b i for i = 1 , ..., n . For a closed convex set χ ⊂ R n , deﬁne the Euclidean distance ofpoint a ∈ R n to χ as dist( a, χ ) = min b ∈ χ (cid:107) a − b (cid:107) . Let {− , , +1 } n be the set of vectors in R n ﬁlledwith elements in {− , , +1 } . The support of x ∈ R n is deﬁned as the set { i | x i (cid:54) = 0 , i = 1 , ..., n } . Kurdyka-(cid:32)Lojasiewicz property is applicable to a wide range of problems such as nonsmooth semi-algebraic minimization problem [Bolte et al., 2014], and serves as a basic assumption to guaranteethe convergence of many algorithms. For example, a series of convergence results for gradient descentmethods are proved in Attouch et al. [2013] under the assumption that the objective satisﬁes the KLproperty. The deﬁnition of Kurdyka-(cid:32)Lojasiewicz property is given below.

Deﬁnition 1 (Kurdyka-(cid:32)Lojasiewicz property) . The function f : R n → R ∪ { + ∞} is said to havethe Kurdyka-(cid:32)Lojasiewicz property at x ∗ ∈ dom ¯ ∂f if there exists η ∈ (0 , + ∞ ] , a neighborhood U of x ∗ and a continuous concave function φ : [0 , η ) → R + such that:(i) φ (0) = 0 , ii) φ is C on (0 , η ) ,(iii) for all s ∈ (0 , η ) , φ (cid:48) ( s ) > ,(iv) for all x in U ∩ [ f ( x ∗ ) < f < f ( x ∗ ) + η ] , the Kurdyka-(cid:32)Lojasiewicz inequality holds φ (cid:48) ( f ( x ) − f ( x ∗ )) dist (0 , ¯ ∂f ( x )) ≥ . If f is smooth, then condition (iv) reverts to Attouch et al. [2013] (cid:107)∇ ( φ ◦ f )( x ) (cid:107) ≥ . Of particular interests is the class of

Semialgebraic functions , which satisﬁes KL property andcovers most common mathematical programming objectives Bolte et al. [2007a,b]. The deﬁnition ofsemialgebraic functions is provided below.

Deﬁnition 2 (Semi-algebraic functions) . A subset of R n is called semi-algebraic if it can be writtenas a ﬁnite union of sets of the form { x ∈ R n : h i ( x ) = 0 , q i ( x ) < , i = 1 , . . . , p } , where h i , q i are real polynomial functions. A function f : R n → R ∪ { + ∞} is semi-algebraic if itsgraph is a semi-algebraic subset of R n +1 . Semi-algebraic functions satisfy KL property with φ ( s ) = cs − θ , for some θ ∈ [0 , ∩ Q and some c > (cid:96) Method with Extrapo-lation

In this section, we propose an e xtrapolated i teratively r eweighted (cid:96) algorithm, hereinafter namedas EIRL1. The framework of this algorithm is presented in Algorithm 1. Algorithm 1

Extrapolated Proximal Iteratively Reweighted (cid:96) Algorithm Input: µ ∈ (0 , β > L f , (cid:15) ∈ R n ++ , 0 ≤ α k ≤ ¯ α < x . Initialize: set k = 0 , x − = x . repeat Compute new iterate: w ki = p ( | x ki | + (cid:15) ki ) p − , (3) y k = x k + α k ( x k − x k − ) , (4) x k +1 ← argmin x ∈ R n (cid:8) ∇ f ( y k ) T x + β (cid:107) x − y k (cid:107) + λ n (cid:88) i =1 w ki | x i | (cid:9) , (5) Choose (cid:15) k +1 ≤ µ(cid:15) k and 0 ≤ α k ≤ ¯ α < Set k ← k + 1. until convergenceDeﬁne the smooth approximation F ( x, (cid:15) ) of F ( x ) with smoothing parameter (cid:15) as F ( x, (cid:15) ) := f ( x ) + λ n (cid:88) i =1 ( | x i | + (cid:15) ) p ψ ( x, y, (cid:15) ) := F ( x, (cid:15) ) + β (cid:107) x − y (cid:107) , Before proceeding to the convergence analysis, we ﬁrst provide some properties of our proposedmethod. In the remainder of this paper, we make the following assumptions about f and F . Assumption 3. (i) f is Lipschitz diﬀerentiable with constant L f ≥ .(ii) The initial point ( x , (cid:15) ) and β are chosen such that L ( F ) := { x | F ( x ) ≤ F } is boundedwhere F := F ( x , (cid:15) ) and β > L f . The following properties hold true for Algorithm 1.

Lemma 4.

Suppose Assumption 3 holds true and { x k } is generated by Algorithm 1 for solving (P) .Then the following statements hold.(i) ψ ( x k , x k − , (cid:15) k ) − ψ ( x k +1 , x k , (cid:15) k +1 ) ≥ β (1 − ¯ α ) (cid:107) x k − x k − (cid:107) .(ii) The sequence { x k } ⊂ L ( F ) and is bounded.(iii) lim k →∞ (cid:107) x k +1 − x k (cid:107) = 0 .(iv) lim k →∞ (cid:107) y k − x k (cid:107) = 0 and lim k →∞ (cid:107) y k − − x k (cid:107) = 0 .Proof. (i) Since x k +1 is the optimal solution of subproblem (5), there exists ξ k +1 ∈ ∂ | x k +1 | suchthat 0 = ∇ f ( y k ) + β ( x k +1 − y k ) + λw k ◦ ξ k +1 , (6)which combined with the strongly convexity of (5) yields (cid:104)∇ f ( y k ) , x k +1 (cid:105) + β (cid:107) x k +1 − y k (cid:107) + λ n (cid:88) i =1 w ki | x k +1 i |≤ (cid:104)∇ f ( y k ) , x k (cid:105) + β (cid:107) x k − y k (cid:107) + λ n (cid:88) i =1 w ki | x ki | − β (cid:107) x k +1 − x k (cid:107) . (7)From the concavity of a p on R ++ , we know for any i ∈ { , . . . , n } ( | x k +1 i | + (cid:15) ki ) p ≤ ( | x ki | + (cid:15) ki ) p + p ( | x ki | + (cid:15) ki ) p − ( | x k +1 i | − | x ki | )= ( | x ki | + (cid:15) ki ) p + w ki ( | x k +1 i | − | x ki | ) . Summing the above inequality over all i yields n (cid:88) i =1 ( | x k +1 i | + (cid:15) ki ) p ≤ n (cid:88) i =1 ( | x ki | + (cid:15) ki ) p + n (cid:88) i =1 w ki ( | x k +1 i | − | x ki | ) . (8)Combining (7) with (8), (cid:104)∇ f ( y k ) , x k +1 (cid:105) + β (cid:107) x k +1 − y k (cid:107) + λ n (cid:88) i =1 ( | x k +1 i | + (cid:15) ki ) p ≤ (cid:104)∇ f ( y k ) , x k (cid:105) + β (cid:107) x k − y k (cid:107) + λ n (cid:88) i =1 ( | x ki | + (cid:15) ki ) p − β (cid:107) x k +1 − x k (cid:107) . (9)5t then follows that F ( x k +1 , (cid:15) k +1 )= f ( x k +1 ) + λ n (cid:88) i =1 ( | x k +1 i | + (cid:15) k +1 i ) p ≤ f ( y k ) + (cid:104)∇ f ( y k ) , x k +1 − y k (cid:105) + L f (cid:107) x k +1 − y k (cid:107) + λ n (cid:88) i =1 ( | x k +1 i | + (cid:15) ki ) p ≤ f ( y k ) + (cid:104)∇ f ( y k ) , x k +1 − y k (cid:105) + β (cid:107) x k +1 − y k (cid:107) + λ n (cid:88) i =1 ( | x k +1 i | + (cid:15) ki ) p ≤ f ( y k ) + (cid:104)∇ f ( y k ) , x k − y k (cid:105) + β (cid:107) x k − y k (cid:107) + λ n (cid:88) i =1 ( | x ki | + (cid:15) ki ) p − β (cid:107) x k +1 − x k (cid:107) ≤ f ( x k ) + λ n (cid:88) i =1 ( | x ki | + (cid:15) ki ) p + β (cid:107) x k − y k (cid:107) − β (cid:107) x k +1 − x k (cid:107) = F ( x k , (cid:15) k ) + β (cid:107) x k − y k (cid:107) − β (cid:107) x k +1 − x k (cid:107) , where the ﬁrst inequality follows from the Lipschitz diﬀerentiability of f , the second inequality is by β > L f , and the third inequality follows from (9) and the last inequality is by the convexity of f .This means that F ( x k +1 , (cid:15) k +1 ) ≤ F ( x k , (cid:15) k ) + β α k ) (cid:107) x k − x k − (cid:107) − β (cid:107) x k +1 − x k (cid:107) by the deﬁnition of y k , which implies that ψ ( x k , x k − , (cid:15) k ) − ψ ( x k +1 , x k , (cid:15) k +1 )= F ( x k , (cid:15) k ) + β (cid:107) x k − x k − (cid:107) − (cid:104) F ( x k +1 , (cid:15) k +1 ) + β (cid:107) x k +1 − x k (cid:107) (cid:105) ≥ β − ( α k ) ) (cid:107) x k − x k − (cid:107) ≥ β − ¯ α ) (cid:107) x k − x k − (cid:107) (10)by { α k } ⊂ [0 , ¯ α ). We then deduce from (10) and 0 < ¯ α < { F ( x k , (cid:15) k ) + β (cid:107) x k − x k − (cid:107) } is monotonically decreasing. This proves part (i).(ii) With x = x − , we know that for all k ≥ F ( x k ) ≤ F ( x k , (cid:15) k ) ≤ F ( x k , (cid:15) k ) + β (cid:107) x k − x k − (cid:107) ≤ F ( x , (cid:15) )Under Assumption 3(ii), we know that { x k } ⊂ L ( F ) and is bounded. This completes the proof ofpart (ii).(iii) Summing both side of (10) from 0 to t , we obtain that β t (cid:88) k =0 (1 − ¯ α ) (cid:107) x k − x k − (cid:107) ≤ F ( x , (cid:15) ) − F ( x t , (cid:15) t ) ≤ F ( x , (cid:15) ) − F , yielding lim k →∞ (cid:107) x k +1 − x k (cid:107) = 0. Therefore, part (ii) is true.(iv) This part is straightforward by noticing that y k − x k = α k ( x k − x k − ) → y k − − x k = ( x k − − x k ) + α k ( x k − − x k − ) → . from part (iii). 6sing similar arguments from Wang et al. [2019] and Lemma 4, we can also obtain results of localstable support and sign as shown in Wang et al. [2019], which are listed below. It shows that aftersome iteration K , { x k } k ≥ K stays in the same orthant, and the nonzero components are boundedaway from 0. Theorem 5.

Suppose Assumption 3 is true and { x k } is generated by Algorithm 1. There then exists C > and K ∈ N such that the following statements hold true.(i) If w ˜ ki > C/λ , then x ki ≡ for all k > ˜ k .(ii) There exists index sets I ∗ ∪ A ∗ = { , . . . , n } such that I ( x k ) ≡ I ∗ and A ( x k ) ≡ A ∗ for any k > K .(iii) For each i ∈ I ∗ and any k > K , | x ki | ≥ (cid:18) Cpλ (cid:19) p − − (cid:15) ki > . (11) (iv) For any limit point x ∗ of { x k } , it holds that I ( x ∗ ) = I ∗ , A ( x ∗ ) = A ∗ and | x ∗ i | ≥ (cid:18) Cpλ (cid:19) p − , i ∈ I ∗ . (12) (v) There exists s ∈ {− , , +1 } n such that sign ( x k ) ≡ s for any k > K .Proof. The bounded { x k } (Lemma 1 (ii)) implies that { y k } is also bounded. Then, there must exist C > k (cid:107)∇ f ( y k ) + β ( x k +1 − y k ) (cid:107) ∞ < C. (13)If w ˜ ki > C/λ for some ˜ k ∈ N , then the optimality condition (6) implies x ˜ k +1 i = 0. Otherwisewe have |∇ f ( y ˜ k ) + β ( x ˜ k +1 − y ˜ k ) | = λw ˜ ki > C , contradicting (13). Monotonicity of ( · ) p − and0 + (cid:15) ˜ k +1 i ≤ | x ˜ ki | + (cid:15) ˜ ki yield w ˜ k +1 i = p (0 + (cid:15) ˜ k +1 i ) p − ≥ p ( | x ˜ ki | + (cid:15) ˜ ki ) p − = w ˜ ki > C/λ. By induction we know that x ki ≡ k > ˜ k . This completes the proof of (i).(ii) Suppose by contradiction this statement is not true. There exists j ∈ { , . . . , n } such that { x kj } takes zero and nonzero values both for inﬁnite times. Hence, there exists a subsequence S ∪ S = N such that |S | = ∞ , |S | = ∞ and that x kj = 0 , ∀ k ∈ S and x kj (cid:54) = 0 , ∀ k ∈ S . Since { (cid:15) kj } S is monotonically decreasing to 0, there exists ˜ k ∈ S such that w ˜ kj = p ( | x ˜ kj | + (cid:15) ˜ kj ) p − = p ( (cid:15) ˜ kj ) p − > C/λ. It follows that x kj ≡ k > ˜ k by (i) which implies { ˜ k + 1 , ˜ k + 2 , . . . } ⊂ S and |S | < ∞ . Thiscontradicts the assumption |S | = ∞ . Hence, (ii) is true.(iii) Combining (i) and (ii), we know for any i ∈ I ∗ , w ki ≤ C/λ , which is equivalent to (11). Thisproves (iii).(iv) By (ii), A ∗ ⊆ A ( x ∗ ) for any limit point x ∗ . It follows from (ii) and (iii) that I ∗ ⊆ I ( x ∗ ) for anylimit point x ∗ . Hence, A ∗ = A ( x ∗ ) and I ∗ = I ( x ∗ ) for any limit point x ∗ since A ∗ ∪ I ∗ = { , ..., n } .(v) By (iii) and Lemma 1(iii), there exists suﬃciently large ¯ k , such that for any k > ¯ k | x ki | > ¯ (cid:15) := 12 (cid:18) Cpλ (cid:19) p − , ∀ i ∈ I ∗ . (14)and (cid:107) x k +1 − x k (cid:107) < ¯ (cid:15) (15)7e prove (v) by contradiction. Assume there exists j ∈ I ∗ such that the sign of x j changes after ¯ k .Hence there must be ˆ k ≥ ¯ k such that x ˆ kj x ˆ k +1 j <

0. It follows that (cid:107) x ˆ k +1 − x ˆ k (cid:107) ≥ | x ˆ k +1 j − x ˆ kj | = (cid:113) ( x ˆ k +1 j ) + ( x ˆ kj ) − x ˆ kj x ˆ k +1 j > (cid:112) ¯ (cid:15) + ¯ (cid:15) = √ (cid:15), where the last inequality is by (14). This contradicts with (15); hence { x k } k ≥ ¯ k have the same sign.Without loss of generality, we can reselect K = ¯ k and then (v) holds true. Deﬁning χ as the set of all cluster points of { x k } , we now show the global convergence of Algorithm1. Theorem 6 (Global convergence) . Suppose Assumption 3 holds true and { x k } is generated byAlgorithm 1. The following statements hold true(i) F attains the same value at every cluster point of { x k } , i.e., there exists ζ ∈ R such that F ( x ∗ ,

0) = ζ for any x ∗ ∈ χ .(ii) Each point x ∗ ∈ χ is a stationary point of F ( x, .Proof. (i) For any x ∗ ∈ χ with subsequence { x k } S → x ∗ , from Lemma 4 we know thatlim k →∞ k ∈S F ( x k , (cid:15) k )= lim k →∞ k ∈S F ( x k , (cid:15) k ) + β (cid:107) x k − x k − (cid:107) = lim k →∞ k ∈S ψ ( x k , x k − , (cid:15) k ) , by the monotonicity of ψ ( x k , x k − , (cid:15) k ) from Lemma 4 (i), we know there is a unique limit value ζ ,i.e. ζ = lim k →∞ k ∈S F ( x k , (cid:15) k ) and F ( x ∗ ,

0) = ζ for any x ∗ ∈ χ .(ii) Let x ∗ be a limit point of { x k } with subsequence { x k } S → x ∗ . We have for any i ∈ I ( x ∗ ), ∇ i f ( x ∗ ) + λp | x ∗ i | p − sign( x ∗ i )= lim k →∞ k ∈S ∇ i f ( x k ) + λp | x ki | p − sign( x ki )= lim k →∞ k ∈S ∇ i f ( y k ) + λp ( | x ki | + (cid:15) ki ) p − sign( x k +1 i )= lim k →∞ k ∈S − β ( x k +1 i − y ki )= 0 , the ﬁrst and second equality is by Lemma 4(iii)-(iv), the third equality is due to x k +1 satisfying theoptimal condition of the subproblem for k > K ∇ i f ( y k ) + β ( x k +1 − y k ) + λp ( | x ki | + (cid:15) ki ) p − sign( x k +1 i ) = 0 , i ∈ I ( x ∗ ) (16)and last equality is by Lemma 4(iv). Therefore, x ∗ is ﬁrst-order optimal, completing the proof.To further analyze the property of { ( x k , (cid:15) k ) } , denote δ i = √ (cid:15) i since (cid:15) i is restricted to be nonneg-ative and write F and ψ as functions of ( x, δ ) for simplicity, i.e., F ( x, δ ) = f ( x ) + λ n (cid:88) i =1 ( | x i | + δ i ) p ,ψ ( x, y, δ ) = f ( x ) + λ n (cid:88) i =1 ( | x i | + δ i ) p + β (cid:107) x − y (cid:107) . { x k } under KL property. Notice that afterthe K th iteration, the iterates { x k I ∗ } remains in the interior of the same orthant of R |I ∗ | and arebounded away from the axes by Theorem 5. Hence we can assume the reduced function F ( x I ∗ , δ I ∗ )has the KL property at ( x ∗I ∗ , I ∗ ) ∈ R |I ∗ | , which is a weaker condition than assuming the KLproperty of F ( x,

0) at ( x ∗ , ∈ R n . Assumption 7.

Suppose ψ has the KL property at every ( x ∗I ∗ , x ∗I ∗ , I ∗ ) ∈ R |I ∗ | , ∀ x ∗ ∈ χ. By Theorem 5, for all suﬃciently large k , the components of { x k I ∗ } are all uniformly boundedaway from 0 and x k A ∗ ≡

0. We have the following properties about ψ . Lemma 8.

Let { x k } be a sequence generated by Algorithm 1 and Assumption 7 is satisﬁed. Forsuﬃciently large k , the following statements hold.(i) There exists D > such that for all k (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) ≤ D ( (cid:107) x k I ∗ − x k − I ∗ (cid:107) + (cid:107) x k − I ∗ − x k − I ∗ (cid:107) + (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) , so that lim k →∞ (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) = 0 .(ii) { ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) } is monotonically decreasing. There exists D > such that ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) ≥ D (cid:107) x k I ∗ − x k − I ∗ (cid:107) . (iii) ψ ( x ∗I ∗ , x ∗I ∗ , I ∗ ) = ζ := lim k →∞ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) , where Γ is the set of the cluster points of { ( x k I ∗ , x k − I ∗ , δ k I ∗ ) } , i.e., Γ := { ( x ∗I ∗ , x ∗I ∗ , I ∗ ) | x ∗ ∈ χ } .Proof. (i) Notice for suﬃciently large k , the gradient of ψ at ( x k I ∗ , x k − I ∗ , δ k I ∗ ) is ∇ x ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = ∇ f ( x k I ∗ ) + β ( x k I ∗ − x k − I ∗ ) + λw k I ∗ ◦ sign( x k I ∗ ) , ∇ y ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = − β ( x k I ∗ − x k − I ∗ ) , ∇ δ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = 2 λw k I ∗ ◦ δ k I ∗ . (17)We ﬁrst derive the upper bound for (cid:107)∇ x ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) . The ﬁrst-order optimality condition ofthe ( k − x k I ∗ is ∇ f ( y k − I ∗ ) + β ( x k I ∗ − y k − I ∗ ) + λw k − I ∗ ◦ sign( x k I ∗ ) = 0 . Hence, we have ∇ x ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = ∇ f ( x k I ∗ ) − ∇ f ( y k − I ∗ )+ β ( y k − I ∗ − x k − I ∗ ) + λ ( w k I ∗ − w k − I ∗ ) ◦ sign( x k I ∗ ) . (18)By Lemma 4(iv) and the Lipschitz diﬀerentiability of f , we know (cid:107)∇ f ( x k I ∗ ) − ∇ f ( y k − I ∗ ) (cid:107) ≤ L f (cid:107) x k I ∗ − y k − I ∗ (cid:107) = L f (cid:107) x k I ∗ − x k − I ∗ − α k ( x k − I ∗ − x k − I ∗ ) (cid:107) = L f (cid:107) x k I ∗ − x k − I ∗ (cid:107) + L f ¯ α (cid:107) x k − I ∗ − x k − I ∗ (cid:107) , (19)and that (cid:107) β ( y k − I ∗ − x k − I ∗ ) (cid:107) = β (cid:107) x k − I ∗ + α k ( x k − I ∗ − x k − I ∗ ) − x k − I ∗ (cid:107) ≤ β ¯ α (cid:107) x k − I ∗ − x k − I ∗ (cid:107) . (20)9n the other hand, by Lagrange’s mean value theorem, for each i ∈ I ∗ , (cid:12)(cid:12) ( w ki − w k − i ) · sign( x ki ) (cid:12)(cid:12) = (cid:12)(cid:12) w ki − w k − i (cid:12)(cid:12) = (cid:12)(cid:12) p ( | x ki | + ( δ ki ) ) p − − p ( | x k − i | + ( δ k − i ) ) p − (cid:12)(cid:12) = (cid:12)(cid:12) p (1 − p )( c ki ) p − (cid:104) | x ki | − | x k − i | + ( δ ki ) − ( δ k − i ) (cid:105)(cid:12)(cid:12) ≤ p (1 − p )( c ki ) p − (cid:104) | x ki − x k − i | + ( δ k − i ) − ( δ ki ) (cid:105) ≤ p (1 − p )( c ki ) p − (cid:104) | x ki − x k − i | + 2 δ i ( δ k − i − δ ki ) (cid:105) , where the ﬁrst equality is by the fact that x ki (cid:54) = 0, and c ki is between | x ki | +( δ ki ) and | x k − i | +( δ k − i ) .It then follows that (cid:107) ( w k I ∗ − w k − I ∗ ) ◦ sign( x k I ∗ ) (cid:107) ≤ (cid:107) ( w k I ∗ − w k − I ∗ ) ◦ sign( x k I ∗ ) (cid:107) ≤ ¯ C (cid:0) (cid:107) x k I ∗ − x k − I ∗ (cid:107) + 2 (cid:107) δ I ∗ (cid:107) ∞ ( (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) (cid:1) ≤ ¯ C (cid:0) √ n (cid:107) x k I ∗ − x k − I ∗ (cid:107) + 2 (cid:107) δ I ∗ (cid:107) ∞ ( (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) (cid:1) , (21)where the second inequality is by ( c ki ) p − ≤ (cid:16) pλC (cid:17) p − − p from Theorem 5(ii) with ¯ C := p (1 − p ) (cid:16) pλC (cid:17) p − − p .Combining (18), (19), (20) and (21), we know (cid:107)∇ x ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) ≤ ( L f + ¯ C √ n ) (cid:107) x k I ∗ − x k − I ∗ (cid:107) + ¯ α ( L + β ) (cid:107) x k − I ∗ − x k − I ∗ (cid:107) + 2 ¯ C (cid:107) δ (cid:107) ∞ ( (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) . (22)On the other hand, we have from (17) that (cid:107)∇ y ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) = β (cid:107) x k I ∗ − x k − I ∗ (cid:107) , (23)and that (cid:107)∇ δ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) ≤ (cid:107)∇ δ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) = (cid:88) i ∈I ∗ λw ki δ ki ≤ (cid:88) i ∈I ∗ λ Cλ √ µ − √ µ ( δ k − i − δ ki ) ≤ C √ µ − √ µ ( (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) , (24)where the second inequality is by Theorem 5(ii) and δ k ≤ √ µδ k − . Overall, we obtain from (22),(23) and (24) that Part (i) holds true by setting D = max (cid:16) β + L f + ¯ C √ n, ¯ α ( L + β ) , C (cid:107) δ I ∗ (cid:107) ∞ + C √ µ −√ µ (cid:17) . Part (ii) and (iii) follow directly from Lemma 4(i) with D = β (1 − ¯ α ) / Theorem 9.

Let { x k } be a sequence generated by Algorithm 1 and Assumption 7 is satisﬁed. Then { x k } converges to a stationary point of F ( x, ; moreover, ∞ (cid:88) k =0 (cid:107) x k − x k − (cid:107) < ∞ . roof. By Theorem 6, it suﬃces to show that { x k } has a unique cluster point.By Lemma 8, ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) is monotonically decreasing and converging to ζ . If ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = ζ after some k , then from Lemma 8(ii), we know x k +1 I ∗ = x k I ∗ for all k > k , meaning x k ≡ x k ∈ χ ,so that the proof is done.We next consider the case that ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) > ζ for all k . Since ψ has the KL propertyat every ( x ∗I ∗ , x I ∗ , I ∗ ) ∈ R |I ∗ | , there exists a continuous concave function φ with η > U = { ( x I ∗ , y I ∗ , δ I ∗ ) ∈ R |I ∗ | | dist(( x I ∗ , y I ∗ , δ I ∗ ) , Γ) < τ } such that φ (cid:48) ( ψ ( x I ∗ , y I ∗ , δ I ∗ ) − ζ )dist((0 , , , ∂ψ ( x I ∗ , y I ∗ , δ I ∗ )) ≥ x I ∗ , y I ∗ , δ I ∗ ) ∈ U ∩ { ( x I ∗ , y I ∗ , δ I ∗ ) ∈ R |I ∗ | | ζ < ψ ( x I ∗ , y I ∗ , δ I ∗ ) < ζ + η } .By the fact that Γ is the set of limit points of { ( x k I ∗ , x k − I ∗ , δ k I ∗ ) } and Lemma 4(ii), we havelim k →∞ dist(( x k I ∗ , x k − I ∗ , δ k I ∗ ) , Γ) = 0 . Hence, there exist k ∈ N such that dist(( x k I ∗ , x k − I ∗ , δ k I ∗ ) , Γ) < τ for any k > k . On the other hand,since { ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) } is monotonically decreasing and converges to ζ , there exists k ∈ N suchthat ζ < ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) < ζ + η for all k > k . Letting ¯ k = max { k , k } and noticing that ψ issmooth at ( x k I ∗ , x k − I ∗ , δ k I ∗ ) for all k > ¯ k , we know from (25) that φ (cid:48) ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) ≥ , for all k ≥ ¯ k. It follows that for any k ≥ ¯ k , (cid:104) φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) − φ ( ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) − ζ ) (cid:105) · D (cid:104) (cid:107) x k I ∗ − x k − I ∗ (cid:107) + (cid:107) x k − I ∗ − x k − I ∗ (cid:107) + (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) (cid:105) ≥ (cid:104) φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) − φ ( ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) − ζ ) (cid:105) (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) ≥ φ (cid:48) ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) · (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107)· (cid:104) ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) (cid:105) ≥ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) ≥ D (cid:107) x k I ∗ − x k − I ∗ (cid:107) , where the ﬁrst inequality is by Lemma 8(i), the second inequality is by the concavity of φ , , thefourth inequality is by Lemma 8(ii). Rearranging and taking the square root of both sides, and usingthe inequality of arithmetic and geometric means, we have (cid:107) x k I ∗ − x k − I ∗ (cid:107) ≤ (cid:114) D D [ φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) − φ ( ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) − ζ )] × (cid:115) (cid:107) x k I ∗ − x k − I ∗ (cid:107) + (cid:107) x k − I ∗ − x k − I ∗ (cid:107) + ( δ k − I ∗ − δ k I ∗ )2 ≤ D D (cid:104) φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) − φ ( ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) − ζ ) (cid:105) + 14 [ (cid:107) x k I ∗ − x k − I ∗ (cid:107) + (cid:107) x k − I ∗ − x k − I ∗ (cid:107) + ( (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) )] . Subtracting (cid:107) x k I ∗ − x k − I ∗ (cid:107) from both sides, we have12 (cid:107) x k I ∗ − x k − I ∗ (cid:107) ≤ D D [ φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) − φ ( ψ ( x k +1 I ∗ , x k I ∗ , δ k +1 I ∗ ) − ζ )]+ 14 ( (cid:107) x k − I ∗ − x k − I ∗ (cid:107) − (cid:107) x k I ∗ − x k − I ∗ (cid:107) ) + 14 ( (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) . k to t , we have t (cid:88) k =¯ k (cid:107) x k I ∗ − x k − I ∗ (cid:107) ≤ D D [ φ ( ψ ( x ¯ k I ∗ , x ¯ k − I ∗ , δ ¯ k I ∗ ) − ζ ) − φ ( ψ ( x t +1 I ∗ , x t I ∗ , δ t +1 I ∗ ) − ζ )]+ ( (cid:107) x ¯ k − I ∗ − x ¯ k − I ∗ (cid:107) − (cid:107) x t I ∗ − x t − I ∗ (cid:107) ) + ( (cid:107) δ ¯ k − I ∗ (cid:107) − (cid:107) δ t I ∗ (cid:107) ) . Now letting t → ∞ , we know (cid:107) δ t I ∗ (cid:107) → (cid:107) x t I ∗ − x t − I ∗ (cid:107) → φ ( ψ ( x t +1 I ∗ , x t I ∗ , δ t +1 I ∗ ) − ζ ) → φ ( ζ − ζ ) = φ (0) = 0. Therefore, we know ∞ (cid:88) k =¯ k (cid:107) x k I ∗ − x k − I ∗ (cid:107) ≤ D D φ ( ψ ( x ¯ k I ∗ , x ¯ k − I ∗ , δ ¯ k I ∗ ) − ζ ) + ( (cid:107) x ¯ k − I ∗ − x ¯ k − I ∗ (cid:107) + (cid:107) δ ¯ k − I ∗ (cid:107) ) (26)Since I ∗ is the support of { x k } k ≥ ¯ k by Theorem 5, ∞ (cid:88) k =¯ k (cid:107) x k − x k − (cid:107) = ∞ (cid:88) k =¯ k (cid:107) x k I ∗ − x k − I ∗ (cid:107) < ∞ . This implies that { x k } is a Cauchy sequence and consequently is convergent. Now we investigate the local convergence rate of Algorithm 1 by assuming that ψ has the propertyat ( x ∗I ∗ , x ∗I ∗ , I ∗ ) with φ in the KL deﬁnition taking the form φ ( s ) = cs − θ for some θ ∈ [0 , c >

0. By the discussion in § p ∈ Q , (cid:80) i ∈I ∗ ( | x i | + (cid:15) i ) p is semi-algebraic around ( x ∗I ∗ , I ∗ ) by [Wakabayashi and Wakabayashi,2008]. Therefore, ψ ( x I ∗ , y I ∗ , δ I ∗ ) is semi-algebraic around ( x ∗I ∗ , x ∗I ∗ , I ∗ ) if f ( x I ∗ ) is semi-algebraicin a neighborhood around x ∗I ∗ .We now show that Algorithm 1 has local linear convergence. Theorem 10.

Suppose { x k } is generated by Algorithm 1 and converges to x ∗ . Assume that ψ ( x I ∗ , y I ∗ , δ I ∗ ) has the KL property at ( x ∗I ∗ , x ∗I ∗ , I ∗ ) with φ in the KL deﬁnition taking the form φ ( s ) = cs − θ forsome θ ∈ [0 , and c > . Then the following statements hold.(i) If θ = 0 , then there exists k ∈ N so that x k ≡ x ∗ for any k > k ;(ii) If θ ∈ (0 , ] , then there exist γ ∈ (0 , , c > , c > such that (cid:107) x k − x ∗ (cid:107) < c γ k − − c (cid:107) δ k (cid:107) (27) for suﬃciently large k ;(iii) If θ ∈ ( , , then there exist c > , c > such that (cid:107) x k − x ∗ (cid:107) < c k − − θ θ − − c (cid:107) δ k (cid:107) (28) for suﬃciently large k .Proof. (i) If θ = 0, then φ ( s ) = cs and φ (cid:48) ( s ) ≡ c . We claim that there must exist k > ψ ( x k , x k − , δ k ) = ζ . Suppose by contradiction this is not true so that ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) > ζ forall k . Since lim k →∞ x k = x ∗ and the sequence { ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) } is monotonically decreasing to ζ byLemma 8. The KL inequality implies that all suﬃciently large k , c (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) ≥ , (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) (cid:107) → k ∈ N such that ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = ζ for all k > k . Hence, we conclude from Lemma 8(ii) that x k +1 I ∗ = x k I ∗ for all k > k , meaning x k ≡ x ∗ = x k for all k ≥ k . This proves (i).(ii)-(iii) Now consider θ ∈ (0 , k ∈ N such that ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) = ζ , then using the same argument of the proof for (i), we can see that { x k } converges ﬁnitely. Thus,we only need to consider the case that ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) > ζ for all k .Moreover, deﬁne S k = (cid:80) ∞ l = k (cid:107) x l +1 I ∗ − x l I ∗ (cid:107) . It holds that (cid:107) x k I ∗ − x ∗I ∗ (cid:107) = (cid:107) x k I ∗ − lim t →∞ x t I ∗ (cid:107) = (cid:107) lim t →∞ t (cid:88) l = k ( x l +1 I ∗ − x l I ∗ ) (cid:107) ≤ S k := ∞ (cid:88) l = k (cid:107) x l +1 I ∗ − x l I ∗ (cid:107) . Therefore, we only have to prove S k also has the same upper bound as in (27) and (28).To derive the upper bound for S k , we continue with (26). For any k > ¯ k with ¯ k deﬁned in theproof of Lemma 9, we can use the same argument of deriving (26) to have S k ≤ D D φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) + 12 (cid:107) x k − I ∗ − x k − I ∗ (cid:107) + 12 (cid:107) δ k − I ∗ (cid:107) = 2 D D φ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) + 12 ( S k − − S k − ) + 12 (cid:107) δ k − I ∗ (cid:107) . (29)By KL inequality with φ (cid:48) ( s ) = c (1 − θ ) s − θ , for k > ¯ k , c (1 − θ )( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) − θ (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ )) (cid:107) ≥ . (30)On the other hand, using Lemma 8(i) and the deﬁnition of S k , we see that for all suﬃcientlylarge k , (cid:107)∇ ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ )) (cid:107) ≤ D ( S k − − S k + (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) (31)Combining (30) with (31), we have( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) θ ≤ D c (1 − θ )( S k − − S k + (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) )Taking a power of (1 − θ ) /θ to both sides of the above inequality and scaling both sides by c , weobtain that for all k > ¯ kφ ( ψ ( x k I ∗ , x k − I ∗ , δ k I ∗ ) − ζ ) = c [ ψ ( x k I ∗ , x k − I ∗ ) − ζ ] − θ ≤ c (cid:104) D c (1 − θ )( S k − − S k + (cid:107) δ k − I ∗ (cid:107) − (cid:107) δ k I ∗ (cid:107) ) (cid:105) − θθ ≤ c (cid:104) D c (1 − θ )( S k − − S k + (cid:107) δ k − I ∗ (cid:107) ) (cid:105) − θθ , (32)which combined with (29) yields S k ≤ C [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ + 12 [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] (33)where C = D cD ( D · c (1 − θ )) − θθ . It follows that S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ C [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ + 12 [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ C [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ + 12 [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] + µ − µ (cid:107) δ k − I ∗ (cid:107) ≤ C [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ + C [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] , (34)13ith C := + µ − µ . For part (ii), θ ∈ (0 , ]. Notice that1 − θθ ≥ S k − − S k + (cid:107) δ k − I ∗ (cid:107) → . Hence, there exists k ≥ ¯ k such that for any k ≥ k [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ ≤ [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] . This, combined with (33), yields S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ ( C + C )[ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ]for any k ≥ k . By δ ki ≤ √ µδ k − i , we know that for i ∈ I ∗ , δ k − i ≤ √ µ − µ ( δ k − i − δ ki ) . (35)It follows that S k + √ µ − µ (cid:107) δ k (cid:107) ≤ ( C + C ) (cid:20)(cid:0) S k − + √ µ − µ (cid:107) δ k − I ∗ (cid:107) (cid:1) − (cid:0) S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) (cid:1)(cid:21) So, S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ C + C C + C + 1 (cid:20) S k − + √ µ − µ (cid:107) δ k − I ∗ (cid:107) (cid:21) ≤ (cid:18) C + C C + C + 1 (cid:19) (cid:98) k (cid:99) (cid:20) S k mod 2 + √ µ − µ (cid:107) δ k mod 2 I ∗ (cid:107) (cid:21) ≤ (cid:18) C + C C + C + 1 (cid:19) k − (cid:20) S + √ µ − µ (cid:107) δ I ∗ (cid:107) (cid:21) . Hence, we have S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ (cid:18)(cid:114) C + C C + C + 1 (cid:19) k − ( S + √ µ − µ (cid:107) δ I ∗ (cid:107) )Therefore, for any k ≥ k , (cid:107) x k − x ∗ (cid:107) ≤ S k ≤ c γ k − − c (cid:107) δ k I ∗ (cid:107) with c = ( S + √ µ − µ (cid:107) δ I ∗ (cid:107) ) , γ = (cid:114) C + C C + C + 1 and c = √ µ − µ , completing the proof of (ii).For part (iii), θ ∈ ( , − θθ < S k − − S k + (cid:107) δ k − I ∗ (cid:107) → . Hence, there exists k ≥ ¯ k such that for any k ≥ k [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] ≤ [ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ . This, combined with (33), yields S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ ( C + C )[ S k − − S k + (cid:107) δ k − I ∗ (cid:107) ] − θθ k ≥ k . Combining with (35) yields S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ ( C + C )[ S k − + √ µ − µ (cid:107) δ k − I ∗ (cid:107) − ( S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) )] − θθ (36)Raising to a power of θ − θ to both side of the above equation, we see further that for any k > k ( S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) ) θ − θ = C [ S k − + √ µ − µ (cid:107) δ k − I ∗ (cid:107) − ( S k + √ µ − µ (cid:107) δ k I ∗ (cid:107) )] (37)with C := ( C + C ) θ − θ .Consider the “even” subsequence of { k , k + 1 , . . . } and deﬁne { ∆ t } t ≥(cid:100) k / (cid:101) with ∆ t := S t + √ µ − µ (cid:107) δ t I ∗ (cid:107) . Then for all t ≥ (cid:100) k / (cid:101) , we have∆ θ − θ t ≤ C (∆ t − − ∆ t )Proceeding as in the proof of [Attouch and Bolte, 2009, Theorem 2] (starting from [Attouch andBolte, 2009, Equation (13)]). Deﬁne h : (0 , + ∞ ) → R by h ( s ) = s − θ − θ and let R ∈ (1 , + ∞ ). Take k ≥ N and assume ﬁrst that h (∆ k ) ≤ Rh (∆ k − ). By rewriting above formula as1 ≤ C (∆ k − − ∆ k )∆ θ − θ k , we obtain that 1 ≤ C (∆ k − − ∆ k ) h (∆ k ) ≤ RC (∆ k − − ∆ k ) h (∆ k − ) ≤ RC (cid:90) ∆ k − ∆ k h ( s ) ds ≤ RC − θ − θ [∆ − θ − θ k − − ∆ − θ − θ k ] . Thus if we set µ = θ − − θ ) RC > ν = − θ − θ < < µ ≤ ∆ νk − ∆ νk − . (38)Assume now that h (∆ k ) > Rh (∆ k ) and set q = ( R ) − θθ ∈ (0 , k ≤ q ∆ k − and furthermore (recalling that ν is negative) we have∆ νk ≥ q ν ∆ νk − ∆ νk − ∆ νk − ≥ ( q ν − νk − . Since q ν − > p → + as p → + ∞ , there exists ¯ µ > q µ − δ νp − > ¯ µ for all p ≥ N . Therefore we obtain that ∆ νk − ∆ νk − ≥ ¯ µ. (39)If we set ˆ µ = min { µ, ¯ µ } >

0, one can combine (38) and (39) to obtain that∆ νk − ∆ νk − ≥ ˆ µ > k ≥ N . By summing those inequalities from N to some t greater than N we obtain that∆ νt − ∆ νN ≥ ˆ µ ( N − N ) and consequently follows from∆ t ≤ [∆ νN + ˆ µ ( t − N )] /ν ≤ C t − − θ θ − , (40)for some C >

0. 15s for the “odd” subsequence of { k , k + 1 , . . . } , we can deﬁne { ∆ t } t ≥(cid:100) k / (cid:101) with ∆ t := S t +1 + √ µ − µ (cid:107) δ t +1 I ∗ (cid:107) and then can still show that (40) holds true.Therefore, for all suﬃciently large and even number k , (cid:107) x k I ∗ − x ∗I ∗ (cid:107) ≤ S k = ∆ k − √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ − θ θ − C k − − θ θ − − √ µ − µ (cid:107) δ k I ∗ (cid:107) . For all suﬃciently large and odd number k , (cid:107) x k I ∗ − x ∗I ∗ (cid:107) ≤ S k = ∆ k − − √ µ − µ (cid:107) δ k I ∗ (cid:107) ≤ − θ θ − C ( k − − − θ θ − − √ µ − µ (cid:107) δ k I ∗ (cid:107) . Overall, we have (cid:107) x k − x ∗ (cid:107) = (cid:107) x k I ∗ − x ∗I ∗ (cid:107) ≤ c k − θ θ − − c (cid:107) δ k I ∗ (cid:107) for suﬃciently large k , where c := 2 − θ θ − C and c := √ µ − µ . This completes the proof.

In this section, we perform sparse signal recovery experiments (similar to Yu and Pong [2019], Zenget al. [2016], Figueiredo et al. [2007], Wen et al. [2018]) to study the behaviors of Algorithm 1.The goal of the experiments is to reconstruct a length n sparse signal x from m observations viaits incomplete measurements with m < n . For this purpose, we ﬁrst generate an m × n matrix A with i.i.d. standard Gaussian entries and orthonormalizing the rows. We then set y = Ax true + ε ,where the origin signal x true contains K randomly placed ± ε ∈ R m has i.i.d standardGaussian entries with variance σ = 10 − . Hence, the objective has f ( x ) = (cid:107) Ax − y (cid:107) , where A ∈ R m × n , y ∈ R m and x ∈ R n .In our experiments, we compare the performances of the proposed algorithm EIRL1 with IRL1,iteratively reweighted (cid:96) algorithm (IRL2) and the iterative jumping thresholding (IJT) algorithm[Zeng et al., 2016] for solving (cid:96) p -norm optimization problem and study the role of α . All algorithmsstart from a random Gaussian vector x with mean 0 and variance 1, and have same terminationcriterion that the number of iteration exceeds the limit or (cid:107) x k − x k − (cid:107)(cid:107) x k (cid:107) ≤ opttol , where opttol is a small parameter, and the default is 10 − . We compare the computational eﬃciency of algorithms for solving (cid:96) p regularization problems with p = 2 / , / , /

7. We ﬁx the matrix size ( m, n ) = (2048 , K = 200, λ = 0 . , µ = 0 . , β = 1 , (cid:15) = 1 and α = 0 . p , we generate 50 random datasets ( A, x true , y ) and compute average MSE with respect to x true , that is, MSE = n (cid:107) x k − x true (cid:107) .Figure 1 depicts the average MSE over the number of iterations, which shows that EIRL1 convergesfaster than other methods. In addition, it is worth noticing that IRL algorithms have the ability tosolve (cid:96) p -norm regularization problem with any 0 < p <

1, while IJT is limited to those p such thatthe subproblems possess an explicit solution. 16

10 20 30 40 50 60 70 80 90 100 M SE IRL1IRL2EIRL1IJT (a) p = M SE IRL1IRL2EIRL1IJT (b) p = M SE IRL1IRL2EIRL1 (c) p = Figure 1: Average MSE versus the number of iterations for IRL1, IRL2 and IJT algorithms. For p = 1 / , p = 2 / p = 6 /

7, EIRL1 converges faster than other algorithms. For p = 6 /

7, there isno explicit subproblem solution for IJT algorithm, therefore it is excluded from the ﬁgure.17 .2 The role of α Since EIRL1 involves a parameter α , a common question is about the selection of α . Therefore, wetest EIRL1 with diﬀerent α in diﬀerent situations with sparsity varying from small to large in our lastexperimental setting. In particular, we set K = 200 , ,

800 for x true , representing the situationswith low, medium and high sparsity, respectively. For each K , we test EIRL1 with 5 diﬀerent valuesof α = 0 , . , . , . , .

9. For each α , we generate 50 random data sets ( A, x true , y ), compute theaverage MSE versus iterates, and record the number of nonzero components in the found solution.We plot the evolution of MSE and the box-plots about the number of nonzero components withdiﬀerent α in Figure 2. It shows that larger α has faster convergence and can ﬁnd sparser solutions. M SE = 0.0 = 0.3 = 0.5 = 0.7 = 0.9 (a) MSE for K = 200 o f non z e r o c o m ponen t s (b) Box-plot for K = 200 M SE = 0.0 = 0.3 = 0.5 = 0.7 = 0.9 (c) MSE for K = 500 o f non z e r o c o m ponen t s (d) Box-plot for K = 500 M SE = 0.0 = 0.3 = 0.5 = 0.7 = 0.9 (e) MSE for K = 800 o f non z e r o c o m ponen t s (f) Box-plot for K = 800 Figure 2: The performance for diﬀerent α when K = 200 , , α hasbetter performance.Notice that we require 0 ≤ α < α , we test more carefully with larger values α =0.90, 0.93,0.95,0.97,0.99 with K = 800in Figure 3. It shows that α = 0 . α .18

10 20 30 40 50 60 70 80 90 100 -3-2.5-2-1.5-1-0.5 l og ( M SE ) = 0.90 = 0.93 = 0.95 = 0.97 = 0.99 (a) log(MSE) for K = 800 o f non z e r o c o m ponen t s (b) Box-plot for K = 800 Figure 3: The performance for diﬀerent α .In Figure 4, we also test dynamically updating α k = k − k + 2for k ≥ α = 0 for K = 800, which is called Nesterov’s momentum coeﬃcient Nesterov [1983].We can see that this α k have better performance than α = 0 . -3-2.5-2-1.5-1-0.5 l og ( M SE ) Nesterov updating = 0.7 = 0.9 (a) log(MSE) for K = 800 Nesterov updating 0.7 0.9950960970980990100010101020 o f non z e r o c o m ponen t s (b) Box-plot for K = 800 Figure 4: The performance for setting Nesterov’s momentum coeﬃcient and α = 0 . , . References

Hedy Attouch and J´erˆome Bolte. On the convergence of the proximal algorithm for nonsmooth func-tions involving analytic features.

Mathematical Programming , 116(1):5–16, Jan 2009. ISSN 1436-4646. doi: 10.1007/s10107-007-0133-5. URL https://doi.org/10.1007/s10107-007-0133-5 .Hedy Attouch, J´erˆome Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularizedgauss–seidel methods.

Mathematical Programming , 137(1-2):91–129, 2013.Alfred Auslender and Marc Teboulle. Interior gradient and proximal methods for convex and conicoptimization.

SIAM Journal on Optimization , 16(3):697–725, 2006.Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems.

SIAM journal on imaging sciences , 2(1):183–202, 2009.19tephen R Becker, Emmanuel J Cand`es, and Michael C Grant. Templates for convex cone problemswith applications to sparse signal recovery.

Mathematical programming computation , 3(3):165,2011.J´erˆo me. Bolte, Aris. Daniilidis, and Adrian. Lewis. The (cid:32)lojasiewicz inequality for nonsmooth sub-analytic functions with applications to subgradient dynamical systems.

SIAM Journal on Opti-mization , 17(4):1205–1223, 2007a. doi: 10.1137/050644641. URL https://doi.org/10.1137/050644641 .J´erˆome Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clarke subgradients of stratiﬁablefunctions.

SIAM Journal on Optimization , 18(2):556–572, 2007b.J´erˆome Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimiza-tion for nonconvex and nonsmooth problems.

Mathematical Programming , 146(1):459–494, Aug2014. ISSN 1436-4646. doi: 10.1007/s10107-013-0701-9. URL https://doi.org/10.1007/s10107-013-0701-9 .Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted (cid:96) minimization. Journal of Fourier analysis and applications , 14(5-6):877–905, 2008.Rick Chartrand and Wotao Yin. Iteratively reweighted algorithms for compressive sensing. In , pages 3869–3872.IEEE, 2008.Xiaojun Chen and Weijun Zhou. Convergence of reweighted (cid:96) minimization algorithms and uniquesolution of truncated lp minimization. Department of Applied Mathematics, The Hong KongPolytechnic University , 2010.Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracleproperties.

Journal of the American statistical Association , 96(456):1348–1360, 2001.M´ario AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparsereconstruction: Application to compressed sensing and other inverse problems.

IEEE Journal ofselected topics in signal processing , 1(4):586–597, 2007.Dongdong Ge, Xiaoye Jiang, and Yinyu Ye. A note on the complexity of (cid:96) p minimization. Mathe-matical programming , 129(2):285–299, 2011.Martin Jaggi.

Sparse convex optimization methods for machine learning . PhD thesis, ETH Zurich,2011.Ming-Jun Lai and Jingyue Wang. An unconstrained (cid:96) q minimization with 0 < q ≤ SIAM Journal on Optimization , 21(1):82–101, 2011.Guanghui Lan, Zhaosong Lu, and Renato DC Monteiro. Primal-dual ﬁrst-order methods with o (1 /(cid:15) )iteration-complexity for cone programming. Mathematical Programming , 126(1):1–29, 2011.Miguel Sousa Lobo, Maryam Fazel, and Stephen Boyd. Portfolio optimization with linear and ﬁxedtransaction costs.

Annals of Operations Research , 152(1):341–365, 2007.Stanislaw Lojasiewicz. Une propri´et´e topologique des sous-ensembles analytiques r´eels.

Les ´equationsaux d´eriv´ees partielles , 117:87–89, 1963.Canyi Lu, Yunchao Wei, Zhouchen Lin, and Shuicheng Yan. Proximal iteratively reweighted al-gorithm with multiple splitting for nonconvex sparsity optimization. In

Twenty-Eighth AAAIConference on Artiﬁcial Intelligence , 2014.Zhaosong Lu. Iterative reweighted minimization methods for (cid:96) p regularized unconstrained nonlinearprogramming. Mathematical Programming , 147(1-2):277–307, 2014.20eixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection instacked rnn framework. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 341–349, 2017.Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed sensingfor rapid mr imaging.

Magnetic Resonance in Medicine: An Oﬃcial Journal of the InternationalSociety for Magnetic Resonance in Medicine , 58(6):1182–1195, 2007.Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse representation for color image restoration.

IEEE Transactions on image processing , 17(1):53–69, 2007.Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factor-ization and sparse coding.

Journal of Machine Learning Research , 11(Jan):19–60, 2010.Yu Nesterov. Gradient methods for minimizing composite functions.

Mathematical Programming ,140(1):125–161, 2013.Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course.

Lecture notes ,3(4):5, 1998.Yurii Nesterov. Primal-dual subgradient methods for convex problems.

Mathematical programming ,120(1):221–259, 2009.Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o(1/kˆ 2). In

Dokl. akad. nauk Sssr , volume 269, pages 543–547, 1983.Paul Tseng. Approximation accuracy, gradient methods, and error bound for structured convexoptimization.

Mathematical Programming , 125(2):263–295, 2010.Seiichiro Wakabayashi and S Wakabayashi. Remarks on semi-algebraic functions. , 2008.Hao Wang, Hao Zeng, and Jiashan Wang. Relating lp regularization and reweighted l1 regularization. arXiv preprint arXiv:1912.00723 , 2019.Bo Wen, Xiaojun Chen, and Ting Kei Pong. A proximal diﬀerence-of-convex algorithm with extrap-olation.

Computational optimization and applications , 69(2):297–324, 2018.Peiran Yu and Ting Kei Pong. Iteratively reweighted (cid:96) algorithms with extrapolation. Computa-tional Optimization and Applications , 73(2):353–386, 2019.Jinshan Zeng, Shaobo Lin, and Zongben Xu. Sparse regularization: Convergence of iterative jumpingthresholding algorithm.

IEEE Transactions on Signal Processing , 64(19):5106–5118, 2016.Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In

International conference on curves and surfaces , pages 711–730. Springer,2010.Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty.