Accelerated Schemes for the L 1 / L 2 Minimization
11 Accelerated Schemes for the L /L Minimization
Chao Wang, Ming Yan, Yaghoub Rahimi, Yifei Lou
Abstract —In this paper, we consider the L /L minimizationfor sparse recovery and study its relationship with the L - αL model. Based on this relationship, we propose three numericalalgorithms to minimize this ratio model, two of which workas adaptive schemes and greatly reduce the computation time.Focusing on the two adaptive schemes, we discuss their connec-tion to existing approaches and analyze their convergence. Theexperimental results demonstrate that the proposed algorithmsare comparable to state-of-the-art methods in sparse recoveryand work particularly well when the ground-truth signal has ahigh dynamic range. Lastly, we reveal some empirical evidenceon the exact L recovery under various combinations of sparsity,coherence, and dynamic ranges, which calls for theoreticaljustification in the future. Index Terms —Sparsity, L , adaptive scheme, dynamic range. I. I
NTRODUCTION I N various science and engineering applications, one aimsto seek for a low-dimensional representation from high-dimensional data, and sparsity is a crucial assumption. Forexample, it is reasonable to assume in machine learning [1]that only a few features correspond to the response. In imageprocessing [2], the restored images are often piecewise con-stant, which means that gradients are sparse. In non-negativematrix factorization [3], the low-rank decomposition enforcessparsity with respect to singular values.Sparse signal recovery is to find the sparsest solution of A x = b where A ∈ R m × n ( m (cid:28) n ), x ∈ R n , and b ∈ R m .We assume that A has a full row rank and b is nonzero.This problem is often referred to as compressed sensing (CS)[4], [5] in the sense that the sparse signal x is compressible.Mathematically, it can be formulated by the L minimization, min x ∈ R n (cid:107) x (cid:107) s.t. A x = b . (1)Unfortunately, the L problem is known to be NP-hard [6].Various approaches in sparse recovery have been investigated.Some greedy methods include orthogonal matching pursuit(OMP) [7], orthogonal least squares (OLS) [8], and com-pressive sampling matching pursuit (CoSaMp) [9]. However,these greedy methods often lack of accuracy when n is large.Alternatively, approximations/relaxation approaches to the L norm have been sought. For example, convex relaxation, C. Wang and Y. Lou are with the Department of Mathematical Sci-ences, University of Texas at Dallas, Richardson, TX 75080 USA (E-mail: [email protected], [email protected]). Y. Lou was partiallysupported by NSF Awards DMS 1522786 and 1846690.Y. Rahimi is with the School of Mathematics, Georgia Institute of Tech-nology, Atlanta, GA 30332 USA (E-mail: [email protected]).M. Yan is with the Department of Computational Mathematics, Scienceand Engineering (CMSE) and the Department of Mathematics, Michigan StateUniversity, East Lansing, MI, 48824 USA (Email: [email protected]). M.Yan was partially supported by NSF award DMS 1621798. referred to as basis pursuit (BP) [10], replaces L in (1) withthe L norm. Recently, nonconvex models attract considerateamount of attentions due to their sharper approximations of L compared to the L norm. Some popular nonconvex modelsinclude L p [11], [12], [13], L - L [14], [15], transformed L (TL1) [16], [17], [18], nonnegative garrote [19], and capped- L [20], [21], [22]. Except for L - L , all of these nonconvexmodels involve one parameter to be determined and adjustedfor different types of sparse recovery problems.In this paper, we study the ratio of L and L as a scale-invariant and parameter-free metric to approximate the desiredscale-invariant L norm. The ratio of L and L can betraced back to [23] as a sparsity measure, and its scale-invariant property was explicitly mentioned in [24]. Esser etal. [25], [14] focused on nonnegative signals and establishedthe equivalence between L /L and L . The ratio model waslater formulated as a nonlinear constraint that was solved by alifted approach [26], [27]. Some applications of L /L includeblind deconvolution [28], [29] and sparse filtering [30], [31].In our earlier work [32], we focused on a constrainedminimization problem, min x ∈ R n (cid:107) x (cid:107) (cid:107) x (cid:107) s . t . A x = b . (2)Theoretically, we proved that any s -sparse vector is a localminimizer of the L /L model provided with a strong nullspace property (sNSP) condition. Computationally, we con-sidered to minimize (2) via the alternating direction methodof multipliers (ADMM) [33]. In particular, we introduced twoauxiliary variables and formed the augmented Lagrangian as L ( x , y , z ; v , w ) = (cid:107) z (cid:107) (cid:107) y (cid:107) + I ( A x − b ) + ρ (cid:13)(cid:13)(cid:13) x − y + ρ v (cid:13)(cid:13)(cid:13) + ρ (cid:13)(cid:13)(cid:13) x − z + ρ w (cid:13)(cid:13)(cid:13) , (3)where I ( · ) is defined as I ( t ) = (cid:40) , t = , + ∞ , otherwise . (4)There is a closed-form solution for each sub-problem. Pleaserefer to [32] for more details.This paper contributes three schemes to minimize (2). Wedemonstrate in experiments that the new schemes are compu-tationally more efficiently compared to the previous ADMMapproach. The novelties of the paper are three-fold:(1) Thanks to the new schemes, L /L can effectively dealwith sparse signals with a high dynamic range, which isnot the case for the ADMM approach;(2) We reveal the connection of the proposed schemesto existing approaches, which helps to establish theconvergence; a r X i v : . [ m a t h . NA ] M a r (3) Our empirical results shed light about the effects of spar-sity, coherence, and dynamic range on sparse recovery,which is new in the CS literature.The rest of the paper is organized as follows. Section IIis devoted to theoretical analysis on the relation between L /L and L - αL , which motivates three numerical schemesto minimize L /L . We interpret the proposed schemes inline with some existing approaches in Section III, followedby convergence analysis in Section IV. We conduct extensiveexperiments in Section V to demonstrate the performance ofthe L /L model with three minimizing algorithms over state-of-the-art methods in sparse recovery. Section VI presentshow the classic L approach behaves under different dynamicranges and how sparsity, coherence, and dynamic range inter-play on sparse recovery. Finally, conclusions and future worksare given in Section VII.II. N UMERICAL SCHEMES
We establish in Proposition 1 a link between the constrained L /L formulation (2) and L - αL , where α is a positiveparameter. Immediately following this proposition, we developa numerical algorithm for minimizing the ratio model. Wefurther discuss two accelerated approaches in Section II-B. Proposition 1.
Denote α ∗ := inf x ∈ R n (cid:26) (cid:107) x (cid:107) (cid:107) x (cid:107) s . t . A x = b (cid:27) , (5)and T ( α ) := inf x ∈ R n {(cid:107) x (cid:107) − α (cid:107) x (cid:107) s . t . A x = b } , (6)then we have(a) if T ( α ) < , then α > α ∗ ;(b) if T ( α ) ≥ , then α ≤ α ∗ ;(c) if T ( α ) = 0 , then α = α ∗ . Proof.
Denote the feasible set of (5) by F = { x | A x = b } .Since b (cid:54) = 0 then / ∈ F .(a) If T ( α ) < , then there exists x ∈ F such that (cid:107) x (cid:107) − α (cid:107) x (cid:107) < , which implies that α > (cid:107) x (cid:107) (cid:107) x (cid:107) . Therefore,we have α > α ∗ .(b) If T ( α ) ≥ , then for all x ∈ F we have (cid:107) x (cid:107) − α (cid:107) x (cid:107) ≥ . So α ≤ (cid:107) x (cid:107) (cid:107) x (cid:107) and hence α ≤ inf x ∈ F (cid:107) x (cid:107) (cid:107) x (cid:107) = α ∗ , i.e., α ≤ α ∗ .(c) If T ( α ) = 0 , then by part (b) we get α ≤ α ∗ .Furthermore, there exists a sequence { x n } ⊂ F suchthat lim n →∞ ( (cid:107) x n (cid:107) − α (cid:107) x n (cid:107) ) = 0 . Since x n ∈ F , wehave (cid:107) b (cid:107) = (cid:107) A x n (cid:107) ≤ (cid:107) A (cid:107)(cid:107) x n (cid:107) . Hence, { x n } has alower bounded, i.e. (cid:107) x n (cid:107) ≥ (cid:107) b (cid:107) / (cid:107) A (cid:107) for all n , then weget lim n →∞ ( (cid:107) x n (cid:107) − α (cid:107) x n (cid:107) ) / (cid:107) x n (cid:107) = 0 , which means α ∗ ≤ lim n →∞ (cid:107) x n (cid:107) / (cid:107) x n (cid:107) = α . Therefore, we have α = α ∗ . A. Bisection Search
It follows from Proposition 1 that the optimal value of L /L equals to the value of α in the L - αL modelif the objective value of L - αL is zero. That is to say,the optimal value of the ratio model is the root of T ( α ) ,which can be obtained by bisection search . Moreover, wehave upper/lower bounds of α , i.e., α ∈ [1 , √ n ] , since (cid:107) x (cid:107) ≤ (cid:107) x (cid:107) ≤ √ n (cid:107) x (cid:107) , ∀ x ∈ R n [34]. The proceduregoes as follows: we start with an initial range of α to be [1 , √ n ] and an initial value of α (0) in between. Then usingthis α (0) , we solve for the L - α (0) L minimization via thedifference-of-convex algorithm (DCA) [35]; more details onthe DCA implementation will be given in Section II-B. Basedon the objective value of T ( α (0) ) , we update the range of α . Specifically if T ( α (0) ) = 0 , then we find the minimumratio and the corresponding minimizer x ∗ in the L - L modelis also the minimizer of the L /L model. If T ( α (0) ) > ,then we update the range as [ α (0) , √ n ] . If T ( α (0) ) < , thenthe minimum ratio is smaller than α (0) , so we can shortenthe range from [1 , √ n ] to [1 , α (0) ] . We can further shortenthe internal as (cid:104) , (cid:107) x ( k +1) (cid:107) (cid:107) x ( k +1) (cid:107) (cid:105) , as the objective value of L - (cid:107) x ( k +1) (cid:107) (cid:107) x ( k +1) (cid:107) L would be less than or equal to zero in the nextiteration. After the range is updated, we choose α (1) using themiddle point of two end points and iterate.We summarize the entire process as Algorithm 1, in whichthe stopping criterion is that the error between two adjacent α values is small enough. As the algorithmic scheme followsdirectly from bisection search, we refer the algorithm as L /L -BS or BS if the context is clear. The convergence ofBS can be obtained in the same way that the bisection methodconverges. However, due to the nonconvex nature of the L - αL minimization (6), there is no guarantee to find its globalminimizer and hence the solution to (5) may be suboptimal. Algorithm 1
The L /L minimization via bisection search( L /L -BS). Input: A ∈ R m × n , b ∈ R m , kMax, and (cid:15) ∈ R Initialize: x (0) , α (0) , lb = 1 , ub = √ n and k = 0 while k < kMax or | α ( k ) − α ( k − | > (cid:15) do x ( k +1) = arg min x ∈ R n (cid:8) (cid:107) x (cid:107) − α ( k ) (cid:107) x (cid:107) s . t . A x = b (cid:9) if (cid:107) x ( k +1) (cid:107) − α ( k ) (cid:107) x ( k +1) (cid:107) < then ub = (cid:107) x ( k +1) (cid:107) (cid:107) x ( k +1) (cid:107) else if (cid:107) x ( k +1) (cid:107) − α ( k ) (cid:107) x ( k +1) (cid:107) > then lb = α ( k ) else break end if α ( k +1) = ub + lb k = k + 1 end while return x ( k ) B. Adaptive Algorithms
The BS algorithm is computationally expensive, consideringthat the L - αL minimization is conducted for multiple times. To speed up, we discuss two variants of L /L -BS by updatingthe parameter α iteratively while minimizing (cid:107) x (cid:107) − α (cid:107) x (cid:107) .Following the DCA framework [36], [37] to minimize (cid:107) x (cid:107) − α (cid:107) x (cid:107) , we consider the objective function as thedifference of two convex functions, i.e., min x ∈ R n g ( x ) − h ( x ) . Bylinearizing the second term h ( · ) , the DCA iterates as follows, x ( k +1) = arg min x ∈ R n g ( x ) − (cid:68) x , ∇ h ( x ( k ) ) (cid:69) . (7)Particularly for the L - αL model, we have g ( x ) = (cid:107) x (cid:107) + I ( A x − b ) and h ( x ) = α (cid:107) x (cid:107) , (8)thus leading to the DCA update as x ( k +1) = arg min x ∈ R n g ( x ) − (cid:28) x , α x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) . (9)Now we consider to update α iteratively by the ratio of thecurrent solution, leading to the following scheme, (cid:40) x ( k +1) = arg min x (cid:110) g ( x ) − (cid:68) x , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:69)(cid:111) ,α ( k +1) = (cid:107) x ( k +1) (cid:107) / (cid:107) x ( k +1) (cid:107) , (10)where g is defined in (8). Notice that the x -subproblem in (10)is a linear programming (LP) problem, which unfortunatelyhas no guarantee that the optimal solution exists (as theproblem can be unbounded). To increase the robustness ofthe algorithm, we further incorporate a quadratic term into thelinear problem, i.e., (cid:40) x ( k +1) = arg min x (cid:110) g ( x ) − (cid:68) x , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:69) + β (cid:107) x − x ( k ) (cid:107) (cid:111) ,α ( k +1) = (cid:107) x ( k +1) (cid:107) / (cid:107) x ( k +1) (cid:107) . (11)We denote these two adaptive methods (10) and (11) as L /L -A1 and L /L -A2, respectively or A1 and A2 forshort. Both algorithms are summarized in Algorithm 2.For the x subproblem of L /L -A1, we convert it into anLP problem. Assume that x = x + − x − where x + ≥ and x − ≥ . Denote ¯ x = (cid:20) x + x − (cid:21) , then A x = b becomes ¯ A ¯ x = b with ¯ A = (cid:2) A − A (cid:3) . Therefore, the x -subproblem becomes min ¯ x ≥ c T ¯ x s.t. ¯ A ¯ x = b , (12)where c = (cid:104) + α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) ; − α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:105) . We adopt the soft-ware Gurobi [38] to solve this LP problem.The x subproblem of L /L -A2 is a quadratic programmingproblem, which can be solved via ADMM. By introducing anauxiliary variable y , we have the augmented Lagrangian, L ρ ( x , y ; u ) = (cid:107) y (cid:107) + I ( A x − b ) − (cid:68) x , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:69) + β (cid:107) x − x ( k ) (cid:107) + u T ( x − y ) + ρ (cid:107) x − y (cid:107) . (13)Then the ADMM iteration goes as follows x j +1 = arg min x L ρ ( x , y j ; u j ) , y j +1 = arg min y L ρ ( x j +1 , y ; u j ) , u j +1 = u j + ρ ( x j +1 − y j +1 ) , (14) Algorithm 2
The L /L minimization via adaptive selectionmethod ( L /L -A1 or A2). Input: A ∈ R m × n , b ∈ R m , kMax, and (cid:15) ∈ R initialization: x (0) , α (0) and k = 1 while k < kMax or (cid:107) x ( k ) − x ( k − (cid:107) / (cid:107) x ( k ) (cid:107) > (cid:15) do (cid:40) Update { x ( k +1) , α ( k +1) } by (10) for A1Update { x ( k +1) , α ( k +1) } by (11) for A2 k = k + 1 end while return x ( k ) where the subscript j indexes the inner loop, as opposed tothe superscript k for outer iterations used in (11). The x -subproblem of (14) is a projection problem to minimize (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − β x ( k ) − u j + ρ y j + α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) β + ρ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , under the constraint of A x = b . Since the closed-form solutionof projecting a vector z to this constraint is proj ( z ) = z − A T ( AA T ) − ( A z − b ) , (15)the x -update is given by x j +1 = proj β x ( k ) − u j + ρ y j + α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) β + ρ . The y -subproblem of (14) is equivalent to y j +1 = arg min y (cid:40) (cid:107) y (cid:107) + ρ (cid:13)(cid:13)(cid:13)(cid:13) y − x j +1 − u j ρ (cid:13)(cid:13)(cid:13)(cid:13) (cid:41) . It has a closed-form solution via soft shrinkage , i.e., y j +1 = shrink (cid:16) x j +1 + u j ρ , ρ (cid:17) , (16)with shrink ( v , µ ) = sign( v ) max ( | v | − µ, . III. C
ONNECTIONS TO PREVIOUS WORKS
We try to interpret the proposed adaptive methods (A1 andA2) in line with some existing approaches: parameter selec-tion, generalized inverse power, and gradient-based methods.Our efforts contribute to convergence analysis in Section IV.
A. Parameter Selection
Recall that in L /L -BS, the ratio L /L is minimizedwhen there exists a proper α ∗ such that (cid:107) x ∗ (cid:107) − α ∗ (cid:107) x ∗ (cid:107) = 0 with x ∗ = arg min x {(cid:107) x (cid:107) − α ∗ (cid:107) x (cid:107) s . t . A x = b } . We canregard this process as a root-finding problem for α ∗ , whichoften occurs in parameter selection. For example, in thediscrepancy principle method [39], [40], [41], one aims to finda parameter α such that the resulting data-fitting term is closeto the noise level. In particular, we represent this process by (cid:40) x ( k +1) = arg min x f ( x , α ( k ) ) ,α ( k +1) = l ( x ( k +1) , α ( k ) ) , (17) where f ( · ) is a general objective function to be minimizedand l ( · ) is a certain scheme to update α so that discrepancyprinciple holds. Typically, an inner loop is required to find thesolution of x -subproblem, followed by updating this parameterin an outer iteration. We further present the j -th inner iterationat the k -th outer iteration by x j +1 = Ψ( x j , α ( k ) ) , (18)for the x -subproblem in (17).To speed-up the process, Wen and Chan [40] proposed anadaptive scheme that updates the parameter during the innerloop such that it renders the current data-fitting term equal tothe noise level. In other words, instead of updating α afterminimizing f , they directly iterated x j +1 = Ψ( x j , α j +1 ) , (19)in a way that { x j +1 , α j +1 } satisfies the discrepancy principle.In this way, only one loop is needed as opposed to inner/outerloops in (18). But it requires a closed-form solution for x j +1 so one can perform a one-dimensional search for α j +1 .The proposed BS scheme falls into the framework of (17)in that the searching range of parameter is shorten everyouter iteration. However, f in our BS method is the L - αL minimization that does not have a closed-form solution. Asopposed to (19), we consider to update x j +1 = Ψ( x j , α j ) (20)prior to updating α . In other word, we update x j +1 basedon α j rather than α j +1 , the latter of which was adoptedin the parameter-selection method [40]. The rationale of(20) is to guarantee that { x j +1 , α j +1 } satisfies (cid:107) x j +1 (cid:107) − α j +1 (cid:107) x j +1 (cid:107) = 0 . The iterative scheme (20) is consistentwith A1 or A2 (depending on the form of Ψ ), if we changethe notation from subscript j to superscript k . B. Generalized Inverse Power Methods
A standard technique to find the smallest eigenvalue of apositive semi-definite symmetric matrix B is the inverse powermethod [34] that requires to iteratively solve the linear system, B x ( k +1) = x ( k ) . (21)The iteration converges to the smallest eigenvector of B ,denoted by x ∗ . Then the smallest eigenvalue can be evaluatedby λ = q ( x ∗ ) , where q ( · ) is Rayleigh quotient defined as q ( x ) = (cid:104) x , B x (cid:105)(cid:107) x (cid:107) . Note that (21) is equivalent to the minimization problem x ( k +1) = arg min x (cid:26) (cid:104) x , B x (cid:105) − (cid:104) x ( k ) , x (cid:105) (cid:27) . (22)It is well known in linear algebra [34], [42] that eigenvectorsof B are critical points of min x q ( x ) and the smallest eigen-value/eigenvector can be found by (22). This idea is naturallyextended to the nonlinear case in [43], where a general quotient is considered, q ( x ) = r ( x ) s ( x ) , with arbitrary functions r ( · ) and s ( · ) . Similarly to (22), we have the corresponding scheme x ( k +1) = arg min x (cid:110) r ( x ) − (cid:104)∇ s ( x ( k ) ) , x (cid:105) (cid:111) . Following [43], we consider to update the eigenvalue λ ( k ) at each iteration to guarantee the algorithm’s descent. Inparticular, the iterative scheme is given by x ( k +1) = arg min x (cid:8) r ( x ) − λ ( k ) (cid:104)∇ s ( x ( k ) ) , x (cid:105) (cid:9) ,λ ( k +1) = r ( x ( k +1) ) s ( x ( k +1) ) . (23)If we choose r ( x ) = g ( x ) , s ( x ) = (cid:107) x (cid:107) , and denote λ as α ,then the generalized inverse power method (23) is L /L -A1.In [44], a modified inverse power method was proposed via thesteepest descent flow. The iteration scheme is to incorporate aquadratic term in the objective function of the x -subproblem,which leads to L /L -A2. C. Gradient-based Methods
Definition 1. A critical point of a constrained optimizationproblem is a vector in the feasible set (satisfying the con-straints) that is also a local maximum, minimum, or saddlepoint of the objective function.According to Karush-Kuhn-Tucker (KKT) conditions, x ∗ (cid:54) = is a critical point of (2) if and only if there exists a vector s such that (cid:40) ∈ ∂ (cid:107) x ∗ (cid:107) (cid:107) x ∗ (cid:107) − (cid:107) x ∗ (cid:107) (cid:107) x ∗ (cid:107) x ∗ (cid:107) x ∗ (cid:107) + A T s , A x ∗ − b . (24)By introducing ˆ s = (cid:107) x ∗ (cid:107) · s , we have (cid:40) ∈ ∂ (cid:107) x ∗ (cid:107) − (cid:107) x ∗ (cid:107) (cid:107) x ∗ (cid:107) x ∗ (cid:107) x ∗ (cid:107) + A T ˆ s , A x ∗ − b . (25)The condition (25) is also an optimality condition to anotheroptimization problem: min x g ( x ) + w ( x ) , (26)where g ( x ) is from (8) and w ( x ) is some function satisfying ∇ w ( x ) = − (cid:107) x (cid:107) (cid:107) x (cid:107) x (cid:107) x (cid:107) . (27)Note that w ( · ) can not be explicitly determined from (27).By applying a proximal gradient method (PGM) [45], [46],[47] on the model (26), we obtain the following scheme x ( k +1) = prox β g (cid:18) x ( k ) − β ∇ w ( x ( k ) ) (cid:19) , (28)where prox g ( y ) = arg min z (cid:8) g ( z ) + (cid:107) z − y (cid:107) (cid:9) . This itera-tive scheme is the same as L /L -A2.As for L /L -A1, we can interpret it as a generalizedconditional gradient method [48] that minimizes g ( x ) + w ( x ) by x ( k +1) = min y (cid:104)∇ w ( x ( k ) ) , y (cid:105) + g ( y ) . IV. C
ONVERGENCE ANALYSIS
Following the discussion in Section III-C, we present theconvergence analysis. We start with the convergence of A2,which is characterized in Theorem 1. To prove it, we needfour lemmas, whose proofs are given in Appendix.
Lemma 1. (Sufficient decreasing) The sequence { x ( k ) , α ( k ) } produced by L /L -A2 satisfies α ( k ) − α ( k +1) ≥ β (cid:107) x ( k +1) (cid:107) (cid:107) x ( k +1) − x ( k ) (cid:107) , ∀ k > . The next two lemmas (Lemma 2 and Lemma 3) discuss theLipschitz properties.
Lemma 2.
Define L = (cid:107) A T ( AA T ) − b (cid:107) . Then for any x , y ∈ R n satisfying A x = A y = b , we have (cid:13)(cid:13)(cid:13)(cid:13) x (cid:107) x (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) ≤ L (cid:107) x − y (cid:107) . Since the gradient of the L norm is ∇(cid:107) x (cid:107) = x (cid:107) x (cid:107) ,Lemma 2 implies that the gradient of Euclidean norm isLipschitz-continuous in the domain { x | A x = b } . The nextlemma is about the Lipschitz property for the implicit function w ( · ) that satisfies (27). Lemma 3.
Given L defined in Lemma 2. For any x , y ∈ R n satisfying A x = A y = b , then (cid:107)∇ w ( x ) − ∇ w ( y ) (cid:107) ≤ L w (cid:107) x − y (cid:107) , (29)for w satisfying (27) and L w = 2 √ nL . Lemma 4.
Given g ( · ) defined in (8) and suppose w ( · ) satisfies(27), we denote Φ( x ) := β (cid:16) x − prox β g (cid:16) x − β ∇ w ( x ) (cid:17)(cid:17) , (30)for an arbitrary β > . Then we have(a) Φ( x ∗ ) = if and only if x ∗ is a critical point of (2);(b) (cid:107) Φ( x ) − Φ( y ) (cid:107) ≤ L Φ (cid:107) x − y (cid:107) with L Φ = L w + 2 β, for any x , y ∈ R n satisfying A x = A y = b .It is stated in (28) that L /L -A2 can be expressed as x ( k +1) = prox β g (cid:16) x ( k ) − β ∇ w ( x ( k ) ) (cid:17) . By the definition of Φ( · ) in (30) and the decreasing property of (cid:107) x (cid:107) / (cid:107) x (cid:107) inLemma 1, we can interpret A2 as a gradient descent method x ( k +1) = x ( k ) − β Φ( x ( k ) ) . In the following theorem, we rely on Lemma 4 to show thatthe descent direction along Φ( · ) leads to convergence. Theorem 1.
Given a sequence { x ( k ) , α ( k ) } generated by L /L -A2. If { x ( k ) } is bounded, there exists a subsequencethat converges to a critical point of the ratio model (2). Proof.
According to Lemma 1, we know that α ( k ) is decreas-ing and bounded from below, so there exists a scalar α ∗ suchthat α ( k ) → α ∗ . With the boundedness assumption of x , weget (cid:107) x ( k +1) − x ( k ) (cid:107) → from Lemma 1, which implies that (cid:107) Φ( x ( k ) ) (cid:107) → . The boundedness of x ( k ) also leads to aconvergent subsequence, i.e., x ( k i ) → x ∗ . Therefore, we have (cid:107) Φ( x ∗ ) (cid:107) = (cid:13)(cid:13)(cid:13) Φ( x ∗ ) − Φ (cid:16) x ( k i ) (cid:17) + Φ (cid:16) x ( k i ) (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) Φ( x ∗ ) − Φ (cid:16) x ( k i ) (cid:17)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) Φ (cid:16) x ( k i ) (cid:17)(cid:13)(cid:13)(cid:13) ≤ L Φ (cid:107) x ( k i ) − x ∗ (cid:107) + (cid:13)(cid:13)(cid:13) Φ (cid:16) x ( k i ) (cid:17)(cid:13)(cid:13)(cid:13) . As k i → ∞ , we get (cid:107) Φ( x ∗ ) (cid:107) = 0 and hence Φ( x ∗ ) = . ByLemma 4, { x ( k i ) } converges to a critical point. Remark 1.
Theorem 1 does not require that the step-size β is small, which is typically for gradient-based methods. In ournumerical tests, we can choose small β and get good results. Theorem 2.
Given a sequence { x ( k ) , α ( k ) } generated by L /L -A1. If { x ( k ) } is bounded, it has a convergent subse-quence. Proof.
Denote z ( x , x ( k ) ) := (cid:107) x (cid:107) − (cid:28) x , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) . Since z ( x ( k ) , x ( k ) ) = 0 by the definition of α ( k ) , the minimalvalue of z ( x , x ( k ) ) subject to the constraint { x | A x = b } isless than or equal to zero. Specifically, z ( x ( k +1) , x ( k ) ) ≤ . As a result, by Cauchy-Schwarz inequality, we have (cid:107) x ( k +1) (cid:107) ≤ (cid:28) x ( k +1) , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) ≤ α ( k ) (cid:107) x ( k +1) (cid:107) , (31)which implies α ( k +1) ≤ α ( k ) . Since α ( k ) ∈ [1 , √ n ] , thedecreasing sequence of α ( k ) converges, i.e., α ( k ) → α ∗ . Bythe boundedness of x ( k ) , it has a convergent subsequence, i.e,there exists a vector x ∗ such that x ( k i ) → x ∗ . Remark 2.
The sufficient decrease property (Lemma 1) doesnot hold for β = 0 when L /L -A2 reduces to A1. So, wecannot show that A1 converges to a critical point. Remark 3.
According to Theorem 1 and Theorem 2, we provethat either both algorithms diverge due to unboundednessor there exists a convergent subsequence. It is possible thatthe solution can be unbounded. For example, A has a zero-column, then the corresponding entry can take + ∞ so that theratio of L and L is minimized. In the numerical tests, wedemonstrate empirically that { x ( k ) } is always bounded andhence convergent for general (random) matrices A . V. N
UMERICAL EXPERIMENTS
In this section, we compare the proposed algorithms withstate-of-the-art methods in sparse recovery. All the numericalexperiments are conducted on a desktop with CPU (Intel i7-6700, 3.4GHz) and
MATLAB 9 . . We focus on the sparse recovery problem with highly coher-ent matrices, on which standard L models fail. Following theworks of [15], [49], [50], we consider an oversampled discretecosine transform (DCT), defined as A = [ a , a , · · · , a n ] ∈ R m × n with a j := √ m cos (cid:0) π w jF (cid:1) , j = 1 , · · · , n, (32) where w is a random vector that is uniformly distributed in [0 , m and F ∈ R is a positive parameter to control thecoherence in a way that a larger F yields a more coher-ent matrix. Throughout the experiments, we consider over-sampled DCT matrices of size × . The ground truth x ∈ R n is simulated as an s -sparse signal, where s is thenumber of nonzero entries. As suggested in [50], we requirea minimum separation at least F in the support of x . As forthe values of non-zero elements, we follow the work of [51]to consider sparse signals with a high dynamic range. Definethe dynamic range of a signal x as Θ( x ) = max {| x s |} min {| x s |} , whichcan be controlled by an exponential factor D . In particular,we simulate x s by the following MATLAB command, xs = sign(randn(s,1)).*10.ˆ(D*rand(s,1)) In the experiments, we set D = 3 and , corresponding to Θ ≈ and , respectively. Note that randn and rand are the MATLAB commands for the Gaussian distribution N (0 , and the uniform distribution U (0 , , respectively. To comparewith our previous work [32] of the L /L minimization, wealso consider that the nonzero elements follow the Gaussiandistribution, i.e., ( x s ) i ∼ N (0 , , i = 1 , , · · · , s. The fidelity of sparse signal recovery is assessed in termsof success rate , defined as the number of successful trials overthe total number of trials. When the relative error between theground truth x and the reconstructed solution x ∗ , i.e., (cid:107) x ∗ − x (cid:107) (cid:107) x (cid:107) , is less than − , we declare it as a success . Moreover, wecategorize the failure of not recovering the ground-truth signalas model/algorithm failures and by comparing the objectivefunction f ( · ) at the ground truth x and at the restored solution x ∗ . If f ( x ) > f ( x ∗ ) , then x is not a global minimizer of themodel, so we regard it as a model failure . If f ( x ) < f ( x ∗ ) ,then the algorithm does not reach a global minimizer. It isreferred to as an algorithm failure . Similarly to success rates,we can define model-failure rates and algorithm-failure rates. A. Algorithmic Comparison
We present various computational aspects of the proposedalgorithms, i.e., BS, A1, and A2, together with comparison toour previous ADMM approach [32]. First of all, we attemptto demonstrate the convergence of all proposed algorithmsusing an example of s = 15 , F = 15 (so the minimalseparation is 30), and nonzero elements following Gaussiandistribution. Since the ratio model is solved via the L - αL model, we plot the values of (cid:107) x ( k ) (cid:107) − α ( k − (cid:107) x ( k ) (cid:107) and α ( k ) versus iteration counter k in Figure 1. For L /L -BS,we record the value at each outer iteration and the stoppingconditions are either the maximum outer iteration reaches 10or | α ( k ) − α ( k − | ≤ − . For each iteration of A1, A2, andthe inner loop of BS, the stopping criterions are the relativeerror (cid:107) x ( k ) − x ( k − (cid:107) / (cid:107) x ( k ) (cid:107) ≤ − . The left plot inFigure 1 illustrates the convergence of the three algorithmsin the sense that (cid:107) x ( k ) (cid:107) − α ( k ) (cid:107) x ( k ) (cid:107) goes down. Both A1and A2 are faster than BS as BS starts with a larger range of α as [1 , √ n ] = [1 , , while A1 and A2 start with a good initialvalue of α (0) = (cid:107) x (0) (cid:107) (cid:107) x (0) (cid:107) , which is very close to the final optimalvalue α ∗ . The right plot in Figure 1 examines the evolution k -15 -10 -5 BSA1A2 k BSA1A2
Figure 1. Empirical analysis on convergence: (cid:107) x ( k ) (cid:107) − α ( k − (cid:107) x ( k ) (cid:107) (left) and α ( k ) (right) versus iteration counter k for BS, A1, and A2. F = 1 F = 20 s ground truthA1A2 s ground truthA1A2 Figure 2. The L norm of the ground truth vectors as well as the reconstructedsolutions by A1 and A2. of α ( k ) , which gradually becomes stable and approaches to asimilar value around 3.06 for all three algorithms. Figure 1confirms the decrease property of α ( k ) proved in Lemma 1.In Theorem 1, we require the sequence { x ( k ) } to bebounded for the convergence analysis. Here we aim at anempirical verification on the boundedness. In particular, wetest on various kinds of linear systems with F ∈ { , } andsparsity ranging from 2 to 22. In each setting, we randomlygenerate 50 pairs of ground-truth signals and linear systemsto compute the L norm of solutions obtained by A1 andA2, along with the L norm of ground-truth signals. Themean values of these L norms are plotted in Figure 2. Asthe maximum values are finite numbers, it means that thereconstructed signal is always bounded. Figure 2 also showsthat the L norms of A1 and A2 align quite well with theground truth when the sparsity is below 14, no matter thesystem is coherent or not. When the matrix is highly coherentwith more nonzero elements, both A1 and A2 give much largervalues of the L norm compared to the ground truth. It isbecause a larger L norm gives rise to a smaller value in theratio of L /L that we try to minimize. In any cases, thesolutions of both A1 and A2 are shown to be bounded.Next, we compare the three algorithms with our previ-ous ADMM approach [32]. We consider F = 1 and with nonzero elements following the Gaussian distributionor having high dynamic ranges. We randomly simulate 50trials for each sparsity level and compute the average ofsuccess rates, algorithm-failure rates, and computation time.The Gaussian case is illustrated in Figure 3, showing thatADMM is the worst in terms of success rates partly due to highalgorithm failure rates. Here, ρ = ρ = 2000 for ADMMand β = 1 , ρ = 20 for A2. In addition, BS achieves thehighest success rates but is the slowest. Both A1 and A2 havesimilar performance to BS with much reduced computationtime. Figure 4 examines the case of the dynamic range for the non-zero values in x with D = 3 and . Here we set β = 10 − and ρ = 0 . for A2, while ρ = ρ = 100 for ADMM. Similarperformance is observed as the Gaussian case. In summary, werate A1 as the most efficient algorithm for minimizing the ratiomodel with a balanced performance between accuracy andcomputational costs. We also observe that all the algorithmstend to give better performance in terms of success rates withhigher dynamic ranges, which seems counter-intuitive. We willrevisit this phenomenon in Section VI. B. Model Comparison
We intend to compare various sparse promoting models.Since the Gaussian case was conducted in our previous work[32], we focus on the dynamic range here. We compare theproposed L /L model with the following models: L [10], L p [11], L - L [49], [15], and TL1 [18]. We adopt L /L -A1to solve for the ratio model, as it is the most efficient algorithmfrom the discussion in Section V-A. The initial guess for allnon-convex models is the L solution obtained by Gurobi.We choose p = 1 / for L p and a = 10 D − for TL1 when therange factor D is known a priori .Figure 5 plots the success rates of F = 1 , and D = 3 , .We observe that TL1 is the best except for the low coherenceand the low dynamic case, where L p is the best. But L p isthe worst in the other cases. The L /L model is always thesecond best. Note that the ratio model is parameter-free, whilethe performance of TL1 largely relies on the parameter a .Figure 6 examines the success rate of TL1 with different valuesof a . We choose a = 10 D − in the model comparison, whichis almost the best among these testing values of a . If no suchprior information of the dynamic range were available to tune a , the performance of TL1 might be worse than L /L .VI. D ISCUSSIONS
Cand´es and Wakin [52] presented two principles in com-pressed sensing, i.e., sparsity and incoherence . We reported inour previous work [32] that higher coherence leads to bettersparse recovery, which seems to contradict with the currentbelief in CS. In this paper, we discuss the dynamic range andreveal its effect on the exact recovery via the L approach. Toour best of our knowledge, there has been little discussion onthe dynamic range in the CS literature, except for [51]. Weconsider low-coherent matrices with F = 1 and high-coherentones with F = 20 . We record the success rates of differentcombinations of sparsity levels ( s = 2 : 4 : 22 ) and dynamicranges D = 0 : 5 in Table I. It shows that a higher dynamicrange leads a better performance. It seems that the L approachis independent on D for relatively sparser signals.Now that there are three quantities that may contribute tothe success of sparse recovery, i.e., sparsity, coherence, anddynamic range, we try to give a comprehensive analysis byusing the relative error (cid:107) x ∗ − x (cid:107) / (cid:107) x (cid:107) instead of the successrates, as the latter depends on the successful threshold. Weplot in Figure 7 the mean and the standard deviation of therelative errors from 50 random trails versus coherence levels( F = 1 , , , , ). Based on Table I, we only considerthe number of non-zeros value larger than 18 and D ≥ . Table IS
UCCESS RATE (%)
IN SOLVING DIFFERENT DYNAMIC RANGES VIA THE L MODEL AT TWO COHERENCE LEVELS F = 1 AND F = 20 . F = 1 s D = 0
100 100 80 4 0 0 D = 1
100 100 80 4 0 0 D = 2
100 100 80 4 0 0 D = 3
100 100 80 4 0 0 D = 4
100 100 86 16 0 0 D = 5
100 100 88 38 12 0 F = 20 s D = 0
100 100 100 100 50 0 D = 1
100 100 100 100 52 0 D = 2
100 100 100 100 52 0 D = 3
100 100 100 100 52 0 D = 4
100 100 100 100 54 0 D = 5
100 100 100 100 76 16
In each subfigure of Figure 7, the curves decrease when F increases, which means that higher coherence leads to betterperformance. This is consistent with the observation in [32].As for the dynamic range, we discover in Figure 7 that alarger value of D leads to a smaller relative error. Finally, thesparsity affects the performance in the way that smaller relativeerrors can be achieved for sparser signals. These numericalphenomena have not been reported in the CS literature, whichmotivate for future theoretical justifications.VII. C ONCLUSIONS AND FUTURE WORKS
We studied the scale-invariant and parameter-free minimiza-tion L /L to promote sparsity. We presented three numericalalgorithms to minimize this nonconvex model based on therelationship between L /L and L - αL for certain α . Experi-mental results compared the proposed algorithms with state-of-the-art methods in sparse recovery. Particularly important is theproposed algorithm works well when the ground-truth signalhas a high dynamic range. Last but not least, we analyzedthe behaviors of the L approach towards the exact recoverywith varying sparsity, coherence, and dynamic range. Futureworks include the theoretical analysis on the effect of thehigh dynamic range towards sparse recovery as well as theapplications of the ratio model in image processing such asblind deconvolution [28], [29].A PPENDIX
A. Proof of Lemma 1Proof.
Based on the x -subproblem in (11), we get (cid:13)(cid:13)(cid:13) x ( k +1) (cid:13)(cid:13)(cid:13) − (cid:28) x ( k +1) , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) + β (cid:13)(cid:13)(cid:13) x ( k +1) − x ( k ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) x ( k ) (cid:13)(cid:13)(cid:13) − (cid:28) x ( k ) , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) + β (cid:107) x ( k ) − x ( k ) (cid:107) = (cid:13)(cid:13)(cid:13) x ( k ) (cid:13)(cid:13)(cid:13) − (cid:28) x ( k ) , α ( k ) x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) . success rates algorithm-failure rates computation time s % ADMMBSA1A2 s % ADMMBSA1A2 s s e c . ADMMBSA1A2 s % ADMMBSA1A2 s % ADMMBSA1A2 s s e c . ADMMBSA1A2
Figure 3. Algorithmic comparison in the Gaussian distribution case with F = 1 (top) and F = 20 (bottom) in terms of success rates (left), algorithm-failurerates (middle), and computation time (right). F = 1 , D = 3 F = 20 , D = 3 s % ADMMBSA1A2 s % ADMMBSA1A2 F = 1 , D = 5 F = 20 , D = 5 s % ADMMBSA1A2 s % ADMMBSA1A2
Figure 4. Success rates of different L /L minimizing algorithms versussparsity at coherence levels F = 1 (left) and F = 20 (right) as well as highdynamic ranges of D = 3 (top) and D = 5 (bottom). . After rearranging, we get the following inequality (cid:107) x ( k +1) (cid:107) + β (cid:107) x ( k +1) − x ( k ) (cid:107) ≤(cid:107) x ( k ) (cid:107) + α ( k ) (cid:28) x ( k +1) − x ( k ) , x ( k ) (cid:107) x ( k ) (cid:107) (cid:29) ≤(cid:107) x ( k ) (cid:107) + α ( k ) (cid:16) (cid:107) x ( k +1) (cid:107) − (cid:107) x ( k ) (cid:107) (cid:17) = α ( k ) (cid:107) x ( k +1) (cid:107) . (33)The second inequality is owing to the convexity of Euclideannorm and the definition of α ( k ) . Lemma 1 is then obtained bydividing (cid:107) x ( k +1) (cid:107) on both sides of (33). F = 1 , D = 3 F = 20 , D = 3 s % L1TL1LpL1-L2L1/L2 s % L1TL1LpL1-L2L1/L2 F = 1 , D = 5 F = 20 , D = 5 s % L1TL1LpL1-L2L1/L2 s % L1TL1LpL1-L2L1/L2
Figure 5. Success rates of different models versus sparsity at coherence levels F = 1 (left) and F = 20 (right) as well as high dynamic ranges of D = 3 (top) and D = 5 (bottom). B. Proof of Lemma 2Proof.
Simple calculations lead to (cid:13)(cid:13)(cid:13)(cid:13) x (cid:107) x (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = 1 − (cid:104) x , y (cid:105)(cid:107) x (cid:107) (cid:107) y (cid:107) + 1= 1 (cid:107) x (cid:107) (cid:107) y (cid:107) (cid:16) (cid:107) x (cid:107) (cid:107) y (cid:107) − (cid:104) x , y (cid:105) (cid:17) ≤ (cid:107) x (cid:107) (cid:107) y (cid:107) (cid:16) (cid:107) x (cid:107) + (cid:107) y (cid:107) − (cid:104) x , y (cid:105) (cid:17) = 1 (cid:107) x (cid:107) (cid:107) y (cid:107) (cid:107) x − y (cid:107) . (34) F = 1 , D = 3 F = 20 , D = 3 s % s % F = 1 , D = 5 F = 20 , D = 5 s % s % Figure 6. Success rates of different a values in the TL1 model at coherencelevels F = 1 (left) and F = 20 (right) as well as high dynamic ranges of D = 3 (top) and D = 5 (bottom). For any x satisfying A x = b , the minimal L norm is reachedby projecting the origin onto the feasible set of { x | A x = b } . It follows from the projection operator defined in (15) that (cid:107) x (cid:107) ≥ (cid:107) proj ( ) (cid:107) = (cid:107) A T ( AA T ) − b (cid:107) . (35)Combining (34) and (35), we get Lemma 2. C. Proof of Lemma 3Proof.
It is straightforward to have (cid:107)∇ w ( x ) − ∇ w ( y ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) (cid:107) x (cid:107) x − (cid:107) y (cid:107) (cid:107) y (cid:107) y (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) x (cid:107) (cid:107) x (cid:107) x − (cid:107) x (cid:107) (cid:107) y (cid:107) y + (cid:107) x (cid:107) (cid:107) y (cid:107) y − (cid:107) y (cid:107) (cid:107) y (cid:107) y (cid:13)(cid:13)(cid:13)(cid:13) ≤(cid:107) x (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) x (cid:107) x (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) + 1 (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12) (cid:107) x (cid:107) − (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12) . (36)We simplify the first term in (36) by calculating (cid:13)(cid:13)(cid:13)(cid:13) x (cid:107) x (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) = 1 (cid:107) x (cid:107) + 1 (cid:107) y (cid:107) − (cid:104) x , y (cid:105)(cid:107) x (cid:107) (cid:107) y (cid:107) = (cid:107) x (cid:107) + (cid:107) y (cid:107) − (cid:104) x , y (cid:105)(cid:107) x (cid:107) (cid:107) y (cid:107) = (cid:18) (cid:107) x − y (cid:107) (cid:107) x (cid:107) (cid:107) y (cid:107) (cid:19) , and using (cid:107) x (cid:107) ≤ √ n (cid:107) x (cid:107) . Therefore, we get (cid:107) x (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) x (cid:107) x (cid:107) − y (cid:107) y (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) ≤ √ nL (cid:107) x − y (cid:107) . (37)As for the second term in (36), we have it bounded by (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12) (cid:107) x (cid:107) − (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12) ≤ (cid:107) y (cid:107) (cid:107) x − y (cid:107) ≤ √ n (cid:107) y (cid:107) (cid:107) x − y (cid:107) ≤ √ nL (cid:107) x − y (cid:107) . (38)Combining (37) and (38), we obtain (29). D. Proof of Lemma 4Proof.
It is straightforward that Φ( x ∗ ) = ⇐⇒ x ∗ = prox β g (cid:18) x ∗ − β ∇ w ( x ∗ ) (cid:19) . By the optimality condition [47], the latter relation holds ifand only if there exists a vector s such that ∈ ∂ (cid:107) x ∗ (cid:107) + ∇ w ( x ∗ ) + β ( x ∗ − x ∗ ) + A T s = ∂ (cid:107) x ∗ (cid:107) + ∇ w ( x ∗ ) + A T s , which implies that x ∗ is a critical point of (26). It followsfrom (28) that (26) is equivalent to (2) and hence x ∗ is also acritical point of (2). According to the nonexpansiveness of theproximal operator and the Lipschitz continuousness of ∇ w ,we have (cid:107) Φ( x ) − Φ( y ) (cid:107) ≤ β (cid:13)(cid:13)(cid:13)(cid:13) prox β g (cid:18) x − β ∇ w ( x ) (cid:19) − prox β g (cid:18) y − β ∇ w ( y ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + β (cid:107) x − y (cid:107) ≤ β (cid:13)(cid:13)(cid:13)(cid:13)(cid:18) x − β ∇ w ( x ) (cid:19) − (cid:18) y − β ∇ w ( y ) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + β (cid:107) x − y (cid:107) ≤(cid:107)∇ w ( x ) − ∇ w ( y ) (cid:107) + 2 β (cid:107) x − y (cid:107) ≤ ( L w + 2 β ) (cid:107) x − y (cid:107) . The Lemma follows. R
EFERENCES[1] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
J. R.Stat. Soc. Series B , vol. 58, no. 1, pp. 267–288, 1996.[2] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation basednoise removal algorithms,”
Physica D , vol. 60, no. 1-4, pp. 259–268,1992.[3] A. Berman and R. J. Plemmons,
Nonnegative matrices in the mathemat-ical sciences . SIAM, 1994.[4] D. L. Donoho et al. , “Compressed sensing,”
IEEE Trans. Inf. Theory ,vol. 52, no. 4, pp. 1289–1306, 2006.[5] E. J. Cand`es, J. K. Romberg, and T. Tao, “Stable signal recovery fromincomplete and inaccurate measurements,”
Comm. Pure Appl. Math ,vol. 59, no. 8, pp. 1207–1223, 2006.[6] B. K. Natarajan, “Sparse approximate solutions to linear systems,”
SIAMJ. Comput. , vol. 24, no. 2, pp. 227–234, 1995.[7] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matchingpursuit: Recursive function approximation with applications to waveletdecomposition,” in
Asilomar Conf. Signals, Systems and Computers .IEEE, 1993, pp. 40–44.[8] S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methodsand their application to non-linear system identification,”
Int. J. Control ,vol. 50, no. 5, pp. 1873–1896, 1989.[9] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery fromincomplete and inaccurate samples,”
Appl. Comput. Harmon. Anal. ,vol. 26, no. 3, pp. 301–321, 2009.[10] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decompositionby basis pursuit,”
SIAM Rev. , vol. 43, no. 1, pp. 129–159, 2001.[11] R. Chartrand, “Exact reconstruction of sparse signals via nonconvexminimization,”
IEEE Signal Process Lett. , vol. 14, no. 10, pp. 707–710,2007.[12] Z. Xu, X. Chang, F. Xu, and H. Zhang, “ L / regularization: Athresholding representation theory and a fast solver,” IEEE Trans. NeuralNetworks Learn. Syst. , vol. 23, no. 7, pp. 1013–1027, 2012.[13] M.-J. Lai, Y. Xu, and W. Yin, “Improved iteratively reweighted leastsquares for unconstrained smoothed (cid:96) q minimization,” SIAM J. Numer.Anal. , vol. 51, no. 2, pp. 927–957, 2013.[14] P. Yin, E. Esser, and J. Xin, “Ratio and difference of l and l normsand sparse representation with coherent dictionaries,” Comm. Inf. Syst. ,vol. 14, no. 2, pp. 87–109, 2014. s = 18 s = 20 s = 22 F -0.0500.050.10.150.20.250.30.35 D = 3D = 5D = 7 F -0.0500.050.10.150.20.250.30.35 D = 3D = 5D = 7 F -0.0500.050.10.150.20.250.30.35 D = 3D = 5D = 7
Figure 7. The relative errors (cid:107) x ∗ − ˆ x (cid:107) / (cid:107) ˆ x (cid:107) produced by the L approach for sparse signal recovery.[15] Y. Lou, P. Yin, Q. He, and J. Xin, “Computing sparse representation ina highly coherent dictionary based on difference of L and L ,” J. Sci.Comput. , vol. 64, no. 1, pp. 178–196, 2015.[16] J. Lv, Y. Fan et al. , “A unified approach to model selection and sparserecovery using regularized least squares,”
Annals of Stat. , vol. 37, no. 6A,pp. 3498–3528, 2009.[17] S. Zhang and J. Xin, “Minimization of transformed L penalty: Closedform representation and iterative thresholding algorithms,” Comm. Math.Sci. , vol. 15, pp. 511–537, 2017.[18] ——, “Minimization of transformed L penalty: theory, differenceof convex function algorithm, and robust application in compressedsensing,” Math. Program. , vol. 169, no. 1, pp. 307–336, 2018.[19] L. Breiman, “Better subset regression using the nonnegative garrote,”
Technometrics , vol. 37, no. 4, pp. 373–384, 1995.[20] D. Peleg and R. Meir, “A bilinear formulation for vector sparsityoptimization,”
Signal Process. , vol. 88, no. 2, pp. 375–389, 2008.[21] T. Zhang, “Multi-stage convex relaxation for learning with sparseregularization,” in
Adv. Neural Inf. Proces. Syst. , 2009, pp. 1929–1936.[22] X. Shen, W. Pan, and Y. Zhu, “Likelihood-based selection and sharpparameter estimation,”
J. Am. Stat. Assoc. , vol. 107, no. 497, pp. 223–232, 2012.[23] P. O. Hoyer, “Non-negative sparse coding,” in
Proc. 12th IEEE Workshopon Neural Networks for Signal Process. , 2002, pp. 557–565.[24] N. Hurley and S. Rickard, “Comparing measures of sparsity,”
IEEETrans. Inf. Theory , vol. 55, no. 10, pp. 4723–4741, 2009.[25] E. Esser, Y. Lou, and J. Xin, “A method for finding structured sparsesolutions to nonnegative least squares problems with applications,”
SIAMJ. Imag. Sci. , vol. 6, no. 4, pp. 2010–2046, 2013.[26] E. Esser, T. Lin, R. Wang, and F. J. Herrmann, “A lifted (cid:96) /(cid:96) constraintfor sparse blind deconvolution,” in the 77th EAGE Conf. Exhi. , 2015.[27] E. Esser, T. T. Lin, F. J. Herrmann, and R. Wang, “Resolving scalingambiguities with the (cid:96) /(cid:96) norm in a blind deconvolution problem withfeedback,” in IEEE 6th Int. Workshop Comput. Adv. Multi-Sensor Adapt.Process. (CAMSAP) , 2015, pp. 365–368.[28] D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using anormalized sparsity measure,” in
IEEE Comput. Vision and PatternRecognit. (CVPR) , 2011, pp. 233–240.[29] A. Repetti, M. Q. Pham, L. Duval, E. Chouzenoux, and J. C. Pesquet,“Euclid in a taxicab: Sparse blind deconvolution with smoothed (cid:96) /(cid:96) regularization,” IEEE Signal Process Lett. , vol. 22, no. 5, pp. 539–543,2015.[30] M. Q. Pham, B. Oudompheng, J. I. Mars, and B. Nicolas, “A noise-robust method with smoothed (cid:96) /(cid:96) regularization for sparse moving-source mapping,” Signal Process. , vol. 135, pp. 96–106, 2017.[31] X. Jia, M. Zhao, Y. Di, P. Li, and J. Lee, “Sparse filtering with thegeneralized (cid:96) p /(cid:96) q norm and its applications to the condition monitoringof rotating machinery,” Mech. Syst. and Sig. Process. , vol. 102, pp. 198–213, 2018.[32] Y. Rahimi, C. Wang, H. Dong, and Y. Lou, “A scale invariant approachfor sparse signal recovery,”
SIAM J. Sci. Comput. , vol. 41, no. 6, pp.A3649–A3672, 2019.[33] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein et al. , “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”
Foundations and Trends R (cid:13) in Machine learning , vol. 3,no. 1, pp. 1–122, 2011.[34] G. H. Golub and C. F. Van Loan, “Matrix computations. 1996,” JohnsHopkins University, Press, Baltimore, MD, USA , pp. 374–426, 1996. [35] Y. Lou, S. Osher, and J. Xin, “Computational aspects of constrained l - l minimization for compressive sensing,” in Model. Comput. and Opt.Inf. Syst. and Manag. Sci.
Springer, 2015, pp. 169–180.[36] T. Pham-Dinh and H. A. Le-Thi, “A D.C. optimization algorithm forsolving the trust-region subproblem,”
SIAM J. Optim. , vol. 8, no. 2, pp.476–505, 1998.[37] ——, “The DC (difference of convex functions) programming andDCA revisited with DC models of real world nonconvex optimizationproblems,”
Annals Oper. Res. , vol. 133, no. 1-4, pp. 23–46, 2005.[38] G. Optimization, “INC. Gurobi optimizer reference manual, 2015,” , 2014.[39] V. A. Morozov,
Methods for solving incorrectly posed problems .Springer Science & Business Media, 2012.[40] Y. Wen and R. H. Chan, “Parameter selection for total-variation-basedimage restoration using discrepancy principle,”
IEEE Trans. ImageProcess. , vol. 21, no. 4, pp. 1770–1781, 2012.[41] T. Teuber, G. Steidl, and R. H. Chan, “Minimization and parameterestimation for seminorm regularization models with I-divergence con-straints,”
Inverse Prob. , vol. 29, no. 3, p. 035007, 2013.[42] L. N. Trefethen and D. Bau III,
Numerical linear algebra . SIAM, 1997,vol. 50.[43] M. Hein and T. B¨uhler, “An inverse power method for nonlineareigenproblems with applications in 1-spectral clustering and sparsePCA,” in
Adv. Neural Inf. Proces. Syst. , 2010, pp. 847–855.[44] X. Bresson, T. Laurent, D. Uminsky, and J. V. Brecht, “Convergenceand energy landscape for cheeger cut clustering,” in
Adv. Neural Inf.Proces. Syst. , 2012, pp. 1385–1393.[45] N. Parikh, S. Boyd et al. , “Proximal algorithms,”
Foundations andTrends R (cid:13) in Optimization , vol. 1, no. 3, pp. 127–239, 2014.[46] M. Fukushima and H. Mine, “A generalized proximal point algorithmfor certain non-convex minimization problems,” Int. J. Syst. Sci. , vol. 12,no. 8, pp. 989–1000, 1981.[47] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signalprocessing,” in
Fixed-point algorithms for inverse problems in scienceand engineering . Springer, 2011, pp. 185–212.[48] K. Bredies, D. A. Lorenz, and P. Maass, “A generalized conditionalgradient method and its connection to an iterative shrinkage method,”
Comput. Opt. Appl. , vol. 42, no. 2, pp. 173–193, 2009.[49] P. Yin, Y. Lou, Q. He, and J. Xin, “Minimization of (cid:96) − for compressedsensing,” SIAM J. Sci. Comput. , vol. 37, no. 1, pp. A536–A563, 2015.[50] A. Fannjiang and W. Liao, “Coherence pattern–guided compressivesensing with unresolved grids,”
SIAM J. Imag. Sci. , vol. 5, no. 1, pp.179–202, 2012.[51] D. A. Lorenz, “Constructing test instances for basis pursuit denoising,”
IEEE Trans. Signal Process. , vol. 61, no. 5, pp. 1210–1214, 2013.[52] E. J. Cand`es and M. B. Wakin, “An introduction to compressivesampling,”