A Scale Invariant Approach for Sparse Signal Recovery
AA SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY ∗ YAGHOUB RAHIMI † , CHAO WANG †‡ , HONGBO DONG § , AND
YIFEI LOU † Abstract.
In this paper, we study the ratio of the L and L norms, denoted as L /L , to promote sparsity.Due to the non-convexity and non-linearity, there has been little attention to this scale-invariant model. Comparedto popular models in the literature such as the L p model for p ∈ (0 ,
1) and the transformed L (TL1), this ratio modelis parameter free. Theoretically, we present a strong null space property (sNSP) and prove that any sparse vector is alocal minimizer of the L /L model provided with this sNSP condition. Computationally, we focus on a constrainedformulation that can be solved via the alternating direction method of multipliers (ADMM). Experiments show thatthe proposed approach is comparable to the state-of-the-art methods in sparse recovery. In addition, a variant of the L /L model to apply on the gradient is also discussed with a proof-of-concept example of the MRI reconstruction. Key words.
Sparsity, L , L , null space property, alternating direction method of multipliers, MRI reconstruc-tion AMS subject classifications.
1. Introduction.
Sparse signal recovery is to find the sparsest solution of A x = b where x ∈ R n , b ∈ R m , and A ∈ R m × n for m (cid:28) n . This problem is often referred to as compressedsensing (CS) in the sense that the sparse signal x is compressible. Mathematically, this fundamentalproblem in CS can be formulated as(1.1) min x ∈ R n (cid:107) x (cid:107) s.t. A x = b , where (cid:107) x (cid:107) is the number of nonzero entries in x . Unfortunately, (1.1) is NP-hard [31] to solve. Apopular approach in CS is to replace L by the convex L norm, i.e.,(1.2) min x ∈ R n (cid:107) x (cid:107) s.t. A x = b . Computationally, there are various L minimization algorithms such as primal dual [8], forward-backward splitting [34], and alternating direction method of multipliers (ADMM) [4].A major breakthrough in CS was the restricted isometry property (RIP) [6], which provides asufficient condition of minimizing the L norm to recover the sparse signal. There is a necessaryand sufficient condition given in terms of null space of the matrix A , thus referred to as null spaceproperty (NSP); see Definition 1.1. Definition
For any matrix A ∈ R m × n , we say the matrix A satisfies a null space property (NSP) of order s if (1.3) (cid:107) v S (cid:107) < (cid:107) v ¯ S (cid:107) , v ∈ ker( A ) \{ } , ∀ S ⊂ [ n ] , | S | ≤ s, ∗ Submitted to the journal’s Methods and Algorithms for Scientific Computing section December 18, 2018.
Funding:
YR and YL were partially supported by NSF grants DMS-1522786 and CAREER 1846690. † Department of Mathematical Sciences, University of Texas at Dallas, Richardson, TX 75080([email protected], [email protected], [email protected]). ‡ the corresponding author. § Department of Mathematics and Statistics, Washington State University, Pullman, WA 99164([email protected]). 1 a r X i v : . [ m a t h . NA ] A ug Y. RAHIMI, C. WANG, H. DONG, Y. LOU where [ n ] := { , . . . , n } , ¯ S is the complement of S , i.e., [ n ] \ S , and x S is defined as ( x S ) i = (cid:40) x i if i ∈ S, otherwise . The null space of A is denoted by ker( A ) := { x | A x = } . Donoho and Huo [12] proved that every s -sparse signal x ∈ R n is the unique solution to the L minimization (1.2) if and only if A satisfies the NSP of order s . NSP quantifies the notionthat vectors in the null space of A should not be too concentrated on a small subset of indices.Since it is a necessary and sufficient condition, NSP is widely used in proving other exact recoveryguarantees. Note that NSP is no longer necessary if “every s -sparse vector” is relaxed. A weaker sufficient condition for the exact L recovery was proved by Zhang [49]. It is stated that if a vector x ∗ satisfies A x ∗ = b and(1.4) (cid:112) (cid:107) x ∗ (cid:107) <
12 min v (cid:26) (cid:107) v (cid:107) (cid:107) v (cid:107) : v ∈ ker( A ) \{ } (cid:27) , then x ∗ is the unique solution to both (1.1) and (1.2). Unfortunately, neither RIP nor NSP can benumerically verified for a given matrix [1, 38].Alternatively, a computable condition for L ’s exact recovery is based on coherence , which isdefined as(1.5) µ ( A ) := max i (cid:54) = j | a Ti a j |(cid:107) a i (cid:107)(cid:107) a j (cid:107) , for a matrix A = [ a , . . . , a N ] . Donoho-Elad [11] and Gribonval [16] proved independently that if x ∗ satisfies A x ∗ = b and(1.6) (cid:107) x ∗ (cid:107) < (cid:18) µ ( A ) (cid:19) , then x ∗ is the optimal solution to both (1.1) and (1.2). Clearly, the coherence µ ( A ) is bounded by[0 , L may not perform well for highly coherent matrices, i.e., µ ( A ) ∼
1, as (cid:107) x (cid:107) is then at most one, which seldom occurs simultaneously with A x ∗ = b .Other than the popular L norm, there are a variety of regularization functionals to promotesparsity, such as L p [9, 43, 23], L - L [44, 26], capped L (CL1) [48, 37], and transformed L (TL1)[29, 46, 47]. Most of these models are nonconvex, leading to difficulties in proving exact recoveryguarantees and algorithmic convergence, but they tend to give better empirical results comparedto the convex L approach. For example, it was reported in [44, 26] that L p gives superior resultsfor incoherent matrices (i.e., µ ( A ) is small), while L - L is the best for the coherent scenario. Inaddition, TL1 is always the second best no matter whether the matrix is coherent or not [46, 47].In this paper, we study the ratio of L and L as a scale-invariant model to approximate thedesired L , which is scale-invariant itself. In one dimensional (1D) case (i.e., n = 1), the L /L model is exactly the same as the L model if we use the convention = 0. The ratio of L and L wasfirst proposed by Hoyer [20] as a sparseness measure and later highlighted in [21] as a scale-invariantmodel. However, there has been little attention on it due to its computational difficulties arisen The sufficient condition of (1.4) is weaker than the one in (1.3). SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY L /L and the L models, but only restricted to nonnegative signals [13, 44]. We aimto apply this ratio model to arbitrary signals. On the other hand, the L /L minimization has anintrinsic drawback that it tends to produce one erroneously large coefficient while suppressing theother non-zero elements, under which case the ratio is reduced. To compensate for this drawback,it is helpful to incorporate a box constraint, which will also be addressed in this paper.Now we turn to a sparsity-related assumption that signal is sparse after a given transform, asopposed to signal itself being sparse. This assumption is widely used in image processing. Forexample, a natural image, denoted by u , is mostly sparse after taking gradient, and hence it isreasonable to minimize the L norm of the gradient, i.e., (cid:107)∇ u (cid:107) . To bypass the NP-hard L norm,the convex relaxation replaces L by L , where the L norm of the gradient is the well-knowntotal variation (TV) [36] of an image. A weighted L - αL model (for α >
0) on the gradient wasproposed in [27], which suggested that α = 0 . α = 1 for image denoising,deblurring, and MRI reconstruction. The ratio of L and L on the image gradient was used indeconvolution and blind deconvolution [22, 35]. We further adapt the proposed ratio model fromsparse signal recovery to imaging applications, specifically focusing on MRI reconstruction.The rest of the paper is organized as follows. Section 2 is devoted to theoretical analysis ofthe L /L model. In Section 3, we apply the ADMM to minimize the ratio of L and L with twovariants of incorporating a box constraint as well as applying on the image gradient. We conductextensive experiments in Section 4 to demonstrate the performance of the proposed approaches overthe state-of-the-art in sparse recovery and MRI reconstruction. Section 5 is a fun exercise, wherewe use the L /L minimization to compute the right-hand-side of the NSP condition (1.4), leadingto an empirical upper bound of the exact L recovery guarantee. Finally, conclusions and futureworks are given in Section 6.
2. Rationales of the L /L model. We begin with a toy example to illustrate the advantagesof L /L over other alternatives, followed by some theoretical properties of the proposed model. Define a matrix A as(2.1) A := − − − ∈ R × , and b = (0 , , , , T ∈ R . It is straightforward that any general solutions of A x = b havethe form of x = ( t, t, t, − t, − t, t − T for a scalar t ∈ R . The sparsest solution occursat t = 0, where the sparsity of x is 3 and some local solutions include t = 10 for sparsity being 4and t = 9 for sparsity being 5. In Figure 1, we plot various objective functions with respect to t ,including L , L p (for p = 1 / L - L , and TL1 (for a = 1 as suggested in [47]). Note that all thesefunctions are not differentiable at the values of t = 0 , , and 10, where the sparsity of x is strictlysmaller than 6. The sparsest vector x corresponding to t = 0 can only be found by minimizing TL1and L /L , while the other models find t = 10 as a global minimum. Recently, Tran and Webster [39] generalized the NSP to dealwith sparse promoting metrics that are symmetric, separable and concave, which unfortunatelydoes not apply to L /L (not separable), but this work motivates us to consider a stronger form ofthe NSP, as defined in Definition 2.1. Y. RAHIMI, C. WANG, H. DONG, Y. LOU -30 -20 -10 0 10 20 30050100150200250300350400450 (a) L -30 -20 -10 0 10 20 30101520253035404550 (b) L p (p=1/2) -30 -20 -10 0 10 20 3056789101112 (c) TL1 -30 -20 -10 0 10 20 30050100150200250 (d) L - L -30 -20 -10 0 10 20 301.61.71.81.922.12.22.32.4 (e) L /L Fig. 1 . The objective functions of a toy example illustrate that only L /L and TL1 can find t = 0 as theglobal minimizer, but TL1 has a very narrow basin of attraction (thus sensitive to initial guess and difficult to findthe global solution.). Definition
For any matrix A ∈ R m × n , we say the matrix A satisfies a strong null spaceproperty (sNSP) of order s if (2.2) ( s + 1) (cid:107) v S (cid:107) ≤ (cid:107) v ¯ S (cid:107) , v ∈ ker( A ) \{ } , ∀ S ⊂ [ n ] , | S | ≤ s. Note that Definition 2.1 is stronger than the original NSP in Definition 1.1 in the sense that ifa matrix satisfies sNSP then it also satisfies the original NSP. The following theorem says that any s -sparse vector is a local minimizer of L /L provided the matrix has the sNSP of order s . Theproof is given in Appendix. Theorem
Assume an m × n matrix A satisfies the sNSP of order s, then any s -sparsesolution of A x = b ( b (cid:54) = ) is a local minimum for L /L in the feasible space of A x = b . i.e.,there exists a positive number t ∗ > such that for every v ∈ ker( A ) with < (cid:107) v (cid:107) ≤ t ∗ we have (2.3) (cid:107) x (cid:107) (cid:107) x (cid:107) ≤ (cid:107) x + v (cid:107) (cid:107) x + v (cid:107) . Finally, we show the optimal value of the L /L subject to A x = b is upper bounded by thesame ratio with b = ; see Proposition 1. Proposition For any A ∈ R m × n , x ∈ R n , we have (2.4) inf z ∈ R n (cid:26) (cid:107) z (cid:107) (cid:107) z (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) A z = A x (cid:27) ≤ inf z ∈ R n (cid:26) (cid:107) z (cid:107) (cid:107) z (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) z ∈ ker( A ) \ { } (cid:27) . SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY Proof.
Denote(2.5) α ∗ = inf z ∈ R n (cid:26) (cid:107) z (cid:107) (cid:107) z (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) A z = A x (cid:27) . For every v ∈ ker( A ) \ { } and t ∈ R , we have that(2.6) α ∗ ≤ (cid:107) x + t v (cid:107) (cid:107) x + t v (cid:107) , since A ( x + t v ) = b . Then we obtain(2.7) lim t →∞ (cid:107) x + t v (cid:107) (cid:107) x + t v (cid:107) = lim t →∞ (cid:107) x /t + v (cid:107) (cid:107) x /t + v (cid:107) = (cid:107) v (cid:107) (cid:107) v (cid:107) . Therefore, for every v ∈ ker( A ) \ { } , (2.8) α ∗ ≤ (cid:107) v (cid:107) (cid:107) v (cid:107) , which directly leads to the desired inequality (2.4).Proposition 1 implies that the left-hand-side of the inequality involves both the underlyingsignal x and the system matrix A , which can be upper bounded by the minimum ratio that onlyinvolves A .
3. Numerical schemes.
The proposed model is(3.1) min x ∈ R n (cid:26) (cid:107) x (cid:107) (cid:107) x (cid:107) + I ( A x − b ) (cid:27) , where I S ( t ) is the function enforcing t into the feasible set S , i.e.,(3.2) I S ( t ) = (cid:40) t ∈ S, + ∞ otherwise . In Subsection 3.1, we detail the ADMM algorithm for minimizing (3.1), followed by a minor changeto incorporate additional box constraint in Subsection 3.2. We discuss in Subsection 3.3 anothervariant of L /L on the gradient to deal with imaging applications. L /L minimization via ADMM. In order to apply the ADMM [4] to solve for(3.1), we introduce two auxiliary variables and rewrite (3.1) into an equivalent form,(3.3) min x , y , z (cid:26) (cid:107) z (cid:107) (cid:107) y (cid:107) + I ( A x − b ) (cid:27) s . t . x = y , x = z . The augmented Lagrangian for (3.3) is L ρ ,ρ ( x , y , z ; v , w ) = (cid:107) z (cid:107) (cid:107) y (cid:107) + I ( A x − b ) + (cid:104) v , x − y (cid:105) + ρ (cid:107) x − y (cid:107) + (cid:104) w , x − z (cid:105) + ρ (cid:107) x − z (cid:107) = (cid:107) z (cid:107) (cid:107) y (cid:107) + I ( A x − b ) + ρ (cid:13)(cid:13)(cid:13)(cid:13) x − y + 1 ρ v (cid:13)(cid:13)(cid:13)(cid:13) + ρ (cid:13)(cid:13)(cid:13)(cid:13) x − z + 1 ρ w (cid:13)(cid:13)(cid:13)(cid:13) . (3.4) Y. RAHIMI, C. WANG, H. DONG, Y. LOU
The ADMM consists of the following five steps:(3.5) x ( k +1) := arg min x L ρ ,ρ ( x , y ( k ) , z ( k ) ; v ( k ) , w ( k ) ) , y ( k +1) := arg min y L ρ ,ρ ( x ( k +1) , y , z ( k ) ; v ( k ) , w ( k ) ) , z ( k +1) := arg min z L ρ ,ρ ( x ( k +1) , y ( k +1) , z ; v ( k ) , w ( k ) ) , v ( k +1) := v ( k ) + ρ ( x ( k +1) − y ( k +1) ) , w ( k +1) := w ( k ) + ρ ( x ( k +1) − z ( k +1) ) . The update for x is a projection to the affine space of A x = b , x ( k +1) := arg min x L ρ ,ρ ( x , y ( k ) , z ( k ) ; v ( k ) , w ( k ) )= arg min x (cid:26) ρ + ρ (cid:13)(cid:13)(cid:13) x − f ( k ) (cid:13)(cid:13)(cid:13) s . t . A x = b (cid:27) = (cid:0) I − A T ( AA T ) − A (cid:1) f ( k ) + A T ( AA T ) − b , where(3.6) f ( k ) = ρ ρ + ρ (cid:18) y ( k ) − ρ v ( k ) (cid:19) + ρ ρ + ρ (cid:18) z ( k ) − ρ w ( k ) (cid:19) . As for the y -subproblem, let c ( k ) = (cid:107) z ( k ) (cid:107) and d ( k ) = x ( k +1) + v ( k ) ρ and the minimizationsubproblem reduces to(3.7) y ( k +1) = arg min y c ( k ) (cid:107) y (cid:107) + ρ (cid:107) y − d ( k ) (cid:107) . If d ( k ) = 0 then any vector y with (cid:107) y (cid:107) = (cid:113) c ( k ) ρ is a solution to the minimization problem. If c ( k ) = 0 then y = d ( k ) is the solution. Now we consider d ( k ) (cid:54) = 0 and c ( k ) (cid:54) = 0. By taking derivativeof the objective function with respect to y , we obtain (cid:18) − c ( k ) (cid:107) y (cid:107) + ρ (cid:19) y = ρ d ( k ) . As a result, there exists a positive number τ ( k ) ≥ y = τ ( k ) d ( k ) . Given d ( k ) , we denote η ( k ) = (cid:107) d ( k ) (cid:107) . For η ( k ) >
0, finding y becomes a one-dimensional search for the parameter τ ( k ) .In other words, if we take D ( k ) = c ( k ) ρ ( η ( k ) ) , then τ ( k ) is a root of τ − τ − D ( k ) (cid:124) (cid:123)(cid:122) (cid:125) F ( τ ) = 0 . The cubic-root formula suggests that F ( τ ) = 0 has only one real root, which can be found by thefollowing closed-form solution.(3.8) τ ( k ) = 13 + 13 ( C ( k ) + 1 C ( k ) ) , with C ( k ) = (cid:115) D ( k ) + 2 + (cid:112) (27 D ( k ) + 2) − . SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY y ( k +1) = (cid:40) e ( k ) d ( k ) = 0 ,τ ( k ) d ( k ) d ( k ) (cid:54) = 0 , where e ( k ) is a random vector with the L norm to be (cid:113) c ( k ) ρ .Finally, the ADMM update for z is(3.10) z ( k +1) = shrink (cid:18) x ( k +1) + w ( k ) ρ , ρ (cid:107) y ( k +1) (cid:107) (cid:19) , where shrink is often referred to as soft shrinkage operator,(3.11) shrink ( v , µ ) i = sign( v i ) max ( | v i | − µ, , i = 1 , , . . . , n. We summarize the ADMM algorithm for solving the L /L minimization problem in Algorithm 3.1. Algorithm 3.1
The L /L minimization via ADMM.Input: A ∈ R m × n , b ∈ R m × , Max and (cid:15) ∈ R while k < Max or (cid:107) x ( k ) − x ( k − (cid:107) / (cid:107) x ( k ) (cid:107) > (cid:15) dox ( k +1) = (cid:0) I − A T ( AA T ) − A (cid:1) f ( k ) + A T ( AA T ) − by ( k +1) = (cid:40) e ( k ) d ( k ) = 0 τ ( k ) d ( k ) d ( k ) (cid:54) = 0 z ( k +1) = shrink (cid:16) x ( k +1) + w ( k ) ρ , ρ (cid:107) y ( k +1) (cid:107) (cid:17) v ( k +1) = v ( k ) + ρ ( x ( k +1) − y ( k +1) ) w ( k +1) = w ( k ) + ρ ( x ( k +1) − z ( k +1) )k = k+1 end whilereturn x ( k ) Remark We can pre-compute the matrix I − A T ( AA T ) − A and the vector A T ( AA T ) − b in Algorithm . The complexity is O ( m n ) for the pre-computation including the matrix-matrixmultiplication and Cholesky decomposition for solving linear system. In each iteration, we needto do matrix-vector multiplication for the x -subproblem, which is in the order of O ( n ) . In the y -subproblem, the rooting finding is one-dimensional search, whose cost can be neglected. The z -subproblem is pixel-wise shrinkage operation and only takes O ( n ) . In summary, the computationcomplexity for each iteration is O ( n ) . We can consider the parallel computing to further speed up,thanks to the separation of the z -subproblem. L /L with box constraint. The L /L model has an intrinsic drawback that tendsto produce one erroneously large coefficient while suppressing the other non-zero elements, underwhich case the ratio is reduced. To compensate for this drawback, it is helpful to incorporate abox constraint, if we know lower/upper bounds of the underlying signal a priori . Specifically, wepropose(3.12) min x ∈ R n (cid:26) (cid:107) x (cid:107) (cid:107) x (cid:107) + I ( A x − b ) (cid:12)(cid:12)(cid:12)(cid:12) x ∈ [ c, d ] (cid:27) , Y. RAHIMI, C. WANG, H. DONG, Y. LOU which is referred to as L /L -box. Similar to (3.3), we look at the following form that enforces thebox constraint on variable z ,(3.13) min x , y , z (cid:26) (cid:107) z (cid:107) (cid:107) y (cid:107) + I ( A x − b ) (cid:27) s . t . x = y , x = z , z ∈ [ c, d ] . The only change we need to make by adapting Algorithm 3.1 to the L /L -box is the z update.The z -subproblem in (3.5) with the box constraint is(3.14) min z (cid:107) y ( k +1) (cid:107) (cid:107) z (cid:107) + ρ (cid:107) x ( k +1) − z + 1 ρ w ( k ) (cid:107) s.t. z ∈ [ c, d ] . For a convex problem (3.14) involving the L norm, it has a closed-form solution given by the softshrinkage, followed by projection to the interval [ c, d ]. In particular, simple calculations show that(3.15) z ( k +1) i = min { max(ˆ z i , c ) , d } , i = 1 , , . . . , n, where ˆ z = shrink ( r , ν ), r = x ( k +1) + w ( k ) ρ and ν = ρ (cid:107) y ( k +1) (cid:107) . If the box constraint [ c, d ] issymmetric, i.e., c = − d and d >
0, it follows from [2] that the update for z can be expressed as(3.16) z ( k +1) i = sign( v i ) min { max( | r i | − ν, , d } , i = 1 , , . . . , n. Remark The existing literature on the ADMM convergence [17, 19, 24, 33, 40, 41, 42]requires the existence of one separable function in the objective function, whose gradient is Lipschitzcontinuous. Obviously, L /L does not satisfy this assumption, no matter with or without the boxconstraint. Therefore, we have difficulties in analyzing the convergence theoretically. Instead, weshow the convergence empirically in Section by plotting residual errors and objective functions,which gives strong supports for theoretical analysis in the future. L /L on the gradient. We adapt the L /L model to apply on the gradient, whichenables us to deal with imaging applications. Let u ∈ R n × m be an underlying image of size n × m .Denote A as a linear operator that models a certain degradation process to obtain the measureddata f . For example, A can be a subsampling operator in the frequency domain and recovering u from f is called MRI reconstruction. In short, the proposed gradient model is given by(3.17) min u ∈ R n × m (cid:107)∇ u (cid:107) (cid:107)∇ u (cid:107) s . t . Au = f, u ∈ [0 , , where ∇ denotes discrete gradient operator ∇ u := { [ u ij − u ( i +1) j ] ni =1 } mj =1 , { [ u ij − u i ( j +1) ] mj =1 } ni =1 with periodic boundary condition; hence the model is referred to as L /L -grad. Note that the boxconstraint 0 ≤ u ≤ d , h , and v , leading to an equivalentproblem,(3.18) min u ∈ R n × m (cid:107) d (cid:107) (cid:107) h (cid:107) s . t . Au = f, d = ∇ u, h = ∇ u, u = v, ≤ v ≤ . Note that we denote d and h in bold to indicate that they have two components corresponding toboth x and y derivatives. The augmented Lagrangian is expressed as L ( u, d , h , v ; w, b , b , e ) = (cid:107) d (cid:107) (cid:107) h (cid:107) + λ (cid:107) Au − f − w (cid:107) + ρ (cid:107) d − ∇ u − b (cid:107) + ρ (cid:107) h − ∇ u − b (cid:107) + ρ (cid:107) v − u − e (cid:107) + I [0 , ( v ) , (3.19) SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY w, b , b , e are dual variables and λ, ρ , ρ , ρ are positive parameters. The updates for d , h are the same as Algorithm 3.1. Specifically for h , we consider D ( k ) = (cid:107) d (cid:107) ρ (cid:107)∇ u ( k +1) + g ( k ) (cid:107) and hence τ ( k ) is the root of the same polynomial as in (3.8). By taking derivative of (3.19) with respect to u , we can obtain the u -update, i.e., u ( k +1) = (cid:0) λA T A − ( ρ + ρ ) (cid:52) + ρ I (cid:1) − (cid:16) λA T ( f + w ( k ) )+ ρ ∇ T ( d ( k ) − b ( k )1 ) + ρ ∇ T ( h ( k ) − b ( k )2 ) + ρ ( v ( k ) − e ( k ) ) (cid:17) . (3.20)Note for certain operator A , the inverse in the u -update (3.20) can be computed efficiently via thefast Fourier transform (FFT). The v -subproblem is a projection to an interval [0 , v ( k +1) ij = min (cid:110) max( u ( k +1) ij + e ( k ) ij , , (cid:111) , i = 1 , , . . . , n, j = 1 , , . . . , m. In summary, we present the ADMM algorithm for the L /L -grad model in Algorithm 3.2. Algorithm 3.2
The L /L -grad minimization via ADMM.Input: f ∈ R n × m , A , Max and (cid:15) ∈ R . while k < Max or (cid:107) u ( k ) − u ( k − (cid:107) / (cid:107) u ( k ) (cid:107) > (cid:15) do Solve u ( k +1) via (3.20)Solve v ( k +1) via (3.21) h ( k +1) = (cid:40) e ( k ) ∇ u ( k +1) + g ( k ) = 0 ,τ ( k ) (cid:0) ∇ u ( k +1) + g ( k ) (cid:1) ∇ u ( k +1) + g ( k ) (cid:54) = 0 . d ( k +1) = shrink (cid:16) ∇ u ( k +1) + b ( k ) , ρ (cid:107) h ( k +1) (cid:107) (cid:17) b ( k +1) = b ( k ) + ∇ u ( k +1) − d ( k +1) g ( k +1) = g ( k ) + ∇ u ( k +1) − h ( k +1) w ( k +1) = w ( k ) + f − Au ( k +1) e ( k +1) = e ( k ) + u ( k +1) − v ( k +1) k = k + 1 end whilereturn u ( k )
4. Numerical experiments.
In this section, we carry out a series of numerical tests todemonstrate the performance of the proposed L /L models together with its corresponding al-gorithms. All the numerical experiments are conducted on a standard desktop with CPU (Inteli7-6700, 3.4GHz) and MATLAB 9 . . We consider two types of sensing matrices: one is called oversampled discrete cosine transform(DCT) and the other is Gaussian matrix. Specifically for the oversampled DCT, we follow theworks of [14, 26, 45] to define A = [ a , a , . . . , a n ] ∈ R m × n with(4.1) a j := 1 √ m cos (cid:18) π w jF (cid:19) , j = 1 , . . . , n, where w is a random vector uniformly distributed in [0 , m and F ∈ R + controls the coherencein a way that a larger value of F yields a more coherent matrix. In addition, we use N ( , Σ) (the0
Y. RAHIMI, C. WANG, H. DONG, Y. LOU multi-variable normal distribution) to generate Gaussian matrix, where the covariance matrix isΣ = { (1 − r ) ∗ I ( i = j ) + r } i,j with a positive parameter r . This type of matrices is used in theTL1 paper [47], which mentioned that a larger r value indicates a more difficult problem in sparserecovery. Throughout the experiments, we consider sensing matrices of size 64 × x ∈ R n is simulated as s -sparse signal, where s is the total number of nonzero entries. Thesupport of x is a random index set and the values of non-zero elements follow Gaussian normaldistribution i.e., ( x s ) i ∼ N (0 , , i = 1 , , . . . , s. We then normalize the ground-truth signal tohave maximum magnitude as 1 so that we can examine the performance of additional [ − ,
1] boxconstraint.Due to the non-convex nature of the proposed L /L model, the initial guess x (0) is veryimportant and should be well-chosen. A typical choice is the L solution (1.2), which is used here.We adopt a commercial optimization software called Gurobi [32] to minimize the L norm via linearprogramming for the sake of efficiency. The stopping criterion is when the relative error of x ( k ) to x ( k − is smaller than 10 − or iterative number exceeds 10 n . We empirically demonstrate the convergence of the proposedADMM algorithms in Figure 2. Specifically we examine the L /L minimization problem (3.1),where the sensing matrix is an oversampled DCT matrix with F = 10 and ground-truth sparsevector has 12 non-zero elements. We also study the MRI reconstruction from 7 radical lines as aparticular sparse gradient problem that involves the L /L -grad minimization of (3.17) by Algo-rithm 3.2.There are two auxiliary variables y and z in L /L such that x = y = z , while two auxiliaryvariables d , h are in L /L -grad for ∇ u = d = h . We show in the top row of Figure 2 the values of (cid:13)(cid:13) x ( k ) − y ( k ) (cid:13)(cid:13) and (cid:13)(cid:13) x ( k ) − z ( k ) (cid:13)(cid:13) as well as (cid:13)(cid:13) ∇ u ( k ) − d ( k ) (cid:13)(cid:13) and (cid:13)(cid:13) ∇ u ( k ) − h ( k ) (cid:13)(cid:13) , all are plottedwith respect to the iteration counter k . The bottom row of Figure 2 is for objective functions, i.e., (cid:13)(cid:13) x ( k ) (cid:13)(cid:13) / (cid:13)(cid:13) x ( k ) (cid:13)(cid:13) and (cid:13)(cid:13) ∇ u ( k ) (cid:13)(cid:13) / (cid:13)(cid:13) ∇ u ( k ) (cid:13)(cid:13) for L /L and L /L -grad, respectively. All the plotsin Figure 2 decrease rapidly with respect to iteration counters, which serves as heuristic evidenceof algorithmic convergence. On the other hand, the objective functions in Figure 2 look oscillatory.This phenomenon implies difficulties in theoretically proving the convergence, as one key step inthe convergence proof requires to show that objective function decreases monotonically [3, 42]. We now compare the proposed L /L approach withother sparse recovery models: L , L p [9], L - L [45, 26], and TL1 [47]. We choose p = 0 . L p and a = 1 for TL1. The initial guess for all the algorithms is the solution of the L model.Both L - L and TL1 are solved via the DCA, with the same stopping criterion as L /L , i.e., (cid:107) x ( k ) − x ( k − (cid:107) (cid:107) x ( k ) (cid:107) ≤ − . As for L p , we follow the default setting in [9].We evaluate the performance of sparse recovery in terms of success rate , defined as the numberof successful trials over the total number of trials. A success is declared if the relative error of thereconstructed solution x ∗ to the ground truth x is less than 10 − , i.e., (cid:107) x ∗ − x (cid:107) (cid:107) x (cid:107) ≤ − . We furthercategorize the failure of not recovering the ground-truth as model failure and algorithm failure . Inparticular, we compare the objective function F ( · ) at the ground-truth x and at the reconstructedsolution x ∗ . If F ( x ) > F ( x ∗ ), it means that x is not a global minimizer of the model, in whichcase we call model failure . On the other hand, F ( x ) < F ( x ∗ ) implies that the algorithm does notreach a global minimizer, which is referred to as algorithm failure . Although this type of analysisis not deterministic, it sheds some lights on which direction to improve: model or algorithm. Forexample, it was reported in [30] that L has the highest model-failure rates, which justifies the need SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY
200 400 600 800 1000 1200 1400 1600 1800 2000 iteration -10 -8 -6 -4 -2 distance ydistance z (a) Residual errors in L /L
200 400 600 800 1000 1200 1400 iteration -4 -3 -2 -1 distance ydistance z (b) Residual errors in L /L Iteration (c) Objective functions of L /L
200 400 600 800 1000 1200 1400 iteration (d) Objective functions of L /L Fig. 2 . Plots of residual errors and objective functions for empirically demonstrating the convergence of theproposed algorithms - L /L in signal processing and L /L -grad with a box constraint for MRI reconstruction. for nonconvex models.In Figure 3, we examine two coherence levels: F = 5 corresponds to relatively low coherenceand F = 20 for higher coherence. The success rates of various models reveal that L /L -boxperforms the best at F = 5 and is comparable to L - L for the highly coherent case of F = 20. Welook at Gaussian matrix with r = 0 . r = 0 . L p model gives the best results for the Gaussiancase, which is consistent in the literature [44, 26]. The proposed model of L /L -box is the second2 Y. RAHIMI, C. WANG, H. DONG, Y. LOU sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (a) Success rates ( F = 5) sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (b) Success rates ( F = 10) sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (c) Model failures ( F = 5) sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (d) Model failures ( F = 10) sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (e) Algorithm failures ( F = 5) sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (f) Algorithm failures ( F = 10) Fig. 3 . Success rates, model failures, algorithm failures for 6 algorithms in the case of oversampled DCTmatrices.
SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY L /L with and without box among the plots for success rates and model failures,we can draw the conclusion that the box constraint can mitigate the inherent drawback of the L /L model, thus improving the recovery rates. In addition, L /L is the second lowest in terms of modelfailure rates and simply adding a box constraint also increases the occurrence of algorithm failurecompared to the none box version. These two observations suggest a need to further improve uponalgorithms of minimizing L /L .Finally, we provide the computation time for all the competing algorithms in Table 1 with theshortest time in each case highlighted in bold. The time for L method is not included, as all theother methods use the L solution as initial guess. It is shown that TL1 is the fastest for relativelylower sparsity levels and the proposed L /L -box is the most efficient at higher sparsity levels. Thecomputational times for all these methods seem consistent with DCT and Gaussian matrices. Table 1
Computation time (sec.) in 5 algorithms. (a) DCT matrix F = 5sparsity 2 6 10 14 18 22 meanTL1 L p L - L L /L L /L -box 0.102 0.183 0.247 0.313 F = 10sparsity 2 6 10 14 18 22 meanTL1 L p L - L L /L L /L -box 0.090 0.179 0.239 0.301 (b) Gaussian matrix r = 0 . L p L - L L /L L /L -box 0.324 0.625 1.039 1.060 1.146 1.385 0.930 r = 0 . L p L - L L /L L /L -box 0.102 0.192 0.265 0.321 As a proof-of-concept example, we study an MRI reconstructionproblem [28] to compare the performance of L , L - L , and L /L on the gradient. The L on4 Y. RAHIMI, C. WANG, H. DONG, Y. LOU sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (a) Success rates ( r = 0 . sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (b) Success rates ( r = 0 . sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (c) Model failures ( r = 0 . sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (d) Model failures ( r = 0 . sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (e) Algorithm failures ( r = 0 . sparsity % L1TL1LpL1-L2L1/L2L1/L2-box (f) Algorithm failures ( r = 0 . Fig. 4 . Success rates, model failures, algorithm failures for 6 algorithms in the Gaussian matrix case.
SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY L - L on the gradient was recently proposed in[27]. We use a standard Shepp-Logan phantom as a testing image, as shown in Figure 5a. The MRImeasurements are obtained by several radical lines in the frequency domain (i.e., after taking theFourier transform); an example of such sampling scheme using 6 lines is shown in Figure 5b. As thispaper focuses on the constrained formulation, we do not consider noise, following the same settingas in the previous works [45, 27]. Since all the competing methods ( L , L -0 . L , and L /L )yield an exact recovery with 8 radical lines, with accuracy in the order of 10 − , we present thereconstructions results of 6 radical lines in Figure 5, which illustrates that the ratio model ( L /L )gives much better results than the difference model ( L -0 . L ). Figure 5 also includes quantitativemeasures of the performance by relative error (RE) between the reconstructed and ground-truthimages, which shows significantly improvement of the proposed L /L -grad over a classic methodin MRI reconstruction, called filter-back projection (FBP), and two recent works of using L [15]and L -0 . L [27] on the gradient. Note that the state-of-the-art methods in MRI reconstructionare [18, 30] that have reported exact recovery from 7 radical lines.
5. Empirical validations.
A review article [7] indicated that two principles in CS are spar-sity and incoherence , leading an impression that a sensing matrix with smaller coherence is easierfor sparse recovery. However, we observe through numerical results [25] (also given in Figure 6b)that a more coherent matrix gives higher recovery rates. This contradiction motivates us to collectempirical evidence regarding to either prove or refuse whether coherence is relevant to sparse re-covery. Here we examine one such evidence by minimizing the ratio of L and L , which gives anupper bound for a sufficient condition of L exact recovery, see (1.4). To avoid the trivial solutionof x = to the problem of min x (cid:110) (cid:107) x (cid:107) (cid:107) x (cid:107) : A x = (cid:111) , we incorporate a sum-to-one constraint. In otherword, we define an expanded matrix ˜ A = [ A ; ones( n, b = [ ; 1]. We then adapt the proposed method to solve for min x (cid:110) (cid:107) x (cid:107) (cid:107) x (cid:107) : ˜ A x = ˜ b (cid:111) . In Figure 6a, we plot the mean value of ratios from 50 random realizations of matrices A at eachcoherence level (controlled by F ), which shows that the ratio actually decreases with respect to F .As the L norm is bounded by the ratio (1.4), smaller ratio indicates it is more difficult to recoverthe signals. Therefore, Figure 6a is consistent with the common belief in CS.We postulate that an underlying reason of more coherent matrices giving better results isminimum separation (MS), as formally introduced in [5]. In Figure 6b, we enforce the minimumseparation of two neighboring spikes to be 40, following the suggestion of 2 F in [14] (we consider F up to 20). In comparison, we also give the success rates of the L recovery without any restrictionson MS in Figure 6c. Note that we use the exactly same matrices in both cases (with and withoutMS). Figure 6c does not have a clear pattern regarding how coherence affects the exact recovery,which supports our hypothesis that minimum separation plays an important role in sparse recovery.It will be our future work to analyze it throughly.
6. Conclusions and future works.
In this paper, we have studied a novel L /L minimiza-tion to promote sparsity. Two main benefits of L /L are scale invariant and parameter free. Twonumerical algorithms based on the ADMM are formulated for the assumptions of sparse signalsand sparse gradients, together with a variant of incorporating additional box constraint. The ex-perimental results demonstrate the performance of the proposed approaches in comparison to thestate-of-the-art methods in sparse recovery and MRI reconstruction. As a by-product, minimizing We also observe that the ratio stagnates for larger F , which is probably because of instability of the proposedmethod when matrix becomes more coherent. Y. RAHIMI, C. WANG, H. DONG, Y. LOU(a) Original (b) Sampling mask (c) FBP (RE = 99.80%)(d) L (RE = 39.42%) (e) L -0.5 L (RE = 38.43%) (f) L /L (RE = 0.04%) Fig. 5 . MRI reconstruction results from 6 radical lines in the frequency domain (2.57 % measurements). Therelative errors (RE) are provided for each method. the ratio also gives an empirical upper bound towards L ’s exact recovery, which motivates furtherinvestigations on exact recovery theories. Other future works include algorithmic improvement andconvergence analysis. In particular, it is shown in Table 1, Figures 3 and 4 that L /L is not as fastas competing methods in CS and also has certain algorithmic failures, which calls for a more robustand more efficient algorithm. In addition, we have provided heuristic evidence of the ADMM’sconvergence in Figure 2 and it will be interesting to analyze it theoretically. Acknowledgements.
We would like to thank the editor and two reviewers for careful andthoughtful comments, which helped us greatly improve our paper. We also acknowledge the helpof Dr. Min Tao from Nanjing University, who suggested the reference on the ADMM convergence.
SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY F -505101520253035 r a t i o (a) The mean of ratios sparsity % F = 2F = 6F = 10F = 14F = 18 (b) Success rates with MS sparsity % F = 2F = 6F = 10F = 14F = 18 (c) Success rates without MS
Fig. 6 . The use of min x (cid:110) (cid:107) x (cid:107) (cid:107) x (cid:107) : A x = (cid:111) as an upper bound for the L recovery. (a) plots the mean of ratiosover 50 realizations with the standard deviation indicated as vertical bars. (b) and (c) are success rates of L1 recoverywith and without minimum separation. Appendix: proof of Theorem 2.2.
In order to prove Theorem 2.2, we study the function(6.1) g ( t ) = (cid:107) x + t v (cid:107) (cid:107) x + t v (cid:107) , where x (cid:54) = and(6.2) v ∈ ker( A ) \{ } with (cid:107) v (cid:107) = 1 . Notice that the denominator of the function g is non-zero for all t ∈ R . Otherwise, we have x + t v = and hence A x + A ( t v ) = A ( ). Since A x = b and A v = , we get b = which is acontradiction. Therefore, the function g is continuous everywhere. Next, we introduce the followinglemma to discuss the L term in the numerator of g . We assume that b (cid:54) = 0 so x = is not a solution to A x = b . Y. RAHIMI, C. WANG, H. DONG, Y. LOU
Lemma
For any x ∈ R n and v ∈ R n satisfying (6.2) , denote S as the support of x and t := min i ∈ S | x i | . We have (6.3) (cid:107) x + t v (cid:107) = (cid:107) x (cid:107) + tσ t ( v ) > , ∀| t | < t , where (6.4) σ t ( v ) = (cid:88) i ∈ S v i sign ( x i ) + sign ( t ) (cid:107) v ¯ S (cid:107) . Proof.
Since x + t v (cid:54) = for all t ∈ R , we have (cid:107) x + t v (cid:107) >
0. It follows from (6.2) that | v i | ≤ , ∀ i . Then we get sign( x i + tv i ) = sign( x i ) , ∀ i ∈ S , as | tv i | < | x i | for | t | < t . Therefore,we have (cid:107) x + t v (cid:107) = (cid:88) i ∈ S | x i + tv i | + (cid:88) i/ ∈ S | t || v i | = (cid:88) i ∈ S ( x i + tv i )sign( x i ) + | t |(cid:107) v ¯ S (cid:107) = (cid:88) i ∈ S x i sign( x i ) + t (cid:88) i ∈ S v i sign( x i ) + | t |(cid:107) v ¯ S (cid:107) = (cid:107) x (cid:107) + t (cid:88) i ∈ S v i sign( x i ) + | t |(cid:107) v ¯ S (cid:107) = (cid:107) x (cid:107) + t (cid:32)(cid:88) i ∈ S v i sign( x i ) + sign( t ) (cid:107) v ¯ S (cid:107) (cid:33) , which implies (6.3) and hence Lemma 6.1 holds.Notice that σ t ( v ) only relies on the sign of t , i.e., it is constant for t > t <
0. Therefore, g ( t ) is differentiable on 0 < t < t and − t < t < t (cid:54) = 0, g is not differentiable atthe points where x i + tv i = 0). Some simple calculations lead to the derivative of g for 0 < t < t and − t < t < g (cid:48) ( t ) = ddt (cid:32) ( (cid:107) x (cid:107) + tσ t ( v )) (cid:107) x (cid:107) + 2 t (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) (cid:33) = 2 σ t ( v ) ( (cid:107) x (cid:107) + tσ t ( v )) (cid:0) (cid:107) x (cid:107) + 2 t (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) (cid:1) − (cid:0) (cid:104) v S , x (cid:105) + 2 t (cid:107) v (cid:107) (cid:1) ( (cid:107) x (cid:107) + tσ t ( v )) ( (cid:107) x (cid:107) + 2 t (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) ) = 2 ( (cid:107) x (cid:107) + tσ t ( v )) (cid:2) σ t ( v ) (cid:0) (cid:107) x (cid:107) + 2 t (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) (cid:1) − (cid:0) (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) (cid:1) ( (cid:107) x (cid:107) + tσ t ( v )) (cid:3) ( (cid:107) x (cid:107) + 2 t (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) ) = 2 ( (cid:107) x (cid:107) + tσ t ( v )) (cid:2)(cid:0) σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) (cid:1) + (cid:0) σ t ( v ) (cid:104) v S , x (cid:105) − (cid:107) x (cid:107) (cid:107) v (cid:107) (cid:1) t (cid:3) ( (cid:107) x (cid:107) + 2 t (cid:104) v S , x (cid:105) + t (cid:107) v (cid:107) ) . (6.5)It follows from Lemma 6.1 that the first term in the numerator of (6.5) is strictly positive, i.e., (cid:107) x (cid:107) + tσ t ( v ) >
0. Therefore, the sign of g (cid:48) depends on the second term in the numerator. Wefurther introduce two lemmas (Lemma 6.2 and Lemma 6.3) to study this term. SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY Lemma
For any x , v ∈ R n and i ∈ [ n ] , we have n (cid:107) x (cid:107) − | x i |(cid:107) x (cid:107) ≥ ( n − (cid:88) j (cid:54) = i x j , (6.6) n (cid:107) v (cid:107) (cid:107) x (cid:107) ≥ (cid:107) x (cid:107) | (cid:104) v , x (cid:105) | . (6.7) Furthermore, if (cid:107) x (cid:107) = s , then the constant n in the inequalities can be reduced to s .Proof. Simple calculations show that n (cid:107) x (cid:107) − | x i |(cid:107) x (cid:107) = n (cid:88) j x j − | x i | (cid:88) j | x j | = ( n − (cid:88) j (cid:54) = i x j + (cid:88) j (cid:54) = i x j + ( n − x i − (cid:88) j (cid:54) = i | x i || x j | = ( n − (cid:88) j (cid:54) = i x j + (cid:88) j (cid:54) = i (cid:0) ( | x i | − | x j | ) + | x i || x j | (cid:1) ≥ ( n − (cid:88) j (cid:54) = i x j ≥ . (6.8)Therefore, we have (cid:80) i (cid:0) n (cid:107) x (cid:107) − | x i |(cid:107) x (cid:107) (cid:1) | v i | ≥ , which implies that(6.9) n (cid:107) v (cid:107) (cid:107) x (cid:107) ≥ (cid:107) x (cid:107) (cid:32)(cid:88) i | x i || v i | (cid:33) ≥ (cid:107) x (cid:107) | (cid:104) v , x (cid:105) | . Similarly, we can reduce the constant n to s , if we know (cid:107) x (cid:107) = s . Lemma
Suppose that an s -sparse vector x satisfies A x = b ( b (cid:54) = 0) with its support on anindex set S and the matrix A satisfies the sNSP of order s . Define (6.10) t := inf v ,t (cid:26) | σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) || σ t ( v ) (cid:104) v S , x (cid:105) − (cid:107) x (cid:107) (cid:107) v (cid:107) | (cid:12)(cid:12)(cid:12)(cid:12) v ∈ ker( A ) , (cid:107) v (cid:107) = 1 , t (cid:54) = 0 (cid:27) , where σ t ( v ) is defined as (6.4) . Then t > . Proof.
For any v ∈ ker( A ) and (cid:107) v (cid:107) = 1, it is straightforward that | σ t ( v ) (cid:104) v S , x (cid:105) − (cid:107) x (cid:107) (cid:107) v (cid:107) | ≤ | σ t ( v ) |(cid:107) v (cid:107) (cid:107) x (cid:107) + (cid:107) x (cid:107) (cid:107) v (cid:107) ≤ (cid:107) v (cid:107) (cid:107) v (cid:107) (cid:107) x (cid:107) + (cid:107) x (cid:107) (cid:107) v (cid:107) = (cid:107) v (cid:107) (cid:107) x (cid:107) + (cid:107) x (cid:107) ≤ √ n (cid:107) x (cid:107) + (cid:107) x (cid:107) , (6.11)and(6.12) | σ t ( v ) | ≥ | sign( t ) (cid:107) v ¯ S (cid:107) | − | (cid:88) i ∈ S v i sign( x i ) | ≥ (cid:107) v ¯ S (cid:107) − (cid:88) i ∈ S | v i | = (cid:107) v ¯ S (cid:107) − (cid:107) v S (cid:107) . Y. RAHIMI, C. WANG, H. DONG, Y. LOU
It follows from the sNSP that (cid:107) v ¯ S (cid:107) ≥ ( s + 1) (cid:107) v S (cid:107) , thus leading to the following two inequalities, | σ t ( v ) | ≥ (cid:107) v ¯ S (cid:107) − (cid:107) v S (cid:107) ≥ s (cid:107) v S (cid:107) | σ t ( v ) | ≥ (cid:107) v ¯ S (cid:107) − (cid:107) v S (cid:107) ≥ (1 − s + 1 ) (cid:107) v ¯ S (cid:107) = ss + 1 (cid:107) v ¯ S (cid:107) . (6.13)Next we will discuss two cases: s = 1 and s > s = 1. Without loss of generality, we assume the only non-zero element is x n (cid:54) = 0 andhence we have | σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) | = | ( v n sign( x n ) + sign( t ) (cid:107) v ¯ S (cid:107) ) x n − ( v n x n ) | x n || = (cid:107) v ¯ S (cid:107) x n . We further discuss two cases: | v n | ≥ √ n and | v n | < √ n . If | v n | ≥ √ n , then (cid:107) v ¯ S (cid:107) ≥ ( s + 1) | v n | ≥ s +1 √ n and hence(6.14) | σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) | = (cid:107) v ¯ S (cid:107) (cid:107) x (cid:107) ≥ s + 1 √ n (cid:107) x (cid:107) . If | v n | < √ n , then we have (cid:107) v ¯ S (cid:107) ≥ − | v n | = 1 − √ n = √ n − √ n and(6.15) | σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) | = (cid:107) v ¯ S (cid:107) (cid:107) x (cid:107) ≥ √ n − √ n (cid:107) x (cid:107) . Combining (6.14) and (6.15), we have(6.16) t ≥ min (cid:110) s +1 √ n (cid:107) x (cid:107) , √ n − √ n (cid:107) x (cid:107) (cid:111) √ n (cid:107) x (cid:107) + (cid:107) x (cid:107) > . (ii) For s >
1. We split into two cases. The first case is ∀ j ∈ S, v j < c (we will determinethe value of c shortly). As a result, we get (cid:107) v S (cid:107) < sc and (cid:107) v ¯ S (cid:107) ≥ − sc since (cid:107) v (cid:107) ≥(cid:107) v (cid:107) = 1. Some simple calculations lead to (cid:12)(cid:12) σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) (cid:12)(cid:12) ≥ | σ t ( v ) |(cid:107) x (cid:107) − | (cid:104) v S , x (cid:105) |(cid:107) x (cid:107) ≥ ss + 1 (cid:107) v ¯ S (cid:107) (cid:107) x (cid:107) − (cid:88) i ∈ S | v i || x i |(cid:107) x (cid:107) (based on (6.13)) ≥ ss + 1 (1 − sc ) (cid:107) x (cid:107) − (cid:88) i ∈ S c | x i |(cid:107) x (cid:107) = ss + 1 (1 − sc ) (cid:107) x (cid:107) − c (cid:107) x (cid:107) ≥ ss + 1 (1 − su ) (cid:107) x (cid:107) − sc (cid:107) x (cid:107) = ss + 1 (cid:0) − (2 s + 1) c (cid:1) (cid:107) x (cid:107) . If we choose c = s +2 , then the above quantity is larger than s (cid:107) x (cid:107) ( s +1)(2 s +2) > . SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY j ∈ S such that v j ≥ c , leading to (cid:12)(cid:12) σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) (cid:12)(cid:12) ≥ | σ t ( v ) |(cid:107) x (cid:107) − | (cid:104) v S , x (cid:105) |(cid:107) x (cid:107) ≥ s (cid:107) v S (cid:107) (cid:107) x (cid:107) − (cid:32)(cid:88) i ∈ S | x i || v i | (cid:33) (cid:107) x (cid:107) = (cid:88) i ∈ S (cid:0) s (cid:107) x (cid:107) − | x i |(cid:107) x (cid:107) (cid:1) | v i |≥ (cid:0) s (cid:107) x (cid:107) − | x j |(cid:107) x (cid:107) (cid:1) | v j | (based on Lemma ≥ c (cid:0) s (cid:107) x (cid:107) − | x j |(cid:107) x (cid:107) (cid:1) ≥ c ( s − (cid:88) i (cid:54) = j x i (based on Lemma ≥ c ( s −
1) min j ∈ S (cid:88) i (cid:54) = j x i . (6.17)These two cases guarantee that t >
0, i.e.,(6.18) t ≥ min (cid:40) c ( s −
1) min j ∈ S (cid:80) i (cid:54) = j x i , s (cid:107) x (cid:107) ( s +1)(2 s +2) (cid:41) √ n (cid:107) x (cid:107) + (cid:107) x (cid:107) > . By (6.16) and (6.18), we get Lemma 6.3.Now, we are ready to prove Theorem 2.2.
Proof.
According to (6.3), the first term in the numerator is strictly positive, i.e., (cid:107) x (cid:107) + tσ t ( v ) = (cid:107) x + t v (cid:107) > , ∀| t | < t . As for the second one, there exists a positive number t definedin Lemma 6.3 such that (cid:12)(cid:12) σ t ( v ) (cid:104) v S , x (cid:105) − (cid:107) x (cid:107) (cid:107) v (cid:107) (cid:12)(cid:12) | t | < | σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) | for all | t | < t and v ∈ ker( A ) with (cid:107) v (cid:107) = 1 . Moreover, we havesign (cid:104) ( σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) + (cid:0) σ t ( v ) (cid:104) v S , x (cid:105) − (cid:107) x (cid:107) (cid:107) v (cid:107) (cid:1) t (cid:105) = sign (cid:16) σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) (cid:17) . Letting t ∗ = min { t , t } , we have for any t ∈ (0 , t ∗ ) and v (cid:54) = that σ t ( v ) > σ t ( v ) = (cid:88) i ∈ S v i sign( x i ) + sign( t ) (cid:107) v ¯ S (cid:107) = (cid:88) i ∈ S v i sign( x i ) + (cid:107) v ¯ S (cid:107) ≥ (cid:107) v ¯ S (cid:107) − (cid:107) v S (cid:107) ≥ max (cid:26) s (cid:107) v S (cid:107) , ss + 1 (cid:107) v ¯ S (cid:107) (cid:27) > . (6.19)2 Y. RAHIMI, C. WANG, H. DONG, Y. LOU
Also (6.13) implies that(6.20) | σ t ( v ) |(cid:107) x (cid:107) ≥ s (cid:107) v S (cid:107) (cid:107) x (cid:107) ≥ (cid:107) v S (cid:107) (cid:107) x (cid:107) ≥ | (cid:104) v S , x (cid:105) |(cid:107) x (cid:107) , thus leading to(6.21) σ t ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) ≥ , for σ t ( v ) > . As a result, we have g (cid:48) ( t ) ≥ < t < t ∗ . The function g ( t ) is not differentiable atzero, but we can compute the sub-derivative as follows,(6.22) g (cid:48) (0 + ) = lim t → + g ( t ) − g (0) t − (cid:107) x (cid:107) (cid:0) σ +1 ( v ) (cid:107) x (cid:107) − (cid:104) v S , x (cid:105) (cid:107) x (cid:107) (cid:1) (cid:107) x (cid:107) ≥ . Similarly, we can get g (cid:48) ( t ) ≤ − t ∗ < t < g (cid:48) (0 − ) ≤
0. Therefore for any 0 < | t | < t ∗ wehave g (0) ≤ g ( t ), which implies that(6.23) (cid:107) x + t v (cid:107) (cid:107) x + t v (cid:107) ≥ (cid:107) x (cid:107) (cid:107) x (cid:107) , ∀| t | < t ∗ . Notice that t ∗ does not depend on the choice of v , therefore the inequality is true for any v satisfying6.2, which will imply the result. REFERENCES[1]
A. S. Bandeira, E. Dobriban, D. G. Mixon, and W. F. Sawin , Certifying the restricted isometry propertyis hard , IEEE Trans. Inf. Theory, 59 (2013), pp. 3448–3450.[2]
A. Beck , First-Order Methods in Optimization , vol. 25, SIAM, 2017.[3]
J. Bolte, S. Sabach, and M. Teboulle , Proximal alternating linearized minimization for nonconvex andnonsmooth problems , Math. Program., 146 (2014), pp. 459–494.[4]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein , Distributed optimization and statistical learningvia the alternating direction method of multipliers , Found. Trends Mach. Learn., 3 (2011), pp. 1–122.[5]
E. J. Cand`es and C. Fernandez-Granda , Towards a mathematical theory of super-resolution , Comm. PureAppl. Math., 67 (2014), pp. 906–956.[6]
E. J. Cand`es, J. Romberg, and T. Tao , Stable signal recovery from incomplete and inaccurate measurements ,Comm. Pure Appl. Math., 59 (2006), pp. 1207–1223.[7]
E. J. Cand`es and M. B. Wakin , An introduction to compressive sampling , IEEE Signal Process. Mag., 25(2008), pp. 21–30.[8]
A. Chambolle and T. Pock , A first-order primal-dual algorithm for convex problems with applications toimaging , J. Math. Imaging and Vision, 40 (2011), pp. 120–145.[9]
R. Chartrand , Exact reconstruction of sparse signals via nonconvex minimization , IEEE Signal Process. Lett.,10 (2007), pp. 707–710.[10]
A. Cohen, W. Dahmen, and R. DeVore , Compressed sensing and the best k-term approximation , J. Am.Math. Soc., 22 (2009), pp. 211–231.[11]
D. Donoho and M. Elad , Optimally sparse representation in general (nonorthogonl) dictionaries via l minimization , Proc. Nat. Acad. Scien. USA, 100 (2003), pp. 2197–2202.[12] D. L. Donoho and X. Huo , Uncertainty principles and ideal atomic decomposition , IEEE Trans. Inf. Theory,47 (2001), pp. 2845–2862.[13]
E. Esser, Y. Lou, and J. Xin , A method for finding structured sparse solutions to non-negative least squaresproblems with applications , SIAM J. Imaging Sci., 6 (2013), pp. 2010–2046.[14]
A. Fannjiang and W. Liao , Coherence pattern–guided compressive sensing with unresolved grids , SIAM J.Imaging Sci., 5 (2012), pp. 179–202.[15]
T. Goldstein and S. Osher , The split Bregman method for L -regularized problems , SIAM J. Imaging Sci.,2 (2009), pp. 323–343. SCALE INVARIANT APPROACH FOR SPARSE SIGNAL RECOVERY [16] R. Gribonval and M. Nielsen , Sparse representations in unions of bases , IEEE Trans. Inf. Theory, 49 (2003),pp. 3320–3325.[17]
K. Guo, D. Han, and T.-T. Wu , Convergence of alternating direction method for minimizing sum of twononconvex functions with linear constraints , Int. J. of Comput. Math., 94 (2017), pp. 1653–1669.[18]
W. Guo and W. Yin , Edge guided reconstruction for compressive imaging , SIAM J. Sci. Imaging, 5 (2012),pp. 809–834.[19]
M. Hong, Z.-Q. Luo, and M. Razaviyayn , Convergence analysis of alternating direction method of multipliersfor a family of nonconvex problems , SIAM J. Optim., 26 (2016), pp. 337–364.[20]
P. O. Hoyer , Non-negative sparse coding , in Proc. IEEE Workshop Neural Networks Signal Proce., 2002,pp. 557–565.[21]
N. Hurley and S. Rickard , Comparing measures of sparsity , IEEE Trans. on Inform. Theory, 55 (2009),pp. 4723–4741.[22]
D. Krishnan, T. Tay, and R. Fergus , Blind deconvolution using a normalized sparsity measure , in IEEEConference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2011, pp. 233–240.[23]
M. J. Lai, Y. Xu, and W. Yin , Improved iteratively reweighted least squares for unconstrained smoothed lqminimization , SIAM J. Numer. Anal., 5 (2013), pp. 927–957.[24]
G. Li and T. K. Pong , Global convergence of splitting methods for nonconvex composite optimization , SIAMJ. Optim., 25 (2015), pp. 2434–2460.[25]
Y. Lou, S. Osher, and J. Xin , Computational aspects of L - L minimization for compressive sensing , inModel. Comput. & Optim. in Inf. Syst. & Manage. Sci., Adv. Intel. Syst. Comput., vol. 359, 2015, pp. 169–180.[26] Y. Lou, P. Yin, Q. He, and J. Xin , Computing sparse representation in a highly coherent dictionary basedon difference of L and L , J. Sci. Comput., 64 (2015), pp. 178–196.[27] Y. Lou, T. Zeng, S. Osher, and J. Xin , A weighted difference of anisotropic and isotropic total variationmodel for image processing , SIAM J. Imaging Sci., 8 (2015), pp. 1798–1823.[28]
M. Lustig, D. L. Donoho, and J. M. Pauly , Sparse MRI: The application of compressed sensing for rapidMR imaging , Magnet. Reson. Med., 58 (2007), pp. 1182–1195.[29]
J. Lv and Y. Fan , A unified approach to model selection and sparse recovery using regularized least squares ,Ann. Appl. Stat., (2009), pp. 3498–3528.[30]
T. Ma, Y. Lou, and T. Huang , Truncated L - L models for sparse recovery and rank minimization , SIAMJ. Imaging Sci., 10 (2017), pp. 1346–1380.[31] B. K. Natarajan , Sparse approximate solutions to linear systems , SIAM J. Comput., (1995), pp. 227–234.[32]
G. Optimization , Gurobi optimizer reference manual , 2015.[33]
J.-S. Pang and M. Tao , Decomposition methods for computing directional stationary solutions of a class ofnonsmooth nonconvex optimization problems , SIAM J. Optim., 28 (2018), pp. 1640–1669.[34]
H. Raguet, J. Fadili, and G. Peyr´e , A generalized forward-backward splitting , SIAM J. Imaging Sci., 6(2013), pp. 1199–1226.[35]
A. Repetti, M. Q. Pham, L. Duval, E. Chouzenouxe, and J.-C. Pesquet , Euclid in a taxicab: Sparse blinddeconvolution with smoothed (cid:96) /(cid:96) regularization , IEEE Signal Process. Lett., 22 (2015), pp. 539–543.[36] L. Rudin, S. Osher, and E. Fatemi , Nonlinear total variation based noise removal algorithms , Physica D, 60(1992), pp. 259–268.[37]
X. Shen, W. Pan, and Y. Zhu , Likelihood-based selection and sharp parameter estimation , J. Am. Stat. Assoc.,107 (2012), pp. 223–232.[38]
A. M. Tillmann and M. E. Pfetsch , The computational complexity of the restricted isometry property,the nullspace property, and related concepts in compressed sensing , IEEE Trans. Inf. Theory, 60 (2014),pp. 1248–1259.[39]
H. Tran and C. Webster , Unified sufficient conditions for uniform recovery of sparse signals via nonconvexminimizations , arXiv preprint arXiv:1710.07348, (2017).[40]
F. Wang, W. Cao, and Z. Xu , Convergence of multi-block Bregman ADMM for nonconvex composite problems ,Sci. China Info. Sci., 61 (2018), pp. 122101:1–12.[41]
F. Wang, Z. Xu, and H.-K. Xu , Convergence of Bregman alternating direction method with multipliers fornonconvex composite problems , arXiv preprint arXiv:1410.8625, (2014).[42]
Y. Wang, W. Yin, and J. Zeng , Global convergence of ADMM in nonconvex nonsmooth optimization , J. Sci.Comput., 78 (2019), pp. 29–63.[43]
Z. Xu, X. Chang, F. Xu, and H. Zhang , l / regularization: A thresholding representation theory and a fastsolver , IEEE Trans. Neural Networks, 23 (2012), pp. 1013–1027.[44] P. Yin, E. Esser, and J. Xin , Ratio and difference of l and l norms and sparse representation with coherentdictionaries , Comm. Info. Systems, 14 (2014), pp. 87–109. Y. RAHIMI, C. WANG, H. DONG, Y. LOU[45]
P. Yin, Y. Lou, Q. He, and J. Xin , Minimization of (cid:96) − for compressed sensing , SIAM J. Sci. Comput., 37(2015), pp. A536–A563.[46] S. Zhang and J. Xin , Minimization of transformed l penalty: Closed form representation and iterativethresholding algorithms , Comm. Math. Sci., 15 (2017), pp. 511–537.[47] S. Zhang and J. Xin , Minimization of transformed l penalty: Theory, difference of convex function algorithm,and robust application in compressed sensing , Math. Program., Ser. B, 169 (2018), pp. 307–336.[48] T. Zhang , Multi-stage convex relaxation for learning with sparse regularization , in Adv. Neural. Inf. Process.Syst., 2009, pp. 1929–1936.[49]