A framework to characterize performance of LASSO algorithms
aa r X i v : . [ c s . I T ] M a r A framework to characterize performance of LASSO algorithms M IHAILO S TOJNIC
School of Industrial EngineeringPurdue University, West Lafayette, IN 47907e-mail: [email protected]
Abstract
In this paper we consider solving noisy under-determined systems of linear equations with sparse so-lutions. A noiseless equivalent attracted enormous attention in recent years, above all, due to work of[12, 13, 25] where it was shown in a statistical and large dimensional context that a sparse unknown vector(of sparsity proportional to the length of the vector) can be recovered from an under-determined systemvia a simple polynomial ℓ -optimization algorithm. [13] further established that even when the equationsare noisy , one can, through an SOCP noisy equivalent of ℓ , obtain an approximate solution that is (in an ℓ -norm sense) no further than a constant times the noise from the sparse unknown vector. In our recentworks [62, 63], we created a powerful mechanism that helped us characterize exactly the performance of ℓ optimization in the noiseless case (as shown in [61] and as it must be if the axioms of mathematics arewell set, the results of [62, 63] are in an absolute agreement with the corresponding exact ones from [25]).In this paper we design a mechanism, as powerful as those from [62, 63], that can handle the analysis of aLASSO type of algorithm (and many others) that can be (or typically are) used for “solving” noisy under-determined systems. Using the mechanism we then, in a statistical context, compute the exact worst-case ℓ norm distance between the unknown sparse vector and the approximate one obtained through such aLASSO. The obtained results match the corresponding exact ones obtained in [6, 26]. Moreover, as a by-product of our analysis framework we recognize existence of an SOCP type of algorithm that achieves thesame performance. Index Terms: Noisy linear systems of equations; LASSO; SOCP; ℓ -optimization; compressed sensing . In recent years the problem of finding sparse solutions of under-determined systems of linear equationsattracted enormous attention. Applications seem vast and as if they are growing almost on a daily basis (see,e.g. [4, 10, 14, 22, 30, 45, 49, 53, 55–57, 69, 71] and references therein). Given a substantial interest in theproblem (and especially that it is coming from a variety of different fields), one may assume that designingefficient algorithms that would solve it could be of far-reaching importance. To that end, we believe that aprecise mathematical understanding of the phenomena that make certain algorithms work well would helpsolidify belief in their success in current and future applications. Moreover, it is possible that down the roadit can also help expand further the range of their applications.Moving long the same lines, we in this paper focus on studying mathematical properties of under-determined systems of linear equations and certain algorithms used to solve them. We start the story by1ntroducing an idealized version of the problem that we plan to study. In its simplest form it amounts tofinding a k -sparse x such that A x = y (1)where A is an m × n ( m < n ) matrix and y is an m × vector (see Figure 1; here and in the rest of thepaper, under k -sparse vector we assume a vector that has at most k nonzero components). Of course, theassumption will be that such an x exists (clearly, the case of real interest is k < m ). To make writing in therest of the paper easier, we will assume the so-called linear regime, i.e. we will assume that k = βn and thatthe number of equations is m = αn where α and β are constants independent of n (more on the non-linearregime, i.e. on the regime when m is larger than linearly proportional to k can be found in e.g. [21, 34, 35]). km = A xy n Figure 1: Model of a linear system; vector x is k -sparseIf one has the freedom to design matrix A then the results from [2, 46, 52] demonstrated that the tech-niques from coding theory (based on coding/decoding of Reed-Solomon codes) can be employed to deter-mine any k -sparse x in (1) for any < α ≤ and any β ≤ α in polynomial time. It is relatively easy toshow that under the unique recoverability assumption β can not be greater than α . Therefore, as long as oneis concerned with the unique recovery of k -sparse x in (1) in polynomial time the results from [2, 46, 52]are optimal. The complexity of algorithms from [2, 46, 52] is roughly O ( n ) . In a similar fashion one can,instead of using coding/decoding techniques associated with Reed/Solomon codes, design the matrix andthe corresponding recovery algorithm based on the techniques related to coding/decoding of Expander codes(see e.g. [41, 42, 72] and references therein). In that case recovering x in (1) is significantly faster for largedimensions n . Namely, the complexity of the techniques from e.g. [41, 42, 72] (or their slight modifications)is usually O ( n ) which is clearly for large n significantly smaller than O ( n ) . However, the techniques basedon coding/decoding of Expander codes usually do not allow for β to be as large as α .On the other hand, if one has no freedom in choice of A designing the algorithms to find k -sparse x in(1) is substantially harder. In fact, when there is no choice in A the recovery problem (1) becomes NP-hard.Two algorithms 1) Orthogonal matching pursuit - OMP and 2)
Basis pursuit - ℓ -optimization (and theirdifferent variations) have been often viewed historically as solid heuristics for solving (1) (in recent yearsbelief propagation type of algorithms are emerging as strong alternatives as well). Roughly speaking, OMPalgorithms are faster but can recover smaller sparsity whereas the BP ones are slower but recover highersparsity. In a more precise way, under certain probabilistic assumptions on the elements of A it can beshown (see e.g. [51, 66, 67]) that if m = O ( k log( n )) OMP (or slightly modified OMP) can recover x in(1) with complexity of recovery O ( n ) . On the other hand a stage-wise OMP from [29] recovers x in (1)with complexity of recovery O ( n log n ) . Somewhere in between OMP and BP are recent improvementsCoSAMP (see e.g. [50]) and Subspace pursuit (see e.g. [23]), which guarantee (assuming the linear regime)that the k -sparse x in (1) can be recovered in polynomial time with m = O ( k ) equations which is the sameperformance guarantee established in [13, 25] for the BP.2e now introduce the BP concept (or, as we will refer to it, the ℓ -optimization concept; a slightmodification/adaptation of it will actually be the main topic of this paper). Variations of the standard ℓ -optimization from e.g. [15, 19, 60] as well as those from [24, 32, 37–39, 59] related to ℓ q -optimization, < q < are possible as well; moreover they can all be incorporated in what we will present below. The ℓ -optimization concept suggests that one can maybe find the k -sparse x in (1) by solving the following ℓ -norm minimization problem min k x k subject to A x = y . (2)As is then shown in [13] if α and n are given, A is given and satisfies the restricted isometry property (RIP)(more on this property the interested reader can find in e.g. [1, 5, 11–13, 58]), then any unknown vector x with no more than k = βn (where β is a constant dependent on α and explicitly calculated in [13]) non-zeroelements can indeed be recovered by solving (2). In a statistical and large dimensional context in [25] andlater in [63] for any given value of β the exact value of the maximum possible α was determined.As we mentioned earlier the above scenario is in a sense idealistic. Namely, it assumes that y in (2) wasobtained through (1). On the other hand in many applications only a noisy version of A x may be availablefor y (this is especially so in measuring type of applications) see, e.g. [12, 13, 40, 70]. When that happensone has the following equivalent to (1) (see, Figure 2) y = A x + v , (3)where v is an m × vector (often dubbed as the noise vector; the so-called ideal case presented above is ofcourse a special case of the noisy case). Finding the k -sparse x in (3) is now incredibly hard. Basically, one km = A xy + v noise Figure 2: Model of a linear system; vector x is k -sparseis looking for a k -sparse x such that (3) holds and on top of that v is unknown. Although the problem ishard there are various heuristics throughout the literature that one can use to solve it approximately. Belowwe restrict our attention to two groups of algorithms that we believe are the most relevant to the results thatwe will present.To introduce a bit or tractability in finding the k -sparse x in (3) one usually assumes certain amount ofknowledge about either x or v . As far as tractability assumptions on v are concerned one typically (andpossibly fairly reasonably in applications of interest) assumes that k v k is bounded (or highly likely to bebounded) from above by a certain known quantity. The following second-order cone programming (SOCP)3nalogue to (2) is one of the approaches that utilizes such an assumption (see, e.g. [13]) min x k x k subject to k y − A x k ≤ r (4)where, r is a quantity such that k v k ≤ r (or r is a quantity such that k v k ≤ r is say highly likely).For example, in [13] a statistical context is assumed and based on the statistics of v , r was chosen suchthat k v k ≤ r happens with overwhelming probability (as usual, under overwhelming probability we inthis paper assume a probability that is no more than a number exponentially decaying in n away from ).Given that (4) is now among few almost standard choices when it comes to finding the x -sparse in (3),the literature on its properties is vast (see, e.g. [13, 28, 65] and references therein). Also, given that thisSOCP will not be the main topic of this paper we below briefly mention only what we consider to be themost influential work on this topic in recent years. Namely, in [13] the authors analyzed performance of (4)and showed a result similar in flavor to the one that holds in the ideal - noiseless - case. In a nutshell thefollowing was shown in [13]: let x be a βn -sparse vector such that (3) holds and let x socp be the solutionof (4). Then k x socp − x k ≤ Cr where β is a constant independent of n and C is a constant independentof n and of course dependent on α and β . This result in a sense establishes a noisy equivalent to the factthat a linear sparsity can be recovered from an under-determined system of linear equations. In an informallanguage, it states that a linear sparsity can be approximately recovered in polynomial time from a noisyunder-determined system with the norm of the recovery error guaranteed to be within a constant multipleof the noise norm. Establishing such a result is, of course, a feat in its own class, not only because of itstechnical contribution but even more so because of the amount of interest that it generated in the field.In this paper we will also consider an approximate recovery of the k -sparse x in (3). However, insteadof the above mentioned SOCP we will focus on a group of highly successful algorithms called LASSO (theLASSO algorithms, as well as the SOCP ones, are of course well known in the statistics community andthere is again a vast literature that covers their performance (see, e.g. [6,9,17,18,26,48,64,68] and referencestherein). There are many variants of LASSO but the following one is probably the most well known min x k y − A x k + λ lasso k x k . (5) λ lasso in (5) is a parameter to be chosen based on the amount of pre-knowledge one may have about A , v ,and/or x . The results that relate to the characterization of the approximation error of (5) that are similar to theSOCP ones mentioned above can be established (see, e.g. [7]). Of course, characterizing the performanceof the recovery algorithm through the norm-2 of the error vector is only one possible way among many(more on other measures of performance can be found in e.g. [9, 70]). In this paper we will develop anovel framework for performance characterization of the LASSO algorithms. Among other things, in astatistical context, the framework will enable us to provide a precise characterization of the norm-2 of theapproximation error of the LASSO algorithms.While our main focus in this paper are algorithms from the LASSO group we mention that besidesthe SOCP and LASSO algorithms there are of course various other algorithms/heuristics that have beensuggested as possible alternatives throughout the literature in recent years. Such an alternative that gainedcertain amount of popularity is for example the so-called Dantzig selector introduced in [16]. The Dantzigselector amounts to solving the following optimization problem min k x k subject to k A T ( A x − y ) k ∞ ≤ C Dan , where C Dan is a carefully chosen parameter that of course should depend on A, v , and/or x . As a lin-4ar program the Danzig selector promises to be faster than SOCP or LASSO which are both quadraticprograms. On the other hand recent improvements in numerical implementations of LASSO’s and theirsolid approximate recovery abilities make them quite competitive as well (more on a thorough discus-sion/comparison, advantages/disdvatnages of the Dantzig selector and the LASSO algorithms can be foundin e.g. [3, 8, 31, 33, 43, 44, 47]).To facilitate the exposition and the easiness of following we will present our framework on a version ofthe LASSO from (5). Namely, we will consider, min x k y − A x k subject to k x k ≤ k ˜ x k (6)where ˜ x is the original k -sparse x that satisfies (3) (we just briefly mention that in a context that will beconsidered in this paper it is not that difficult to transform the LASSO from (6) to one that is structurallyequivalent to (5); however, we stop short of exploring this connection further before presenting our mainresults and only mention that a section towards the end of the paper will explore it in more detail.). We dohowever mention right here that in order to run (6) one does require the knowledge of k ˜ x k . In a sense thisrequirement is an equivalent to setting r and λ lasso in (4) and (5), respectively. In order to be maximallyeffective both r and λ lasso do require some amount of pre-knowledge about A , v , and/or x .Before we proceed further we briefly summarize the organization of the rest of the paper. In Section 2,we present a statistical framework for the performance analysis of the LASSO algorithms. To demonstrateits power we towards the end of Section 2, for any given α and β , compute the worst case norm-2 of theerror that (6) makes when used for approximate recovery of general sparse signals x from (3). In Section3 we then specialize results from Section 2 to the so-called signed vectors x . In Section 4 we discuss howthe LASSO from (6) can be connected to the LASSO from (5). In Section 5 we demonstrate that there isan SOCP algorithm (similar to the one given in (4)) that achieves the same performance as do (6) and acorresponding (5). In Section 6 we present results that we obtained through numerical experiments. Finally,in Section 7 we discuss obtained results. x In this section we create a statistical LASSO’s performance analysis framework. Before proceeding furtherwe will now explicitly state the major assumptions that we will make (the remaining ones, will be madeappropriately throughout the analysis). Namely, in the rest of the paper we will assume that the elementsof A are i.i.d. standard normal random variables. We will also assume that the elements of v are i.i.d.Gaussian random variables with zero mean and variance σ . As stated earlier, we will assume that ˜ x isthe original x in (3) that we are trying to recover and that it is any k -sparse vector with a given fixedlocation of its nonzero elements and a given fixed combination of their signs. Since the analysis (andthe performance of (6)) will clearly be irrelevant with respect to what particular location and what particularcombination of signs of nonzero elements are chosen, we can for the simplicity of the exposition and withoutloss of generality assume that the components x , x , . . . , x n − k of x are equal to zero and the components x n − k +1 , x n − k +2 , . . . , x n of x are greater than or equal to zero. Moreover, throughout the paper we will callsuch an x k -sparse and positive. In a more formal way we will set ˜ x = ˜ x = · · · = ˜ x n − k = 0˜ x n − k +1 ≥ , ˜ x n − k +1 ≥ , . . . , ˜ x n ≥ . (7)5e also now take the opportunity to point out a rather obvious detail. Namely, the fact that ˜ x is positive isassumed for the purpose of the analysis. However, this fact is not known a priori and is not available to thesolving algorithm (this will of course change in Section 3).Once we establish the framework it will be clear that it can be used to characterize many of the LASSOfeatures. We will defer these details to a collection of forthcoming papers. In this paper we will present onlya small application that relates to a classical question of quantifying the approximation error that (6) makeswhen used to recover any k -sparse x that satisfies (3) and is from a set of x ’s with a given fixed location ofnonzero elements and a given fixed combination of their signs.Before proceeding further we will introduce a few definitions that will be useful in formalizing thisapplication as well as in conducting the entire analysis. As it is natural we start with the solution of (6). Let ˆ x be the solution of (6) and let w lasso ∈ R n be such that ˆ x = ˜ x + w lasso . (8)As an application of our framework we will compute the largest possible value of k ˆ x − ˜ x k = k w lasso k for any combination ( α, β ) . Or more rigorously, for any combination ( α, β ) , we will find a d lasso such that lim n →∞ P ( d lasso − ǫ ≤ max ˜ x k w lasso k ≤ d lasso + ǫ ) = 1 (9)for an arbitrarily small constant ǫ . However, before doing so we will first present the general framework.The framework that we will present will center around finding the optimal value of the objective function in(6) (of course in a probabilistic context). In the first of the following two subsections we will create a lowerbound on this optimal value. We will then afterwards in the second of the subsections create an upper boundon this optimal value. Naturally in the third subsection we will show that the two bounds actually match. Tomake further writing easier and clearer we set already here ζ obj = min x k y − A x k subject to k x k ≤ k ˜ x k . (10) ζ obj In this section we present the part of the framework that relates to finding a “high-probability” lower boundon ζ obj . To make arguments that will follow less tedious we will make an assumption that is significantlyweaker than what we will eventually prove. Namely, we will assume that there is a (if necessary, arbitrarilylarge) constant C w such that P ( k w lasso k ≤ C w ) ≥ − e − ǫ C w n . (11)To make our arguments flow more naturally, one should probably provide a direct proof of this statementright here. However, given the difficulty of the task ahead we refrain from that and assume that the statementis correct. Roughly speaking, what we assume is that k w lasso k is bounded by an arbitrarily large constant(of course we hope to create a machinery that can prove much more than (11)).We start by noting that if one knows that y = A ˜ x + v holds then (10) can be rewritten as min x k v + A ˜ x − A x k subject to k x k ≤ k ˜ x k . (12)6fter a small change of variables, x = ˜ x + w , (12) becomes min w k v − A w k subject to k ˜ x + w k ≤ k ˜ x k , (13)or in a more compact form min w k A v (cid:20) w σ (cid:21) k subject to k ˜ x + w k ≤ k ˜ x k , (14)where A v = (cid:2) − A v (cid:3) is now an m × ( n + 1) random matrix with i.i.d. standard normal components. Let S w ( σ, ˜ x , C w ) = { (cid:20) w σ (cid:21) ∈ R n +1 | k w k ≤ C w and k ˜ x + w k ≤ k ˜ x k } . (15)Further, let f obj ( σ, w ) = k A v (cid:20) w σ (cid:21) k (16)and set, ζ ( help ) obj = min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) f obj ( σ, w ) = min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) k A v (cid:20) w σ (cid:21) k = min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) max k a k =1 a T A v (cid:20) w σ (cid:21) . (17)We now state a lemma from [36] that will be of use in what follows. Lemma 1. ( [36]) Let A be an m × n matrix with i.i.d. standard normal components. Let g and h be m × and n × vectors, respectively, with i.i.d. standard normal components. Also, let g be a standard normalrandom variable and let Φ ⊂ R n be an arbitrary subset. Then for all choices of real ψ φ P (min φ ∈ Φ max k a k =1 ( a T Aφ + k φ k g − ψ φ ) ≥ ≥ P (min φ ∈ Φ max k a k =1 ( k φ k m X i =1 g i a i + n X i =1 h i φ i − ψ φ ) ≥ . (18)Now, after applying Lemma 1 one has P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w ) + q k w k + σ g ) ≥ ζ ( l ) obj )= P (cid:18) min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) max k a k =1 (cid:18) a T A v (cid:20) w σ (cid:21) + q k w k + σ g (cid:19) ≥ ζ ( l ) obj (cid:19) ≥ P min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) max k a k =1 q k w k + σ m X i =1 g i a i + n X i =1 h i w i + h n +1 σ ! ≥ ζ ( l ) obj ! . (19)In what follows we will analyze the following probability p l = P min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) max k a k =1 q k w k + σ m X i =1 g i a i + n X i =1 h i w i + h n +1 σ ! ≥ ζ ( l ) obj ! , (20)which is of course nothing but the probability on the left-hand side of the inequality in (19). We will7ssentially show that for certain ζ ( l ) obj this probability is close to . That will rather obviously imply that wehave a “high probability” lower bound on ζ obj . To that end, we first note that the maximization over a istrivial and one obtains p l = P min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) q k w k + σ k g k + n X i =1 h i w i ! + h n +1 σ ≥ ζ ( l ) obj ! . (21)To facilitate the exposition that will follow let ξ ( σ, g , h , ˜ x ) = min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) q k w k + σ k g k + n X i =1 h i w i ! . (22)One should note here that, although present in the definition of S w , σ clearly does not have an impact onthe result of the above optimization. Now we split the analysis into two parts. The first one will be thedeterministic analysis of ξ ( σ, g , h , ˜ x ) and will be presented in Subsection 2.1.1. In the second part (that willbe presented in Subsection 2.1.2) we will use the results of such a deterministic analysis and continue theabove probabilistic analysis applying various concentration results. ξ ( σ, g , h , ˜ x ) In this section we compute ξ ( σ, g , h ) . We first rewrite the optimization problem from (22) in the followingpossibly clearer form ξ ( σ, g , h , ˜ x ) = min w q k w k + σ k g k + n X i =1 h i w i subject to k ˜ x + w k ≤ k ˜ x k q k w k + σ ≤ p C w + σ . (23)To remove the absolute values we introduce auxiliary variables t i , ≤ i ≤ n and transform the aboveproblem to ξ ( σ, g , h , ˜ x ) = min w , t q k w k + σ k g k + n X i =1 h i w i subject to n X i =1 t i ≤ k ˜ x k ˜ x i + w i − t i ≤ , n − k + 1 ≤ i ≤ n − ˜ x i − w i − t i ≤ , n − k + 1 ≤ i ≤ n w i − t i ≤ , ≤ i ≤ n − k − w i − t i ≤ , ≤ i ≤ n − k q k w k + σ ≤ p C w + σ . (24)8he Lagrange dual of the above problem then becomes L ( ν, λ (1) , λ (2) , w , t , γ ) = q k w k + σ k g k + n X i =1 h i w i + ν n X i =1 t i − ν k ˜ x k + n X i = n − k +1 λ (1) i (˜ x i + w i − t i )+ n X i = n − k +1 λ (2) i ( − ˜ x i − w i − t i )+ n − k X i =1 λ (1) i ( w i − t i )+ n − k X i =1 λ (2) i ( − w i − t i )+ γ ( q k w k + σ − p C w + σ ) . (25)After rearranging the terms we further have L ( ν, λ (1) , λ (2) , w , t , γ ) = q k w k + σ k g k + n X i =1 h i w i − ν k ˜ x k + n X i =1 t i ( ν − λ (1) i − λ (2) i )+ n X i = n − k +1 λ (1) i (˜ x i + w i )+ n X i = n − k +1 λ (2) i ( − ˜ x i − w i ) + n − k X i =1 λ (1) i w i − n − k X i =1 λ (2) i w i + γ ( q k w k + σ − p C w + σ ) . (26)After a few further arrangements we finally have L ( ν, λ (1) , λ (2) , w , t , γ ) = q k w k + σ ( k g k + γ ) + n X i =1 h i w i − ν k ˜ x k + n X i =1 t i ( ν − λ (1) i − λ (2) i )+ n X i = n − k +1 ( λ (1) i − λ (2) i )˜ x i + n X i =1 ( λ (1) i − λ (2) i ) w i − γ p C w + σ . (27)Setting ( ν − λ (1) i − λ (2) i ) = 0 , ≤ i ≤ n , (to insure that the dual is bounded) and combining (24) and (27)is enough to obtain ξ ( σ, g , h , ˜ x ) = max ν,λ (1) ,λ (2) ,γ min w , t L ( ν, λ (1) , λ (2) , w , t ) subject to λ ( i ) j ≥ , ≤ j ≤ n, ≤ i ≤ ν ≥ ν − λ (1) i − λ (2) i = 0 , ≤ i ≤ nγ ≥ , (28)where we of course use the fact that the strict duality obviously holds. After removing the minimizationover t we have ξ ( σ, g , h , ˜ x ) = max ν,λ (1) ,λ (2) ,γ min w L ( ν, λ (1) , λ (2) , w , γ ) subject to λ ( i ) ≥ , ≤ j ≤ n, ≤ i ≤ ν ≥ ν − λ (1) i − λ (2) i = 0 , ≤ i ≤ nγ ≥ . (29)9here L ( ν, λ (1) , λ (2) , w , γ ) = q k w k + σ ( k g k + γ )+ n X i =1 h i w i − ν k ˜ x k + n X i = n − k +1 ( λ (1) i − λ (2) i )˜ x i + n X i =1 ( λ (1) i − λ (2) i ) w i − γ p C w + σ . (30)The inner minimization over w is now doable. Setting the derivatives with respect to w i to zero one obtains w ( k g k + γ ) p k w k + σ + ( h + λ (1) − λ (2) ) = 0 , (31)where λ (1) = [ λ (1)1 , λ (1)2 , . . . , λ (1) n ] T , λ (2) = [ λ (2)1 , λ (2)2 , . . . , λ (2) n ] T . From (31) one then has w ( k g k + γ ) = − q k w k + σ ( h + λ (1) − λ (2) ) (32)or in a norm form k w k ( k g k + γ ) = ( k w k + σ ) k h + λ (1) − λ (2) k . (33)From (33) we then find k w sol k = σ k h + λ (1) − λ (2) k q ( k g k + γ ) − k h + λ (1) − λ (2) k , (34)and from (32) w sol = σ ( h + λ (1) − λ (2) ) q ( k g k + γ ) − k h + λ (1) − λ (2) k (35)where w sol is of course the solution of the inner minimization over w . Now, one should note that (34) and(35) are of course possible only if k g k + γ − k h + λ (1) − λ (2) k ≥ . Later in the paper we will recognize,that for λ (1) and λ (2) that are optimal in (29), validity of this condition essentially implies the regime (in ( α, β ) plane) where the worst-case k w k is finite with overwhelming probability (or equivalently, if for such λ (1) and λ (2) the condition is not valid then for the corresponding ( α, β ) the worst-case k w k is infinite withoverwhelming probability). Plugging the value of w sol from (35) back in (29) gives ξ ( σ, g , h , ˜ x ) = max ν,λ (1) ,λ (2) ,γ σ q ( k g k + γ ) − k h + λ (1) − λ (2) k − ν k ˜ x k + n X i = n − k +1 ( λ (1) i − λ (2) i )˜ x i − γ p C w + σ subject to λ ( i ) ≥ , ≤ j ≤ n, ≤ i ≤ ν ≥ ν − λ (1) i − λ (2) i = 0 , ≤ i ≤ n k g k + γ − k h + λ (1) − λ (2) k ≥ γ ≥ . (36)Let z (1) = [1 , , . . . , T . By plugging the constraint λ (1) = ν z (1) − λ (2) back into the objective functionand making sure that ν − λ (2) i ≥ , ≥ i ≥ n , one can remove λ (1) from the above optimization and get the10ollowing ξ ( σ, g , h , ˜ x ) = max ν,λ (2) ,γ σ q ( k g k + γ ) − k h + ν z (1) − λ (2) k − ν k ˜ x k + n X i = n − k +1 ( ν − λ (2) i )˜ x i − γ p C w + σ subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n k g k + γ − k h + ν z (1) − λ (2) k ≥ γ ≥ . (37)Since we assumed that ˜ x i ≥ , n − k + 1 ≤ i ≤ n , and ˜ x i = 0 , ≤ i ≤ n − k one then from (37) has ξ ( σ, g , h , ˜ x ) = max ν,λ (2) ,γ σ q ( k g k + γ ) − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i − γ p C w + σ subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n k g k + γ − k h + ν z (1) − λ (2) k ≥ γ ≥ . (38)After a simple scaling of λ (2) one finds that the following is an equivalent to (38) ξ ( σ, g , h , ˜ x ) = max ν,λ (2) ,γ σ q ( k g k + γ ) − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i − γ p C w + σ subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n k g k + γ − k h + ν z (1) − λ (2) k ≥ γ ≥ . (39)Now, the maximization over γ can be done. After setting the derivative to zero one finds k g k + γ q ( k g k + γ ) − k h + ν z (1) − λ (2) k − p C w + σ = 0 (40)and after some algebra γ opt = s σ C w k h + ν z (1) − λ (2) k − k g k , (41)where of course γ opt would be the solution of (39) only if larger than or equal to zero. Alternatively ofcourse γ opt = 0 . Now, based on these two scenarios we distinguish two different optimization problems:11. The “overwhelming” optimization ξ ov ( σ, g , h , ˜ x ) = max ν,λ (2) σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (42)2. The “non-overwhelming” optimization ξ nov ( σ, g , h , ˜ x ) = max ν,λ (2) p C w + σ k g k − C w k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (43)The “overwhelming” optimization is the equivalent to (39) if for its optimal values ˆ ν and d λ (2) holds s σ C w k h + ˆ ν z (1) − d λ (2) k ≤ k g k , (44)We now summarize in the following lemma the results of this subsection. Lemma 2.
Let ˆ ν and d λ (2) be the solutions of (42) and analogously let ˜ ν and g λ (2) be the solutions of (43).Let ξ ( σ, g , h , ˜ x ) be, as defined in (22), the optimal value of the objective function in (22). Then ξ ( σ, g , h , ˜ x ) = σ q k g k − k h + ˆ ν z (1) − d λ (2) k − P ni = n − k +1 d λ (2) i ˜ x i , if q σ C w k h + ˆ ν z (1) − d λ (2) k ≤ k g k p C w + σ k g k − C w k h + ˜ ν z (1) − g λ (2) k − P ni = n − k +1 g λ (2) i ˜ x i , otherwise . (45) Moreover, let ˆ w be the solution of (22). Then ˆ w ( σ, g , h , ˜ x ) = σ ( h +ˆ ν z (1) − d λ (2) ) q k g k −k h +ˆ ν z (1) − d λ (2) k , if q σ C w k h + ˆ ν z (1) − d λ (2) k ≤ k g k C w ( h +˜ ν z (1) − g λ (2) ) k h +˜ ν z (1) − g λ (2) k , otherwise , (46) and k ˆ w ( σ, g , h , ˜ x ) k = σ k h +ˆ ν z (1) − d λ (2) ) k q k g k −k h +ˆ ν z (1) − d λ (2) k , if q σ C w k h + ˆ ν z (1) − d λ (2) k ≤ k g k C w , otherwise . (47) Proof.
The first part follows trivially. The second one follows from (35) by choosing the optimal ˆ ν and d λ (2) or alternatively ˜ ν and g λ (2) . 12 .1.2 Concentration of ξ ( σ, g , h , ˜ x ) In this section we will show that ξ ( σ, g , h , ˜ x ) concentrates with high probability around its mean. To doso we will instead of looking at (45) look back at (22) which is the original definition of ξ ( σ, g , h , ˜ x ) .Now, before proceeding further we first recall on the following incredible result from [20] related to theconcentrations of Lipschitz functions of Gaussian random variables. Lemma 3 ( [20, 54]) . Let f lip ( · ) : R n −→ R be a Lipschitz function such that | f lip ( a ) − f lip ( b ) | ≤ c lip k a − b k . Let a be a vector comprised of i.i.d. zero-mean, unit variance Gaussian random variablesand let ǫ lip > . Then P ( | f lip ( a ) − Ef lip ( a ) | ≥ ǫ lip Ef lip ( a )) ≤ exp ( − ( ǫ lip Ef lip ( a )) c lip ) . (48)In the following lemma we will show that ξ ( σ, g , h , ˜ x ) is a Lipschitz function. To do so, we will, roughlyspeaking, assume that k w k in the definition of ξ ( σ, g , h , ˜ x ) is bounded by a large constants say C w . Werecall here that our goal in this paper, though, is much bigger than creating a “constant type” bound on k w k .Namely, we will actually establish the precise value that k w k takes in the worst case with overwhelmingprobability. Clearly, knowing that one could then use much better value than C w to upper bound k w k inthe definition of ξ ( σ, g , h , ˜ x ) . However, for the purposes of the concentration inequalities any constant (ofcourse independent of n ) is fine. In fact, any sub-root dependence on n would be fine too, it is just that inthat case “overwhelming” wouldn’t be negative exponential any more. Lemma 4.
Let g and h be m and n dimensional vectors, respectively, with i.i.d. standard normal variablesas their components. Let σ > be an arbitrary scalar. Let ξ ( σ, g , h , ˜ x ) be as in (22). Further let ǫ lip > be any constant. Then P ( | ξ ( σ, g , h , ˜ x ) − Eξ ( σ, g , h , ˜ x ) | ≥ ǫ lip Eξ ( σ, g , h , ˜ x )) ≤ exp (cid:26) − ( ǫ lip Eξ ( σ, g , h , ˜ x )) C w + σ ) (cid:27) . (49) Proof.
We start by setting f lip ( g (1) , h (1) ) = min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) q k w k + σ k g (1) k + n X i =1 h (1) i w i ! . (50)Further, let w (1) lip be the solution of the minimization in (50). Then, clearly f lip ( g (1) , h (1) ) = q k w (1) lip k + σ k g (1) k + n X i =1 h (1) i ( w (1) lip ) i ! , (51)where ( w (1) lip ) i is the i -th index of w (1) lip . In an analogous fashion set f lip ( g (2) , h (2) ) = min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) q k w k + σ k g (2) k + n X i =1 h (2) i w i ! , (52)13nd let w (2) lip be the solution of the minimization in (51). Then again clearly f lip ( g (2) , h (2) ) = q k w (2) lip k + σ k g (2) k + n X i =1 h (2) i ( w (2) lip ) i ! , (53)where of course ( w (2) lip ) i is the i -th index of w (2) lip . Now assume that f lip ( g (1) , h (1) ) = f lip ( g (2) , h (2) ) (ifthey are equal we are trivially done). Further let f lip ( g (1) , h (1) ) < f lip ( g (2) , h (2) ) (the rest of the argumentof course can trivially be flipped if f lip ( g (1) , h (1) ) > f lip ( g (2) , h (2) ) ). We then have | f lip ( g (2) , h (2) ) − f lip ( g (1) , h (1) ) | = f lip ( g (2) , h (2) ) − f lip ( g (1) , h (1) )= q k w (2) lip k + σ k g (2) k + n X i =1 h (2) i ( w (2) lip ) i ! − q k w (1) lip k + σ k g (1) k + n X i =1 h (1) i ( w (1) lip ) i ! ≤ q k w (1) lip k + σ k g (2) k + n X i =1 h (2) i ( w (1) lip ) i ! − q k w (1) lip k + σ k g (1) k + n X i =1 h (1) i ( w (1) lip ) i ! = q k w (1) lip k + σ ( k g (2) k − k g (1) k ) + n X i =1 ( h (2) i − h (1) i )( w (1) lip ) i ≤ q k w (1) lip k + σ ( k g (2) − g (1) k ) + k ( h (2) − h (1) k k w (1) lip k ≤ q k w (1) lip k + σ q k g (2) − g (1) k + k h (2) − h (1) k ≤ p C w + σ q k g (2) − g (1) k + k h (2) − h (1) k , (54)where the first inequality follows by sub-optimality of w (1) lip in (52). Connecting beginning and end in (54)and combining it with (50) one then has that ξ ( σ, g , h , ˜ x ) is Lipschitz with c lip = p C w + σ . (49) theneasily follows by Lemma 3.One then has that k h + ˆ ν z (1) − d λ (2) k and k h + ˜ ν z (1) − g λ (2) k concentrate as well which automaticallyimplies that ˆ w also concentrates. More formally, one then has analogues to (49) P ( |k h + ˆ ν z (1) − d λ (2) k − E k h + ˆ ν z (1) − d λ (2) k | ≥ ǫ ( norm )1 E k h + ˆ ν z (1) − d λ (2) k ) ≤ e − ǫ ( norm )2 n P ( |k h + ˜ ν z (1) − g λ (2) k − E k h + ˜ ν z (1) − g λ (2) k | ≥ ǫ ( norm )3 E k h + ˜ ν z (1) − g λ (2) k ) ≤ e − ǫ ( norm )4 n P ( |k ˆ w k − E k ˆ w k | ≥ ǫ ( w )1 E k ˆ w k ) ≤ e − ǫ ( w )2 n , (55)where as usual ǫ ( norm )1 > , ǫ ( norm )2 > , and ǫ ( w )1 > are arbitrarily small constants and ǫ ( norm )3 , ǫ ( norm )4 ,and ǫ ( w )2 are constant dependent on ǫ ( norm )1 > , ǫ ( norm )2 > , and ǫ ( w )1 > , respectively, but independentof n . 14ow, we return to the probabilistic analysis of (21). Combining (21), (22), and (49) we have p l = P min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) q k w k + σ k g k + n X i =1 h i w i ! + h n +1 σ ≥ ζ ( l ) obj ! = P (cid:16) ξ ( σ, g , h , ˜ x ) + h n +1 σ ≥ ζ ( l ) obj (cid:17) ≥ (cid:18) − exp (cid:26) − ( ǫ lip Eξ ( σ, g , h , ˜ x )) C w + σ ) (cid:27)(cid:19) P (cid:16) (1 − ǫ lip ) Eξ ( σ, g , h , ˜ x ) + h n +1 σ ≥ ζ ( l ) obj (cid:17) . (56)Since h n +1 is a standard normal one easily has P ( h n +1 σ ≥ − ǫ ( h )1 √ n ) ≥ − e − ǫ ( h )2 n where ǫ ( h )1 > is anarbitrarily small constant and ǫ ( h )2 is a constant dependent on ǫ ( h )1 and σ but independent on n . By choosing ζ ( l ) obj = (1 − ǫ lip ) Eξ ( σ, g , h , ˜ x ) − ǫ ( h )1 √ n, (57)one then from (56) has p l = P min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) q k w k + σ k g k + n X i =1 h i w i ! + h n +1 σ ≥ (1 − ǫ lip ) Eξ ( σ, g , h , ˜ x ) − ǫ ( ζ )1 √ n ! ≥ (cid:18) − exp (cid:26) − ( ǫ lip Eξ ( σ, g , h , ˜ x )) C w + σ ) (cid:27)(cid:19) (1 − e − ǫ ( h )2 n ) . (58)As stated after (20), (58) is conceptually enough to establish a “high probability” lower bound on ζ obj . Thenext few steps that formally do so are rather obvious but we include them for the completeness. Combining(19) and (58) we obtain P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w ) + q k w k + σ g ) ≥ ζ ( l ) obj ) ≥ (cid:18) − exp (cid:26) − ( ǫ lip Eξ ( σ, g , h , ˜ x )) C w + σ ) (cid:27)(cid:19) (1 − e − ǫ ( h )2 n ) , , (59)where ζ ( l ) obj is as in (57). Now, one further has P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w ) + q k w k + σ g ) ≥ ζ ( l ) obj ) ≤ P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w )) + p C w + σ g ≥ ζ ( l ) obj ) . (60)Since g is a standard normal one easily again has P ( g p C w + σ ≤ ǫ ( g )1 √ n ) ≥ − e − ǫ g )1 n where ǫ ( g )1 > is an arbitrarily small constant and ǫ ( g )2 is a constant dependent on ǫ ( g )1 , σ , and C w but independent on n .Applying this to the first term on the right hand side of the above inequality one obtains P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w )) + p C w + σ g ≥ ζ ( l ) obj ) ≤ P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w )) ≥ ζ ( l ) obj − ǫ ( g )1 √ n ) + e − ǫ g )1 n . (61)15ow let ζ lowerobj = ζ lobj − ǫ ( g )1 √ n . From (57) then obviously ζ ( lower ) obj = (1 − ǫ lip ) Eξ ( σ, g , h , ˜ x ) − ǫ ( h )1 √ n − ǫ ( g )1 √ n. (62)Also let ǫ lower be a constant such that − e − ǫ lower n ≤ (cid:18) − exp (cid:26) − ( ǫ lip Eξ ( σ, g , h , ˜ x )) C w + σ ) (cid:27)(cid:19) (1 − e − ǫ ( h )2 n ) − e − ǫ g )1 n . (63)Then a combination of (11), (59), (60), (61), (62), and (63) gives P ( ζ obj ≥ ζ ( lower ) obj ) ≥ P ( ζ ( help ) obj ≥ ζ ( lower ) obj )(1 − e − ǫ C w n )= P ( min [ w T σ ] T ∈ S w ( σ, ˜ x ,C w ) ( f obj ( σ, w )) ≥ ζ ( lower ) obj ) ≥ (1 − e − ǫ lower n )(1 − e − ǫ C w n ) . (64)We summarize the results from this subsection in the following lemma. Lemma 5.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Consider an ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let then ζ obj be as defined in (10) and let w be the solution of (14). Assume P ( k w k ≤ C w ) ≥ − e − ǫ C w n for an arbitrarily large constant C w and a constant ǫ C w > dependent on C w but independent of n . Then there is a constant ǫ lower > P ( ζ obj ≥ ζ ( lower ) obj ) ≥ (1 − e − ǫ lower n )(1 − e − ǫ C w n ) , (65) where ζ ( lower ) obj = (1 − ǫ lip ) Eξ ( σ, g , h , ˜ x ) − ǫ ( h )1 √ n − ǫ ( g )1 √ n, (66) ξ ( σ, g , h , ˜ x ) is as defined in (22), and ǫ lip , ǫ ( h )1 , ǫ ( g )1 are all positive arbitrarily small constants.Proof. Follows from the previous discussion. ζ obj In this section we present a general framework for finding a “high-probability” upper bound on ζ obj . To thatend, let r and C w up be positive scalars (in this subsection we present a general framework and take thesescalars to be arbitrary; however to make the bound as tight sa possible in the following subsection we willmake them take particular values). Now, if we can show that there is a w ∈ R n such that k ˜ x + w k ≤ k ˜ x k and k v − A w k ≤ r with overwhelming probability then r can act as an upper bound on ζ obj . We then startby looking at the following optimization problem min w k ˜ x + w k − k ˜ x k k A v (cid:20) w σ (cid:21) k ≤ r k w k ≤ C w up , (67)where A v is as defined right after (14). If we can show that with overwhelming probability the objectivevalue of the above optimization problem is negative then r will be a valid “high probability” upper-boundon ζ obj . Moreover, it will be achieved by a w for which it will hold that k w k ≤ C w up .16e now proceed in a fashion similar to the one from Subsection 2.1.1. To remove the absolute valueswe introduce auxiliary variables t i , ≤ i ≤ n , and transform the above problem to min w , t n X i =1 t i − k ˜ x k subject to ˜ x i + w i − t i ≤ , n − k + 1 ≤ i ≤ n − ˜ x i − w i − t i ≤ , n − k + 1 ≤ i ≤ n w i − t i ≤ , ≤ i ≤ n − k − w i − t i ≤ , ≤ i ≤ n − k k A v (cid:20) w σ (cid:21) k ≤ r k w k ≤ C w up . (68)We also slightly modify the first of the constraints from (67) in the following way min w , t , b n X i =1 t i − k ˜ x k subject to ˜ x i + w i − t i ≤ , n − k + 1 ≤ i ≤ n − ˜ x i − w i − t i ≤ , n − k + 1 ≤ i ≤ n w i − t i ≤ , ≤ i ≤ n − k − w i − t i ≤ , ≤ i ≤ n − k k b k ≤ r (cid:2) − A v (cid:3) (cid:20) w σ (cid:21) = b k w k ≤ C w up . (69)The Lagrange dual of the above problem then becomes L ( λ (1) , λ (2) , ν (1) , γ , γ , w , t , b ) = n X i =1 t i −k ˜ x k + n X i = n − k +1 λ (1) i (˜ x i + w i − t i )+ n X i = n − k +1 λ (2) i ( − ˜ x i − w i − t i )+ n − k X i =1 λ (1) i ( w i − t i )+ n − k X i =1 λ (2) i ( − w i − t i ) − ν (1) A w + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r )+ γ ( k w k − C w up ) , (70)where ν (1) is × m row vector of Lagrange variables and λ (1) , λ (2) are as in previous sections. Afterrearranging terms we further have L ( λ (1) , λ (2) , ν (1) , γ , γ , w , t , b ) = −k ˜ x k + n X i =1 t i (1 − λ (1) i − λ (2) i ) + n X i = n − k +1 λ (1) i (˜ x i + w i )+ n X i = n − k +1 λ (2) i ( − ˜ x i − w i )+ n − k X i =1 λ (1) i w i − n − k X i =1 λ (2) i w i − ν (1) A w + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r )+ γ ( k w k − C w up ) . (71)17fter a few further arrangements we finally have L ( λ (1) , λ (2) , ν (1) , γ , γ , w , t , b ) = n X i =1 t i (1 − λ (1) i − λ (2) i ) + n X i = n − k +1 ( λ (1) i − λ (2) i − x i + (( λ (1) − λ (2) ) T − ν (1) A ) w + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r ) + γ ( k w k − C w up ) . (72)Setting ( ν − λ (1) i − λ (2) i ) = 0 , ≤ i ≤ n , (to insure that the dual is bounded) we have L ( λ (2) , ν (1) , γ , γ , w , b ) = − n X i = n − k +1 λ (2) i ˜ x i + (( z (1) − λ (2) ) T − ν (1) A ) w + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r ) + γ ( k w k − C w up ) . (73)Finally we can write a dual problem to (69) max λ (2) ,ν (1) ,γ ,γ min w , b L ( λ (2) , ν (1) , γ , γ , w , b ) subject to ≤ λ (2) i ≤ , ≤ i ≤ nγ ≥ ,γ ≥ , (74)where we of course use the fact that the strict duality obviously holds. Now, we minimize over w by settingthe derivatives to zero d L ( λ (2) , ν (1) , γ , γ , w , b ) d w = (( z (1) − λ (2) ) T − ν (1) A ) T + 2 γ w . (75)From (75) we easily have w = (( z (1) − λ (2) ) T − ν (1) A ) T γ . (76)Plugging (75) back in (73) we further have L ( λ (2) , ν (1) , γ , γ , b ) = − n X i = n − k +1 λ (2) i ˜ x i − k ( z (1) − λ (2) ) T − ν (1) A k γ + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r ) − γ C w up . (77)Now, we minimize L ( λ (1) , λ (2) , ν (1) , γ , γ , b ) over b by setting the derivatives to zero d L ( λ (1) , λ (2) , ν (1) , γ , γ , b ) d b = − ν (1) + 2 γ w . (78)From (78) we easily have b = ν (1) γ . (79)18lugging (79) back in (77) we have L ( λ (2) , ν (1) , γ , γ ) = − n X i = n − k +1 λ (2) i ˜ x i + − k ( z (1) − λ (2) ) T − ν (1) A k γ + ν (1) v σ − k ν (1) k γ − γ r − γ C w up , (80)and finally an equivalent to (69) max λ (2) ,ν (1) ,γ ,γ L ( λ (1) , λ (2) , ν (1) , γ , γ ) subject to ≤ λ (2) i ≤ , ≤ i ≤ nγ ≥ ,γ ≥ . (81)After doing the trivial maximization over γ and γ one obtains max λ (2) ,ν (1) − n X i = n − k +1 λ (2) i ˜ x i − C w up k ( z (1) − λ (2) ) T − ν (1) A k + ν (1) v σ − k ν (1) k r subject to ≤ λ (2) i ≤ , ≤ i ≤ n. (82)We rewrite (82) in a slightly more convenient form − min λ (2) ,ν (1) max k a k = C w up (( z (1) − λ (2) ) T − ν (1) A ) a − ν (1) v σ + k ν (1) k r + 2 n X i = n − k +1 λ (2) i ˜ x i subject to ≤ λ (2) i ≤ , ≤ i ≤ n. (83)Now let us define f ( up ) obj as − f ( up ) obj = − min λ (2) ,ν (1) max k a k = C w up (( z (1) − λ (2) ) T − ν (1) A ) a − ν (1) v σ + k ν (1) k r + 2 n X i = n − k +1 λ (2) i ˜ x i subject to ≤ λ (2) i ≤ , ≤ i ≤ n. (84)Any r such that lim n → P ( f ( up ) obj ≥
0) = 1 is then a valid “high-probability” upper bound.We now introduce a refinement of a lemma from [62] which itself is a slightly modified Lemma 1(Lemma 1 is of course the backbone of the escape through a mesh theorem utilized in [63]).
Lemma 6.
Let A be an m × n matrix with i.i.d. standard normal components. Let g and h be m × and ( n + 1) × vectors, respectively, with i.i.d. standard normal components. Also, let g be a standard normalrandom variable and let Λ be a set such that Λ = ( λ (2) | ≤ λ (2) i ≤ , ≤ i ≤ n ) . Then P ( min λ (2) ∈ Λ ,ν (1) ∈ R n \ max k a k = C w up ( − ν (1) (cid:2) A v (cid:3) (cid:20) a σ (cid:21) + k ν (1) k g − ψ a ,λ (2) ,ν (1) ) ≥ ≥ P ( min λ (2) ∈ Λ ,ν (1) ∈ R n \ max k a k =1 ( k ν (1) k ( n X i =1 h i a i + h n +1 σ ) + q C w up + σ m X i =1 g i ν (1) i − ψ a ,λ (2) ,ν (1) ) ≥ . (85)19et ψ a ,λ (2) ,ν (1) = ǫ ( g )3 √ n k ν (1) k − a T ( z (1) − λ (2) ) − k ν (1) k r − n X i = n − k +1 λ (2) i ˜ x i , (86)with ǫ ( g )3 > being an arbitrarily small constant independent of n . The left-hand side of the inequality in(85) is then the following probability of interest p u = P ( min λ (2) ∈ Λ ,ν (1) ∈ R n \ max k a k = C w up ( k ν (1) k ( n X i =1 h i a i + h n +1 σ ) + q C w up + σ m X i =1 g i ν (1) i − ǫ ( g )3 √ n k ν (1) k + a T ( z (1) − λ (2) ) + k ν (1) k r + 2 n X i = n − k +1 λ (2) i ˜ x i ) ≥ . After solving the inner maximization over a and pulling out k ν k one has p u = P ( min λ (2) ∈ Λ ,ν ∈ R n \ ( C w up k h + 1 k ν (1) k ( z (1) − λ (2) ) k + h n +1 σ + q C w up + σ m X i =1 g i ν (1) i k ν (1) k − ǫ ( g )3 √ n + r + 2 n X i = n − k +1 λ (2) i k ν (1) k ˜ x i ) ≥ . After minimization of the second term over a unit norm vector we further have p u = P ( min λ (2) ∈ Λ ,ν ∈ R n \ ( C w up k h + 1 k ν (1) k ( z (1) − λ (2) ) k + h n +1 σ − q C w up + σ k g k − − ǫ ( g )3 √ n + r + 2 n X i = n − k +1 λ (2) i k ν (1) k ˜ x i ) ≥ . (87)Now we change variables so that ν = k ν (1) k and λ (2) = λ (2) k ν (1) k and redefine Λ by setting Λ (2) = { λ (2) ∈ R n | ≤ λ (2) i ≤ ν, ≤ i ≤ n } . (88)We also recall that z (1) remains as defined right after (36). Plugging all of this back in (87) gives us p u = P ( r + h n +1 σ − ǫ ( g )3 √ n − max λ (2) ∈ Λ (2) ,ν ≥ ( q C w up + σ k g k − C w up k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) ≥ . (89)Now, let ξ up ( σ, g , h , ˜ x , C w up ) = max λ (2) ∈ Λ (2) ,ν ≥ ( q C w up + σ k g k − C w up k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . (90)In the following lemma we will show that ξ up ( σ, g , h , ˜ x , C w up ) is a Lipschitz function. Lemma 7.
Let g and h be m and n dimensional vectors, respectively, with i.i.d. standard normal variablesas their components. Let σ > be an arbitrary scalar. Let ξ up ( σ, g , h , ˜ x , C w up ) be as in (90). Further let lip > be any constant. Then P ( | ξ up ( σ, g , h , ˜ x , C w up ) − Eξ up ( σ, g , h , ˜ x , C w up ) | ≥ ǫ lip Eξ up ( σ, g , h , ˜ x , C w up )) ≤ exp (cid:26) − ( ǫ lip Eξ up ( σ, g , h , ˜ x , C w up )) C w + σ ) (cid:27) . (91) Proof.
The proof will be similar to the corresponding one from Subsection 2.1.2. We start by setting f lip ( g (1) , h (1) ) = max λ (2) ∈ Λ (2) ,ν ≥ ( q C w up + σ k g (1) k − C w up k h (1) + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . (92)Further, let ν ( lip ) and λ ( lip ) be the solutions of the minimization in (92). Then, clearly f lip ( g (1) , h (1) ) = ( q C w up + σ k g (1) k − C w up k h (1) + ν ( lip ) z (1) − λ ( lip ) ) k − n X i = n − k +1 λ ( lip ) i ˜ x i ) . (93)In an analogous fashion set f lip ( g (2) , h (2) ) = max λ (2) ∈ Λ (2) ,ν ≥ ( q C w up + σ k g (2) k − C w up k h (2) + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) , (94)and let ν ( lip ) and λ ( lip ) be the solutions of the minimization in (94). Then, clearly f lip ( g (2) , h (2) ) = ( q C w up + σ k g (2) k − C w up k h (2) + ν ( lip ) z (1) − λ ( lip ) ) k − n X i = n − k +1 λ ( lip ) i ˜ x i ) , (95)Now assume that f lip ( g (1) , h (1) ) = f lip ( g (2) , h (2) ) (if they are equal we are trivially done). Further let f lip ( g (1) , h (1) ) < f lip ( g (2) , h (2) ) (the rest of the argument of course can trivially be flipped if f lip ( g (1) , h (1) ) > lip ( g (2) , h (2) ) ). We then have | f lip ( g (2) , h (2) ) − f lip ( g (1) , h (1) ) | = f lip ( g (2) , h (2) ) − f lip ( g (1) , h (1) )= ( q C w up + σ k g (2) k − C w up k h (2) + ν ( lip ) z (1) − λ ( lip ) k − n X i = n − k +1 λ ( lip ) i ˜ x i ) − ( q C w up + σ k g (1) k − C w up k h (1) + ν ( lip ) z (1) − λ ( lip ) k − n X i = n − k +1 λ ( lip ) i ˜ x i ) ≤ ( q C w up + σ k g (2) k − C w up k h (2) + ν ( lip ) z (1) − λ ( lip ) k − n X i = n − k +1 λ ( lip ) i ˜ x i ) − ( q C w up + σ k g (1) k − C w up k h (1) + ν ( lip ) z (1) − λ ( lip ) k − n X i = n − k +1 λ ( lip ) i ˜ x i )= q C w up + σ ( k g (2) k − k g (1) k ) − C w up ( k h (2) + ν ( lip ) z (1) − λ ( lip ) k − k h (2) + ν ( lip ) z (1) − λ ( lip ) k ) ≤ q C w up + σ k g (2) − g (1) k + C w up ( k h (2) − h (1) k ) ≤ q C w up + σ q k g (2) − g (1) k + ( k h (2) − h (1) k ) , (96)where the first inequality follows by sub-optimality of ν lip and λ ( lip ) in (94). Connecting beginning andend in (96) and combining it with (92) one then has that ξ up ( σ, g , h , ˜ x , C w up ) is Lipschitz with c lip = p C w + σ . (91) then easily follows by Lemma 3.We continue by following the line of arguments right after (56). As stated there P ( h n +1 σ ≥ − ǫ ( h )1 √ n ) ≥ − e − ǫ ( h )2 n where ǫ ( h )1 > is an arbitrarily small constant and ǫ ( h )2 is a constant dependent on ǫ ( h )1 and σ but independent on n . Set r = ζ ( u ) obj = (1 + ǫ lip ) Eξ ( σ, g , h , ˜ x , C w up ) + ǫ ( h )1 √ n + ǫ ( g )3 √ n. (97)One then has after combing (89) and Lemma 7 p u = P ( r + h n +1 σ − ǫ ( g )3 √ n − max λ (2) ∈ Λ (2) ,ν ≥ ( q C w up + σ k g k − C w up k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) ≥ ≥ P ((1 + ǫ lip ) Eξ ( σ, g , h , ˜ x , C w up ) ≥ max λ (2) ∈ Λ (2) ,ν ≥ ( q C w up + σ k g k − C w up k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ))(1 − e − ǫ ( h )2 n ) ≥ − exp ( − ( ǫ lip Eξ ( σ, g , h , ˜ x , C w up )) C w up + σ ) )! (1 − e − ǫ ( h )2 n ) . (98)As stated after (20), (98) is conceptually enough to establish a “high probability” upper bound on ζ obj . Whatis left is to connect it with (84). Combining (98), (85), and (84) we then obtain P ( f ( up ) obj ≥ ≥ − exp ( − ( ǫ lip Eξ ( σ, g , h , ˜ x , C w up )) C w up + σ ) )! (1 − e − ǫ ( h )2 n )(1 − e − ǫ ( g )4 n ) , (99)22here we used the fact that g is the standard normal and therefore P ( g − ǫ ( g )3 √ n ≤ ≥ (1 − e − ǫ ( g )4 n ) foran arbitrarily small ǫ ( g )3 > and a constant ǫ ( g )4 dependent on ǫ ( g )3 but independent of n .We are now in position to summarize results from this subsection in the following lemma which isessentially an “upper-bound” analogue to Lemma 5. Lemma 8.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Consider an ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let then ζ obj be as defined in (10) and let w be the solution of (14). There is aconstant ǫ upper > P ( ζ obj ≤ ζ ( upper ) obj ) ≥ − e − ǫ upper n , (100) where ζ ( upper ) obj = (1 + ǫ lip ) Eξ up ( σ, g , h , ˜ x , C w up ) + ǫ ( h )1 √ n + ǫ ( g )3 √ n, (101) ξ up ( σ, g , h , ˜ x , C w up ) is as defined in (90), ǫ lip , ǫ ( h )1 , ǫ ( g )3 are all positive arbitrarily small constants, and C w up is a constant such that k w k ≤ C w up .Proof. Follows from the previous discussion.
In this section we specialize the general bounds introduced above and show how they can match each other.We will divide presentation in three subsections. In the first of the subsections we will make a connection tothe noiseless case and show how one can then remove the constraint from (45), (46), and (47). In the secondsubsection we will consider a w such that |k w k − k ˆ w k | ≥ ǫ w up k ˆ w k . We will then quantify how muchthe lower bound that can be computed for such a w through the framework presented in Section 2.1 deviatesfrom the optimal one obtained for ˆ w . In the last subsection we will then show that there will be a w suchthat the upper bound computed through the framework presented in Section 2.2 will deviate less. That willin essence establish that upper and lower bounds computed in the previous sections indeed match. We willthen draw conclusions as for the consequences which such a matching of the bounds leaves on a couple ofLASSO parameters. ℓ optimization In this subsection we establish a connection between the constraint in (45), (46), and (47) and the funda-mental performance characterization of ℓ optimization derived in [62] (and of course earlier in the contextof neighborly polytopes in [25]). We first recall on the condition from Lemma 2. The condition states s σ C w k h + ˆ ν z (1) − d λ (2) k ≤ k g k , (102)where C w is an arbitrarily large constant and ˆ ν and d λ (2) are the solutions of max σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ≤ λ (2) i ≤ ν, ≤ i ≤ nν ≥ . (103)23ow we note the following equivalent to (103) for the case when nonzero components of ˜ x are infinite max σ q k g k − k h + ν z (1) − λ (2) k subject to ≤ λ (2) i ≤ ν, ≤ i ≤ n − kλ (2) i = 0 , n − k + 1 ≤ i ≤ nν ≥ . (104)Now, to make the new observations easily comparable to the corresponding ones from [61, 63] we set ¯ h = [ | h | (1)(1) , | h | (2)(2) , . . . , | h | ( n − k )( n − k ) , h n − k +1 , h n − k +2 , . . . , h n ] T , (105)where [ | h | (1)(1) , | h | (2)(2) , . . . , | h | ( n − k )( n − k ) ] are magnitudes of [ h , h , . . . , h n − k ] sorted in increasing order (possibleties in the sorting process are of course broken arbitrarily). Also we let z (2) be such that z (2) i = − z (1) i , n − k + 1 ≤ i ≤ n and z (2) i = z (1) i , ≤ i ≤ n − k . It is then relatively easy to see that the above optimizationproblem is equivalent to max σ q k g k − k ¯ h − ν z (2) + λ (2) k subject to ≤ λ (2) i ≤ ν, ≤ i ≤ n − kλ (2) i = 0 , n − k + 1 ≤ i ≤ nν ≥ . (106)Let ν ℓ and λ ( ℓ ) be the solution of the above maximization. Then, as we showed in [63] and [62], theinequality E k g k > E k ¯ h − ν ℓ z (2) + λ ( ℓ ) k (107)establishes the following fundamental performance characterization of the ℓ optimization algorithm from(2) that could be used instead of LASSO to recover x in (1) (which is a noiseless version of (3)) (1 − β w ) q π e − ( erfinv ( − αw − βw )) α w − √ erfinv ( 1 − α w − β w ) = 0 , (108)where of course α w = mn and β w = kn . As it is also shown in [63] and [62] both of the quantities under theexpected values in (107) nicely concentrate. Then with overwhelming probability one has that for any pair ( α, β ) that satisfies (or lies below) the above fundamental performance characterization of ℓ optimization k g k > k ¯ h − ν ℓ z (2) + λ ( ℓ ) k . (109)Moreover, since λ (2) i ≥ , n − k + 1 ≤ i ≤ n , in (103) one actually has that (109) implies that withoverwhelming probability k g k > k h + ˆ ν z (1) − d λ (2) k , (110)which for sufficiently large C w is the same as (102). We then in what follows assume that pair ( α, β ) is suchthat it satisfies the fundamental ℓ optimization performance characterization (or is in the region below it)and therefore proceed by ignoring the condition (102). (Strictly speaking, all our overwhelming probabilitiesbelow should be multiplied with an overwhelming probability that (108) holds; to maintain writing easierwe will skip this detail.) 24 .3.2 Deviation from the lower-bound In this subsection we show that k w lasso k can not deviate substantially from k ˆ w k without substantiallyaffecting the value of the lower bound on the objective in (6) that is derived in Section 2.1. To that end letus assume that there is a w off that is the solution of the LASSO from (6) (or to be slightly more precisethat is such that ˆ x = ˜ x + w off , where obviously ˆ x is the solution of (6)). Further, let |k w off k − k ˆ w k | ≥ ǫ w up k ˆ w k , where ǫ w up is an arbitrarily small constant.One can then proceed by repeating the same line of thought as in Section 2.1. The only difference willbe that now C w = k w off k and consequently in the definition of S w ( σ, ˜ x , C w ) , k w k ≤ C w changes to k w k = C w = k w off k . This difference will of course not affect the concept presented in Section 2.1. Theonly real consequence will be the change of (22). Adapted to the new scenario (22) becomes ξ off ( σ, g , h , ˜ x , w off ) = min w , t q k w off k + σ k g k + n X i =1 h i w i subject to n X i =1 t i ≤ k ˜ x k ˜ x i + w i − t i ≤ , n − k + 1 ≤ i ≤ n − ˜ x i − w i − t i ≤ , n − k + 1 ≤ i ≤ n w i − t i ≤ , ≤ i ≤ n − k − w i − t i ≤ , ≤ i ≤ n − k q k w k + σ ≤ q w off + σ . (111)One can then proceed further with solving the Lagrangian to obtain ξ off ( σ, g , h , ˜ x , w off ) = max λ (2) ∈ Λ (2) ,ν ≥ ( q w off + σ k g k − w off k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . (112)Using the probabilistic arguments from Section 2.1 one then from Lemma 5 has that if w off is the solution of(6) then its objective value with overwhelming probability is lower bounded by (1 − ǫ lip ) Eξ off ( σ, g , h , ˜ x , w off ) ( ξ off ( σ, g , h , ˜ x , w off ) is structurally the same as ξ up ( σ, g , h , ˜ x , C w up ) from (90) and therefore easily con-centrates based on Lemma 7). We will now consider in parallel the following lower bound from (42) (clearly,choosing w off = ˆ w would make (112) equivalent to (42)). ξ ov ( σ, g , h , ˜ x ) = max ν ≥ ,λ (2) ∈ Λ (2) σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i . (113)Now, let as usual ˆ ν and d λ (2) be the solutions of (113). Let ξ help ( σ, g , h , ˜ x , w off ) = q w off + σ k g k − w off k h + ˆ ν z (1) − d λ (2) k − n X i = n − k +1 d λ (2) i ˜ x i . (114)25hen ξ off ( σ, g , h , ˜ x , w off ) − ξ ov ( σ, g , h , ˜ x ) ≥ ξ help ( σ, g , h , ˜ x , w off ) − ξ ov ( σ, g , h , ˜ x )= q w off + σ k g k − w off k h + ˆ ν z (1) − d λ (2) k − σ q k g k − k h + ˆ ν z (1) − d λ (2) k . (115)For the simplicity let |k w off k − k ˆ w k | = ǫ w up k ˆ w k (this restriction is clearly more conservative than |k w off k − k ˆ w k | ≥ ǫ w up k ˆ w k ). Now, we switch to expectations and ignore all ǫ except ǫ w up . Since everyquantity that we will consider (see (55)) concentrates ǫ ’s in concentration inequalities can be made arbitrarilyclose to zero; moreover once ǫ w up is fixed all other ǫ ’s can be made arbitrarily small compared to ǫ w up . Also,we will show derivation for w off = (1 + ǫ w up ) k ˆ w k (the derivation for the case w off = (1 − ǫ w up ) k ˆ w k is completely analogous).Now, to facilitate writing we then set all ǫ ’s except ǫ w up to zero. We then have Eξ off ( σ, g , h , ˜ x , w off ) − Eξ ov ( σ, g , h , ˜ x ) ≥ Eξ help ( σ, g , h , ˜ x , w off ) − Eξ ov ( σ, g , h , ˜ x ) . = q (1 + ǫ w up ) ( E k ˆ w k ) + σ E k g k − (1 + ǫ w up ) E k ˆ w k E k h + ˆ ν z (1) − d λ (2) k − σ q ( E k g k ) − ( E k h + ˆ ν z (1) − d λ (2) k ) (116)where . = means that equality is not exact but for a fixed ǫ w up can be made as close to it as needed. In asimilar fashion we have E k ˆ w k . = σE k h + ˆ ν z (1) − d λ (2) k q ( E k g k ) − ( E k h + ˆ ν z (1) − d λ (2) k ) . (117)Before we proceed further we simplify the notation with the following change of variables. g E = E k g k h E = E k h + ˆ ν z (1) − d λ (2) k ξ E = σ q ( E k g k ) − ( E k h + ˆ ν z (1) − d λ (2) k ) = σ q g E − h E w E = E k ˆ w k = σh E q g E − h E . (118)From (118) one easily has h E = g E − ξ E σ w E = h E σ ξ E . (119)26hen a combination of (116), (118), and (119) gives Eξ help ( σ, g , h , ˜ x , w off ) − Eξ ov ( σ, g , h , ˜ x ) . = s (1 + ǫ w up ) ( h E σ ξ E ) + σ g E − (1 + ǫ w up ) w E h E − ξ E = (1 + ǫ w up ) g E σ ξ E s − ξ E (2 ǫ w up + ǫ w up )(1 + ǫ w up ) g E σ − (1 + ǫ w up ) g E σ ξ E + ǫ w up ξ E . (120)Now, assuming that ǫ w up is small (and recognizing that ξ E ≤ g E σ ) from (120) we have (1 + ǫ w up ) g E σ ξ E s − ξ E (2 ǫ w up + ǫ w up )(1 + ǫ w up ) g E σ − (1 + ǫ w up ) g E σ ξ E + ǫ w up ξ E ≈ (1 + ǫ w up ) g E σ ξ E (1 − ξ E (2 ǫ w up + ǫ w up )2(1 + ǫ w up ) g E σ ) − (1 + ǫ w up ) g E σ ξ E + ǫ w up ξ E = − ξ E (2 ǫ w up + ǫ w up )2(1 + ǫ w up ) + ǫ w up ξ E = 2 ǫ w up ξ E (1 + ǫ w up ) − ξ E (2 ǫ w up + ǫ w up )2(1 + ǫ w up ) = ξ E ǫ w up ǫ w up ) . (121)Combining (115), (120), and (121) we finally have Eξ off ( σ, g , h , ˜ x , w off ) − Eξ ov ( σ, g , h , ˜ x ) ≥ ǫ w up ǫ w up ) Eξ E ≥ ǫ w up ǫ w up ) Eξ ov ( σ, g , h , ˜ x ) (122)where the last inequality follows by noting that in the definition of ξ ov ( σ, g , h , ˜ x ) the elements of ˜ x and λ (2) are non-negative.Now, roughly speaking, (122) shows that if k w lasso k were to deviate from k ˆ w k the optimal valueof the objective in (6) would be higher than the lower bound derived in Section 2.1. We summarize theseobservations in the following lemma (essentially a deviating equivalent of Lemma 5 from Section 2.1). Lemma 9.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Consider an ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let then ζ obj be as defined in (10) or (14) and let w off be the solution of (14).Let α and β be below the fundamental characterization (108) and let ˆ w be as defined in (46). Assume that |k w off k − k ˆ w k | ≥ ǫ w up k ˆ w k , where ǫ w up is an arbitrarily small but fixed constant. Then there wouldbe a constant ǫ off > , and arbitrarily small positive constants ǫ lip , ǫ ( h )1 , ǫ ( g )1 such that P ( ζ obj ≥ ζ ( off ) obj ) ≥ − e − ǫ off n , (123) where ζ ( off ) obj = (1 − ǫ lip )(1 + ǫ w up ǫ w up ) ) Eξ ov ( σ, g , h , ˜ x ) − ǫ ( h )1 √ n − ǫ ( g )1 √ n, (124) and ξ ov ( σ, g , h , ˜ x ) is as defined in (42) (or (113)).Proof. Follows from the previous discussion, discussion from Section 2.3.1, and a combination of (112),(115), (122), arguments right after (112), and Lemma 5.27 .3.3 Deviation of the upper bound
In this section we will show that k w lasso k can not deviate from k ˆ w k as much as it was assumed in theprevious section. To do so we will actually continue to assume that it can and then eventually reach acontradiction. As in previous section, let then |k w off k − k ˆ w k | ≥ ǫ w up k ˆ w k , where ǫ w up is an arbitrarilysmall constant. Further, let ξ dual ( σ, g , h , ˜ x ) be ξ dual ( σ, g , h , ˜ x ) = min d ≥ max ν,λ (2) p d + σ k g k − d k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (125)Rewriting (125) with a simple sign flipping turns out to be useful in what follows − ξ dual ( σ, g , h , ˜ x ) = max d ≥ min ν,λ (2) − p d + σ k g k + d k h + ν z (1) − λ (2) k + n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (126)The following lemma provides a powerful tool to deal with (126). Lemma 10.
Let ξ dual ( σ, g , h , ˜ x ) be as defined in (126). Further, let − ξ ov ( σ, g , h , ˜ x ) = min ν,λ (2) max d ≥ − p d + σ k g k + d k h + ν z (1) − λ (2) k + n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (127) Then ξ dual ( σ, g , h , ˜ x ) = ξ ov ( σ, g , h , ˜ x ) . (128) Proof.
After solving the inner maximization over d in (127) one has d opt = σ k h + ν z (1) − λ (2) k q k g k − k h + ν z (1) − λ (2) k . (129)Such a d then establishes that the right-hand side of (127) is indeed ξ ov ( σ, g , h , ˜ x ) , i.e, one has as in (42) − ξ ov ( σ, g , h , ˜ x ) = min ν,λ (2) − σ q k g k − k h + ν z (1) − λ (2) k + n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (130)28ow we digress for a moment and consider the following optimization problem min ν,λ (2) , q , q − σ q + n X i = n − k +1 λ (2) i ˜ x i subject to k h + ν z (1) − λ (2) k ≤ q q + q ≤ k g k ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (131)Let − ξ (1) ov ( σ, g , h , ˜ x ) be the optimal value of its objective function. Let quadruplet ˆ ν, d λ (2) , ˆ q , ˆ q be thesolution of the above optimization problem. Then it must be k h + ˆ ν z (1) − d λ (2) k = ˆ q (132)and consequently ˆ q = q k g k − k h + ˆ ν z (1) − d λ (2) k − ξ (1) ov ( σ, g , h , ˜ x ) = − σ q k g k − k h + ˆ ν z (1) − d λ (2) k + n X i = n − k +1 d λ (2) i ˜ x i . (133)The above claim is rather obvious but for the completeness we sketch the argument that supports it. Assumethat k h + ˆ ν z (1) − d λ (2) k < ˆ q , then ˆ q < q k g k − k h + ˆ ν z (1) − d λ (2) k , and − ξ (1) ov ( σ, g , h ) would besmaller then the expression on the right-hand side of (133). Now, since (132) and (133) hold one has that − ξ (1) ov ( σ, g , h , ˜ x ) can be determined through the following equivalent to (131) − ξ (1) ov ( σ, g , h , ˜ x ) = min ν,λ (2) − σ q k g k − k h + ν z (1) − λ (2) k + n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n (134)After comparing (130) and (134) we have − ξ (1) ov ( σ, g , h , ˜ x ) = − ξ ov ( σ, g , h , ˜ x ) . (135)Now, let us write the Lagrange dual of the optimization problem in (131). Let d and γ be Lagrangianvariables such that max d ≥ ,γ ≥ min ν,λ (2) , q , q − σ q + n X i = n − k +1 λ (2) i ˜ x i + d k h + ν z (1) − λ (2) k − d q + γ ( q + q ) − γ k g k subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (136)29fter solving the inner minimization over q , q in (136) we have max d ≥ ,γ ≥ min ν,λ (2) − σ + d γ − γ k g k + n X i = n − k +1 λ (2) i ˜ x i + d k h + ν z (1) − λ (2) k subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (137)Since the first two terms in the objective function in (137) do not involve neither ν nor λ (2) one can thenmaximize their sum over γ for any d . After that we finally have max d ≥ min ν,λ (2) − p σ + d k g k + n X i = n − k +1 λ (2) i ˜ x i + d k h + ν z (1) − λ (2) k subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (138)Let − ξ (2) ov ( σ, g , h , ˜ x ) be the optimal value of the objective function in (138). Since (138) is the dual of (131)and since the strict duality obviously holds (the optimization problem in (131) is clearly convex) one has − ξ (2) ov ( σ, g , h , ˜ x ) = − ξ (1) ov ( σ, g , h , ˜ x ) . (139)On the other hand the optimization problem in (138) is the same as the one in (126) and therefore − ξ (2) ov ( σ, g , h , ˜ x ) = − ξ dual ( σ, g , h , ˜ x ) . (140)Connecting (135), (139), and (140) one finally has − ξ dual ( σ, g , h , ˜ x ) = − ξ ov ( σ, g , h , ˜ x ) (141)which is what is stated in (128). This concludes the proof.Let ˆ d, ˆ ν, d λ (2) be the solution of (125). Clearly, ˆ d = k ˆ w k = σ k h +ˆ ν z (1) − d λ (2) k q k g k −k h +ˆ ν z (1) − d λ (2) k and since allquantities concentrate E ˆ d = E k ˆ w k . = σ E k h +ˆ ν z (1) − d λ (2) k q E k g k − E k h +ˆ ν z (1) − d λ (2) k . Now, set C w up = E k ˆ w k in (90).Then a combination of (90), (125), and Lemma 10 gives Eξ up ( σ, g , h , ˜ x , E k ˆ w k ) = E max λ (2) ∈ Λ (2) ,ν ≥ ( p ( E k ˆ w k ) + σ k g k − E k ˆ w k k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i )= E max λ (2) ∈ Λ (2) ,ν ≥ ( q ( E ˆ d ) + σ k g k − E ˆ d k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . = E min d ≥ max λ (2) ∈ Λ (2) ,ν ≥ ( p d + σ k g k − d k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) = Eξ ov ( σ, g , h , ˜ x ) . (142)Combining Lemma 14 and (142) one has that with overwhelming probability there is a w such that theobjective in (6) is upper bounded by a quantity arbitrarily close from above to Eξ ov ( σ, g , h , ˜ x ) . On the30ther hand Lemma 5 states that for any w such that |k w k − k ˆ w k k ≥ ǫ w up k ˆ w k , ǫ w up > , the objectivevalue of (6) is with overwhelming probability lower bounded by a quantity that is arbitrarily close frombelow to (1 + ǫ wup ǫ wup ) ) Eξ ov ( σ, g , h , ˜ x ) . Clearly then the assumption of Lemma 5 is unsustainable andone has that k w lasso k can not deviate substantially from k ˆ w k . This then implies that with overwhelmingprobability the objective value of (6) concentrates around Eξ ov ( σ, g , h , ˜ x ) and consequently that k w lasso k concentrates around E k ˆ w k . In this section we connect all of the above. We will summarize the results obtained so far in the followingtheorem.
Theorem 1.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables andlet A be an m × n matrix of i.i.d. standard normal random variables. Further, let g and h be m × and n × vectors of i.i.d. standard normals, respectively. Consider a k -sparse ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let the solution of (6) be ˆ x and let the so-called error vector of LASSO from(6) be w lasso = ˆ x − ˜ x . Let n be large and let constants α = mn and β = kn be below the fundamentalcharacterization (108). Consider the following optimization problem: ξ ov ( σ, g , h , ˜ x ) = max ν,λ (2) σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ ≤ λ (2) i ≤ ν, ≤ i ≤ n. (143) Let ˆ ν and d λ (2) be the solution of (143). Set k ˆ w k = σ k h + ˆ ν z (1) − d λ (2) k q k g k − k h + ˆ ν z (1) − d λ (2) k . (144) Then: P ((1 − ǫ ( lasso )1 ) Eξ ov ( σ, g , h , ˜ x ) ≤ k y − A ˆ x k ≤ (1 + ǫ ( lasso )1 ) Eξ ov ( σ, g , h , ˜ x )) = 1 − e − ǫ ( lasso )2 n (145) and P ((1 − ǫ ( lasso )1 ) E k ˆ w k ≤ k w lasso k ≤ (1 + ǫ ( lasso )1 ) E k ˆ w k ) = 1 − e − ǫ ( lasso )2 n , (146) where ǫ ( lasso )1 > is an arbitrarily small constant and ǫ ( lasso )2 is a constant dependent on ǫ ( lasso )1 and σ butindependent of n .Proof. Follows from the above discussion and a combination of (42), Lemma 2, discussion in Section 2.3.1,and Lemmas 5, 14, and 10.It may not be clear immediately but the result presented in the above theorem is incredibly powerful.Among other things, it enables one to precisely estimate the norm of the error vector in “noisy” under-determined systems of linear equations. Moreover, it can do so for any given k -sparse vector ˜ x . Furthermore,all of it is done through a transformation of the original LASSO from (6) to a much simpler optimizationprogram (143). While many quantities of interest in LASSO recovery can be computed through the mecha-nism presented above, below we focus only on quantities that relate to what we will call LASSO’s generic The results presented in the above theorem are fairly general and pertain to pretty much any possible scenarioone can imagine. Here we will focus on the so-called “worst-case” scenario or as we will refer to it “generic”performance scenario. We will now show that E k ˆ w k from Theorem 1 can be upper-bounded over the setof all ˜ x ’s. To that end let us assume that all nonzero components of ˜ x are infinite. The optimization problemfrom (143) then becomes ξ ( gob ) ov ( σ, g , h ) = max ν,λ (2) σ q k g k − k h + ν z (1) − λ (2) k subject to ν ≥ ≤ λ (2) i = 0 , n − k + 1 ≤ i ≤ n ≤ λ (2) i ≤ ν, ≤ i ≤ n − k. (147)Let ν gen and λ ( gen ) be the solution of (147) and let w gen be the error vector in case when all nonzerocompoenents of ˜ x are infinite (in Section 2.3.1 for a slightly changed version of (147) ν gen and λ ( gen ) werereferred to as ν ℓ and λ ( ℓ ) ). Now, let us assume that some of nonzero components of ˜ x in (143) are finite.And let as usual ˆ ν and d λ (2) be the solution of (143) and let k ˆ w k be the norm of the LASSO’s error vector.Since σ q k g k − k h + ˆ ν z (1) − d λ (2) k − n X i = n − k +1 d λ (2) i ˜ x i ≥ σ q k g k − k h + ν gen z (1) − λ ( gen ) k (148)and d λ (2) i ≥ , n − k + 1 ≤ i ≤ n , one has that k h + ˆ ν z (1) − d λ (2) k ≤ k h + ν gen z (1) − λ ( gen ) k . (149)Furthermore, one then has for the norm of error vectors k ˆ w k = σ k h + ˆ ν z (1) − d λ (2) k q k g k − k h + ˆ ν z (1) − d λ (2) k ≤ σ k h + ν gen z (1) − λ ( gen ) k q k g k − k h + ν gen z (1) − λ ( gen ) k = k w gen k . (150)Then the following generic equivalent to Theorem 1 can be established. Theorem 2.
Assume the setup of Theorem 1. Consider the following optimization problem: ξ ( gen ) ov ( σ, g , h ) = min ν,λ (2) k h + ν z (1) − λ (2) k subject to ν ≥ λ (2) i = 0 , n − k + 1 ≤ i ≤ n ≤ λ (2) i ≤ ν, ≤ i ≤ n − k. (151)32 et ν gen and λ ( gen ) be the solution of (151). Set k w gen k = σ k h + ν gen z (1) − λ ( gen ) k q k g k − k h + ν gen z (1) − λ ( gen ) k . (152) Then: P ( ∃ w lasso |k w lasso k ∈ ((1 − ǫ ( lasso )1 ) E k w gen k , (1 + ǫ ( lasso )1 ) E k w gen k )) ≥ − e − ǫ ( lasso )2 n P ( k w lasso k ≤ (1 + ǫ ( lasso )1 ) E k w gen k )) ≥ − e − ǫ ( lasso )3 n , (153) where ǫ ( lasso )1 > is an arbitrarily small constant and ǫ ( lasso )2 and ǫ ( lasso )3 are constants dependent on ǫ ( lasso )1 and σ but independent of n .Proof. Follows from the above discussion, Theorem 1, and by noting that the optimization problems in(151) and (147) are equivalent.The following corollary then provides a quick way of computing the concentrating point of the “worstcase” norm of the error vector.
Corollary 1.
Assume the setup of Theorems 1 and 2. Let α = mn and β w = kn . Then P ( ∃ w lasso |k w lasso k ∈ ((1 − ǫ ( lasso )1 ) σ r α w α − α w , (1 + ǫ ( lasso )1 ) σ r α w α − α w )) ≥ − e − ǫ ( lasso )2 n P ( k w lasso k ≤ (1 + ǫ ( lasso )1 ) σ r α w α − α w ) ≥ − e − ǫ ( lasso )2 n , (154) where α w < α is such that (1 − β w ) q π e − ( erfinv ( − αw − βw )) α w − √ erfinv ( 1 − α w − β w ) = 0 (155) and ǫ ( lasso )1 > is an arbitrarily small constant and ǫ ( lasso )2 is a constant dependent on ǫ ( lasso )1 and σ butindependent of n .Proof. Let ¯ h and z (2) be as in Section 2.3.1. Then ξ ( gen ) ov ( σ, g , h ) = min ν,λ (2) k ¯ h − ν z (2) + λ (2) k subject to ν ≥ λ (2) i = 0 , n − k + 1 ≤ i ≤ n ≤ λ (2) i ≤ ν, ≤ i ≤ n − k, (156)33s equivalent to (154). Moreover E k w gen k . = σ Eξ ( gen ) ov ( σ, g , h ) q E k g k − Eξ ( gen ) ov ( σ, g , h ) . = σ r α w α − α w , (157)where α w m . = Eξ ( gen ) ov ( σ, g , h ) is one of the main contributions of [63]. The rest then trivially followsfrom (153).Using (155) and (147) one can then for any σ and any pair ( α, β w ) (that is below fundamental charac-terization (108)) determine the value of the worst case E k w lasso k as σ q α w α − α w . We present the obtainedresults in Figure 3. For several fixed values of worst case E k w lasso k we determine curves of points ( α, β w ) for which these fixed values are achieved (of course for any α that is below a curve the value of the corre-sponding worst case E k w lasso k is smaller). As can be seen from the plots the lower the norm-2 of the errorvector the smaller the allowable region for pairs ( α, β w ) .The results of the above corollary match those obtained in [7, 26] through a state evolution/bilief propa-gation type of analysis. The above corollary relates to the LASSO from (6) whereas the results from [7, 26]are derived for somewhat different LASSO from (5). However, as mentioned earlier, in Section 4 we willestablish a nice connection between the LASSO from (6) and one that is fairly similar to (5). α β w / α ( α , β w ) curves as functions of ρ =||w lasso || / σ , LASSO ρ→∞ρ =5 ρ =3 ρ =2 ρ =1 Figure 3: ( α, β w ) curves as functions of ρ = E k w lasso k σ for LASSO algorithm from (6)34 LASSO’s performance analysis framework – signed x In this section we show how the LASSO’s performance analysis framework developed in the previous sec-tion can be specialized to the case when signals are a priori known to have nonzero components of certainsign. All major assumptions stated at the beginning of the previous section will continue to hold in this sec-tion as well; namely, we will continue to consider matrices A with i.i.d. standard normal random variables;elements of v will again be i.i.d. Gaussian random variables with zero mean and variance σ . The maindifference, though, comes in the definition of ˜ x . We will in this section assume that ˜ x is the original x in(3) that we are trying to recover and that it is any k -sparse vector with a given fixed location of its nonzeroelements and with a priori known signs of its elements. Given the statistical context, it will be fairly easyto see later on that everything that we will present in this section will be irrelevant with respect to what par-ticular location and what particular combination of signs of nonzero elements are chosen. We therefore forthe simplicity of the exposition and without loss of generality assume that the components x , x , . . . , x n − k of x are equal to zero and the components x n − k +1 , x n − k +2 , . . . , x n of x are greater than or equal to zero.However, differently from what was assumed in the previous section, we now assume that this informationis a priori known. That essentially means that this information is also known to the solving algorithm. Theninstead of (6) one can consider its a better (“signed”) version min x k y − A x k subject to k x k ≤ k ˜ x k x i ≥ , ≤ i ≤ n. (158)In what follows we will mimic the procedure presented in the previous section, skip all the obviousparallels, and emphasize the points that are different. The framework that we will present below will againcenter around finding the optimal value of the objective function in (158). In the first of the following twosubsections we will create a lower bound on this optimal value (this will essentially amount to creating aprocedure that is analogous to the one presented in Section 2.1). We will then afterwards in the second ofthe subsections create an upper bound on this optimal value. As it was done in the case of general ˜ x in theprevious section we will in the third subsection show that the two bounds actually match. To make furtherwriting easier and clearer we set already here ζ obj + = min x k y − A x k subject to k x k ≤ k ˜ x k x i ≥ , ≤ i ≤ n. (159) ζ obj + In this section we present the part of the framework that relates to finding a “high-probability” lower boundon ζ + obj . As in the previous section we again assume that there is a (if necessary, arbitrarily large) constant C w such that P ( k w lasso k ≤ C w ) = 1 − e − ǫ C w n . (160)We again start by noting that if one knows that y = A ˜ x + v holds then (159) can be rewritten as min x k v + A ˜ x − A x k subject to k x k ≤ k ˜ x k x i ≥ , ≤ i ≤ n. (161)35fter a small change of variables, x = ˜ x + w , one has an equivalent to (14) min w k A v (cid:20) w σ (cid:21) k subject to n X i =1 w i ≤ x i + w i ≥ , ≤ i ≤ n, (162)where as earlier A v = (cid:2) − A v (cid:3) is an m × ( n + 1) random matrix with i.i.d. standard normal components.Let S + w ( σ, ˜ x , C w ) = { (cid:20) w σ (cid:21) ∈ R n +1 | k w k ≤ C w and n X i =1 w i ≤ and ˜ x i + w i ≥ , ≤ i ≤ n } . (163)Further, let f obj + ( σ, w ) = k A v (cid:20) w σ (cid:21) k (164)and set, ζ ( help ) obj + = min [ w T σ ] T ∈ S + w ( σ, ˜ x ,C w ) f obj + ( σ, w ) = min [ w T σ ] T ∈ S + w ( σ, ˜ x ,C w ) k A v (cid:20) w σ (cid:21) k = min [ w T σ ] T ∈ S + w ( σ, ˜ x ,C w ) max k a k =1 a T A v (cid:20) w σ (cid:21) . (165)Now, after applying Lemma 1 and following the procedure from the previous section one has P ( min [ w T σ ] T ∈ S + w ( σ, ˜ x ,C w ) ( f obj + ( σ, w ) + q k w k + σ g ) ≥ ζ ( l ) obj + ) ≥ p + l . (166)where p + l = P min [ w T σ ] T ∈ S + w ( σ, ˜ x ,C w ) q k w k + σ k g k + n X i =1 h i w i ! + h n +1 σ ≥ ζ ( l ) obj + ! . (167)As in previous section we will essentially show that for certain ζ ( l ) obj + this probability is close to which willimply that we have a “high probability” lower bound on ζ obj + . Let ξ + ( σ, g , h , ˜ x ) = min [ w T σ ] T ∈ S + w ( σ, ˜ x ,C w ) q k w k + σ k g k + n X i =1 h i w i ! . (168)Now we split the analysis into two parts. The first one will be the deterministic analysis of ξ + ( σ, g , h , ˜ x ) and will be presented in Subsection 3.1.1. In the second part (that will be presented in Subsection 3.1.2) wewill use the results of such a deterministic analysis and continue the above probabilistic analysis applyingvarious concentration results. 36 .1.1 Optimizing ξ + ( σ, g , h , ˜ x ) In this section we compute ξ + ( σ, g , h ) . We first rewrite the optimization problem from (168) in the follow-ing way ξ + ( σ, g , h , ˜ x ) = min w q k w k + σ k g k + n X i =1 h i w i subject to n X i =1 w i ≤ x i + w i ≥ , ≤ i ≤ n q k w k + σ ≤ p C w + σ . (169)The Lagrange dual of the above problem then becomes L ( ν, λ (2) , w , γ ) = q k w k + σ k g k + n X i =1 h i w i + ν n X i =1 w i − n X i = n − k +1 λ (2) i (˜ x i + w i ) − n − k X i =1 λ (2) i w i + γ ( q k w k + σ − p C w + σ ) . (170)After a few further arrangements we finally have L ( ν, λ (2) , w , γ ) = q k w k + σ ( k g k + γ ) + n X i =1 ( h i + ν − λ (2) i ) w i − n X i = n − k +1 λ (2) i ˜ x i − γ p C w + σ . (171)One can then write the following dual problem of (169) ξ + ( σ, g , h , ˜ x ) = max ν,λ (2) ,γ min w L ( ν, λ (2) , w ) subject to λ (2) i ≥ , ≤ i ≤ nν ≥ γ ≥ , (172)where we of course use the fact that the strict duality obviously holds. The inner minimization over w isnow doable. Setting the derivatives with respect to w i to zero one obtains w ( k g k + γ ) p k w k + σ + ( h + ν z (1) − λ (2) ) = 0 , (173)where z (1) and λ (2) are as defined in the previous section. From (173) one then has w ( k g k + γ ) = − q k w k + σ ( h + ν z (1) − λ (2) ) (174)or in a norm form k w k ( k g k + γ ) = ( k w k + σ ) k h + ν z (1) − λ (2) k . (175)37rom (175) we then find k w sol + k = σ k h + ν z (1) − λ (2) k q ( k g k + γ ) − k h + ν z (1) − λ (2) k , (176)and from (174) w sol + = σ ( h + ν z (1) − λ (2) ) q ( k g k + γ ) − k h + ν z (1) − λ (2) k (177)where w sol + is of course the solution of the inner minimization over w . As in the previous section, oneshould note that (176) and (177) are of course possible only if k g k + γ − k h + ν z (1) − λ (2) k ≥ . (Also,as in the previous section if for ν and λ (2) that are optimal in (172) the condition is not met then for thecorresponding ( α, β ) the worst-case k w k is infinite with overwhelming probability). Plugging the value of w sol + from (177) back in (172) gives ξ + ( σ, g , h , ˜ x ) = max ν,λ (2) ,γ σ q ( k g k + γ ) − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i − γ p C w + σ subject to λ (2) i ≥ , ≤ i ≤ nν ≥ k g k + γ − k h + ν z (1) − λ (2) k ≥ γ ≥ . (178)Now, the maximization over γ can be done. After setting the derivative to zero one finds k g k + γ q ( k g k + γ ) − k h + ν z (1) − λ (2) k − p C w + σ = 0 (179)and after some algebra γ opt + = s σ C w k h + ν z (1) − λ (2) k − k g k , (180)where of course γ opt + would be the solution of (178) only if larger than or equal to zero. Alternatively ofcourse γ opt + = 0 . Now, based on these two scenarios we distinguish two different optimization problems:1. The “overwhelming” optimization — signed ˜ x ξ ov + ( σ, g , h , ˜ x ) = max ν,λ (2) σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ λ (2) i ≥ , ≤ i ≤ n. (181)38. The “non-overwhelming” optimization — signed ˜ x ξ nov + ( σ, g , h , ˜ x ) = max ν,λ (2) p C w + σ k g k − C w k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ λ (2) i ≥ , ≤ i ≤ n. (182)The “overwhelming” optimization is the equivalent to (178) if for its optimal values c ν + and [ λ (2+) holds s σ C w k h + c ν + z (1) − [ λ (2+) k ≤ k g k , (183)We now summarize in the following lemma the results of this subsection. Lemma 11.
Let c ν + and [ λ (2+) be the solutions of (181) and analogously let f ν + and ] λ (2+) be the solutionsof (182). Let ξ + ( σ, g , h , ˜ x ) be, as defined in (168), the optimal value of the objective function in (168). Then ξ + ( σ, g , h , ˜ x ) = σ q k g k − k h + c ν + z (1) − [ λ (2+) k − P ni = n − k +1 [ λ (2+) i ˜ x i , if q σ C w k h + c ν + z (1) − [ λ (2+) k ≤ k g k p C w + σ k g k − C w k h + f ν + z (1) − ] λ (2+) k − P ni = n − k +1 ] λ (2+) i ˜ x i , otherwise . (184) Moreover, let d w + be the solution of (168). Then d w + ( σ, g , h , ˜ x ) = σ ( h + c ν + z (1) − \ λ (2+) ) q k g k −k h + c ν + z (1) − \ λ (2+) k , if q σ C w k h + c ν + z (1) − [ λ (2+) k ≤ k g k C w ( h + f ν + z (1) − ^ λ (2+) ) k h + f ν + z (1) − ^ λ (2+) k , otherwise , (185) and k d w + ( σ, g , h , ˜ x ) k = σ k h + c ν + z (1) − \ λ (2+) ) k q k g k −k h + c ν + z (1) − \ λ (2+) k , if q σ C w k h + c ν + z (1) − [ λ (2+) k ≤ k g k C w , otherwise . (186) Proof.
The first part follows trivially. The second one follows from (177) by choosing the optimal c ν + and [ λ (2+) or alternatively f ν + and ] λ (2+) . ξ + ( σ, g , h , ˜ x ) In this section we establish that ξ + ( σ, g , h , ˜ x ) concentrates with high probability around its mean. Thefollowing lemma is an analogue to Lemma 3.1.2 Lemma 12.
Let g and h be m and n dimensional vectors, respectively, with i.i.d. standard normal variablesas their components. Let σ > be an arbitrary scalar. Let ξ ( σ, g , h , ˜ x ) be as in (22). Further let ǫ lip > be any constant. Then P ( | ξ + ( σ, g , h , ˜ x ) − Eξ + ( σ, g , h , ˜ x ) | ≥ ǫ lip Eξ + ( σ, g , h , ˜ x )) ≤ exp (cid:26) − ( ǫ lip Eξ + ( σ, g , h , ˜ x )) C w + σ ) (cid:27) . (187)39 roof. It follows by literally repeating every step of proof of Lemma . The only difference is that one nowhas S + w ( σ, ˜ x , C w ) instead of S w ( σ, ˜ x , C w ) .Moreover one then has that k h + c ν + z (1) − [ λ (2+) k and k h + f ν + z (1) − ] λ (2+) k concentrate as well whichautomatically implies that d w + also concentrates. More formally, one then has analogues to (187) P ( |k h + c ν + z (1) − [ λ (2+) k − E k h + c ν + z (1) − [ λ (2+) k | ≥ ǫ ( norm )1 E k h + c ν + z (1) − [ λ (2+) k ) ≤ e − ǫ ( norm )2 n P ( |k h + f ν + z (1) − ] λ (2+) k − E k h + f ν + z (1) − ] λ (2+) k | ≥ ǫ ( norm )3 E k h + f ν + z (1) − ] λ (2+) k ) ≤ e − ǫ ( norm )4 n P ( |k d w + k − E k d w + k | ≥ ǫ ( w )1 E k d w + k ) ≤ e − ǫ ( w )2 n , (188)where as usual ǫ ( norm )1 > , ǫ ( norm )2 > , and ǫ ( w )1 > are arbitrarily small constants and ǫ ( norm )3 , ǫ ( norm )4 ,and ǫ ( w )2 are constant dependent on ǫ ( norm )1 > , ǫ ( norm )2 > , and ǫ ( w )1 > , respectively, but independentof n . After repeating every step between (55) and (64) one arrives to the following analogue to Lemma 5. Lemma 13.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Consider an ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let then ζ obj + be as defined in (159) and let w + be the solution of (162). Assume P ( k w + k ≤ C w ) ≥ − e − ǫ C w n for an arbitrarily large constant C w and a constant ǫ C w > dependenton C w but independent of n . Then there is a constant ǫ lower > P ( ζ obj + ≥ ζ ( lower ) obj + ) ≥ (1 − e − ǫ lower n )(1 − e − ǫ C w n ) , (189) where ζ ( lower ) obj + = (1 − ǫ lip ) Eξ + ( σ, g , h , ˜ x ) − ǫ ( h )1 √ n − ǫ ( g )1 √ n, (190) ξ + ( σ, g , h , ˜ x ) is as defined in (168), and ǫ lip , ǫ ( h )1 , ǫ ( g )1 are all positive arbitrarily small constants.Proof. Follows from the previous discussion. ζ obj + In this section we present a general framework for finding a “high-probability” upper bound on ζ obj + . Tothat end, let r + and C w up + be positive scalars (as in Section 3.2, we in this subsection present a generalframework and take these scalars to be arbitrary; however to make the bound as tight sa possible in thefollowing subsection we will make them take particular values). As earlier, if we can show that there is a w ∈ R n such that k ˜ x + w k ≤ k ˜ x k and k v − A w k ≤ r + with overwhelming probability then r + can actas an upper bound on ζ obj + . We then start by looking at the following optimization problem min w k ˜ x + w k − k ˜ x k k A v (cid:20) w σ (cid:21) k ≤ r + ˜ x i + w i ≥ , ≤ i ≤ n k w k ≤ C w up + , (191)40here A v is as defined right after (14). If we can show that with overwhelming probability the objectivevalue of the above optimization problem is negative then r + will be a valid “high probability” upper-boundon ζ obj + . Moreover, it will be achieved by a w for which it will hold that k w k ≤ C w up + .First let us rewrite the objective value of the above optimization problem in a slightly more convenientform min x n X i =1 w i k A v (cid:20) w σ (cid:21) k ≤ r + ˜ x i + w i ≥ , ≤ i ≤ n k w k ≤ C w up + . (192)Now, we proceed in a fashion similar to the one from Subsection 3.1.1. We first do a slight modification ofthe first constraint from (192) in the following way min x n X i =1 w i k A v (cid:20) w σ (cid:21) k ≤ r ˜ x i + w i ≥ , ≤ i ≤ n k b k ≤ r (cid:2) − A v (cid:3) (cid:20) w σ (cid:21) = b k w k ≤ C w up + . (193)The Lagrange dual of the above problem then becomes L ( λ (2) , ν (1) , γ , γ , w , b ) = n X i =1 w i − n X i = n − k +1 λ (2) i (˜ x i + w i ) − n − k X i =1 λ (2) i w i − ν (1) A w + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r ) + γ ( k w k − C w up + ) , (194)where ν (1) and λ (2) are vectors of Lagrange variables as in previous sections. After rearranging terms wefurther have L ( λ (2) , ν (1) , γ , γ , w , b ) = − n X i = n − k +1 λ (2) i ˜ x i + (( z (1) − λ (2) ) T − ν (1) A ) w + ν (1) v σ − ν (1) b + γ ( n X i =1 b − r ) + γ ( k w k − C w up + ) . (195)41inally we can write a dual problem to (193) max λ (2) ,ν (1) ,γ ,γ min w , b L ( λ (2) , ν (1) , γ , γ , w , b ) subject to λ (2) i ≥ , ≤ i ≤ nγ ≥ ,γ ≥ , (196)where we of course use the fact that the strict duality obviously holds. Now, after repeating all the stepsfrom (75) to (84) (wherever we had λ (2) we would now have λ (2) and there will be no upper bound oncomponents of λ (2) in the corresponding optimization problems) one obtains and analogue to (84) − min λ (2) ,ν (1) max k a k = C w up + (( z (1) − λ (2) ) T − ν (1) A ) a − ν (1) v σ + k ν (1) k r + + n X i = n − k +1 λ (2) i ˜ x i subject to λ (2) i ≥ , ≤ i ≤ n. (197)Now let us define f ( up ) obj + as − f ( up ) obj + = − min λ (2) ,ν (1) max k a k = C w up + (( z (1) − λ (2) ) T − ν (1) A ) a − ν (1) v σ + k ν (1) k r + + n X i = n − k +1 λ (2) i ˜ x i subject to λ (2) i ≥ , ≤ i ≤ n. (198)Any r + such that lim n → P ( f ( up ) obj + ≥
0) = 1 is then a valid “high-probability” upper bound. Set Λ (2+) = { λ (2) ∈ R n | λ (2) i ≥ , ≤ i ≤ n } , (199)and ξ up + ( σ, g , h , ˜ x , C w up ) = max λ (2) ∈ Λ (2+) ,ν ≥ ( q C w up + + σ k g k − C w up + k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . (200)After further repeating all the steps between (84) and Lemma 14 (the only difference is that λ (2) i ∈ Λ (2+) inthe “signed” scenario) one then has the following “signed” analogue to Lemma 14 (which in essence givesa way of finding an r + such that lim n → P ( f ( up ) obj + ≥
0) = 1 ). Lemma 14.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Consider an ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let then ζ obj + be as defined in (159) and let w + be the solution of (162). There isa constant ǫ upper > P ( ζ obj + ≤ ζ ( upper ) obj + ) ≥ − e − ǫ upper n , (201) where ζ ( upper ) obj + = (1 + ǫ lip ) Eξ up + ( σ, g , h , ˜ x , C w up + ) + ǫ ( h )1 √ n + ǫ ( g )3 √ n, (202) ξ up + ( σ, g , h , ˜ x , C w up + ) is as defined in (200), ǫ lip , ǫ ( h )1 , ǫ ( g )3 are all positive arbitrarily small constants, and C w up + is a constant such that k w + k ≤ C w up + .Proof. Follows from the previous discussion. 42 .3 Matching upper and lower bounds
In this section we specialize the general bounds introduced above and show how they match. We will againdivide presentation in three subsections. In the first of the subsections we will make a connection to thenoiseless “signed” case and show how one can then remove the constraint from (184), (185), and (186).In the second subsection we will consider a w such that |k w k − k d w + k | ≥ ǫ w up k d w + k . We will thenquantify how much the lower bound that can be computed for such a w through the framework presentedin Section 3.1 deviates from the optimal one obtained for d w + . In the last subsection we will then show thatthere will be a w such that the upper bound computed through the framework presented in Section 3.2 willdeviate less. That will in essence establish that upper and lower bounds computed in the previous sectionsindeed match. ℓ optimization of signed x In this subsection we establish a connection between the constraint in (184), (185), and (186) and the funda-mental performance characterization of ℓ optimization derived in [62] (and of course earlier in the contextof neighborly polytopes in [27]). We first recall on the condition from Lemma 11. The condition states s σ C w k h + c ν + z (1) − [ λ (2+) k ≤ k g k , (203)where C w is an arbitrarily large constant and c ν + and [ λ (2+) are the solutions of max σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to λ (2) i ≥ , ≤ i ≤ nν ≥ . (204)Now we note the following equivalent to (204) for the case when nonzero components of ˜ x are infinite max σ q k g k − k h + ν z (1) − λ (2) k subject to λ (2) i ≥ , ≤ i ≤ n − kλ (2) i = 0 , n − k + 1 ≤ i ≤ nν ≥ . (205)To make the new observations easily comparable to the corresponding ones from [61, 63] we set ¯ h + = [ h (1)(1) , h (2)(2) , . . . , h ( n − k )( n − k ) , h n − k +1 , h n − k +2 , . . . , h n ] T , (206)where [ h (1)(1) , h (2)(2) , . . . , h ( n − k ) ( n − k ) ] are [ h , h , . . . , h n − k ] sorted in increasing order (possible ties in thesorting process are of course broken arbitrarily). Also we let z (2) be as in the previous section, i.e. let it besuch that z (2) i = − z (1) i , n − k + 1 ≤ i ≤ n and z (2) i = z (1) i , ≤ i ≤ n − k . It is then relatively easy to see43hat the above optimization problem is equivalent to max σ q k g k − k ¯ h + − ν z (2) + λ (2) k subject to λ (2) i ≥ , ≤ i ≤ n − kλ (2) i = 0 , n − k + 1 ≤ i ≤ nν ≥ . (207)Let ν ℓ + and λ ( ℓ +) be the solution of the above maximization. Further, consider the following “signed”version of the standard ℓ -optimization min k x k subject to A x = yx i ≥ . (208)Then, as we showed in [63] and [62], the inequality E k g k > E k ¯ h + − ν ℓ + z (2) + λ ( ℓ +) k (209)establishes the following fundamental performance characterization of the ℓ optimization algorithm from(208) that could be used instead of LASSO from (158) to recover signed x in (1) (which is a noiselessversion of (3)) (1 − β + w ) q π e − ( erfinv (2 − α + w − β + w − α + w − √ erfinv (2 1 − α + w − β + w −
1) = 0 , (210)where of course α + w = mn and β + w = kn . As it is also shown in [63] and [62] both of the quantities under theexpected values in (209) nicely concentrate. Then with overwhelming probability one has that for any pair ( α, β ) that satisfies (or lies below) the fundamental performance characterization of ℓ optimization givenin (210) k g k > k ¯ h + − ν ℓ + z (2) + λ ( ℓ +) k . (211)Moreover, since λ (2) i ≥ , n − k + 1 ≤ i ≤ n , in (204) one actually has that (211) implies that withoverwhelming probability k g k > k h + c ν + z (1) − [ λ (2+) k , (212)which for sufficiently large C w is the same as (203). We then in what follows assume that pair ( α, β ) is suchthat it satisfies the “signed” fundamental ℓ optimization performance characterization (or is in the regionbelow it) from (210) and therefore proceed by ignoring condition (203). In this subsection we establish that k w lasso + k (of course, w lasso + = c x + − ˜ x , where c x + is the solution of(158)) can not deviate substantially from k d w + k without substantially affecting the value of the lower boundon the objective in (158) that is derived in Section 3.1. To that end let us assume that there is a w off + thatis the solution of the LASSO from (158) (or to be slightly more precise that is such that c x + = ˜ x + w off + ,where obviously c x + is the solution of (158)). Further, let |k w off + k − k d w + k | ≥ ǫ w up k d w + k , where ǫ w up is an arbitrarily small constant. 44ne can then write a “signed” analogue to (112) ξ off + ( σ, g , h , ˜ x , w off + ) = max λ (2) ∈ Λ (2+) ,ν ≥ ( q w off + σ k g k − w off k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . (213)After repeating all the arguments between (112) and Lemma 9 one obtains the following analogue to Lemma9. Lemma 15.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Consider an ˜ x defined in (7) and a y defined in (3) for x = ˜ x . Let then ζ obj + be as defined in (159) and let w off + be the solution of (162). Let α and β be below the fundamental characterization (210) and let d w + be as defined in (185). Assume that |k w off + k − k d w + k | ≥ ǫ w up k d w + k , where ǫ w up is an arbitrarily small but fixed constant. Then therewould be a constant ǫ off > , and arbitrarily small positive constants ǫ lip , ǫ ( h )1 , ǫ ( g )1 such that P ( ζ obj + ≥ ζ ( off ) obj + ) ≥ − e − ǫ off n , (214) where ζ ( off ) obj = (1 − ǫ lip )(1 + ǫ w up ǫ w up ) ) Eξ ov + ( σ, g , h , ˜ x ) − ǫ ( h )1 √ n − ǫ ( g )1 √ n, (215) and ξ ov + ( σ, g , h , ˜ x ) is as defined in (181).Proof. Follows from the discussion in Section 2.3.2.
In this section we establish that k w lasso + k can not deviate from k d w + k as much as it was assumed in theprevious section which is conceptually enough to make the bounds from Sections 3.1 and 3.2 match. Allarguments from Section 2.3.3 can be repeated again. The only difference will be that in all optimizationproblems from Section 2.3.3 one will now have no upper bound on λ (2) i , ≤ i ≤ n (this essentially amountsto using set Λ (2+) instead of set Λ (2) ). One then has a “signed” analogue to (142) Eξ up + ( σ, g , h , ˜ x , E k d w + k ) = E max λ (2) ∈ Λ (2+) ,ν ≥ ( q ( E k d w + k ) + σ k g k − E k d w + k k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i )= E max λ (2) ∈ Λ (2+) ,ν ≥ ( q ( E c d + ) + σ k g k − E c d + k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) . = E min d ≥ max λ (2) ∈ Λ (2+) ,ν ≥ ( p d + σ k g k − d k h + ν z (1) − λ (2) ) k − n X i = n − k +1 λ (2) i ˜ x i ) = Eξ ov + ( σ, g , h , ˜ x ) , (216)where c d + = k d w + k would be the solution of a “signed” analogue to (125). Following the arguments after(142) one then has that the assumption of Lemma 13 is unsustainable and that k w lasso + k can not deviatesubstantially from k d w + k . This then implies that with overwhelming probability the objective value of (158)concentrates around Eξ ov + ( σ, g , h , ˜ x ) and consequently that k w lasso + k concentrates around E k d w + k .45 .4 Connecting all pieces In this section we connect all of the above. The following theorem essentially does so.
Theorem 3.
Let v be an n × vector of i.i.d. zero-mean variance σ Gaussian random variables and let A be an m × n matrix of i.i.d. standard normal random variables. Further, let g and h be m × and n × vectors of i.i.d. standard normals, respectively. Consider a k -sparse ˜ x defined in (7) and a y definedin (3) for x = ˜ x . Let the solution of (158) be c x + and let the so-called error vector of LASSO from (158)be w lasso + = c x + − ˜ x . Let n be large and let constants α = mn and β = kn be below the fundamentalcharacterization (210). Consider the following optimization problem: ξ ov + ( σ, g , h , ˜ x ) = max ν,λ (2) σ q k g k − k h + ν z (1) − λ (2) k − n X i = n − k +1 λ (2) i ˜ x i subject to ν ≥ λ (2) i ≥ , ≤ i ≤ n. (217) Let c ν + and [ λ (2+) be the solution of (217). Set k d w + k = σ k h + c ν + z (1) − [ λ (2+) k q k g k − k h + c ν + z (1) − [ λ (2+) k . (218) Then: P ((1 − ǫ ( lasso )1 ) Eξ ov + ( σ, g , h , ˜ x ) ≤ k y − A ˆ x k ≤ (1+ ǫ ( lasso )1 ) Eξ ov + ( σ, g , h , ˜ x )) = 1 − e − ǫ ( lasso )2 n (219) and P ((1 − ǫ ( lasso )1 ) E k d w + k ≤ k w lasso + k ≤ (1 + ǫ ( lasso )1 ) E k d w + k ) = 1 − e − ǫ ( lasso )2 n , (220) where ǫ ( lasso )1 > is an arbitrarily small constant and ǫ ( lasso )2 is a constant dependent on ǫ ( lasso )1 and σ butindependent of n .Proof. Follows from the above discussion.
In this section we show how the results presented in the above theorem can be adapted to the so-called“worst-case” scenario or as we refer to it “generic performance” scenario. Repeating the line of argumentsfrom Section 2.4.1 one can establish the following generic equivalent to Theorem 3.
Theorem 4.
Assume the setup of Theorem 3. Consider the following optimization problem: ξ ( gob ) ov + ( σ, g , h ) = min ν,λ (2) k h + ν z (1) − λ (2) k subject to ν ≥ λ (2) i = 0 , n − k + 1 ≤ i ≤ nλ (2) i ≥ , ≤ i ≤ n − k. (221)46 et ν gen + and λ ( gen +) be the solution of (221). Set k w gen + k = σ k h + ν gen + z (1) − λ ( gen +) k q k g k − k h + ν gen + z (1) − λ ( gen +) k . (222) Then: P ( ∃ w lasso + |k w lasso + k ∈ ((1 − ǫ ( lasso )1 ) E k w gen + k , (1 + ǫ ( lasso )1 ) E k w gen + k )) ≥ − e − ǫ ( lasso )2 n P ( k w lasso + k ≤ (1 + ǫ ( lasso )1 ) E k w gen + k ) ≥ − e − ǫ ( lasso )3 n , (223) where ǫ ( lasso )1 > is an arbitrarily small constant and ǫ ( lasso )2 and ǫ ( lasso )3 are constants dependent on ǫ ( lasso )1 and σ but independent of n .Proof. Follows by the use of the same arguments that were used to establish Theorem 3.The following corollary then provides a quick way of computing the concentrating point of the “worstcase” norm of the error vector.
Corollary 2.
Assume the setup of Theorems 3 and 4. Let α = mn and β + w = kn . Then P ( ∃ w lasso + |k w lasso + k ∈ ((1 − ǫ ( lasso )1 ) σ s α + w α − α + w , (1 + ǫ ( lasso )1 ) σ s α + w α − α + w )) ≥ − e − ǫ ( lasso )2 n P ( k w lasso + k ≤ (1 + ǫ ( lasso )1 ) σ s α + w α − α + w ) ≥ − e − ǫ ( lasso )3 n , (224) where α + w < α is such that (1 − β + w ) q π e − ( erfinv (2 − α + w − β + w − α + w − √ erfinv (2 1 − α + w − β + w −
1) = 0 , (225) ǫ ( lasso )1 > is an arbitrarily small constant, and ǫ ( lasso )2 and ǫ ( lasso )2 are constants dependent on ǫ ( lasso )1 and σ but independent of n .Proof. Follows by the use of the same arguments that were used to establish Corollary 1 and a recognitionthat the fundamental characterization of interest in the “signed” case is the one given in (210).Based on the above corollary one can then for any σ and any pair ( α, β + w ) (that is below funda-mental characterization (225) or alternatively (210)) determine the value of the worst case E k w lasso + k as σ q α + w α − α + w . We present the obtained results in Figure 4. For several fixed values of the worst case E k w lasso + k we determine curves of points ( α, β + w ) for which these fixed values are achieved (of coursefor any α that is below a curve the value of the corresponding worst case E k w lasso + k is smaller). Asin the previous section, the lower the norm-2 of the error vector the smaller the allowable region for pairs ( α, β + w ) . Also as it was the case in the previous section, the results of the above corollary match thoseobtained in [7, 26] through a state evolution/bilief propagation type of analysis for the “signed” version of47he LASSO from (5) (signed version of the LASSO from (5) as expected assumes just simple adding of thepositivity constraints on the components of x ). α β w / α ( α , β w ) curves as functions of ρ =||w lasso+ || / σ , LASSO ρ→∞ρ =5 ρ =3 ρ =2 ρ =1 Figure 4: ( α, β + w ) curves as functions of ρ = E k w lasso + k σ for LASSO algorithm from (158) In this section we establish a connection between the LASSO algorithm from (6) that we analyzed in Section2 and the more well known form of LASSO from (5). Instead of well-known (5) we will consider its a slightmodification min x k y − A x k + λ lasso k x k . (226)Both LASSO’s, (6) and (226), (as well as the one from (5)) rely on some type of the prior knowledge that canbe available about A , v , or ˜ x . In (6) we assumed that one knows k ˜ x k (of course, if one has no knowledgeof k ˜ x k LASSO from (6) simply can not be run). On the other hand the LASSO from (226) (as well as theone from (5)) requires that one sets in advance parameter λ lasso which can be a tough task if there is noa priori knowledge about A , v , or ˜ x . Now even if there is some a priori available knowledge about theseobjects there are still many ways how one can set λ lasso . We will below show a particular way of setting λ lasso in (226) that can make LASSO’s from (6) and (226) essentially equivalent (of course, as long as oneis interested in performance measures discussed in this paper). In the interest of saving space we will sketchonly the key arguments without going into tedious details similar to the ones presented in earlier sections.(All that we mention below can be made precise, though; in fact, one can pretty much reach the same levelof exactness demonstrated in Sections 2 and 3; however, the length of the precise probabilistic argumentswould equal (if not exceed) the length of the arguments presented in Sections 2 and 3.)Now, let λ lasso in (226) be such that λ lasso = E ˆ ν where ˆ ν is the solution of (42). Then (226) becomes min x k y − A x k + E ˆ ν k x k , (227)48r in a more convenient form min x k y − A x k + E ˆ ν k x k − E ˆ ν k ˜ x k , (228)This could be rewritten in a way analogous to (14) as min w k A v (cid:20) w σ (cid:21) k + E ˆ ν k ˜ x + w k − E ˆ ν k ˜ x k , (229)where A v is as in (14). One can then repeat all arguments from the beginning of Section 2.1 (essentiallythose before Section 2.1.1) to arrive at the following analogue of (23) ξ conn ( σ, g , h , ˜ x ) = min w q k w k + σ k g k + n X i =1 h i w i + E ˆ ν k ˜ x + w k − E ˆ ν k ˜ x k subject to q k w k + σ ≤ p C w + σ . (230)Now, one should note that E ˆ ν in the above optimization is chosen as the “optimal” (it is actually the concen-trating point of the optimal one; to make this really precise one would need to go through all the probabilisticarguments of Section 2 and plus some more) ν in the Lagrange dual of (23). One then has that argumentsfrom Section 2.1 (essentially an appropriate repetition of those that follow (23)) will produce the lowerbound on the objective of (228) that is with overwhelming probability arbitrarily close to the one derivedin Lemma 5. The arguments from Section 2.2 related to the upper bound can be trivially repeated as wellsince the negativity of the objective in (67) implies that r is also an upper bound on the optimal value of theobjective in (228). The matching arguments from Section 2.3 then follow as well. Now if one let w conn bethe solution of (229), then with overwhelming probability k w conn k concentrates around E k ˆ w k , where ˆ w is as defined in Theorem 1.For the signed case the arguments are the same, only instead of E ˆ ν in (227), (228), and (229) one shoulduse E c ν + where c ν + is the solution of (181). Also, as it is probably obvious, this time k w conn k concentratesaround E k d w + k where d w + is as defined in Theorem 3. In this section we show that there is an SOCP equivalent to the LASSO from (6) (as long as the norm-2 ofthe error vector is a performance measure of interest). To that end let us recall that an SOCP algorithm forfinding an approximation of ˜ x if A and y from (3) are known can be (see, e.g. [13]) min x k x k subject to k y − A x k ≤ r socp . (231)The choice of r socp critically impacts the outcome of the above optimization. In fact more is true, the choiceof r socp heavily depends on what type of approximation error one is looking for. As we have mentioned inSection 1 a popular choice for r socp is the smallest quantity that is with high probability larger than k v k .There are probably many reasons for such a choice. One of them is that it would with high probabilityguarantee that the original ˜ x in (3) is permissible in (231). Now, if one is looking for an x that will be closein norm-2 to the original ˜ x then it is not necessary to look for the original ˜ x (especially so given that findingoriginal ˜ x is in general pretty much impossible). So if one gives up on that then the value of r socp can goeven lower than the smallest quantity larger (with overwhelming probability) than k v k . One should also49ote that by lowering r socp one would give up not only possibility to find ˜ x (which is tiny anyway) but alsohighly likely more when it comes to the structure of the solution vector. This is of course a problem on itsown that requires a thorough discussion. However, since we now look only at the norm of the error vectoras a performance measure we stop short of pursuing this discussion here any further.To go along these lines we choose r socp = Eξ up ( σ, g , h , E k ˆ w k ) . = Eξ ov ( σ, g , h , ˜ x ) ≤ Eξ ( gob ) ov ( σ, g , h ) ≤ σ √ m . = E k v k . (232)Now, let x socp be the solution of (231) (with r socp as in (232)). Further let x socp = w socp + ˜ x . Let ˆ w be asdefined in Theorem 1. Then as shown in Sections 2.1, 2.2, and 2.3 k w socp k concentrates around E k ˆ w k .Basically, the argument is that if |k w socp k − k ˆ w k | ≥ ǫ w up k ˆ w k for a fixed arbitrarily small positive ǫ w up and k ˜ x + w socp k ≤ k ˜ x k then with overwhelming probability k y − A x socp k > r socp . On the other hand,as it was also established in Sections 2.1, 2.2, and 2.3, there is a w (which concentrates around E k ˆ w k with overwhelming probability) such that k ˜ x + w k ≤ k ˜ x k and k y − A x socp k ≤ r socp . This essentiallyestablishes that if one chooses r socp in (231) as suggested in (232) then the norm-2 of the error vector willbe the same as the norm-2 of the error vector one obtains through LASSO’s from (6) and (226) (the latterone of course with an appropriate choice of λ lasso ).As we hinted above what we presented here is only a characterization of a particular performance mea-sure of an SOCP algorithm (the same is of course true for the LASSO algorithms). How adequate is such aperformance measure is whole another story that we will explore in more detail elsewhere. In this section we present a set of numerical results related to the theoretical predictions that we derived inearlier sections. We will divide the presentation into two groups: 1) the set of results that will relate to thegeneral (unsigned) unknown sparse vectors and 2) the set of results that will relate to signed unknown sparsevectors. To make scaling easier in all experiments we set σ = 1 . We also assumed that nonzero componentsof ˜ x are all of equal and large magnitude. For the concreteness we set this magnitude to be √ n . For everysetup that we discuss below we ran numerical experiments. x In this subsection we will present numerical results that relate to the theoretical ones created in Sections 2and 4. We will consider two groups of ( α, β w ) regimes, one that we will refer to as the low ( α, β w ) regimeand the other that we will refer to as the high ( α, β w ) regime.
2) Low ( α, β w ) regime — ρ = E k w lasso k σ = 2 We ran a carefully designed set of experiments intended to show a specific behavior of the LASSO’s from(6) and (227) in what we will refer to as the low ( α, β w ) regime. For α ∈ { . , . , . } we determinedthree values of β w from the contour LASSO line that corresponds to ρ = 2 in the figure given in Section 2.We then ran (6) assuming that k ˜ x k is known and (227) using theoretical value for E ˆ ν where, as mentionedin Section 4, ˆ ν is the solution of (42). We call the optimal value of the objective in (228) ζ conn (this valueis the optimal value of (227) shifted by a constant). Also, for this set of experiments we set n = 2000 .Obtained results are presented in Table 1. The theoretical values for any of the simulated quantities in anyof the simulated scenarios are given in parallel as bolded numbers. We observe a solid agreement betweenthe theoretical predictions and the results obtained through numerical experiments.
2) High ( α, β w ) regime — ρ = E k w lasso k σ = 3 theoretical results for the noisy recovery through LASSO’s; σ = 1 , ρ = E k w lasso k = 2 ; (6) and (227) were run times with n = 2000 α β w /α E ˆ ν Eζ conn √ n E k w conn k Eζ obj √ n E k w lasso k . . . . / . . / . / . . / . . . . / . . / . / . . / . . . . / . . / . / . . / Table 2: Experimental/ theoretical results for the noisy recovery through LASSO’s; σ = 1 , ρ = E k w lasso k = 3 ; (6) and (227) were run times α β w /α E ˆ ν Eζ conn √ n E k w conn k Eζ obj √ n E k w lasso k . . . . / . . / . / . . / . . . . / . . / . / . . / . . . . / . . / . / . . / We also ran a carefully designed set of experiments intended to show a specific behavior of the LASSO’sfrom (6) and (227) in what we will refer to as the high ( α, β w ) regime. For α ∈ { . , . , . } we nowdetermined three values of β w from the contour LASSO line that corresponds to ρ = 3 in the figure givenin Section 2. We then again ran (6) assuming that k ˜ x k is known and (227) using the theoretical values for E ˆ ν . For the scenario when α = 0 . we set n = 3000 while for the scenarios with other two values of α we set n = 2000 . Obtained results are presented in Table 2. The theoretical values for any of the simulatedquantities in any of the simulated scenarios are again given in parallel as bolded numbers. We again observe asolid agreement between the theoretical predictions and the results obtained through numerical experiments. x In this subsection we will present numerical results that relate to the theoretical ones created in Sections 3and 4. We will again consider two groups of ( α, β + w ) regimes, one that we will refer to as the low ( α, β + w ) regime and the other that we will refer to as the high ( α, β + w ) regime.
2) Low ( α, β + w ) regime — ρ = E k w lasso + k σ = 2 We first ran a set of experiments intended to show a specific behavior of the LASSO’s from (158) and(227) in what we will refer to as the low ( α, β + w ) regime. For α ∈ { . , . , . } we determined threevalues of β + w from the contour LASSO line that corresponds to ρ = 2 in the figure given in Section 3. Wethen ran (158) assuming that k ˜ x k is known and (227) using theoretical value for E c ν + where, as mentionedin Section 4, c ν + is the solution of (181). Also when running (227) we now added positivity constraintson the elements of x . We call the optimal value of the objective in (228) ζ conn + . When α = 0 . we set n = 1500 while for the other two values of α we set n = 2000 . Obtained results are presented in Table 3.The theoretical values for any of the simulated quantities in any of the simulated scenarios are as usual givenin parallel as bolded numbers. We once again observe a solid agreement between the theoretical predictionsand the results obtained through numerical experiments.
2) High ( α, β + w ) regime — ρ = E k w lasso + k σ = 3 As in the previous subsection we also ran a carefully designed set of experiments intended to show aspecific behavior of the LASSO’s from (158) and (227) in what we will refer to as the high ( α, β + w ) regime.51able 3: Experimental/ theoretical results for the noisy recovery through LASSO’s; σ = 1 , ρ = E k w lasso + k = 2 ; (6) and (227) were run times α β + w /α E c ν + Eζ conn + √ n E k w conn + k Eζ obj + √ n E k w lasso + k . . . . / . . / . / . . / . . . . / . . / . / . . / . . . . / . . / . / . . / Table 4: Experimental/ theoretical results for the noisy recovery through LASSO’s; σ = 1 , ρ = E k w lasso + k = 3 ; (6) and (227) were run times α β + w /α E c ν + Eζ conn + √ n E k w conn + k Eζ obj + √ n E k w lasso + k . . . . / . . / . / . . / . . . . / . . / . / . . / . . . . / . . / . / . . / Following further the methodology of the previous subsection for α ∈ { . , . , . } we determined threevalues of β + w from the contour LASSO line that corresponds to ρ = 3 in the figure given in Section 3. Wethen again ran (158) and (227) (when running (227) we of course again added positivity constraints and weagain used theoretical value for E c ν + ). When α = 0 . we set n = 1500 while for the other two values of α we set n = 2000 . Obtained results are presented in Table 4. The theoretical values for all quantities ofinterest are again given in parallel as bolded numbers. We once again observe a solid agreement betweenthe theoretical predictions and the results obtained through numerical experiments. In this paper we considered “noisy” under-determined systems of linear equations with sparse solutions.We looked from a theoretical point of view at classical polynomial-time LASSO algorithms. Under theassumption that the system matrix A has i.i.d. standard normal components, we created a general frameworkthat can be used to characterize various quantities of interest in analyzing the LASSO’s performance. Amongother things, the framework enables one to precisely estimate the norm of the error vector in “noisy” under-determined systems. Moreover, it can do so for any given k -sparse vector ˜ x .While many quantities of interest in LASSO recovery can be computed through the mechanism pre-sented here, to demonstrate its power we in this introductory paper focused only on, what we called,LASSO’s generic performance. We essentially established the precise values of the “worst-case” norm-2 of the error vector. On the other hand, using the framework one can create a massive set of results forthe LASSO’s non-generic or as we will refer to it problem dependent performance. However, this goessignificantly over the scope of an introductory paper. We will dissect problems from this direction into tinydetails in one of the forthcoming papers. Also, the existence of an SOCP type of the recovery algorithm thatachieves the same norm-2 of the error vector as the LASSO does followed as a by-product of our analysis.As for the applications, further developments are pretty much unlimited. Literally every problem thatwe were able to solve in the so-called noiseless case (and there was hardly any that we were not) throughthe mechanisms from [63] and [62] can now be handled in the noisy case as well. For example, quantifyingperformance of LASSO or SOCP optimization problems in solving “noisy” systems with special structureof the solution vector (block-sparse, binary, box-constrained, low-rank matrix, partially known locations of52onzero components, just to name a few), “noisy” systems with noisy (or approximately sparse)) solutionvectors can then easily be handled to an ultimate precision. In a series of forthcoming papers we will presentsome of these applications. References [1] R. Adamczak, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann. Restricted isometry property ofmatrices with independent columns and neighborly polytopes by random sampling.
Preprint , 2009.available at arXiv:0904.4723.[2] M. Akcakaya and V. Tarokh. A frame construction and a universal distortion bound for sparse repre-sentations.
IEEE Trans. on Signal Processing , 56(6), June 2008.[3] M. S. Asif and J. Romberg. On the lasso and dantzig selector equivalence.
Constructive Approximation , 28(3), 2008.[6] M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications tocompressed sensing.
Preprint . available online at arXiv:1001.3448.[7] M. Bayati and A. Montanari. The lasso risk of gaussian matrices.
Preprint . available online atarXiv:1008.2581.[8] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector.
TheAnnals of Statistics , 37(4):1705–1732, 2009.[9] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Sparsity oracle inequalities for the lasso.
ElectronicJournal of Statistics , 1:169–194, 2007.[10] E. Candes. Compressive sampling.
Proc. International Congress of Mathematics , pages 1433–1452,2006.[11] E. Candes. The restricted isometry property and its implications for compressed sensing.
CompteRendus de l’Academie des Sciences, Paris, Series I, 346 , pages 589–59, 2008.[12] E. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction fromhighly incomplete frequency information.
IEEE Trans. on Information Theory , 52:489–509, December2006.[13] E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measure-ments.
Comm. Pure Appl. Math. , 59:1207–1223, 2006.[14] E. Candes and T. Tao. Decoding by linear programming.
IEEE Trans. on Information Theory , 51:4203–4215, Dec. 2005.[15] E. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1 minimization.
J. FourierAnal. Appl. , 14:877–905, 2008. 5316] E. Cands and T. Tao. The dantzig selector: statistical estimation when p is much larger than n.
Ann.Statist. , 35(6):2313–2351, 2007.[17] S.S. Chen and D. Donoho. Examples of basis pursuit.
Proceeding of wavelet applications in signaland image processing III , 1995.[18] S.S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit.
SIAM, JOurnalon Scientific Computing
Lect. NotesMath. , 50, 1976.[21] G. Cormode and S. Muthukrishnan. Combinatorial algorithms for compressed sensing.
SIROCCO,13th Colloquium on Structural Information and Communication Complexity , pages 280–294, 2006.[22] S. F. Cotter and B. D. Rao. Sparse channel estimation via matching pursuit with application to equal-ization.
IEEE Trans. on Communications , 50(3), 2002.[23] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.
Preprint ,page available at arXiv:0803.0811, March 2008.[24] M. E. Davies and R. Gribonval. Restricted isometry constants where ell-p sparse recovery can fail for < p ≤ Disc. Comput. Geometry , 35(4):617–652, 2006.[26] D. Donoho, A. Maleki, and A. Montanari. The noise-sensitiviy thase transition in compressed sensing.
Preprint , Apr. 2010. available on arXiv.[27] D. Donoho and J. Tanner. Neighborliness of randomly-projected simplices in high dimensions.
Proc.National Academy of Sciences , 102(27):9452–9457, 2005.[28] D. L. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representations inthe presence of noise.
IEEE Transactions on Information Theory , 52(1):6–18, Jan 2006.[29] D. L. Donoho, Y. Tsaig, I. Drori, and J.L. Starck. Sparse solution of underdetermined linear equationsby stagewise orthogonal matching pursuit.
IEEE Signal Processing Magazine , 25(2), 2008.[31] B. Efron, T. Hastie, and R. Tibshirani. Discussion: The dantzig selector: statistical estimation when pis much larger than n.
Ann. Statist. , 35(6):2358–2364, 2007.[32] S. Foucart and M. J. Lai. Sparsest solutions of underdetermined linear systems via ell-q minimizationfor < q ≤ p is much larger than n . Ann. Statist. , 35(6):2385–2391, 2007.5434] A. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. Algorithmic linear dimension reduction inthe l1 norm for sparse vectors. , 2006.[35] A. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: fast algorithms forcompressed sensing.
ACM STOC , pages 237–246, 2007.[36] Y. Gordon. On Milman’s inequality and random subspaces which escape through a mesh in R n . Geometric Aspect of of functional analysis, Isr. Semin. 1986-87, Lect. Notes Math , 1317, 1988.[37] R. Gribonval and M. Nielsen. Sparse representations in unions of bases.
IEEE Trans. Inform. Theory ,49(12):3320–3325, December 2003.[38] R. Gribonval and M. Nielsen. On the strong uniqueness of highly sparse expansions from redundantdictionaries. In
Proc. Int Conf. Independent Component Analysis (ICA’04) , LNCS. Springer-Verlag,September 2004.[39] R. Gribonval and M. Nielsen. Highly sparse representations from dictionaries are unique and indepen-dent of the sparseness measure.
Appl. Comput. Harm. Anal. , 22(3):335–355, May 2007.[40] J. Haupt and R. Nowak. Signal reconstruction from noisy random projections.
IEEE Trans. InformationTheory
J. Roy. Statist. Soc. Ser. B , 71:127142, 2009.[44] V. Koltchinskii. The dantzig selector and sparsity oracle inequalities.
Bernoulli , 15(3):799–828, 2009.[45] J. Mairal, F. Bach, J. Ponce, Guillermo Sapiro, and A. Zisserman. Discriminative learned dictionariesfor local image analysis.
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2008.[46] I. Maravic and M. Vetterli. Sampling and reconstruction of signals with finite rate of innovation in thepresence of noise.
IEEE Trans. on Signal Processing , 53(8):2788–2805, August 2005.[47] N. Meinshausen, G. Rocha, and B. Yu. Discussion: A tale of three cousins: Lasso, l2boosting anddantzig.
Ann. Statist. , 35(6):2373–2384, 2007.[48] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.
Ann. Statist. , 37(1):246270, 2009.[49] O. Milenkovic, R. Baraniuk, and T. Simunic-Rosing. Compressed sensing meets bionformatics: a newDNA microarray architecture.
Information Theory and Applications Workshop , 2007.[50] D. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate sam-ples.
Applied and Computational Harmonic Analysis , 26(3):301–321, 2009.[51] D. Needell and R. Vershynin. Unifrom uncertainly principles and signal recovery via regularizedorthogonal matching pursuit.
Foundations of Computational Mathematics , 9(3):317–334, 2009.5552] F. Parvaresh and B. Hassibi. Explicit measurements with almost optimal thresholds for compressedsensing.
IEEE ICASSP , Mar-Apr 2008.[53] F. Parvaresh, H. Vikalo, S. Misra, and B. Hassibi. Recovering sparse signals using sparse measure-ment matrices in compressed dna microarrays.
IEEE Journal of Selected Topics in Signal Processing ,2(3):275–285, June 2008.[54] G. Pisier. Probabilistic methods in the geometry of banach spaces.
Springer Lecture Notes
IEEE Signal Processing Magazine , 25(2):14–20,2008.[58] M. Rudelson and R. Vershynin. Geometric approach to error correcting codes and reconstruction ofsignals.
International Mathematical Research Notices , 64:4019 – 4041, 2005.[59] R. Saab, R. Chartrand, and O. Yilmaz. Stable sparse approximation via nonconvex optimization.
ICASSP, IEEE Int. Conf. on Acoustics, Speech, and Signal Processing , Apr. 2008.[60] V. Saligrama and M. Zhao. Thresholded basis pursuit: Quantizing linear programming solutions foroptimal support recovery and approximation in compressed sensing. 2008. available on arxiv.[61] M. Stojnic. A rigorous geometry-probability equivalence in characterization of ℓ -optimization. avail-able at arXiv.[62] M. Stojnic. Upper-bounding ℓ -optimization weak thresholds. available at arXiv.[63] M. Stojnic. Various thresholds for ℓ -optimization in compressed sensing. submitted to IEEE Trans.on Information Theory , 2009. available at arXiv:0907.3666.[64] R. Tibshirani. Regression shrinkage and selection with the lasso. J. Royal Statistic. Society , B 58:267–288, 1996.[65] J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise.
IEEETransactions on Information Theory , 52(3):1030–1051, March 2006.[66] J. Tropp and A. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit.
IEEE Trans. on Information Theory , 53(12):4655–4666, 2007.[67] J. A. Tropp. Greed is good: algorithmic results for sparse approximations.
IEEE Trans. on InformationTheory , 50(10):2231–2242, 2004.[68] S. van de Geer. High-dimensional generalized linear models and the lasso.
Ann. Statist. , 36(2):614–645, 2008.[69] H. Vikalo, F. Parvaresh, and B. Hassibi. On sparse recovery of compressed dna microarrays.
Asilomorconference , November 2007. 5670] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity.
Proc. AllertonConference on Communication, Control, and Computing