[PDF] Sensitivity of ℓ 1 minimization to parameter choice

Abstract

The use of generalized LASSO is a common technique for recovery of structured high-dimensional signals. Each generalized LASSO program has a governing parameter whose optimal value depends on properties of the data. At this optimal value, compressed sensing theory explains why LASSO programs recover structured high-dimensional signals with minimax order-optimal error. Unfortunately in practice, the optimal choice is generally unknown and must be estimated. Thus, we investigate stability of each LASSO program with respect to its governing parameter. Our goal is to aid the practitioner in answering the following question: given real data, which LASSO program should be used? We take a step towards answering this by analyzing the case where the measurement matrix is identity (the so-called proximal denoising setup) and we use ℓ 1 regularization. For each LASSO program, we specify settings in which that program is provably unstable with respect to its governing parameter. We support our analysis with detailed numerical simulations. For example, there are settings where a 0.1% underestimate of a LASSO parameter can increase the error significantly; and a 50% underestimate can cause the error to increase by a factor of 10 9 .

Full PDF

SSensitivity of (cid:96) minimization to parameter choice A ARON B ERK ∗ Dept. Mathematics, University of British Columbia ∗ Corresponding author: [email protected]

ANIV P LAN

Dept. Mathematics, University of British Columbiayaniv @ math.ubc.ca AND ¨O ZG ¨ UR Y ILMAZ

Dept. Mathematics, University of British Columbiaoyilmaz @ math.ubc.ca The use of generalized L

ASSO is a common technique for recovery of structured high-dimensional sig-nals. Each generalized L

ASSO program has a governing parameter whose optimal value depends onproperties of the data. At this optimal value, compressed sensing theory explains why L

ASSO programsrecover structured high-dimensional signals with minimax order-optimal error. Unfortunately in practice,the optimal choice is generally unknown and must be estimated. Thus, we investigate stability of eachL

ASSO program with respect to its governing parameter. Our goal is to aid the practitioner in answeringthe following question: given real data, which L ASSO program should be used?

We take a step towardsanswering this by analyzing the case where the measurement matrix is identity (the so-called proximaldenoising setup) and we use (cid:96) regularization. For each L ASSO program, we specify settings in whichthat program is provably unstable with respect to its governing parameter. We support our analysis withdetailed numerical simulations. For example, there are settings where a 0.1% underestimate of a L

ASSO parameter can increase the error signiﬁcantly; and a 50% underestimate can cause the error to increaseby a factor of 10 . Keywords : Parameter instability, Sparse proximal denoising, L

ASSO , Compressed sensing, Convex opti-mization

1. Introduction

A fundamental problem of signal processing centers the development and analysis of efﬁcacious meth-ods for structured signal recovery that are widely applicable in practice. Frequently in applications, thesignal is assumed to be structured according to some data model and measured by a particular acqui-sition method. For example, in image deblurring one might assume the objects of interest lie in thedual of a Besov space [28, 24], while in MRI applications, one might assume the images are sparsein a wavelet domain, and measured by subsampling their Fourier coefﬁcients [27]. There is extensiveliterature concerned with those applications in which the goal is to recover the ground-truth signal fromacquired measurements by a prescribed convex program that exploits the signal structure. For exam-ple, compressed sensing (CS) has demonstrated that a scale-invariant structure such as sparsity can becaptured by convex optimization. c (cid:13) The author 2019. All rights reserved. a r X i v : . [ c s . I T ] A p r of 52 BERK, A., PLAN, Y., YILMAZ, ¨O

The above paradigm can be put in the following mathematical language. Assume that K ⊆ R N is anonempty closed and convex set. Denote the gauge of K by (cid:107) x (cid:107) K : = inf { λ > x ∈ λ K } and observethat (cid:107) · (cid:107) K may be a norm for certain choices of K . Assume that a signal x ∈ R N is “structured” in thesense that (cid:107) x (cid:107) K is relatively small. Suppose A ∈ R m × N deﬁnes the linear measurement process anddeﬁne the measurements y = Ax + η z where z ∈ R m is a possibly stochastic noise vector with noiselevel η >

0. Here, 1 (cid:54) m , N < ∞ are integers and we do not yet make an assumption on the relativesize of m and N . For τ , σ , λ >

0, we deﬁne the following three generalized L ASSO programs, which areconvex, where the goal is to best approximate the original signal x .ˆ x ( τ ; y , A , K ) : = arg min (cid:110) (cid:107) y − Ax (cid:107) : x ∈ τ K (cid:111) (LS τ , K ) x (cid:93) ( λ ; y , A , K ) : = arg min (cid:110) (cid:107) y − Ax (cid:107) + λ (cid:107) x (cid:107) K : x ∈ R N (cid:111) (QP λ , K )˜ x ( σ ; y , A , K ) : = arg min (cid:110) (cid:107) x (cid:107) K : (cid:107) y − Ax (cid:107) (cid:54) σ (cid:111) (BP σ , K )For brevity of notation, when it is clear from context, we omit explicit dependence of ˆ x , ˜ x , x (cid:93) on y , A and K . We include below several examples of this general set-up:1. To obtain total variation (TV) denoising for [continuous-valued discrete] images, deﬁne for x ∈ R N × N , (cid:107) x (cid:107) BV : = (cid:107) x (cid:107) + ∑ α ∈ [ N ] ∑ β ∈ ν ( α ) | x α − x β | , where [ N ] = { , , . . . , N } and ν : [ N ] → P ([ N ] ) is the neighbour map that determines which“pixels” x β of the image are the neighbours of the pixel x α . If α = ( i , j ) and 2 (cid:54) i , j (cid:54) N − ν ( i , j ) = { ( i − , j ) , ( i , j − ) , ( i + , j ) , ( i , j + ) } with a variety of choicesfor the remaining indices. So deﬁned, x (cid:93) ( λ ; y , I , K ) is a well-known denoising model for two-dimensional images when A = I is the identity matrix and K : = {(cid:107) x (cid:107) BV (cid:54) } [38]. Instead deﬁning (cid:107) x (cid:107) BV : = (cid:107) x (cid:107) + ∑ N − i = | x i + − x i | for x ∈ R N , one obtains an equivalent denoising method for one-dimensional signals. With minor modiﬁcation of x (cid:93) ( λ ) to allow for A to act as a bounded linearoperator on x ∈ R N × N ( e.g., convolution with a Gaussian kernel), one may extend the model forimage deblurring [15].2. Say that x ∈ R N is s -sparse if x ∈ Σ Ns : = { x ∈ R N : (cid:107) x (cid:107) (cid:54) s } where (cid:107) x (cid:107) = { j : x j (cid:54) = } . Deﬁne K : = B N , suppose x ∈ R N is s -sparse for some s (cid:62) A ∈ R m × N is a Gaussianrandom matrix with A i j iid ∼ N ( , m − / ) . Then we obtain three common variants of the L ASSO that solve the “vanilla” CS problem: the constrained L

ASSO yielding ˆ x ( τ ; y , A , K ) , basis pursuitdenoise yielding ˜ x ( σ ; y , A , K ) , and the unconstrained L ASSO yielding x (cid:93) ( λ ; y , A , K ) .3. When A = I is the identity matrix, ( LS τ , K ) yields the orthogonal projection onto τ K , which wedenote by P τ K ( y ) : = ˆ x ( τ ; y , I , K ) . Similarly, ( QP λ , K ) yields the proximal operator for the gaugeinduced by K , which we denote by prox λ − K ( y ) : = x (cid:93) ( λ ; y , I , K ) . Proximal operators are theworkhorses of proximal algorithms. Projected gradient descent methods rely on P τ K ( y ) , whileprox λ − K ( y ) is central to proximal gradient descent methods.4. For example, suppose y = Φ x + η z where x is s -sparse, Φ ∈ R m × N is a Gaussian randommatrix with m (cid:28) N and η z is scaled normal random noise. A well-known way of solving for ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE x ( τ ; y , Φ , B N ) where B N is the unit (cid:96) ball, is to compute the following projected gradient descentscheme: x t + : = P τ B N ( x t − µ t ∇ (cid:107) Φ x − y (cid:107) ) .

5. Assume that x (cid:48) ∈ R N is s -sparse and let x = Ψ − x (cid:48) where Ψ is the orthonormal DFT matrix.Given y = x + η z , the vector ˆ x ( τ ; y , Ψ − , B N ) gives an analogue of running so-called constrainedproximal denoising in Fourier space.6. Consider a matrix x ∈ R N × N , let (cid:107) x (cid:107) ∗ denote its nuclear norm and deﬁne K : = { x ∈ R N × N : (cid:107) x (cid:107) ∗ (cid:54) } . Then ˜ x ( σ ) gives the standard optimization program for recovering a low-rank matrix x ∈ R N × N from measurements Ax : = (cid:104) A i , x (cid:105) = ∑ α ∈ [ N ] A i , α x α .In both the second and ﬁnal examples, the signal x does not (necessarily) belong to the structure set K . Instead, K serves as a kind of structural proxy. To clarify, K = B N in the second example, which isa structural proxy for sparse vectors in the sense that if x ∈ R N is s -sparse then (cid:107) x (cid:107) / (cid:107) x (cid:107) is relativelysmall compared to non-sparse vectors. A similar statement holds for low-rank matrices and the nuclearnorm, as in the ﬁnal example.Because of the myriad applications of this class of programs to real-world problems, it is imperativeto fully characterize the performance and stability of these algorithms. For example, the error rates ofˆ x ( τ ) are well-known when τ is equal to the optimal parameter choice, A is a subgaussian random matrixand K is a symmetric, closed, convex set containing the origin [21, 26, 31]. However, the error of theestimator ˆ x ( τ ) is not fully characterized in this setting for values of τ that are not the optimal choice.Similarly, there lacks a full comparison of the error behaviour between the three estimators ˆ x ( τ ) , ˜ x ( σ ) and x (cid:93) ( λ ) as a function of their governing parameters. It is an open question if there are settings inwhich one estimator is always preferable to another.Perhaps the most common example of where these programs are used is CS. CS is a provably stableand robust technique for simultaneous data acquisition and dimension reduction [21]. Take the linearmeasurement model y = Ax , where x ∈ R N is s -sparse. The now classical CS result [10, 11, 12,16, 17, 21] shows if A is suitably random and has m (cid:62) Cs log ( N / s ) rows, then one may efﬁcientlyrecover x from ( y , A ) . Numerical implementations of CS are commonly tied to one of three convex (cid:96) programs: constrained L ASSO , unconstrained L

ASSO , and quadratically constrained basis pursuit[40]. The advent of suitable fast and scalable algorithms has made the associated family of convex (cid:96) minimization problems extremely useful in practice [22, 23, 32, 40].Proximal Denoising (PD) is a simpliﬁcation of its more general CS counterpart, in which the mea-surement matrix is identity. PD uses convex optimization as a means to recover a structured signalcorrupted by additive noise. We deﬁne three convex programs for PD: constrained proximal denoising,basis pursuit proximal denoising, and unconstrained proximal denoising. To bear greatest relevance toCS, we assume that x is s -sparse, having no more than s non-zero entries, and that y = x + η z , where z iid ∼ N ( , ) and η >

0. For τ , σ , λ >

0, respectively,ˆ x ( τ ) : = arg min x ∈ R N (cid:8) (cid:107) y − x (cid:107) : (cid:107) x (cid:107) (cid:54) τ (cid:9) (LS ∗ τ )˜ x ( σ ) : = arg min x ∈ R N (cid:8) (cid:107) x (cid:107) : (cid:107) y − x (cid:107) (cid:54) σ (cid:9) (BP ∗ σ ) x (cid:93) ( λ ) : = arg min x ∈ R N (cid:8) (cid:107) y − x (cid:107) + λ (cid:107) x (cid:107) (cid:9) . (QP ∗ λ ) of 52 BERK, A., PLAN, Y., YILMAZ, ¨O

These are clear simpliﬁcations of ( LS τ , K ) , ( QP λ , K ) and ( BP σ , K ) introduced above, in which K = B N isthe (cid:96) ball and where we use ∗ to denote that the measurement matrix A ∈ R N × N is identity.Following the dicussion above, minimax order-optimal recovery results for CS and PD programsrely on the ability to make a speciﬁc choice of the program’s governing parameter ( i.e., “using anoracle”) [21]. However, the optimal choice of the governing parameter for these programs is generallyunknown in practice. Consequently, it is desirable that the error of the solution exhibit stability withrespect to variation of the parameter about its optimal setting. If the optimal choice of parameter yieldsorder-optimal recovery error, then one may hope that a “nearly” optimal choice of parameter admits“nearly” order-optimal recovery error, too, in the sense that the discrepancy in error is no greater thana multiplicative constant that depends smoothly on the discrepancy in parameter choice. For example,if R ( α ) is the mean-squared error of a convex program with parameter α >

0, and α ∗ > α , such as R ( α ) (cid:46) A ( α ) R ( α ∗ ) , where A : R → R + is a nonnegative smooth function with A ( α ∗ ) =

1. For example, the risk for ( QP ∗ λ ) satisﬁes this expression with A ( λ ) = ( λ / λ ∗ ) when λ (cid:62) λ ∗ .Unfortunately, such a hope cannot be guaranteed in general. We prove the existence of regimesin which PD programs exhibit parameter instability — small changes in parameter values can lead toblow-up in risk. Moreover, since the three versions of PD are equivalent in a sense ( cf. Proposition 2.4),one might think it does not matter which to choose in practice. However, in this paper we demonstrateregimes in which one program exhibits parameter instability, while the other two do not. For example,in the very sparse regime, our theory and simulations suggest not to use ( BP ∗ σ ) , while in the low-noiseregime, they suggest not to use ( LS ∗ τ ) . At the same time, we identify situations where PD programsperform well in theory and simulations alike.We explore the connection between PD and CS numerically, observing that our theoretical resultsfor PD are mirrored in the CS setup. This holds in both completely synthetic experiments, and for amore realistic example using the Shepp-Logan phantom. Thus, the theoretical results in this paper canhelp practitioners decide which program to use in CS problems with real data.

2. Summary of results to follow

This section contains three sibling results that simplify the main results in the next sections by consid-ering asymptotic versions of them. By “risk”, we mean the noise-normalized expected squared error(nnse) of an estimator. The risks for the estimators ˆ x ( τ ) , x (cid:93) ( λ ) and ˜ x ( σ ) are, respectively:ˆ R ( τ ; x , N , η ) : = η − E (cid:107) ˆ x ( τ ) − x (cid:107) , R (cid:93) ( λ ; x , N , η ) : = η − E (cid:107) x (cid:93) ( ηλ ) − x (cid:107) , ˜ R ( σ ; x , N , η ) : = η − E (cid:107) ˜ x ( σ ) − x (cid:107) . Denote Σ Ns : = { x ∈ R N : (cid:107) x (cid:107) (cid:54) s } where (cid:107) x (cid:107) gives the number of non-zero entries of x ; it is not anorm. Denote by R ∗ ( s , N ) the following optimally tuned worst-case risk for ( LS ∗ τ ) : R ∗ ( s , N ) : = sup x ∈ Σ Ns ˆ R ( (cid:107) x (cid:107) ; x , N , η ) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim η → ˆ R ( x , N , η ) . ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE R ∗ ( s , N ) as a benchmark, noting it isorder-optimal in Proposition 2.5.In section 4, we show that ( LS ∗ τ ) exhibits an asymptotic phase transition in the low-noise regime.There is exactly one value τ ∗ of the governing parameter yielding minimax order-optimal error, with anychoice τ (cid:54) = τ ∗ yielding markedly worse behaviour. The intuition for this result is that ( LS ∗ τ ) is extremelysensitive to the value of τ in the low-noise regime, making empirical use of ( LS ∗ τ ) woefully unstable inthis regime.T HEOREM N → ∞ max x ∈ Σ Ns (cid:107) x (cid:107) = lim η → ˆ R ( τ ; x , N , η ) R ∗ ( s , N ) =  ∞ τ < τ ∗ τ = τ ∗ = ∞ τ > τ ∗ Next, in section 5, we show that ( QP ∗ λ ) exhibits an asymptotic phase transition. The worst-case riskover x ∈ Σ Ns is minimized for parameter choice λ ∗ = O ( (cid:112) log ( N / s )) [30]. While λ ∗ has no closedform expression, it satisﬁes λ ∗ / (cid:112) ( N ) N → ∞ −−−→ s ﬁxed (Proposition 5.3). Thus, we considerthe normalized parameter µ = λ / (cid:112) ( N ) . The risk R (cid:93) ( λ ; x , N , η ) is minimax order-optimal when µ > µ < HEOREM λ ( µ , N ) : = µ √ N for µ >

0. Then,lim N → ∞ sup x ∈ Σ Ns R (cid:93) ( λ ( µ , N ) ; x , N , η ) R ∗ ( s , N ) = (cid:40) O ( µ ) µ (cid:62) ∞ µ < ( BP ∗ σ ) is poorly behaved for all σ > x is very sparse.Namely, ˜ R ( σ ; x , N , η ) is asymptotically suboptimal for any σ > s / N is sufﬁciently small.T HEOREM N → ∞ sup x ∈ Σ Ns inf σ > ˜ R ( σ ; x , N , η ) R ∗ ( s , N ) = ∞ All numerical results are discussed in section 7, and proofs of most theoretical results are deferredto section 8. Next, we add two clariﬁcations. First, the three PD programs are equivalent in a sense.P

ROPOSITION (cid:54) = x ∈ R N and λ >

0. Where x (cid:93) ( λ ) solves ( QP ∗ λ ) , deﬁne τ : = (cid:107) x (cid:93) ( λ ) (cid:107) and σ : = (cid:107) y − x (cid:93) ( λ ) (cid:107) . Then x (cid:93) ( λ ) solves ( LS ∗ τ ) and ( BP ∗ σ ) .However, τ and σ have stochastic dependence on z , and this mapping may not be smooth. Thus,parameter stability of one program is not implied by that of another. Second, R ∗ ( s , N ) has the desirableproperty that it is computable up to multiplicative constants. The proof follows by [30] and standardbounds in [21]. We don’t claim novelty for this result, and defer its full proof to section 8.2.P ROPOSITION s (cid:62) , N (cid:62) η > y = x + η z for z ∈ R N with z i iid ∼ N ( , ) . Let M ∗ ( s , N ) : = inf x ∗ sup x ∈ Σ Ns η − (cid:107) x ∗ − x (cid:107) be the minimax risk over arbitrary estimators x ∗ = x ∗ ( y ) . There is c , C , C > N (cid:62) N = N ( s ) , with N (cid:62) cs log ( N / s ) (cid:54) M ∗ ( s , N ) (cid:54) inf λ > sup x ∈ Σ Ns R (cid:93) ( λ ; x , N , η ) (cid:54) C R ∗ ( s , N ) (cid:54) C s log ( N / s ) . of 52 BERK, A., PLAN, Y., YILMAZ, ¨O

Thus, in the simpliﬁed theorems above, we could have normalized by any of the above expressionsinstead of R ∗ ( s , N ) , because all three expressions are asymptotically equivalent up to constants. Incontrast, a consequence of Proposition 2.5 using Theorem 2.3 is thatinf σ > sup x ∈ Σ Ns ˜ R ( σ ; x , N , η ) (cid:62) sup x ∈ Σ Ns inf σ > ˜ R ( σ ; x , N , η ) (cid:29) R ∗ ( s , N ) . In particular, removing the parameters’ noise dependence destroys the equivalence attained in Proposi-tion 2.4.2.1

Related work

PD is a simple model that elucidates crucial properties of models in general [19]. As a central model fordenoising, it lays the groundwork for CS, deconvolution and inpainting problems [20]. A fundamentalsignal recovery phase transition in CS is predicted by geometric properties of PD [2], because theminimax risk for PD is equal to the statistical dimension of the signal class [30]. This quantity is ageneralized version of R ∗ ( s , N ) introduced above.Robustness of PD to inexact information is discussed brieﬂy in [30], wherein sensitivity to con-straint set perturbation is quantiﬁed, including an expression for right-sided stability of unconstrainedPD. Essentially, PD programs are proximal operators, a powerful tool in convex and non-convex opti-mization [7, 14]. For a thorough treatment of proximal operators and proximal point algorithms, werefer the reader to [6, 18, 37]. Thus is PD interesting in its own right, as argued in [30].Equivalence of the above programs is illuminated from several perspectives [6, 40, 30]. PD risk isconsidered with more general convex constraints [13]. A connection has been made between the riskof Unconstrained L ASSO and R (cid:93) ( λ ; x , N , η ) [4, 3]. In addition, there are near-optimal error boundsfor worst-case noise demonstrating that equality-constrained basis pursuit ( σ =

0) performs well underthe noisy CS model ( η (cid:54) =

0) [43]. It should be noted that these results do not contradict those of thiswork, as random noise can be expected to perform better than worst-case noise in general. Recently,a bound on the unconstrained L

ASSO

MSE has been proven, which is uniform in λ and uniform in x ∈ B Np [29, Thm 3.2]. Note that this also does not run contrary to the left-sided parameter instabilityresult mentioned above as the uniformity in λ is over a pre-speciﬁed interval chosen independently ofthe optimal parameter choice λ ∗ , and the assumption on signal structure is different.2.2 Notation

We use the standard notation for the Euclidean p norm, (cid:107) · (cid:107) p , for values p (cid:62)

1, and occasionally makeuse of the overloaded notation (cid:107) x (cid:107) : = { i ∈ [ N ] : x i (cid:54) = } to denote the number of nonzero entries ofa vector x . Let N ∈ N be an integer representing dimension. Let x ∈ Σ Ns ⊆ R N be an s -sparse signalwith support set T ⊆ [ N ] : = { , , . . . , N } , where s (cid:28) N and Σ Ns : = { x ∈ R N : 0 (cid:54) (cid:107) x (cid:107) (cid:54) s } denotesthe set of s -sparse vectors. We use x or x (cid:48) to denote an arbitrary s -sparse signal, whereas x denotes thesignal for a given problem. Let z ∈ R N be a normal random vector with covariance matrix equal to theidentity, z i iid ∼ N ( , ) . Denote by η ∈ ( , ) the standard deviation and suppose y = x + η z . Moreover,let Z ∼ N ( , ) denote a standard normal random variable. Denote B Np : = { x ∈ R N : (cid:107) x (cid:107) p (cid:54) } thestandard (cid:96) p ball and for a set C ⊆ R N , denote by γ C : = { γ x : x ∈ C } the scaling of C by γ . Alladditional notation shall be introduced in context. ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

3. Main theoretical tools

In this section, we synthesize several known results from convex analysis and probability theory, somewith proof sketches to provide intuition. We outline notation to refer to common objects from convexanalysis. We introduce two well-known tools for characterizing the effective dimension of a set, and statea result that connects these tools with PD estimators [30]. We state a projection lemma that introducesa notion of ordering for projection operators. To our knowledge, this ﬁnal result in 3.1.1 novel. In3.2.1 we state two recent results giving reﬁned bounds on the Gaussian mean width of convex polytopesintersected with Euclidean balls [5].3.1

Tools from convex analysis

Let f : R N → R be a convex function and let x ∈ R N . Denote by ∂ f ( x ) the subdifferential of f at thepoint x , ∂ f ( x ) : = { v ∈ R N : ∀ y , f ( y ) (cid:62) f ( x ) + (cid:104) v , y − x (cid:105)} Note that ∂ f ( x ) is a nonempty, convex and compact set. Given A ⊆ R N and λ >

0, denote λ A : = { λ a : a ∈ A } , cone ( A ) : = { λ x : x ∈ A , λ (cid:62) } . For a nonempty set C and x ∈ R N , denote the distance of x to C by dist ( x , C ) : = inf w ∈ C (cid:107) x − w (cid:107) . If C is also closed and convex, then there exists a unique point in C attaining the minimum, denotedP C ( x ) : = arg min w ∈ C (cid:107) x − w (cid:107) . Denote by C ◦ : = { v | ∀ x ∈ C , (cid:104) v , x (cid:105) (cid:54) } the polar cone of C ; and deﬁne the statistical dimension [2] of C by D ( C ) : = E [ dist ( g , C ) ] , g ∼ N ( , I N ) The descent set of a non-empty convex set C at a point x ∈ R N is given by F C ( x ) : = { h : x + h ∈ C } .The tangent cone is given by T C ( x ) : = cl ( cone ( F C ( x ))) where cl denotes the closure operation; it is thesmallest closed cone containing the set of feasible directions. With these tools, we recall the result of[30] in the PD context, giving a precise characterization of the risk for ( LS ∗ τ ) .T HEOREM C be a non-empty closed and convex set, let x ∈ C be anarbitrary vector and assume that z ∼ N ( , I N ) . Thensup η > η E (cid:107) P C ( x + η z ) − x (cid:107) = D ( T C ( x ) ◦ ) . (3.1)In that work, the authors note D ( T C ( x ) ◦ ) ≈ w (cid:0) T C ( x ) ∩ B N (cid:1) , where w ( · ) denotes the Gaussian meanwidth. Speciﬁcally, Gaussian mean width gives a near-optimal characterization of the risk for ( LS ∗ τ ) .Thus, w ( · ) represents an effective dimension of a structured convex set [34, 35, 36].D EFINITION K ⊆ R N is givenby w ( K ) : = E sup x ∈ K (cid:104) x , g (cid:105) , g ∼ N ( , I N ) . of 52 BERK, A., PLAN, Y., YILMAZ, ¨O

Next, we include one set of conditions under which ˜ x ( σ ) lies in the descent cone of the structure set,yielding a useful norm inequality. This proposition is a simpliﬁcation of classical results found in [21].P ROPOSITION s (cid:62)

0, let x ∈ Σ Ns . Suppose y = x + η z where η > z ∈ R N with z i iid ∼ N ( , ) lies on the event E : = {(cid:107) z (cid:107) (cid:54) N − √ N } . If ˜ x ( σ ) solves ( BP ∗ σ ) with σ (cid:62) η √ N , then (cid:107) ˜ x (cid:107) (cid:54) (cid:107) x (cid:107) and (cid:107) ˜ x − x (cid:107) (cid:54) √ s (cid:107) ˜ x − x (cid:107) . Proof of Proposition 3.3.

Since σ (cid:62) N and (cid:107) z (cid:107) (cid:54) N − √ N (cid:54) √ N , it follows by ˜ x ( σ ) being theminimizer and x being in the feasible set that (cid:107) ˜ x ( σ ) (cid:107) < (cid:107) x (cid:107) on E . Hence, ˜ x − x ∈ T B N ( x ) , the (cid:96) tangent cone of x . By Lemma 3.1, one obtains the desired identity, (cid:107) ˜ x − x (cid:107) = (cid:107) h (cid:107) = (cid:107) h T (cid:107) + (cid:107) h T C (cid:107) (cid:54) (cid:104) sgn ( ˜ x T − x ) , h T (cid:105) − (cid:104) sgn x , h (cid:105) (cid:54) (cid:107) sgn ( ˜ x T − x ) − sgn ( x ) (cid:107) (cid:107) h (cid:107) (cid:54) √ s (cid:107) h (cid:107) . (cid:3) L EMMA (cid:96) descent cone characterization) Let x ∈ Σ Ns with non-empty support set T ⊆ [ N ] and deﬁne C : = (cid:107) x (cid:107) B N . Let T C ( x ) = cone ( F C ( x )) be the tangent cone of the scaled (cid:96) ballabout the point x and deﬁne the set K ( x ) : = { h ∈ R N : (cid:107) h T C (cid:107) (cid:54) −(cid:104) sgn ( x ) , h (cid:105)} . Then T C ( x ) = K ( x ) .3.1.1 Projection lemma.

We introduce a result that to our knowledge is novel: the projection lemma.Given z ∈ R N , this lemma orders the one-parameter family of projections z t : = P tK ( z ) as a function of t > K is a closed and convex set with 0 ∈ K . Namely, as depicted in Figure 1a, (cid:107) P tK ( z ) (cid:107) (cid:54) (cid:107) P uK ( z ) (cid:107) for 0 < t (cid:54) u < ∞ .This lemma has immediate consequences for the ability of proximal algorithms to recover the 0vector from corrupted measurements. Note that the set K need be neither symmetric nor origin-centered,but it must be convex, in general; we have included a pictorial counterexample in Figure 1b to depictwhy.L EMMA K ⊆ R n be a non-empty closed and convex set with 0 ∈ K , and ﬁx λ (cid:62)

1. For z ∈ R n , (cid:107) P K ( z ) (cid:107) (cid:54) (cid:107) P λ K ( z ) (cid:107) . The following is an alternative version of Lemma 3.2 which quickly follows.C

OROLLARY K ⊆ R n be a non-empty closed and convex set with 0 ∈ K and let (cid:107) · (cid:107) K be thegauge of K . Given y ∈ R n deﬁne x α : = arg min {(cid:107) x (cid:107) K : (cid:107) x − y (cid:107) (cid:54) α } Then (cid:107) x α (cid:107) is decreasing in α .R EMARK f ( t ) : = (cid:107) u t (cid:107) , where u t : = t P λ K ( z ) + ( − t ) P K ( z ) , and yields a growth rate of this derivative at t = t f ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t = = (cid:104) z , z λ − z (cid:105) (cid:62) (cid:107) z λ − z (cid:107) λ − . ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

Tools from probability theory

For a full treatment of the topics herein, we refer the reader to [21, 42, 41, 1]. We start by deﬁning sub-gaussian random variables and stating Hoeffding’s inequality, which characterizes how they concentratein high dimensions.D

EFINITION ψ -norm) The subgaussian norm of a random variable X is (cid:107) X (cid:107) ψ : = sup p (cid:62) p − / (cid:0) E | X | p (cid:1) / p A random variable X is subgaussian iff (cid:107) X (cid:107) ψ < ∞ .T HEOREM X i , i = , . . . n , be mean-zerosubgaussian random variables and let a ∈ R n . For t > P (cid:0)(cid:12)(cid:12) n ∑ i = a i X i (cid:12)(cid:12) (cid:62) t (cid:1) (cid:54) e · exp (cid:16) − t C ∑ ni = a i (cid:107) X i (cid:107) ψ (cid:17) One may deﬁne subexponential random variables in a way similar to subgaussian random variables.They, too, admit a concentration inequality.D

EFINITION ψ norm) The subexponential norm of a random variable is (cid:107) X (cid:107) ψ : = sup p (cid:62) p − (cid:0) E | X | p (cid:1) / p . A random variable X is subexponential iff (cid:107) X (cid:107) ψ < ∞ . zz z 𝜆 (a) z 𝜆 K λ K (b) F IG . 1: (a) A visualization of the lemma. Projecting z onto the outer and inner set gives z λ and z ,respectively; evidently, (cid:107) z (cid:107) (cid:54) (cid:107) z λ (cid:107) . (b) A counterexample using scaled (cid:96) p balls for some 0 < p < K must be convex in general. Here, z is projected inwards onto λ K , but towards a distalvertex when projected onto K .0 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O T HEOREM X , . . . , X n be independent mean-zerosubexponential random variables. Then for all { a , . . . , a n } ∈ R n , P (cid:0) | n ∑ i = a i X i | (cid:62) t (cid:1) (cid:54) (cid:16) − C min (cid:110) t k (cid:107) a (cid:107) , tk (cid:107) a (cid:107) ∞ (cid:111)(cid:17) , t (cid:62) , k : = max i (cid:107) X i (cid:107) ψ Finally, we introduce a result of Borell, Tsirelson, Ibragimov, and Sudakov about Gaussian pro-cesses, which states that the supremum of a Gaussian process deﬁned over a topological space T behavesnearly like a normal random variable. For a proof of this result, we refer the reader to [1].T HEOREM T be a topological space and let { f t } t ∈ T be a centred( i.e., mean-zero) Gaussian process on T with (cid:107) f (cid:107) T : = sup t ∈ T | f t | σ T : = sup t ∈ T E (cid:2) | f t | (cid:3) such that (cid:107) f (cid:107) T is almost surely ﬁnite. Then E (cid:107) f (cid:107) T and σ T are both ﬁnite and for each u > P (cid:0) (cid:107) f (cid:107) T > E (cid:107) f (cid:107) T + u (cid:1) (cid:54) exp (cid:0) − u σ T (cid:1) . Reﬁned bounds on Gaussian mean width.

Two recent results yield improved upper- and lower-bounds on the GMW of convex polytopes intersected with Euclidean balls [5]. Each is integral todemonstrating ( BP ∗ σ ) parameter instability. The ﬁrst describes how local effective dimension of a convexhull scales with neighbourhood size.P ROPOSITION m (cid:62) N (cid:62)

2. Let T be the convex hull of 2 N points in R m andassume T ⊆ B m . Then for γ ∈ ( , ) , w ( T ∩ γ B m ) (cid:54) min (cid:8) (cid:113) max (cid:8) , log ( eN γ ) (cid:9) , γ (cid:112) min { m , N } (cid:9) The second result shows that Proposition 3.9 is tight up to multiplicative constants.P

ROPOSITION m (cid:62) N (cid:62)

2. Let γ ∈ ( , ] and assume for simplicitythat s = / γ is a positive integer such that s (cid:54) N /

5. Let T be the convex hull of the 2 N points {± M , . . . , ± M N } ⊆ S m − . Assume that for some real number κ ∈ ( , ) we have κ (cid:107) θ (cid:107) (cid:54) (cid:107) M θ (cid:107) for all θ ∈ R N such that (cid:107) θ (cid:107) (cid:54) s , Then w ( T ∩ γ B m ) (cid:62) ( √ / ) κ (cid:113) log ( N γ / ) . ( LS ∗ τ ) parameter instability We describe a parameter instability regime for ( LS ∗ τ ) , revealing a regime in which there is exactly onechoice of parameter τ ∗ > R ( τ ∗ ; x , N , η ) is minimax order-optimal. Speciﬁcally, Theorem4.1 shows that ˆ R ( τ ; x , N , η ) exhibits an asymptotic singularity in the limiting low-noise regime (bylow-noise regime, we mean hereafter the regime in which η → ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

11 of 52In 7.1 we complement this asymptotic result with numerical simulations that contrast how the threerisks behave in a simpliﬁed experimental context. The numerics support that Theorem 4.1 providesaccurate intuition to guide how ( LS ∗ τ ) can be expected to perform in practice when the noise level issmall relative to the magnitude of the signal’s entries.The analogue of the classical CS result is included in our result as the special case τ = τ ∗ = (cid:107) x (cid:107) ( cf . Proposition 2.5). The cases for τ (cid:54) = τ ∗ may seem surprising initially, but can be understood withthe following key intuition: the approximation error is controlled by the effective dimension of theconstraint set.First, one should generally not expect good recovery when the signal lies outside the constraint set.When τ < τ ∗ , y lies outside of the constraint set with high probability in the limiting low-noise regime.Accordingly, there is a positive distance between the true signal and the recovered signal which may belower-bounded by a dimension-independent constant. Hence, the risk is determined by the reciprocal ofthe noise variance, growing unboundedly as η → τ > τ ∗ , y lies within the constraint set with high probability in the limitinglow-noise regime. Thus, the problem is essentially unconstrained in this setting, so the effective dimen-sion of the constraint set for the problem should be considered equal to that of the ambient dimension.In particular, one should expect that the error be proportional to N .T HEOREM ( LS ∗ τ ) parameter instability) Let s (cid:62) , η > x ∈ Σ Ns \ Σ Ns − . Given τ > η → ˆ R ( τ ; x , N , η ) =  ∞ τ < (cid:107) x (cid:107) R ∗ ( s , N ) τ = (cid:107) x (cid:107) N τ > (cid:107) x (cid:107) In summary, the surprising part of this result is that there is a sharp phase transition between twounstable regimes, with the optimal regime lying on the boundary of the two phases. We argue thissuggests that there is only one reasonable choice for τ in the low-noise regime. Observe, that Theorem4.1 connects with Theorem 2.1 by taking the limit of the problem as N → ∞ after ﬁrst restricting tosignals of a ﬁnite norm (arbitrarily, 1) so that the essence of the result is preserved. ( QP ∗ λ ) parameter instability We show that R (cid:93) ( λ ; x , N , η ) is smooth in the low-noise regime. This result becomes evident from theclosed-form expression for R (cid:93) ( λ ; s , N ) that emerges for this special case. At ﬁrst, this smoothness resultseems to stand in contrast to the “cusp-like” behaviour that we observe analytically and numerically forlim η → ˆ R ( τ ; x , N , η ) ( cf . Figure 3). However, R (cid:93) ( λ ; s , N ) possesses unfavourable dependence on N thatis elucidated in Theorem 5.2.Brieﬂy, if the governing parameter λ is too small, then the risk grows unboundedly as a power lawof N in high dimensions. This rate of growth implies that the risk is minimax suboptimal for such λ . Toour knowledge, this result is novel. In contrast, for all suitably large λ , R (cid:93) ( λ ; s , N ) admits the desirableproperty suggested in section 1: R (cid:93) ( λ ; s , N ) (cid:46) ( λ / λ ∗ ) R ∗ ( s , N ) . The result, stated in Theorem 5.4,essentially follows from known L ASSO bounds for RIP matrices: R ( λ ) (cid:54) λ s . Thus, in the low-noiseregime, R (cid:93) ( λ ; x , N , η ) exhibits a phase transition between order-optimal and suboptimal regimes.The numerics of section 7.2 suggest a viable constant for the growth rate of the risk when λ is toosmall, and support Theorem 5.4 in the case where λ is sufﬁciently large. These numerics also clarifythe role that the dimension-dependent growth rate serves in the stability of ( QP ∗ λ ) about λ ∗ .2 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O

Smoothness of the risk

The ( QP ∗ λ ) estimator for a problem with noise level η > λ > ηλ . In particular, x (cid:93) ( ηλ ) is a smooth function with respect to the problem parame-ters, hence so is R (cid:93) ( λ ; x , N , η ) (being a composition of smooth functions). However, the closed formexpression for R (cid:93) ( λ ; x , N , η ) is unavailable, because the expectations involved are untractable in gen-eral. When the noise-level vanishes this is no longer true and we may compute an exact expression interms of λ , s and N for the risk. Speciﬁcally, we note that the smoothness result below is not special tothe case where η →

0, but is notable because of the closed form expression for the risk that is obtained.Moreover, the result is notable, because the closed form expression is equivalent (in some preciselydeﬁnable sense) to R (cid:93) ( λ ; x , N , η ) when η > x are all large( i.e., “the signal is well-separated from the noise”). We make this connection after the main resultsdiscussed below. In turn, this connects Theorem 5.2 and Theorem 5.4 to Theorem 2.2, where the ana-lytic expression is used to derive the so-called left-sided parameter instability and right-sided parameterstability results.P ROPOSITION R (cid:93) ( λ ; x , N , η ) smoothness) Let s (cid:62) , N (cid:62) , x ∈ Σ Ns and η >

0. For λ > η → R (cid:93) ( λ ; x , N , η ) = s ( + λ ) + ( N − s ) (cid:2) ( + λ ) Φ ( − λ ) − λ φ ( λ ) (cid:3) (5.1)R EMARK R (cid:93) ( λ ; s , N ) : = lim η → R (cid:93) ( λ ; x , N , η ) ;and deﬁne the function G ( λ ) : = ( + λ ) Φ ( − λ ) − λ φ ( λ ) for notational brevity, where φ and Φ denotethe standard normal pdf and cdf, respectively.An equivalence in behaviour is seen between the low-noise regime η → | x , j | → ∞ for j ∈ supp ( x ) with η >

0. For both programs, the noise level is “effectively” zero bycomparison to the size of the entries of x . This type of scale invariance allows us to re-state theprevious result as a max formulation.C OROLLARY s (cid:62) , N (cid:62) , x ∈ Σ Ns and η >

0. For λ > x ∈ Σ Ns R (cid:93) ( λ ; x , N , η ) = R (cid:93) ( λ ; s , N ) Left-sided parameter instability

We reveal an asymptotic regime in which R (cid:93) ( λ ; s , N ) is minimax suboptimal for all λ sufﬁciently small.The result follows from showing the risk derivative is large for all λ < ¯ λ when s is sufﬁciently smallrelative to N . Here, ¯ λ : = √ N is an Ansatz estimate of λ ∗ used to make the proof proceed cleanly.Finally, we show in what sense ¯ λ is asymptotically equivalent to λ ∗ in Proposition 5.3.The proof for the bound on the risk derivative follows by calculus and a standard estimate of Φ ( − λ ) in terms of φ ( λ ) . Its scaling with respect to the ambient dimension destroys the optimal behaviour of R (cid:93) ( λ ; x , N ) for all λ < ¯ λ . The proof of this result, stated in Theorem 5.2, follows immediately fromLemma 5.1 by the fundamental theorem of calculus.L EMMA s (cid:62)

1. For any ε ∈ ( , ) , there exists C > N = N ( s ) (cid:62) s so that for all N (cid:62) N − dd u (cid:12)(cid:12)(cid:12)(cid:12) u = − ε R (cid:93) ( u ¯ λ ; s , N ) (cid:62) CN ε ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

13 of 52where ¯ λ = (cid:112) ( N ) is an estimate of the optimal parameter choice for ( QP ∗ λ ) .T HEOREM ( QP ∗ λ ) parameter instability) Under the conditions of the previous lemma, for ε ∈ ( , ) there exists a constant C > N (cid:62) N (cid:62) N , R (cid:93) (( − ε ) ¯ λ ; s , N ) (cid:62) C N ε log N . Though these results may initially seem surprising, we claim they are sensible when viewed incomparison to unregularized proximal denoising ( i.e., λ = x is unused and so one expects error be proporitional to the ambient dimension, as in section 4. In thelow-noise regime, the sensitivity of the program to λ is apparently ampliﬁed, and for λ > ( QP ∗ λ ) to behave similarly to unregularized proximal denoising, begetting risk that behaveslike a power law of N .P ROPOSITION N ∈ N with N (cid:62) s ∈ [ N ] and ¯ λ = √ N . For givenproblem data, suppose x (cid:93) ( λ ) solves ( QP ∗ λ ) , and let λ ∗ be the optimal parameter choice for R (cid:93) ( λ ; s , N ) .Then lim N → ∞ ¯ λλ ∗ = EMARK λ estimates the optimal parameter choice for ( QP ∗ λ ) in the following sense[30]. λ ∗ = O ( (cid:112) log ( N / s )) ≈ (cid:112) N = : ¯ λ Right-sided parameter stability

In the low-noise regime, R (cid:93) may still be order-optimal if λ is chosen large enough. Speciﬁcally, if λ = L λ ∗ for some L >

1, then R (cid:93) ( λ ; x , N ) is still minimax order-optimal. We claim no novelty forthe result of this section, but use it as a contrast to elucidate the previous theorem. Whereas for λ < ¯ λ we are penalized for under-regularizing in the low-noise regime in high dimensions, the theorem belowimplies that we are not penalized for over-regularizing.T HEOREM ( QP ∗ λ ) is parameter stable in the sense that for any λ > L = λ / λ ∗ >

1, thereis N = N ( s , λ ) (cid:62) N (cid:62) N , R (cid:93) ( λ ; s , N ) R ∗ ( s , N ) (cid:54) CL . Observe that the theorem still holds in the event that λ ∗ is replaced by ¯ λ . Thus, one may obtain theexact point of the phase transition, ¯ λ , observed in Theorem 2.2. In fact, with this note, Theorem 2.2follows as a direct consequence of the results of this section by letting N → ∞ . ( BP ∗ σ ) parameter instability The program ( BP ∗ σ ) is maximin suboptimal for very sparse vectors x . We show that ˜ R ( σ ; x , N , η ) scales as a power law of N for all σ >

0. This rate is signiﬁcantly worse than R ∗ ( s , N ) . When x isvery sparse and ( BP ∗ σ ) is underconstrained, then σ (cid:62) η N and 6.1 proves that ˜ R ( σ ; x , N , η ) = Ω ( √ N ) .4 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O

When ( BP ∗ σ ) is overconstrained, then σ (cid:54) η √ N and 6.2, proves that ˜ R ( σ ; x , N , η ) = Ω ( N q ) for some q > x is very sparse.Intuitively, ( BP ∗ σ ) kills not only the noise, but also eliminates too much of the signal content whenunderconstrained and s is small compared to N . Because the signal is very sparse, destroying the sig-nal content is disastrous to the risk. When overconstrained, the remaining noise overwhelms the risk,because the off-support has size approximately equal to the ambient dimension.The above two steps are combined in Theorem 6.2 as a minimax formulation over all σ > x ∈ Σ Ns . In Theorem 6.3, this result is strengthened to a maximin statement over x ∈ Σ Ns and all σ > ( BP ∗ σ ) in empirical settings, we assure the reader that they are consistent. The type of parameter insta-bility described in this section occurs at very large dimensions, in the setting where s (cid:62) ( BP ∗ σ ) to recover even the 0 vector (arguably a desir-able property of a denoising program), many structured high-dimensional signals observed in practiceare not so sparse [in a basis] as to belong to the present regime. Nevertheless, this result serves as acaveat for the limits of a popular (cid:96) convex program.6.1 Underconstrained ( BP ∗ σ ) The proof of this result uses standard methods from CS and may be found in 8.8.L

EMMA s (cid:62) x ∈ Σ Ns \ Σ Ns − be an exactly s -sparse signal with | x j | (cid:38) N for all j ∈ supp ( x ) . If σ > η √ N , then there exists a constant C > N = N ( s ) (cid:62) N (cid:62) N then ˜ R ( σ ; x , N , η ) (cid:62) C √ N . Overconstrained ( BP ∗ σ ) The proof that ˜ R ( σ ; x , N , λ ) scales as a power law of N when σ (cid:54) η √ N proceeds by an involvedargument, hinging on two major steps. The ﬁrst step is to ﬁnd an event whose probability is lower-bounded by a universal constant, on which ( BP ∗ σ ) fails to recover the 0 vector when σ = η √ N . Then,Lemma 3.2 extends this result to all σ (cid:54) η √ N . At this point, one may obtain the minimax result ofTheorem 6.2, as well as a partial maximin result for all x ∈ Σ Ns on the restriction to σ (cid:54) η √ N . Then, tostrengthen these claims to a maximin result over all σ >

0, we prove a lemma that leverages elementaryproperties from convex analysis to show how the error of an estimator may be controlled by that of alower dimensional estimator from the same class.In this section, we state key results for building intuition and defer technical results and proofs to8.8.T

HEOREM C > , q ∈ ( , ) and N (cid:62) N (cid:62) N , s (cid:62) η > x ∈ Σ Ns inf σ (cid:54) η √ N ˜ R ( σ ; x , N , η ) (cid:62) CN q . By scaling, it is sufﬁcient to prove this result in the case where η =

1. The discussion below thusassumes y = x + z , while results are stated in full generality. The main result relies on provinginf σ (cid:54) √ N ˜ R ( σ ; x , N , ) (cid:62) CN q ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

15 of 52when x ≡

0, trivially implying the equation before it. Thus, the problem now becomes that of recoveringthe 0 vector from standard normally distributed noise:˜ x ( σ ) = arg min {(cid:107) x (cid:107) : (cid:107) x − z (cid:107) (cid:54) σ } . Here and below, we denote the feasible set in ( BP ∗ σ ) by F ( z ; σ ) = B N ( z ; σ ) and use the notation F : = F ( z ; √ N ) . For λ > < α (cid:54) α < ∞ , deﬁne K i = λ B N ∩ α i B N to be the intersection of the (cid:96) -ballscaled by λ with the (cid:96) -ball scaled by α i for i = , σ = √ N , we prove a geometric lemma. A pictorial representation of this lemma appears inFigure 2, in which we have represented λ B N using Milman’s 2D representation of high-dimensional (cid:96) balls to facilitate the intuition for how they behave in the present context. The key to the proof ofTheorem 6.1 is the geometric lemma below, Lemma 6.2. It proves there exists an (cid:96) ball of radius λ that intersects the feasible set, hence a solution ˜ x ( σ ) must satisfy (cid:107) ˜ x ( σ ) (cid:107) (cid:54) λ . Further, it shows thatany vector in the ball λ B N which has small Euclidean norm does not intersect the feasible set. Thus, thesolution must have large Euclidean norm.Finally, this geometric lemma veriﬁes that the previous three conditions occur on an event occurringwith at least probability k >

0. As an immediate consequence, this lemma yields a lower risk bound,Corollary 6.1. The integers N (8.6)0 , N (8.9)0 are deﬁned in the technical results of 8.8.2.L EMMA K , K , F be deﬁned as above. Let N (6.2)0 : = max { N (8.6)0 , N (8.9)0 } bea universal constant and suppose N > N (6.2)0 . There are universal constants k = k ( N (6.2)0 ) > C , q > E such that1. K ∩ F (cid:54) = /0 2. K ∩ F = /0 3. α > C N q P ( E ) > k .C OROLLARY η >

0. There are universal constants C , q > N (cid:62) N (6.2)0 ,˜ R ( η √ N ; 0 , N , η ) (cid:62) CN q . Next we extend Corollary 6.1 from the case where σ = √ N to any positive σ (cid:54) √ N . The proof ofthis result follows near immediately from the projection lemma in Lemma 3.2. Thus, one ﬁnds ˜ x ( σ ) hasEuclidean norm at least as large as ˜ x ( √ N ) when ˜ x ( σ ) is an estimator of the 0 vector. z λ B α B α B F x(σ)~ F IG . 2: A visualization of the lemma. Weuse Milman’s 2D representation of high-dimensional (cid:96) balls to facilitate the intuition.In this setting, ˜ x ( σ ) must lie inside λ B N .On the event E described by the lemma, onesimultaneously ﬁnds K ∩ F (cid:54) = /0 and K ∩ F = /0.6 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O L EMMA < σ < σ = √ N and x ≡

0. Deﬁne ˜ x ( σ ) , ˜ x ( σ ) as in ( BP ∗ σ ) for σ = σ , σ ,respectively. Then (cid:107) ˜ x ( σ ) (cid:107) (cid:62) (cid:107) ˜ x ( σ ) (cid:107) . Moreover, for N (cid:62) E (cid:107) ˜ x ( σ ) (cid:107) (cid:62) E (cid:107) ˜ x ( σ ) (cid:107) . Minimax results

We now have the tools to state a minimax instability result for ( BP ∗ σ ) . Informally, the best worst-caserisk scales as a power law of N in the very sparse regime. In particular, for s ﬁxed and N sufﬁcientlylarge, there is no choice of σ > HEOREM C > , q ∈ ( , ] , N (cid:62) N (cid:62) N , η (cid:62) s (cid:62)

1, inf σ > sup x ∈ Σ Ns ˜ R ( σ ; x , N , η ) (cid:62) CN q Maximin results

The ﬁnal result of this section establishes maximin parameter instability for all x ∈ Σ Ns and σ >

0. To dothis, we must show there exists a choice of signal x ∈ R N admitting no choice of σ > s -sparse signals with s (cid:62)

1. This will be enough to yield a choice of x whose recoveryis suboptimal over the whole parameter range.L EMMA ( BP ∗ σ ) , s (cid:62)

1) Let x ∈ Σ Ns with supp ( x ) ⊆ T ⊆ [ N ] , let y = x + ξ forsome ξ ∈ R N , let x : = ( x ) T C ∈ Σ N − s and ﬁx σ >

0. Let ˜ x = ˜ x ( σ ) ∈ R N be the solution of ( BP ∗ σ ) where x is the ground truth, and let ˜ x (cid:48) = ˜ x (cid:48) ( σ ) ∈ R N − s be the solution of ( BP ∗ σ ) where x is the ground truth.Then (cid:107) ˜ x T C (cid:107) (cid:62) (cid:107) ˜ x (cid:48) (cid:107) . An immediate consequence of this result is the following inequality between the Euclidean normsof the error vectors.C

OROLLARY h : = ˜ x − x and h (cid:48) : = ˜ x (cid:48) − x , where x , x , ˜ x , ˜ x (cid:48) are deﬁned as above. Then, (cid:107) h (cid:107) (cid:62) (cid:107) h (cid:48) (cid:107) . R EMARK N − s (cid:62) N (6.2)0 then ˜ x (cid:48) is parameter unstable for σ (cid:54) √ N − s and so ˜ x is, too. The ﬁx for thisslight mismatch is trivial, but technical. The result can be extended to the range σ (cid:54) √ N by adjustingthe constants in the proof of Lemma 6.2 and its constituents, leveraging the fact that ( N − s ) / N → N → ∞ and re-selecting N (6.2)0 if necessary. We omit the details of this technical exercise.We proceed under the assumption that the constants have been tuned to allow for ˜ x (cid:48) parameterinstability to imply ˜ x parameter instability for all σ (cid:54) √ N . Thus equipped, we state the followingmaximin parameter instability result for ( BP ∗ σ ) . The proof of this result proceeds by ﬁnding a signal x ∈ Σ Ns such that ˜ R ( σ ; x , N , η ) is suboptimal for all σ >

0. Since Lemma 6.1 applies only to signals x with at least one non-zero entry, one shows there exists such a signal which simultaneously admits ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

17 of 52poor risk for σ (cid:54) η √ N and σ (cid:62) η √ N . For example, it is enough to take x : = Ne where e ∈ R N isthe ﬁrst standard basis vector.T HEOREM ( BP ∗ σ ) maximin suboptimality) There are universal constants C > , q ∈ ( , ] and N (cid:62) N (cid:62) N sup x ∈ Σ Ns inf σ > ˜ R ( σ ; x , N , η ) (cid:62) CN q . R EMARK x ; even still it is not possibleto choose σ to achieve order-optimal risk.

7. Numerical Results

Let P ∈ { ( LS ∗ τ ) , ( QP ∗ λ ) , ( BP ∗ σ ) } be a PD program with solution x ∗ ( ρ ) where ρ ∈ { τ , λ , σ } is the asso-ciated parameter. Given a signal x and noise η z , denote by L ( ρ ; x , N , η z ) the loss associated to P and deﬁne ρ ∗ = ρ ( x , η ) > ρ yielding best risk ( i.e., where E z L ( ρ ; x , N , η z ) isminimal). We say the normalized parameter ρ for the problem P is given by ρ : = ρ / ρ ∗ and note that ρ = L ( ρ ; x , N , η ˆ z ) ; by the law of large numbers, this riskestimates well an average of such losses over many realizations ˆ z . Finally, deﬁne the auxiliary function L ( ρ ; x , N , η ˆ z ) : = L ( ρρ ∗ ; x , N , η ˆ z ) .The plots in Figures Figure 3a, Figure 3b, Figure 4b, Figure 5a and Figure 6 visualize the averageloss, ¯ L ( ρ i ; x , N , η , k ) : = k k ∑ j = L ( ρ i ; x , N , η ˆ z i j ) (7.1)for each program, evaluated on a grid { ρ i } ni = of size n and plotted on a log-log scale, where L ( ρ ; x , N , η ˆ z ) = η − (cid:107) x ∗ ( ρ ) − x (cid:107) . Here, each of the nk realizations of the noise is distributed according to ˆ z i j ∼ N ( , ) , the noise level given by η and the signal by x where x = N ∑ si = e i with e i being the i thstandard basis vector. The grid { ρ i } ni = was logarithmically spaced and centered about ρ ( n + ) / = n always odd. The solutions to each PD problem were obtained using standard available methodsin Python: sklearn ’s minimize scalar function from the optimize module was used for solv-ing ( LS ∗ τ ) and ( BP ∗ σ ) [33], while the solution to ( QP ∗ λ ) was obtained via soft-thresholding. Finally, theoptimal values τ ∗ , λ ∗ and σ ∗ were either determined analytically ( e.g., τ ∗ = (cid:107) x (cid:107) ), or estimated on adense grid about an approximately optimal value for that parameter. Initial guesses for σ ∗ and λ ∗ were η √ N and (cid:112) ( N / s ) respectively.7.1 ( LS ∗ τ ) numerical simulations This section presents numerical simulations demonstrating parameter instability of ( LS ∗ τ ) in thelow-noise regime for two different ambient dimensions N = , . This repetition has the beneﬁt ofshowcasing the behaviour of ( LS ∗ τ ) at two different sparsity levels, as well as contrasting the behaviourof ( LS ∗ τ ) with ( QP ∗ λ ) and ( BP ∗ σ ) at relatively low and high dimensions. Using the notation above, n =

301 points and s = ( k , N ) = ( , ) for Figure 3a, while ( k , N ) = ( , ) for Figure 3b. In8 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O (a) (b) F IG . 3: ( LS ∗ τ ) parameter instability in the low-noise regime. Average loss as per (7.1) for each programplotted on a log-log scale with respect to the normalized parameter. The data parameters for (a) are ( s , N , η , k , n ) = ( , , − , , ) and those for (b) are ( s , N , η , k , n ) = ( , , − , , ) .both regimes, x is quite sparse s / N ∼ − , − with entries that are well separated from the noise N / η ∼ , .We may glean several pieces of information from these two plots. Most notably, the ( LS ∗ τ ) parameterinstability manifests in very low dimensions, relative to practical problem sizes. Moreover, the curvefor ( LS ∗ τ ) average loss seems to approach something resembling the sharp asymptotic phase transitiondescribed by Theorem 4.1. One may also notice the behaviour of the other two programs in the low-noise regime. It is apparent that the magnitude of the derivative for the ( QP ∗ λ ) risk increases markedlyon the left-hand side of the optimal normalized parameter value ( i.e., below 1) between the N = and N = plots. This behaviour is consistent with the result in Theorem 5.2 that the left-sided risk scalesas a power law of N .Finally, we observe that ( BP ∗ σ ) develops a shape resembling the instability of ( LS ∗ τ ) when N = .We offer the plausible explanation that the relative sparsity of the signal is small ( s / N = / ) andthus this regime coincides with the regime in which ( BP ∗ σ ) develops parameter instability. Figure 5demonstrates that such an instability seems to occur in very large dimensions, a suspicion corroboratedby the remark at the end of 7.3.We observe that the parameter instability developed by ( BP ∗ σ ) seems to manifest in a way similar tothat of ( LS ∗ τ ) . This is interesting, because Theorem 6.3 shows that there is no good choice of parameter σ , though Figure 5 supports that there is a single best choice, albeit minimax suboptimal, when N ismoderately large.7.2 ( QP ∗ λ ) analytic plots We plot R (cid:93) ( λ ; s , N ) using the expressions derived in (5.1). Observe that the plot of the analytic expres-sion for R (cid:93) ( λ ; s , N ) agrees well with the simulations of R (cid:93) ( λ ; x , N , η ) in Figure 3 and Figure 5.In Figure 4a we plot R (cid:93) ( λ ; s , N ) for λ ∈ { − − , − − , } . It is evident from the reference lines y ∼ N / and y ∼ √ N that R (cid:93) ( u ¯ λ ; s , N ) scales like a power law of N for u <

1, while R (cid:93) ( λ ; s , N ) appearsto have approximately order-optimal growth. The derivatives of these three functions are visualized inFigure 4c, with plotted reference lines y = N / , √ N . Again, it is evident that the derivative scales asa power law of N for those risks with λ < ¯ λ . In Figure 4b we plot R (cid:93) ( λ ; s , N ) as a function of λ for N = . One may observe parameter instability for λ < λ ∗ , for example by comparison to the plotted ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

19 of 52 N10 u=1-1e-02u=1-1e-03u=2O(N ),O(N ) (a) (b) N10 u=1-1e-02u=1-1e-03u=2N ,N (c) F IG . 4: ( QP ∗ λ ) parameter instability in the low-noise regime. All curves are generated analytically usingthe expressions obtained in section 5 and plotted on a log-log scale. (a) A plot of R (cid:93) ( u λ ∗ ; s , ) asa function of N for u ∈ { − − , − − , } . The lines y = N / / y = √ N /

20 are plotted forreference. (b) A plot of R (cid:93) ( λ ; s , ) as a function of λ . Two lines are plotted as reference for riskgrowth rate with respect to λ . (c) A plot of the magnitude of dd u R (cid:93) ( u λ ∗ ; s , ) as a function of N for u ∈ { − − , − − , } . The lines y = N / , y = √ N /

20 are plotted as reference.reference line y ∼ λ − . Similarly, one may observe right-sided parameter stability of R (cid:93) ( λ ; s , N ) bycomparing with the second plotted reference line, y ∼ λ . From these simulations one may observe thatchoosing λ = . λ ∗ accrues at least a 10 fold magniﬁcation of the error.Finally, we would like to clarify a potentially confusing issue. Though our theory for ( QP ∗ λ ) refersto λ ∗ only through its connection with ¯ λ , we were able to approximate λ ∗ empirically in our numericalsimulations. Accordingly, we have made reference to it when discussing parameter stability regimes.7.3 ( BP ∗ σ ) numerical simulations This section presents numerical simulations demonstrating parameter instability of ( BP ∗ σ ) in the regimewhere x is very sparse. Figure 5a is generated as described by the above procedure in section 7, withparameters ( N , s , η , k , n ) = ( , , , , ) , while Figure 5b was generated in a way that mirrors theproof of Theorem 6.3, with parameters ( s , η , k , n ) = ( , , , ) .The thrust of Figure 5a is to resolve parameter instability of ( BP ∗ σ ) about the optimal parameterchoice. Because the theory suggests that ˜ R ( σ ; x , N , η ) is surely resolved when the ambient dimension issufﬁciently large, we set N = ; this value was expected to resolve the instability, as per the discussionin 7.3.1 below. We limited the number of realizations and grid points because the problem size wascomputationally prohibitive. The minimal average loss observed on the plot was signiﬁcantly larger0 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O (a) N ̃R̃σ opt ̃N);N)̃R smooth N .3 μ±σ (b) F IG . 5: ( BP ∗ σ ) parameter instability in the very sparse regime. (a) Data parameters: ( s , N , η , k , n ) =( , , , , ) . Average losses plotted on a log-log scale with respect to the normalized parameter.(b) Average best loss for ( BP ∗ σ ) as a function of N . Data parameters: ( s , η , k , n ) = ( , , , ) . Thefunction σ opt ( N ) was obtained as the value of σ bestowing minimal loss of the program for each N andrealization. These best losses were averaged, yielding average best loss. The standard deviation wascomputed for each N from the same loss realizations and included as a ribbon about the mean; functions y = N / and y = N / are included for reference.than the respective minimal average losses of ( LS ∗ τ ) and ( QP ∗ λ ) by a factor of 82 .

2, supporting thetheory. We also noticed a cusp-like behaviour, which would be an interesting object of further study.Figure 5b was generated so as to mirror the theory backing Theorem 6.3. Speciﬁcally, noise real-izations were constrained to the constant probability event {(cid:107) z (cid:107) − N ∈ ( . √ N , √ N ) } . Plotted in theﬁgure is the average best loss as a function N ,¯ L best ( N ; x , η , k , n ) : = k k ∑ j = min i ∈ [ n ] L ( σ i ( N ) ; x , N , η ˆ z i j ) . The domain for N ranges from 10 to 10 , computed on a logarithmically spaced grid composed of 51points. For each value N in the grid, the average loss was computed for n =

31 values of σ , each using k =

25 realizations ˆ z of the noise. The standard deviations of the best loss realizations were computed,and plotted as a grey ribbon about the average best loss. Included for reference is a smoothed versionof the average best loss, computed as a rolling window average. In addition, we have plotted two powerlaws of N that lower- and upper-bound the averaged best loss, and nearly bound that quantity up to afull standard deviation.7.3.1 Simulating theorem parameters.

Here we clarify the relationship between some of the con-stants appearing in the proofs of Theorem 6.2 and Theorem 6.3. We provide two examples of minimal N values guaranteeing parameter instability behaviour of ( BP ∗ σ ) for given parameter choices. The the-ory does not claim these values to be optimal, nor do we claim that the constants are tuned. In particular,these demonstrations seem rather pessimistic, especially by comparison with the numerical simulationsin Figure 5.The following values were determined by computing N : = max { N (8.4)0 ( a , C , L ) , N (8.7)0 ( C , L ) } forparticular choices of a , C , C and L , using their deﬁnitions in the technical results of 8.8.2. Thus, the ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

21 of 52F IG . 6: Parameter stability of sparsePD programs when ( s , N , η , k , n ) =( , , , , ) . Plotted curvesrepresent average loss, plotted on a log-logscale.theory of section 6 guarantees parameter instability for all N (cid:62) N when N ≈ . ( a , C , C , L ) ≈ ( . , , , . ) or N ≈ . ( a , C , C , L ) ≈ ( . , . , , . ) . These numbers appear pessimistic, given that N is large, while ( C , C ) ≈ ( , ) implies the instabil-ity arises on the event {(cid:107) z (cid:107) − N ∈ ( √ N , √ N ) } , which occurs with relatively minute (but constant)probability. Thus, it may not be all that surprising that ( BP ∗ σ ) suboptimality is difﬁcult to ascertainempirically from a small number of realizations in only moderately large dimension when σ ≈ σ ∗ .7.4 Parameter stability in sparse proximal denoising

In this section we show numerical simulations in which the three programs appear to exhibit betterparameter stability. For these simulations, η ≈ . s = N = . Average loss was com-puted from k =

25 realizations for n =

401 grid points. As the noise is large, this setting lies (mostly)outside the regime in which ( LS ∗ τ ) and ( QP ∗ λ ) exhibit parameter instability. Moreover, the signal is notvery sparse, since s / N = .

25. Thus, this setting also lies outside the regime in which ( BP ∗ σ ) exhibitsparameter instability. Accordingly, smooth risk curves are seen for ( BP ∗ σ ) and ( QP ∗ λ ) . While ( QP ∗ λ ) and ( BP ∗ σ ) appear relatively gradual, ( LS ∗ τ ) appears at least to avoid a cusp-like point about τ / τ ∗ =

1. Thesedata visualized in Figure 6.7.5

Realistic denoising examples

Image-space denoising.

We visualize how proximal denoising behaves for a realistic denoisingproblem. The ground truth signal is the standard 512 × × x ∈ [ , ] N ⊆ R N , N =

786 432. The denoising is performed in image space. Speciﬁcally,the signal x is not sparse: 99 .

98% of its coefﬁcients are nonzero. We set y j = x , j + η z j ∈ R N where η = − , z j iid ∼ N ( , ) . The results of this example are displayed in Figure 7: the ground truthand noisy images in the top row, and quantitative results captured by plots of the average loss (7.1) inthe bottom row.The plots of average loss were generated from k =

25 realizations of noise z , with a logarithmi-cally spaced grid of n =

501 points centered about the optimal parameter value for each of the threeproximal denoising programs. The optimal parameter value for each program was determined ana-lytically where possible, or numerically using standard solvers [25]. A smooth approximating curveof the non-uniformly spaced point cloud of loss realizations was computed using radial basis function2 of 52

BERK, A., PLAN, Y., YILMAZ, ¨O −1 Normalized Parameter10 (BP *σ )(QP *λ )(LS *τ ) F IG . 7: Top (left-to-right):

The underlying signal is the 512 × × η = − ); the right-most image is corrupted byiid normally distributed noise ( η = [ , ] ; those of thenoisy images are scaled to this range for plotting. Bottom:

Average loss is plotted with respect to thenormalized parameter for ( LS ∗ τ ) , ( QP ∗ λ ) and ( BP ∗ σ ) respectively when η = − (left) and η = ( N , k , n ) = (

786 432 , , ) . Plotted lines are smoothed approximationsof loss realization data using multiquadric RBFs.approximation. The RBF approximation used multiquadric kernels with parameters ( ε rbf , µ rbf , n rbf ) =( − , − , ) . Here, ε rbf is the associated RBF scale parameter, µ rbf is a smoothing parameter and n rbf is the number of grid points at which to approximate [25]. The RBF parameters for the approxima-tion were selected so as to generate a smooth line that best represents the path about which the individual(noisy) data points concentrate.About the optimal average loss (where the normalized parameter is 1), an average difference of1 . τ results in a 4 . × fold difference in nnse on average when η = − . Incontrast, that error varies by no more than a factor of three in the large noise regime ( η = η = − upper bound those computed for η =

1. Theseresults suggest not to use ( LS ∗ τ ) for proximal denoising when η is small, even when the underlying dataare not sparse.7.5.2

1D denoising example.

In this section, we demonstrate parameter instability regimes for arealistic example of a 1D signal using wavelet domain denoising. Speciﬁcally, an s -sparse 1D signal x ∈ R N was generated in the Haar wavelet domain, where ( s , N ) = ( , ) . In the signal domain, iidnormal random noise was added to the signal to generate W − y : = W x + η z where η = N , N . Thedenoising problem was solved in the wavelet domain on a grid of size 501 centered about the optimalnormalized parameter and logarithmically spaced. Namely, the input to each program was y . The loss ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

23 of 52was computed in the signal domain after applying the inverse transform to the estimated solution: L ( ρ ; x , N , η ˆ z ) : = η − (cid:107) W − ( x ∗ ( ρ ) − x ) (cid:107) A smooth approximation to the average loss ¯ L ( ρ i ; x , N , η , k ) was computed from k =

25 realizations ofthe noise using linear radial basis function approximation with parameters ( µ rbf , n rbf ) = ( . , ) .In Figure 8, we visualize how the three programs behave for denoising a 1D signal, sparse in the Haarwavelet domain, which has been corrupted by one of two different noise levels in the signal domain.The top row visualizes the ground truth signal with a realization of the corrupted signal for η = N / η = N /

10 (top-right). The bottom row visualizes the average loss with respect to thenormalized parameter of each program. In the high-noise regime (bottom-right), it is clearly seen that ( BP ∗ σ ) is the most parameter unstable about the optimal parameter choice. Moreover, the best averageloss for ( BP ∗ σ ) is greater than that for ( QP ∗ λ ) or ( LS ∗ τ ) , as suggested by the supporting theory. Wenote that ( QP ∗ λ ) also has an average loss greater than the minimal one, and suggest — noting the localvariability in the curve — that this is an artifact of the RBF approximation through the optimality region.In the moderate-noise regime, we see a situation in which ( QP ∗ λ ) appears to be the most parameterstable — again consistent with our reasoning that unconstrained programs should exhibit better stability.In contrast, ( LS ∗ τ ) is most parameter unstable below the optimal parameter, while it is ( BP ∗ σ ) that ismost parameter unstable above the optimal parameter. This behaviour may be indicative of a regimeintermediate to those we have previously discussed ( i.e., lying between strictly low-noise and strictlyvery sparse).With the grid in Figure 9, we intend to elucidate how parameter instability manifests for each pro-gram as a function of the normalized parameter, by visualizing the recovered signal for different valuesof the normalized parameter. The top plot shows the same average losses that are plotted in the bottom-left of Figure 8. The dotted lines at ρ = . , . , , / , ρ and each program, there is a corresponding plotin the grid that depicts the solution to the program for that normalized parameter value ρ , along with theoriginal signal x , which is depicted as a black dotted line in each of the 15 plots. When ρ is too smallfor ( BP ∗ σ ) and ( QP ∗ λ ) , it is clear that the noise fails to be thresholded away. In contrast, this occurs for ( LS ∗ τ ) when ρ is too large. On the other hand, the signal content is thresholded away by ( BP ∗ σ ) when ρ too large, and by ( LS ∗ τ ) when ρ is too small. Notice that this behaviour does not seem to occur with ( QP ∗ λ ) , further supporting that ( QP ∗ λ ) admits right-sided parameter stability.7.5.3 Wavelet-space denoising.

In this section, we demonstrate parameter instability regimes for arealistic example using proximal denoising of an image signal in a wavelet domain. Namely, noise isadded in the image domain, the data denoised in Haar wavelet space, and performance of the back-transformed estimator is evaluated in the image domain. The image was designed to resemble a Shepp-Logan phantom, but to admit a very sparse expansion in Haar wavelets. This modiﬁed phantom, whichwe coin the “Square Shepp-Logan phantom”, was created so as to be sparse enough to allow for bettervisualization of ( BP ∗ σ ) parameter instability. Speciﬁcally, if one were to generate the same ﬁgures forthe Shepp-Logan phantom, one would see that ( BP ∗ σ ) is less parameter stable than ( QP ∗ λ ) , but thatthe behaviour is markedly less pronounced than the behaviour we visualize in Figure 10 or Figure 12.Indeed this discrepancy results from the standard Shepp-Logan phantom being less sparse (having more4 of 52 BERK, A., PLAN, Y., YILMAZ, ¨O y y −1 Normalized Parameter10 (LS *τ )(BP *σ )(QP *λ ) 10 −1 Normalized Parameter10 (LS *τ )(BP *σ )(QP *λ ) F IG . 8: Haar wavelet space denoising of a 1D signal that is sparse in the Haar wavelet domain for twodifferent noise levels, η = N /

100 (left column) and η = N /

10 (right column).

Top: each plot containsa realization of the noisy 1D signal plotted in orange with the ground truth signal in blue.

Bottom: visualizations of the average loss curve, computed using RBF approximation. The parameters for thisexample are: ( s , N , k , n ) = ( , , , ) .non-zero entries) in its Haar wavelet transform than our modiﬁcation. An alternative demonstrationusing the standard Shepp-Logan phantom might proceed using a different transform domain in whichits representation is sparser.A corrupted Square Shepp-Logan phantom was obtained by adding iid noise z i , j iid ∼ N ( , ) to theimage pixels I = ( I i , j ) i , j , yielding y where y i , j = I i , j + η z i , j with η = − , . I i , j ∈ [ , ] isthe ( i , j ) th pixel of the uncorrupted Square Shepp-Logan phantom. The input signal to each recoveryprogram was the vectorized 2D Haar wavelet transform of y i , j : w = W ( y i , j ) i , j where W is the operatorconnoting a Haar wavelet transform to (vectorized) Haar wavelet coefﬁcients. Loss was computed in theimage domain, using the nnse of the inverse-transformed proximal denoising estimator. For example, theloss for ( BP ∗ σ ) is given by η − (cid:107) W − ( ˜ x ( σ )) − I (cid:107) . Average loss (7.1) was thus computed by averagingthe loss over k =

25 realizations of the noise z .The associated parameters of the problem are ( s , N , k , n ) = ( , , , ) , implying a rel-ative sparsity of 1 . [ , ] . Subsequent visualizationsdo not perform this rescaling so that a perceptual evaluation of the recovery is better facilitated.The plots in the bottom row of Figure 10 depict the average loss as a function of the normalizedparameter ρ of each program. For each of the k realizations, the loss was computed on a logarithmicallyspaced grid of n =

501 points about the optimal parameter. As in section 7.5.1, a smooth approxi-mating curve to the non-uniformly spaced point cloud of loss realizations was computed using RBFapproximation. The RBF approximation used multiquadric kernels with parameters ( ε rbf , µ rbf , n rbf ) =( − , − , ) [25]. The RBF parameters for the approximation were selected so as to generate a ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

25 of 52smooth line that best represents the path about which the individual (noisy) data points concentrate,especially so as to resolve the behaviour of the loss about ρ = fold difference in nnse results from a less than2% perturbation of τ in the low-noise and very sparse regime ( η = − , s / N ≈ . ( BP ∗ σ ) is less stable than ( QP ∗ λ ) , especially for values of the normalized parametergreater than 1, as suggested by our theory. In the very sparse regime with large noise ( η = . ( BP ∗ σ ) is markedly more parameter unstable than ( LS ∗ τ ) or ( QP ∗ λ ) , especially for values of the normalizedparameter exceeding 1. Moreover, we observe that the minimal average loss for ( BP ∗ σ ) is greater thanthat for ( LS ∗ τ ) or ( QP ∗ λ ) . This numerical behaviour is consistent with our theoretical results.In Figure 11 and Figure 12 we depict estimator performance by visualizing the solution to eachprogram at speciﬁc values of the normalized parameter. The description of each ﬁgure is identical, butthe noise levels η differ between them. Speciﬁcally, for each program we show the recovered image andits pixel-wise nnse for values of the normalized parameter ρ = . , . , , / ,

2. The plot in the toprow of the ﬁgure depicts a loss curve for each program ( i.e., a curve generated from one realization ofthe noise z ), along with reference lines for the corresponding values of the normalized parameter whoserecovered image are visualized. The middle row contains a grid of 15 images; each column correspondsto a value of the normalized parameter as denoted by the title heading, while each row corresponds to aproximal denoising program as denoted by the labels along the left-most y -axis. The bottom grouping of15 images depicts the pixel-wise nnse, arranged identically to the middle row. Because the average losscurves were computed on a grid of n logarithmically spaced points centered about the optimal parametervalue, we do not visualize the recovered image for the exact values of ρ given above, but for those valuesrepresented by the coloured points seen in the plot of the top row. These points are sufﬁciently close tothe quoted values of ρ so as to visualize the program behaviour all the same.The numerics of Figure 11 occur in the low-noise regime ( η = − ) , and so, as expected, demon-strate parameter instability of ( LS ∗ τ ) . We note that pixel-wise nnse for ( BP ∗ σ ) is approximately 20 timesworse than ( QP ∗ λ ) when ρ ≈

2. Moreover, the pathologies (in the sense of pixel-wise nnse) of these lat-ter two programs appear similar. We also observe that the pixel-wise nnse varies more greatly for ( BP ∗ σ ) than for ( QP ∗ λ ) as ρ varies from 0 .

75 to 4 /

3. This is consistent with our theory for the behaviour of ( BP ∗ σ ) in the very sparse regime. The numerics of Figure 12 occur in the high-noise regime ( η = . ) .Failure of ( BP ∗ σ ) in the very sparse regime is seen from examining the solution itself. For example,when ρ <

1, pixel values of the solution to ( BP ∗ σ ) may reach more than 2 or even be negative. Thispathology manifests as large-magnitude pixelation in the corresponding plots of pixel-wise nnse. Catas-trophic failure of ( BP ∗ σ ) is observed for ρ >

1, in which the program fails to recover any semblance ofthe original image. Speciﬁcally, large σ shrinks the wavelet coefﬁcients to near the origin, enforcingfew non-zero components that are small in magnitude. This yields the rectangular pattern observed inthe solutions for ( BP ∗ σ ) (top-right of the middle row). In contrast, moderate deformation of the image isobserved for ρ (cid:54) = ( QP ∗ λ ) and ( LS ∗ τ ) .7.5.4 L ASSO

Example.

This section includes a realistic example comparing parameter instability ofL

ASSO programs in the very sparse regime, both in the low noise regime and when the noise is relativelylarge. Speciﬁcally, we assume the model y = Ax + z , x = W ( I ) BERK, A., PLAN, Y., YILMAZ, ¨O where W ( I ) connotes the 2D Haar wavelet transform of I , the 80 ×

80 square Shepp-Logan phan-tom. This image size was reduced from that of section 7.5.3, because using the full image for theexamples in this section would have been computationally prohibitive. The measurement matrix A ∈ R m × N has entries A i j iid ∼ Z / √ m where Z ∼ N ( , ) . The parameters for the problem are ( N , s , m ) =( , , ) , implying a sparsity ratio of 6 .

48% in the Haar wavelet domain, and a measurementmatrix aspect ratio of 48 .

46% with m (cid:38) s log ( N / s ) . The wavelet coefﬁcients x are recovered accordingto ( LS τ , K ) , ( BP σ , K ) and ( QP λ , K ) where K = B N .Given two signals x ∗ , x deﬁne the peak signal-to-noise ratio (psnr) bypsnr ( x ∗ , x ) : =

10 log (cid:16) max i ∈ [ N ] x i mse ( x ∗ , x ) (cid:17) , mse ( x ∗ , x ) : = N N ∑ i = ( x ∗ i − x i ) . (7.2)As with deﬁning loss L ( ρ ; x , N , η ˆ z i j ) , by abuse of notation we deﬁne psnr as a function of the normal-ized parameter ρ , psnr ( ρ ) : = psnr ( ρ ; x , N , η ˆ z i j ) : = psnr ( x ∗ ( ρ ) , x ) .In Figure 13, we compute average loss and average psnr as a function of the normalized parameter,where average loss is measured using nnse in the image domain, as in section 7.5.3 and average psnr isdeﬁned as in (7.2). The data was simulated for k =

25 realizations of noise for both η = · − (left)and η = . n =

301 points aboutthe optimal normalized parameter. For both average psnr and average loss, the resultant point cloud of7 525 points for each program was used as input for a multiquadric RBF approximation with parameters ( ε rbf , µ rbf , n rbf ) = ( − , , ) (except for ( LS τ , K ) when η = · − , for which the parameters were ( ε rbf , µ rbf , n rbf ) = ( − , − , ) , selected so as to properly resolve the cusp about ρ =

1) [25].About the optimal choice of normalized parameter, ρ =

1, an approximate 1 . · fold differencein average loss results from a 2 .

73% average perturbation of the normalized parameter for ( LS τ , K ) in the low-noise regime ( η = · − ). In contrast, the error difference is no more than 10 for theother two programs. In particular, ( LS τ , K ) undergoes an approximate 30 dB drop in psnr for this smallvariation. In the very sparse regime with η = . ( BP σ , K ) is the least stablewhen compared with the other two programs. If σ is 10% larger than the optimal choice, the psnr isapproximately halved. Moreover, its best average loss is observed to be strictly greater than that foreither of the other two programs. This observation mirrors the numerics for ( BP ∗ σ ) in section 7.5.3 andis consistent with the theoretical results of section 6.Finally, we observe that the numerics for ( QP λ , K ) exhibit parameter stability, though the data regimeis low-noise and very sparse (Figure 13, left). We claim this behaviour is not contrary to 5.2, and usethe following intuition from ( QP ∗ λ ) to elucidate. When λ < ¯ λ , Theorem 5.2 demonstrates parameterinstability for ( QP ∗ λ ) behaving as R (cid:93) ( λ ; s , N ) (cid:38) N ε where λ = ( − ε ) ¯ λ . This term dominates R (cid:93) ( λ ; s , N ) only for relatively high dimensional problems ( cf. section 7.2), roughly requiring that N ε (cid:38) s log ( N / s ) .By this heuristic, in the present example, the instability does not dominate for ε (cid:54) C · .

80 where C > ( QP ∗ λ ) is an appropriate choice only if λ > ¯ λ , while our numericssupport that λ < ¯ λ remains a safe choice for ( QP ∗ λ ) ( cf. section 7.5.3) and ( QP λ , K ) in relatively lowerdimensional problems. This observation is particularly advantageous given that parameter instabilitymay still be expected of both ( LS τ , K ) and ( BP σ , K ) .In Figure 14 and Figure 15, a grid of plots similar to those of section 7.5.3 were generated to visualize ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

27 of 52the solution to each program as a function of the normalized parameter ρ ∈ { . , . , , / , } . Asbefore, the topmost image in each ﬁgure is a reference plot to depict the locations on the curve to whichthe displayed images correspond; however, these plots now depict psnr as a function of the normalizedparameter. The images displayed below the reference plot do not correspond exactly to the quotednormalized parameters, but to a closest approximation obtained from a logarithmically spaced gridof n =

301 points centered about the optimal parameter. These true normalized parameter values arevisualized as large coloured dots on the reference plot; that they approximate well the quoted normalizedparameter values for ρ is veriﬁed by their proximity to the black dotted lines in the reference plot.Showing the estimator corresponding to a normalized parameter has the twofold purpose of visualizingthe pathology of each program as its parameter varies, and demonstrating when a program is relativelyunstable in a given regime.The images in Figure 14 portray the setting of low-noise regime, with η = · − . Indeed, thereference plot displays a cusp for the ( LS τ , K ) loss that was characteristic of ( LS ∗ τ ) in the low-noiseregime. Moreover, one observes similar pathologies in both the recovered image for ( LS τ , K ) as wellas the point-wise nnse. The image recovered using ( LS τ , K ) is blurry for ρ <

1, which is indicative ofincompletely recovered wavelet coefﬁcients. For ρ >

1, the noise was not suppressed on the off-supportof the wavelet coefﬁcients yielding a noisy pixelated image in both the recovered image and the errorimage. This behaviour is observed to a signiﬁcantly lesser degree for the corresponding ( BP σ , K ) and ( QP λ , K ) images.The images in Figure 15 portray the very sparse regime where η = .

5. As is consistent with theasymptotics in section 6 for ( BP ∗ σ ) , it is difﬁcult to visualize ( BP σ , K ) parameter instability for relativelylow dimensional problems. We suspect that this instability would have been markedly more apparentwere it possible to run these simulations for the full 640 ×

640 square Shepp-Logan phantom image.Nevertheless, one observes that ( BP σ , K ) is the least stable of the three programs for ρ >

1. In particular,the visualized estimator and point-wise nnse both depict catastrophic failure of ( BP σ , K ) for ρ = BERK, A., PLAN, Y., YILMAZ, ¨O −1 Normalized parameter10 (BP *σ )(QP *λ )(LS *τ ) *σ )x *λ )x *τ )x F IG . 9: Wavelet space denoising of a 1D signal for different values of the normalized parameter when η ≈ Top:

The sections of the average-loss surface for which estimator recovery will be visu-alized are depicted by the dots which lie nearly on the blacked dotted lines, themselves located at ρ = . , . , , / , Bottom:

This group of ﬁfteen plots represents a program’s solution for a partic-ular value of the normalized parameter, arranged in a grid. Each row of the 15 plot grouping representsa program, as denoted by the legend label; each column a value of the normalized parameter, as deter-mined by the heading above the top row.

ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

29 of 52 −1 Normalized Parameter10 (BP *σ )(QP *λ )(LS *τ ) −1 Normalized Parameter10 (BP *σ )(QP *λ )(LS *τ ) F IG . 10: Top (left-to-right):

The underlying signal is the 640 ×

640 Square Shepp-Logan phantomimage; the middle image is corrupted by iid normally distributed noise ( η = − ); the right-mostimage is corrupted by iid normally distributed noise ( η = . [ , ] ; those of the noisy images are scaled to [ , ] . Bottom:

Average loss is plotted with respectto the normalized parameter for ( LS ∗ τ ) , ( QP ∗ λ ) and ( BP ∗ σ ) respectively when η = − (left) and η = . ( s , N , k , n ) = ( , , , ) , implying relative sparsityof 1 . BERK, A., PLAN, Y., YILMAZ, ¨O −1 Normalized parameter10 (BP *σ )(QP *λ )(LS *τ ) ( B P * σ ) ( Q P * λ )( L S * τ ) F IG . 11: Wavelet space denoising of the square Shepp-Logan phantom for different values of the nor-malized parameter when η = − . Top:

The sections of the average-loss surface for which estimatorrecovery will be visualized are depicted by the dots which lie nearly on the blacked dotted lines, them-selves located at ρ = . , . , , / , Middle:

This group of ﬁfteen plots represents a program’ssolution for a particular value of the normalized parameter, arranged in a grid. Image pixel values arenot scaled to [ , ] ; their range is given by the associated colour bar. Bottom:

This group of ﬁfteenplots depicts pixel-wise nnse for each (program, normalized parameter) pairing. In both the middle andbottom groups, the program is denoted along the left-hand side, while the normalized parameter valueis denoted along the top row of each group.

ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

31 of 52 −1 Normalized parameter10 (BP *σ )(QP *λ )(LS *τ ) F IG . 12: Wavelet space denoising of the square Shepp-Logan phantom for different values of the nor-malized parameter when η = . Top:

BERK, A., PLAN, Y., YILMAZ, ¨O −1 Normalized Parameter10 (BP σ )(QP λ )(LS τ ) 10 −1 Normalized Parameter10 (BP σ )(QP λ )(LS τ ) −1 Normalized Parameter102030405060 (BP σ )(QP λ )(LS τ ) 10 −1 Normalized Parameter81012141618 (BP σ )(QP λ )(LS τ ) F IG . 13: Average loss (top) and average psnr (bottom) vs. normalized parameter for ( LS τ , K ) , ( QP λ , K ) and ( BP σ , K ) respectively when η = · − (left) and η = . ( s , N , m , k , n ) = ( , , , , ) ; relative sparsity 6 . . ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

33 of 52 −1 Normalized parameter0204060 (BP σ )(QP λ )(LS τ ) ( B P σ ) ( Q P λ )( L S τ ) ( B P σ ) ( Q P λ )( L S τ ) −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 −1 −1 −3 −2 −1 −3 −2 −1 −3 −2 −1 F IG . 14: Wavelet space compressed sensing problem with the square Shepp-Logan phantom for differentvalues of the normalized parameter when ( s , N , m , η ) = ( , , , · − ) . Top:

The sectionsof the psnr surface for which estimator recovery will be visualized are depicted by the dots which lienearly on the black dotted lines, themselves located at ρ = . , . , , / , Middle:

This ﬁrst 3 × Bottom:

This 3 × BERK, A., PLAN, Y., YILMAZ, ¨O −1 Normalized parameter051015 (BP σ )(QP λ )(LS τ ) ( B P σ ) ( Q P λ )( L S τ ) −101 −0.50.00.51.0 0.00.5 0.00.20.4 −0.10−0.050.000.050.1001 −0.50.00.51.01.5 0.00.51.01.5 0.00.51.0 −0.250.000.250.500.75−0.250.000.250.500.75 0.00.5 −0.50.00.51.0 −0.50.00.51.0 −0.50.00.51.0 ( B P σ ) ( Q P λ )( L S τ ) F IG . 15: Wavelet space compressed sensing problem with the square Shepp-Logan phantom for differentvalues of the normalized parameter when ( s , N , m , η ) = ( , , , . ) . Top:

The sections of thepsnr surface for which estimator recovery will be visualized are depicted by the dots which lie nearlyon the black dotted lines, themselves located at ρ = . , . , , / , Middle:

This ﬁrst 3 × Bottom:

This 3 × ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

35 of 52

8. Proofs

Proof of worst-case risk equivalence L EMMA x ∈ Σ Ns , (cid:107) x (cid:107) =

1. Then ˆ R ( τ ; τ x , N , η ) is an increasing function of τ (cid:62) Proof of Lemma 8.1.

Given y ( τ ) : = τ x + η z for η > z ∈ R N , z i iid ∼ N ( , ) , let ˆ x ( τ ) : = ˆ x ( τ ; y ( τ )) solve ˆ x ( τ ; y ( τ )) : = arg min x {(cid:107) y ( τ ) − x (cid:107) : (cid:107) x (cid:107) (cid:54) τ } Let K : = B N − x , a convex set containing the origin. Using a standard scaling property of orthogonalprojections, ˆ x ( τ ) − x = arg min w {(cid:107) η z − w (cid:107) : (cid:107) w + τ x (cid:107) (cid:54) τ } = P τ K ( η z ) . Hence, it follows by Lemma 3.2 that (cid:107) ˆ x ( τ ) − τ x (cid:107) is an increasing function of τ . (cid:3) P ROPOSITION η , τ > N (cid:62)

2. Thensup x ∈ Σ Ns ˆ R ( (cid:107) x (cid:107) ; x , N , η ) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim τ → ∞ ˆ R ( τ ; τ x , N , η ) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim η → ˆ R ( x , N , η ) . Proof of Proposition 8.1.

The ﬁrst equality is an immediate consequence of Lemma 8.1:sup x ∈ Σ Ns ˆ R ( (cid:107) x (cid:107) ; x , N , η ) = max x ∈ Σ Ns (cid:107) x (cid:107) = sup τ > ˆ R ( τ ; τ x , N , η ) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim τ → ∞ ˆ R ( τ ; τ x , N , η ) . The second equality follows from a standard property of orthogonal projections, and the risk expressionderived in Lemma 8.1. For K : = B N − x ,max x ∈ Σ Ns (cid:107) x (cid:107) = lim τ → ∞ ˆ R ( τ ; τ x , N , η ) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim τ → ∞ η − (cid:107) P τ K ( η z ) (cid:107) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim τ → ∞ τ η (cid:107) P K ( τ − η z ) (cid:107) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim ˜ η → ˜ η − (cid:107) P K ( ˜ η z ) (cid:107) = max x ∈ Σ Ns (cid:107) x (cid:107) = lim η → ˆ R ( x , N , η ) . (cid:3) Proof of ( LS ∗ τ ) optimal riskProof of Proposition 2.5. Directly from Theorem 3.1, R ∗ ( s , N ) = max x ∈ Σ Ns (cid:107) x (cid:107) = D ( T B N ( x ) ◦ ) BERK, A., PLAN, Y., YILMAZ, ¨O where D ( T B N ( x ) ◦ ) is the mean-squared distance to the polar of the (cid:96) descent cone. The operator D hasthe following desirable relation to the Gaussian mean width, where C is a non-empty convex cone [2,Prop 10.2]: w ( C ∩ S N − ) (cid:54) D ( C ◦ ) (cid:54) w ( C ∩ S N − ) + . Thus, it sufﬁces to lower- and upper-bound w (cid:0) T B N ∩ S N − (cid:1) . The desired upper bound is an elementarybut technical exercise using H¨older’s inequality, Stirling’s approximation and a bit of calculus. Thelower bound may be computed using Sudakov’s inequality and [21, Lemma 10.12]. It thereby followsthat cs log ( N / s ) (cid:54) D ( T B N ( x ) ◦ ) (cid:54) Cs log ( N / s ) . where c , C > cs log ( N / s ) (cid:54) R ∗ ( s , N ) (cid:54) Cs log ( N / s ) .From Theorem 5.4, R (cid:93) ( λ ∗ ; s , N ) (cid:54) Cs log N for any N (cid:62) N ( s ) with N ( s ) sufﬁciently large. Usingthe above equation gives, for c , C > Cs log N (cid:54) C cs log ( N / s ) (cid:54) C R ∗ ( s , N ) . Finally, observe that R (cid:93) ( λ ∗ ; s , N ) is trivially lower bounded by M ∗ ( s , N ) = Θ ( s log ( N / s )) [9]. (cid:3) Proof of (cid:96) tangent cone equivalenceProof of Lemma 3.1. First observe that the deﬁnition of F C ( x ) is equivalent to F C ( x ) = { h ∈ R N : h = z − x , (cid:107) z (cid:107) (cid:54) (cid:107) x (cid:107) } . Next, observe that K ( x ) is a cone. So, for left containment, it sufﬁces to show F C ( x ) ⊆ K ( x ) since thecone generated by a set is no larger than any cone containing that set. These two expressions: (cid:104) sgn ( x ) T , x (cid:105) = (cid:107) x (cid:107) (cid:62) (cid:107) z (cid:107) = (cid:107) z T (cid:107) + (cid:107) h T C (cid:107) (cid:107) z T (cid:107) = (cid:104) sgn ( z ) , z T (cid:105) (cid:62) (cid:104) sgn ( x ) , z T (cid:105) , are by deﬁnition of h = z − x ∈ F C ( x ) . They combine to yield left containment: (cid:107) h T C (cid:107) (cid:54) −(cid:104) sgn ( x ) , z T − x (cid:105) = −(cid:104) sgn ( x ) , h T (cid:105) = −(cid:104) sgn ( x ) , h (cid:105) . To show right containment, ﬁrst ﬁx w ∈ K ( x ) and select α (cid:62) z : = x + α w admits z j x j (cid:62) j ∈ T . Using α (cid:107) w T C (cid:107) (cid:54) − α (cid:104) sgn ( x ) , w T (cid:105) , we show (cid:107) z (cid:107) (cid:54) (cid:107) x (cid:107) implying that α w ∈ F C ( x ) , whence w ∈ T C ( x ) . Where h : = α w = z − x , (cid:107) z (cid:107) = (cid:107) z T (cid:107) + (cid:107) z T C (cid:107) = (cid:104) sgn ( z T ) , z T (cid:105) + (cid:107) h T C (cid:107) (cid:54) (cid:104) sgn ( z T ) , z T (cid:105) − (cid:104) sgn ( x ) , h T (cid:105) = (cid:104) sgn ( z T ) , z T (cid:105) − (cid:104) sgn ( x ) , h T (cid:105) + (cid:104) sgn ( x ) , x (cid:105) − (cid:104) sgn ( x ) , x (cid:105) = (cid:104) sgn ( x ) , x (cid:105) + (cid:104) sgn ( z T ) , z T (cid:105) − (cid:104) sgn ( x ) , z T (cid:105) = (cid:107) x (cid:107) + (cid:104) sgn ( z T ) − sgn ( x ) , z T (cid:105) = (cid:107) x (cid:107) where the latter equality follows from the fact that (cid:104) sgn ( z T ) − sgn ( x ) , z T (cid:105) (cid:54) = x j z j < j ∈ T , which goes against the initial assumption deﬁning α and z . Thus, w ∈ T C ( x ) and T C ( x ) = K ( x ) as desired. (cid:3) ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

37 of 528.4

Proof of the projection lemmaProof of Lemma 3.2.

Deﬁne z α : = P α K ( z ) for α = , λ and deﬁne f ( t ) : = (cid:107) u t (cid:107) , where u t : = tz λ + ( − t ) z for t ∈ [ , ] . Our goal is to show dd t (cid:12)(cid:12) t = f ( t ) (cid:62)

0; this implies (cid:107) z λ (cid:107) (cid:62) (cid:107) z (cid:107) , because f is convex. Expanding f ( t ) , f ( t ) = t (cid:0) ( (cid:107) z λ (cid:107) − (cid:104) z , z λ (cid:105) + (cid:107) z (cid:107) (cid:1) + t (cid:0) (cid:104) z , z λ (cid:105) − (cid:107) z (cid:107) (cid:1) + (cid:107) z (cid:107) . So it is required to check the condition ( (cid:63) ) :dd t f ( t ) (cid:12)(cid:12)(cid:12)(cid:12) t = = (cid:2) t (cid:107) z λ − z (cid:107) + (cid:104) z , z λ − z (cid:105) (cid:12)(cid:12) t = = (cid:104) z , z λ − z (cid:105) ( (cid:63) ) (cid:62) C ( x ) is the projection of x onto a convex set C then for any y ∈ C , (cid:104) y − P C ( x ) , x − P C ( x ) (cid:105) (cid:54)

0. From the projection condition [6], we have • (cid:104) λ − z λ − z , z − z (cid:105) (cid:54) • (cid:104) λ z − z λ , z − z λ (cid:105) (cid:54) (cid:62) (cid:104) z λ − λ z , z − z (cid:105) + (cid:104) λ z − z λ , z − z λ (cid:105) = (cid:104) λ z − z λ , z − z λ (cid:105) = (cid:104) ( λ − ) z , z − z λ (cid:105) + (cid:107) z − z λ (cid:107) (cid:62) ( λ − ) (cid:104) z , z − z λ (cid:105) which is equivalent to (cid:104) z , z λ − z (cid:105) (cid:62)

0. Therefore, f is a convex function increasing on the interval t ∈ [ , ] , whence (cid:107) z (cid:107) (cid:54) (cid:107) z λ (cid:107) as desired. (cid:3) R EMARK (cid:107) z (cid:107) (cid:54) (cid:107) z λ (cid:107) ⇐⇒ (cid:107) z (cid:107) (cid:54) (cid:107) z (cid:107) (cid:107) z λ (cid:107) , one may instead prove the following chain, (cid:107) z (cid:107) (cid:54) (cid:104) z , z λ (cid:105) (cid:54) (cid:107) z (cid:107) (cid:107) z λ (cid:107) . The latter inequality is true by Cauchy-Schwarz, so it remains only to prove the former: (cid:104) z , z λ (cid:105) − (cid:107) z (cid:107) (cid:62) ⇐⇒ (cid:104) z , z λ − z (cid:105) (cid:62) . Rearranging shows this inequality is equivalent to ( (cid:63) ) , and the remainder of the proof proceeds as is.This remark is included for intuition, but this approach is less generalizable. For example, it does notyield the rate of growth observed in the remark at the end of 3.1.1.8.5 Elementary results from probability

We brieﬂy recall two aspects of how normal random vectors concentrate in high dimensions.P

ROPOSITION z ∈ R N with z i iid ∼ N ( , ) , ﬁx constants 0 < C < C < ∞ and deﬁne the event Z ± by Z ± : = { C √ N (cid:54) (cid:107) z (cid:107) − N (cid:54) C √ N } . There exists a constant p = p ( C , C ) > N (cid:62) N (cid:62) N , P (cid:0) Z ± (cid:1) (cid:62) p BERK, A., PLAN, Y., YILMAZ, ¨O

Proof of Proposition 8.2.

Deﬁne the χ N -distributed random variable X N : = (cid:107) z (cid:107) − N √ N . Since X N N → ∞ −−−→ N ( , ) by the central limit theorem, for any ε > N ∈ N such that (cid:12)(cid:12) P (cid:0) X N (cid:54) t (cid:1) − Φ ( t ) (cid:12)(cid:12) (cid:54) ε for all N (cid:62) N , where Φ is the standard normal cdf. One need merely choose ε > P (cid:0) Z ± (cid:1) (cid:62) Φ ( C ) − Φ ( C ) − ε = : p ( C , C ) > N for which the chain of inequalities is valid for all N (cid:62) N . (cid:3) C OROLLARY N , N ∈ N with N (cid:62) N (cid:62)

2. Let z ∈ R N with z i iid ∼ N ( , ) and deﬁne the event A N : = (cid:8) (cid:107) z (cid:107) (cid:54) N − √ N & (cid:107) z (cid:107) ∞ (cid:54) (cid:112) N (cid:9) There exists a real constant C = C ( N ) > P ( A N ) (cid:62) C . Proof of Corollary 8.1.

Given N , deﬁne the events E N : = {(cid:107) z (cid:107) (cid:54) N − √ N } and F N : = {(cid:107) z (cid:107) ∞ (cid:54) √ N } . Using the standard identity Φ ( − x ) (cid:54) φ ( x ) / x , we note that P ( F N ) (cid:62) − N P ( | Z | > (cid:112) N ) (cid:62) − (cid:113) π N log N > P ( A N ) = P ( E N F N ) = P ( E N | F N ) P ( F N ) (cid:62) P ( E N ) P ( F N ) (cid:62) C (cid:62) . (cid:3) We also recall that an event holding with high probability, intersected with an event occurring withconstant probability, still occurs with constant probability.P

ROPOSITION N (cid:62) E = E ( N ) is an event that holds with highprobability in the sense that P ( E N ) (cid:62) − p ( N ) for some function p ( N ) > N → ∞ p ( N ) =

0. Suppose also that for an event F = F ( N ) thereexists q > N (cid:62) P ( F ( N )) (cid:62) q . Then there exists a constant q (cid:48) > N (cid:62) P (cid:0) E ( N ) ∩ F (cid:1) (cid:62) q (cid:48) for all N (cid:62) N . Proof of Proposition 8.3.

The proof is very similar to that of Proposition 8.2. Simply choose athreshold ε > N (cid:62) P (cid:0) E ( N ) ∩ F (cid:1) (cid:62) q − p ( N ) (cid:62) q − ε = : q (cid:48) > N (cid:62) N . (cid:3) R EMARK p ( N ) as in Proposition 8.3 is p ( N ) ∼ O ( e − N ) when E N : = {| X − µ | (cid:54) t } for X a subgaussian random variable, E X = µ and t > ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

39 of 528.6

Proof of ( LS ∗ τ ) parameter instabilityProof of Theorem 4.1. Let x ∈ Σ Ns with non-empty support and let τ > ( LS ∗ τ ) . Firstsuppose the parameter is chosen smaller than the optimal value, i.e., τ < (cid:107) x (cid:107) . The discrepancy of theguess, ρ : = |(cid:107) x (cid:107) − τ | = (cid:107) x (cid:107) − τ >

0, induces the instability.The solution ˆ x ( τ ) to ( LS ∗ τ ) satisﬁes 0 (cid:54) (cid:107) ˆ x ( τ ) (cid:107) (cid:54) τ by construction. Therefore, by the Cauchy-Schwarz inequality and an application of the triangle inequality, (cid:107) ˆ x ( τ ) − x (cid:107) (cid:62) N − (cid:107) ˆ x ( τ ) − x (cid:107) (cid:62) ρ N > . Accordingly, lim η → η (cid:107) ˆ x ( τ ) − x (cid:107) (cid:62) lim η → ρ N η = ∞ . Next assume τ is chosen too large, with discrepancy between the correct and actual guesses for theparameter again being denoted ρ = τ − (cid:107) x (cid:107) >

0. Two key pieces of intuition guide this result. Theﬁrst is that the error of approximation should be controlled by the effective dimension of the constraintset. The second suggests that y continues to lie within the constraint set for sufﬁciently small noiselevel, meaning recovery behaves as though it were unconstrained. Hence, the effective dimension of theproblem is that of the ambient dimension, and so one should expect the error to be proportional to N .First, we show that for η sufﬁciently small, y ∈ τ B N with high probability. Fix a sequence η j j → ∞ −−−→ y j : = x + η j z . Since (cid:107) z (cid:107) is subgaussian, Theorem 3.5 implies there is a constant C > P (cid:0) (cid:107) z (cid:107) (cid:62) t + N (cid:114) π (cid:1) (cid:54) P (cid:0)(cid:12)(cid:12) (cid:107) z (cid:107) − N (cid:114) π (cid:12)(cid:12) (cid:62) t (cid:1) (cid:54) e · exp (cid:0) − t CN (cid:1) . In order to satisfy x + η z ∈ τ B N , we need (cid:107) x + η z (cid:107) < τ , for which η (cid:107) z (cid:107) < ρ is sufﬁcient. Theprobability that this event does not occur is upper bounded by P (cid:0) (cid:107) z (cid:107) (cid:62) ρη (cid:1) (cid:54) P (cid:0) (cid:107) z (cid:107) (cid:62) t + N (cid:114) π (cid:1) (cid:54) e · exp (cid:0) − t CN (cid:1) For t = ρ / η − N (cid:113) π , and ˜ C > P (cid:18) (cid:107) z (cid:107) (cid:62) ρη (cid:19) (cid:54) e · exp (cid:32) − (cid:0) ρ / η − N (cid:112) / π (cid:1) CN (cid:33) (cid:46) ˜ C exp (cid:18) − ρ N η (cid:19) η → −−−→ . Let E j : = {(cid:107) z (cid:107) < ρη j } for j (cid:62)

1; their respective probabilities lower-bounded by p j : = − ˜ C exp ( − ρ / N η j ) .Given 0 < ε (cid:28)

1, denote by j the ﬁrst integer such that p j (cid:62) − ε for all j (cid:62) j . On E j with j (cid:62) j , y j ∈ τ B N so y j is the unique minimizer of ( LS ∗ τ ) , meaning:1 η (cid:107) ˆ x ( τ ) − x (cid:107) = (cid:107) z (cid:107) . BERK, A., PLAN, Y., YILMAZ, ¨O

The result follows by bounding the following expectations:lim η → η E (cid:107) ˆ x ( τ ) − x (cid:107) = lim j → ∞ E (cid:2) η − j (cid:107) ˆ x ( τ ) − x (cid:107) ( E j ) (cid:3) + E (cid:2) η − j (cid:107) ˆ x ( τ ) − x (cid:107) ( E Cj ) (cid:3) . ( (cid:63) )The ﬁrst term converges by dominated convergence theorem:lim j → ∞ E (cid:2) η − j (cid:107) ˆ x ( τ ) − x (cid:107) ( E j ) (cid:3) = lim j → ∞ E (cid:2) ( E j ) (cid:107) z (cid:107) (cid:3) = E (cid:2) (cid:107) z (cid:107) (cid:3) = N . On E Cj , (cid:107) ˆ x ( τ ) − x (cid:107) (cid:54) (cid:107) ˆ x ( τ ) − x (cid:107) (cid:54) ρ η , so by dominated convergence theorem,lim j → ∞ E (cid:2) η − j (cid:107) ˆ x ( τ ) − x (cid:107) ( E Cj ) (cid:3) (cid:54) lim j → ∞ E (cid:2) (cid:107) z (cid:107) ( E Cj ) (cid:3) = . This immediately yields the desired result,lim η → η E (cid:107) ˆ x ( τ ) − x (cid:107) = N . To prove the ﬁnal case where τ = (cid:107) x (cid:107) , set C = B N in (3.1) of Theorem Theorem 3.1. Then,lim η → η − E (cid:107) ˆ x ( τ ) − x (cid:107) = D ( T B N ( x ) ◦ ) = Θ ( s log ( N / s )) (cid:28) N . (cid:3) Proofs of ( QP ∗ λ ) resultsProof of Proposition 5.1. Because z is isotropic and iid, one can split the signal x = x + − x − into“positive” and “negative” components, and so it sufﬁces to consider the case where x , j (cid:62) j ∈ [ N ] . The heart of this proposition again relies on the fact that the noise limits to 0. In general, λ > ∼ O ( η √ log N ) , so we require only that | x , j | = O ( ) for j ∈ T = supp ( x ) .This requirement can be written x , j (cid:62) a > j ∈ T and some real number a >

0. Recall that theminimizer of ( QP ∗ λ ) is given by the soft-thresholding operator which we denote by x (cid:93) ( ηλ ) = S ηλ ( x + η z ) . Where k ∈ T , (cid:96) ∈ T C so that x , k (cid:62) a , x ,(cid:96) =

0, one has S ηλ ( x , k + η z k ) − x , k =  η ( z k − λ ) x , k > η ( λ − z k ) − x , k | x , k + η z k | (cid:54) ηλη ( z k + λ ) x , k < − η ( λ + z k ) S ηλ ( η z (cid:96) ) =  η ( z (cid:96) − λ ) z (cid:96) > λ | z (cid:96) | (cid:54) λη ( z (cid:96) + λ ) z (cid:96) < − λ and so independence of z j yieldslim η → η E (cid:107) x (cid:93) ( ηλ ) − x (cid:107) = lim η → s η E (cid:2) ( S ηλ ( x , k + η z k ) − x , k ) (cid:3) + lim η → N − s η E (cid:2) S ηλ ( z (cid:96) ) (cid:3) . Passing to a sequence η j →

0, there exists J ∈ N such that for all j (cid:62) J , S η j λ ( x , k + η j z k ) − x , k = η j ( z k − λ ) with high probability. ( (cid:63) ) ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

41 of 52If this equality were true almost surely, it would follow that for k ∈ T η − E (cid:107) ( x (cid:93) ( ηλ ) − x ) T (cid:107) = E (cid:2) ( z T − λ ) (cid:3) = E [ z T (cid:3) + s λ = s ( + λ ) . Indeed this is still true in the case of ( (cid:63) ) with η →

0. In particular, using independence of z k for k ∈ T and denoting by E j the high probability event ( (cid:63) ) , we obtain by similar means as in the proof of Theorem4.1,lim η → η − E (cid:107) ( x (cid:93) ( ηλ ) − x ) T (cid:107) = lim j → ∞ s η j E (cid:2) η j ( z k − λ ) ( E j ) (cid:3) + s η j E (cid:2) ( x (cid:93) k ( η j λ ) − x , k ) ( E Cj ) (cid:3) = s ( + λ ) Next, deﬁne G ( λ ) : = ( + λ ) Φ ( − λ ) − λ φ ( λ ) . By independence of the entries of z T C , with any (cid:96) ∈ T C ,the second quantity is exactly computable aslim η → η − E (cid:107) ( x (cid:93) ( ηλ ) − x ) T C (cid:107) = ( N − s ) E (cid:2) S λ ( z (cid:96) ) (cid:3) = ( N − s ) G ( λ ) , where the ﬁnal equality is by deﬁnition of S λ and elementary calculations ( cf. remark remark 8.4).Therefore, as desired, lim η → η − E (cid:107) x (cid:93) ( ηλ ) − x (cid:107) = s ( + λ ) + ( N − s ) G ( λ ) . (cid:3) Proof of Corollary 5.1.

For 0 (cid:54) t (cid:54) s , where we deﬁne for simplicity of notation Σ N − : = /0, observethat sup x ∈ Σ Nt \ Σ Nt − R (cid:93) ( λ ; x , N , η ) = R (cid:93) ( λ ; t , N ) because the regime η → η > | x , j | → ∞ for j ∈ supp ( x ) (as shown explicitly in the proof of Proposition 8.1). Therefore,sup x ∈ Σ Ns R (cid:93) ( λ ; x , N , η ) = max (cid:54) t (cid:54) s sup x ∈ Σ Nt \ Σ Nt − R (cid:93) ( λ ; x , N , η )= max { R (cid:93) ( λ ; 0 , N ) , R (cid:93) ( λ ; s , N ) } = R (cid:93) ( λ ; s , N ) by linearity of the max argument and the fact that 1 + λ (cid:62) G ( λ ) for λ > (cid:3) Proof of ( QP ∗ λ ) parameter instability. We now prove Lemma Lemma 5.1.

Proof of Lemma Lemma 5.1.

By Proposition Proposition 5.1, R (cid:93) ( λ ; s , N ) = s ( + λ ) + ( N − s ) G ( λ ) .We prove the result by controlling G (cid:48) ( λ ) using integration by parts. Thus,dd λ G ( λ ) = λ Φ ( − λ ) − φ ( λ ) (cid:54) λ ( λ − λ + λ ) φ ( λ ) − φ ( λ ) = − λ − λ φ ( λ ) A simple substitution yields, for all N > exp (cid:16) ( − ε ) − (cid:17) ,dd u (cid:12)(cid:12)(cid:12)(cid:12) u = − ε G ( u ¯ λ ) (cid:54) (cid:20) − ( u ¯ λ ) − u ¯ λ φ ( u ¯ λ ) (cid:12)(cid:12)(cid:12)(cid:12) u = − ε = − ( − ε ) log ( N ) − ( − ε ) (cid:113) π log ( N ) N − ( − ε ) = : − γ ( N , ε ) N − ( − ε ) . BERK, A., PLAN, Y., YILMAZ, ¨O

Multiplying G (( − ε ) ¯ λ ) by N − s yields (cid:12)(cid:12)(cid:12)(cid:12) dd u R (cid:93) ( u ¯ λ ; s , N ) (cid:12)(cid:12)(cid:12)(cid:12) u = − ε (cid:62) ( N − s ) γ ( N , ε ) N − ( − ε ) − s ( − ε ) (cid:112) N = γ ( N , ε ) N ε − ε − s γ ( N , ε ) N − ( − ε ) − s ( − ε ) (cid:112) N (cid:62) CN ε for some constant C > N (cid:62) N , where N > exp (cid:16) ( − ε ) − (cid:17) is chosen sothat for all N (cid:62) N the following two conditions are satisﬁed:  ( N − s ) γ ( N , ε ) N − ( − ε ) (cid:62) s ( − ε ) √ N γ ( N , ε ) (cid:0) − sN (cid:1) (cid:62) s ( − ε ) N − ε + ε √ N + CN − ε + ε In this regime, one achieves unbounded growth of the risk as a power law of the ambient dimension. (cid:3) R EMARK x > Φ ( − x ) = (cid:90) ∞ x φ ( t ) d t = (cid:16) x − x (cid:17) φ ( x ) + (cid:90) ∞ x t φ ( t ) t d t (cid:54) (cid:16) x − x (cid:17) φ ( x ) + x − (cid:90) ∞ x t φ ( t ) d t = (cid:16) x − x + x (cid:17) φ ( x ) Proof of Theorem 5.2.

Deﬁne f ( u ) : = dd u R (cid:93) ( u ¯ λ ; s , N ) and F ( u ) : = R (cid:93) ( u ¯ λ ; s , N ) its anti-derivative. Theproof is an application of the fundamental theorem of calculus: F ( ) − F ( − ε ) = (cid:90) ε f ( − t ) d t (cid:54) − C (cid:90) ε N t d t = C − N ε log N . The result follows by substituting: R (cid:93) (( − ε ) ¯ λ ; s , N ) (cid:62) C N ε − N + R (cid:93) ( ¯ λ ; s , N ) (cid:62) C N ε log N where the latter inequality holds after taking N sufﬁciently large, and C > (cid:3) Proof of Proposition 5.3.

By Proposition 5.1, R (cid:93) ( λ ; s , N ) = s ( + λ ) + ( N − s ) G ( λ ) . We prove theresult by controlling G (cid:48) ( λ ) . One may lower bound G (cid:48) ( λ ) asdd λ G ( λ ) = λ Φ ( − λ ) − φ ( λ ) (cid:62) λ (cid:16) λλ + (cid:17) φ ( λ ) − φ ( λ ) = − φ ( λ ) λ + . This gives the following lower bound for dd λ R (cid:93) ( λ ; s , N ) :d R (cid:93) d λ ( λ ; s , N ) (cid:62) s λ − ( N − s ) φ ( λ ) λ + (cid:62) λ − N φ ( λ ) λ + = λ + (cid:0) λ ( λ + ) − N φ ( λ ) (cid:1) . Substituting ¯ λ gives a positive quantity, since N (cid:62) N + (cid:0)(cid:112) N ( N + ) − √ π (cid:1) > . ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

43 of 52Consequently, λ ∗ < ¯ λ because λ ∗ is the value giving optimal risk and dd λ R (cid:93) ( λ ; s , N ) is increasing forall λ (cid:62) ¯ λ . Then it must be that | λ ∗ − ¯ λ | < ε for any ε > N is sufﬁciently large. Indeed,ﬁx ε >

0. By Lemma 5.1 there exists N (cid:62) N (cid:62) N ( QP ∗ λ ) is parameter unstable for λ < ¯ λ , yielding R (cid:93) ( λ ; s , N ) (cid:38) N ε . But R (cid:93) ( λ ∗ ; s , N ) (cid:54) CR ∗ ( s , N ) for N (cid:62) N by Proposition 2.5, wherewe re-choose N = N ( s ) if necessary. Thus, it must be that | λ ∗ − ¯ λ | < ε for all N (cid:62) N . In particular,lim N → ∞ ¯ λ / λ ∗ = (cid:3) R EMARK Φ ( − λ ) (cid:62) λλ + φ ( λ ) Let Z ∼ N ( , ) be a standard normal random variable and let S λ ( · ) denote soft-thresholding by λ > (cid:54) E (cid:2) S λ ( Z ) (cid:3) = (cid:90) ∞ λ ( z − λ ) φ ( λ ) d z = ( + λ ) Φ ( − λ ) − λ φ ( λ ) . Thus, ( + λ ) Φ ( − λ ) (cid:62) λ φ ( λ ) , giving the desired lower bound.8.7.2 Proof of ( QP ∗ λ ) right-sided stability. We next prove right-sided stability of ( QP ∗ λ ) . Proof of Theorem 5.4.

Given L = λ / λ ∗ >

1, deﬁne ¯ L = ¯ L ( s , N ) > λ = ¯ L ¯ λ = ¯ L √ N . Note lim N → ∞ ¯ L ( s , N ) = L , because ¯ λ is asymptotically equivalent to λ ∗ up to constants. A direct substitution of λ = ¯ L ¯ λ = ¯ L √ N in the analytic formula for R (cid:93) ( λ ; s , N ) yields the desired bound, noting that R (cid:93) ( λ ∗ ; s , N ) equals R ∗ ( s , N ) up to constants. Thus, there is C > N = N ( s ) (cid:62) N (cid:62) N R (cid:93) ( λ ; s , N ) (cid:54) s ( + L log N ) + N − s ¯ LN ¯ L √ π log N (cid:54) CL R ∗ ( s , N ) . (cid:3) Proofs of ( BP ∗ σ ) results Proof of underconstrained ( BP ∗ σ ) parameter instability. We prove parameter instability of ( BP ∗ σ ) in the underconstrained regime. Proof of Lemma 6.1.

By scaling, it sufﬁces to consider the case where η =

1. Deﬁne the event A N : = (cid:8) (cid:107) z (cid:107) (cid:54) N − √ N & (cid:107) z (cid:107) ∞ (cid:54) (cid:112) N (cid:9) . On A N , it follows from the KKT conditions, where h = ˜ x ( σ ) − x , that N (cid:54) σ = (cid:107) h (cid:107) − (cid:104) h , z (cid:105) + (cid:107) z (cid:107) (cid:54) (cid:107) h (cid:107) − (cid:104) h , z (cid:105) + N − √ N By Cauchy-Schwartz and deﬁnition of A N ,12 (cid:107) h (cid:107) (cid:62) √ N + (cid:104) h , z (cid:105) (cid:62) √ N − (cid:107) h (cid:107) (cid:107) z (cid:107) ∞ (cid:62) √ N − (cid:107) h (cid:107) (cid:112) N . BERK, A., PLAN, Y., YILMAZ, ¨O

Applying Proposition 3.3 and the binomial inequality 2 ab (cid:54) a + b gives √ N − (cid:107) h (cid:107) (cid:112) N (cid:62) √ N − √ s (cid:107) h (cid:107) (cid:112) N (cid:62) √ N − (cid:107) h (cid:107) − s log N Combining these two groups of inequalities gives (cid:107) h (cid:107) (cid:62) √ N − s log N . Hence, by Bayes’ rule andCorollary 8.1 there exist dimension independent constants C , C (cid:48) > E (cid:107) ˜ x ( σ ) − x (cid:107) (cid:62) P ( A N ) · E (cid:2) (cid:107) ˜ x ( σ ) − x (cid:107) | A N (cid:3) (cid:62) C (cid:48) (cid:0) √ N − s log N (cid:1) (cid:62) C √ N . The ﬁnal inequality follows by the assumption that N (cid:62) N ( s ) . (cid:3) Supporting propositions for the geometric lemma.

This section is dedicated to several resultsnecessary for the proof of Lemma 6.2, a main lemma in the proof of Theorem 6.2 and Theorem 6.3. Westate and prove these propositions in line.P

ROPOSITION C >

0. Let α = a N / and λ = L (cid:113) N log N . Where K : = λ B N ∩ α B N , thereexists a choice of universal constants a > L (cid:29) N = N (8.4)0 ( a , C , L ) (cid:62) N (8.4)0 ( a , C , L ) : = D / ( D − ) , D : = a L < , D : = (cid:16) C + a L (cid:17) < N > N w ( K ) (cid:62) ( a + C ) √ N . Proof of Proposition 8.4.

Since w ( K ) = E z sup q ∈ K (cid:104) q , z (cid:105) is the Gaussian mean width of K , we mayinvoke Proposition 3.10 to obtain a sufﬁcient chain of inequalities: E sup K (cid:104) q , z (cid:105) = w ( K ) (3.10) (cid:62) √ κλ (cid:115) log (cid:0) N α λ (cid:1) ( ∗ ) (cid:62) (cid:16) α + C (cid:17) √ N . In particular, Proposition 3.10 holds with κ =

1, since κ is the lower-RIP constant of the sensing matrixfor ( BP ∗ σ ) , which is the identity. We thus turn our attention to ( ∗ ) , which is equivalent tolog (cid:0) D √ N log N (cid:1) (cid:62) D log N , D : = a L , D : = (cid:16) C + a L (cid:17) Rearranging gives 12 + log D + log log N log N (cid:62) D and for D , D (cid:54)

1, this is certainly satisﬁed for N (cid:62) D / ( D − ) ( e.g., L =

11 imposes N (cid:38) when a = , C = N = N ( a , C , L ) as in the proposition statement sothat for all N (cid:62) N , as desired, w ( K ) (cid:62) (cid:16) a + C (cid:17) √ N . (cid:3) ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

45 of 52P

ROPOSITION δ > c ∈ ( , ) . Let K = λ B N ∩ α B N be as deﬁned above. There are universalconstants ˜ D > , N (cid:62) N > N : = N (8.5)0 ( c , ˜ D , δ , L ) : = (cid:16) D L ( − c ) log (cid:0) δ (cid:1)(cid:17) there exists q ∈ K such that (cid:104) q , z (cid:105) (cid:62) cw ( K ) with probability at least 1 − δ , where z ∈ R N with z i iid ∼ N ( , ) . Proof of Proposition 8.5.

Note that K ⊆ R N is a topological space and deﬁne the centered Gaussianprocess f x : = (cid:104) x , g (cid:105) for g i iid ∼ N ( , ) . Observe that (cid:107) f (cid:107) K : = sup x ∈ K | f x | is almost surely ﬁnite. For any u > P (cid:0) sup x ∈ K |(cid:104) x , g (cid:105)| < w ( K ) − u (cid:1) (cid:54) exp (cid:0) − u σ K (cid:1) . by Theorem 3.8. Therefore, for c ∈ ( , ) , P (cid:0) sup x ∈ K |(cid:104) x , g (cid:105)| < cw ( K ) (cid:1) (cid:54) exp (cid:0) − ( − c ) w ( K ) σ K (cid:1) (cid:54) exp (cid:0) − ( − c ) L √ N log ( D √ N log N )

16 log N (cid:1) (cid:54) δ because σ K = sup x ∈ K E |(cid:104) x , g (cid:105)| = sup x ∈ K N ∑ i = x i E | g i | = sup x ∈ K (cid:107) x (cid:107) = α = √ N . A speciﬁc choice of q ∈ K follows by choosing the q ∈ K that realizes the supremum, since K isclosed. (cid:3) P ROPOSITION C , δ > Z − : = {(cid:107) z (cid:107) (cid:54) N + C √ N } for z ∈ R N with z i iid ∼ N ( , ) . There is a universal constant N = N (8.6)0 (cid:62) N (8.6)0 (cid:62) max { N (8.4)0 ( a , C , L ) , N (8.5)0 ( c , ˜ D , δ , L ) (cid:9) , and a universal constant k = k ( N (8.6)0 , δ ) > N (cid:62) N there is an event E ⊆ Z − satisfying K ∩ F (cid:54) = /0 on E and P ( E ) (cid:62) P ( Z − ) − δ . Proof of Proposition 8.6.

By Proposition 8.5, for any c ∈ ( , ) there is an event E that holds with high probability such thatsup q ∈ K (cid:104) q , z (cid:105) (cid:62) c w ( K ) on E . Subsequent statements are made on the restriction to E .As K is closed, there is q ∈ K realizing the supremum, whence (cid:104) q , z (cid:105) (cid:62) c w ( K ) . Now, choose C (cid:48) > C (cid:62) c − (cid:0) a + C (cid:1) − a . Then q ∈ K satisﬁes (cid:104) q , z (cid:105) (cid:62) c w ( K ) (cid:62) c (cid:16) a + C (cid:48) (cid:17) √ N (cid:62) (cid:16) a + C (cid:17) √ N . BERK, A., PLAN, Y., YILMAZ, ¨O

Now, because (cid:107) q (cid:107) (cid:54) α and q ∈ K , it holds on the event E ∩ Z − that (cid:16) a + C (cid:17) √ N (cid:62) (cid:107) q (cid:107) + (cid:0) (cid:107) z (cid:107) − N (cid:1) Combining the two previous chains of inequalities implies that (cid:107) q − z (cid:107) (cid:54) N Namely, there exists an event Z − ∩ E , such that q ∈ K ∩ F , so long as N (cid:62) N (8.6)0 . Because E holdswith high probability and the probability of Z − is lower-bounded by a universal constant, Proposition8.3 implies P ( Z − ∩ E ) (cid:62) k ( N (8.6)0 , δ ) for N (cid:62) N (8.6)0 , where N (8.6)0 (cid:62) max { N (8.4)0 ( a , C , L ) , N (8.5)0 ( c , ˜ D , δ , L ) (cid:9) . (cid:3) P ROPOSITION C > L (cid:62)

1. Set K : = λ B N ∩ α B N , where λ = L (cid:113) N log N . There is amaximal choice of α = α ( N ) > N (cid:62) w ( K ) (cid:54) C √ NProof of Proposition 8.7.

Since w ( K ) = E z sup q ∈ K (cid:104) q , z (cid:105) is the Gaussian mean width of K , we may invoke Proposition 3.9 toobtain a sufﬁcient chain of inequalities: w ( K ) (3.9) (cid:54) λ (cid:115) log ( eN α λ ) ( ∗∗ ) (cid:54) C √ N . The ﬁrst inequality follows by (3.9) immediately. Rearranging and substituting for λ , ( ∗∗ ) is equivalentto D log N (cid:62) log (cid:0) D α log N (cid:1) , D : = (cid:16) C L (cid:17) , D : = eL . This inequality is satisﬁed for any α with α (cid:54) N D D log N = : A ( C , N ) For example, one may choose α = LN D √ e log N , D : = C L . For such 0 < α (cid:54) A ( N ; C , L ) , it holds as desired that w ( K ) (cid:54) C √ N . (cid:3) R EMARK N so that A ( C , N ) is increasing for all N (cid:62) N . A quickcalculation reveals that N = N (8.7)0 ( C , L ) : = exp (cid:0) ( D ) − (cid:1) is sufﬁcient. ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

47 of 52P

ROPOSITION δ >

0, and C >

1. Let K = λ B N ∩ α B N as above. There are universal constants˜ D > , N (cid:62) N > N : = N (8.8)0 ( C , ˜ D , δ , L ) : = (cid:16) D L ( C − ) log (cid:0) δ (cid:1)(cid:17) one has sup q ∈ K (cid:104) q , z (cid:105) (cid:54) Cw ( K ) with probability at least 1 − δ , where z ∈ R N with z i iid ∼ N ( , ) . Proof of Proposition 8.8.

Deﬁne the centered Gaussian process f x : = (cid:104) x , g (cid:105) for x ∈ K ⊆ R N , atopological space, and where g i iid ∼ N ( , ) . Observe (cid:107) f (cid:107) K = sup x ∈ K | f x | < ∞ almost surely. For any u > P (cid:0) sup x ∈ K |(cid:104) x , g (cid:105)| > w ( K ) + u (cid:1) (cid:54) exp (cid:16) − u σ K (cid:17) by Theorem 3.8. Hence, for C > P (cid:0) sup x ∈ K |(cid:104) x , g (cid:105)| > Cw ( K ) (cid:1) (cid:54) exp (cid:16) − ( C − ) w ( K ) σ K (cid:17) (cid:54) exp (cid:16) − ( C − ) L N log ( D √ N log N ) α log N (cid:17) (cid:54) δ because σ K = sup x ∈ K E |(cid:104) x , g (cid:105)| = sup x ∈ K N ∑ i = x i E | g i | = sup x ∈ K (cid:107) x (cid:107) = α (cid:54) α = √ N . Finally, for δ > C >

1, sup x ∈ K |(cid:104) x , g (cid:105)| (cid:54) Cw ( K ) with probability at least 1 − δ for any N (cid:62) N (8.8)0 . (cid:3) P ROPOSITION C , δ > Z + : = {(cid:107) z (cid:107) (cid:62) N + C √ N } where z ∈ R N with z i iid ∼ N ( , ) . There is a universal constant N : = N (8.9)0 (cid:62) N (8.9)0 (cid:62) max (cid:8) N (8.7)0 , N (8.8)0 (cid:9) . and a universal constant k = k ( N , δ ) > N (cid:62) N there is an event E ⊆ Z + satisfying K ∩ F = /0 on E and P ( E ) (cid:62) k : = P ( Z + ) − δ . Proof of Proposition 8.9.

By Proposition 8.8, for any 0 < c < E that holds withhigh probability such that sup q ∈ K (cid:104) q , z (cid:105) (cid:54) c w ( K ) on E . Because K is closed, there is q ∈ K realizingthe supremum when restricted to E , whence (cid:104) q , z (cid:105) (cid:54) sup q (cid:48) ∈ K (cid:104) q (cid:48) , z (cid:105) (cid:54) c w ( K ) . Now, choose C (cid:48) > (cid:54) C (cid:54) c C (cid:48) . Then q ∈ K satisﬁes (cid:104) q , z (cid:105) (cid:54) c w ( K ) (cid:54) c C (cid:48) √ N (cid:54) C √ N BERK, A., PLAN, Y., YILMAZ, ¨O

On the other hand, for any q (cid:48) ∈ F on the event Z + , C √ N (cid:54) (cid:107) q (cid:48) (cid:107) + (cid:107) z (cid:107) − N (cid:54) (cid:104) q (cid:48) , z (cid:105) whence K ∩ F = /0 on the event Z + ∩ E . Because E holds with high probability and the probabilityof Z + is lower-bounded by a universal constant, Proposition 8.3 implies P (cid:0) Z + ∩ E (cid:1) > k ( N (8.9)0 , δ ) for N (cid:62) N (8.6)0 where N (8.9)0 (cid:62) max (cid:8) N (8.7)0 , N (8.8)0 (cid:9) . (cid:3) Proof of the geometric lemma.

We now have the tools required for Lemma 6.2. For intuiton ofthe result, we refer the reader to Figure 2 in 6.2.

Proof of Lemma 6.2.

The proof of the ﬁrst two items follows trivially from Proposition 8.6 andProposition 8.9. Deﬁne the event E : = Z − ∩ E ∩ Z + ∩ E To prove the ﬁnal item, observe that P (cid:0) E (cid:1) (cid:62) P (cid:0) Z − ∩ Z + (cid:1) − δ (cid:62) k for all sufﬁciently large N . Thisis a direct consequence of Proposition 8.2 and Proposition 8.3.The proof of the third item follows from a note in Proposition 8.7. Speciﬁcally, the result holds forany choice of α satisfying0 < α (cid:54) A ( N ; C ; L ) = LN D √ e log N , D : = C L Hence, choose C , q > α > C N q for all N (cid:62) N (6.2)0 (cid:62) N (8.7)0 . (cid:3) Proofs for overconstrained suboptimality.

First we prove a key ingredient in the main resultsfor ˜ R ( σ ; x , N , η ) parameter instability. Then, we prove the lemma that extends ( BP ∗ σ ) parameter insta-bility from σ = √ N and x ≡ σ (cid:54) √ N and x ≡

0. Finally, we prove the restricted maximin result,yielding parameter instability for overconstrained ( BP ∗ σ ) . Proof of Corollary 6.1.

Restrict to the event E as given in the lemma and assume that N (cid:62) N (6.2)0 . K ∩ F is non-empty, so ˜ x ( σ ) ∈ K ∩ F by deﬁnition. K ∩ F = /0 thereby implies˜ x ( σ ) ∈ λ B N ∩ (cid:0) α B N \ α B N (cid:1) ∩ F = ( K \ K ) ∩ F . Whence follows (cid:107) ˜ x ( σ ) (cid:107) (cid:54) λ and α (cid:54) (cid:107) ˜ x ( σ ) (cid:107) (cid:54) α . Applying Bayes’ rule to the noise-normalizedrisk yields: ˜ R ( σ ; 0 , N , η ) (cid:62) P ( E ) η E (cid:2) (cid:107) ˜ x ( σ ) (cid:107) | E (cid:3) (cid:62) k C N q = : CN q . (cid:3) Proof of Lemma 6.3.

This result is an immediate consequence of Corollary 3.1. (cid:3)

ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

49 of 52

Proof of Theorem 6.1.

Without loss of generality, assume η =

1. We may trivially lower-bound theminimax expression by considering only the case where x ≡ x ∈ Σ Ns inf σ (cid:54) √ N ˜ R ( σ ; x , N , ) (cid:62) inf σ (cid:54) √ N ˜ R ( σ ; 0 , N , ) Lemma 6.3 and Corollary 6.1 imply in turn,inf σ (cid:54) √ N ˜ R ( σ ; 0 , N , η ) (cid:62) ˜ R ( √ N ; 0 , N , η ) (cid:62) CN q for all N (cid:62) N , where N (cid:62) N (6.2)0 and C , q > (cid:3) Proof of minimax suboptimality.

We prove that ( BP ∗ σ ) is minimax suboptimal. Proof of Theorem 6.2.

Without loss of generality, take η =

1. Observe thatinf σ > sup x ∈ Σ Ns ˜ R ( σ ; x , N , ) = min (cid:110) inf σ (cid:54) √ N S ( σ ) , inf σ > √ N S ( σ ) (cid:111) where S ( σ ) : = sup x ∈ Σ Ns ˜ R ( σ ; x , N , ) . Next, assume N (cid:62) N (6.2)0 . Then one has inf σ > √ N S ( σ ) (cid:62) C √ N by Lemma 6.1. Moreover, a trivial lower bound, Lemma 6.3 and Corollary 6.1 successively implyinf σ (cid:54) √ N S ( σ ) (cid:62) inf σ (cid:54) √ N ˜ R ( σ ; 0 , N , ) (cid:62) ˜ R ( √ N ; 0 , N , ) (cid:62) C N q . In particular, there is a universal constant C > σ > sup x ∈ Σ Ns ˜ R ( σ ; x , N , ) (cid:62) min { C N q , C √ N } (cid:62) CN q . (cid:3) Proof of maximin suboptimality.

We prove that ( BP ∗ σ ) is maximin suboptimal. Proof of Lemma 6.4.

The proof is completed by the following chain of inequalities. The ﬁrst and lastequalities are by deﬁnition of the ( BP ∗ σ ) estimator. The ﬁrst inequality follows by relaxing the objective;the second inequality follows by relaxing the constraint condition. (cid:107) ˜ x T C (cid:107) = (cid:13)(cid:13) arg min {(cid:107) x (cid:107) : (cid:107) y − x (cid:107) (cid:54) σ } T C (cid:13)(cid:13) (cid:62) (cid:13)(cid:13) arg min {(cid:107) x T C (cid:107) : (cid:107) y − x (cid:107) (cid:54) σ } T C (cid:13)(cid:13) (cid:62) (cid:13)(cid:13) arg min {(cid:107) x T C (cid:107) : (cid:107) ( y − x ) T C (cid:107) (cid:54) σ } T C (cid:13)(cid:13) ≡ (cid:107) ˜ x (cid:48) (cid:107) (cid:3) Proof of Theorem 6.3.

We may trivially lower-bound the maximin expression by considering the casewhere x : = Ne where e is the ﬁrst standard basis vector. Without loss of generality, we may assumethat this entry is in the ﬁrst coordinate, and is at least N . Again without loss of generality, it sufﬁces toconsider the case where η =

1. We write the lower bound assup x ∈ Σ Ns inf σ > ˜ R ( σ ; x , N , ) (cid:62) inf σ > ˜ R ( σ ; x , N , ) . BERK, A., PLAN, Y., YILMAZ, ¨O If σ (cid:62) √ N , then the result follows by Lemma 6.1. Otherwise, it must be that σ (cid:54) √ N , in whichcase the result follows immediately by Lemma 6.4. In this latter case, we have implicitly assumedthat if σ ∈ ( √ N − , √ N ) , then the omitted technical exercise of adjusting constants in Corollary 6.1and its constituents has been carried out. For further detail on this caveat, see the remark immediatelysucceeding Corollary 6.2. (cid:3)

9. Conclusions

We have illustrated regimes in which each program is unstable. The theory of section 4, section 5 andsection 6 proves asymptotic results for each program, while the numerics of section 7 supports using theasymptotic behaviour as a basis for practical intuition. Thus, we hope these results inform practitionersabout which program to use.In section 4 and 7.1 we observe that ( LS ∗ τ ) exhibits parameter instability in the low-noise regime.The risk ˆ R ( τ ; x , N , η ) develops an asymptotic singularity as η →

0, blowing up for any τ (cid:54) = (cid:107) x (cid:107) ,where ˆ R ( (cid:107) x (cid:107) ; x , N , η ) attains minimax order-optimal error. Numerical simulations support thatˆ R ( τ ; x , N , η ) develops cusp-like behaviour in the low-noise regime, which agrees with the asymptoticsingularity of Theorem 4.1. Notably, ( LS ∗ τ ) parameter instability manifests in very low dimensionsrelative to practical problem sizes. Outside of the low-noise regime, ( LS ∗ τ ) appears to exhibit betterparameter stability, as exempliﬁed in Figure 6.In section 5 and section 7.2 we observe that ( QP ∗ λ ) exhibits left-sided parameter instability in thelow-noise regime. When λ < ¯ λ we prove that R (cid:93) ( λ ; s , N ) scales asymptotically as a power law of N .The suboptimal scaling of the risk manifests in relatively higher dimensional problems, as suggested byFigure 4a. Minimax order-optimal scaling of the risk when λ (cid:62) ¯ λ is clear from Figure 4b. The numericsof section 7 support that ( QP ∗ λ ) is generally the most stable of the three programs considered.In section 6 and 7.3 we observe that ( BP ∗ σ ) exhibits parameter instability in the very sparse regime.Notably, ˜ R ( σ ; x , N , η ) is maximin suboptimal for any choice of σ > s / N sufﬁciently small. Thisbehaviour is supported by Figure 5a, in which the best average loss of ( BP ∗ σ ) is a 82 . ( LS ∗ τ ) and ( QP ∗ λ ) . Further, the average loss for ( BP ∗ σ ) exhibits a clear cusp-like behaviour inFigure 5a, like for that of ( LS ∗ τ ) , which would be an interesting object of further study. Outside of thevery sparse regime, ( BP ∗ σ ) appears to exhibit parameter stability, as exempliﬁed in Figure 6.In section 7.5 we portray how estimators behave as a function of the normalized parameter for eachprogram. We show the kinds of pathologies from which these estimators suffer in unstable regimes, anddemonstrate that estimators for compressed sensing problems can exhibit similar pathologies (section7.5.4). These simulations support the intuition that our theory may be extended to the compressedsensing setting.Finally, we demonstrated the usefulness of Lemma 3.2. By this result, the size of ˜ x ( η √ N ) controlsthe size of ˜ x ( σ ) for σ (cid:54) η √ N when x ≡

0. This was key to demonstrating risk suboptimality forunderconstrained ( BP ∗ σ ) . Moreover, Lemma 3.2 was used to prove ˆ R ( τ ; τ x , N , η ) is an increasing func-tion of τ when (cid:107) x (cid:107) =

1. Thus, the projection lemma was particularly effective for proving minimaxorder-optimality of R ∗ ( s , N ) .Future works include extending the main results to the CS set-up and to more general atomic norms.These results may also extend to ones under more general noise models. Some of these extensions arein preparation by the authors. Lastly, it would be interesting to see what role parameter instability mightplay in proximal point algorithms and those algorithms relying on proximal operators. Conversely, itwould be useful to understand rigorously when a PD program exhibits parameter instability, and todetermine systematically the regime in which that instability arises. ENSITIVITY OF (cid:96) MINIMIZATION TO PARAMETER CHOICE

51 of 52

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC)[CGSD3-489677 to A.B., 22R23068 to Y.P., 22R82411 to O.Y., 22R68054 to O.Y.]; and the PaciﬁcInstitute for the Mathematical Sciences (PIMS) [CRG 33: HDDA to Y.P., CRG 33: HDDA to O.Y.].

Acknowledgements

We would like to thank Dr. Navid Ghadermarzy for a careful reading of the manuscript.R

EFERENCES [1] Robert J Adler and Jonathan E Taylor.

Random Fields and Geometry . Springer Science & Business Media,2009.[2] Dennis Amelunxen, Martin Lotz, Michael B McCoy, and Joel A Tropp. Living on the edge: Phase transitionsin convex programs with random data.

Inf. Inference , 3(3):224–294, 2014.[3] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applicationsto compressed sensing.

IEEE Trans. Inform. Theory , 57(2):764–785, 2011.[4] Mohsen Bayati and Andrea Montanari. The LASSO risk for Gaussian matrices.

IEEE Trans. Inform. Theory ,58(4):1997–2017, 2012.[5] Pierre C Bellec. Localized Gaussian width of M -convex hulls with applications to LASSO and convex aggre-gation. arXiv preprint arXiv:1705.10696 , 2017.[6] Dimitri P Bertsekas, Angelia Nedi, Asuman E Ozdaglar, et al. Convex Analysis and Optimization . AthenaScientiﬁc, 2003.[7] J´erˆome Bolte, Patrick L Combettes, and J-C Pesquet. Alternating proximal algorithm for blind image recov-ery. In , pages 1673–1676. IEEE, 2010.[8] Christer Borell. The Brunn-Minkowski inequality in Gauss space.

Invent. Math. , 30(2):207–216, 1975.[9] Emmanuel J Candes and Mark A Davenport. How well can we estimate a sparse vector?

Appl. Comput.Harmon. Anal. , 34(2):317–323, 2013.[10] Emmanuel J Cand`es, Justin Romberg, and Terence Tao. Robust uncertainty principles: Exact signal recon-struction from highly incomplete frequency information.

IEEE Trans. Inform. Theory , 52(2):489–509, 2006.[11] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete andinaccurate measurements.

Comm. Pure Appl. Math. , 59(8):1207–1223, 2006.[12] Emmanuel J Candes and Terence Tao. Near-optimal signal recovery from random projections: Universalencoding strategies?

IEEE Trans. Inform. Theory , 52(12):5406–5425, 2006.[13] Sourav Chatterjee et al. A new perspective on least squares under convex constraint.

Ann. Statist. , 42(6):2340–2381, 2014.[14] Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In

Fixed-Point Algorithms for Inverse Problems in Science and Engineering , pages 185–212. Springer, 2011.[15] Ingrid Daubechies and Gerd Teschke. Variational image restoration by means of wavelets: Simultaneousdecomposition, deblurring, and denoising.

Appl. Comput. Harmon. Anal. , 19(1):1–16, 2005.[16] Mark A. Davenport, Marco F. Duarte, Yonina C. Eldar, and Gitta Kutyniok.

Introduction to CompressedSensing , page 164. Cambridge University Press, 2012.[17] David L Donoho. Compressed sensing.

IEEE Trans. Inform. Theory , 52(4):1289–1306, 2006.[18] Jonathan Eckstein and Dimitri P Bertsekas. On the DouglasRachford splitting method and the proximal pointalgorithm for maximal monotone operators.

Math. Program. , 55(1-3):293–318, 1992.[19] Michael Elad. Sparse and redundant representation modeling — What next?

IEEE Signal Processing Letters ,19(12):922–928, 2012.[20] Michael Elad, Mario AT Figueiredo, and Yi Ma. On the role of sparse and redundant representations in imageprocessing.

Proc. IEEE , 98(6):972–982, 2010.

BERK, A., PLAN, Y., YILMAZ, ¨O [21] Simon Foucart and Holger Rauhut.

A Mathematical Introduction to Compressive Sensing . Number 1 in 3.Birkh¨auser Basel, 2013.[22] Michael P Friedlander, Ives Macedo, and Ting Kei Pong. Gauge optimization and duality.

SIAM J. Optim. ,24(4):1999–2022, 2014.[23] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models viacoordinate descent.

J. Stat. Softw. , 33(1):1, 2010.[24] John B Garnett, Triet M Le, Yves Meyer, and Luminita A Vese. Image decompositions using boundedvariation and generalized homogeneous besov spaces.

Appl. Comput. Harmon. Anal. , 23(1):25–56, 2007.[25] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientiﬁc tools for Python, 2001–.[Online; accessed 14 February 2019].[26] Christopher Liaw, Abbas Mehrabian, Yaniv Plan, and Roman Vershynin. A simple tool for bounding thedeviation of random matrices on geometric sets. In

Geometric Aspects of Functional Analysis , pages 277–299. Springer, 2017.[27] Michael Lustig, David Donoho, and John M Pauly. Sparse mri: The application of compressed sensing forrapid mr imaging.

Magn. Reson. Med. , 58(6):1182–1195, 2007.[28] Yves Meyer.

Oscillating patterns in image processing and nonlinear evolution equations: the ﬁfteenth DeanJacqueline B. Lewis memorial lectures , volume 22. American Mathematical Soc., 2001.[29] L´eo Miolane and Andrea Montanari. The distribution of the lasso: Uniform control over sparse balls andadaptive parameter tuning. arXiv preprint arXiv:1811.01212 , 2018.[30] Samet Oymak and Babak Hassibi. Sharp MSE bounds for proximal denoising.

Found. Comput. Math. ,16(4):965–1029, 2016.[31] Samet Oymak, Christos Thrampoulidis, and Babak Hassibi. The squared-error of generalized LASSO: Aprecise analysis. In , pages1002–1009. IEEE, 2013.[32] Mee Young Park and Trevor Hastie. L -regularization path algorithm for generalized linear models. J. R.Stat. Soc. Ser. B. Stat. Methodol. , 69(4):659–677, 2007.[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python.

J. Mach. Learn. Res. , 12:2825–2830, 2011.[34] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A convexprogramming approach.

IEEE Trans. Inform. Theory , 59(1):482–494, 2013.[35] Yaniv Plan and Roman Vershynin. Dimension reduction by random hyperplane tessellations.

Discrete Com-put. Geom. , 51(2):438–461, 2014.[36] Yaniv Plan and Roman Vershynin. The generalized LASSO with non-linear observations.

IEEE Trans.Inform. Theory , 62(3):1528–1537, 2016.[37] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm.

SIAM J. Control Optim. ,14(5):877–898, 1976.[38] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms.

Phys. D , 60(1-4):259–268, 1992.[39] BS Tsirelson, IA Ibragimov, and VN Sudakov. Norms of Gaussian sample functions. In

Proceedings of theThird JapanUSSR Symposium on Probability Theory , pages 20–41. Springer, 1976.[40] Ewout Van Den Berg and Michael P Friedlander. Probing the pareto frontier for basis pursuit solutions.

SIAMJ. Sci. Comput. , 31(2):890–912, 2008.[41] Ramon van Handel. Probability in high dimension. Technical report, Princeton University NJ, 2014.[42] Roman Vershynin.

High-dimensional Probability: An Introduction with Applications in Data Science , vol-ume 47. Cambridge University Press, 2018.[43] P Wojtaszczyk. Stability and instance optimality for Gaussian measurements in compressed sensing.