The distribution of the Lasso: Uniform control over sparse balls and adaptive parameter tuning
ThThe distribution of the Lasso:Uniform control over sparse balls and adaptive parameter tuning
L´eo Miolane ? and Andrea Montanari † November 6, 2018
Abstract
The Lasso is a popular regression method for high-dimensional problems in which the number of parameters θ , . . . , θ N , is larger than the number n of samples: N > n . A useful heuristics relates the statistical properties ofthe Lasso estimator to that of a simple soft-thresholding denoiser, in a denoising problem in which the parameters ( θ i ) i ≤ N are observed in Gaussian noise, with a carefully tuned variance. Earlier work confirmed this picture inthe limit n, N → ∞ , pointwise in the parameters θ , and in the value of the regularization parameter.Here, we consider a standard random design model and prove exponential concentration of its empiricaldistribution around the prediction provided by the Gaussian denoising model. Crucially, our results are uniformwith respect to θ belonging to ‘ q balls, q ∈ [0 , , and with respect to the regularization parameter. This allowsto derive sharp results for the performances of various data-driven procedures to tune the regularization.Our proofs make use of Gaussian comparison inequalities, and in particular of a version of Gordon’s minimaxtheorem developed by Thrampoulidis, Oymak, and Hassibi, which controls the optimum value of the Lasso opti-mization problem. Crucially, we prove a stability property of the minimizer in Wasserstein distance, that allowsto characterize properties of the minimizer itself. Given data ( x i , y i ) , ≤ i ≤ n , with x i ∈ R N , y i ∈ R , the Lasso [48, 13] fits a linear model by minimizing the costfunction L λ ( θ ) = 12 n n X i =1 (cid:0) y i − h x i , θ i (cid:1) + λn | θ | = 12 n k y − Xθ k + λn | θ | . (1)Here X ∈ R n × N is the matrix with rows x , . . . , x n , y = ( y , . . . , y n ) , k v k denotes the ‘ norm of vector v , and | v | its ‘ norm. To fix normalizations, we will assume that the columns of X have ‘ norm o (1) . (Note thatthis normalization is different from the one that is sometimes adopted in the literature, but the two are completelyequivalent.)A large body of theoretical work supports the use of ‘ regularization in the high-dimensional regime n (cid:46) N ,when only a small subset of the coefficients θ are expected to be large. Broadly speaking, we can distinguish twotypes of theoretical approaches. A first line of work makes deterministic assumptions about the design matrix X ,such as the restricted isometry property and its generalizations [12, 10]. Under such conditions, minimax optimalestimation rates as well as oracle inequalities have been proved in a remarkable sequence of papers [11, 8, 52, 34, 37].As an example, assume that that the linear model is correct. Namely, y = Xθ ? + σz , (2)for σ ≥ , z ∼ N (0 , I n ) , and θ ? a vector with s non-zero entries. Then, a theorem of Bickel, Ritov and Tsy-bakov [8] implies that, with high probability, λ ≥ σ p c log N ⇒ k b θ λ − θ ? k ≤ Cs λ , (3)for some constants c , C that depend on the specific assumptions on the design. (The normalization of [8] isrecovered by setting σ = σ /n , where σ is the noise variance of [8].) ? D´epartement d’Informatique de l’ENS, ´Ecole Normale Sup´erieure, CNRS, PSL Research University & Inria, Paris, France. † Department of Electrical Engineering and Department of Statistics, Stanford University. a r X i v : . [ m a t h . S T ] N ov . . . . . . . . . δ . . . . . . . . . R i s k MMSEmin λ R ∗ ( λ )Lasso, λ = (cid:98) λ EST
Lasso, λ = (cid:98) λ SURE
Lasso, λ = (cid:98) λ k − CV Lasso, λ = σ √ N (a) Independent columns . . . . . . . . . δ . . . . . . . R i s k Lasso, λ = (cid:98) λ EST
Lasso, λ = (cid:98) λ SURE
Lasso, λ = σ √ N Lasso, λ = (cid:98) λ k − CV min λ N (cid:107) θ (cid:63) − (cid:98) θ λ (cid:107) (b) Correlated columns Figure 1:
Estimation risk of the Lasso for different choices of λ , as a function of δ . N = 8000 . Inboth plots, σ = 0 . . The true coefficients vector θ ? is chosen to be sN -sparse with s = 0 . . Theentries on the support of θ ? are drawn i.i.d. N (0 , . Cross-validation is carried out using folds.SURE is computed using the estimator b σ for the plot on the left, and the true value of σ on the right. Left:
A standard random design with ( X ij ) ∼ iid N (0 , /n ) . Right:
The rows of the design matrix X are i.i.d. Gaussian, with correlation structure given by anautoregressive process, see Eq. (20). Here we used φ = 2 .Unfortunately, this analysis provides limited insight into the choice of the regularization parameter λ which–in practice– can impact significantly the estimation accuracy. As an example, Fig. 3 reports the result of a smallsimulation in which we compare four different methods of selecting λ . The bound of Eq. (3) suggests to set λ = σ √ c log N . For the standard random design used in the left frame, the optimal constant is expected to be c =2 [16, 19]. We compare this method to three procedures that adapt the choice of λ to the data: cross validation(CV), Stein’s Unbiased Risk Estimate (SURE), and a procedure that minimizes an estimate of the risk (EST). We referto the next sections for further details on these methods. Note that all of these adaptive procedures significantlyoutperform the ‘theory driven’ λ : over a broad range of sample sizes n , the resulting estimation error is to times smaller. Further, the error achieved by these methods is quite close to the Bayes optimum.These empirical observations are not captured by the bound (3), or by similar results.An alternative style of analysis postulates an idealized model for the data and derives asymptotically exactresults. Throughout this paper we will consider the simplest of such models, by assuming that design matrix tohave i.i.d. entries X ij ∼ N (0 , /n ) . While this assumption is likely to be violated in practice, it allows to deriveuseful insights that are mathematically consistent, and susceptible of being generalized to a broader context. Thistype of analysis was first carried out in the context of the Lasso in [6] and then extended to a number of otherproblems, see e.g. [29, 47, 14, 46, 22, 44]. As an example, Figure 1 reports the predictions of this analysis for therisk of the three adaptive procedure for selecting λ . The agreement with the numerical simulations is excellent.Unfortunately, the results in [6] (and in follow-up work) do not allow to derive in a mathematically rigorousway curves such as the ones in Figure 1. In fact earlier results hold ‘pointwise’ over λ and hence do not apply toadaptive procedures to select λ . Further they provide asymptotic estimates ‘pointwise’ over θ , and hence do notallow to compute –for instance– minimax risk.In order to clarify these points, it is useful to overview informally the picture emerging from [6, 18]. Fix θ ∈ R N , λ ∈ R > , and let η ( x ; b ) = ( | x | − b ) + sign( x ) be the soft thresholding function. By the KKT conditionsthe Lasso estimator b θ λ satisfies b θ λ = η (cid:0)b θ dλ ; ατ (cid:1) , b θ dλ = b θ λ + ατλ X T ( y − X b θ λ ) , (4)where the vector b θ dλ is also referred to as the ‘debiased Lasso’ [55, 51, 26]. The above identity holds for arbitrary α, τ > . However, [6] predicts that the distribution of the debiased estimator b θ dλ simplifies dramatically forspecific choices of these parameters.Namely, let Θ be a random variable with distribution given by the empirical distribution of ( θ i ) i ≤ N (i.e., Θ = θ i with probability /N , for i ∈ { , . . . , N } ) and let Z ∼ N (0 , be independent of Θ . Define α ∗ , τ ∗ to be the2olution of the following system of equations (we refer to Section 3.1 for a discussion of existence and uniqueness): ( τ = σ + δ E h ( η (Θ + τ Z, ατ ) − Θ) i ,λ = ατ (cid:0) − δ P (cid:0)(cid:12)(cid:12) Θ + τ Z (cid:12)(cid:12) > ατ (cid:1)(cid:1) . (5)When α, τ are selected in this way, b θ dλ is approximately normal with mean θ ? (the true parameters vector) andvariance τ ∗ : b θ d ≈ N ( θ ? , τ ∗ I) . More precisely, for any test function f : R × R → R , with | f ( x ) − f ( y ) | ≤ L (1 + k x k + k y k ) k x − y k , almost surely, lim N →∞ N N X i =1 f ( θ ?i , b θ dλ,i ) = E (cid:8) f (Θ , Θ + τ ∗ Z ) (cid:9) , lim N →∞ N N X i =1 f ( θ ?i , b θ λ,i ) = E (cid:8) f (Θ , η (Θ + τ ∗ Z ; α ∗ τ ∗ )) (cid:9) . (6)This is an asymptotic result, which holds along sequences of problems with: ( i ) Converging aspect ratio n/N → δ ∈ (0 , ∞ ) ; ( ii ) Fixed regularization λ ∈ (0 , ∞ ) ; ( iii ) Parameter vectors θ ? = θ ? ( n ) whose empirical distributionconverges (weakly) to a limit law p Θ . As emphasized above, this does not allow deduce the behavior of the Lassowith adaptive choices of λ (there could be deviations from the above limits for exceptional values of λ ), or tocompute the minimax risk (there could be deviations for exceptional vectors θ ? ).The importance of establishing uniform convergence with respect to the regularization parameter λ was re-cently emphasized by Mousavi, Maleki, and Baraniuk [33]. Among other results, these authors derive a uniformconvergence statement for the related approximate message passing (AMP) algorithm. However, in order to es-tablish uniform convergence, they have to construct an ad-hoc smoothing of the quantity of interest, which isroughly equivalent to discretizing the corresponding tuning parameter.In this paper, we obtain uniform (in λ ) convergence results for the Lasso, hence providing a sound mathematicalbasis to the comparison of various adaptive procedures, as well as to the study of minimax risk.The rest of the paper is organized as follows. Section 2 reviews related work. We state our main theoreticalresults in Section 3. In Section 4 we apply these results to two types of statistical questions: estimating the riskand noise level, and selecting λ through adaptive procedures. Further, we illustrate our results in numerical sim-ulations. Finally, Section 5 outlines the main proof ideas, with most technical legwork deferred to the appendices. There is –by now– a substantial literature on determining exact asymptotics in high-dimensional statistical models,and a number of mathematical techniques have been developed for this task. We will only provide a few pointersfocusing on high-dimensional regression problems.The original proof of [6] was based on an asymptotically exact analysis of an approximate message passing(AMP) algorithm [5] that was first proposed in [18] to minimize the Lasso cost function. Variants of AMP havebeen developed in a number of contexts, opening the way to the analysis of various statistical estimation prob-lems. A short list includes generalized linear models [36], phase retrieval [40, 31], robust regression [14], logisticregression [44], generalized compressed sensing [7]. This approach is technically less direct than others, but hasthe advantage of providing an efficient algorithm, and is and not necessarily limited to convex problems (see [32]for a non-convex example).As mentioned above, our work was partially motivated by the recent results of Mousavi, Maleki, and Bara-niuk [33] that establish a form of uniformity for the AMP estimates –but not for the Lasso solution. It would beinteresting to understand whether the approach of [33] could also be used to obtain uniform results for the Lassoor other statistical estimators.Here we follow a different route that exploits powerful Gaussian comparison inequalities first proved by Gor-don [24, 25]. Gordon inequality allows to bound the distribution of a minimax value, i.e. the value of a randomvariable G ∗ = min i ≤ N max j ≤ M G ij , where ( G ij ) i ≤ N,j ≤ M is a Gaussian process, in terms of a similar quantityfor a ‘simpler’ Gaussian process. The use of Gordon’s inequality in this context was pioneered by Stojnic [43] andthen developed by a number of authors in the context of regularized regression [47], M-estimation [46], general-ized compressed sensing [1], binary compressed sensing [42] and so on. The key idea is to write the optimizationproblem of interest as a minimax problem, and then apply a suitable version of Gordon’s inequality. A matchingbound is obtained by convex duality and then a second application of Gordon’s inequality. In particular, convexityof the cost function of interest is a crucial ingredient.While the Gaussian comparison inequality provides direct access to the value of the optimization problem,understanding the properties of the estimator can be more challenging. In this paper we identify a property (that3e call local stability ) that allows to transfer information on the minimum (the Lasso cost) into information aboutthe minimizer (the Lasso estimator). We believe this strategy can be applied to other examples beyond the Lasso.Independently, a different approach based on leave-one-out techniques was developed by El Karoui in thecontext of ridge-regularized robust regression [29, 22].Finally, a parallel line of research determines exact asymptotics for Bayes optimal estimation, under a modelin which the coordinates of θ are i.i.d. with common distribution p Θ . In particular, the asymptotic Bayes optimalerror for linear regression with random designs was recently determined in [2, 38]. Of course –in general– Bayesoptimal estimation requires knowledge of the distribution p Θ , and is not computationally efficient. We will usethis Bayes-optimal error as a benchmark of our adaptive procedures. Generalizations of these results were alsoobtained in [3] for other regression problems. A successful approach to these models uses smart interpolationtechniques that generalize ideas in spin-glass theory. As stated above, we consider the standard linear model (2) where y = Xθ ? + σz , with noise z ∼ N (0 , I n ) , and X a Gaussian design: ( X i,j ) i ≤ n,j ≤ N i.i.d. ∼ N (0 , /n ) . The Lasso estimator is defined by b θ λ = arg min θ ∈ R N L λ ( θ ) . (7)(The minimizer is almost surely unique since the columns of X are in generic positions.) We set δ = n/N to be thenumber of samples per dimension. We are interested in uniform estimation over sparse vectors θ ? . Following [16,27] we formalize this notion using ‘ p -balls (which are convex sets only for p ≥ ). Definition 3.1
Define for p, ξ > the ‘ p -ball F p ( ξ ) = ( x ∈ R N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 | x i | p ≤ ξ p ) , and for s ∈ [0 , F ( s ) = (cid:8) x ∈ R N (cid:12)(cid:12) k x k ≤ sN (cid:9) . By Jensen’s inequality we have for p ≥ p > , F p ( ξ ) ⊂ F p ( ξ ) .Let φ ( x ) = e − x / √ π be the standard Gaussian density and Φ( x ) = R x −∞ φ ( t ) dt be the associated cumulativefunction. In the case of ‘ balls (sparse vectors), a crucial role is played by the following sparsity level. Definition 3.2
Define the critical sparsity as s max ( δ ) = δ max α ≥ ( − δ (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) α − (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) ) . The critical sparsity curve first appears in the seminal work by Donoho and Tanner on compressed sensing [20, 15].These authors consider the noiseless case ( z = 0 ) of model (2) and reconstruction via ‘ minimization (whichcorresponds to the λ → limit of the Lasso). They prove that ‘ minimization reconstructs exactly θ ? with highprobability, if k θ ? k ≤ N ( s max ( δ ) − ε ) , and fails with high probability if k θ ? k ≥ N ( s max ( δ ) + ε ) (for any ε > ).A second interpretation of the critical sparsity s max ( δ ) was given in [19, 50, 47]. For k θ ? k ≤ N ( s max ( δ ) − ε ) ,the Lasso achieves stable reconstruction. Namely, there exists M = M ( s, δ ) < ∞ for s < s max ( δ ) , such that,if k θ ? k ≤ N s , k b θ λ − θ ? k ≤ M ( s, δ ) σ . Our results provide a third interpretation: uniform limit laws for theLasso will be obtained on ‘ balls only for s < s max ( δ ) . 4 crucial role in our results is provided by the following max-min problem: max β ≥ min τ ≥ σ ψ λ ( β, τ ) , (8) ψ λ ( β, τ ) ≡ (cid:18) σ τ + τ (cid:19) β − β + 1 δ E min w ∈ R (cid:26) w τ β − βZw + λ | w + Θ | − λ | Θ | (cid:27) . The expectation above is with respect to (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , , where b µ θ ? denotes the empirical distribution ofthe entries of the vector θ ? : b µ θ ? = 1 N N X i =1 δ θ ?i . Proposition 3.1
The max-min (8) is achieved at a unique couple ( β ∗ ( λ ) , τ ∗ ( λ )) . Moreover, ( τ ∗ ( λ ) , β ∗ ( λ )) is also the unique couple ( β, τ ) ∈ (0 , + ∞ ) that verify τ = σ + δ E h ( η (Θ + τ Z, τ λβ ) − Θ) i β = τ (cid:16) − δ E h η (Θ + τ Z, τλβ ) i(cid:17) . (9) We will also use the notation α ∗ ( λ ) = λ/β ∗ ( λ ) and s ∗ ( λ ) = E (cid:2) η (Θ + τ ∗ ( λ ) Z, τ ∗ ( λ ) α ∗ ( λ )) (cid:3) = P (cid:0) | Θ + τ ∗ ( λ ) Z | ≥ α ∗ ( λ ) τ ∗ ( λ ) (cid:1) . (10)We will sometimes omit the dependency on λ and write simply α ∗ , β ∗ , τ ∗ , s ∗ . The distribution µ ∗ λ definedbelow will correspond (see Theorem 3.1 in the next section) to the limit of the empirical distribution of the entriesof ( b θ λ , θ ? ) . Definition 3.3
We denote by µ ∗ λ the law of the couple (cid:0) η (cid:0) Θ + τ ∗ ( λ ) Z, α ∗ ( λ ) τ ∗ ( λ ) (cid:1) , Θ (cid:1) , where (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , . We fix from now on < λ min ≤ λ max and D ⊂ R N that can be either F p ( ξ ) for some ξ, p > , or F ( s ) forsome s < s max ( δ ) . Our uniformity domain is defined by Ω = (cid:0) δ, σ, D , λ min , λ max (cid:1) . Namely, we will control b θ λ uniformly with respect to θ ? ∈ D and λ ∈ [ λ min , λ max ] , with n/N = δ . We will call constant any quantity thatonly depends on Ω . In absence of further specifications, C, c will be constants (that depend only on Ω ) that areallowed to change from one line to another.Our first result shows that the empirical distribution of the entries { ( b θ λ,i , θ ?i ) } i ≤ N is uniformly close to themodel µ ∗ λ . We quantify deviations using the Wasserstein distance. Recall that, given two probability measures µ, ν on R d with finite second moment, their Wasserstein distance of order is W ( µ, ν ) = (cid:16) inf γ ∈C ( µ,ν ) Z k x − y k γ (d x, d y ) (cid:17) / , (11)where the infimum is taken over all couplings of µ and ν . Note that W metrizes the convergence in Eq. (6).Namely lim n →∞ W ( µ n , µ ∗ ) = 0 if and only if, for any test function f : R × R → R , with | f ( x ) − f ( y ) | ≤ L (1 + k x k + k y k ) k x − y k , we have lim n →∞ R f ( x ) µ n (d x ) = R f ( x ) µ ∗ (d x ) [53]. It provides therefore a naturalway to extend earlier results to a non-asymptotic regime. Theorem 3.1
Assume that D = F p ( ξ ) for some ξ > and p > . Then there exists constants C, c > that only depend on Ω ,such that for all (cid:15) ∈ (0 , ]sup θ ? ∈D P sup λ ∈ [ λ min ,λ max ] W (cid:0)b µ ( b θ λ ,θ ? ) , µ ∗ λ (cid:1) ≥ (cid:15) ! ≤ C(cid:15) − max(1 ,a ) − N (1 /p − + exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where a = + p . Remark 1.
It is worth emphasizing in what sense Theorem 3.1 is uniform with respect to λ ∈ [ λ min , λ max ] and to θ ? ∈ D :• Uniformity with respect to λ . We bound (in probability) the maximum (over λ ) deviation between the empiricaldistribution b µ ( b θ λ ,θ ? ) and the predicted distribution µ ∗ λ . (The supremum over λ is ‘inside’ the probability.)• Uniformity with respect to θ ? . We bound the maximum probability (over θ ? ) of a deviation between b µ ( b θ λ ,θ ? ) and µ ∗ λ . (The supremum over θ ? is ‘outside’ the probability.)The reader might wonder whether it is possible to strengthen this result and bound the maximum deviation over θ ? (‘move the supremum over θ ? inside’). The answer is negative. In particular, we can choose the support of θ ? tocoincide with a submatrix of X with atypically small minimum singular value. This will result in larger estimationerror k b θ λ − θ ? k , and hence in a large Wasserstein distance W ( b µ ( b θ λ ,θ ? ) , µ ∗ λ ) . Remark 2.
Note that Theorem 3.1 does not hold for ‘ balls. This is probably a fundamental problem, since controlling W distance uniformly over ‘ balls is impossible even in the simple sequence model (or, equivalently, for orthogonaldesigns X ). Namely, consider the case in which we observe y i = θ ?i + z i , i ≤ N , where ( z i ) i ≤ N i.i.d. ∼ N (0 , τ ∗ ) , andwe try to estimate θ ? by computing b θ λ,i = η ( y i ; λ ) . Then there are vectors θ ? ∈ F ( s ) such that the empirical law b µ ( b θ λ ,θ ? ) does not concentrate in Wasserstein distance around its expectation µ ∗ λ , i.e. the law of (Θ , η (Θ + Z ; λ ) ) for G ∼ N (0 , τ ∗ ) .In order to see this, it is sufficient to consider the vector θ ? = ( N, N, . . . , kN, , . . . , . In Appendix F.1, we prove that (for this choice of θ ? ) there exists a constant c such that W ( b µ ( b θ λ ,θ ? ) , µ ∗ λ ) ≥ p k/N with probability at least − e − c k for all N large enough. We can think of several possibilities to overcome this intrinsic non-uniformity over ‘ balls. One option wouldbe to consider a weaker notion of distance between probability measures. Here we follow a different route, andprove uniform estimates over ‘ balls for several specific quantities of interest. In order to state these results, weintroduce the following quantities, which correspond to the risk and the prediction error (and are expressed interms of the solution ( τ ∗ , β ∗ ) of (9)) R ∗ ( λ ) = δ (cid:0) τ ∗ ( λ ) − σ (cid:1) , (12) P ∗ ( λ ) = β ∗ ( λ ) + 2 σ δ s ∗ ( λ ) − σ δ . (13) Theorem 3.2
Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > that only depend on Ω , such that for all (cid:15) ∈ (0 , θ ? ∈D P sup λ ∈ [ λ min ,λ max ] (cid:16) N k b θ λ − θ ? k − R ∗ ( λ ) (cid:17) ≥ (cid:15) ! ≤ C(cid:15) N q e − cN(cid:15) , (14) sup θ ? ∈D P (cid:16) sup λ ∈ [ λ min ,λ max ] (cid:16) n k y − X b θ λ k − β ∗ ( λ ) (cid:17) ≥ (cid:15) (cid:17) ≤ C(cid:15) N q e − cN(cid:15) , (15) sup θ ? ∈D P sup λ ∈ [ λ min ,λ max ] (cid:16) n k X ( θ ? − b θ λ ) k − P ∗ ( λ ) (cid:17) ≥ (cid:15) ! ≤ C(cid:15) N q e − cN(cid:15) , (16) where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . The statement (14) is proved in Appendix C.2, while (15)-(16) are proved in Appendix D.So far we focused on the Lasso estimator b θ λ . The debiased Lasso estimator is defined as b θ dλ = b θ λ + X T ( y − X b θ λ )1 − n k b θ λ k . p -values [55, 51, 26, 45], andprovide an explicit construction of the ‘direct observations’ model in the sense that b θ dλ is approximately distributedas N ( θ ? , τ ∗ I) . We let µ ( d ) λ be the law of the couple (cid:0) Θ + τ ∗ ( λ ) Z, Θ (cid:1) , where (Θ , Z ) ∼ ˆ µ θ ? ⊗ N (0 , . Theorem 3.3
Let b µ ( b θ dλ ,θ ? ) denote the empirical distribution (on R ) of the entries of ( b θ dλ , θ ? ) . There exists constants c, C > such that for all (cid:15) ∈ (0 , , sup θ ? ∈F ( ξ ) P (cid:16) sup λ ∈ [ λ min ,λ max ] W ( b µ ( b θ dλ ,θ ? ) , µ ( d ) λ ) ≥ (cid:15) (cid:17) ≤ C(cid:15) e − cN(cid:15) . Theorem 3.3 is proved in Section F.5.
In order to select the regularization parameter and to evaluate the quality of the Lasso solution b θ λ , it is usefulto estimate the risk and noise level. The paper [4] developed a suite of estimators of these quantities based onthe asymptotic theory of [6]. The same paper also proposed generalizations of these estimators to correlateddesigns. Here we revisit these estimators and prove stronger guarantees. First, we obtain quantitative boundon the consistency rate of our estimators. Second, our results are uniform over λ , which justifies using theseestimators to select λ .Let us start with the estimation of τ ∗ ( λ ) which plays a crucial role in the asymptotic theory. We define b τ ( λ ) = √ n k y − X b θ λ k n − k b θ λ k . We will see with Theorem F.1 presented in Appendix F.4 that lim
N,n →∞ N k b θ λ k = P ( | Θ + τ ∗ Z | ≥ τ ∗ λ/β ∗ ) ≡ s ∗ ( λ ) . Further, by Theorem 3.2, we have √ n k y − X b θ λ k = β ∗ ( λ ) + o n (1) . Recall that by (9) we have β ∗ ( λ ) = τ ∗ ( λ ) (cid:0) − δ s ∗ ( λ ) (cid:1) . We deduce b τ ( λ ) = τ ∗ ( λ ) + o n (1) . More precisely we have the following consistency result. Corollary 4.1
Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > that only depend on Ω such that for all (cid:15) ∈ (0 , θ ? ∈D P sup λ ∈ [ λ min ,λ max ] | b τ ( λ ) − τ ∗ ( λ ) | ≥ (cid:15) ! ≤ C(cid:15) − N q exp (cid:0) − cN (cid:15) (cid:1) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . We next consider estimating the ‘ error of the Lasso. Following [6], we define b R ( λ ) = b τ ( λ ) (cid:16) N k b θ λ k − (cid:17) + (cid:13)(cid:13) X T ( y − X b θ λ ) (cid:13)(cid:13) N (cid:0) − n k b θ λ k (cid:1) . orollary 4.2 Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > such that for all (cid:15) ∈ (0 , , sup θ ? ∈D P (cid:16) sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) b R ( λ ) − N k b θ λ − θ ? k (cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:17) ≤ C(cid:15) N q e − cN(cid:15) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . Corollary 4.2 is proved in Appendix F.6. Since by Corollary 4.2, Corollary 4.1, Theorem 3.2 we have with highprobability b R ( λ ) ’ N k b θ λ − θ ? k ’ δ ( τ ∗ ( λ ) − σ ) ’ δ ( b τ ( λ ) − σ ) , the estimator b σ ( λ ) = b τ ( λ ) − Nn b R ( λ ) = b τ ( λ ) (cid:16) Nn − n k b θ λ k (cid:17) − (cid:13)(cid:13) X T ( y − X b θ λ ) (cid:13)(cid:13) n (cid:0) − n k b θ λ k (cid:1) (17)is a consistent estimator of the noise level σ . Corollary 4.3
There exists constants
C, c > that only depend on Ω , such that for all (cid:15) ∈ (0 , θ ? ∈D P sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)b σ ( λ ) − σ (cid:12)(cid:12) > (cid:15) ! ≤ C(cid:15) N q e − cN(cid:15) . Finally, we consider the prediction error k Xθ ? − X b θ λ k . Stein Unbiased Risk Estimator (SURE) provides ageneral method to estimate the prediction error, see e.g. [41, 21, 49]. In the present case, it takes the form b P SURE ( λ ) = 1 n k y − X b θ λ k + 2 σ n k b θ λ k . (18)Tibshirani and Taylor [49] proved that b P SURE ( λ ) is an unbiased estimator of the prediction error, namely E { b P SURE ( λ ) } = 1 n k Xθ ? − X b θ λ k + σ . (19)The next result establishes consistency, uniformly over λ and θ ? , with quantitative concentration estimates. Corollary 4.4
Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > that only depend on Ω such that for all (cid:15) ∈ (0 , θ ? ∈D P (cid:16) sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) n k Xθ ? − X b θ λ k + σ − b P SURE ( λ ) (cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:17) ≤ C(cid:15) N q e − cN(cid:15) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) .The same result holds if σ in (18) is replaced by an estimator of the noise level satisfying the same consistencycondition as b σ defined by (17) (cf. Corollary 4.3). This corollary follows simply from Theorem F.1 and Theorem 3.2.
Remark 3.
Notice that exact unbiasedness of b P SURE ( λ ) only holds if the noise z in the linear model (2) is Gaus-sian [49]. In contrast, it is not hard to generalize the proofs in the present paper to include other noise distributions. λ As anticipated, we can use our uniform bounds to select λ through an adaptive procedure. We discuss here threesuch procedures, that have already been illustrated in Figure 1: ( i ) Selecting λ by minimizing the estimate b τ ( λ ) , wedenote this by b λ EST ; ( ii ) Select λ as to minimize Stein’s Unbiased Risk Estimate b P SURE ( λ ) , b λ SURE ; ( iii ) Select λ by k -fold cross-validation, b λ k -CV . We will next describe these procedures in greater detail, and state the correspondingguarantees. 8 inimization of b τ ( λ ) . Since the ‘ risk of the Lasso is by Theorem 3.2 approximately equal to R ∗ ( λ ) = δ ( τ ∗ ( λ ) − σ ) and since by Corollary 4.1, b τ is a consistent estimator (uniformly in λ ) of τ ∗ , a natural procedure for selecting λ is to minimize b τ . We then define b λ EST = arg min λ ∈ [ λ min ,λ max ] b τ ( λ ) . The next result is an immediate consequence of Theorem 3.2 and Corollary 4.1:
Proposition 4.1
Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > that only depend on Ω such that for all (cid:15) ∈ (0 , θ ? ∈D P (cid:18) N k b θ b λ EST − θ ? k ≤ inf λ ∈ [ λ min ,λ max ] n N k b θ λ − θ ? k o + (cid:15) (cid:19) ≥ − C(cid:15) − N q exp (cid:0) − cN (cid:15) (cid:1) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) .Minimization of SURE. We define b λ SURE = arg min λ ∈ [ λ min ,λ max ] b P SURE ( λ ) . Here, it is understood that we can use either σ or b σ ( λ ) , cf. Eq. (17), in the definition of b P SURE . We deduce fromCorollary 4.4:
Proposition 4.2
Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > that only depend on Ω such that for all (cid:15) ∈ (0 , θ ? ∈D P (cid:18) n k X b θ b λ SURE − Xθ ? k ≤ inf λ ∈ [ λ min ,λ max ] n n k X b θ λ − Xθ ? k o + (cid:15) (cid:19) ≥ − C(cid:15) − N q exp (cid:0) − cN (cid:15) (cid:1) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) .Cross-validation. We analyze now k -fold Cross Validation. Let k ≥ and define n k = n ( k − /k . We partitionthe rows of X in k groups: we obtain k -submatrices of size ( n/k ) × N that we denote X (1) , . . . , X ( k ) . Let us alsowrite for i ∈ { , . . . , k } , X ( - i ) for the submatrix of X obtained by removing the rows X ( i ) . We denote by y ( i ) , z ( i ) and y ( - i ) , z ( - i ) the corresponding subvectors of y and z .The estimator b R k -CV of the risk using k -fold cross validation if defined as follows. For i = 1 , . . . , k solve theLasso problem b θ iλ = arg min θ ∈ R N (cid:26) n k (cid:13)(cid:13)(cid:13) y ( - i ) − X ( - i ) θ (cid:13)(cid:13)(cid:13) + λn | θ | (cid:27) , and then compute b R k -CV ( λ ) = 1 N k X i =1 (cid:13)(cid:13)(cid:13) y ( i ) − X ( i ) b θ iλ (cid:13)(cid:13)(cid:13) . Finally, we set λ as follows b λ k -CV = arg min λ ∈ [ λ min ,λ max ] b R k -CV ( λ ) . The next Proposition shows that b R k -CV ( λ ) is equal to the true risk (shifted by δσ ) up to O ( k − / ) . Proposition 4.3
There exists constants c, C > that depend only on Ω , such that for all k ≥ such that s max (cid:0) ( k − δ/k (cid:1) > s in the case where D = F ( s ) , we have sup θ ? ∈D P sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) b R k -CV ( λ ) − N k b θ λ − θ ? k − δσ (cid:12)(cid:12)(cid:12) ≥ C √ k ! ≤ Ck N q e − cN/k , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . N k b θ b λ k - CV − θ ? k ≤ inf λ ∈ [ λ min ,λ max ] N k b θ λ − θ ? k + O ( k − / ) . In this Section we compare numerically various different choices for the regularization parameter λ , namely b λ EST , b λ SURE and b λ k -CV , presented in the previous section. For these experiments we take the components θ ? , . . . , θ ?N tobe i.i.d. from P = s N (0 ,
1) + (1 − s ) δ . Within this probabilistic model, we can compare achieved by our various choice of λ to the Bayes optimal error(Minimal Mean Squared Error): MMSE N = min b θ E h(cid:13)(cid:13) θ ? − b θ ( y, X ) (cid:13)(cid:13) i = E h(cid:13)(cid:13) θ ? − E [ θ ? | y, X ] (cid:13)(cid:13) i , where the minimum is taken over all estimators b θ (i.e. measurable functions of X, y ). The limit of the MMSE hasbeen recently computed by [2] and [38]. Recall, that given two random variables
U, V , their mutual informationis the Kullback-Leibler divergence between their joint distribution and the product of the marginals: I ( U ; V ) ≡ D KL ( p U,V k p U × p V ) . Theorem 4.1 (
Information-theoretic limit, from [2, 38] ) Define the function Ψ δ,σ ( m ) = I P (cid:16) σ − m (cid:17) + δ (cid:16) log(1 + m ) − m m (cid:17) , where I P ( r ) = I (Θ; √ r Θ + Z ) for (Θ , Z ) ∼ P ⊗ N (0 , . Then, for almost every δ, σ > the function Ψ δ,σ admits a unique maximizer m ∗ ( δ, σ ) on R ≥ and MMSE N −−−−→ N →∞ δσ m ∗ ( δ, σ ) . Figure 1 reports the risk achieved by the various choices of λ as a function of the number of samples perdimension δ . We also compare the data-driven procedures of the previous section to the theory-driven choice λ = σ √ N . In the left frame, we consider uncorrelated random designs: X i,j i.i.d. ∼ N (0 , /n ) . On the right, weconsider i.i.d. Gaussian rows with covariance structure determined by an auto-regressive model. Explicitly, thecolumns ( X j ) ≤ j ≤ N of X are generated according to: X = u , X j +1 = 1 p φ (cid:0) φX j + u j (cid:1) (20)where u j i.i.d. ∼ N (0 , I /n ) and φ = 2 . For both types of designs, b λ EST , b λ SURE and b λ k -CV perform similarly, andsubstantially outperform the theoretical choice λ = σ √ N .For uncorrelated designs, the resulting risk is closely tracked by the asymptotic theory, and is surprisinglyclose to the asymptotic prediction for the Bayes risk MMSE N .While our theory does not cover the case of correlated designs, the qualitative behavior is remarkably similar.We also observed that in this case, the risk estimator b R ( λ ) is not consistent but its minimum is roughly located atthe same value of λ as for uncorrelated designs.Next we study adaptivity to sparsity. On Figure 2, we plot the risk as a function of the sparsity of the signal θ ? . We compare the three adaptive procedures (namely, b λ EST , b λ SURE and b λ k -CV ), to the following choice λ MM ( s ) = α σ r − δ M s ( α ) ,M s ( α ) = s (1 + α ) + 2(1 − s ) (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) ,α = arg min α ≥ M s ( α ) , s < s max ( δ ) is a nominal value for the sparsity (in Figure 2, we use s = 0 . ). The value λ MM ( s ) isexpected to be asymptotically minimax optimal over F ( s ) [19].Also in this example, adaptive procedures dramatically outperform the fixed choice λ = σ √ N , and alsothe minimax optimal λ at the nominal sparsity level. . . . . . s . . . . . R i s k / s MMSEmin λ R ∗ ( λ ) R ∗ ( λ ), λ = λ MM s Lasso, λ = b λ EST
Lasso, λ = b λ SURE
Lasso, λ = b λ k − CV Lasso, λ = λ MM s Lasso, λ = σ √ N Figure 2:
Risk of the Lasso for different choices of λ . N = 10000 , σ = 0 . , δ = 0 . . Here θ ? ischosen to be sN -sparse, and we vary the sparsity level s . The entries on the support of θ ? are i.i.d. N (0 , . Cross-validation is carried out using folds. SURE is computed using the estimator b σ . Theminimax regularization λ MM ( s ) is used at the nominal level s = 0 . . As mentioned above, our proofs are based on Gaussian comparison inequalities, and in particular on Gordon’smin-max theorem [24, 25]. In this section we review the application of this inequality to the Lasso as developedin [47]. We then discuss the limitations of earlier work, which does not characterize the empirical distribution ofthe Lasso estimator b θ λ (or need extra sparsity assumptions [35]) nor uniform bounds as in Theorem 3.1. A keychallenge is related to the fact that the Lasso cost function (1) is convex but not strongly convex. Hence, a smallchange in λ could cause a priori a large change in the minimizer b θ λ .In order to overcome these problems, we establish a property that we call ‘local stability.’ Namely, if theempirical distribution of ( b θ λ , θ ? ) deviates from our prediction, then the value of the optimization problem increasessignificantly. This implies that the empirical distribution is stable with respect to perturbations of the cost (e.g.changes in λ ). Gordon’s comparison is again crucial to prove this stability property.Finally, we describe how local stability is used to prove the theorems in the previous sections. A full descriptionof the proofs is provided in the appendices. It is more convenient (but equivalent) to study b w λ = b θ λ − θ ? instead of b θ λ . The vector b w λ is the minimizer of thecost function C λ ( w ) = 12 n k Xw − σz k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1) . (21)Following [47], we rewrite the minimization of C λ as a saddle point problem: min w ∈ R N C λ ( w ) = min w ∈ R N max u ∈ R n (cid:26) n u T (cid:0) Xw − σz (cid:1) − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1)(cid:27) . (22)11e apply the following Theorem from [47] which improves over Gordon’s Theorem [25] by exploiting convexduality. Theorem 5.1 (
Theorem 3 from [47] ) Let S w ⊂ R N and S u ⊂ R n be two compact sets and let Q : S w × S u → R be a continuous function. Let G = ( G i,j ) i.i.d. ∼ N (0 , , g ∼ N (0 , I N ) and h ∼ N (0 , I n ) be independent standard Gaussian vectors. Define C ∗ ( G ) = min w ∈ S w max u ∈ S u u T Gw + Q ( w, u ) ,L ∗ ( g, h ) = min w ∈ S w max u ∈ S u k u k g T w + k w k h T u + Q ( w, u ) . Then we have:• For all t ∈ R , P (cid:16) C ∗ ( G ) ≤ t (cid:17) ≤ P (cid:16) L ∗ ( g, h ) ≤ t (cid:17) . • If S w and S u are convex and if Q is convex concave, then for all t ∈ RP (cid:16) C ∗ ( G ) ≥ t (cid:17) ≤ P (cid:16) L ∗ ( g, h ) ≥ t (cid:17) . For the reader’s convenience, we provide in Appendix G.3 a proof of this theorem.Because of Gordon’s Theorem, it suffices now to study (see Corollary 5.1 below) for ( g, g , h ) ∼ N (0 , I N ) ⊗N (0 , ⊗ N (0 , I n ) . L λ ( w ) = 12 r k w k n + σ k h k√ n − n g T w + g σ √ n ! + λn | w + θ ? | − λn | θ ? | . (23) Corollary 5.1 ( a ) Let D ⊂ R N be a closed set. We have for all t ∈ RP (cid:16) min w ∈ D C λ ( w ) ≤ t (cid:17) ≤ P (cid:16) min w ∈ D L λ ( w ) ≤ t (cid:17) . ( b ) Let D ⊂ R N be a convex closed set. We have for all t ∈ RP (cid:16) min w ∈ D C λ ( w ) ≥ t (cid:17) ≤ P (cid:16) min w ∈ D L λ ( w ) ≥ t (cid:17) . Proof.
We will only prove the first point, since the second follows from the same arguments. Define for ( w, u ) ∈ R N × R n c λ ( w, u ) = 1 n u T Xw − σn u T z − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1) ,l λ ( w, u ) = − n / k u k g T w + 1 n k u k g σ + r k w k n + σ h T un − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1) . Notice that for all w ∈ R N , L λ ( w ) = max u ∈ R n l λ ( w, u ) and C λ ( w ) = max u ∈ R n c λ ( w, u ) .Let us suppose that X, z, g, h, g live on the same probability space and are independent. Let (cid:15) ∈ (0 , . Let σ max ( X ) denote the largest singular value of the matrix X . By tightness we can find K > such that the event n σ max ( X ) ≤ K, k z k ≤ K, k g k ≤ K, k h k ≤ K, | g | ≤ K o (24)has probability at least − (cid:15) . Let D ⊂ R N be a (non-empty, otherwise the result is trivial) closed set. Let us fix w ∈ D . On the event (24) C λ ( w ) and L λ ( w ) are both upper bounded by some non-random quantity R . Let now w ∈ D such that C λ ( w ) ≤ R . We have then λn | w + θ ? | ≤ R + λn | θ ? | , which implies that k w k is upper bounded12y some non-random quantity R . This implies that, on the event (24), the minimum of C λ over D is achieved on D ∩ B (0 , R ) . Similarly on (24) the minimum of L λ over D is achieved on D ∩ B (0 , R ) , for some non-randomquantity R . Without loss of generalities, one can assume R = R . On the event (24) we have min w ∈ D C λ ( w ) = min w ∈ D ∩ B (0 ,R ) C λ ( w ) = min w ∈ D ∩ B (0 ,R ) max u ∈ B (0 ,R ) c λ ( w, u ) , for some non-random R > . This gives that for all t ∈ R , we have P (cid:16) min w ∈ D C λ ( w ) ≤ t (cid:17) ≤ P (cid:16) min w ∈ D ∩ B (0 ,R ) max u ∈ B (0 ,R ) c λ ( w, u ) ≤ t (cid:17) + (cid:15) , and similarly P (cid:16) min w ∈ D ∩ B (0 ,R ) max u ∈ B (0 ,R ) l λ ( w, u ) ≤ t (cid:17) ≤ P (cid:16) min w ∈ D L λ ( w ) ≤ t (cid:17) + (cid:15) . Since the sets D ∩ B (0 , R ) and B (0 , R ) are compact, one can apply Theorem 5.1 to c λ and l λ and obtain: P (cid:16) min w ∈ D C λ ( w ) ≤ t (cid:17) ≤ P (cid:16) min w ∈ D L λ ( w ) ≤ t (cid:17) + 2 (cid:15) . The Corollary follows then from the fact one can take (cid:15) arbitrarily small. (cid:3)
In order to prove that (for instance) b w λ verifies with high probability some property, let’s say for instance that theempirical distribution of ( b θ λ = θ ? + b w λ , θ ? ) is close to µ ?λ , we define a set D (cid:15) ⊂ R N that contains all the vectorsthat do not verify this property, e.g. D (cid:15) = (cid:8) w ∈ R N (cid:12)(cid:12) W (cid:0)b µ ( θ ? + w,θ ? ) , µ ∗ λ (cid:1) ≥ (cid:15) (cid:9) , for some (cid:15) ∈ (0 , . The goalnow is to prove that with high probability min w ∈ D C λ ( w ) ≥ min w ∈ R N C λ ( w ) + (cid:15) , for some (cid:15) > . Using Gordon’s min-max Theorem (Corollary 5.1) we will be able to show P (cid:16) min w ∈ D (cid:15) C λ ( w ) ≤ min w ∈ R N C λ ( w ) + (cid:15) (cid:17) ≤ P (cid:16) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + (cid:15) (cid:17) + o N (1) . (25)Informally, this is a consequence of the following two remarks. First, by applying parts ( a ) and ( b ) of Corollary 5.1to the convex domain R N , we deduce that min w ∈ R N C λ ( w ) ≈ min w ∈ R N L λ ( w ) . Second, by applying part ( a ) tothe closed domain D , we obtain min w ∈ D (cid:15) C λ ( w ) (cid:38) min w ∈ D (cid:15) L λ ( w ) It remains now to study the cost function L λ , which is much simpler. This is done in Appendix B. The keystep will be to establish the following ‘local stability’ result (the next statement is an immediate consequence ofProposition B.1 and Theorem B.1 in the appendices. We prove in fact that the cost function L λ is strongly convexon a neighborhood of its minimizer.). Theorem 5.2
The minimizer w ∗ λ = arg min w L λ ( w ) exists and is almost surely unique. Further, there exists constants γ, c, C > that only depend on Ω such that for all θ ? ∈ D , all λ ∈ [ λ min , λ max ] and all (cid:15) ∈ (0 , P (cid:16) ∃ w ∈ R N , N k w − w ∗ λ k > (cid:15) and L λ ( w ) ≤ min v ∈ R N L λ ( v ) + γ(cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . We do not obtain an equally strong result for the cost function C λ ( w ) , but we prove the following statement,which is sufficient for obtaining uniform control (for the sake of argument, we focus here on the domain F p ( ξ ) and control of the empirical distribution). 13 eorem 5.3 Assume that D = F p ( ξ ) for some ξ, p > . There exists constants C, c, γ > that only depend on Ω such thatfor all (cid:15) ∈ (0 , ]sup λ ∈ [ λ min ,λ max ] sup θ ? ∈D P (cid:16) ∃ θ ∈ R N , W (cid:0)b µ ( θ,θ ? ) , µ ∗ λ (cid:1) ≥ (cid:15) and L λ ( θ ) ≤ min L λ + γ(cid:15) (cid:17) ≤ C(cid:15) − max(1 ,a ) exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where a = + p . Theorem 5.3 is proved in Appendix C.1.
For the sake of simplicity, we will illustrate the prove strategy by considering the empirical distribution of b w λ = b θ λ − θ ? , as the argument is similar for other quantities. According to Theorem 3.1, this should be well approximatedby µ λ that is the law of b Θ − Θ , when ( b Θ , Θ) ∼ µ ∗ λ , cf. Definition 3.3.As anticipated, Eq. (25) and Theorem 5.2, allow to control W ( b µ b w λ , µ λ ) for a fixed λ ( b µ b w λ denotes the empiricaldistribution of the entries of b w λ ). Namely, we can define D ε to be the set of vectors w such that W ( b µ w , µ λ ) ≥ ε > . We then prove that the minimizer w ∗ λ of L λ has empirical distribution close to µ λ , and therefore byTheorem 5.2, L λ ( w ) > L λ ( w ∗ λ ) + γ(cid:15) for all w ∈ D (cid:15) , with high probability. This imply that the right-hand sideof (25) is very small and we deduce that, with high probability, all minimizers or near minimizers of C λ ( w ) haveempirical distribution close to µ λ ,We now would like to prove Theorem 3.1 and show that with high probability b µ b w λ ≈ µ λ , uniformly in λ ∈ [ λ min , λ max ] . To do so, we apply the above argument for λ = λ , . . . , λ k , where λ , . . . , λ k is an (cid:15) -netof [ λ min , λ max ] . This implies that, with high probability for λ ∈ { λ , . . . , λ k } , W ( b µ b w λi , µ λ i ) ≤ ε . Next, for λ ∈ [ λ i , λ i +1 ] , we show that C λ i ( b w λ ) = min w ∈ R N C λ i ( w ) + O ( | λ i +1 − λ i | ) . Consequently if | λ i +1 − λ i | = O ( (cid:15) ) (using again Eq. (25) and Theorem 5.2), we obtain that W ( b µ b w λ , µ λ i ) = O ( (cid:15) ) and therefore W ( b µ b w λ , µ λ ) = O ( (cid:15) ) . We conclude that W ( b µ b w λ , µ λ ) = O ( (cid:15) ) for all λ ∈ [ λ min , λ max ] , with highprobability, which is the desired claim.If the strategy exposed above allows to obtain the risk of the Lasso and the empirical distribution of its coor-dinates, it is not enough to get its sparsity k b θ λ k or to obtain the empirical distribution of the debiased lasso b θ dλ = b θ λ + X T ( y − X b θ λ )1 − n k b θ λ k . Therefore, we will need to analyze the vector b v λ = 1 λ X T ( y − X b θ λ ) , which is a subgradient of the ‘ -norm at b θ λ . We are able to study b v λ using Gordon’s min-max Theorem because b v λ is the unique maximizer of v min w ∈ R N n n (cid:13)(cid:13) Xw − σz (cid:13)(cid:13) + λn v T ( w + θ ? ) o . The detailed analysis is done in Section E.
Acknowledgements
This work was partially supported by grants NSF DMS-1613091, NSF CCF-1714305 and NSF IIS-1741162 and ONRN00014-18-1-2729. 14 ppendix A: Study of the scalar optimization problem
In this section we study the scalar optimization problem (8): max β ≥ min τ ≥ σ (cid:18) σ τ + τ (cid:19) β − β + 1 δ E min w ∈ R (cid:26) w τ β − βZw + λ | w + Θ | − λ | Θ | (cid:27) , (26)where (Θ , Z ) ∼ P ⊗ N (0 , , for some probability distribution P with finite first moment: E P | Θ | < ∞ . Ofcourse, we will be mainly interested by the case where P = b µ θ ? , the empirical distribution of the entries of θ ? .Define ψ λ ( β, τ ) = (cid:18) σ τ + τ (cid:19) β − β + 1 δ E min w ∈ R (cid:26) w τ β − βZw + λ | w + Θ | − λ | Θ | (cid:27) . A.1 Basic properties of the scalar optimization problem
Lemma A.1 (
From [18] ) For all δ ∈ (0 , , the equation (1 + α )Φ( − α ) − αφ ( α ) = δ admits a unique positive solution α min = α min ( δ ) > . Proof.
Let ϕ : α (1 + α )Φ( − α ) − αφ ( α ) . ϕ is continuous on R ≥ , we have ϕ (0) = and ϕ (+ ∞ ) = 0 .It remains to show that ϕ is strictly decreasing on R ≥ . Compute ϕ ( α ) = 2 α Φ( − α ) − φ ( α ) and ϕ ( α ) =2Φ( − α ) > . Since ϕ (+ ∞ ) = 0 , we have that for all α ≥ , ϕ ( α ) < . ϕ is thus strictly decreasing on R ≥ . (cid:3) Let us define β max = β max ( δ, λ ) = λα min ( δ ) . More generally, we will always write α = λ/β . We prove in this section the following theorem and some auxiliaryresults. Theorem A.1
The max-min (8) is achieved at a unique couple ( β ∗ , τ ∗ ) and < β ∗ < β max . Moreover, ( τ ∗ , β ∗ ) is also the uniquecouple in (0 , + ∞ ) that verify τ = σ + δ E h ( η (Θ + τ Z, τ λβ ) − Θ) i β = τ (cid:16) − δ E h η (Θ + τ Z, τλβ ) i(cid:17) . (27) Lemma A.2 − λδ E | Θ | ≤ max β ≥ min τ ≥ σ ψ λ ( β, τ ) ≤ σ . Proof.
We have max β min τ ψ λ ( β, τ ) ≥ min τ ψ λ (0 , τ ) = − λδ E | Θ | . Then, by taking w = 0 one get max β ≥ min τ ≥ σ ψ λ ( β, τ ) ≤ max β ≥ min τ ≥ σ (cid:18) σ τ + τ (cid:19) β − β = σ . (cid:3) Define for α ≥ and y ∈ R , ‘ α ( y ) = min x ∈ R (cid:26)
12 ( y − x ) + α | x | (cid:27) and for Z ∼ N (0 , , x ∈ R , ∆ α ( x ) = E (cid:2) ‘ α ( x + Z ) − α | x | (cid:3) . emma A.3 E (cid:20) min w ∈ R (cid:26) w τ β − βZw + λ | w + Θ | − λ | Θ | (cid:27)(cid:21) = τ β E (cid:20) ∆ α (cid:16) Θ τ (cid:17)(cid:21) − βτ . (28) where α = λ/β . Proof.
Let β > and compute E (cid:20) min w ∈ R (cid:26) w τ β − βZw + λ | w + Θ | − λ | Θ | (cid:27)(cid:21) = − βτ βτ E min w ∈ R (cid:26)
12 ( w − τ Z ) + τ λβ | w + Θ | − τ λβ | Θ | (cid:27) = − βτ βτ E min w ∈ R (cid:26)
12 ( w − Z ) + α (cid:12)(cid:12)(cid:12) w + Θ τ (cid:12)(cid:12)(cid:12) − α (cid:12)(cid:12)(cid:12) Θ τ (cid:12)(cid:12)(cid:12)(cid:27) , where α = λ/β . Thus E (cid:20) min w ∈ R (cid:26) w τ β − βZw + λ | w + Θ | − λ | Θ | (cid:27)(cid:21) = − βτ βτ E (cid:20) ∆ α (cid:18) Θ τ (cid:19)(cid:21) . (cid:3) Lemma A.4 • If β > β max , ψ λ ( β, τ ) −−−−−→ τ → + ∞ −∞ . • If β = β max , ψ λ ( β, τ ) −−−−−→ τ → + ∞ − β − λδ E (cid:12)(cid:12) Θ (cid:12)(cid:12) . Proof.
By (28) and the fact that ∆ α (0) = + αφ ( α ) − ( α + 1)Φ( − α ) by Lemma F.13, we get that for all β, τ > ψ λ ( β, τ ) = σ β τ − β τ βδ (cid:18) δ αφ ( α ) − ( α + 1)Φ( − α ) (cid:19) + βδ ξ α ( τ ) , where α = λ/β and ξ α ( τ ) = τ E (cid:20)(cid:18) ∆ α (cid:18) Θ τ (cid:19) − ∆ α (0) (cid:19)(cid:21) . Using the definition of β max : if β > β max then α < α min and therefore δ + αφ ( α ) − ( α + 1)Φ( − α ) < . If β = β max , δ + αφ ( α ) − ( α + 1)Φ( − α ) = 0 . It remains to compute the limit of ξ α ( τ ) as τ → ∞ .Using the expression (see Lemma F.13) of the left-and right-derivatives of ∆ α at , we have almost-surely: τ (cid:18) ∆ α (cid:18) Θ τ (cid:19) − ∆ α (0) (cid:19) −−−−→ τ →∞ − α | Θ | . Suppose that E | Θ | < ∞ . By Lemma F.13, ∆ α is α -Lipschitz. Consequently, for all τ > : (cid:12)(cid:12)(cid:12)(cid:12) τ (cid:18) ∆ α (cid:18) Θ τ (cid:19) − ∆ α (0) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ α | Θ | . Since we have assumed that E | Θ | < ∞ we can apply the dominated convergence theorem to obtain that ξ α ( τ ) −−−−−→ τ → + ∞ − α E | Θ | . (cid:3) Define w ∗ ( α, τ ) = η (cid:0) Θ + τ Z, ατ (cid:1) − Θ .w ∗ ( α, τ ) is the minimizer of w w τ β − βZw + λ | w + Θ | (recall that we always write α = λ/β ).16 emma A.5 If β ≥ β max the equation τ = σ + 1 δ E h w ∗ ( α, τ ) i = σ + 1 δ E h ( η (Θ + τ Z, τ λβ ) − Θ) i . (29) does not admits any solution on (0 , + ∞ ) . For all β ∈ (0 , β max ) , the function ψ λ ( β, · ) admits a unique minimizer τ ∗ ( β ) on (0 , + ∞ ) that is also the unique solution of (29) . Moreover, α τ ∗ ( α ) is C ∞ on ( α min , + ∞ ) and forall α > α min (cid:12)(cid:12)(cid:12)(cid:12) ∂τ ∗ ∂α ( α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( α + 1) τ ∗ ( α ) δσ . Proof.
Most of this lemma was already proved in [18], we however provide a full proof for completeness. We haveto study the fixed point equation τ = σ + 1 δ E (cid:2) w ∗ ( α, τ ) (cid:3) = F α ( τ ) , where α = λ/β . We can compute F α explicitly: F α ( τ ) = σ + τ δ (cid:16) α + E (cid:2) ( x − α − (cid:0) Φ( α − x ) − Φ( − α − x ) (cid:1) − ( x + α ) φ ( α − x ) + ( x − α ) φ ( x + α ) (cid:3) (cid:17) , where we used the notation x = Θ τ . We can then compute the derivatives: F α ( τ ) = δ − (1 + α ) E [Φ( x − α ) + Φ( − x − α )] − δ − E [( x + α ) φ ( x − α ) − ( x − α ) φ ( − x − α )] ,F α ( τ ) = − δτ E (cid:2) x ( φ ( x − α ) − φ ( x + α )) (cid:3) ≤ .F α is therefore concave. By dominated convergence F α ( τ ) −−−−−→ τ → + ∞ δ (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) ( < if β < β max ≥ if β ≥ β max . Since F α (0) = σ > and by concavity of F α , the fixed point equation admits a unique solution τ ∗ ( α ) if and onlyis β ∈ (0 , β max ) . In that case we have also F α ( τ ∗ ( α )) < .Let us now assume that β ∈ (0 , β max ) . We have almost-surely ∂∂τ min w ∈ R (cid:26) w τ β + βZw + λ | w + Θ | (cid:27) = − β τ w ∗ ( α, τ ) . Since | w ∗ ( α, τ ) | ≤ ατ + τ | Z | , we have by derivation under the expectation ∂∂τ ψ λ ( β, τ ) = β − βσ τ − β δτ E (cid:2) w ∗ ( α, τ ) (cid:3) = β τ (cid:18) τ − (cid:18) σ + 1 δ E (cid:2) w ∗ ( α, τ ) (cid:3)(cid:19)(cid:19) . Consequently, τ ∗ ( β ) is the unique minimizer of ψ λ ( β, · ) over (0 , + ∞ ) .Let us now compute ∂τ ∗ ∂α . Since F α is a C ∞ function of τ , one can apply the implicit function theorem toobtain that the mapping α τ ∗ ( α ) is C ∞ and moreover: ∂τ ∗ ∂α ( α ) = ∂F α ∂α ( τ ∗ ( α ))1 − F α ( τ ∗ ( α )) . (30)Compute ∂F α ∂α ( τ ) = 2 τ δ E [ α (Φ( − α + x ) + Φ( − α − x )) − ( φ ( α − x ) + φ ( α + x ))] . One verify easily that − ≤ − φ (0) ≤ α Φ( − α ) − φ ( α ) ≤ δ τ ∂F α ∂α ( τ ) ≤ α . (31)17y concavity on has that F α ( τ ∗ ( α )) is smaller than the slope of the line between the points of coordinates (0 , σ ) and ( τ ∗ ( α ) , τ ∗ ( α )) : F α ( τ ∗ ( α )) ≤ − σ τ ∗ ( α ) . (32)From equations (30-31-32) we get (cid:12)(cid:12)(cid:12)(cid:12) ∂τ ∗ ∂α ( α ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ τ ∗ ( α ) σ δ ( α + 1) . The result follows then from the fact that ∂τ ∗ ∂α ( α ) = 2 τ ∗ ( α ) ∂τ ∗ ∂α ( α ) . (cid:3) Define now Ψ λ : β min τ ≥ σ ψ λ ( β, τ ) . Lemma A.6
The function Ψ λ is differentiable on (0 , β max ) with derivative Ψ λ ( β ) = τ ∗ ( α ) − β − δ E [ Zw ∗ ( α, τ ∗ ( α ))] (33) = τ ∗ ( α ) (cid:18) − δ E (cid:20) Φ (cid:16) Θ τ ∗ ( α ) − α (cid:17) + Φ (cid:16) − Θ τ ∗ ( α ) − α (cid:17)(cid:21)(cid:19) − β . (34) Proof. Ψ λ is differentiable on (0 , β max ) (because of Lemma A.5) with derivative Ψ λ ( β ) = 12 (cid:18) σ τ ∗ ( α ) + τ ∗ ( α ) (cid:19) − β + 1 δ (cid:18) τ ∗ ( α ) E [ w ∗ ( α, τ ∗ ( α )) ] − E [ Zw ∗ ( α, τ ∗ ( α ))] (cid:19) (35) = τ ∗ ( α ) − β − δ E [ Zw ∗ ( α, τ ∗ ( α ))] , (36)because of (29). The second equality follows by Gaussian integration by parts. (cid:3) Corollary A.1
The function Ψ λ achieves its maximum over R ≥ at a unique β ∗ ∈ (0 , β max ) . Proof. Ψ λ is the minimum of a collection of -strongly concave functions: it is therefore -strongly concave andadmits thus a unique maximizer β ∗ over R ≥ . By Lemma A.4 we know that β ∗ < β max . Indeed, notice that max β Φ λ ( β ) ≥ Ψ λ (0) = − λδ E | Θ | . Lemme A.4 gives that β ∗ ∈ [0 , β max ) , because Ψ λ ( β max ) ≤ Ψ λ (0) − β < Ψ λ (0) . By dominated convergence: E (cid:20) Φ (cid:16) Θ τ ∗ ( α ) − α (cid:17) + Φ (cid:16) − Θ τ ∗ ( α ) − α (cid:17)(cid:21) −−−−→ β → + . Indeed, when β → + , α = λ/β → + ∞ and | Θ τ ∗ ( α ) | ≤ | Θ | σ . Therefore by Lemma A.6 we obtain lim inf β → + Ψ λ ( β ) ≥ σ > . By concavity, we deduce that β ∗ ∈ (0 , β max ) . (cid:3) Proposition A.1
The function λ β ∗ ( λ ) is C ∞ and is α − -Lipschitz over (0 , + ∞ ) . λ α ∗ ( λ ) is C ∞ over (0 , + ∞ ) andstrictly increasing. Proof.
Let us define γ ∗ ( λ ) = β ∗ ( λ ) /λ . γ ∗ ( λ ) is the unique maximizer of γ min τ ≥ σ (cid:18) σ τ + τ (cid:19) γ − λ γ + 1 δ E min w ∈ R (cid:26) w τ γ − γZw + | w + Θ | − | Θ | (cid:27) = h ( γ ) − λ γ , h is a concave C ∞ function on R > . γ ∗ ( λ ) is thus the unique solution of h ( γ ) − λγ = 0 , on R > . γ h ( γ ) − λγ is C ∞ with derivative γ h ( γ ) − λ < . Consequently, the implicit function theoremgives that the mapping λ ∈ R > γ ∗ ( λ ) is C ∞ and that ∂γ ∗ ∂λ ( λ ) = − γ ∗ ( λ ) λ − h ( γ ∗ ( λ )) < . One deduces that λ α ∗ ( λ ) = γ ∗ ( λ ) − is C ∞ and strictly increasing and that λ β ∗ ( λ ) = λγ ∗ ( λ ) is C ∞ .Moreover (cid:12)(cid:12)(cid:12)(cid:12) ∂β ∗ ∂λ ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) λ ∂γ ∗ ∂λ ( λ ) + γ ∗ ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ γ ∗ ( λ ) ≤ α min . (cid:3) Proof of Theorem A.1.
By Corollary A.1, the maximum in β in (8) is achieved at a unique β ∗ ∈ (0 , β max ) . To this β ∗ corresponds a unique τ ∗ ( β ∗ ) that achieves the minimum in (8), by Lemma A.5. By (29) and (34) we obtain that ( τ ∗ ( β ∗ ) , β ∗ ) is solution of the system (27). Let now ( τ, β ) ∈ (0 , + ∞ ) be another solution of (27). τ is thereforesolution of (29) which gives that β ∈ (0 , β max ) and τ = τ ∗ ( β ) by Lemma A.5. The second equality in (27) givesthat Ψ λ ( β ) = 0 and thus that β = β ∗ by strong concavity of Ψ λ . We conclude ( τ, β ) = ( τ ∗ ( β ∗ ) , β ∗ ) . (cid:3) A.2 Control on β ∗ , τ ∗ The goal of this section is to show that β ∗ and τ ∗ remain bounded when θ ? varies in D . Theorem A.2
There exists constants β min , τ max > that only depend on Ω such that for all θ ? ∈ D and all λ ∈ [ λ min , λ max ] , β min ≤ β ∗ ( λ ) < β max and σ ≤ τ ∗ ( λ ) ≤ τ max . To prove Theorem A.2, we separate the case where D = F p ( ξ ) (where it follows from Lemma A.9 and Corol-lary A.2 below) from the case where D = F ( s ) (where it follows from Lemmas A.11 and A.12). A.2.1 Technical lemmasLemma A.7
We have max β ≥ min τ ≥ σ ψ λ ( β, τ ) = ψ λ ( β ∗ , τ ∗ ( β ∗ )) = 12 β ∗ + λδ E (cid:2) | w ∗ ( α ∗ , τ ∗ ( β )) + Θ | − | Θ | (cid:3) = 12 β ∗ + τ ∗ ( β ∗ ) λδ E (cid:20) H α ∗ (cid:18) Θ τ ∗ ( β ∗ ) (cid:19)(cid:21) , where H α ( x ) = ( x − α )Φ( − α + x ) + ( − x − α )Φ( − x − α ) + φ ( − x + α ) + φ ( x + α ) − | x | . Proof.
Using the optimality condition (29) of τ ∗ ( β ) , we have for all β ∈ (0 , β max ) ψ λ ( β, τ ∗ ( β )) = − β + βτ ∗ ( β ) − βδ E [ Zw ∗ ( α, τ ∗ ( β ))] + λδ E [ | w ∗ ( α, τ ∗ ( β )) + Θ | − | Θ | ] . At β ∗ the optimality condition (see (33)) reads β ∗ = τ ∗ ( β ∗ ) − δ E (cid:2) Zw ∗ ( α ∗ , τ ∗ ( β ∗ )) (cid:3) , thus ψ λ ( β ∗ , τ ∗ ( β ∗ )) = 12 β ∗ + λδ E (cid:2) | w ∗ ( α ∗ , τ ∗ ( β )) + Θ | − | Θ | (cid:3) . (37)Compute for α, τ > E (cid:12)(cid:12) w ∗ ( α, τ ) + Θ (cid:12)(cid:12) = E (cid:12)(cid:12) η (Θ + τ Z, ατ ) (cid:12)(cid:12) = τ E (cid:12)(cid:12)(cid:12) η (cid:16) Θ τ + Z, α (cid:17)(cid:12)(cid:12)(cid:12) . (38)19ow, for x ∈ R , E | η ( x + Z, α ) | = Z + ∞ α − x ( x + z − α ) φ ( z ) dz + Z − α − x −∞ ( − x − z − α ) φ ( z ) dz = ( x − α )Φ( x − α ) + ( − x − α )Φ( − x − α ) + φ ( α − x ) + φ ( α + x )= H α ( x ) + | x | . and we obtain the Lemma by putting this together with (38) and (37). (cid:3) The next Lemma summarizes the main properties of H α . Lemma A.8 H α is a continuous, even function and for x > H α ( x ) = Φ( x − α ) − Φ( − x − α ) − ∈ ( − , .H α is therefore -Lipschitz. H α admits a maximum at and H α (0) = 2 φ ( α ) − α Φ( − α ) > . Moreover H α ( x ) −−−−−→ x → + ∞ − α . A.2.2 On ‘ p -ballsLemma A.9 Assume that E (cid:2) | Θ | p (cid:3) ≤ ξ p for some ξ, p > . Then, there exists a constant β min = β min ( δ, λ min , ξ, p, σ ) suchthat for all λ ≥ λ min , < β min ≤ β ∗ ( λ ) < β max . Proof.
Let β ∈ (0 , β max ) . By Lemma A.6 we have Ψ λ ( β ) = τ ∗ ( β ) (cid:18) − δ E (cid:20) Φ (cid:16) Θ τ ∗ ( β ) − α (cid:17) + Φ (cid:16) − Θ τ ∗ ( β ) − α (cid:17)(cid:21)(cid:19) − β . The function g α : x Φ( x − α ) + Φ( − x − α ) is even, and increasing over R ≥ . Let K > such that ξ p K p σ p ≤ δ .By Markov’s inequality we have P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) Θ τ ∗ ( β ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ K (cid:19) ≤ P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) Θ σ (cid:12)(cid:12)(cid:12)(cid:12) p ≥ K p (cid:19) ≤ K p σ p E | Θ | p ≤ δ . Thus E (cid:20) g α (cid:18) Θ τ ∗ ( β ) (cid:19)(cid:21) ≤ g α ( K ) + δ . As β → , α ≥ λ min /β → + ∞ . Since g α ( K ) −−−−−→ α → + ∞ there exists β = β ( K, λ min , δ ) > such that for all β ∈ (0 , β ) , g α ( K ) ≤ δ . Thus for all β ∈ (0 , β ) , Ψ λ ( β ) ≥ τ ∗ ( β ) − δ + δ δ ! − β ≥ σ − β . Let β min = min( σ , β ) . We conclude that for all β ∈ (0 , β min ) , Ψ λ ( β ) > . By concavity we have then that β ∗ ≥ β min . The other inequality β ∗ < β max was already proved in Corollary A.1. (cid:3) Corollary A.2
Assume that E (cid:2) | Θ | p (cid:3) ≤ ξ p for some ξ, p > . Then there exists a constant τ max = τ max ( ξ, p, δ, s, λ min , λ max ) such that for all λ ∈ [ λ min , λ max ] , σ ≤ τ ∗ ( β ∗ ( λ )) ≤ τ max . roof. Let t ≥ ξ . By Markov’s inequality we have P ( | Θ | ≥ t ) ≤ (cid:16) ξt (cid:17) p ≤ , since θ ? ∈ F p ( ξ ) . E (cid:20) H α ∗ (cid:18) Θ τ ∗ ( α ∗ ) (cid:19)(cid:21) = E (cid:20) ( | Θ | < t ) H α ∗ (cid:18) Θ τ ∗ ( α ∗ ) (cid:19)(cid:21) + E (cid:20) ( | Θ | ≥ t ) H α ∗ (cid:18) Θ τ ∗ ( α ∗ ) (cid:19)(cid:21) ≥ (cid:18) − (cid:18) ξt (cid:19) p (cid:19) (cid:18) H α ∗ (0) − tτ ∗ ( α ∗ ) (cid:19) − α ∗ (cid:18) ξt (cid:19) p , because by Lemma A.8, H α ∗ is -Lipschitz and for all x ∈ R , − α ∗ ≤ H α ∗ ( x ) ≤ H α ∗ (0) . Replacing H α ∗ (0) by itsexpression given by Lemma A.8 we get E (cid:20) H α ∗ (cid:18) Θ τ ∗ ( α ∗ ) (cid:19)(cid:21) ≥ (cid:0) φ ( α ∗ ) − α ∗ Φ( − α ∗ ) (cid:1) − (cid:18) ξt (cid:19) p (cid:0) α ∗ + 2( φ ( α ∗ ) − α ∗ Φ( − α ∗ )) (cid:1) − tτ ∗ ( α ∗ ) . Since α ∗ ≤ λ max /β min and φ ( α ∗ ) − α ∗ Φ( − α ∗ ) > (because α ∗ > α min ), we can find a constant t = t ( δ, σ, λ min , λ max , p, ξ ) ≥ ξ such that ( α ∗ + 2( φ ( α ∗ ) − α ∗ Φ( − α ∗ ))) (cid:18) ξt (cid:19) p ≤ φ ( α ∗ ) − α ∗ Φ( − α ∗ ) . For this choice of t we have then τ ∗ ( α ∗ ) λδ E (cid:20) H α ∗ (cid:18) Θ τ ∗ (cid:19)(cid:21) ≥ λδ τ ∗ ( α ∗ )( φ ( α ∗ ) − α ∗ Φ( − α ∗ )) − λtδ . Consequently by Lemma A.2 and Lemma A.7 we have β ∗ λδ τ ∗ ( φ ( α ∗ ) − α ∗ Φ( − α ∗ )) − λtδ ≤ ψ λ ( β ∗ , τ ∗ ( α ∗ )) ≤ σ , which finally gives τ ∗ ( α ∗ ) ≤ δσ λ − + tφ ( α max ) − α max Φ( − α max ) . (cid:3) A.2.3 On sparse balls
Define the critical function: M s : α s (1 + α ) + 2(1 − s ) (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) .M s corresponds to the worst mean squared error achievable by soft-thresholding with threshold α to estimate avector θ ? ∈ F ( s ) from the observations y = θ ? + w , where w ∼ N (0 , I N ) , see [17, 16, 27]. Lemma A.10 (
From [18] ) Assume that s < s max ( δ ) = δ max α ≥ ( − δ (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) α − (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) ) . Then there exists α ≥ such that M s ( α ) < δ . Proof.
Let s < s max ( δ ) . From the definition of s max ( δ ) , we can find α ∈ R such that δ − δ (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) α − (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) > s , which gives M s ( α ) < δ . (cid:3) We assume in this section that s < s max ( δ ) . Let us compute the derivatives M s ( α ) = 2 ( αs + 2(1 − s )( α Φ( − α ) − φ ( α ))) ,M s ( α ) = 2 ( s + (1 − s )2Φ( − α )) > . M s ( α ) = (cid:0) αM s ( α ) + M s ( α ) (cid:1) . Let α be the unique α > such that M s ( α ) = 0 and let α < α be such that M s ( α ) = M s ( α ) = δ . We can then easily plot the variations of M s : αM s M s M s α α α + ∞− M s ( α ) M s ( α ) + ∞ + ∞ δ δ ssM s ( α ) Lemma A.11
Let s < s max ( δ ) and assume that P (Θ = 0) ≤ s . Then, there exists a constant β min = β min ( δ, λ min , s, σ ) suchthat for all λ ≥ λ min < β min ≤ β ∗ ( λ ) < β max ( λ max , δ ) := λ max /α min ( δ ) . Proof.
We already proved in Corollary A.1 that β ∗ ( λ ) < λ/α min . For all < β < λ/α min , we have by Lemma A.6 Ψ λ ( β ) = τ ∗ ( β ) (cid:18) − δ E (cid:20) Φ (cid:16) Θ τ ∗ ( β ) − α (cid:17) + Φ (cid:16) − Θ τ ∗ ( β ) − α (cid:17)(cid:21)(cid:19) − β . The function g α : x Φ( x − α ) + Φ( − x − α ) is even, and increasing over R ≥ . Therefore Ψ λ ( β ) ≥ τ ∗ ( β ) δ (cid:0) δ − s − (1 − s )2Φ( − α ) (cid:1) − β . Let β = β ( λ min , δ, s ) > such that for all β ∈ (0 , β ) , − α ) ≤ ( δ − s ) . For all β ∈ (0 , β ) we have then Ψ λ ( β ) ≥ σ ( δ − s )2 δ − β . Let β min = min( σ ( δ − s )2 δ , β ) : for all β ∈ (0 , β min ) , Ψ λ ( β ) > . By concavity we conclude that β ∗ ≥ β min . (cid:3) Lemma A.12
Let s < s max ( δ ) and assume that P (Θ = 0) ≤ s . Then for all β, τ, λ > we have ψ λ ( β, τ ) ≥ βσ τ − β τ β δ (cid:0) δ − M s ( α ) (cid:1) . Proof.
By (28) we have for all β, τ > ψ λ ( β, τ ) = β (cid:18) σ τ + τ (cid:19) − β τ βδ E (cid:20) ∆ α (cid:16) Θ τ (cid:17) − (cid:21) . Since by Lemma F.13, ∆ α is even and non-increasing over R ≥ , we have E (cid:20) ∆ α (cid:16) Θ τ (cid:17) − (cid:21) ≥ s ∆ α (+ ∞ ) + (1 − s )∆ α (0) − − s α − s ) (cid:16)
12 + αφ ( α ) − (1 + α )Φ( − α ) (cid:17) −
12 = − M s ( α ) . (cid:3) emma A.13 Let s < s max ( δ ) and assume that P (Θ = 0) ≤ s . Then the following inequalities hold β ∗ τ ∗ ( β ∗ )( δ − M s ( α ∗ )) ≤ δ (cid:0) σ + β ∗ (cid:1) , (39) − τ ∗ ( β ∗ ) λM s ( α ∗ ) ≤ δσ , (40) τ ∗ ( β ∗ )( δ − M s ( α ∗ )) ≤ β ∗ . (41) Proof.
The inequality (39) simply follows from the previous lemma and from the fact that ψ λ ( β, τ ∗ ( β )) ≤ max β ≥ min τ ≥ σ ψ λ ( β, τ ) ≤ σ , by Lemma A.2. Let us prove (40). By Lemma A.7, we have ψ λ ( β ∗ , τ ∗ ( β ∗ )) ≥ λτ ∗ ( β ∗ ) δ E (cid:20) H α ∗ (cid:16) Θ τ ∗ ( β ∗ ) (cid:17)(cid:21) . Since by Lemma A.8, H α ∗ is even, decreasing on R ≥ , we have E (cid:20) H α ∗ (cid:16) Θ τ ∗ ( β ∗ ) (cid:17)(cid:21) ≥ sH α ∗ (+ ∞ ) + (1 − s ) H α ∗ (0)= − sα ∗ + 2(1 − s ) (cid:0) φ ( α ∗ ) − α ∗ Φ( − α ∗ ) (cid:1) = − M s ( α ∗ ) , which proves (40). To prove (41) we use the optimality condition at β ∗ : λ ( β ∗ ) = τ ∗ ( α ∗ ) (cid:18) − δ E (cid:20) Φ (cid:16) Θ τ ∗ ( α ∗ ) − α ∗ (cid:17) + Φ (cid:16) − Θ τ ∗ ( α ∗ ) − α ∗ (cid:17)(cid:21)(cid:19) − β ∗ . (42)The function x Φ( x − α ∗ ) + Φ( − x − α ∗ ) is even, increasing on R ≥ . Therefore E (cid:20) Φ (cid:16) Θ τ ∗ ( α ∗ ) − α ∗ (cid:17) + Φ (cid:16) − Θ τ ∗ ( α ∗ ) − α ∗ (cid:17)(cid:21) ≤ s + 2(1 − s )Φ( − α ∗ ) = 12 M s ( α ∗ ) . Combining this inequality with (42) leads to (41). (cid:3)
Proposition A.2
Let s < s max ( δ ) and assume that P (Θ = 0) ≤ s . Then, there exists a constant τ max = τ max ( δ, λ min , λ max , s, σ ) such that for all λ ∈ [ λ min , λ max ] , σ ≤ τ ∗ ( β ∗ ( λ )) ≤ τ max . Proof.
Let ( β ∗ , τ ∗ ) be the unique optimal couple and recall α ∗ = λ/β ∗ . We distinguish 3 cases: Case 1 : α ∗ ≥ α . In that case M s ( α ∗ ) ≤ M s ( α ) < δ . The inequality (41) gives τ ∗ ( β ∗ )( δ − M s ( α ∗ )) ≤ β ∗ ≤ β max , which gives τ ∗ ( β ∗ ) ≤ β max δ − M s ( α ) . Case 2 : α ∗ ∈ [( α + α ) / , α ] . In that case δ − M s ( α ∗ ) ≥ c > , for some constant c = c ( δ, s ) > . Now,by (39) β ∗ τ ∗ ( β ∗ )( δ − M s ( α ∗ )) ≤ δ ( σ + β ∗ ) . Therefore, τ ∗ ≤ δcβ min ( σ + β ) . Case 3 : α ∗ < ( α + α ) / . In that case M s ( α ∗ ) ≤ − c , for some constant c = c ( δ, s ) > . Consequentlyby (40) we get τ ∗ ( β ∗ ) ≤ σ δλc . (cid:3) .3 Dependency in λ Proposition A.3 • The mapping λ β ∗ ( λ ) is C ∞ and α − -Lipschitz on R > .• The mapping λ τ ∗ ( λ ) is C ∞ and M -Lipschitz on [ λ min , λ max ] , for some constant M (Ω) > . Proof.
The first point has already been by Proposition A.1. λ τ ∗ ( λ ) is the composition of the mappings λ α ∗ ( λ ) and α τ ∗ ( α ) , that are both C ∞ by Lemma A.5 and Proposition A.1. Compute the derivative: ∂τ ∗ ∂λ ( λ ) = ∂α ∗ ∂λ ( λ ) ∂τ ∗ ∂α ( α ∗ ( λ )) . Recall that α ∗ ( λ ) = λ/β ∗ ( λ ) . Thus (cid:12)(cid:12)(cid:12)(cid:12) ∂α ∗ ∂λ ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) β ∗ ( λ ) − ∂β ∗ ∂λ ( λ ) λβ ∗ ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ β − + 2 α − λ max β − . By Lemma A.5, we have (cid:12)(cid:12)(cid:12)(cid:12) ∂τ ∗ ∂α ( α ∗ ( λ )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ( α ∗ ( λ ) + 1) τ ∗ ( α ∗ ) δσ . Since by Theorem A.2, τ ∗ ( α ∗ ) ≤ τ max (Ω) and α ∗ ≤ λ max /β min (Ω) , the derivative of τ ∗ with respect to λ isbounded on [ λ min , λ max ] . (cid:3) Appendix B: Study of Gordon’s optimization problem for c w λ In this section we study L λ defined by (23). Define, for w ∈ R N and β ≥ ‘ λ ( w, β ) = r k w k n + σ k h k√ n − n g T w + g σ √ n ! β − β + λn | w + θ ? | − λn | θ ? | . (43)So that L λ ( w ) = max β ≥ ‘ λ ( w, β ) . Let us define the vector w λ ∈ R N by w λ,i = η (cid:18) θ ?i + τ ∗ ( λ ) g i , τ ∗ ( λ ) λβ ∗ ( λ ) (cid:19) − θ ?i . (44)The goal of this section is to prove that, with high probability, the minimizer of L λ is close to w λ and that L λ isstrongly convex around w λ . Proposition B.1 L λ admits almost surely a unique minimizer w ∗ λ on R N . Proof. L λ is a convex function that goes to + ∞ at infinity, so it admits minimizers over R N . Case 1 : there exists a minimizer w such that q k w k n + σ k h k√ n − n g T w + g σ √ n > .In that case, there exist a neighborhood O w of w such that for all w ∈ O w a ( w ) := r k w k n + σ k h k√ n − n g T w + g σ √ n > . Thus for all w ∈ O w , L λ ( w ) = a ( w ) + λn | w + θ ? | − λn | θ ? | . Recall that the composition of a strictly convexfunction and a strictly increasing function is strictly convex. L λ is therefore strictly convex on O w because a isstrictly convex and remains strictly positive on O w and because x > x is strictly increasing. w is thus theonly minimizer of L λ . 24 ase 2 : for all minimizer w we have q k w k n + σ k h k√ n − n g T w + g σ √ n ≤ .Let w be a minimizer of L λ . The optimality condition gives − λ − r k w k n + σ k h k√ n − n g T w + g σ √ n ! + w q k w k n + σ k h k√ n − g ∈ ∂ | θ ? + w | . We obtain then ∈ ∂ | θ ? + w | which implies w = − θ ? : L λ has a unique minimizer. (cid:3) B.1 Local stability of Gordon’s optimization
Theorem B.1
There exists constants γ, c, C > that only depend on Ω such that for all θ ? ∈ D , all λ ∈ [ λ min , λ max ] and all (cid:15) ∈ (0 , P (cid:16) ∃ w ∈ R N , N k w − w λ k > (cid:15) and L λ ( w ) ≤ min v ∈ R N L λ ( v ) + γ(cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . We deduce from Theorem B.1 that for all (cid:15) ∈ (0 , with probability at least − C(cid:15) − e − cn(cid:15) , N k w ∗ λ − w λ k ≤ (cid:15) .From this we deduce easily that with the same probability | L λ ( w ∗ λ ) − L λ ( w λ ) | ≤ M (cid:15) , for some constant
M > ,which gives by Proposition F.1: Corollary B.1
Define L ∗ ( λ ) = ψ λ ( β ∗ ( λ ) , τ ∗ ( λ )) . (45) The exists constants c, C > that only depend on Ω such that P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) min w ∈ R N L λ ( w ) − L ∗ ( λ ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:19) ≤ C(cid:15) e − nc(cid:15) . B.2 Proof of Theorem B.1
Proposition B.2
For all
R > there exists constants c, C > that only depend on (Ω , R ) , such that for all (cid:15) ∈ (0 , , ∀ θ ? ∈ D , ∀ λ ∈ [ λ min , λ max ] , P (cid:16) L λ ( w λ ) ≤ min k w k≤√ nR L λ ( w ) + (cid:15) (cid:17) ≥ − C(cid:15) e − cn(cid:15) . Proof.
Notices that it suffices to proves the proposition for (cid:15) smaller than some constant. Let θ ? ∈ D , λ ∈ [ λ min , λ max ] . Let R > and (cid:15) ∈ (cid:0) , min(1 , σ / (cid:3) . Define ‘ ◦ λ ( w, β ) = r k w k n + σ − n g T w ! β − β + λn | w + θ ? | − λn | θ ? | . On the event (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k h k − (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:27) \ (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) g σ √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:15) (cid:27) , (46)which has probability at least − Ce − cn(cid:15) , we have, for all w ∈ B (0 , R √ n ) and β ∈ [0 , β max ] : | ‘ λ ( w, β ) − ‘ ◦ λ ( w, β ) | = β r k w k n + σ (cid:12)(cid:12)(cid:12)(cid:12) k h k√ n − (cid:12)(cid:12)(cid:12)(cid:12) + β (cid:12)(cid:12)(cid:12)(cid:12) g σ √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ β max p σ + R (cid:12)(cid:12)(cid:12)(cid:12) n k h k − (cid:12)(cid:12)(cid:12)(cid:12) + β max (cid:15) ≤ β max ( p σ + R + 1) | {z } K (cid:15) . ( β ∗ , τ ∗ ) = ( β ∗ ( λ ) , τ ∗ ( λ )) . We have on the event (46): min k w k≤ R √ n L λ ( w ) = min k w k≤ R √ n max β ≥ ‘ λ ( w, β ) ≥ min k w k≤ R √ n ‘ λ ( w, β ∗ ) ≥ min k w k≤ R √ n ‘ ◦ λ ( w, β ∗ ) − K(cid:15) .
Using the fact that for w ∈ B (0 , R √ n ) r k w k n + σ = min σ ≤ τ ≤√ σ + R ( k w k n + σ τ + τ ) , we obtain that min k w k≤ R √ n ‘ ◦ λ ( w, β ∗ ) = min σ ≤ τ ≤√ σ + R (cid:26) β ∗ (cid:18) σ τ + τ (cid:19) − β ∗ n min k w k≤ R √ n (cid:26) β ∗ τ k w k − β ∗ g T w + λ | w + θ ? | − λ | θ ? | (cid:27)(cid:27) . For all τ ∈ [ σ, √ σ + R ] the function g min k w k≤ R √ n (cid:26) β ∗ τ k w k − β ∗ g T w + λ | w + θ ? | − λ | θ ? | (cid:27) is β max R √ n -Lipschitz. Therefore F ( τ, g ) = β ∗ (cid:18) σ τ + τ (cid:19) − β ∗ n min k w k≤ R √ n (cid:26) β ∗ τ k w k − β ∗ g T w + λ | w + θ ? | − λ | θ ? | (cid:27) is β R n − -sub-Gaussian. Therefore there exists constants C, c > such that for all τ ∈ [ σ, √ σ + R ] , wehave P ( | F ( τ, g ) − E F ( τ, g ) | > (cid:15) ) ≤ Ce − cn(cid:15) . F ( · , g ) is almost-surely a β max (1 + R σ ) -Lipschitz function on [ σ, √ σ + R ] . Therefore, by an (cid:15) -net argument onecan find constants C, c > that only depend on (Ω , R ) , such that for all (cid:15) > the event ( sup τ ∈ [ σ, √ σ + R ] | F ( τ, g ) − E F ( τ, g ) | ≤ (cid:15) ) (47)has probability at least − C(cid:15) e − cn(cid:15) . On the event (47) we have then min τ ∈ [ σ, √ σ + R ] F ( τ, g ) ≥ min τ ∈ [ σ, √ σ + R ] E [ F ( τ, g )] − (cid:15) . Notice that for all τ > we have n E (cid:20) min k w k≤ R √ n (cid:26) β ∗ τ k w k − β ∗ g T w + λ | w + θ ? | − λ | θ ? | (cid:27)(cid:21) ≥ n N X i =1 E (cid:20) min w i ∈ R (cid:26) β ∗ τ w i − β ∗ g i w i + λ | w i + θ ?i | − λ | θ ?i | (cid:27)(cid:21) = 1 δ E (cid:20) min w ∈ R (cid:26) β ∗ τ w − β ∗ Zw + λ | w + Θ | − λ | Θ | (cid:27)(cid:21) , where the last expectation is with respect (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , . Consequently on the event (46) and (47), wehave (1 + K ) (cid:15) + min k w k≤ R √ n L λ ( w ) ≥ min σ ≤ τ ≤√ σ + R ψ λ ( β ∗ , τ ) ≥ min σ ≤ τ ψ λ ( β ∗ , τ ) = ψ λ ( β ∗ , τ ∗ ) . By Proposition F.1 we have that ψ λ ( β ∗ , τ ∗ ) ≥ L λ ( w λ ) − (cid:15) , with probability at least − Ce − cn(cid:15) . Then, for all (cid:15) ∈ (0 , we have with probability at least − C(cid:15) e − cn(cid:15) min k w k≤ R √ n L λ ( w ) + ( K + 2) (cid:15) ≥ L λ ( w λ ) . (cid:3) emma B.1 Let f be a convex function on R N . Let w ∈ R N and r > . Suppose that f is γ -strongly convex on the ball B ( w, r ) , for some γ > . Assume that f ( w ) ≤ min x ∈ B ( w,r ) f ( x ) + (cid:15) , for some (cid:15) ≤ r γ . Then f admits a unique minimizer x ∗ over R N . We have x ∗ ∈ B ( w, r ) and therefore k x ∗ − w k ≤ γ (cid:15) . Moreover, for every x ∈ R N such that f ( x ) ≤ min f + (cid:15) we have k x − w k ≤ γ (cid:15) . Proof. f is convex on B ( w, r ) , it admits therefore a minimizer x ∗ on B ( w, r ) . By strong convexity we have k x ∗ − w k ≤ γ (cid:15) ≤ r . Consequently, x ∗ is in the interior of B ( w, r ) . By strong convexity, x ∗ is then the unique minimizer of f over R N .By strong convexity, for any x outside of B ( w, r ) we have f ( x ) > f ( x ∗ ) + 12 γ (cid:16) r (cid:17) ≥ f ( x ∗ ) + (cid:15) . Consequently, if f ( x ) ≤ min f + (cid:15) then x ∈ B ( w, r ) and thus k x − x ∗ k ≤ γ (cid:15) . (cid:3) Proof of Theorem B.1.
Let t = min( β min , σ ) . By Lemma F.1 the event (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) k w λ k n − E k w λ k n (cid:12)(cid:12)(cid:12)(cid:12) ≤ t , g T w λ n ≤ E (cid:20) g T w λ n (cid:21) + t, k g k ≤ √ N , (cid:12)(cid:12)(cid:12) σg √ n (cid:12)(cid:12)(cid:12) ≤ β min , (1 − β min τ max ) ≤ k h k√ n ≤ (cid:27) (48)has probability at least − Ce − cn , for some constants C, c > . On the event (48) r k w λ k n + σ ≥ r E k w λ k n + σ − t ≥ τ ∗ − t . Therefore r k w λ k n + σ k h k√ n ≥ k h k√ n τ ∗ − k h k√ n t ≥ τ ∗ − β min − t ≥ τ ∗ − β min . Consequently, on the event (48) we have r k w λ k n + σ k h k√ n − n g T w λ + σg √ n ≥ τ ∗ − n E (cid:2) g T w λ (cid:3) − β min ≥ β min , because τ ∗ − n E [ g T w λ ] = β ∗ ≥ β min . Moreover, on the event (48) the function f : w r k w k n + σ k h k√ n − n g T w + g σ √ n is √ Nn + √ n -Lipschitz. We have seen above that on (48), f ( w λ ) ≥ β min . Thus we can find a constant r > such that on the event (48) we have for all w ∈ B ( w λ , r √ n ) f ( w ) > β min . By Lemma F.14, the function f is an -strongly convex on B ( w λ , r √ n ) , for some constant a > . For all w ∈ B ( w λ , r √ n ) we have L λ ( w ) = 12 f ( w ) + λn ( | w + θ ? | − | θ ? | ) . w ∈ B ( w λ , r √ n ) : ∇ (cid:16) f (cid:17) ( w ) = f ( w ) ∇ f ( w ) + ∇ f ( w ) ∇ f ( w ) T (cid:23) aβ min n I N , which means that L is γn -strongly convex on B ( w λ , r √ n ) , for some constant γ > .Notice that it suffices to prove Theorem B.1 for (cid:15) ∈ (0 , q ] for some constant q > . Let (cid:15) ∈ (0 , γr ) . Let nowapply Proposition B.2 with R = τ max + r : with probability at least − C(cid:15) e − cn(cid:15) n L λ ( w λ ) ≤ min k w k≤ R √ n L λ ( w ) + (cid:15) o . (49)Notice that on the event (48), k w λ k n ≤ E (cid:20) k w λ k n (cid:21) + t ≤ E (cid:20) k w λ k n (cid:21) + σ = τ ∗ ≤ τ . Therefore, on (48), B ( w λ , r √ n ) ⊂ B (0 , R √ n ) . Using then (49) we get L λ ( w λ ) ≤ min w ∈ B ( w λ ,r √ n ) L λ ( w ) + (cid:15) . Consequently, on the events (48) and (49) that have probability at least − C(cid:15) e − cn(cid:15) , Lemma B.1 gives that for all w ∈ R N such that L λ ( w ) ≤ min v ∈ R N L λ ( v ) + (cid:15) we have k w λ − w k ≤ nγ (cid:15) . (cid:3) Appendix C: Empirical distribution and risk of the Lasso
C.1 Proofs of local stability of the Lasso cost
C.1.1 Application of Gordon’s min-max TheoremProposition C.1
There exists constants c, C > that only depend on Ω such that for all closed set D ⊂ R N and for all (cid:15) ∈ (0 , , P (cid:18) min w ∈ D C λ ( w ) ≤ min w ∈ R N C λ ( w ) + (cid:15) (cid:19) ≤ P (cid:18) min w ∈ D L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 (cid:15) (cid:19) + C(cid:15) e − cn(cid:15) . In order to prove this, we start by showing that the optimal Lasso cost concentrates around L ∗ ( λ ) . Proposition C.2
There exists constants c, C > that only depend on Ω such that for all closed set D ⊂ R N and for all (cid:15) ∈ (0 , , P (cid:18)(cid:12)(cid:12)(cid:12) min w ∈ R N C λ ( w ) − L ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) . Proof.
By Corollary 5.1, we have P (cid:18) min w ∈ R N C λ ( w ) − L ∗ ( λ ) ≥ (cid:15) (cid:19) ≤ P (cid:18) min w ∈ R N L λ ( w ) − L ∗ ( λ ) ≥ (cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) , where the last inequality comes from Corollary B.1. The bound of the probability of the converse inequality isproved analogously. (cid:3) Proof of Proposition C.1.
Recall that L ∗ ( λ ) is defined by (45). Let D ⊂ R N be a closed set. P (cid:18) min w ∈ D C λ ( w ) ≤ min w ∈ R N C λ ( w ) + (cid:15) (cid:19) ≤ P (cid:18) min w ∈ D C λ ( w ) ≤ min w ∈ R N C λ ( w ) + (cid:15) and min w ∈ R N C λ ( w ) ≤ L ∗ ( λ ) + (cid:15) (cid:19) + P (cid:18) min w ∈ R N C λ ( w ) > L ∗ ( λ ) + (cid:15) (cid:19) ≤ P (cid:18) min w ∈ D C λ ( w ) ≤ L ∗ ( λ ) + 2 (cid:15) (cid:19) + C(cid:15) e − cn(cid:15) , P (cid:18) min w ∈ D C λ ( w ) ≤ L ∗ ( λ ) + 2 (cid:15) (cid:19) ≤ P (cid:18) min w ∈ D L λ ( w ) ≤ L ∗ ( λ ) + 2 (cid:15) (cid:19) . (50)We thus get P (cid:18) min w ∈ D C λ ( w ) ≤ min w ∈ R N C λ ( w ) + (cid:15) (cid:19) ≤ P (cid:18) min w ∈ D L λ ( w ) ≤ L ∗ ( λ ) + 2 (cid:15) (cid:19) + C(cid:15) e − cn(cid:15) ≤ P (cid:18) min w ∈ D L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 (cid:15) (cid:19) + 2 P (cid:18) min w ∈ R N L λ ( w ) < L ∗ ( λ ) − (cid:15) (cid:19) + C(cid:15) e − cn(cid:15) ≤ P (cid:18) min w ∈ D L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 (cid:15) (cid:19) + C(cid:15) e − cn(cid:15) , for some constants c, C > , because of Corollary B.1. (cid:3) C.1.2 Local stability of the empirical distribution of the Lasso estimator: proof of Theorem 5.3
For w ∈ R N , let µ ( w ) be the probability distribution over R defined by b µ ( w ) = 1 N N X i =1 δ ( w i + θ ?i ,θ ?i ) Theorem 5.3 follows from Proposition C.1 and the following Lemma.
Lemma C.1
Assume that D = F p ( ξ ) for some ξ, p > . There exists constants γ, c, C > that depend only on Ω , such thatfor all (cid:15) ∈ (0 , ] we have P (cid:18) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 γ(cid:15) (cid:19) ≤ C(cid:15) − max(1 ,a ) exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where D (cid:15) = (cid:8) w ∈ R N (cid:12)(cid:12) W ( b µ ( w ) , µ ∗ λ ) ≥ (cid:15) (cid:9) and a = + p . Proof.
By Theorem B.1 and Proposition F.2 there exists constants γ, c, C > such that for all (cid:15) ∈ (0 , ] the event n ∀ w ∈ R N , L λ ( w ) ≤ min v ∈ R N L λ ( v ) + 3 γ(cid:15) = ⇒ N k w − w λ k ≤ (cid:15) o \ n W (cid:0) µ ∗ λ , b µ ( w λ ) (cid:1) ≤ (cid:15) o (51)has probability at least − C(cid:15) − exp (cid:0) − cn(cid:15) (cid:1) − C(cid:15) − a exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) ≥ − C(cid:15) − max(1 ,a ) exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where a = + p . On the event (51), we have for all w ∈ D (cid:15) : N k w − w λ k ≥ W (cid:0)b µ ( w ) , b µ ( w λ ) (cid:1) ≥ (cid:16) W ( b µ ( w ) , µ ∗ λ ) − W ( µ ∗ λ , b µ ( w λ )) (cid:17) ≥ (cid:15) . This gives that on the event (51), for all w ∈ D (cid:15) , L λ ( w ) > min v ∈ R N L λ ( v ) + 3 γ(cid:15) . The intersection of (51) with theevent (cid:8) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 γ(cid:15) (cid:9) is therefore empty: the lemma is proved. (cid:3) C.1.3 Local stability of the risk of the Lasso estimator
We prove here the analog of Theorem 5.3 for the risk of the Lasso estimator.
Theorem C.1 for the risk of the Lasso estimator. There exists constants
C, c, γ > that only depend on Ω such that for all (cid:15) ∈ (0 , λ ∈ [ λ min ,λ max ] sup θ ? ∈D P (cid:18) ∃ θ ∈ R N , (cid:16) N k θ − θ ? k − R ∗ ( λ ) (cid:17) ≥ (cid:15) and L λ ( θ ) ≤ min L λ + γ(cid:15) (cid:19) ≤ C(cid:15) e − cN(cid:15) . Theorem C.1 follows from Proposition C.1 and the following Lemma.29 emma C.2
There exists constants γ, c, C > that only depend on Ω such that for all (cid:15) ∈ (0 , we have P (cid:18) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 γ(cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) , where D (cid:15) = (cid:26) w ∈ R N (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) k w k − p N R ∗ ( λ ) (cid:17) ≥ N (cid:15) (cid:27) . Proof.
By Theorem B.1 and Lemma F.1 there exists constants γ, c, C > such that for all (cid:15) ∈ (0 , the event n ∀ w ∈ R N , L λ ( w ) ≤ min v ∈ R N L λ ( v ) + 3 γ(cid:15) = ⇒ N k w − w λ k ≤ (cid:15) o \ n(cid:0) k w λ k − p N R ∗ ( λ ) (cid:1) ≤ N (cid:15) o (52)has probability at least − C(cid:15) e − cn(cid:15) . On the event (52), we have for all w ∈ D (cid:15) : N k w − w λ k ≥ N (cid:0) k w k − k w λ k (cid:1) ≥ N (cid:0) √ N (cid:15) − √ N (cid:15) (cid:1) ≥ (cid:15) . This gives that on the event (52), for all w ∈ D (cid:15) , L λ ( w ) > min v ∈ R N L λ ( v ) + 3 γ(cid:15) . The intersection of (52) with theevent (cid:8) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 γ(cid:15) (cid:9) is therefore empty: the lemma is proved. (cid:3) C.2 Uniform control over λ : proofs of Theorems 3.1 and 3.2- (14) C.2.1 Control of the ‘ -norm of the Lasso estimatorProposition C.3 Let ξ > , p > . Define K = 2 ξ + δσ λ min . Then ∀ θ ? ∈ F p ( ξ ) , P (cid:18) ∀ λ ≥ λ min , N (cid:12)(cid:12) b w λ (cid:12)(cid:12) ≤ KN (1 /p − + (cid:19) ≥ − e − n/ . Proof.
Since F p ( ξ ) ⊂ F p ( ξ ) for p ≥ p , it suffices to prove the Proposition for p ∈ (0 , : we suppose now to be inthat case. With probability at least − e − n/ we have k z k ≤ √ n and therefore min L λ ≤ L λ ( θ ? ) ≤ σ + λn | θ ? | for all λ ≥ . One has thus with probability at least − e − n/ , ∀ θ ? ∈ F ( ξ ) , ∀ λ > , λn | b θ λ | ≤ L λ ( b θ λ ) ≤ σ + λn | θ ? | , which implies that N | b θ λ | ≤ δσ λ + ξN /p − since N | θ ? | ≤ N (cid:0) P Ni =1 | θ ?i | p (cid:1) /p ≤ ξN /p − . (cid:3) Proposition C.4
Assume that < δ < and σ > . Let s < s max ( δ ) . Then, there exists constants c, K > such that ∀ θ ? ∈ F ( s ) , P (cid:18) ∀ λ ∈ [ λ min , λ max ] , n (cid:12)(cid:12) | b w λ + θ ? | − | θ ? | (cid:12)(cid:12) ≤ K (cid:19) ≥ − e − cn . Proposition C.4 follows from the arguments of [50] that we reproduce below.
Lemma C.3
Suppose that θ ? ∈ F ( s ) for some s < s max ( δ ) . There exists constants c, a > that only depend on ( δ, s ) suchthat with probability at least − e − cn , for all w ∈ R N such that | w + θ ? | − | θ ? | ≤ we have k Xw k ≥ a k w k . Proof.
Define K = D (cid:0) | · | , θ ? (cid:1) = [ r> n u ∈ R N (cid:12)(cid:12)(cid:12) | θ ? + ru | ≤ | θ ? | o , ‘ -norm at θ ? . Define ν min ( X, K ) = inf (cid:8) k Xx k (cid:12)(cid:12) x ∈ K , k x k = 1 (cid:9) , Let ω ( K ) be the Gaussian width of K : ω ( K ) = E " sup u ∈K , k u k =1 h g, u i , where the expectation is taken with respect to g ∼ N (0 , I N ) . The following result goes back to Gordon’swork, [24], [25]. It can be found in for instance [50] (Proposition 3.3). Proposition C.5
For all t ≥ , P (cid:18) √ n ν min ( X, K ) ≥ √ n − − ω ( K ) − t (cid:19) ≥ − e − t / . Recall that M s ( α ) = s (1 + α ) + 2(1 − s ) (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1) is the “critical function” studied inSection A.2.3. Lemma C.4
For all α ≥ ω ( K ) ≤ N M s ( α ) = N (cid:0) s (1 + α ) + 2(1 − s ) (cid:0) (1 + α )Φ( − α ) − αφ ( α ) (cid:1)(cid:1) . Proof.
Let v ∈ ∂ | θ ? | . By convexity, we have for all w ∈ K we have h w, v i ≤ | w + θ ? | − | θ ? | ≤ . Now for all x ∈ R N and α ≥ k x − αv k = sup k w k =1 h x − αv, w i ≥ sup w ∈K , k w k =1 (cid:8) h w, x i − h w, αv i (cid:9) ≥ sup w ∈K , k w k =1 h w, x i . (53)Let S denote the support of θ ? . Let g ∼ N (0 , I N ) , α ≥ and define v i = ( sign ( θ ?i ) if i ∈ S ,α − g i ( | g i | ≤ α ) + sign ( g i ) ( | g i | > α ) otherwise.Notice that v ∈ ∂ | θ ? | , therefore by (53): ω ( K ) ≤ E "(cid:16) sup u ∈K , k u k =1 h g, u i (cid:17) ≤ E (cid:2) k g − αv k (cid:3) ≤ E X i ∈ S ( g i − α sign ( θ ?i )) + X i S η ( g i , α ) ≤ N s (1 + α ) + 2 N (1 − s )((1 + α )Φ( − α ) − αφ ( α )) . (cid:3) Since s ≤ s max ( δ ) , there exists (see Lemma A.10) α ≥ and t ∈ (0 , such that M s ( α ) ≤ δ (1 − t ) .Consequently ω ( K ) ≤ √ n (1 − t ) . Therefore, there exists some constants a, c > that only depends on s and δ such that P ( ν min ( X, K ) ≥ a ) ≥ − e − cn . On the above event, for all w ∈ K , k Xw k ≥ a k w k , which proves the Lemma. (cid:3) Proof of Proposition C.4.
Let us work on the event n ∀ w ∈ R N , | w + θ ? | − | θ ? | ≤ ⇒ (cid:13)(cid:13) Xw (cid:13)(cid:13) ≥ a k w k o \ n k z k ≤ √ n o , (54)which has probability at least − e − cn/ . Let λ ∈ [ λ min , λ max ] . Notice that on the event (54) we have min C λ ≤C λ (0) ≤ σ and therefore λn (cid:0) | b w λ + θ ? | − | θ ? | (cid:1) ≤ σ .
31e distinguish two cases:
Case 1 : | b w λ + θ ? | − | θ ? | ≥ . In that case we obtain n (cid:12)(cid:12) | b w λ + θ ? | − | θ ? | (cid:12)(cid:12) ≤ σ λ min . Case 2 : | b w λ + θ ? | − | θ ? | ≤ . In that case σ ≥ C λ ( b w λ ) ≥ n k X b w λ − σz k − λn | b w λ | ≥ n k X b w λ k − σ n k z k − λ √ Nn k b w λ k≥ a n k b w λ k − σ − λ √ δn k b w λ k . This implies that there exists a constant C = C ( s, δ, σ ) > such that √ n k b w λ k ≤ C (1 + λ ) . One conclude − Cδ − / (1 + λ max ) ≤ − n | b w λ | ≤ n (cid:0) | b w λ + θ ? | − | θ ? | (cid:1) ≤ . (cid:3) C.2.2 Lipschitz continuity of the limiting risk and empirical distributionProposition C.6
The function λ µ ∗ λ is M -Lipschitz on [ λ min , λ max ] with respect to the Wasserstein distance W , for someconstant M = M (Ω) > . Proof.
Let λ , λ ∈ [ λ min , λ max ] . W ( µ ∗ λ , µ ∗ λ ) ≤ E h ( η (Θ + τ ∗ ( λ ) Z, α ∗ ( λ ) τ ∗ ( λ )) − η (Θ + τ ∗ ( λ ) Z, α ∗ ( λ ) τ ∗ ( λ ))) i ≤ E h ( τ ∗ ( λ ) Z − τ ∗ ( λ ) Z ) + ( α ∗ ( λ ) τ ∗ ( λ ) − α ∗ ( λ ) τ ∗ ( λ )) i ≤ τ ∗ ( λ ) − τ ∗ ( λ )) + 2( α ∗ ( λ ) τ ∗ ( λ ) − α ∗ ( λ ) τ ∗ ( λ )) ≤ α ) ( τ ∗ ( λ ) − τ ∗ ( λ )) + 2 τ ( α ∗ ( λ ) − α ∗ ( λ )) . Since by Proposition A.3 the functions λ α ∗ ( λ ) and λ τ ∗ ( λ ) are both M -Lipschitz on [ λ min , λ max ] , for someconstant M = M (Ω) > , we obtain: W ( µ ∗ λ , µ ∗ λ ) ≤ M (1 + α + τ )( λ − λ ) , which proves the Lemma. (cid:3) Proposition C.7
The function λ R ∗ ( λ ) = δ ( τ ∗ ( λ ) − σ ) is M -Lipschitz on [ λ min , λ max ] , for some constant M = M (Ω) > . Proof.
This is a consequence of Proposition A.3. (cid:3)
C.2.3 Proofs of Theorems 3.1 and 3.2Lemma C.5
Assume that D is either F ( s ) or F p ( ξ ) for some s < s max ( δ ) and ξ ≥ , p > . Define q = ( (1 /p − + if D = F p ( ξ ) , if D = F ( s ) . Then there exists constants
K, C, c > that depend only on Ω such that for all θ ? ∈ D P (cid:16) ∀ λ, λ ∈ [ λ min , λ max ] , L λ ( b θ λ ) ≤ min x ∈ R N L λ ( x ) + KN q | λ − λ | (cid:17) ≥ − Ce − cn . (55)32 roof. K = K (Ω) > be a constant such that for all θ ? ∈ D , the event n ∀ λ ∈ [ λ min , λ max ] , n (cid:12)(cid:12) | b θ λ | − | θ ? | (cid:12)(cid:12) ≤ KN q o (56)has probability at least − Ce − cn . Such K exists by Propositions C.3 and C.4. On the event (56) we have for all λ, λ ∈ [ λ min , λ max ] : L λ ( b θ λ ) = L λ ( b θ λ ) + ( λ − λ ) 1 n | b θ λ |≤ L λ ( b θ λ ) + ( λ − λ ) 1 n | b θ λ | = L λ ( b θ λ ) + 1 n ( λ − λ ) (cid:0) | b θ λ | − | b θ λ | (cid:1) ≤ min θ ∈ R N L λ ( θ ) + 1 n | λ − λ | (cid:16)(cid:12)(cid:12) | b θ λ | − | θ ? | (cid:12)(cid:12) + (cid:12)(cid:12) | b θ λ | − | θ ? | (cid:12)(cid:12)(cid:17) ≤ min θ ∈ R N L λ ( θ ) + 2 KN q | λ − λ | . (cid:3) Theorem 3.1 and Theorem 3.2-(14) are proved the same way.
Proof of Theorem 3.1.
Let γ > as given by Theorem 5.3 and let K = K (Ω) > as given by Lemma C.5. Let M = M (Ω) > such that λ µ ∗ λ is M -Lipschitz with respect to the Wasserstein distance W on [ λ min , λ max ] ,as given by Proposition C.6.Let (cid:15) ∈ (0 , and define (cid:15) = min (cid:16) γ(cid:15) KN q , (cid:15)M +1 (cid:17) . Let k = (cid:6) ( λ max − λ min ) /(cid:15) (cid:7) . Define, for i = 0 , . . . , k : λ i = λ min + i(cid:15) . By Theorem 5.3, the event n ∀ i ∈ { , . . . , k } , ∀ θ ∈ R N , L λ i ( θ ) ≤ min x ∈ R N L λ i ( x ) + γ(cid:15) = ⇒ W ( b µ ( θ,θ ? ) , µ ∗ λ i ) ≤ (cid:15) o (57)has probability at least − kC(cid:15) − max(1 ,a ) e − cN(cid:15) (cid:15) a log( (cid:15) ) − ≥ − CN q (cid:15) − max(1 ,a ) − e − cN(cid:15) (cid:15) a log( (cid:15) ) − . Therefore,on the intersection of the event in (55) and the event (57) we have for all λ ∈ [ λ min , λ max ] L λ i ( b θ λ ) ≤ min x ∈ R N L λ i ( x ) + 2 KN q | λ − λ i | ≤ min x ∈ R N L λ i ( x ) + γ(cid:15) , where ≤ i ≤ k is such that λ ∈ [ λ i − , λ i ] . This implies (since we are on the event (57)) that W (cid:0)b µ ( b θ λ ,θ ? ) , µ ∗ λ i (cid:1) ≤ (cid:15) . We conclude by W (cid:0)b µ ( b θ λ ,θ ? ) , µ ∗ λ (cid:1) ≤ W (cid:0)b µ ( b θ λ ,θ ? ) , µ ∗ λ i (cid:1) + 2 W ( µ ∗ λ , µ ∗ λ i ) ≤ (cid:15) + 2 M ( λ − λ i ) ≤ (cid:15) . This proves the Theorem. (cid:3)
Proof of Theorem 3.2- (14) . Let γ > as given by Theorem C.1 and let K = K (Ω) > as given by Lemma C.5.Let M = M (Ω) > such that λ R ∗ ( λ ) is M -Lipschitz on [ λ min , λ max ] , as given by Proposition C.7.Let (cid:15) ∈ (0 , and define (cid:15) = min (cid:16) γ(cid:15) KN q , (cid:15)M +1 (cid:17) . Let k = (cid:6) ( λ max − λ min ) /(cid:15) (cid:7) . Define, for i = 0 , . . . , k : λ i = λ min + i(cid:15) . By Theorem C.1, the event (cid:26) ∀ i ∈ { , . . . , k } , ∀ θ ∈ R N , L λ i ( θ ) ≤ min x ∈ R N L λ i ( x ) + γ(cid:15) = ⇒ (cid:16) N k θ − θ ? k − R ∗ ( λ i ) (cid:17) ≤ (cid:15) (cid:27) (58)has probability at least − kC(cid:15) − e − cN(cid:15) ≥ − CN q (cid:15) − e − cN(cid:15) . Therefore, on the intersection of the event in (55)and the event (58) we have for all λ ∈ [ λ min , λ max ] L λ i ( b θ λ ) ≤ min θ ∈ R N L λ i ( θ ) + 2 KN q | λ − λ i | ≤ min θ ∈ R N L λ i ( θ ) + γ(cid:15) , ≤ i ≤ k is such that λ ∈ [ λ i − , λ i ] . This implies (since we are on the event (58)) that (cid:16) N k b θ λ − θ ? k − R ∗ ( λ i ) (cid:17) ≤ (cid:15) . We conclude by (cid:16) N k b θ λ − θ ? k − R ∗ ( λ ) (cid:17) ≤ (cid:16) N k b θ λ − θ ? k − R ∗ ( λ i ) (cid:17) + 2 (cid:16) R ∗ ( λ i ) − R ∗ ( λ ) (cid:17) ≤ (cid:15) + 2 M ( (cid:15) ) ≤ (cid:15) . This proves (14). (cid:3)
Appendix D: Study of the Lasso residual: proof of (15) - (16) This Section is devoted to the proof of (15)-(16) from Theorem 3.2. Let us define b u λ = X b w λ − σz = X b θ λ − y . b u λ is the unique maximizer of u min w ∈ R N n u T Xw − σu T z − k u k + λ ( | θ ? + w | − | θ ? | ) o . In Section D.2 below, we prove the following Theorem:
Theorem D.1
There exists constants c, C > such that for all (cid:15) ∈ (0 , , all θ ? ∈ D and all λ ∈ [ λ min , λ max ] P (cid:18)(cid:16) n k b u λ k − β ∗ ( λ ) (cid:17) ≥ (cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) , and P (cid:18)(cid:16) n k b u λ + σz k − P ∗ ( λ ) (cid:17) ≥ (cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) . Theorem 3.2-(15)-(16) will then be deduced from Theorem D.1 in Section D.2.
D.1 Study of Gordon’s optimization problem
Recall that g ∼ N (0 , I N ) and h ∼ N (0 , I n ) are independent standard Gaussian vectors. Define for ( w, u ) ∈ R N × R n , m λ ( w, u ) = − n / k u k g T w + 1 n / k w k h T u − σn u T z − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1) , and U λ ( u ) = min w ∈ R N m λ ( w, u ) , e U λ ( u ) = m λ ( w λ , u ) , where w λ is defined by (44). Obviously we have U λ ( u ) ≤ e U λ ( u ) . We write also u λ = β ∗ ( λ ) τ ∗ ( λ ) (cid:16)p τ ∗ ( λ ) − σ h √ n − σ √ n z (cid:17) . (59) Lemma D.1
There exists constants
C, c > such that for all (cid:15) ∈ (0 , and any λ ∈ [ λ min , λ max ] we have with probability atleast − Ce − cn(cid:15) • e U λ is /n -strongly concave on R n and admits therefore a unique maximizer u ∗ λ over R n .• | max u ∈ R n e U λ ( u ) − L ∗ ( λ ) | ≤ (cid:15) , where L ∗ ( λ ) is defined by Corollary B.1.• n k u ∗ λ − u λ k ≤ (cid:15) . roof. By Lemma F.1 and Lemma F.2, n w T λ g concentrates around s ∗ ( λ ) δ which is greater than some constant γ > .Indeed s ∗ ( λ ) = E (cid:20) Φ (cid:16) Θ τ ∗ − α ∗ (cid:17) + Φ (cid:16) − Θ τ ∗ − α ∗ (cid:17)(cid:21) remains greater than some strictly positive constant while θ ? vary in D and λ vary in [ λ min , λ max ] . By Lemma F.1we have then that with probability at least − Ce − cn , w T λ g ≥ which implies that e U λ is /n -strongly concave.Let us compute max u ∈ R n e U λ ( u ) = max β ≥ n(cid:16)(cid:13)(cid:13)(cid:13) n k w λ k h − σ √ n z (cid:13)(cid:13)(cid:13) − n g T w λ (cid:17) β − β + λn (cid:0) | w λ + θ ? | − | θ ? | (cid:1)o = 12 (cid:18)(cid:13)(cid:13)(cid:13) n k w λ k h − σ √ n z (cid:13)(cid:13)(cid:13) − n g T w λ (cid:19) + λn (cid:0) | w λ + θ ? | − | θ ? | (cid:1) . By the concentration properties of w λ (see Section F.2), have that with probability at least − Ce − cn(cid:15) , | max u ∈ R n e U λ ( u ) − L ∗ ( λ ) | ≤ (cid:15) . One verify analogously that e U λ ( u λ ) ≥ L ∗ ( λ ) − (cid:15) with the same probability, which implies the thirdpoint by strong concavity. (cid:3) D.2 Proof of Theorem D.1
Let us only prove the second point since the first one follows from the same arguments. Let (cid:15) ∈ (0 , and define D (cid:15) = n u ∈ R n (cid:12)(cid:12)(cid:12) (cid:12)(cid:12) √ n k u + σz k − p P ∗ ( λ ) (cid:12)(cid:12) ≥ (cid:15) / o . Let us define for ( w, u ) ∈ R N × R n : c λ ( w, u ) = 1 n u T Xw − σn u T z − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1) (60) Lemma D.2
We have almost surely min w ∈ R N max u ∈ R n c λ ( w, u ) = max u ∈ R n min w ∈ R N c λ ( w, u ) = c λ ( b w λ , b u λ ) . Proof.
By definition of b w λ and b u λ we have c λ ( b w λ , b u λ ) = min w ∈ R N max u ∈ R n c λ ( w, u ) ≥ max u ∈ R n min w ∈ R N c λ ( w, u ) . Let us prove the converse inequality. The optimality condition of b w λ gives that there exists v ∈ ∂ | θ ? + b w λ | suchthat X T b u λ + λv = X T ( X b w λ − σz ) + λv = 0 . The function w c λ ( w, b u λ ) is convex and n X T b u λ + λn v = 0 is a subgradient at b w λ . Therefore min w ∈ R N c λ ( w, b u λ ) = c λ ( b w λ , b u λ ) , which proves the lemma. (cid:3) We compute now P (cid:0)b u λ ∈ D (cid:15) (cid:1) = P (cid:16) max u ∈ D (cid:15) min w ∈ R N c λ ( w, u ) ≥ max u ∈ R n min w ∈ R N c λ ( w, u ) (cid:17) ≤ P (cid:16) max u ∈ D (cid:15) min w ∈ R N c λ ( w, u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) + P (cid:16) max u ∈ R n min w ∈ R N c λ ( w, u ) ≤ L ∗ ( λ ) − (cid:15) (cid:17) . By Lemma D.2 and Proposition C.2 we can bound P (cid:16) max u ∈ R n min w ∈ R N c λ ( w, u ) ≤ L ∗ ( λ ) − (cid:15) (cid:17) = P (cid:16) min w ∈ R N C λ ( w ) ≤ L ∗ ( λ ) − (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . P (cid:16) max u ∈ D (cid:15) min w ∈ R N c λ ( w, u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) ≤ P (cid:16) max u ∈ D (cid:15) min w ∈ R N m λ ( w, u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) = 2 P (cid:16) max u ∈ D (cid:15) U λ ( u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) . Since U λ ≤ e U λ we obtain P (cid:16) max u ∈ D (cid:15) min w ∈ R N c λ ( w, u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) ≤ P (cid:16) max u ∈ D (cid:15) e U λ ( u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) . Let E be the event of Lemma D.1 above and let us work on the event E \ n(cid:12)(cid:12) √ n k u λ + σz k − p P ∗ ( λ ) (cid:12)(cid:12) ≤ (cid:15) / o , (61)which has probability at least − C(cid:15) e − cn(cid:15) (the fact that the second event in the intersection has this probabilityfollows from standard concentration arguments as in Section F.2). Let now u ∈ D (cid:15) , by the definition of D (cid:15) andthe event above we have √ n k u − u λ k ≥ (cid:15) / and thus √ n k u − u ∗ λ k ≥ (cid:15) / . By /n -strong concavity of e U λ weget e U λ ( u ) ≤ max u ∈ R n e U λ ( u ) − (cid:15) ≤ L ∗ ( λ ) − (cid:15) . Consequently P (cid:16) max u ∈ D (cid:15) e U λ ( u ) ≥ L ∗ ( λ ) − (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) , which proves the result. D.3 Uniform control over λ : proof of Theorem 3.2- (15) - (16) Let D be either F ( s ) for some s < s max ( δ ) or F p ( ξ ) for some ξ ≥ , p > . Let q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . By Propositions C.3 and C.4 there exists a constant K = K (Ω) such that the event n ∀ λ ∈ [ λ min , λ max ] , n (cid:12)(cid:12) | b w λ + θ ? | − | θ ? | (cid:12)(cid:12) ≤ KN q o (62)has probability at least − Ce − cn . Let us fix this constant K and let us write D K = n w ∈ R N (cid:12)(cid:12)(cid:12) n (cid:12)(cid:12) | w + θ ? | − | θ ? | (cid:12)(cid:12) ≤ KN q o . We define also U λ ( u ) = min w ∈ D K n n u T Xw − σn u T z − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1)o . Lemma D.3
The function U λ is /n -strongly concave. On the event (62) , b u λ is the (unique) maximizer of U λ . Proof.
Let us work on the event (62) and let λ ∈ [ λ min , λ max ] . We have, by permutation of max and min: max u ∈ R n U λ ( u ) ≤ min w ∈ D K C λ ( w ) = C λ ( b w λ ) , because on the event (62), b w λ (the minimizer of C λ ) is in D K . By the optimality condition of b w λ , one verify easilythat U λ ( b u λ ) = C λ ( b w λ ) which proves the lemma. (cid:3) Theorem 3.2-(15)-(16) follow then easily from Theorem D.1 (by an (cid:15) -net argument as in the proof of Theo-rems 3.1 and 3.2-(14), see Section C.2) and the following Proposition:
Proposition D.1
Let q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . There exists constants C, c, κ > such that for all θ ? ∈ D the following event n ∀ λ, λ ∈ [ λ min , λ max ] , n k b u λ − b u λ k ≤ κN q | λ − λ | o has probability at least − Ce − cn . roof. Let us work on the event (62), which has probability at least − Ce − cn . Let λ, λ ∈ [ λ min , λ max ] . We have sup u ∈ R n (cid:12)(cid:12) U λ ( u ) − U λ ( u ) (cid:12)(cid:12) ≤ sup w ∈ D K (cid:12)(cid:12)(cid:12) λ − λ n ( | w + θ ? | − | θ ? | ) (cid:12)(cid:12)(cid:12) ≤ KN q | λ − λ | . Therefore U λ ( b u λ ) ≥ U λ ( b u λ ) − N q K | λ − λ | ≥ U λ ( b u λ ) − N q K | λ − λ | ≥ U λ ( b u λ ) − KN q | λ − λ | , which gives that n k b u λ − b u λ k ≤ KN q | λ − λ | by /n -strong concavity. (cid:3) Appendix E: Study of the subgradient b v λ The goal of this section is to analyze the vector b v λ = 1 λ X T ( y − X b θ λ ) , which is a subgradient of the ‘ -norm at b θ λ . Let us define B ∞ (0 ,
1) = n v ∈ R N (cid:12)(cid:12)(cid:12) k v k ∞ ≤ o . E.1 Main results
Let B = (cid:8) w ∈ R N | | w | ≤ | θ ? | +5 σ λ − n + K (cid:9) , where K > is some constant (depending only on Ω ) that willbe fixed later in the analysis (in fact K is the constant given by Lemma E.5). Notice that b w λ ∈ B , with probabilityat least − e − n/ . Define V λ ( v ) = min w ∈ B (cid:26) n k Xw − σz k + λn v T ( θ ? + w ) − λn | θ ? | (cid:27) . Lemma E.1
With probability at least − e − n/ we have for all λ ≥ λ min min w ∈ R N C λ ( w ) = max k v k ∞ ≤ V λ ( v ) and b v λ = − λ − X T ( X b w λ − σz ) is a maximizer of V λ . Proof.
Let us work on the event {k z k ≤ √ n } which has probability at least − e − n/ . On this event we have b w λ ∈ B and therefore min w ∈ R N C λ ( w ) = min w ∈ B C λ ( w ) = max k v k ∞ ≤ V λ ( v ) , where the permutation of the min-max is authorized by Proposition G.1. The optimality condition of b w λ gives that b v λ = − λ − X T ( X b w λ − σz ) ∈ ∂ | θ ? + b w λ | . Therefore b v T λ ( b w λ + θ ? ) = | b w λ + θ ? | . Using the optimality condition again we obtain V λ ( b v λ ) = min w ∈ B (cid:26) k Xw − σz k + λ b v T λ ( θ ? + w ) (cid:27) = 12 k X b w λ − σz k + λ b v T λ ( θ ? + b w λ )= 12 k X b w λ − σz k + λ | θ ? + b w λ | . Therefore b v λ achieves the optimal value. (cid:3) .1.1 The empirical law of the subgradient Let ν ∗ λ be the law of the couple (cid:18) − α ∗ ( λ ) τ ∗ ( λ ) (cid:16) η (cid:0) Θ + τ ∗ ( λ ) Z, α ∗ ( λ ) τ ∗ ( λ ) (cid:1) − Θ − τ ∗ ( λ ) Z (cid:17) , Θ (cid:19) , (63)where (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , . For v ∈ R N we define b µ ( v,θ ? ) = 1 N N X i =1 δ ( v i ,θ ?i ) . Theorem E.1
Assume that D = F p ( ξ ) for some ξ, p > . There exists constants C, c > that only depend on Ω such that forall λ ∈ [ λ min , λ max ] and all (cid:15) ∈ (0 , ] , sup θ ? ∈D P (cid:16) W ( b µ ( b v λ ,θ ? ) , ν ∗ λ ) ≥ (cid:15) (cid:17) ≤ C(cid:15) − max(1 ,a ) exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where a = + p . Theorem E.1 is proved in Section E.3.2.
Theorem E.2
Let D be F p ( ξ ) for some ξ > and p > . For all (cid:15) ∈ (0 , ] , sup θ ? ∈D P (cid:16) sup λ ∈ [ λ min ,λ max ] W ( b µ ( b v λ ,θ ? ) , ν ∗ λ ) ≥ (cid:15) (cid:17) ≤ C(cid:15) − max(1 ,a ) − N (1 /p − + exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where a = + p . Theorem E.2 is deduced from Theorem E.1 in Section E.3.4.
E.1.2 The norm of the subgradient
Let us define κ ∗ ( λ ) = β ∗ ( λ ) λ (cid:16) δ − s ∗ ( λ ) − δ σ τ ∗ ( λ ) (cid:17) . (64) Theorem E.3
There exists a constant
C, c > such that for all λ ∈ [ λ min , λ max ] and all (cid:15) ∈ (0 , , sup θ ? ∈D P (cid:16)(cid:16) N k b v λ k − κ ∗ ( λ ) (cid:17) ≥ (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . Theorem E.3 is proved in Section E.3.1. We deduce as before:
Theorem E.4
Let D be either F ( s ) for s < s max ( δ ) or F p ( ξ ) for some ξ > and p > . There exists constants C, c > suchthat for all (cid:15) ∈ (0 , , sup θ ? ∈D P (cid:16) sup λ ∈ [ λ min ,λ max ] (cid:16) N k b v λ k − κ ∗ ( λ ) (cid:17) ≥ (cid:15) (cid:17) ≤ C(cid:15) N q e − cn(cid:15) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . Theorem E.4 is deduced from Theorem E.3 in Section E.3.4.38 .1.3 Upper bound on the sparsity of the Lasso estimator
Studying b v λ allows to get an upper-bound on the ‘ norm of b θ λ . Indeed if b θ λ,i = 0 then | b v λ,i | = 1 : therefore k b θ λ k ≤ (cid:8) i (cid:12)(cid:12) | b v λ,i | = 1 (cid:9) . For this reason, the following results will be used to prove Theorem F.1 in Appendix F.4. Theorem E.5
There exists constants
C, c > such that for all λ ∈ [ λ min , λ max ] and all (cid:15) ∈ (0 , , sup θ ? ∈D P (cid:16) N (cid:8) i (cid:12)(cid:12) | b v λ,i | ≥ − (cid:15) (cid:9) ≥ s ∗ ( λ ) + 2(1 + α max ) (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . Theorem E.5 is proved in Section E.3.3.
Theorem E.6
Let D be either F ( s ) for s < s max ( δ ) or F p ( ξ ) for some ξ > and p > . We have for all (cid:15) ∈ (0 , , sup θ ? ∈D P (cid:16) ∃ λ ∈ [ λ min , λ max ] , N (cid:8) i (cid:12)(cid:12) | b v λ,i | = 1 (cid:9) ≥ s ∗ ( λ ) + (cid:15) (cid:17) ≤ C(cid:15) N q e − cn(cid:15) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . Theorem E.6 is deduced from Theorem E.5 in Section E.3.4.
E.2 Gordon’s strategy for the subgradient
E.2.1 Application of Gordon’s Theorem
Let g ∼ N (0 , I N ) and h ∼ N (0 , I n ) be independent standard Gaussian vectors. We define: V λ ( v ) = min w ∈ B r n k w k + σ k h k√ n − n g T w + g σ √ n ! + λn v T ( w + θ ? ) − λn | θ ? | . The following Proposition is the analog of Corollary 5.1.
Proposition E.1
Let D ⊂ (cid:8) v ∈ R N (cid:12)(cid:12) k v k ∞ ≤ (cid:9) be a closed set.• We have for all t ∈ R P (cid:16) max v ∈ D V λ ( v ) ≥ t (cid:17) ≤ P (cid:16) max v ∈ D V λ ( v ) ≥ t (cid:17) . • If D is convex, then we have for all t ∈ RP (cid:16) max v ∈ D V λ ( v ) ≤ t (cid:17) ≤ P (cid:16) max v ∈ D V λ ( v ) ≤ t (cid:17) . Proof.
Let v ∈ R N . By Proposition G.1 one can permute the min-max and obtain: V λ ( v ) = min w ∈ B max u ∈ R n (cid:26) n u T Xw − σn u T z − n k u k + λn v T ( θ ? + w ) − λn | θ ? | (cid:27) = max u ∈ R n min w ∈ B (cid:26) n u T Xw − σn u T z − n k u k + λn v T ( θ ? + w ) − λn | θ ? | (cid:27) . Let D ⊂ (cid:8) v ∈ R N (cid:12)(cid:12) k v k ∞ ≤ (cid:9) be a closed set. We can the apply Gordon’s Theorem (Corollary G.1) in order tocompare max ( v,u ) ∈ D × R n min w ∈ B (cid:26) n u T Xw − σn u T z − n k u k + λn v T ( θ ? + w ) − λn | θ ? | (cid:27) , (65)39ith max ( v,u ) ∈ D × R n min w ∈ B (r k w k n + σ h T un − n / k u k g T w + g σ √ n − n k u k + λn v T ( θ ? + w ) − λn | θ ? | ) (66) = max v ∈ D min w ∈ B max u ∈ R n (r k w k n + σ h T un − n / k u k g T w + g σ √ n − n k u k + λn v T ( θ ? + w ) − λn | θ ? | ) , which is equal to max v ∈ D V λ ( v ) . Note that the maximums in (65) and (66) are not defined on compact sets (since D × R n is not bounded). One has therefore to follow the same procedure than for Corollary 5.1, and show thatthere exists a compact set K ⊂ R n such that with high probability, the maximum over u ∈ R n is achieved in K . For the sake of brevity we do not provide a complete execution of this argument and refer to the proof ofCorollary 5.1. (cid:3) E.2.2 Study of Gordon’s optimization problem
In this section we study the optimization problem max k v k ∞ ≤ V λ ( v ) . Let us define v λ = − α ∗ ( λ ) − τ ∗ ( λ ) − (cid:16) η (cid:0) θ ? + τ ∗ ( λ ) g, α ∗ ( λ ) τ ∗ ( λ ) (cid:1) − θ ? − τ ∗ ( λ ) g (cid:17) . The goal of this section is to prove:
Theorem E.7
There exists constants γ, c, C > that only depend on Ω such that for all θ ? ∈ D , all λ ∈ [ λ min , λ max ] and all (cid:15) ∈ (0 , P (cid:16) ∃ v ∈ B ∞ (0 , , N k v − v λ k ≥ (cid:15) and V λ ( v ) ≥ max v ∈ R N V λ ( v ) − γ(cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . Recall that w ∗ λ is by Lemma B.1 the unique minimizer of L λ over R N . Lemma E.2
With probability at least − e − n/ we have min w ∈ R N L λ ( w ) = max k v k ∞ ≤ V λ ( v ) and the vector v ∗ λ = − λ − r k w ∗ λ k n + σ k h k√ n − n g T w ∗ λ + g σ √ n ! + k h k√ n w ∗ λ p k w ∗ λ k /n + σ − g ! (67) verifies k v ∗ λ k ∞ ≤ and is a maximizer of V λ . Proof.
By Proposition G.1, one can switch the min-max: max k v k ∞ ≤ V λ ( v ) = min w ∈ B max k v k ∞ ≤ r n k w k + σ k h k√ n − n g T w + g σ √ n ! + λn v T ( w + θ ? ) − λn | θ ? | (68) = min w ∈ B L λ ( w ) . (69)Let us work on the event (cid:8) k h k ≤ √ n (cid:9) ∩ { g ≤ √ n } which has probability at least − e − n/ . We have λn ( | w ∗ λ + θ ? | − | θ ? | ) ≤ L λ ( w ∗ λ ) ≤ L λ (0) ≤ σ . This gives w ∗ λ ∈ B and thus max k v k ∞ ≤ V λ ( v ) = min w ∈ B L λ ( w ) =min w ∈ R N L λ ( w ) .The optimality condition of w ∗ λ gives that v ∗ λ = − λ − r k w ∗ λ k n + σ k h k√ n − n g T w ∗ + g σ √ n ! + k h k√ n w ∗ λ p k w ∗ λ k /n + σ − g ! ∈ ∂ | θ ? + w ∗ λ | . v ∗ T λ ( w ∗ λ + θ ? ) = | w ∗ λ + θ ? | . Using the optimality condition again we obtain V λ ( v ∗ λ ) = min w ∈ B r n k w k + σ k h k√ n − n g T w + g σ √ n ! + λn v ∗ T λ ( w + θ ? ) − λn | θ ? | = 12 r n k w ∗ λ k + σ k h k√ n − n g T w ∗ λ + g σ √ n ! + λn | w ∗ λ + θ ? | − λn | θ ? | = min w ∈ R N L λ ( w ) = max k v k ∞ ≤ V λ ( v ) . Therefore v ∗ λ achieves the optimal value. (cid:3) Proposition E.2
For all θ ? ∈ D and all λ ∈ [ λ min , λ max ] we have for all (cid:15) ∈ (0 , P (cid:16) N k v ∗ λ − v λ k ≥ (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . Proof.
By Theorem B.1 we have for all (cid:15) ∈ (0 , P (cid:16) N k w ∗ λ − w λ k ≥ (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) , so we deduce the result from the expression (67) of v ∗ λ and the concentration properties of w λ (see Section F.2). (cid:3) By the same arguments used for proving Lemma E.2 it is not difficult to prove:
Lemma E.3
The function β ≥ min w ∈ B ‘ λ ( w, β ) (recall that ‘ λ is defined by Equation 43) admits a unique maximizer b ∗ λ over R ≥ and b ∗ λ = r k w ∗ λ k n + σ k h k√ n − n g T w ∗ λ + g σ √ n ! + . Moreover, for all (cid:15) ∈ (0 , we have P (cid:16) | b ∗ λ − β ∗ ( λ ) | > (cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . E.2.3 Proof of Theorem E.7
Let v ∈ R N such that k v k ∞ ≤ . We have by Proposition G.1 V λ ( v ) = min w ∈ B max β ≥ ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β λn v T ( w + θ ? ) − λn | θ ? | ) = max β ≥ min w ∈ B ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β λn v T ( w + θ ? ) − λn | θ ? | ) = max β ≥ min ≤ r ≤ R (cid:26) β p r + σ k h k√ n − r √ n k βg − λv k + β g σ √ n − β λn v T θ ? − λn | θ ? | (cid:27) , because the minimization over the direction of w is easy to perform. Let us define for κ > D κ = n v ∈ B ∞ (0 , , N k v − v ∗ λ k ≤ κ and V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − κ o . By concavity of V λ , D κ is convex. 41 roposition E.3 There exists a constant κ > such that with probability at least − Ce − cn we have ∀ v ∈ D κ , V λ ( v ) = e V λ ( v ) where e V λ : v min w ∈ R N ( max β ∈ [ β ∗ − κ,β ∗ + κ ] ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β ) + λn v T ( w + θ ? ) − λn | θ ? | ) . In order to prove Proposition E.3, we start with a Lemma:
Lemma E.4
For all v ∈ D κ , the function f v : β min w ∈ B ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β λn v T ( w + θ ? ) − λn | θ ? | ) admits a unique maximizer b λ ( v ) on [0 , + ∞ ) and one has | b λ ( v ) − b ∗ λ | ≤ κ/ . Proof.
Let v ∈ D κ . f v is -strongly concave so it admits a unique maximizer b λ ( v ) on R ≥ . We have f v ( b λ ( v )) = max β ≥ f v ( β ) = V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − κ , because v ∈ D κ . Notice now that f v ( b λ ( v )) ≤ min w ∈ B ‘ λ ( w, b λ ( v )) because v T ( w + θ ? ) ≤ | w + θ ? | . Permutingthe min-max (using Proposition G.1), we have max k v k ∞ ≤ V λ ( v ) = max β ≥ min w ∈ B ‘ λ ( w, β ) , where we recall that ‘ λ isdefined by (43). We get min w ∈ B ‘ λ ( w, b λ ( v )) ≥ max β ≥ min w ∈ B ‘ λ ( w, β ) − κ . The function β min w ∈ B ‘ λ ( w, β ) is -strongly concave and maximized (by Lemma E.3 above) at b ∗ λ , hence | b λ ( v ) − b ∗ λ | ≤ κ/ . (cid:3) Lemma E.5
There exist constants
K, κ > such that with probability at least − Ce − cn the following happens. For all β ≥ , v ∈ R N such that | β − b ∗ λ | ≤ κ and k v − v ∗ λ k ≤ √ N κ the minimum over R N of w β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β λn v T ( w + θ ? ) − λn | θ ? | is achieved on B (0 , √ N K ) . Proof.
The minimization with respect to the direction of w is easy to perform: w has to be a non-negative multipleof βg − λv . It remains thus to minimizes with respect to the norm of w . We have to show that under the conditionsof the lemma, the minimum of r ≥ β p r + σ − r k h k k βg − λv k (70)is achieved for r smaller than some constant. By Theorem B.1 and Lemma E.3 there exists a constant R > (forinstance R = τ max + 1 ) such that the event n k w ∗ λ k ≤ nR o \ n b ∗ λ ≥ β min / o \ n k h k ≥ √ n/ o (71)has probability at least − Ce − cn . Let us define the constants a = R √ R + σ < and κ = min √ δβ (1 − a )256 λ max , β min / , (1 − a ) √ δβ min λ max ! . Let us now work on the event (71). Let v ∈ R N and β ≥ such that k v − v ∗ λ k ≤ √ N κ and | β − b ∗ λ | ≤ κ . Wehave k g − λβ v k ≤ k g − λb ∗ λ v ∗ λ k + k λβ v − λb ∗ λ v ∗ λ k . k g − λb ∗ λ v ∗ λ k = k h k k w ∗ λ k / √ n q k w ∗ λ k n + σ ≤ k h k a , with probability at least − Ce − cn . Now k β v − b ∗ λ v ∗ λ k ≤ β ∗ k v − v ∗ λ k + k β v − b ∗ λ v k ≤ β min k v − v ∗ λ k + √ N min( β, b ∗ λ ) | β − b ∗ λ |≤ β min k v − v ∗ λ k + 16 √ Nβ | β − b ∗ λ | ≤ − a √ n . Putting all together: k h k k g − λβ v k ≤ a + 1 − a a < . This gives that the minimum of (70) is achieved for r ≤ σ √ − ((1+ a ) / . One can thus chose K = δσ √ − ((1+ a ) / . (cid:3) Proof of Proposition E.3.
Let us now fix a constant κ ∈ (0 , β min / that verify the statement of Lemma E.5.Let us work on the intersection of the event {| b ∗ λ − β ∗ ( λ ) | ≤ κ/ } with the event of Lemma E.5. This intersectionhas by Lemma E.3 and Lemma E.5 probability at least − Ce − cn .Let v ∈ D κ . By Lemma E.4 the unique maximizer b λ ( v ) of f v verify | b λ ( v ) − b ∗ λ | ≤ κ/ and therefore | b λ ( v ) − β ∗ | ≤ κ . Consequently V λ ( v ) = max β ∈ [ β ∗ − κ,β ∗ + κ ] min w ∈ B ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β λn v T ( w + θ ? ) − λn | θ ? | ) . Now, for β ∈ [ β ∗ − κ, β ∗ + κ ] , we have | β − b ∗ λ | ≤ κ . Since we are working on the event of Lemma E.5, we obtain V λ ( v ) = max β ∈ [ β ∗ − κ,β ∗ + κ ] min w ∈ R N ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β λn v T ( w + θ ? ) − λn | θ ? | ) , and Proposition E.3 follows from the permutation of the min − max using Proposition G.1. (cid:3) Lemma E.6
There exists a constant
C, c, γ > such that e V λ is γ/N -strongly concave, with probability at least − Ce − cn . Proof.
The function f ∗ : v ∈ R N max w ∈ R N ( v T w − nλ max β ∈ [ β ∗ − κ,β ∗ + κ ] ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β )) is the convex conjugate of the convex function f : w ∈ R N nλ max β ∈ [ β ∗ − κ,β ∗ + κ ] ( β r n k w k + σ k h k√ n − n g T w + g σ √ n ! − β ) = nλ ϕ r n k w k + σ k h k√ n − n g T w + g σ √ n ! , where ϕ is the C function ϕ ( x ) = x if x ∈ [ β ∗ − κ, β ∗ + κ ] , ( β ∗ − κ ) x − ( β ∗ − κ ) if x ≤ β ∗ − κ , ( β ∗ + κ ) x − ( β ∗ + κ ) if x ≥ β ∗ + κ .f is a proper closed convex function (because f is convex and its domain is R N ), therefore its convex conjugate f ∗ is also a proper closed convex function. The Fenchel-Moreau Theorem gives then that f = f ∗∗ . Let us compute43he gradient of f for w ∈ R N ∇ f ( w ) = nλ ϕ r n k w k + σ k h k√ n − n g T w + g σ √ n ! k h k√ n w/n q k w k n + σ − gn . It is not difficult to verify that there exists a constant L such that ∇ f is L -Lipschitz on R N , with probability atleast − Ce − cn . f = f ∗∗ is therefore /L -strongly smooth (see Definition G.1). By Proposition G.2 this givesthat f ∗ is /L strongly convex. One deduces then that e V λ is γ/N - strongly concave with γ = λ/ ( Lδ ) . (cid:3) Let < γ < / be a constant that verify the statement of Lemma E.6 and let κ > be a constant given byProposition E.3. Notice that it suffices to prove Theorem E.7 for (cid:15) small enough and let (cid:15) ∈ (0 , κ ) . P (cid:16) ∃ v ∈ B ∞ (0 , , N k v − v λ k > (cid:15) and V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) ≤ P (cid:16) ∃ v ∈ B ∞ (0 , , N k v − v ∗ λ k > (cid:15) and V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) + C(cid:15) e − cn(cid:15) ≤ P (cid:16) ∃ v ∈ D κ , N k v − v ∗ λ k > (cid:15) and V λ ( v ) ≥ max v ∈ D κ V λ ( v ) − γ(cid:15) (cid:17) + C(cid:15) e − cn(cid:15) , (72)because, if there exists v ∈ B ∞ (0 , such that N k v − v ∗ λ k > (cid:15) and V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) , we canconstruct e v ∈ D κ that verifies the same conditions. Indeed:• if N k v − v ∗ λ k ≤ κ , one simply take e v = v .• otherwise, e v = v ∗ λ + κ ( v − v ∗ λ ) / k v − v ∗ λ k is in D κ and by concavity V λ ( e v ) ≥ V λ ( v ) .Since with probability at least − Ce − cn we have V λ ( v ) = e V λ ( v ) for all v ∈ D κ and e V λ is γ/N -strongly concave,the probability in (72) above is less that Ce − cn . E.3 Proofs of the main results about the subgradient
Let us start with the analog of Proposition C.1 for the costs functions V λ and V λ : Proposition E.4
There exists constants c, C > that only depend on Ω such that for all closed set D ⊂ R N and for all (cid:15) ∈ (0 , , P (cid:18) max v ∈ D V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − (cid:15) (cid:19) ≤ P (cid:18) max v ∈ D V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − (cid:15) (cid:19) + C(cid:15) e − cn(cid:15) . The proof of Proposition E.4 is omitted for the sake of brevity, and because it follows from the exact samearguments than Proposition C.1.
E.3.1 The norm of b v λ : proof of Theorem E.3Lemma E.7 There exists constants γ, c, C > that depend only on Ω , such that for all (cid:15) ∈ (0 , we have P (cid:18) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) , where D (cid:15) = n v ∈ B ∞ (0 , (cid:12)(cid:12)(cid:12) (cid:0) k v k − p N κ ∗ ( λ ) (cid:1) ≥ (cid:15) o and κ ∗ ( λ ) is defined by (64) . Proof.
Similarly to Proposition F.1 it is not difficult to prove that for all (cid:15) ∈ (0 , , P (cid:16)(cid:12)(cid:12)(cid:12) N k v λ k − κ ∗ ( λ ) (cid:12)(cid:12)(cid:12) > (cid:15) (cid:17) ≤ Ce − cN(cid:15) , c, C > . By Theorem E.7 there exists constants γ, c, C > such that for all (cid:15) ∈ (0 , theevent n ∀ v ∈ B ∞ (0 , , V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) = ⇒ N k v − v λ k ≤ (cid:15) o \ n(cid:0) k v λ k− p N κ ∗ ( λ ) (cid:1) ≤ N (cid:15) o (73)has probability at least C(cid:15) e − cn(cid:15) . On the event (73), we have for all v ∈ D (cid:15) : N k v − v λ k ≥ N (cid:0) k v k − k v λ k (cid:1) ≥ N (cid:0) √ N (cid:15) − √ N (cid:15) (cid:1) ≥ (cid:15) . This gives that on the event (73), for all v ∈ D (cid:15) , V λ ( v ) < max k v k ∞ ≤ V λ ( v ) − γ(cid:15) . The intersection of (73) with theevent (cid:8) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≥ V λ ( v ) − γ(cid:15) (cid:9) is therefore empty: the lemma is proved. (cid:3) Proof of Theorem E.3.
Let γ > be a constant that verify the statement of Lemma E.7. Let (cid:15) ∈ (0 , and define D (cid:15) = n v ∈ B ∞ (0 , (cid:12)(cid:12)(cid:12) (cid:0) k v k − p N κ ∗ ( λ ) (cid:1) ≥ N (cid:15) o .D (cid:15) is a closed set. P (cid:16) ∃ v ∈ B ∞ (0 , , (cid:12)(cid:12) N k v k − κ ∗ ( λ ) (cid:12)(cid:12) ≥ (cid:15) and V λ ( v ) ≥ max V λ − γ(cid:15) (cid:17) = P (cid:16) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) ≤ P (cid:16) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) + Ce − cn(cid:15) ≤ C(cid:15) e − cn(cid:15) , where we used successively Proposition E.4 and Lemma E.7. (cid:3) E.3.2 The empirical law of b v λ : proof of Theorem E.1 Theorem E.1 follows now from Proposition E.4 and the following Lemma.
Lemma E.8
There exists constants γ, c, C > that depend only on Ω , such that for all (cid:15) ∈ (0 , ] we have P (cid:18) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) log( (cid:15) ) − , where D (cid:15) = (cid:8) v ∈ B ∞ (0 , (cid:12)(cid:12) W ( b µ ( v,θ ? ) , ν ∗ λ ) ≥ (cid:15) (cid:9) . Proof.
By Theorem E.7 and Proposition F.2 there exists constants γ, c, C > such that for all (cid:15) ∈ (0 , ] the event n ∀ v ∈ B ∞ (0 , , V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) = ⇒ N k v − v λ k ≤ (cid:15) o \ n W (cid:0) ν ∗ λ , b µ ( v λ ,θ ? ) (cid:1) ≤ (cid:15) o (74)has probability at least − C(cid:15) − exp (cid:0) − cn(cid:15) (cid:1) − C(cid:15) − a exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) ≥ − C(cid:15) − max(1 ,a ) exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) . On the event (74), we have for all v ∈ D (cid:15) : N k v − v λ k ≥ W (cid:0)b µ ( v,θ ? ) , b µ ( v λ ,θ ? ) (cid:1) ≥ (cid:16) W ( b µ ( v,θ ? ) , ν ∗ λ ) − W ( ν ∗ λ , b µ ( v λ ,θ ? ) ) (cid:17) ≥ (cid:15) . This gives that on the event (74), for all v ∈ D (cid:15) , V λ ( v ) < max k v k ∞ ≤ V λ ( v ) − γ(cid:15) . The intersection of (74) with theevent (cid:8) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≥ V λ ( v ) − γ(cid:15) (cid:9) is therefore empty: the lemma is proved. (cid:3) Proof of Theorem E.1.
Let γ > be a constant that verify the statement of Lemma E.8. Let (cid:15) ∈ (0 , ] and define D (cid:15) = (cid:8) v ∈ B ∞ (0 , (cid:12)(cid:12) W ( b µ v , ν ∗ λ ) ≥ (cid:15) (cid:9) D (cid:15) is a closed set. P (cid:16) ∃ v ∈ B ∞ (0 , , W ( b µ v , ν ∗ λ ) ≥ (cid:15) and V λ ( v ) ≥ max V λ − γ(cid:15) (cid:17) = P (cid:16) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) ≤ P (cid:16) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) + Ce − cn(cid:15) ≤ C(cid:15) e − cn(cid:15) log( (cid:15) ) − , where we used successively Proposition E.4 and Lemma E.8. (cid:3) .3.3 Proof of Theorem E.5Lemma E.9 There exists constants γ, c, C > that depend only on Ω , such that for all (cid:15) ∈ (0 , we have P (cid:18) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) , where D (cid:15) = n v ∈ B ∞ (0 , (cid:12)(cid:12)(cid:12) N (cid:8) i (cid:12)(cid:12) | v i | ≥ − (cid:15) (cid:9) > s ∗ ( λ ) + 2(1 + α max ) (cid:15) o . Proof.
Let (cid:15) ∈ (0 , and define s (cid:15) = 1 N n i ∈ { , . . . , N } (cid:12)(cid:12)(cid:12) | v λ,i | ≥ − (cid:15) o .s (cid:15) is the mean of independent Bernoulli random variables. By Hoeffding’s inequality we have P (cid:16) s (cid:15) ≤ P (cid:0) | τ − ∗ Θ + Z | ≥ α ∗ − α ∗ (cid:15) (cid:1) + (cid:15) (cid:17) ≥ − e − N(cid:15) . Compute P (cid:0) | τ − ∗ Θ + Z | ≥ α ∗ − α ∗ (cid:15) (cid:1) = E (cid:20) Φ (cid:16) Θ τ ∗ ( λ ) − α ∗ ( λ ) + 2 α ∗ ( λ ) (cid:15) (cid:17) + Φ (cid:16) − Θ τ ∗ ( λ ) − α ∗ ( λ ) + 2 α ∗ ( λ ) (cid:15) (cid:17)(cid:21) ≤ s ∗ ( λ ) + 2 α max (cid:15) . We obtain P (cid:16) s (cid:15) ≤ s ∗ ( λ ) + 2 α max (cid:15) + (cid:15) (cid:17) ≥ − e − N(cid:15) . By Theorem E.7 there exists a constant γ > such thatthe event n ∀ v ∈ B ∞ (0 , , V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) = ⇒ N k v − v λ k < (cid:15) o \ n s (cid:15) ≤ s ∗ ( λ ) + α max (cid:15) + (cid:15) o has probability at least − C(cid:15) e − cn(cid:15) . We have on this event, for all v ∈ D (cid:15) , N k v − v λ k ≥ (cid:15) . Therefore, on theabove event we have max v ∈ D (cid:15) V λ ( v ) < max k v k ∞ ≤ V λ ( v ) − γ(cid:15) , which concludes the proof. (cid:3) Proof of Theorem E.5.
Let γ > be a constant that verify the statement of Lemma E.9. Let (cid:15) ∈ (0 , and define D (cid:15) = (cid:26) v ∈ B ∞ (0 , (cid:12)(cid:12)(cid:12)(cid:12) N (cid:8) i (cid:12)(cid:12) | v i | ≥ − (cid:15) (cid:9) ≥ s ∗ ( λ ) + 2(1 + α max ) (cid:15) (cid:27) .D (cid:15) is a closed set. P (cid:16) N (cid:8) i (cid:12)(cid:12) | b v λ,i | ≥ − (cid:15) (cid:9) ≥ s ∗ ( λ ) + 2(1 + α max ) (cid:15) (cid:17) ≤ P (cid:16) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) ≤ P (cid:16) max v ∈ D (cid:15) V λ ( v ) ≥ max k v k ∞ ≤ V λ ( v ) − γ(cid:15) (cid:17) + C(cid:15) e − cn(cid:15) ≤ C(cid:15) e − cn(cid:15) , where we used successively Proposition E.4 and Lemma E.9. (cid:3) E.3.4 Uniform control over λ : proof of Theorems E.2, E.4 and E.6 Theorems E.2, E.4 and E.6 are deduced from Theorems E.1, E.3 and E.5 by an (cid:15) -net argument, as we did to deduceTheorems 3.1 and 3.2 from Theorems 5.3 and C.1. Since the ideas are the same, we only present here the keyargument:
Proposition E.5
Assume that D is F ( s ) or F ( ξ ) for some s < s max ( δ ) and ξ ≥ , p > . Let q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . Then there exists constants K, C, c > that depend only on Ω such that for all θ ? ∈ D P (cid:16) ∀ λ, λ ∈ [ λ min , λ max ] , N k b v λ − b v λ k ≤ KN q | λ − λ | (cid:17) ≥ − Ce − cn . (75)46 roof. By Proposition D.1, there exists a constant K such that with probability at least − Ce − cn we have ∀ λ, λ ∈ [ λ min , λ max ] , n k b u λ − b u λ k ≤ KN q | λ − λ | . Notice now that b v λ = − λ X T b u λ and that with probability at least − e − n/ , σ max ( X ) ≤ δ − / + 2 (byProposition G.6) which combined with the above inequality, prove the Proposition. (cid:3) Appendix F: Some auxiliary results and proofs
F.1 Proof of Remark 2
Let k ≤ N and define the vector θ ? = ( N, N, . . . , kN, , . . . , . With the definitions given in Remark 2, weclaim that W ( b µ ( b θ λ ,θ ? ) , µ ∗ λ ) ≥ p k/N with probability at least − e − ck , for some constant c > . Indeed, considerthe case λ = 0 , τ ∗ = 1 , and let P , E denote probability and expectation with respect to the coupling that achievesthe Wasserstein distance. This is a coupling for a triple of random variables ( I, Θ , Z ) , with I ∼ Unif (cid:0) { , . . . , N } (cid:1) , (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , , with W ( b µ ( b θ λ ,θ ? ) , µ ∗ λ ) = E (cid:8) ( θ I − Θ) (cid:9) + E (cid:8) ( θ I + z I − Θ − Z ) (cid:9) ≡ A + B . (76)We will proceed to bound separately the two terms above. Define δ i ≡ P (Θ = θ ?i | I = i ) , and δ max ≡ max i ≤ k δ i .Since Θ ∈ { , N, . . . , kN } with probability one, we have A ≥ N k X i =1 E (cid:8) ( θ ?I − Θ) (cid:12)(cid:12) I = i (cid:9) δ i ≥ N k X i =1 δ i ≥ N δ max . (77)For the second term, we have B ≥ k X i =1 E (cid:8) ( θ ?I + z I − Θ − Z ) I = i Θ= θ ?i (cid:9) = k X i =1 E (cid:8) ( z i − Z ) I = i Θ= θ ?i (cid:9) (78) = k X i =1 E (cid:8) ( z i − Z ) Θ= θ ?i (cid:9) − k X i =1 E (cid:8) ( z i − Z ) I = i Θ= θ ?i (cid:9) . (79)Note that, by the coupling definition, P (cid:0) I = i (cid:12)(cid:12) Θ = θ ?i (cid:1) = N P (cid:0) I = i ; Θ = θ ?i (cid:1) = 1 − δ i . Using the fact that Θ and Z are independent random variables, together with Cauchy-Schwartz inequality,we get B ≥ k X i =1 E (cid:8) ( z i − Z ) } P (Θ = θ ?i ) − k X i =1 E (cid:8) ( z i − Z ) Θ= θ ?i (cid:9) / P (cid:0) I = i ; Θ = θ ?i (cid:1) / (80) ≥ N k X i =1 E (cid:8) ( z i − Z ) } − N k X i =1 E (cid:8) ( z i − Z ) (cid:9) / P (cid:0) I = i (cid:12)(cid:12) Θ = θ ?i (cid:1) / (81) ≥ N k X i =1 E (cid:8) ( z i − Z ) } − N k X i =1 δ / i E (cid:8) ( z i − Z ) (cid:9) / (82) ≥ N k X i =1 (1 + z i ) − N k X i =1 δ / i (cid:0) z i + z i (cid:1) / . (83)where in the last step we used the fact that Z ∼ N (0 , . Using (3 + 6 x + x ) ≤ x ) , we thus conclude B ≥ − δ / N k X i =1 (1 + z i ) . (84)By concentration properties of chi-squared random variables, for any ε > , there exists c ( ε ) > such that, withprobability at least − e − ck we have k P ki =1 z i ≥ − ε . Hence, with the same probability W ( b µ ( b θ λ ,θ ? ) , µ ∗ λ ) ≥ N δ max + 2 kN (1 − δ / )(1 − ε ) (85) ≥ kN . (86)47e last inequality follows by lower bounding the first term for δ max > /N , and the second for δ max ≤ /N , andfixing ε a sufficiently small constant. F.2 Concentration properties of w λ We prove in this section concentrations of the norms and some scalar product of w λ . Lemma F.1
There exists constants c, C > that only depend on Ω such that for all t ≥ the event (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n g T w λ − E (cid:20) n g T w λ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ t , (cid:12)(cid:12)(cid:12)(cid:12) k w λ k n − E (cid:20) k w λ k n (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ t and (cid:12)(cid:12)(cid:12)(cid:12) | w λ + θ ? | n − E (cid:20) | w λ + θ ? | n (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ t (cid:27) has probability at least − Ce − ct n − Ce − ctn . Proof.
The function g w λ = ( η ( θ ?i + τ ∗ ( λ ) g i , α ∗ ( λ ) τ ∗ ( λ )) − θ ?i ) ≤ i ≤ N is τ max -Lipschitz. Consequently:• g | w λ + θ ? | n is δ − / n − / τ max -Lipschitz. Therefore | w λ + θ ? | n is τ δ − n − sub-Gaussian: for all t ≥ , P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) | w λ + θ ? | n − E (cid:20) | w λ + θ ? | n (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ e − nt δ/τ . • g k w λ k√ n is n − / τ max -Lipschitz. Therefore k w λ k√ n is τ n − sub-Gaussian. Its expectation is boundedby E k w λ k√ n ≤ ( E k w λ k n ) / = τ ∗ ≤ τ max . By Proposition G.5, we obtain that k w λ k n is ( Cn − , Cn − ) -sub-Gamma for some constant C and therefore for all t ≥ , P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) k w λ k n − E (cid:20) k w λ k n (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:19) ≤ e − cnt + 2 e − cnt . Now for i ∈ { , . . . , N } , g i w λ,i = τ ∗ g i + g i (cid:0) w λ,i − τ ∗ g i (cid:1) . | w λ,i − τ ∗ g i | ≤ α ∗ τ ∗ and g i is -sub-Gaussian and E [ | g i | ] = q π ≤ . Consequently, Lemma G.1 gives that g i ( w λ,i − τ ∗ g i ) is τ ∗ α ∗ -sub-Gaussian. This gives that n g T ( w λ − τ ∗ g ) concentrates exponentially fast aroundits mean. So does n k g k . (cid:3) Lemma F.2 n E k w λ k + σ = τ ∗ ( λ ) and n E (cid:2) g T w λ (cid:3) = τ ∗ ( λ ) − β ∗ ( λ ) = 1 δ s ∗ ( λ ) . Proof.
The first equality comes from Lemma A.5: since ( g i ) i.i.d. ∼ N (0 , , we have E k w λ k = N E (cid:2) w ∗ ( α ∗ , τ ∗ ) (cid:3) .The second equality comes from the optimality condition of β ∗ , see Lemma A.6, and the definition (10) of s ∗ ( λ ) . (cid:3) The next proposition simply follows from Lemma F.4 and standard concentration arguments, so we omit itsproof.
Proposition F.1
There exists constant
C, c > that only depend on Ω such that for all (cid:15) ∈ [0 , , P (cid:16)(cid:12)(cid:12) L λ ( w λ ) − ψ λ ( β ∗ ( λ ) , τ ∗ ( λ )) (cid:12)(cid:12) > (cid:15) (cid:17) ≤ Ce − cn(cid:15) . .3 Concentration of the empirical distribution Proposition F.2
Let θ ? ∈ F p ( ξ ) , where p, ξ > . Let µ = b µ θ ? ⊗ N (0 , and let b µ be the empirical distribution of the entries of (cid:0) θ ?i , g i (cid:1) ≤ i ≤ N , where g , . . . , g N i.i.d. ∼ N (0 , . Then there exists constants C, c > that only depends on ξ p , suchthat for all (cid:15) ∈ (0 , ] , P (cid:0) W ( b µ, µ ) > (cid:15) (cid:1) ≤ C(cid:15) − a exp (cid:0) − cN (cid:15) (cid:15) a log( (cid:15) ) − (cid:1) , where a = + p . Before proving Proposition F.2, we will need two simple lemmas. For r ≥ and x ∈ R we use the notation x | r = x if − r ≤ x ≤ r ,r if x ≥ r , − r if x ≤ − r . Let µ | r be the law of (cid:0) Θ , Z | r (cid:1) where (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , . Lemma F.3 W ( µ, µ | r ) ≤ e − r / . Proof.
We have W ( µ, µ | r ) ≤ E (cid:2) ( Z − Z | r ) (cid:3) ≤ √ π Z + ∞ r ( z − r ) e − z / dz ≤ e − r / . (cid:3) Let b µ | r be the empirical distribution of the entries of (cid:0) θ ?i , g i | r (cid:1) ≤ i ≤ N . Lemma F.4
With probability at least − e − N(cid:15) , we have W ( b µ, b µ | r ) ≤ (cid:15) + e − r / . Proof.
Obviously W ( b µ, b µ | r ) ≤ N P Ni =1 ( g i − g i | r ) . The function x x − x | r is -Lipschitz, so the variables ( g i − g i | r ) are i.i.d. (16 , -sub-Gamma. Therefore for all (cid:15) ∈ [0 , , P N N X i =1 (cid:0) g i − g i | r (cid:1) > E (cid:0) Z − Z | r (cid:1) + (cid:15) ! ≤ e − N(cid:15) . And we conclude using E (cid:0) Z − Z | r (cid:1) ≤ e − r / , which we proved in the lemma above. (cid:3) We need now some concentration results for empirical measures, in Wasserstein distance. The next propositionfollows from a direct application of Theorem 2 from [23] to distributions with bounded support. Notice that theresults from [23] are much more general than this.
Proposition F.3
Let A , . . . A m i.i.d. ∼ ν be a collection of i.i.d. random variables, bounded by some constant r > . Let b ν m = 1 m m X i =1 δ A i be the empirical distribution of A , . . . , A m . Then there exists two absolute constants c, C > such that for all t ≥ P (cid:0) W ( ν, b ν m ) ≥ r t (cid:1) ≤ C exp( − cmt ) . Proof of Proposition F.2.
We are now going to couple µ | r with b µ | r . Let R > . Let k ≥ and let δ = 2 R/k .Define B l = (cid:2) − R + ( l − δ, − R + lδ (cid:1) , l = 1 , . . . , k . We define also B = ( −∞ , R ) ∪ [ R, + ∞ ) . For l = 0 , . . . k we write I l = { i | θ ?i ∈ B l } and N l = I l . Let t > . Let l ∈ { , . . . , k } . The random variables ( g i | r ) i ∈ I l are i.i.d. and bounded by r . By the propositionabove, one can couple i l ∼ Unif( I l ) with Z l ∼ N (0 , such that we have with probability at least − Ce − ct N . E (cid:2) ( Z l | r − g i l | r ) (cid:3) ≤ tr r NN l , where E denotes the expectation with respect to i l and Z l . Let j l ∼ Unif( I l ) independently of everything else.For l = 0 , we define ( i , Z ) ∼ Unif( I ) ⊗N (0 , , independently of everything else. We have with probabilityat least − Ce − ct N : E (cid:2) ( Z | r − g i | r ) (cid:3) = E (cid:2) Z | r (cid:3) + E (cid:2) g i | r (cid:3) ≤ tr r NN , where E denotes the expectation with respect to Z and i . Indeed, E (cid:2) g i | r (cid:3) = N P i ∈ I g i | r ≤ tr q NN with probability at least − Ce − ct N . The equality comes from the fact that Z and i are independent. Finally,we define j = i .Let us now define the random variable L whose law is given by P ( L = l ) = N l N , independently of everythingelse. Define ( Y = (cid:0) θ ?j L , Z L | r (cid:1) ,Y = (cid:0) θ ?i L , g i L | r (cid:1) . ( Y , Y ) is a coupling of ( µ | r , b µ | r ) . Let E denote the expectation with respect to ( i l , Z l ) ≤ l ≤ k and L . Then E k Y − Y k = k X l =0 N l N E h (cid:0) θ ?i l − θ ?j l (cid:1) + (cid:0) Z l | r − g i l | r (cid:1) i ≤ k X l =1 N l N r NN l tr + δ ! + N N tr r NN ! ≤ δ + √ ktr + 2 N N ≤ δ + √ ktr + 2 ξ p R p , with probability at least − C ( k + 1) e − ct N , where the last inequality comes from Markov’s inequality, since θ ? ∈ F p ( ξ ) .Let now (cid:15) ∈ (0 , ] . Let us chose r = p − (cid:15) ) , R = (cid:15) − /p and k = d (cid:15) − / − /p e ≤ (cid:15) − / − /p , so that δ = 2 R/k ≤ √ (cid:15) . Consequently E (cid:13)(cid:13) Y − Y k ≤ (4 + 2 ξ p ) (cid:15) + 2 √ (cid:15) − / − / (2 p ) | log( (cid:15) ) | t . So if we chose t = | log( (cid:15) ) | − (cid:15) + p we obtain P (cid:16) W ( µ | r , b µ | r ) ≤ (4 + 2 ξ p + 2 √ (cid:15) (cid:17) ≥ − C(cid:15) − /p − / exp( − cN (cid:15) (cid:15) / /p / log( (cid:15) ) ) . Combining this with Lemmas F.3 and F.4 proves the proposition. (cid:3)
F.4 Sparsity of the Lasso estimator
The goal of this section is to prove: 50 eorem F.1
Assume here that D is either F ( s ) or F p ( ξ ) for some ≤ s < s max ( δ ) and ξ > , p > . There exists constants C, c > that only depend on Ω , such that for all (cid:15) ∈ (0 , θ ? ∈D P sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) N k b θ λ k − s ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≥ (cid:15) ! ≤ C(cid:15) N q e − cN(cid:15) , where q = 0 if D = F ( s ) and q = (1 /p − + if D = F p ( ξ ) . Since { i | | b v λ,i | = 1 } ≥ k b θ λ k , Theorem E.6 gives that sup θ ? ∈D P (cid:18) ∃ λ ∈ [ λ min , λ max ] , N k b θ λ k ≥ s ∗ ( λ ) + (cid:15) (cid:19) ≤ C(cid:15) N q e − cN(cid:15) . (87)It remains to prove the converse lower bound in order to get Theorem F.1. We start with the following ‘localstability’ property of the Lasso cost: Proposition F.4
There exists constants
C, c, γ > that only depend on Ω such that for all (cid:15) ∈ (0 , λ ∈ [ λ min ,λ max ] sup θ ? ∈D P (cid:18) ∃ θ ∈ R N , N k θ k < s ∗ ( λ ) − (cid:15) and L λ ( θ ) ≤ min L λ + γ(cid:15) (cid:19) ≤ C(cid:15) e − cN(cid:15) . Proposition F.4 is a consequence of Proposition C.1 and Lemma F.5 below.
Lemma F.5
There exists constants γ, c, C > that only depend on Ω such that for all (cid:15) ∈ (0 , we have P (cid:18) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + 3 γ(cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) , where D (cid:15) = (cid:8) w ∈ R N (cid:12)(cid:12) N k w + θ ? k < s ∗ ( λ ) − (cid:15) (cid:9) . Proof.
Define x λ = w λ + θ ? = (cid:0) η (cid:0) θ ?i + τ ∗ g i , α ∗ τ ∗ (cid:1)(cid:1) ≤ i ≤ N , and for r > s r = 1 N n i ∈ { , . . . , N } (cid:12)(cid:12)(cid:12) | x λ,i | ≥ r o .s r is a mean of independent Bernoulli random variables, by Hoeffding’s inequality we have: P (cid:16) s r ≥ P (cid:0) | Θ + τ ∗ Z | ≥ α ∗ τ ∗ + r (cid:1) − (cid:15) (cid:17) ≥ − e − N(cid:15) / . Compute P (cid:0) | Θ + τ ∗ Z | ≥ α ∗ τ ∗ + r (cid:1) = E (cid:20) Φ (cid:16) Θ τ ∗ ( λ ) − α ∗ ( λ ) − rτ ∗ ( λ ) (cid:17) + Φ (cid:16) − Θ τ ∗ ( λ ) − α ∗ ( λ ) − rτ ∗ ( λ ) (cid:17)(cid:21) ≥ s ∗ ( λ ) − rσ . Let us chose r = σ(cid:15)/ . We have then P (cid:16) s r ≥ s ∗ ( λ ) − (cid:15) (cid:17) ≥ − e − N(cid:15) / . By Theorem B.1 there exists a constant γ > such that the event n ∀ w ∈ R N , L λ ( w ) ≤ min v ∈ R N L λ ( v ) + 3 γ(cid:15) = ⇒ N k w − w λ k < σ (cid:15) o \ n s r ≥ s ∗ ( λ ) − (cid:15) o (88)has probability at least − C(cid:15) e − cn(cid:15) . We have on this event, for all w ∈ D (cid:15) N k w − w λ k = 1 N k w + θ ? − x λ k ≥ (cid:15) r = σ (cid:15) . min w ∈ D (cid:15) L λ ( w ) > min w ∈ R N L λ ( w ) + 3 γ(cid:15) . We conclude P (cid:18) min w ∈ D (cid:15) L λ ( w ) ≤ min w ∈ R N L λ ( w ) + γ(cid:15) (cid:19) ≤ C(cid:15) e − cn(cid:15) . (cid:3) Using the same arguments that we use to deduce Theorems 3.1 and 3.2 (14) from Theorem 5.3 and Theorem C.1in Section C.2, we deduce from Proposition F.4 that for all (cid:15) ∈ (0 , θ ? ∈D P (cid:16) ∃ λ ∈ [ λ min , λ max ] , N k b θ λ k < s ∗ ( λ ) − (cid:15) (cid:17) ≤ C(cid:15) N q e − cn(cid:15) . This proves, together with (87), Theorem F.1.
F.5 Proof of Theorem 3.3
Recall that the distributions µ ∗ λ and ν ∗ λ are respectively defined by Definition 3.3 and (63). Let (cid:15) ∈ (0 , . Fromnow, we will work on the event E = n ∀ λ ∈ [ λ min , λ max ] , W ( b µ ( b θ λ ,θ ? ) , µ ∗ λ ) + W ( b µ ( b v λ ,θ ? ) , ν ∗ λ ) ≤ (cid:15) o\ n ∀ λ ∈ [ λ min , λ max ] , (cid:12)(cid:12)(cid:12) N k b θ λ k − s ∗ ( λ ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) N (cid:8) i (cid:12)(cid:12) | b v λ,i | = 1 (cid:9) − s ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) o , which has probability at least − C(cid:15) − e − cN(cid:15) from what we have just seen, and Theorems 3.1, E.2, F.1 and E.6.From now, E and P will denote the probability with respect to the empirical distributions of the entries of thevectors we study, and the variables that we couple with them. Let λ ∈ [ λ min , λ max ] . On the event E one cancouple (Θ x , Z x ) ∼ ˆ µ θ ? ⊗ N (0 , and (Θ v , Z v ) ∼ ˆ µ θ ? ⊗ N (0 , with (Θ , b Θ λ , b V λ , b Θ dλ ) which is sampled fromthe empirical distribution of the entries of ( θ ? , b θ λ , b v λ , b θ dλ ) , such that E h(cid:0) b Θ λ − η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) (cid:1) + (cid:0) Θ − Θ x (cid:1) i ≤ (cid:15) , E h(cid:0) b V λ + 1 α ∗ τ ∗ (cid:0) η (Θ v + τ ∗ Z v , α ∗ τ ∗ ) − Θ v − τ ∗ Z v (cid:1)(cid:1) + (cid:0) Θ − Θ v (cid:1) i ≤ (cid:15) . Let E = n(cid:12)(cid:12) b Θ λ − η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) (cid:12)(cid:12) ≤ (cid:15) , (cid:12)(cid:12) b V λ + 1 α ∗ τ ∗ (cid:0) η (Θ v + τ ∗ Z v , α ∗ τ ∗ ) − Θ v − τ ∗ Z v (cid:1)(cid:12)(cid:12) ≤ (cid:15) o . By Chebychev’s inequality, P ( E ) ≥ − C(cid:15) , for some constant C > . Let us also define the event E = n Θ x + τ ∗ Z x = α ∗ τ ∗ and Θ v + τ ∗ Z v = α ∗ τ ∗ o . Θ x + τ ∗ Z x and Θ v + τ ∗ Z v admit a density with respect to Lebesgue’s measure. Therefore P ( E ) = 1 . Lemma F.6
The event E = n | b Θ λ | 6∈ (0 , (cid:15) ] and | b V λ | 6∈ [1 − (cid:15) , o has probability at least − C(cid:15) . Proof.
We denote here by O ( (cid:15) ) quantities that are bounded by C(cid:15) , from some constant C . Since Θ x + τ ∗ Z x admits a density with respect to Lebesgue’s measure we have P (cid:16)(cid:12)(cid:12) η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) (cid:12)(cid:12) (0 , (cid:15) ] (cid:17) = 1 − O ( (cid:15) ) . Consequently, since the events E has probability at least − O ( (cid:15) ) , we have P ( | b Θ λ | ∈ [0 , (cid:15) ])= P (cid:16) | b Θ λ | ∈ [0 , (cid:15) ] and | η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) | 6∈ (0 , (cid:15) ] and (cid:12)(cid:12) b Θ λ − η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) (cid:12)(cid:12) ≤ (cid:15) (cid:17) + O ( (cid:15) )= P (cid:16) η (cid:0) Θ x + τ ∗ Z x , α ∗ τ ∗ (cid:1) = 0 (cid:17) + O ( (cid:15) ) = s ∗ ( λ ) + O ( (cid:15) ) . Since P ( b Θ λ = 0) = s ∗ ( λ ) + O ( (cid:15) ) because we are working on E , we conclude that P (cid:0) | b Θ λ | ∈ (0 , (cid:15) ] (cid:1) = O ( (cid:15) ) .One can prove the same way that P (cid:0) | b V λ | ∈ [1 − (cid:15) , (cid:1) = O ( (cid:15) ) , which gives the desired result. (cid:3) emma F.7 The event E = n b Θ λ = 0 ⇐⇒ b V λ = sign( b Θ λ ) o has probability at least − C(cid:15) , for some constant C > . Proof.
Since b v λ ∈ ∂ | b θ λ | , b θ λ,i > implies that b v λ,i = sign( b θ λ,i ) . Thus P (cid:0) b Θ λ = 0 = ⇒ b V λ = sign( b θ λ,i ) (cid:1) = 1 . Wehave thus (cid:8) b Θ λ = 0 (cid:9) ⊂ (cid:8) | b V λ | = 1 (cid:9) . (89)On the event E we have (cid:12)(cid:12)(cid:12) N k b θ λ k − s ∗ ( λ ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) N (cid:8) i (cid:12)(cid:12) | b v λ,i | = 1 (cid:9) − s ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) which gives P (cid:0) b Θ λ = 0 (cid:1) = s ∗ ( λ ) + O ( (cid:15) ) and P (cid:0) | b V λ | = 1 (cid:1) = s ∗ ( λ ) + O ( (cid:15) ) . We deduce then from (89) that P (cid:0) | b V λ | = 1 and b Θ λ = 0 (cid:1) = O ( (cid:15) ) and finally P (cid:0) b V λ = sign( b Θ λ ) = ⇒ b Θ λ = 0 (cid:1) ≥ − C(cid:15) . (cid:3) Lemma F.8
Let E = E ∩ E ∩ E ∩ E . The event E has probability at least − C(cid:15) and on E we have Θ v + τ ∗ Z v ≥ α ∗ τ ∗ ⇐⇒ Θ x + τ ∗ Z x ≥ α ∗ τ ∗ , and Θ v + τ ∗ Z v ≤ − α ∗ τ ∗ ⇐⇒ Θ x + τ ∗ Z x ≤ − α ∗ τ ∗ . Proof.
Since E , E , E and E have all a probability greater than − O ( (cid:15) ) , the event E = E ∩ E ∩ E ∩ E has probability at least − O ( (cid:15) ) . On E we have Θ v + τ ∗ Z v ≥ α ∗ τ ∗ ⇐⇒ Θ v + τ ∗ Z v − η (Θ v + τ ∗ Z v , α ∗ τ ∗ ) = α ∗ τ ∗ ⇐⇒ b V λ ≥ − (cid:15) (because we are on the event E ) ⇐⇒ b V λ = 1 (because we are on the event E ) ⇐⇒ b Θ λ > (because we are on the event E ) ⇐⇒ b Θ λ > (cid:15) (because we are on the event E ) ⇐⇒ η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) > (because we are on the event E ) ⇐⇒ Θ x + τ ∗ Z x ≥ α ∗ τ ∗ (because we are on the event E ) . The second equivalence is proved exactly the same way. (cid:3)
Let us define X d = η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) + Θ v + τ ∗ Z v − η (Θ v + τ ∗ Z v , α ∗ τ ∗ ) . We have E ( b Θ dλ − X d ) = 2 E h(cid:0) b Θ λ − η (Θ x + τ ∗ Z x , α ∗ τ ∗ ) (cid:1) i + 2 E h(cid:0) λ − n k b θ λ k b V λ − Θ v − τ ∗ Z v + η (Θ v + τ ∗ Z v , α ∗ τ ∗ ) (cid:1) i ≤ C(cid:15) , for some constant C > , because on the event E , N (cid:12)(cid:12) k b θ λ k − s ∗ ( λ ) (cid:12)(cid:12) ≤ (cid:15) , so λ − n k b θ λ k = λ − δ s ∗ ( λ ) + O ( (cid:15) ) = α ∗ τ ∗ + O ( (cid:15) ) . By Lemma F.8 above, we have on the event E , X d = ( Θ x + τ ∗ Z x if Θ x + τ ∗ Z x ≥ α ∗ τ ∗ or Θ x + τ ∗ Z x ≤ − α ∗ τ ∗ , Θ v + τ ∗ Z v otherwise . T x = (Θ x + τ ∗ Z x , Θ x ) and T v = (Θ v + τ ∗ Z v , Θ v ) .Since Θ x + τ x ∗ Z and Θ v + τ ∗ Z v have the same law and P (cid:0) Θ x + τ ∗ Z x ≥ α ∗ τ ∗ (cid:12)(cid:12) E (cid:1) = P (cid:0) Θ v + τ ∗ Z v ≥ α ∗ τ ∗ (cid:12)(cid:12) E (cid:1) (by Lemma F.8), we have P (cid:0) Θ x + τ ∗ Z x ≥ α ∗ τ ∗ (cid:12)(cid:12) E c (cid:1) = P (cid:0) Θ v + τ ∗ Z v ≥ α ∗ τ ∗ (cid:12)(cid:12) E c (cid:1) . Similarly we have P (cid:0) Θ x + τ ∗ Z x ≤ − α ∗ τ ∗ (cid:12)(cid:12) E c (cid:1) = P (cid:0) Θ v + τ ∗ Z v ≤ − α ∗ τ ∗ (cid:12)(cid:12) E c (cid:1) .One can therefore define two random variables e T x = ( e Θ x + τ ∗ e Z x , e Θ x ) and e T v = ( e Θ v + τ ∗ e Z v , e Θ v ) such that• conditionally on E c , e T x (respectively e T v ) and T x (respectively T v ) have the same law.• On the event E c , e Θ x + τ ∗ e Z x ≥ α ∗ τ ∗ ⇐⇒ e Θ v + τ ∗ e Z v ≥ α ∗ τ ∗ and e Θ x + τ ∗ e Z x ≤ − α ∗ τ ∗ ⇐⇒ e Θ v + τ ∗ e Z v ≤− α ∗ τ ∗ .We define then (cid:0) e X d , e Θ (cid:1) = T x on the event E provided that | Θ x + τ ∗ Z x | ≥ α ∗ τ ∗ ,T v on the event E provided that | Θ x + τ ∗ Z x | < α ∗ τ ∗ , e T x on the event E c provided that | e Θ x + τ ∗ e Z x | ≥ α ∗ τ ∗ , e T v on the event E c provided that | e Θ x + τ ∗ e Z x | < α ∗ τ ∗ . ( e X d , e Θ) ∼ µ dλ which is the law of (Θ + τ ∗ Z, Θ) where (Θ , Z ) ∼ b µ θ ? ⊗ N (0 , . Indeed, for every continuousbounded function f we have E [ f ( e X d , e Θ)] = E h E | Θ x + τ ∗ Z x |≥ α ∗ τ ∗ f ( T x ) i + E h E | Θ v + τ ∗ Z v | <α ∗ τ ∗ f ( T v ) i + E h E c | e Θ x + τ ∗ e Z x |≥ α ∗ τ ∗ f ( e T x ) i + E h E c | e Θ v + τ ∗ e Z v | <α ∗ τ ∗ f ( e T v ) i = E h E | Θ x + τ ∗ Z x |≥ α ∗ τ ∗ f ( T x ) i + E h E | Θ v + τ ∗ Z v | <α ∗ τ ∗ f ( T v ) i + E h E c | Θ x + τ ∗ Z x |≥ α ∗ τ ∗ f ( T x ) i + E h E c | Θ v + τ ∗ Z v | <α ∗ τ ∗ f ( T v ) i = E h | Θ x + τ ∗ Z x |≥ α ∗ τ ∗ f ( T x ) i + E h | Θ v + τ ∗ Z v | <α ∗ τ ∗ f ( T v ) i = E h | Θ x + τ ∗ Z x |≥ α ∗ τ ∗ f ( T x ) i + E h | Θ x + τ ∗ Z x | <α ∗ τ ∗ f ( T x ) i = E h f ( T x ) i . Let us now compute E h(cid:0) X d − e X d (cid:1) i = E h E c (cid:0) X d − e X d (cid:1) i ≤ C p P ( E c ) ≤ C(cid:15) , and E h(cid:0) e Θ − Θ (cid:1) i ≤ E h E (cid:0) Θ x − Θ (cid:1) i + E h E (cid:0) Θ v − Θ (cid:1) i + E h E c (cid:0) e Θ x − Θ (cid:1) i + E h E c (cid:0) e Θ v − Θ (cid:1) i ≤ (cid:15) + 2 C p P ( E c ) ≤ C(cid:15) .
Therefore E (cid:13)(cid:13) ( b Θ dλ , Θ) − ( e X d , e Θ) (cid:13)(cid:13) ≤ C(cid:15) and consequently W ( b µ ( b θ dλ ,θ ? ) , µ dλ ) ≤ C(cid:15) , on the event E whichhas probability at least − C(cid:15) − e − cN(cid:15) . F.6 Proof of Corollary 4.2
Let (cid:15) ∈ (0 , . Let us work on the intersection the events of Theorem F.1,Corollary 4.1 and E.3, which as probabilityat least − C(cid:15) e − cN(cid:15) . Let λ ∈ [ λ min , λ max ] . N (cid:13)(cid:13) X T ( y − X b θ λ ) k = λ N k b v λ k = λ κ ∗ ( λ ) + O ( (cid:15) ) . We have also − n k b θ λ k = 1 − δ s ∗ ( λ ) + O ( (cid:15) ) = β ∗ ( λ ) /τ ∗ ( λ ) + O ( (cid:15) ) . Therefore (cid:13)(cid:13) X T ( y − X b θ λ ) (cid:13)(cid:13) N (1 − n k b θ λ k ) = (cid:16) λτ ∗ ( λ ) β ∗ ( λ ) (cid:17) κ ∗ ( λ ) + O ( (cid:15) ) = τ ∗ ( λ ) (cid:0) δ − s ∗ ( λ ) (cid:1) − δσ + O ( (cid:15) ) . b τ ( λ ) = τ ∗ ( λ ) + O ( (cid:15) ) and N k b θ λ k = s ∗ ( λ ) + O ( (cid:15) ) . Consequently b τ ( λ ) (cid:16) N k b θ λ k − (cid:17) = τ ∗ ( λ ) (2 s ∗ ( λ ) −
1) + O ( (cid:15) ) . Putting all together we obtain b R ( λ ) = δτ ∗ ( λ ) − δσ + O ( (cid:15) ) = R ∗ ( λ )+ O ( (cid:15) ) , and we conclude using Theorem 3.2. F.7 Proof of Proposition 4.3
Let n ∈ { , . . . , n } . We consider a random n × N matrix X and a random vector z = ( z , . . . , z n ) such that X i,j i.i.d. ∼ N (0 , /n ) and z i i.i.d. ∼ N (0 , are independent and independent of everything else. Lemma F.9
There exists constants γ, c, C > that only depend on Ω such that for all θ ? in D and all λ ∈ [ λ min , λ max ] suchthat for all (cid:15) ∈ (0 , , P (cid:16) ∃ w ∈ R N , (cid:12)(cid:12)(cid:12) n k X w − σz k − n k w k − σ (cid:12)(cid:12)(cid:12) ≥ √ (cid:15) nn and L λ ( w ) ≤ min v ∈ R N L λ ( v ) + γ(cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . Proof.
The vector w λ is independent from X , z . Hence k X w λ − σz k = (cid:16) n k w λ k + σ (cid:17) χ (90)where χ is independent from w λ and follows a χ -squared distribution with n degrees of freedom. We havetherefore for all t ≥ P (cid:16)(cid:12)(cid:12) χ − n (cid:12)(cid:12) ≥ tn (cid:17) ≤ Ce − cn t + Ce − cn t , (91)for some constants c, C > . We know by Lemma F.1 and Lemma F.2 that n k w λ k concentrates exponentially fastaround τ ∗ ( λ ) − σ , which is (Theorem A.2) bounded by some constant. There exists therefore constants C, c > such that P (cid:16) n k w λ k + σ > C (cid:17) ≤ Ce − cn . (92)From (90)-(91) and (92) above, we deduce that for all t ≥ P (cid:16)(cid:12)(cid:12) n k X w λ − σz k − n k w λ k − σ (cid:12)(cid:12) > t (cid:17) ≤ Ce − cn t + Ce − cn t + Ce − cn , (93)for some constants c, C > . By Proposition G.6, we know that P (cid:0) σ max ( X ) > δ − / + p n /n + 1 (cid:1) ≤ e − n/ .Let (cid:15) ∈ (0 , . Let w ∈ R N such that k w − w λ k ≤ (cid:15)N . √ n k X w λ − σz k + 1 √ n k X w − σz k ≤ σ √ n k z k + σ max ( X ) √ n (cid:0) k w λ k + k w k (cid:1) ≤ C r nn for some constant C > , with probability at least − Ce − cn . Consequently (cid:12)(cid:12)(cid:12) n k X w λ − σz k − n k X w − σz k (cid:12)(cid:12)(cid:12) ≤ C √ nn (cid:12)(cid:12)(cid:12) k X w λ − σz k − k X w − σz k (cid:12)(cid:12)(cid:12) ≤ C √ nn k X ( w λ − w ) k≤ Cσ max ( X ) √ δ − (cid:15) nn ≤ C √ (cid:15) nn with probability at least − Ce − cn for some constant C > . Similarly, we have with probability at least − Ce − cn , (cid:12)(cid:12)(cid:12) n k w λ k − n k w k (cid:12)(cid:12)(cid:12) ≤ C √ (cid:15). We conclude that with probability at least − Ce − cn we have for all w ∈ R N such that k w − w λ k ≤ N (cid:15) , (cid:12)(cid:12)(cid:12) n k X w λ − σz k − n k w λ k − n k X w + σz k + 1 n k w k (cid:12)(cid:12)(cid:12) ≤ C √ (cid:15) (cid:0) nn (cid:1) ≤ C √ (cid:15) nn . Combining this with (93), we get that for all (cid:15) ∈ (0 , , P (cid:16) ∃ w ∈ R N , k w − w λ k ≤ N (cid:15) and (cid:12)(cid:12)(cid:12) n k X w + σz k − n k w k − σ (cid:12)(cid:12)(cid:12) > C √ (cid:15) nn (cid:17) ≤ Ce − cn(cid:15) .
55e conclude using Theorem B.1 that P (cid:16) ∃ w ∈ R N , (cid:12)(cid:12)(cid:12) n k X w − σz k − n k w k − σ (cid:12)(cid:12)(cid:12) ≥ √ (cid:15) nn and L λ ( w ) ≤ min v ∈ R N L λ ( v ) + γ(cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) for some constants c, C, γ > . (cid:3) Using Proposition C.1, we deduce
Lemma F.10
There exists constants γ, c, C > that only depend on Ω such that for all θ ? in D and all λ ∈ [ λ min , λ max ] suchthat for all (cid:15) ∈ (0 , , P (cid:16) ∃ θ ∈ R N , (cid:12)(cid:12)(cid:12) n k X θ ? + σz − X θ k − n k θ − θ ? k − σ (cid:12)(cid:12)(cid:12) ≥ √ (cid:15) nn and L λ ( θ ) ≤ min L λ + γ(cid:15) (cid:17) ≤ C(cid:15) e − cn(cid:15) . We have b θ iλ = arg min θ ∈ R N (cid:26) n k (cid:13)(cid:13)(cid:13) y ( - i ) − X ( - i ) θ (cid:13)(cid:13)(cid:13) + λn | θ | (cid:27) = arg min θ ∈ R N (cid:26) n k (cid:13)(cid:13)(cid:13) X ( - i ) θ ? + σz ( - i ) − X ( - i ) θ (cid:13)(cid:13)(cid:13) + λn | θ | (cid:27) = arg min θ ∈ R N ( n (cid:13)(cid:13)(cid:13)r kk − X ( - i ) θ ? + r kk − σz ( - i ) − r kk − X ( - i ) θ (cid:13)(cid:13)(cid:13) + λn | θ | ) . b θ iλ is thus the minimizer of the Lasso cost (7) for δ ( k ) = k − k δ and σ ( k ) = p k/ ( k − σ . Let τ ( k ) ∗ ( λ ) be the τ ∗ defined by Theorem 3.1, but with δ ( k ) instead of δ and σ ( k ) instead of σ . Define the corresponding ‘risk’: R ( k ) ∗ ( λ ) = δ ( k ) (cid:0) τ ( k ) ∗ ( λ ) − ( σ ( k ) ) (cid:1) . It is not difficult to verify that the bounds on τ ∗ , β ∗ of Section A.2 are uniform with respect to δ and σ . Moreprecisely sup δ ∈ [ δ min ,δ max ] sup σ ∈ [ σ min ,σ max ] sup λ ∈ [ λ min ,λ max ] sup θ ? ∈D (cid:8) τ ∗ ( λ, δ, σ ) + β ∗ ( λ, δ, σ ) (cid:9) < + ∞ , where δ max , δ min , σ max , σ min > such that s max ( δ min ) > s if we are in the case D = F ( s ) . This gives that underthe assumptions of Proposition 4.3, τ ( k ) ∗ and R ( k ) ∗ are bounded for all k ≥ (that verify s max ( δ ( k − /k ) > s inthe case D = F ( s ) ) by some constant that depends only on Ω . Lemma F.11
There exists constants
C, c > that only depend on Ω such that for all θ ? ∈ D , for all i ∈ { , . . . , k } and for all (cid:15) ∈ (0 , , P (cid:16) sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) kn k y ( i ) − X ( i ) b θ iλ k − n k b θ iλ − θ ? k − σ (cid:12)(cid:12)(cid:12) ≥ k(cid:15) (cid:17) ≤ CN q (cid:15) − e − cn(cid:15) . Proof.
Let i ∈ { , . . . , k } . Let us define for θ ∈ R N , L ( - i ) λ ( θ ) = 12 n k (cid:13)(cid:13) y ( - i ) − X ( - i ) θ (cid:13)(cid:13) + λn | θ | . Let (cid:15) ∈ (0 , . Let η = γ(cid:15)KN q and M = d ( λ max − λ min ) /η e . Define for j ∈ { , . . . , M } , define λ j = min (cid:0) λ min + jη, λ max (cid:1) . We apply Lemma F.10 with n = n/k , X = X ( i ) and z = z ( i ) to obtain that the event E = n ∀ j ∈ { , . . . , M } , ∀ θ ∈ R N , L ( - i ) λ j ( θ ) ≤ min L ( - i ) λ j + γ(cid:15) = ⇒ (cid:12)(cid:12)(cid:12) kn k y ( i ) − X ( i ) θ k − n k θ − θ ? k − σ (cid:12)(cid:12)(cid:12) < √ (cid:15)k o has probability at least − M C(cid:15) e − c(cid:15) n . By Lemma C.5 the event E = n ∀ λ, λ ∈ [ λ min , λ max ] , L ( - i ) λ ( b θ iλ ) ≤ min x ∈ R N L ( - i ) λ ( x ) + KN q | λ − λ | o (94)56as probability at least − Ce − cn . On the event E , we have for all j ∈ { , . . . , k } and all λ ∈ [ λ j − , λ j ] L ( - i ) λ j ( b θ iλ ) ≤ min x ∈ R N L ( - i ) λ j ( x ) + KN q η ≤ min x ∈ R N L ( - i ) λ j ( x ) + γ(cid:15). We obtain that on E ∩ E , which has probability at least − CN q (cid:15) − e − cn(cid:15) ∀ λ ∈ [ λ min , λ max ] , (cid:12)(cid:12)(cid:12) kn k y ( i ) − X ( i ) b θ iλ k − n k b θ iλ − θ ? k − σ (cid:12)(cid:12)(cid:12) < √ (cid:15)k. (cid:3) Proposition F.5
There exists constants c, C > that only depend on Ω , such that for all θ ? ∈ D and for all i ∈ { , . . . , k } P (cid:16) sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) N k b θ iλ − θ ? k − R ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ C √ k (cid:17) ≤ CN q k e − cN/k . Proof.
Let us fix i ∈ { , . . . , k } . By Proposition C.7, λ R ∗ ( λ ) is K -Lipschitz on [ λ min , λ max ] , for someconstant K > . By Propositions C.3 and C.4 there exists a constant K > such that the event E = n ∀ λ ∈ [ λ min , λ max ] , N (cid:12)(cid:12) | b θ iλ | − | θ ? | (cid:12)(cid:12) ≤ K N q and N (cid:12)(cid:12) | b θ λ | − | θ ? | (cid:12)(cid:12) ≤ K N q o (95)has probability at least − Ce − cn . Let us define η = min (cid:16) δ N q kK , K √ k (cid:17) and M = d ( λ max − λ min ) /η e . For all j ∈ { , . . . , M } , we write λ j = min (cid:0) λ min + jη, λ max (cid:1) .By Theorem 3.2 the event E = n sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) N k b θ iλ − θ ? k − R ( k ) ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ o \ n sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) n (cid:13)(cid:13) y − X b θ λ (cid:13)(cid:13) − β ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ o (96)has probability at least − CN q e − cN . By Lemma F.11, applied with (cid:15) = k − , E = n sup λ ∈ [ λ min ,λ max ] (cid:12)(cid:12)(cid:12) kn k y ( i ) − X ( i ) b θ iλ k − n k b θ iλ − θ ? k − σ (cid:12)(cid:12)(cid:12) ≤ o has probability at least − Ck N q e − cn/k . On the event E ∩ E , we have, for all λ ∈ [ λ min , λ max ] , L λ ( b θ iλ ) = 12 n (cid:13)(cid:13) y ( i ) − X ( i ) b θ iλ (cid:13)(cid:13) + 12 n (cid:13)(cid:13) y ( - i ) − X ( - i ) b θ iλ (cid:13)(cid:13) + λn | b θ iλ |≤ k (cid:0) σ + 1 n k θ ? − b θ iλ k + 1 (cid:1) + 12 n k (cid:13)(cid:13) y ( - i ) − X ( - i ) b θ iλ (cid:13)(cid:13) + λn | b θ iλ |≤ k (cid:0) σ + δ − ( R ( k ) ∗ ( λ ) + 1) (cid:1) + 12 n k (cid:13)(cid:13) y ( - i ) − X ( - i ) b θ λ (cid:13)(cid:13) + λn | b θ λ |≤ k (cid:0) σ + δ − ( R ( k ) ∗ ( λ ) + 1) (cid:1) + L λ ( b θ λ ) + (cid:16) n k − n (cid:17)(cid:13)(cid:13) y − X b θ λ (cid:13)(cid:13) ≤ L λ ( b θ λ ) + Ck for some constant C > . Let j ∈ { , . . . , M } . We have L λ ( b θ iλ ) = L λ j ( b θ iλ ) − λ j − λn | b θ iλ | and L λ ( b θ λ ) ≤ L λ ( b θ λ j ) = L λ j ( b θ λ j ) + λ − λ j n | b θ λ j | = min θ ∈ R N L λ j ( θ ) + λ − λ j n | b θ λ j | . So we get that on the event E ∩ E ∩ E , for all j ∈ { , . . . , M } and all λ ∈ [ λ j − , λ j ] , L λ j ( b θ iλ ) ≤ min θ ∈ R N L λ j ( θ ) + λ j − λn (cid:0) | b θ iλ j | − | b θ λ j | (cid:1) + Ck ≤ min θ ∈ R N L λ j ( θ ) + λ j − λδ K N q + Ck ≤ min θ ∈ R N L λ j ( θ ) + ηδ K N q + Ck ≤ min θ ∈ R N L λ j ( θ ) + C k , C > , because on E we have ∀ λ ∈ [ λ min , λ max ] , N (cid:12)(cid:12) | b θ iλ | − | b θ λ | (cid:12)(cid:12) ≤ K N q . By Theorem C.1,there exists constants C, c, γ > such that for all (cid:15) ∈ (0 , the event E = n ∀ j ∈ { , . . . , M } , ∀ θ ∈ R N , L λ j ( θ ) ≤ min L λ j + γ(cid:15) = ⇒ (cid:12)(cid:12)(cid:12) N k θ − θ ? k − R ∗ ( λ j ) (cid:12)(cid:12)(cid:12) ≤ √ (cid:15) o has probability at least − CM (cid:15) − e − cN(cid:15) . Consider the constant κ = C γ . If k ≥ κ , then (cid:15) = C γk ≤ and theevent E has probability at least − CM ke − cN/k . So we obtain that on the event E ∩ E ∩ E ∩ E , which hasprobability − CN q k e − cn/k , ∀ j ∈ { , . . . , M } , ∀ λ ∈ [ λ j − , λ j ] , (cid:12)(cid:12)(cid:12) N k b θ iλ − θ ? k − R ∗ ( λ j ) (cid:12)(cid:12)(cid:12) ≤ C √ k , for some constant C > . If now k < κ . Then on the event E we have ∀ j ∈ { , . . . , M } , ∀ λ ∈ [ λ j − , λ j ] , (cid:12)(cid:12)(cid:12) N k b θ iλ − θ ? k − R ∗ ( λ j ) (cid:12)(cid:12)(cid:12) ≤ sup λ ∈ [ λ min ,λ max ] R ( k ) ∗ ( λ ) + sup λ ∈ [ λ min ,λ max ] R ∗ ( λ ) + 1 ≤ C ≤ C √ κ √ k , where C is a constant. We conclude that (in both cases) there exists a constant C > such that ∀ j ∈ { , . . . , M } , ∀ λ ∈ [ λ j − , λ j ] , (cid:12)(cid:12)(cid:12) N k b θ iλ − θ ? k − R ∗ ( λ j ) (cid:12)(cid:12)(cid:12) ≤ C √ k , holds with probability at least − CN q k e − cN/k Proposition F.5 follows from the fact that for all λ ∈ [ λ j − , λ j ] , | R ∗ ( λ ) − R ∗ ( λ j ) | ≤ K | λ − λ j | ≤ √ k . (cid:3) Proof of Proposition 4.3.
We apply Lemma F.11 with (cid:15) = k − / to obtain that with probability at least − Ck N q e − cn/k we have ∀ λ ∈ [ λ min , λ max ] , ∀ i ∈ { , . . . , k } , (cid:12)(cid:12)(cid:12) kn k y ( i ) − X ( i ) b θ iλ k − n k b θ iλ − θ ? k − σ (cid:12)(cid:12)(cid:12) ≤ √ k . By summing these inequalities for i = 1 . . . k and using the triangular inequality, we get (cid:12)(cid:12)(cid:12) kn k X i =1 k y ( i ) − X ( i ) b θ iλ k − n k X i =1 k b θ iλ − θ ? k − kσ (cid:12)(cid:12)(cid:12) ≤ k √ k . and then (cid:12)(cid:12)(cid:12) N k X i =1 k y ( i ) − X ( i ) b θ iλ k − k k X i =1 N k b θ iλ − θ ? k − δσ (cid:12)(cid:12)(cid:12) ≤ δ √ k . (97)By Proposition F.5, we have with probability at least − CN q k e − cN/k , ∀ λ ∈ [ λ min , λ max ] , ∀ i ∈ { , . . . , k } , (cid:12)(cid:12)(cid:12) N k b θ iλ − θ ? k − R ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ C √ k . This implies (again by summing and using the triangular inequality) that (cid:12)(cid:12)(cid:12) k k X i =1 N k b θ iλ − θ ? k − R ∗ ( λ ) (cid:12)(cid:12)(cid:12) ≤ C √ k , which, combined with (97) proves Proposition 4.3. (cid:3) F.8 The scalar lasso
In this section we study ‘ α ( y ) = min x ∈ R (cid:26)
12 ( y − x ) + α | x | (cid:27) . (98)58 emma F.12 The minimum (98) is achieved at an unique point x ∗ = η ( y, α ) and ‘ α ( y ) = y if − α ≤ y ≤ ααy − α if y ≥ α − αy − α if y ≤ − α Suppose now that y = x + Z , for some x ∈ R and Z ∼ N (0 , . Lemma F.13
Define ∆ α ( x ) = E (cid:2) ‘ α ( x + Z ) − α | x | (cid:3) . The function ∆ α is continuous, even, decreasing on R ≥ , α -Lipschitz. Moreover ( ∆ α (0) = + αφ ( α ) − (1 + α )Φ( − α )lim x →±∞ ∆ α ( x ) = − α and ∆ α (0 + ) = − ∆ α (0 − ) = − α . Proof.
Since Z and − Z have the same law, one verify easily that ∆ α is an even function. We have for all x > α ( x ) = E [ ‘ α ( x + Z ) − α ] = E [ ( x + Z ∈ [ − α, α ])( x + Z − α )] ≤ .‘ α is convex, therefore x E [ ‘ α ( x + Z )] is non-decreasing. E [ ‘ α ( Z )] = 0 because ‘ α is an odd function.Consequently, for all x > − α ≤ E [ ‘ α ( x + Z ) − α ] = ∆ α ( x ) . This gives (recall that ∆ α is even and continuous over R ) that ∆ α is α -Lipschitz. From what we have seen above,we have also ∆ α (0 + ) = − ∆ α (0 − ) = − α . Compute now, using the fact that ‘ α is even: ∆ α (0) = E [ ‘ α ( Z )] = Z α z φ ( z ) dz + Z + ∞ α (2 αz − α ) φ ( z ) dz . By integration by parts Z α z φ ( z ) dz = h − zφ ( z ) i α + Z α φ ( z ) dz = − αφ ( α ) + 12 − Φ( − α ) Z + ∞ α (2 αz − α ) φ ( z ) dz = − α Φ( − α ) + 2 αφ ( α ) . Therefore ∆ α (0) = + αφ ( α ) − (1 + α )Φ( − α ) . We have almost-surely ‘ α ( x + Z ) − α | x | −−−−−→ x →±∞ − α . Thus, by dominated convergence lim x →±∞ ∆ α ( x ) = − α . (cid:3) F.9 A convexity lemma
Lemma F.14
The function f : x ∈ R N r k x k n + σ is σ n ( R + σ ) / -strongly convex on B (0 , √ nR ) . roof. Let x, y ∈ B (0 , √ nR ) and define for t ∈ [0 , , g ( t ) = f ( z t ) , where z t = ( tx + (1 − t ) y ) . Compute g ( t ) = n ( x − y ) T z t q k z t k n + σ , and g ( t ) = n k x − y k q k z t k n + σ − (cid:0) n ( x − y ) T z t (cid:1) (cid:0) k z t k n + σ (cid:1) / = 1 (cid:0) k z t k n + σ (cid:1) / n k x − y k (cid:16) k z t k n + σ (cid:17) − (cid:18) n ( x − y ) T z t (cid:19) ! ≥ σ (cid:0) k z t k n + σ (cid:1) / (cid:18) n k x − y k (cid:19) ≥ n k x − y k σ ( R + σ ) / . Consequently tf ( x ) + (1 − t ) f ( y ) = tg (1) + (1 − t ) g (0) ≥ g ( t ) + 12 t (1 − t ) 1 n k x − y k σ ( R + σ ) / = f ( tx + (1 − t ) y ) + 12 t (1 − t ) 1 n k x − y k σ ( R + σ ) / . (cid:3) Appendix G: Toolbox
G.1 Notations recap
Recall that X is a n × N random matrix with entries X i,j i.i.d. ∼ N (0 , /n ) . The random vectors z ∈ R n , g ∈ R N and h ∈ R n are standard Gaussian random vectors. The following table displays the main cost (or objective) functionsused in this paper and their corresponding optimizers.Definition Optimizer L λ ( θ ) = n k Xθ − y k + λn | θ | b θ λ C λ ( w ) = n k Xw − σz k + λn ( | w + θ ? | − | θ ? | ) b w λ U λ ( u ) = min w ∈ R N (cid:8) u T Xw − σu T z − k u k + λ ( | θ ? + w | − | θ ? | ) (cid:9) b u λ V λ ( v ) = min w ∈ B (cid:8) n k Xw − σz k + λn v T ( θ ? + w ) − λn | θ ? | (cid:9) b v λ L λ ( w ) = (cid:18)q k w k n + σ k h k√ n − n g T w + g σ √ n (cid:19) + λn | w + θ ? | − λn | θ ? | w ∗ λ U λ ( u ) = min w ∈ R N (cid:8) − n / k u k g T w + n / k w k h T u − σn u T z − n k u k + λn (cid:0) | w + θ ? | − | θ ? | (cid:1)(cid:9) u ∗ λ V λ ( v ) = min w ∈ B (cid:26) (cid:16)q n k w k + σ k h k√ n − n g T w + g σ √ n (cid:17) + λn v T ( w + θ ? ) − λn | θ ? | (cid:27) v ∗ λ Table 1:
Main cost/objective functionsIn the definition of V λ above, B = (cid:8) w ∈ R N | | w | ≤ | θ ? | + 5 σ λ − n + K (cid:9) , where K > is the constantgiven by Lemma E.5. The functions L λ , U λ and V λ are the “corresponding cost/objective functions” to C λ , U λ and V λ . A main part of the analysis is to show that w ∗ λ , u ∗ λ and v ∗ λ are approximately equal to w λ , u λ and v λ given by:60 λ = η (cid:0) θ ? + τ ∗ ( λ ) g, α ∗ ( λ ) τ ∗ ( λ ) (cid:1) − θ ? u λ = β ∗ ( λ ) τ ∗ ( λ ) (cid:16)p τ ∗ ( λ ) − σ h √ n − σ √ n z (cid:17) v λ = − α ∗ ( λ ) − τ ∗ ( λ ) − (cid:16) η (cid:0) θ ? + τ ∗ ( λ ) g, α ∗ ( λ ) τ ∗ ( λ ) (cid:1) − θ ? − τ ∗ ( λ ) g (cid:17) Table 2: “Asymptotic optimizers”
G.2 Convex analysis lemmas
Proposition G.1 (
Corollary 37.3.2 from [39] ) Let C and D be non-empty closed convex sets in R m and R n , respectively, and let f be a continuous finiteconcave-convex function on C × D . If either C or D is bounded, one has inf v ∈ D sup u ∈ C f ( u, v ) = sup u ∈ C inf v ∈ D f ( u, v ) . Definition G.1
A convex function f over R n is said to be• γ -strongly convex if x f ( x ) − γ k x k is convex.• L -strongly smooth is f is differentiable everywhere and for all x, y ∈ R n we have f ( y ) ≤ f ( x ) + ( y − x ) T ∇ f ( x ) + L k x − y k . Remark 4. If f is convex, differentiable over R n , and ∇ f is L -Lipschitz, then f is L -strongly smooth. Indeed, if wetake x, y ∈ R n and if we define h ( t ) = f ((1 − t ) x + ty ) we have f ( y ) − f ( x ) = h (1) − h (0) = Z h ( t ) dt = Z ( y − x ) T ∇ f ((1 − t ) x + ty ) dt = ( y − x ) T f ( x ) + Z ( y − x ) T (cid:0) ∇ f ((1 − t ) x + ty ) − ∇ f ( x ) (cid:1) dt ≤ ( y − x ) T f ( x ) + Z tL k x − y k dt ≤ ( y − x ) T f ( x ) + L k x − y k . Proposition G.2
Let f be a closed convex function over R n . Then f is γ -strongly convex if and only if f ∗ is γ -strongly smooth. This result can be found in the book [54], see Corollary 3.5.11 on page 217 and the Remark 3.5.3 below. A moreaccessible presentation of this result can be found in [28].
G.3 Gaussian min-max Theorem
In this section, we reproduce the proof of the tight Gaussian min-max comparison theorem from [47] for com-pleteness, but also because we need a slightly more general version of this result.We recall the classical Gordon’s min-max Theorem from [24] (see also Corollary 3.13 from [30]):61 eorem G.1
Let X i,j and ( Y i,j ) , ≤ i ≤ n , ≤ j ≤ m be two (centered) Gaussian random vectors such that E X i,j = E Y i,j for all i, j , E X i,j X i,k ≥ E Y i,j Y i,k for all i, j, k , E X i,j X l,k ≤ E Y i,j Y l,k for all i = l and j, k . Then, for all real numbers λ i,j : P n \ i =1 m [ j =1 (cid:8) X i,j > λ i,j (cid:9) ≤ P n \ i =1 m [ j =1 (cid:8) Y i,j > λ i,j (cid:9) . Theorem G.2
Let D u ⊂ R n and D v ⊂ R m be two compact sets. Let Q : D u × D v → R be a continuous function. Let (cid:0) X ( u, v ) (cid:1) ( u,v ) ∈ D u × D v and (cid:0) Y ( u, v ) (cid:1) ( u,v ) ∈ D u × D v be two centered Gaussian processes. Suppose that the func-tions ( u, v ) X ( u, v ) and ( u, v ) Y ( u, v ) are continuous on D u × D v almost surely. Assume that E (cid:2) X ( u, v ) (cid:3) = E (cid:2) Y ( u, v ) (cid:3) for all ( u, v ) ∈ D u × D v , E (cid:2) X ( u, v ) X ( u, v ) (cid:3) ≥ E (cid:2) Y ( u, v ) Y ( u, v ) (cid:3) for all u ∈ D u , v, v ∈ D v , E (cid:2) X ( u, v ) X ( u , v ) (cid:3) ≤ E (cid:2) Y ( u, v ) Y ( u , v ) (cid:3) for all u, u ∈ D u , v, v ∈ D v such that u = u . Then for all t ∈ RP (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t (cid:17) ≤ P (cid:16) min u ∈ D u max v ∈ D v X ( u, v ) + Q ( u, v ) ≤ t (cid:17) . Proof.
Define the random variable d = sup n d ∈ Q + (cid:12)(cid:12)(cid:12) ∀ ( z, z ) ∈ ( D u × D v ) , k z − z k ≤ d = ⇒ (cid:0) | X ( z ) − X ( z ) | ≤ (cid:15) and | Y ( z ) − Y ( z ) | ≤ (cid:15) (cid:1)o .X and Y are continuous on the compact set D u × D v and are therefore uniformly continuous on this set: d > almost surely. Let (cid:15) > . By tightness there exists a constant d > such that P ( d ≥ d ) ≥ − (cid:15) .Q is continuous and thus uniformly continuous on D u × D v : there exists δ ∈ (0 , d ] such that for all z, z ∈ D u × D v , k z − z k ≤ δ = ⇒ | Q ( z ) − Q ( z ) | ≤ (cid:15) .Let D δu (respectively D δv ) be a δ/ √ -net of D u (respectively D v ). D δu × D δv is thus a δ -net of D u × D v . ByTheorem G.1 we have for all t ∈ RP (cid:16) min u ∈ D δu max v ∈ D δv X ( u, v ) + Q ( u, v ) > t (cid:17) ≤ P (cid:16) min u ∈ D δu max v ∈ D δv Y ( u, v ) + Q ( u, v ) > t (cid:17) , which gives by taking the complementary: P (cid:16) min u ∈ D δu max v ∈ D δv Y ( u, v ) + Q ( u, v ) ≤ t (cid:17) ≤ P (cid:16) min u ∈ D δu max v ∈ D δv X ( u, v ) + Q ( u, v ) ≤ t (cid:17) . By construction of δ we have with probability at least − (cid:15) (cid:12)(cid:12)(cid:12) min u ∈ D δu max v ∈ D δv X ( u, v ) + Q ( u, v ) − min u ∈ D u max v ∈ D v X ( u, v ) + Q ( u, v ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) , and similarly for Y . We have therefore, for all t ∈ RP (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t − (cid:15) (cid:17) − (cid:15) ≤ P (cid:16) min u ∈ D u max v ∈ D v X ( u, v ) + Q ( u, v ) ≤ t + 2 (cid:15) (cid:17) + (cid:15) , P (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t (cid:17) ≤ P (cid:16) min u ∈ D u max v ∈ D v X ( u, v ) + Q ( u, v ) ≤ t + 4 (cid:15) (cid:17) + 2 (cid:15) , which proves the theorem by taking (cid:15) → . (cid:3) Corollary G.1
Let D u ⊂ R n + n and D v ⊂ R m + m be compact sets and let Q : D u × D v → R be a continuous function.Let G = ( G i,j ) i.i.d. ∼ N (0 , , g ∼ N (0 , I n ) and h ∼ N (0 , I m ) be independent standard Gaussian vectors. For u ∈ R n + n and v ∈ R m + m we define ˜ u = ( u , . . . , u n ) and ˜ v = ( v , . . . , v m ) . Define C ∗ ( G ) = min u ∈ D u max v ∈ D v ˜ v T G ˜ u + Q ( u, v ) ,L ∗ ( g, h ) = min u ∈ D u max v ∈ D v k ˜ v k g T ˜ u + k ˜ u k h T ˜ v + Q ( u, v ) . Then we have:• For all t ∈ R , P (cid:16) C ∗ ( G ) ≤ t (cid:17) ≤ P (cid:16) L ∗ ( g, h ) ≤ t (cid:17) . • If D u and D v are convex and if Q is convex concave, then for all t ∈ RP (cid:16) C ∗ ( G ) ≥ t (cid:17) ≤ P (cid:16) L ∗ ( g, h ) ≥ t (cid:17) . Proof.
Let us consider the Gaussian processes: ( X ( u, v ) = k ˜ v k g T ˜ u + k ˜ u k h T ˜ v ,Y ( u, v ) = ˜ v T G ˜ u + k ˜ u kk ˜ v k z , where z ∼ N (0 , is independent from G . Let ( u, v ) , ( u , v ) ∈ D u × D v and compute E (cid:2) Y ( u, v ) Y ( u , v ) (cid:3) − E (cid:2) X ( u, v ) X ( u , v ) (cid:3) = k ˜ u kk ˜ v kk ˜ u kk ˜ v k + (˜ u T ˜ u )(˜ v T ˜ v ) − k ˜ v kk ˜ v k (˜ u T ˜ u ) − k ˜ u kk ˜ u k (˜ v T ˜ v )= (cid:0) k ˜ u kk ˜ u k − (˜ u T ˜ u ) (cid:1)(cid:0) k ˜ v kk ˜ v k − (˜ v T ˜ v ) (cid:1) ≥ . Therefore X and Y verify the covariance inequalities of Theorem G.2: one can apply Theorem G.2: P (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t (cid:17) ≤ P (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t (cid:17) , We have then P (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t (cid:17) ≥ P (cid:16) min u ∈ D u max v ∈ D v Y ( u, v ) + Q ( u, v ) ≤ t (cid:12)(cid:12)(cid:12) z ≤ (cid:17) ≥ P (cid:16) min u ∈ D u max v ∈ D v ˜ v T G ˜ u + Q ( u, v ) ≤ t (cid:12)(cid:12)(cid:12) z ≤ (cid:17) = 12 P (cid:16) C ∗ ( G ) ≤ t (cid:17) , which proves that P (cid:16) min u ∈ D u max v ∈ D v ˜ v T G ˜ u + Q ( u, v ) ≤ t (cid:17) ≤ P (cid:16) min u ∈ D u max v ∈ D v k ˜ v k g T ˜ u + k ˜ u k h T ˜ v + Q ( u, v ) ≤ t (cid:17) . Let us suppose now that D u and D v are convex and that G is convex-concave. We now apply the inequality wejust proved, but with the role of u and v being switched (and − Q and − t instead of Q and t ): P (cid:16) min v ∈ D v max u ∈ D u ˜ v T G ˜ u − Q ( u, v ) ≤ − t (cid:17) ≤ P (cid:16) min v ∈ D v max u ∈ D u k ˜ v k g T ˜ u + k ˜ u k h T ˜ v − Q ( u, v ) ≤ − t (cid:17) , ( G, g, h ) and ( − G, − g, − h ) have the same law): P (cid:16) max v ∈ D v min u ∈ D u ˜ v T G ˜ u + Q ( u, v ) ≥ t (cid:17) ≤ P (cid:16) max v ∈ D v min u ∈ D u k ˜ v k g T ˜ u + k ˜ u k h T ˜ v + Q ( u, v ) ≥ t (cid:17) . By Proposition G.1, one can switch the min-max of the left-hand side, because Q is convex-concave and we areworking on convex sets D u and D v . For the right-hand side, we simply use the fact that: max v ∈ D v min u ∈ D u k ˜ v k g T ˜ u + k ˜ u k h T ˜ v + Q ( u, v ) ≤ min u ∈ D u max v ∈ D v k ˜ v k g T ˜ u + k ˜ u k h T ˜ v + Q ( u, v ) , to conclude the proof. (cid:3) G.4 Basic concentration results
We recall in this section some elementary concentration results, see Chapter 2 from [9] for a more detailed pre-sentation of these facts.
Definition G.2
A real random variable X is said to be• σ -sub-Gaussian if for every s ∈ R , log E e s ( X − E [ X ]) ≤ s σ , • ( v, c ) -sub-Gamma if for every s ∈ ( − /c, /c ) , log E e s ( X − E [ X ]) ≤ s v − c | s | ) . One deduces immediately from the above definition:
Proposition G.3
Let ( X , . . . , X n ) be independent real random variables. Define S = P ni =1 X i .• Suppose that for all i ∈ { , . . . , n } , X i is σ i -sub-Gaussian. Then S is P ni =1 σ i -sub-Gaussian.• Suppose that for all i ∈ { , . . . , n } , X i is ( v i , c i ) -sub-Gamma. Then S is (cid:0) P ni =1 v i , max c i (cid:1) -sub-Gamma. Proposition G.4
Let X be a real random variable.• if X is σ -sub-Gaussian, then for all t > P ( X − E [ X ] ≥ t ) ∨ P ( X − E [ X ] ≤ − t ) ≤ e − t σ , • if X is ( v, c ) -sub-Gamma, then for all t > P ( X − E [ X ] ≥ √ ct + vt ) ∨ P ( X − E [ X ] ≤ − ( √ ct + vt )) ≤ e − t . Remark 5.
The bound P ( X > √ vt + ct ) ≤ e − t implies that P ( X > t ) ≤ ( exp (cid:16) − t v (cid:17) for < t ≤ vc , exp (cid:0) − t c (cid:1) for t ≥ vc . Proposition G.5 If X is σ -sub-Gaussian and has mean µ , then X is a sub-Gamma random variable with parameters ( v = 16 σ + 4 µ σ ,c = 4 σ . roof. Let µ = E [ X ] and Y = X − µ . X = Y + 2 µY + µ . E (cid:2) ( Y ) (cid:3) = E [( X − µ ) ] ≤ σ , E (cid:2) ( Y ) q (cid:3) = E (cid:2) ( X − µ ) q (cid:3) ≤ q !(2 σ ) q = 12 q !(16 σ )(2 σ ) q − . By Bernstein’s inequality (see for instance Theorem 2.10 in [9]) log E e s ( Y − E [ Y ]) ≤ σ s − σ | s | ) . µY is µ σ -sub-Gaussian, therefore log E e µsY ≤ µ σ s and log E e s ( X − E [ X ]) = log E e s ( Y − E [ Y ])+2 µsY ≤
12 log E e s ( Y − E [ Y ]) + 12 log E e µsY ≤ σ s − σ s + 4 µ σ s ≤ (16 σ + 4 µ σ ) s − σ | s | .X is therefore a Sub-Gamma random variable with variance factor v = 16 σ + 4 µ σ and scale parameter c = 4 σ . (cid:3) Lemma G.1
Let X be a σ -sub-Gaussian random variable. Define m = E [ | X | ] . Let Y be a random variable bounded by .Then XY is m + 2 σ ) -sub-Gaussian. Proof.
We have | E [ XY ] | ≤ m , therefore E (cid:2) ( XY − E [ XY ]) q (cid:3) ≤ q − E [ X q ] + 2 q − m q ≤ q !(8 σ ) q + q !(4 m ) q ≤ q !(8 σ + 4 m ) q . (cid:3) G.5 Largest singular value of a Gaussian matrix
The largest singular value of a n × N matrix A is defined as σ max ( A ) = max k x k≤ k Ax k . The next classical result is a simple consequence of Slepian’s Lemma (see for instance [30], Section 3.3) and theclassical Gaussian concentration inequality (see for instance [9], Theorem 5.6).
Proposition G.6
Let G be a n × N random matrix, whose entries are i.i.d. N (0 , . For all t ≥ we have P ( σ max ( G ) > √ N + √ n + t ) ≤ e − t / . References [1] Dennis Amelunxen, Martin Lotz, Michael B McCoy, and Joel A Tropp. Living on the edge: Phase transitionsin convex programs with random data.
Information and Inference: A Journal of the IMA , 3(3):224–294, 2014.[2] Jean Barbier, Mohamad Dia, Nicolas Macris, and Florent Krzakala. The mutual information in random linearestimation. arXiv preprint arXiv:1607.02335 , 2016.[3] Jean Barbier, Florent Krzakala, Nicolas Macris, L´eo Miolane, and Lenka Zdeborov´a. Optimal errors and phasetransitions in high-dimensional generalized linear models. In
Conference On Learning Theory , pages 728–731,2018.[4] Mohsen Bayati, Murat A Erdogdu, and Andrea Montanari. Estimating lasso risk and noise level. In
Advancesin Neural Information Processing Systems , pages 944–952, 2013.655] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applicationsto compressed sensing.
IEEE Transactions on Information Theory , 57(2):764–785, 2011.[6] Mohsen Bayati and Andrea Montanari. The lasso risk for gaussian matrices.
IEEE Transactions on InformationTheory , 58(4):1997–2017, 2012.[7] Raphael Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate messagepassing with non-separable functions. arXiv preprint arXiv:1708.03950 , 2017.[8] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.
The Annals of Statistics , 37(4):1705–1732, 2009.[9] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration inequalities: A nonasymptotic theoryof independence . Oxford university press, 2013.[10] Peter B¨uhlmann and Sara Van De Geer.
Statistics for high-dimensional data: methods, theory and applications .Springer Science & Business Media, 2011.[11] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger thann.
The Annals of Statistics , 35(6):2313–2351, 2007.[12] Emmanuel J Candes and Terence Tao. Decoding by linear programming.
IEEE transactions on informationtheory , 51(12):4203–4215, 2005.[13] Scott Chen and David L Donoho. Examples of basis pursuit. In
Wavelet Applications in Signal and ImageProcessing III , volume 2569, pages 564–575. International Society for Optics and Photonics, 1995.[14] David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymptotic variance viaapproximate message passing.
Probability Theory and Related Fields , 166(3-4):935–969, 2016.[15] David L Donoho. High-dimensional centrally symmetric polytopes with neighborliness proportional to di-mension.
Discrete & Computational Geometry , 35(4):617–652, 2006.[16] David L Donoho and Iain M Johnstone. Minimax risk over ‘ p -balls for ‘ q -error. Probability Theory and RelatedFields , 99(2):277–303, 1994.[17] David L Donoho and Jain M Johnstone. Ideal spatial adaptation by wavelet shrinkage. biometrika , 81(3):425–455, 1994.[18] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing.
Proceedings of the National Academy of Sciences , 106(45):18914–18919, 2009.[19] David L Donoho, Arian Maleki, and Andrea Montanari. The noise-sensitivity phase transition in compressedsensing.
IEEE Transactions on Information Theory , 57(10):6920–6941, 2011.[20] David L Donoho and Jared Tanner. Neighborliness of randomly projected simplices in high dimensions.
Proceedings of the National Academy of Sciences , 102(27):9452–9457, 2005.[21] Bradley Efron. The estimation of prediction error: covariance penalties and cross-validation.
Journal of theAmerican Statistical Association , 99(467):619–632, 2004.[22] Noureddine El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators.
Probability Theory and Related Fields , 170(1-2):95–175,2018.[23] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empiricalmeasure.
Probability Theory and Related Fields , 162(3-4):707–738, 2015.[24] Yehoram Gordon. Some inequalities for Gaussian processes and applications.
Israel Journal of Mathematics ,50(4):265–289, 1985.[25] Yehoram Gordon. On Milman’s inequality and random subspaces which escape through a mesh in R n . In Geometric Aspects of Functional Analysis , pages 84–106. Springer, 1988.[26] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensionalregression.
The Journal of Machine Learning Research , 15(1):2869–2909, 2014.6627] Iain M Johnstone. Function estimation and gaussian sequence models.
Unpublished manuscript , 2(5.3):2, 2002.[28] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with ma-trices.
Journal of Machine Learning Research , 13(Jun):1865–1890, 2012.[29] Noureddine El Karoui. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robustregression estimators: rigorous results. arXiv:1311.2445 , 2013.[30] Michel Ledoux and Michel Talagrand.
Probability in Banach Spaces: isoperimetry and processes . SpringerScience & Business Media, 2013.[31] Junjie Ma, Ji Xu, and Arian Maleki. Optimization-based AMP for Phase Retrieval: The Impact of Initializationand ‘ -regularization. arXiv preprint arXiv:1801.01170 , 2018.[32] Andrea Montanari and Emile Richard. Non-negative principal component analysis: Message passing algo-rithms and sharp asymptotics. IEEE Transactions on Information Theory , 62(3):1458–1484, 2016.[33] Ali Mousavi, Arian Maleki, and Richard G Baraniuk. Consistent parameter estimation for lasso and approx-imate message passing.
The Annals of Statistics , 45(6):2427–2454, 2017.[34] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, Bin Yu, et al. A unified framework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science , 27(4):538–557, 2012.[35] Ashkan Panahi and Babak Hassibi. A universal analysis of large-scale regularized least squares solutions. In
Advances in Neural Information Processing Systems , pages 3381–3390, 2017.[36] Sundeep Rangan. Generalized approximate message passing for estimation with random linear mixing. In
Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on , pages 2168–2172. IEEE, 2011.[37] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linearregression over ‘ q -balls. IEEE Transactions on Information Theory , 57(10):6976–6994, 2011.[38] Galen Reeves and Henry D Pfister. The replica-symmetric prediction for compressed sensing with gaussianmatrices is exact. In
Information Theory (ISIT), 2016 IEEE International Symposium on , pages 665–669. IEEE,2016.[39] Ralph Tyrell Rockafellar.
Convex analysis . Princeton university press, 2015.[40] Philip Schniter and Sundeep Rangan. Compressive phase retrieval via generalized approximate messagepassing.
IEEE Transactions on Signal Processing , 63(4):1043–1055, 2015.[41] Charles M Stein. Estimation of the mean of a multivariate normal distribution.
The annals of Statistics , pages1135–1151, 1981.[42] Mihailo Stojnic. Recovery thresholds for ‘ optimization in binary compressed sensing. In Information TheoryProceedings (ISIT), 2010 IEEE International Symposium on , pages 1593–1597. IEEE, 2010.[43] Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint arXiv:1303.7291 ,2013.[44] Pragya Sur and Emmanuel J Cand`es. A modern maximum-likelihood theory for high-dimensional logisticregression. arXiv:1803.06964 , 2018.[45] Takashi Takahashi and Yoshiyuki Kabashima. A statistical mechanics approach to de-biasing and uncertaintyestimation in lasso for random measurements. arXiv preprint arXiv:1803.09927 , 2018.[46] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized m-estimatorsin high-dimensions.
IEEE Transactions on Information Theory , 2018.[47] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A precise analysisof the estimation error. In
Conference on Learning Theory , pages 1683–1709, 2015.[48] Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society.Series B (Methodological) , pages 267–288, 1996. 6749] Ryan J Tibshirani, Jonathan Taylor, et al. Degrees of freedom in lasso problems.
The Annals of Statistics ,40(2):1198–1232, 2012.[50] Joel A Tropp. Convex recovery of a structured signal from independent random linear measurements. In
Sampling Theory, a Renaissance , pages 67–101. Springer, 2015.[51] Sara Van de Geer, Peter B¨uhlmann, Ya’acov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidenceregions and tests for high-dimensional models.
The Annals of Statistics , 42(3):1166–1202, 2014.[52] Sara A Van De Geer, Peter B¨uhlmann, et al. On the conditions used to prove oracle results for the lasso.
Electronic Journal of Statistics , 3:1360–1392, 2009.[53] C´edric Villani.
Optimal transport: old and new , volume 338. Springer Science & Business Media, 2008.[54] Constantin Zalinescu.
Convex analysis in general vector spaces . World scientific, 2002.[55] Cun-Hui Zhang and Stephanie S Zhang. Confidence intervals for low dimensional parameters in high dimen-sional linear models.