[PDF] Sparse recovery by reduced variance stochastic approximation

Abstract

In this paper, we discuss application of iterative Stochastic Optimization routines to the problem of sparse signal recovery from noisy observation. Using Stochastic Mirror Descent algorithm as a building block, we develop a multistage procedure for recovery of sparse solutions to Stochastic Optimization problem under assumption of smoothness and quadratic minoration on the expected objective. An interesting feature of the proposed algorithm is its linear convergence of the approximate solution during the preliminary phase of the routine when the component of stochastic error in the gradient observation which is due to bad initial approximation of the optimal solution is larger than the "ideal" asymptotic error component owing to observation noise "at the optimal solution." We also show how one can straightforwardly enhance reliability of the corresponding solution by using Median-of-Means like techniques. We illustrate the performance of the proposed algorithms in application to classical problems of recovery of sparse and low rank signals in linear regression framework. We show, under rather weak assumption on the regressor and noise distributions, how they lead to parameter estimates which obey (up to factors which are logarithmic in problem dimension and confidence level) the best known to us accuracy bounds.

Full PDF

aa r X i v : . [ s t a t . M L ] J un Sparse recovery by reduced variance stochastic approximation

Anatoli Juditsky * , Andrei Kulunchakov Hlib Tsyntseus , June 12, 2020

Abstract

In this paper, we discuss application of iterative Stochastic Optimization routines to theproblem of sparse signal recovery from noisy observation. Using Stochastic Mirror Descentalgorithm as a building block, we develop a multistage procedure for recovery of sparsesolutions to Stochastic Optimization problem under assumption of smoothness and quadraticminoration on the expected objective. An interesting feature of the proposed algorithm is itslinear convergence of the approximate solution during the preliminary phase of the routinewhen the component of stochastic error in the gradient observation which is due to badinitial approximation of the optimal solution is larger than the “ideal” asymptotic errorcomponent owing to observation noise “at the optimal solution.” We also show how one canstraightforwardly enhance reliability of the corresponding solution by using Median-of-Meanslike techniques.We illustrate the performance of the proposed algorithms in application to classical prob-lems of recovery of sparse and low rank signals in linear regression framework. We show,under rather weak assumption on the regressor and noise distributions, how they lead toparameter estimates which obey (up to factors which are logarithmic in problem dimensionand conﬁdence level) the best known to us accuracy bounds.

In this paper, we consider the Stochastic Optimization problem of the form g ∗ = min x ∈ X (cid:8) g ( x ) = E { G ( x, ω ) } (cid:9) (1)where X is a given convex and closed subset of a Euclidean space E , G : X × Ω → R is a smoothconvex mapping, and E stands for the expectation with respect to unknown distribution of ω ∈ Ω(we assume that the corresponding expectation exists for every x ∈ X ). As it is usual in thissituation, we suppose that we have access to a stochastic “oracle” supplying “randomized”information about g ; we assume that the problem is solvable with the optimal solution x ∗ whichis sparse (we consider a general notion of sparsity structure of x ∗ as deﬁned in Section 2.1 whichcomprises “usual” sparsity, group sparsity, and low rank matrix structures as basic examples). * LJK, Universit´e Grenoble Alpes, 700 Avenue Centrale, 38401 Domaine Universitaire de Saint-Martin-d’Hres,France, [email protected] Research of this author was supported by MIAI @ Grenoble Alpes (ANR-19-P3IA-0003). Universit´e Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, Grenoble, 38000, France, [email protected] LJK, Universit´e Grenoble Alpes, 700 Avenue Centrale, 38401 Domaine Universitaire de Saint-Martin-d’Hres,France, [email protected] s -sparse (i.e., with at most s nonvanishing components) vector x ∗ ∈ R n of regression coeﬃcientsis to be recovered from the linear noisy observation η = Φ T x ∗ + σξ, (2)where Φ ∈ R N × n is the regression matrix, and ξ ∈ R N is zero-mean noise with unit covariancematrix; we are typically interested in the situation where the problem dimension is large, i.e.when n ≫ N . Note that the problem of sparse recovery from observation (2) with randomregressors (columns of the regression matrix Φ) φ i , i = 1 , ..., m can be cast as Stochastic Opti-mization. For instance, assuming that regressors φ i and noises ξ i , i = 1 , ..., N , are identicallydistributed, we may consider Stochastic Optimization problemmin x ∈ X n g ( x ) = E { ( η − φ T x ) } o (3)over s -sparse x ’s. There are essentially two approaches to solving (3). Note that observations η i and φ i provide us with unbiased estimates G ( x, ω i = [ φ i , η i ]) = k η i − φ Ti x k of the problemobjective g ( x ). Therefore, one can build a Sample Average Approximation (SAA) b g ( x ) = 1 N N X i =1 G ( x, ω i ) = N k η − Φ T x k of the objective g ( x ) of (3) and then solve the resulting Least Squares problem by a deterministicoptimization routine. A now standard approach to enhancing the sparsity of solutions is to useiterative thresholding [8, 26, 3, 23]. When applied to the linear regression problem (3), thistechnique amounts to using a gradient descent to minimize the Least Squares objective b g incombination with thresholding of approximate solutions to enforce sparsity. Another approachwhich refers to ℓ - and nuclear norm minimization allows to reduce problems of sparse or lowrank recovery to convex optimization. In particular, sparse recovery by Lasso and DantzigSelector has been extensively studied in the statistical literature [18, 13, 4, 53, 16, 14, 48, 22,17, 15, 31, 42, 35, 50, 20], among others). For instance, the celebrated Lasso estimate b x N, lasso in the sparse linear regression problem is a solution to the ℓ -penalized Least Squares problem b x N, lasso ∈ Argmin x (cid:8) N k η − Φ T x k + λ k x k (cid:9) (4)where λ ≥ ℓ - and nuclear norm minimization are proposed.In particular, recovery of any s -sparse (i.e., with at most s nonvanishing components) vector x ∗ is possible with “small error” if the empirical regressor covariance matrix b Σ = N ΦΦ T veriﬁes acertain restricted conditioning assumption, e.g., Restricted Eigenvalue (RE) [4] or Compatibilitycondition [53]. The latter conditions roughly mean that if there is no “approximately s -sparsevectors,” i.e. vectors which are close to vectors with only s nonvanishing entries in the kernel of b Σ, and, moreover, for all vectors z which are “approximately sparse,” k b Σ z k ≥ λ k z k . The goodnews is that although these conditions are typically diﬃcult to verify for individual matrices Φ,they are satisﬁed for several families of random matrices, such as Rademacher (with indepen-dent random ± R n , etc. For instance, when columns φ i of Φ are sampled independently2rom normal distribution φ i ∼ N (0 , Σ) with covariance matrix Σ with bounded diagonal ele-ments which satisﬁes κ Σ I (cid:22) Σ (here I is the n × n -identity matrix), κ Σ >

0, RE conditionholds with high probability for s as large as O (cid:16) Nκ Σ ln[ n ] (cid:17) [48]. The Restricted Strong Convexity (RSC) condition, analogous to the RE or Compatibilitycondition in this case, also ensures that iterative thresholding procedures converge linearly toan approximate solution with accuracy which is similar to that of Lasso or Dantzig Selectorestimation [3, 23] in this case.Another approach to solving (1) which refers to Stochastic Approximation (SA) may beused whenever there is a “stochastic oracle” providing an unbiased stochastic observation of thegradient ∇ g of the objective g of (1). For instance, note that the observable quantity ∇ G ( x, ω i ) = φ i ( φ Ti x − η i ) is an unbiased estimate of the gradient ∇ g ( x ) of the objective of (3), and so aniterative algorithm of Stochastic Approximation type can be used to build approximate solutionsto (3). In particular, diﬀerent versions of Stochastic Approximation procedure were applied tosolve (3) under ℓ and sparsity constraint. Recall, that we are interested in high-dimensionalproblems, we are looking for bounds for recovery error which are “essentially independent”(logarithmic, at most) in problem dimension n . This requirement rules out the use of standard“Euclidean” Stochastic Approximation. Indeed, typical bounds for the expected inaccuracy E { g ( b x N ) } − g ∗ of Stochastic approximation contains the term proportional to σ E {k φ k } andthus proportional to n in the case of “dense” regressors with E {k φ k } = O ( n ). Therefore,unless regressors φ are sparse (or possess a special structure, e.g., when φ i are low rank matricesin the case of low rank matrix recovery), standard Stochastic Approximation leads to accuracybounds for sparse recovery which are proportional to dimension n of the parameter vector [47].In other words, our application calls for non-Euclidean Stochastic Approximation procedures,such as Stochastic Mirror Descent algorithm [44].In particular, [51, 52] study the properties of Stochastic Mirror Descent algorithm undersub-Gaussian noise assumption and show that approximate solution b x N after N iterations ofthe attain the bound g ( b x N ) − g ∗ = O (cid:16) σ p s log( n ) /N (cid:17) , often referred to as “slow rate” of sparserecovery. In order to improve the error estimates of Stochastic Approximation one may usemultistage algorithm under strong or uniform convexity assumption [32, 34, 24]. However, suchassumptions do not hold in the problems such as sparse linear regression problem, where theyare replaced by Restricted Strong Convexity conditions. For instance, the authors of [2] developa multistage procedure targeted at sparse recovery stochastic optimization problem (1) based onSMD algorithm of [30, 45] under bounded regressor and sub-Gaussian noise assumption. Theyshow, for instance, that when applied to the sparse linear regression, the ℓ -error k b x N − x ∗ k of the approximate solution b x N after N iterations of the proposed routine converges at the rate Here and in the sequel, we use notation A (cid:22) B for n × n symmetric matrices A and B such that B − A (cid:23) B − A is positive semideﬁnite. The reader acquainted with the compressive sensing theory will notice that the setting of the ℓ -recoveryproblem considered in this paper is rather diﬀerent from the standard “setting,” but is rather similar in spirit tothat in [16, 14, 1, 6, 9]. Although, unlike [1, 6, 9] we do not assume any special structure of x ∗ apart from itssparsity, we suppose random regressors are independent of x ∗ , while in the “standard setting” one allows x ∗ todepend on the regressor matrix instead. Although the current setting seems less restrictive, we do not know anyresult stating that a recovery in this setting is possible under “essentially less restrictive” assumptions than thosefor the “standard” ℓ recovery. More generally, strong convexity of the objective associated with smoothness is a feature of the Euclideansetup. For instance, the conditioning of a smooth objective (the ratio of the Lipschitz constant of the gradient tothe constant of strong convexity) when measured with respect to the ℓ -norm cannot be less than n (the problemdimension) [34]. (cid:18) σκ Σ q s ln nN (cid:19) with high probability. While this “asymptotic” rate coincides with the best rateattainable by known to us algorithms for solving (3) the algorithm in [2] requires at least s ln[ n ] κ SMD iterations per stage, implying that the method in question can be used only if the numberof nonvanishing entries in the parameter vector is O (cid:18) κ Σ q N ln n (cid:19) (recall that the correspondinglimit is O (cid:16) Nκ Σ ln[ n ] (cid:17) for Lasso [48] and iterative thresholding procedures [3, 23]).Our goal in the present paper is to provide a reﬁned analysis of Stochastic Approximationalgorithms for computing sparse solutions to (1) exploiting a variance reduction scheme utilizingin a special way smoothness of the problem objective. It allows to build a new acceleratedmultistage Stochastic Approximation algorithm. To give a ﬂavor of the results we present below,we summarize the properties of the proposed procedure—Stochastic Mirror Descent for SparseRecovery (SMD-SR)—in the case of stochastic optimization problem (3) associated with sparselinear regression estimation problem. Let us assume that regressors φ i are a.s. bounded, i.e., k φ i k ∞ = O (1), the covariance matrix Σ = E { φ φ T } of regressors satisﬁes Σ (cid:23) κ Σ I ; we supposethat the noises ξ i are zero-mean with variance E { ξ i } ≤ σ , and that we are given R < ∞ and x ∈ R n such that E {k x − x ∗ k } ≤ R . • On the k -th stage of the SMD-SR algorithm we run N k iterations of the Mirror Descentrecursion and then “sparsify” the obtained approximate solution by zeroing out all but s entries of largest amplitudes. • SMD-SR algorithm operates in two phases. At the ﬁrst (preliminary) phase we perform aﬁxed number N k = O (cid:16) s ln nκ Σ (cid:17) of SMD iterations per stage so that the expected quadraticerror E {k b y N − x ∗ k } decreases linearly as total iteration count N grows with exponentproportional to κ Σ s ln n . When the expected quadratic error becomes O (cid:16) σ sκ Σ (cid:17) , we pass tothe second (asymptotic) phase of the method. • During the stages of the asymptotic phase, the number of iterations per stage grows as N k = 2 k N where k is the stage index, and the expected quadratic error decreases as O (cid:16) σ s ln nκ N (cid:17) where N is total iteration count.It may appear surprising that a stochastic algorithm converges linearly during the preliminaryphase, when the component of the error due to the observation noise is small (for instance, itconverges linearly in the “noiseless” case, cf. [47]) eliminating fast the initial error; its rateof convergence is similar to that of the deterministic gradient descent algorithm, when “fullgradient observation” ∇ g ( x ) is available. On the other hand, in the asymptotic regime, theprocedure attains the rate which is equivalent to the best known rates in this setting, and thisunder the model assumptions which are close to the weakest known today [23, 48].The paper is organized as follows. The analysis of the SMD-SR in the general setting is inSection 2. We deﬁne the general problem setting and introduce key notions used in the paperin Section 2.1. Then in Section 2.3 we reveal the multistage algorithm and study its basicproperties. Next, in Section 2.4 we show how sub-Gaussian conﬁdence bounds for the error ofapproximate solutions can be obtained using an adopted analog of Median-of-Means approach.Finally, in Section 3 we discuss the properties of the method and conditions in which it leads However, this is not the best rate of estimation of sparse regression parameter vector, cf. [28]; the correspond-ing recovery routines have a diﬀerent structure and do not rely on solving (3). In hindsight, the underlying idea can be seen as a generalization of the variance reduction device in [5].

4o “small error” solution when applied to sparse linear regression and low rank linear matrixrecovery problems.

Let E be a ﬁnite-dimensional real vector (Euclidean) space. Consider a Stochastic Optimizationproblem min x ∈ X [ E { G ( x, ω ) } ] (5)where X ⊂ E is a convex set with nonempty interior (a solid), ω is a random variable on aprobability space Ω with distribution P , and G : X × Ω → R . We suppose that the expectedobjective g ( x ) = E { G ( x, ω ) } is ﬁnite for all x ∈ X and is convex and diﬀerentiable on X . Let k · k be a norm on E , and let k · k ∗ be the conjugate norm, i.e., k s k ∗ = max x { s T x : k x k ≤ } , s ∈ E. We suppose that gradient ∇ g ( · ) of g ( · ) is Lipschitz-continuous on X : k∇ g ( x ′ ) − ∇ g ( x ) k ∗ ≤ Lk x − x ′ k , ∀ x, x ′ ∈ X, (6)that the problem is solvable with optimal value g ∗ = min x ∈ X g ( x ). Furthermore, we supposethat the optimal solution x ∗ to the problem is unique, and that g ( · ) satisﬁes quadratic growthcondition on X with respect to the Euclidean norm k · k [41], i.e., for all x ∈ Xg ( x ) − g ( x ∗ ) ≥ κ k x − x ∗ k . (7)In what follows, we assume that we have at our disposal a stochastic (gray box) oracle —a devicewhich can generate ω ∼ P and compute, for any x ∈ X a random unbiased estimation of ∇ g ( x ).From now on we make the following assumption about the structure of the gradient observation: Assumption [S1]. G ( · , ω ) is smooth on X , i.e., it is continuously diﬀerentiable on X foralmost all ω ∈ Ω , and k∇ G ( x, ω ) − ∇ G ( x ′ , ω ) k ∗ ≤ L ( ω ) k x − x ′ k with E {L ( ω ) } ≤ ν < ∞ . We assume that E {∇ G ( x, ω ) } = ∇ g ( x ) and E {k ∇ G ( x, ω ) − ∇ g ( x ) | {z } =: ζ ( x,ω ) k ∗ } ≤ ς ( x ) , ∀ x ∈ X, and, furthermore, there are ≤ κ , κ ′ < ∞ such that the bound holds: ς ( x ) ≤ κ ν [ g ( x ) − g ∗ − h∇ g ( x ∗ ) , x − x ∗ i ] | {z } =: V g ( x ∗ ,x ) + κ ′ E {k ζ ( x ∗ , ω ) k ∗ } | {z } =: ς ∗ . (8) In what follows ∇ G ( · , ω ) replaces notation ∇ x G ( · , ω ) for the gradient of G w.r.t. the ﬁrst argument. emark. Assumption S1 and, in particular, bound (8) are essential to the subsequent de-velopments and certainly merit some comments. We postpone the corresponding discussion toSection 3 where we present several examples of observation models in which this assumption nat-urally holds. For now, let us note that relation (8) is rather characteristic to the case of smoothstochastic observation. Indeed, let us consider the situation where the Lipschitz constant L ( ω )is a.s. bounded, i.e. L ( ω ) ≤ ν . In this case we have ς ( x ) = E (cid:8) k∇ G ( x, ω ) − ∇ g ( x ) k ∗ (cid:9) ≤ (cid:16) E (cid:8) k∇ G ( x, ω ) − ∇ G ( x ∗ , ω ) k ∗ (cid:9) / + k∇ g ( x ) − ∇ g ( x ∗ ) k ∗ + E (cid:8) k∇ G ( x ∗ , ω ) − ∇ g ( x ∗ ) k ∗ (cid:9) / (cid:17) . On the other hand, by Lipschitz continuity of ∇ G ( · , ω ) we have G ( x, ω ) − G ( x ∗ , ω ) ≥ h∇ G ( x ∗ , ω ) , x − x ∗ i + (2 ν ) − k∇ G ( x, ω ) − ∇ G ( x ∗ , ω ) k ∗ implying that ς ( x ) ≤ (cid:16) [2 ν E { G ( x, ω ) − G ( x ∗ , ω ) − h∇ G ( x ∗ , ω ) , x − x ∗ i} ] / + [2 ν ( g ( x ) − g ( x ∗ ) − h∇ g ( x ∗ ) , x − x ∗ i )] / + ς ∗ (cid:17) ≤ ν [ g ( x ) − g ∗ − h∇ g ( x ∗ ) , x − x ∗ i ] + 2 ς ∗ . Sparsity structure.

In what follows we assume to be given a sparsity structure [27] on E —afamily P of projector mappings P = P on E with associated nonnegative weights π ( P ). For anonnegative real s we set P s = { P ∈ P : π ( P ) ≤ s } . Given s ≥ x ∈ E s -sparse if there exists P ∈ P s such that P x = x . We will make thefollowing assumption. Assumption [S2]

Given x ∈ X one can eﬃciently compute a “sparse approximation” of x —an optimal solution x s = sparse( x ) to the optimization problem min k x − z k over s -sparse z ∈ X (9) where k · k is the Euclidean norm: k z k = h z, z i / . Furthermore, for any s -sparse z ∈ E thenorm k · k satisﬁes k z k ≤ √ s k z k . In what follows we refer to x s as “sparsiﬁcation of x .” We are mainly interested in thefollowing “standard examples”:1. “Vanilla” sparsity: in this case E = R n with the standard inner product, P is comprisedof projectors on all coordinate subspaces of R n , π ( P ) = rank( P ), and k · k = k · k .Assumption S2 clearly holds, for instance, when X is orthosymmetric, e.g., a ball of ℓ p -norm on R n , 1 ≤ p ≤ ∞ .2. Group sparsity: E = R n , and we partition the set { , ..., n } of indices into K nonoverlap-ping subsets I , ..., I K , so that to every x ∈ R n we associate blocks x k with correspondingindices in I k , k = 1 , ..., K . Now P is comprised of projectors P = P I onto subspaces E I = { [ x , ..., x K ] ∈ R n : x k = 0 ∀ k / ∈ I } associated with subsets I of the index set { , ..., K } . We set π ( P I ) = card I , and deﬁne k x k = P Kk =1 k x k k — block ℓ /ℓ -norm. Same as above, Assumption S2 holds in this case when X is “block-symmetric,” for in-stance, is a ball of block norm k · k . 6. Low rank sparsity structure: in this example E = R p × q with, for the sake of deﬁniteness, p ≥ q , and the Frobenius inner product. Here P is the set of mappings P ( x ) = P ℓ xP r where P ℓ and P r are, respectively, q × q and p × p orthoprojectors, and k · k is the nuclearnorm k x k = P qi =1 σ i ( x ) where σ ( x ) ≥ σ ( x ) ≥ ... ≥ σ q ( x ) are singular values of x .In this case Assumption S2 holds due to the EckartYoung approximation theorem, itsuﬃces that X is a ball of a Schatten norm k x k r = ( P qi =1 σ ri ( x )) /r , 1 ≤ r ≤ ∞ .In what follows we suppose that the optimal solution x ∗ to problem (5) is s -sparse. Our objectiveis to build approximate solutions b x N to problem (5) utilizing N queries to the stochastic oracle.We quantify the accuracy of an approximate solution b x using the following risk measures: • Recovery risks: maximal over x ∗ ∈ X expected squared errorsRisk |·| ( b x | X ) = sup x ∗ ∈ X E {| b x − x ∗ | } where | · | stands for k · k - or k · k -norm, and ǫ -risks of recovery—the smallest maximalover x ∗ ∈ X radius of 1 − ǫ -conﬁdence ball of norm | · | centered at b x :Risk |·| ,ǫ ( b x | X ) = inf (cid:26) r : sup x ∗ ∈ X Prob {| b x − x ∗ | ≥ r } ≤ ǫ (cid:27) • Prediction risks: maximal over x ∗ ∈ X expected suboptimalityRisk g ( b x | X ) = sup x ∗ ∈ X E { g ( b x ) } − g ∗ , of b x and the smallest maximal over x ∗ ∈ X − ǫ -conﬁdence intervalRisk g,ǫ ( b x | X ) = inf (cid:26) r : sup x ∗ ∈ X Prob { g ( b x ) − g ∗ ≥ r } ≤ ǫ (cid:27) . In what follows, we use a generic notation C for an absolute constant; notation a . b meansthat the ratio a/b is bounded by an absolute constant. Notation and deﬁnitions.

Let ϑ : E → R be a continuously diﬀerentiable convex functionwhich is strongly convex with respect to the norm k · k , i.e., h∇ ϑ ( x ) − ∇ ϑ ( x ′ ) , x − x ′ i ≥ k x − x ′ k , ∀ x, x ′ ∈ E. From now on, w.l.o.g. we assume that ϑ ( x ) ≥ ϑ (0) = 0. We say that Θ is the constant ofquadratic growth of ϑ ( · ) if ∀ x ∈ E ϑ ( x ) ≤ Θ k x k . Clearly, Θ ≥ . If, in addition, Θ is “not too large,” and for any x ∈ X , a ∈ E and β > z ∈ X {h a, z i + βϑ ( z − x ) } can be easily computed we say that distance-generating function (d.-g.f.) ϑ is “prox-friendly.”We present choices of prox-friendly d.-g.f.’s relative to the norm used in application sections.7e also utilize associated Bregman divergence V x ( x, z ) = ϑ ( z − x ) − ϑ ( x − x ) − h∇ ϑ ( x − x ) , z − x i , ∀ z, x, x ∈ X. For Q ∈ R p × q we denote kQk ∞ = max ij | [ Q ] ij | ;for symmetric positive-deﬁnite Q ∈ R n × n and x ∈ R n we denote k x k Q = p x T Qx.

Stochastic Mirror Descent algorithm.

For x, x ∈ X , u ∈ E , and β > proximal mapping Prox β ( u, x ; x ) := argmin z ∈ X (cid:8) h u, z i + βV x ( x, z ) (cid:9) = argmin z ∈ X (cid:8) h u − β h∇ ϑ ( x − x ) , z i + βϑ ( z − x ) (cid:9) . (10)For i = 1 , , . . . , consider Stochastic Mirror Descent recursion x i = Prox β i − ( ∇ G ( x i − , ω i ) , x i − ; x ) , x ∈ X, (11)Here β i > i = 0 , , . . . , is a stepsize parameter to be deﬁned later, and ω , ω , . . . are inde-pendent identically distributed (i.i.d.) realizations of random variable ω , corresponding to theoracle queries at each step of the algorithm.The approximate solution to problem (5) after N iterations is deﬁned as weighted average b x N = " N X i =1 β − i − − N X i =1 β − i − x i . (12)The next result describes some useful properties of the recursion (11). Proposition 2.1

Suppose that SMD algorithm is applied to problem (5) in the situation de-scribed in this section. We assume that Assumption S1 holds and that initial condition x ∈ X is independent of ω i , i = 1 , , ... and such that E {k x − x ∗ k } ≤ R ; we use constant stepsizes β i ≡ β ≥ κ ν, i = 1 , , ..., m. Then approximate solution b x m = m P mi =1 x i after m steps of the algorithm satisﬁes E { g ( b x m ) } − g ∗ ≤ R m (cid:18) Θ β + κ ν β (cid:19) + 2 κ ′ ς ∗ β . (13) We assume to be given

R < ∞ and x ∈ X such that k x ∗ − x k ≤ R , along with problemparameters κ , κ ′ , ν, ς ∗ , κ and an upper bound ¯ s for signal sparsity. We are using the StochasticMirror Descent algorithm and apply the iterative approach of [34, 33] to improve its accuracybounds. The proposed Stochastic Mirror Descent algorithm for Sparse Recovery (SMD-SR)works in stages; the stages are split into two groups—phases—corresponding to two essentiallydiﬀerent regimes of the method. We refer to the ﬁrst phase as preliminary, and refer to thesecond as asymptotic. 8 lgorithm 1 [SMD-SR]

1. Preliminary phase

Initialization:

Set y = x , R = R . β = 2 κ ν, m = (cid:5) κ − ¯ s (8Θ κ + 1) ν (cid:4) (14)(here ⌋ a ⌊ stands for the smallest integer greater or equal to a ). Put K = (cid:23) log (cid:18) R κν κ ς ∗ ¯ s κ ′ (cid:19)(cid:22) and run K = min (cid:26)(cid:22) Nm (cid:23) , K (cid:27) stages of the preliminary phase (here ⌊ a ⌋ stands for the “usual” integer part – thelargest integer less or equal to a ). Stage k = 1 , ... : Compute approximate solution b x m ( y k − , β ) after m iterations ofSMD algorithm with constant stepsize parameter β , corresponding to the initialcondition x = y k − . Then deﬁne y k as “ s -sparsiﬁcation” of b x m ( y k − , β ), i.e., y k = sparse( b x m ( y k − , β )). Output: deﬁne b y (1) = y K and b x (1) = b x m ( y K − , β ) as approximate solutions at the endof the phase.2. Set M = N − m K and m k = (cid:23)

512 ¯ s Θ ν κ κ k (cid:22) , k = 1 , ... If m > M terminate and output b y N = b y (1) and b x N = b x (1) as approximate solutions bythe procedure; otherwise, continue with stages of the asymptotic phase.Asymptotic phase Initialization:

Set y ′ = b y (1) and β k = 2 k ν κ , k = 1 , ... . Stage k = 1 , ... : Compute b x m k ( y ′ k − , β k ); same as above, deﬁne y ′ k = sparse( b x m k ( y ′ k − , β k )). Output:

Run K ′ = max ( k : k X i =1 m i ≤ M ) asymptotic stages and output b y N = y ′ K ′ and b x N = b x m K ( y ′ K ′ − , β K ′ ).Properties of the proposed procedure are summarized in the following statement. Theorem 2.1

In the situation of this section, suppose that N ≥ m so at least one preliminarystage of Algorithm 1 is completed. Then there is an absolute c > such that approximatesolutions b x N and b y N produced by the algorithm satisfy Risk g ( b x N | X ) ≤ κR ¯ s exp (cid:26) − cN κ Θ κ ¯ sν (cid:27) + C ς ∗ ¯ s κ ′ Θ κN , (15)Risk k·k ( b y N | X ) ≤ s Risk k·k ( b y N | X ) ≤ s Risk k·k ( b x N | X ) . R exp (cid:26) − cN κ Θ κ ¯ sν (cid:27) + Θ κ ′ ς ∗ ¯ s κ N . (16)9 .4 Enhancing reliability of SMD-SR solutions

In this section, our objective is to build approximate solutions to problem (5) utilizing Algorithm1 which obey “sub-Gaussian type” bounds on their ǫ -risks. Note that bounds (15) and (16) ofTheorem 2.1 do allow only for Chebyshev-type bounds for risks of b y N and b x N . Nevertheless,their conﬁdence can be easily improved by applying, for instance, an adapted version of “median-of-means” estimate [44, 40]. Reliable recovery utilizing geometric median of SMD-SR solutions.

Suppose thatavailable sample of length N can be split into L independent samples of length M = N/L (for the sake of simplicity let us assume that N is a multiple of L ). We run Algorithm 1 oneach subsample thus obtaining L independent recoveries b x (1) M , ..., b x ( L ) M and compute “enhancedsolutions” using an aggregation procedure of geometric median-type. Note that we are in thesituation where Theorem 2.1 applies, meaning that approximate solutions b x (1) M , ..., b x ( L ) M satisfy ∀ ℓ E { g ( b x ( ℓ ) M ) } − g ∗ ≤ τ M := κR ¯ s exp (cid:26) − cM κ Θ κ ¯ sν (cid:27) + C ς ∗ ¯ s κ ′ Θ κM , (17)and so ∀ ℓ E {k b x ( ℓ ) M − x ∗ k } ≤ θ M := 2 κ τ M . R ¯ s exp (cid:26) − cM κ Θ κ ¯ sν (cid:27) + Θ κ ′ ς ∗ ¯ sκ M . (18)We are to select among b x ( ℓ ) M the solution which attains similar bounds “reliably.”1. The ﬁrst reliable solution b x N, − ǫ of x ∗ is a “pure” geometric median of b x (1) M , ..., b x ( L ) M : weput b x N, − ǫ ∈ Argmin x L X ℓ =1 k x − b x ( ℓ ) M k , (19)and then deﬁne b y N, − ǫ = sparse( b x N, − ǫ ).Computing reliable solutions b x N, − ǫ and b y N, − ǫ as optimal solutions to (19) amounts tosolving a nontrivial optimization problem. A simpler reliable estimation can be computedby replacing the geometric median b x N, − ǫ by its “empirical counterparts” (note that, num-ber L of solutions to be aggregated is not large—it is typically order of ln[1 /ǫ ]).2. We can replace b x N, − ǫ with b x ′ N, − ǫ ∈ Argmin x ∈{ b x (1) M ,..., b x ( L ) M } L X ℓ =1 k x − b x ( ℓ ) M k and compute its sparse approximation b y ′ N, − ǫ = sparse( b x ′ N, − ǫ ).3. Another reliable solution (with slightly better guarantees) was proposed in [25]. Let i ∈{ , ..., L } , we set r ij = k b x ( i ) M − b x ( j ) M k r i (1) ≤ r i (2) ≤ ... ≤ r i ( L − corresponding order statistics (i.e., r i · ’s sorted in theincreasing order). We deﬁne reliable solution b x ′′ N, − ǫ = b x ( b i ) M where b i ∈ Argmin i ∈{ ,...,L } r i ⌉ L/ ⌈ (20)(here ⌉ a ⌈ stands for the smallest integer strictly greater than a ), and put b y ′′ N, − ǫ =sparse( b x ′′ N, − ǫ ). Theorem 2.2

Let ǫ ∈ (0 , ] , and let x N (resp. y N ) be one of reliable solutions b x − ǫ,N , b x ′ − ǫ,N and b x ′′ − ǫ,N (resp., b y − ǫ,N , b y ′ − ǫ,N and b y ′′ − ǫ,N ) described above using L = ⌋ α ln[1 /ǫ ] ⌊ independentapproximate solutions b x (1) M , ..., b x ( L ) M by Algorithm 1. When N ≥ Lm we have Risk k·k ,ǫ ( y N | X ) ≤ √ s Risk k·k ,ǫ ( y N | X ) ≤ √ s Risk k·k ,ǫ ( x N | X ) . R exp (cid:26) − cN κ Θ κ ¯ sν ln[1 /ǫ ] (cid:27) + ς ∗ ¯ sκ r Θ κ ′ ln[1 /ǫ ] N . (21)

Remark.

Notice that the term ln[1 /ǫ ] enters the bound (21) as a multiplier which is typicalfor accuracy estimates of solutions which relies upon median to enhance conﬁdence. A betterdependence on reliability tolerance parameter with the corresponding term entering as in Θ +ln[1 /ǫ ] may be obtained for algorithm utilizing “trimmed” stochastic gradients [29]. Reliable solution aggregation.

Let us assume that two independent observation samplesof lengths N and K are available. In the present approach, we use the ﬁrst sample to compute,same as in the construction presented above, L independent approximate SMD-SR solutions b x ( ℓ ) M , ℓ = 1 , ..., L , M = N/L . Then we “aggregate” b x (1) M , ..., b x ( L ) M —select the best of them in termsof the objective value g ( b x ( ℓ ) M ) by computing reliable estimations of diﬀerences g ( b x ( i ) M ) − g ( b x ( j ) M )using observations of the second subsample.The proposed procedure for reliable selection of the “best” solution b x ( ℓ ) M is as follows. Algorithm 2 [Reliable aggregation]

Initialization:

Algorithm parameters are ǫ ∈ (0 , ], L ′ ∈ Z + and m = K/L ′ (for thesake of simplicity we assume, as usual, that K = mL ′ ). We assume to be given L points b x (1) M , ..., b x ( L ) M (approximate solution of the ﬁrst step).We compute b x ′′ N, − ǫ = b x ( b i ) M the reliable solution as deﬁned in (20) and denote b I = { i , ..., i ⌉ L/ ⌈ } ,the set of indices of ⌉ L/ ⌈ closest to b x ′′ N, − ǫ points among b x (1) M , ..., b x ( L ) M . Comparison procedure:

We split the (second) sample ω K into L ′ independent subsamples ω ℓ , ℓ = 1 , ..., L ′ of size m . For all i ∈ b I we compute the index b v i = max j ∈ b I, j = i (cid:26) median ℓ [ b v ℓji ] − ρ ij (cid:27) The exact value of the numeric constant α is speciﬁc for each construction, and can be retrieved from theproof of the theorem. b v ℓji = 1 m m X k =1 (cid:10) ∇ G ( b x ( j ) M + t k ( b x ( i ) M − b x ( j ) M ) , ω ℓk ) , b x ( i ) M − b x ( j ) M (cid:11) , ℓ = 1 , ..., L ′ , are estimates of v ji = g ( b x ( i ) M ) − g ( b x ( j ) M ), t k = k − m , k = 1 , ..., m , and coeﬃcients ρ ij > r ij = k b x ( i ) M − b x ( j ) M k . • Output: We say that x ( i ) M is admissible if b v i ≤

0. When the set of admissible b x ( i ) M ’s isnonempty we deﬁne the procedure output x N + K, − ǫ as one of admissible b x ( i ) M ’s, and deﬁne x N + K, − ǫ = b x (1) M otherwise.Now, consider the following (cf. Assumption S1 ) Assumption [S3].

There are ≤ χ, χ ′ < ∞ such that for any x ∈ X and z ∈ E the followingbound holds: E {h ζ ( x, ω ) , z i } ≤ k z k [ χ L ( g ( x ) − g ∗ ) + χ ′ ς ∗ ] (22) where L is the Lipschitz constant of the gradient ∇ g of g with respect to the Euclidean norm, k∇ g ( x ′ ) − ∇ g ( x ′′ ) k ≤ L k x ′ − x ′′ k , ∀ x ′ , x ′′ ∈ X. Theorem 2.3

Let Assumption S3 hold, and let τ M and θ M be as in (17) and (18) respectively.Further, in the situation of this section, let ǫ ∈ (0 , ] , L = ⌋ α ln[1 /ǫ ] ⌊ for large enough α , and let x N + K, − ǫ be an approximate solution by Algorithm 2 in which we set L ′ ≥ k /ε ] j and ρ ij = 2 r ij r L χm ( γ ( r ij ) + τ M ) + 2 r ij ς ∗ r χ ′ m where γ ( r ) = h r r χ L m + τ M i + 4 rζ ∗ r χ ′ m ! / . (23) Then

Risk g,ǫ ( x N + K, − ǫ | X ) ≤ ¯ γ := γ (8 θ M ) , In particular, when K = mL ′ ≥ c max n χ L ln[1 /ǫ ] κ , Nχ ′ Θ κ ′ ¯ s o for an appropriate absolute c > , onehas Risk g,ǫ ( x N + K, − ǫ | X ) . κR ¯ s exp (cid:26) − cN κ Θ κ ¯ sν ln[1 /ǫ ] (cid:27) + ς ∗ ¯ s Θ κ ′ ln[1 /ǫ ] κN . Let us consider the problem of recovery of a sparse signal x ∗ ∈ R n , n ≥

3, from independentand identically distributed observations η i = φ Ti x ∗ + σξ i , i = 1 , , ..., N φ i and ξ i mutually independent and such that E { φ i φ Ti } = Σ, κ Σ I (cid:22) Σ, and k Σ k ∞ ≤ υ ,with known κ Σ > υ ; we also assume that E { ξ i } = 0 and E { ξ i } ≤ x ∗ is s -sparse . Furthermore, we assume to be given a convex and closedsubset X of R n (e.g., a large enough ball of ℓ - or ℓ -norm centered at the origin) such that x ∗ ∈ X , along with R < ∞ and x ∈ X such that k x ∗ − x k ≤ R .We are about to apply Stochastic Optimization approach described in Section 2. To thisend, consider the Stochastic Optimization problemmin x ∈ X  g ( x ) = E (cid:8) ( η − φ T x ) | {z } =: G ( x,ω =[ φ,η ]) (cid:9) . (24)Note that x ∗ is the unique optimal solution to the above problem. Indeed, we have g ( x ) = E (cid:8) [ φ T ( x ∗ − x ) + σξ ] (cid:9) = ( x − x ∗ ) T Σ( x − x ∗ ) | {z } =: k x − x ∗ k + σ . Observe that ∇ g ( x ) = Σ( x − x ∗ ) = E {∇ G ( x, ω ) } where ∇ G ( x, ω ) = φφ T ( x − x ∗ ) − σξφ, (25)and ζ ( x, ω ) = ∇ G ( x, ω ) − ∇ g ( x ) = [ φφ T − Σ]( x − x ∗ ) − σξφ. Further, function g is strongly convex with respect to ℓ -norm (and thus quadratically minorated)with parameter κ = κ Σ . We set k · k = k · k with k · k ∗ = k · k ∞ , and we use “ ℓ -proximal setup”of the SMD-SR algorithm with quadratically growing for n > ϑ ( x ) = e ln( n ) n ( p − − p ) /p k x k p , p = 1 + 1ln n , the corresponding Θ satisfying Θ ≤ e ln n .Note that ς ( x ) = E (cid:8) k∇ G ( x, ω ) − ∇ g ( x ) k ∞ (cid:9) = E (cid:8) k [ φφ T − Σ]( x − x ∗ ) − σξφ k ∞ (cid:9) . Therefore, in our present situation Assumption S1 reads E (cid:8) k [ φφ T − Σ]( x − x ∗ ) − σξφ k ∞ (cid:9) ≤ κ ν k x − x ∗ k + ς ∗ . (26)The following statement is a straightforward corollary of Theorems 2.1 and 2.2; it describes theproperties of approximate solutions by Algorithm 1 when applied to the observation model inthis section. We assume that problem parameters—values κ , ν, κ Σ , σ and an upper bound ¯ s on sparsity of x ∗ —are known. Proposition 3.1

Suppose that (26) holds.(i) Let the sample size N satisfy N ≥ m = (cid:23) ν ¯ sκ Σ (4 e κ ln[ n ] + 1) (cid:22) o at least one preliminary stage of Algorithm 1 is completed. Then approximate solutions b x N and b y N produced by the algorithm satisfy Risk k·k ( b y N | X ) ≤ s Risk k·k ( b x N | X ) . R exp (cid:26) − cN κ Σ κ ¯ sν ln n (cid:27) + νσ ¯ s ln nκ N , (27)Risk g ( b x N | X ) . κ Σ R ¯ s exp (cid:26) − cN κ Σ κ ¯ sν ln n (cid:27) + νσ ¯ s κ ′ ln nκ Σ N . (ii) Furthermore, when observation size satisﬁes N ≥ αm ln[1 /ǫ ] with large enough absolute α > , − ǫ reliable solutions b y N, − ǫ and b x N, − ǫ as deﬁned in Section 2.4 satisfy Risk k·k ,ǫ ( b y N, − ǫ | X ) ≤ √ s Risk k·k ,ǫ ( b y N, − ǫ | X ) ≤ √ s Risk k·k ,ǫ ( b x N, − ǫ | X ) . R exp (cid:26) − cN κ Σ κ ¯ sν ln[1 /ǫ ] ln n (cid:27) + σ ¯ sκ Σ r ν ln[1 /ǫ ] ln nN , (28) with b x ′ N, − ǫ , b x ′′ N, − ǫ and b y ′ N, − ǫ , b y ′′ N, − ǫ verifying similar bounds. Note that Assumption S3 in the case of sparse linear regression holds when for some 1 ≤ χ < ∞∀ x ∈ X, z ∈ R n E n(cid:0) z T φ (cid:1) (cid:0) φ T ( x − x ∗ ) (cid:1) o ≤ χ k z k σ (Σ) k x − x ∗ k (29)where σ (Σ) is the principal eigenvalue (the spectral norm) of Σ. Indeed, in this case we have E (cid:8) ( z T ζ ( x, ω )) (cid:9) = E { (cid:0) z T { [ φφ T − Σ]( x − x ∗ ) − σφξ } (cid:1) } = E { (cid:0) z T [ φφ T − Σ]( x − x ∗ ) (cid:1) } + σ E { ξ (cid:0) z T φ (cid:1) }≤ E n(cid:0) z T φ (cid:1) (cid:0) φ T ( x − x ∗ ) (cid:1) o + σ k z k σ (Σ) ≤ k x − x ∗ k | {z } = g ( x ) − g ∗ χ k z k σ (Σ) + σ (Σ) σ k z k implying Assumption S3 with χ ′ = σ (Σ) /ν .The following result is a corollary of Theorem 2.3. Proposition 3.2

Suppose that (26) and (29) hold true, and let N ≥ c max (cid:26) κ ν ¯ sκ Σ ln[1 /ǫ ] ln n, χσ (Σ) κ Σ ln[1 /ǫ ] (cid:27) with large enough c > . Then aggregated solution x N, − ǫ (with K = N ) by Algorithm 2 satisﬁes Risk g,ǫ ( x N, − ǫ | X ) . κ Σ R ¯ s exp (cid:26) − cN κ Σ κ ¯ sν ln[1 /ǫ ] ln n (cid:27) + σ ν ¯ s ln[1 /ǫ ] ln nκ Σ N . (30)Note that when σ (Σ) = O ( ν ln n ) and κ and χ are both O (1) bounds (28) and (30) hold for N ≥ c ν ¯ sκ Σ ln[1 /ǫ ] ln n . 14 emark. Results of Propositions 3.1 and 3.2 merit some comments. If compared to nowstandard accuracy bounds for sparse recovery by ℓ -minimization [4, 10, 11, 31, 48, 50, 53, 14],to the best of our knowledge, (26) and (29) provide the most relaxed conditions under whichthe bounds such as (27)–(30) can be established. An attentive reader will notice a degradationof bounds (28) and (30) with respect to comparable results [20, 31, 48] as far as dependence infactors which are logarithmic in n and ǫ − is concerned—bound (21) depends on the productln[ n ] ln[1 /ǫ ] of these terms instead of the sum ln[ n ] + ln[ ǫ − ] in the “classical” results. Thisseems to be “an artifact” of the reliability enhancement approach using median of estimatorswe have adopted in this work, cf. the comment after Theorem 2.2. Nevertheless, it is rathersurprising to see that conditions on the regressor model in Proposition 3.1, apart from positivedeﬁniteness of regressor covariance matrix, essentially resume to (cf. (26)) E (cid:8) k φφ T z k ∞ (cid:9) . ν k z k ∀ z ∈ R n . Below we consider some examples of situations where bounds (26) and (29) hold with constantswhich are “almost” dimension-independent, i.e. are, at most, logarithmic in problem dimension.

When this is the case, and when observation count N satisﬁes N ≥ αm ln[1 /ǫ ] ln[ R/σ ] for largeenough absolute α , so that the preliminary phase of the algorithm is completed, the bounds ofPropositions 3.1 and 3.2 coincide (up to already mentioned logarithmic in n and 1 /ǫ factors)with the best accuracy bound available for sparse recovery in the situation in question. Sub-Gaussian regressors: suppose now that φ i ∼ SG (0 , S ), i.e., regressors φ i are sub-Gaussian with zero mean and matrix parameter S , meaning that E (cid:8) e u T φ (cid:9) ≤ e uT Su for all u ∈ R n .Let us assume that sub-Gaussianity matrix S is “similar” to the covariance matrix Σ of φ ,i.e. S (cid:22) µ Σ with some µ < ∞ . Note that E { ( φ T z ) } ≤ z T Sz ) ≤ µ k z k , and thus E { ( z T φφ T x ) } ≤ E { ( z T φ ) } / E { ( x T φ ) } / ≤ z T Sz x T Sx ≤ µ σ (Σ) k z k k x k , what is (29) with χ = 16 µ . Let us put ¯ υ = max i [ S ] ii . One easily veriﬁes that in this case ν = E {k φ k ∞ } ≤ υ (ln[2 n ] + 1) ≤ µυ (ln[2 n ] + 1) , and E {k φ k ∞ } ≤ υ (ln [2 n ] + 2 ln[2 n ] + 2) ≤ µ υ (ln [2 n ] + 2 ln[2 n ] + 2) . As a result, we have ζ ( x ) = E (cid:8) k [ φφ T − Σ]( x − x ∗ ) − σξφ k ∞ (cid:9) ≤ h ( E {k φ k ∞ } ) / ( E { ( φ T ( x − x ∗ )) } ) / + σ ( E {k φ k ∞ } ) / + √ υ k x − x ∗ k Σ i ≤ hp υ (ln[2 n ] + 2) k x − x ∗ k S + σ p υ (ln[2 n ] + 1) + √ υ k x − x ∗ k Σ i ≤ (cid:0) µ p n ] + 2) + 1 (cid:1) υ k x − x ∗ k + 4 µυ (ln[2 n ] + 1) σ . whence, S1 holds with κ ν . µ υ ln n , κ ′ .

1, and ς ∗ . µυσ ln n . Note that a similar deterioration was noticed in [14]. In the case of “isotropic sub-Gaussian” regressors, see [38], the bounds of Proposition 3.1 are comparable tobounds of [37, Theorem 5] for Lasso recovery under relaxed moment assumptions on the noise ξ . Bounded regressors: we assume that k φ i k ∞ ≤ µ a.s.. One has ς ( x ) = E (cid:8) k [ φφ T − Σ]( x − x ∗ ) − σξφ k ∞ (cid:9) ≤ E n(cid:2) k φ k ∞ ( | φ T ( x − x ∗ ) | + σ | ξ | ) + k E { φφ T ( x − x ∗ ) k ∞ (cid:3) o ≤ µ E n(cid:2) | φ T ( x − x ∗ ) | + σ | ξ | + √ υ k x − x ∗ k Σ (cid:3) o ≤ µ + √ υ ) k x − x ∗ k Σ + 2 µ σ implying (26) with κ ν ≤ µ + √ υ ) and ς ∗ ≤ µ σ . In particular, this condition isstraightforwardly satisﬁed when φ j are sampled from an orthogonal system with uni-formly bounded elements, e.g., φ j = √ nψ κ j where { ψ j , j = 1 , ..., n } is a trigonometric orHadamard basis of R n , and κ j are independent and uniformly distributed over { , ..., n } .On the other hand, in this case, for z = x = ψ we have E { ( z T φφ T x ) } = E { ( ψ φφ T ψ ) } = n = n k ψ k = n k x k k z k , implying that (29) can only hold with χ = O ( n ) in this case.Besides this, when φ is a linear image of a Rademacher vector, i.e. φ = Aη where A ∈ R m × n and η has independent components [ η ] i ∈ {± } with Prob { [ η ] i = 1 } = Prob { [ η ] i = − } =1 /

2, one has Σ = AA T , and E { ( φ T x ) } ≤ k A T x k (cf. the case of sub-Gaussian regressorsabove). Thus, we have E { ( z T φφ T ( x − x ∗ )) } ≤ E { ( z T φ ) } / E { (( x − x ∗ ) T φ ) } / ≤ z T Σ z ( x − x ∗ ) T Σ( x − x ∗ ) ≤ σ (Σ) k z k k x − x ∗ k implying (29) with χ = 6. On the other hand, denoting µ = max j k Row j ( A ) k , we getProb {k φ k ∞ ≥ tµ } ≤ ne − t / , with E {k φ k ∞ } ≤ µ [ln[2 n ] + 1] and E {k φ k ∞ } ≤ µ [ln [2 n ] + 2 ln[2 n ] + 2] . Thus, ζ ( x ) = E (cid:8) k [ φφ T − Σ]( x − x ∗ ) − σξφ k ∞ (cid:9) ≤ h ( E {k φ k ∞ } ) / ( E { ( φ T ( x − x ∗ )) } ) / + σ ( E {k φ k ∞ } ) / + √ υ k x − x ∗ k Σ i ≤ (cid:20)q √ n ] + 2) µ k x − x ∗ k Σ + σ p n ] + 1) µ + √ υ k x − x ∗ k Σ (cid:21) ≤ µ (cid:0)q √ n ] + 2) + 1 (cid:1) k x − x ∗ k + 4 µ (ln[2 n ] + 1) σ what is (8) with κ ν . µ ln n and κ ′ ς ∗ . µ σ ln n .3. Scale mixtures:

Let us now assume that φ ∼ √ Zη, (31)where Z is a scalar a.s. positive random variable, and η ∈ R n is independent of Z withcovariance matrix E { ηη T } = Σ . Because E (cid:8) k φ k ∞ (cid:9) = E { Z } E (cid:8) k η k ∞ (cid:9) , E (cid:8) k φφ T z k ∞ (cid:9) = E { Z } E (cid:8) k ηη T z k ∞ (cid:9) E { φφ T } = E { Z } E { ηη T } , we conclude that if random vector η satisﬁes (26) with Σ substituted for Σ and E { Z } isﬁnite then a similar bound also holds for φ . It is obvious that if η satisﬁes (29) then E (cid:8) ( z T φφ T x ) (cid:9) = E { Z } E (cid:8) ( z T ηη T x ) (cid:9) ≤ E { Z } E { Z } χ k z k k x k ≤ χ E { Z } E { Z } σ (Σ) k z k k x k , and (29) holds for φ with χ replaced with χ E { Z } E { Z } .Let us consider the situation where η ∼ N (0 , Σ ) with positive deﬁnite Σ . In this case φ is referred to as Gaussian scale mixture with a standard example provided by n -variate t -distributions t n ( q, Σ ) (multivariate Student distributions with q degrees of freedom, see[36] and references therein). Here, by deﬁnition, t n ( q, Σ ) is the distribution of the randomvector φ = √ Zη with Z = q/ζ , where ζ is the independent of η random variable following χ -distribution with q degrees of freedom. One can easily see that all one-dimensionalprojections e T φ , k e k = 1, of φ are random variables with univariate t q -distribution.When φ i ∼ t n ( q, Σ ) with q >

4, we have for ζ ∼ χ q E (cid:26) qζ (cid:27) = qq − , E (cid:26) q ζ (cid:27) = 3 q ( q − q − , so that Σ = qq − Σ , and ς ( x ) = E (cid:8) k [ φφ T − Σ]( x − x ∗ ) − σξφ k ∞ (cid:9) . q − q − υ ln[ n ] k x − x ∗ k Σ + σ υ ln n implying (8) with κ , κ ′ . ς ∗ . σ υ ln n . Moreover, in this case E { ( z T φφ T x ) } = E { Z } E { z T ηη T x ) } ≤ E { Z } E { Z } k z k k x k ≤ q − q − σ (Σ) k z k k x k . Another example of Gaussian scale mixture (31) is the n -variate Laplace distribution L n ( λ, Σ ) [21] in which Z has exponential distribution with parameter λ . In this caseall one-dimensional projections e T φ , k e k = 1, of φ are Laplace random variables. If φ i ∼ L n ( λ, Σ ) one has ς ( x ) . υ ln[ n ] k x − x ∗ k Σ + σ υ ln n and E { ( z T φφ T x ) } . σ (Σ) k z k k x k . In this section we consider the problem of recovery of matrix x ∗ ∈ R p × q , from independent andidentically distributed observations η i = h φ i , x ∗ i + σξ i , i = 1 , , ..., N, (32)17ith φ i ∈ R p × q which are random independent over i with covariance operator Σ (deﬁnedaccording to Σ( x ) = E { φ h φ, x i} ). We assume that ξ i ∈ R are mutually independent andindependent of φ i with E { ξ i } = 0 and E { ξ i } ≤ E is the space of p × q matrices equipped with the Frobenius scalarproduct h a, b i = Tr (cid:0) a T b (cid:1) with the corresponding norm k a k = h a, a i / . For the sake of deﬁniteness, we assume that p ≥ q ≥

2. Our choice for the norm k · k is the nuclear norm k x k = k σ ( x ) k where σ ( x ) is thesingular spectrum of x , so that the conjugate norm is the spectral norm k y k ∗ = k σ ( y ) k ∞ . Wesuppose that κ Σ k x k ≤ h x, Σ( x ) i ≤ υ k x k ∀ x ∈ R p × q , with known κ Σ > υ , we write κ Σ I (cid:22) Σ (cid:22) υI ; for x ∈ R p × q we denote k x k Σ = p h x, Σ( x ) i .Finally, we assume that matrix x ∗ is of rank s ≤ ¯ s ≤ q , and moreover, that we are given aconvex and closed subset X of R p × q such that x ∗ ∈ X , along with R < ∞ and x ∈ X satisfying k x ∗ − x k ≤ R .Consider the Stochastic Optimization problemmin x ∈ X  g ( x ) = E (cid:8) ( η − h φ, x i ) | {z } =: G ( x,ω =[ φ,η ]) (cid:9) . (33)We are to apply SMD algorithm to solve (33) with the proximal setup associated with the nuclearnorm with quadratically growing for q ≥ ϑ ( x ) = 2 e ln(2 q )  q X j =1 σ rj ( x )  r , r = (cid:0)

12 ln[2 q ] (cid:1) − , (here σ j ( x ) are singular values of x ) with the corresponding parameter Θ ≤ C ln[2 q ] (cf. [46,Theorem 2.3]). Note that, in the premise of this section, g ( x ) = E (cid:8) ( σξ + h φ, x ∗ − x i ) (cid:9) = ( k x − x ∗ k + σ ) , with ∇ g ( x ) = Σ( x − x ∗ ) = E (cid:8) φ ( h φ, x − x ∗ i − σξ ) | {z } = ∇ G ( x,ω ) (cid:9) and ζ ( x, ω ) = ∇ G ( x, ω ) − ∇ g ( z ) = [ φ h φ, x − x ∗ i − Σ( x − x ∗ )] − σφξ. Let us now consider the case regressors φ i ∈ R p × q drawn independently from a sub-Gaussianensemble , φ i ∼ SG (0 , S ) with sub-Gaussian operator S . The latter means that E (cid:8) e h x,φ i (cid:9) ≤ e h x,S ( x ) i ∀ x ∈ R p × q with linear positive deﬁnite S ( · ). To show the bound of Theorems 2.1–2.3 in this case we needto verify that relationships (8) and (22) of Assumptions S1 and S3 are satisﬁed. To this end,let us assume that S is “similar” to the covariance operator Σ of φ , namely, S (cid:22) µ Σ with some µ < ∞ . This setting covers, for instance, the situation where the entries in the regressors matrix18 ∈ R p × q are standard Gaussian or Rademacher i.i.d. random variables (in these models, S = Σis the identity, and g ( x ) − g ( x ∗ ) = k x − x ∗ k ).Note that, more generally, when S (cid:22) µ Σ we have S (cid:22) µυI with E {k φ k ∗ } ≤ C µ υ ( p + q ) , cf. Lemma A.3 of the appendix, and E {h φ, x − x ∗ i } ≤ h x − x ∗ , S ( x − x ∗ ) i ≤ µ k x − x ∗ k for sub-Gaussian random variable h φ, x − x ∗ i ∼ SG (0 , h x − x ∗ , S ( x − x ∗ ) i ). Therefore, E (cid:8) k φ h φ, x − x ∗ i − Σ( x − x ∗ ) k ∗ (cid:9) ≤ E (cid:8) k φ h φ, x − x ∗ ik ∗ (cid:9) + 2 υ k x − x ∗ k ≤ E (cid:8) k φ k ∗ (cid:9) / E (cid:8) h φ, x − x ∗ i (cid:9) / + 2 υ k x − x ∗ k ≤ Cµ ( p + q ) υ k x − x ∗ k + 2 υ k x − x ∗ k . Taking into account that ν = E {k φ k ∗ } ≤ Cµυ ( p + q ) in this case, we have ς ( x ) = E (cid:8) k ζ ( x, ω ) k ∗ (cid:9) ≤ E (cid:8) k φ h φ, x − x ∗ i − Σ( x − x ∗ ) k ∗ (cid:9) + 2 σ E (cid:8) k φ k ∗ (cid:9) ≤ Cµ ( p + q ) + 1) υ [ g ( x ) − g ∗ ] + 2 Cµυ ( p + q ) σ | {z } = ς ∗ implying (8) with κ . µ and κ ′ . ∀ x ∈ X, z ∈ R p × q E n h φ, z i h φ, x i o ≤ E n h z, φ i o / E n h φ, x i o / ≤ h z, S ( z ) ih x, S ( x ) i ≤ µ υ k z k k x k , so that E n h z, ζ ( x, ω ) i o = E n h z, φ h φ, x − x ∗ i − Σ( x − x ∗ ) − σφξ i o = E n(cid:0) h z, φ ih φ, x − x ∗ i − h z, Σ( x − x ∗ ) i (cid:1) o + σ E { ξ h z, φ i }≤ E n h z, φ i h φ, x − x ∗ i o + σ υ k z k ≤ µ υ [ g ( x ) − g ∗ ] k z k + σ υ k z k implying the bound (22) with χ . µ ( p + q ) − and χ ′ . µ − ( p + q ) − . When substitutingthe above bounds for problem parameters into statements of Theorems 2.1–2.3 we obtain thefollowing statement summarizing the properties of the approximate solutions by the SMD-SRalgorithm utilizing observations (32). Proposition 3.3

In the situation of this section,(i) let the sample size N satisfy N ≥ α (cid:20) µ υ ( p + q )¯ s ln qκ Σ (cid:21) for an appropriate absolute α , implying that at least one preliminary stage of Algorithm 1 iscompleted. Then there is an absolute c > such that approximate solutions b x N and b y N producedby the algorithm satisfy Risk k·k ( b y N | X ) ≤ s Risk k·k ( b x N | X ) . R exp (cid:26) − cN κ Σ µ υ ( p + q )¯ s ln q (cid:27) + σ µυ ( p + q )¯ s ln qκ N ,

Risk g ( b x N | X ) . κ Σ R ¯ s exp (cid:26) − cN κ Σ µ υ ( p + q )¯ s ln q (cid:27) + σ µυ ( p + q )¯ s ln qκ Σ N . ii) Furthermore, when observation size satisﬁes N ≥ α ′ (cid:20) µ υ ( p + q )¯ s ln[1 /ǫ ] ln qκ Σ (cid:21) with large enough α ′ , − ǫ reliable solutions b y N, − ǫ and b x N, − ǫ deﬁned in Section 2.4 satisfy forsome c ′ > k·k ,ǫ ( b y N, − ǫ | X ) ≤ √ s Risk k·k ,ǫ ( b y N, − ǫ | X ) ≤ √ s Risk k·k ,ǫ ( b x N, − ǫ | X ) . R exp (cid:26) − c ′ N κ Σ µ υ ( p + q )¯ s ln[1 /ǫ ] ln q (cid:27) + σ ¯ sκ Σ r µυ ( p + q ) ln[1 /ǫ ] ln qN , (34) with solutions b x ′ N, − ǫ , b x ′′ N, − ǫ and b y ′ N, − ǫ , b y ′′ N, − ǫ verifying analogous bounds. Finally, the fol-lowing bound holds for the aggregated solution x N, − ǫ (with K = N ) by Algorithm 2: Risk g,ǫ ( x N, − ǫ | X ) . κ Σ R ¯ s exp (cid:26) − c ′ N κ Σ µ υ ( p + q )¯ s ln[1 /ǫ ] ln q (cid:27) + σ µυ ( p + q )¯ s ln[1 /ǫ ] ln qκ Σ N .

Remark.

Let us now compare the bounds of the proposition to available accuracy estimatesfor low rank matrix recovery. Notice ﬁrst, that when assuming that µ . s satisﬁes¯ s . N κ Σ ( p + q ) υ ln[1 /ǫ ] ln q . The above condition is essentially the same, up to logarithmic in 1 /ǫ factor, as the best conditionon rank of the signal to be recovered under which the recovery is exact in the case of exact—noiseless—observation [12, 49]. The risk bounds of Proposition 3.3 can be compared to thecorresponding accuracy bounds for recovery b x N, Lasso by Lasso with nuclear norm penalization,as in [42, 35]. For instance, when regressors φ i have i.i.d. N (0 ,

1) entries they state (cf. [42,Corollary 5]) that the k · k , ǫ -risk of the recovery satisﬁes the boundRisk k·k ,ǫ ( b x N, Lasso | X ) . σ r ( p + q ) N for ǫ ≥ exp {− ( p + q ) } . Observe that the above bound coincides, up to logarithmic in q and1 /ǫ factors with the second—asymptotic—term in the bound (34). This result is all the moresurprising if we recall that its validity is not limited to sub-Gaussian regressors—what we needin fact is the bound (cf. the remark after Proposition 3.2) E (cid:8) k φ h φ, z ik ∗ (cid:9) . ( p + q ) k x − x ∗ k . (35)For instance, one straightforwardly veriﬁes that the latter bound holds, for instance, in thecase where regressor φ is a scale mixtures of matrices satisfying (35) (e.g., scale mixture ofsub-Gaussian matrices). A Proofs

A.1 Proof of Proposition 2.1

We start with a technical result on the SMD algorithm which we formulate in a more generalsetting of composite minimization. Speciﬁcally, assume that we aim at solving the problemmin x ∈ X [ f ( x ) = E x { G ( x, ω ) } + h ( x )] , (36)20here X and G are as in Section 2.1 and h is convex and continuous. We consider a moregeneral composite proximal mapping: for ζ ∈ E, x, x ∈ X , and β > β ( ζ, x ; x ) := argmin z ∈ X (cid:8) h ζ, z i + h ( z ) + βV x ( x, z ) (cid:9) = argmin z ∈ X (cid:8) h ζ − βϑ ′ ( x − x ) , z i + h ( z ) + βϑ ( z − x ) (cid:9) (37)and consider for i = 1 , , . . . Stochastic Mirror Descent recursion (11). Same as before, theapproximate solution after N iterations of the algorithm is deﬁned as weighted average of x i ’saccording to (12). Obviously, to come back to the situation in Section 2.2 it suﬃces to put h ( x ) ≡

0. To alleviate notation and w.l.o.g. we assume that x = 0 and denote V ( x, z ) = V ( x, z ); we also denote ζ i = ∇ G ( x i − , ω i ) − ∇ g ( x i − )and ε ( x N , z ) = N X i =1 β − i − [ h∇ g ( x i − ) , x i − z i + h ( x i ) − h ( z )] + V ( x i − , x i ) , (38)with x N = ( x , . . . , x N ). In the sequel we use the following well known result which we provebelow for the sake of completeness. Proposition A.1

In the situation of this section, let β i ≥ L for all i = 0 , , ... , and let b x N bedeﬁned in (12), where x i are iterations (11). Then for any z ∈ X we have " N X i =1 β − i − [ f ( b x N ) − f ( z )] ≤ N X i =1 β − i − [ f ( x i ) − f ( z )] ≤ ε ( x N , z ) ≤ V ( x , z ) − V ( x N , z ) + N X i =1 h h ζ i , z − x i − i β i − + k ζ i k ∗ β i − i (39) ≤ V ( x , z ) + N X i =1 h h ζ i , z i − − x i − i β i − + 32 k ζ i k ∗ β i − i , (40) where z i is a random vector with values in X depending only on x , ζ , . . . , ζ i . Proof of Proposition A.1. 1 o . Let x , . . . , x N be some points of X ; let ε i +1 ( z ) := h∇ g ( x i ) , x i +1 − z i + h h ′ ( x i +1 ) , x i +1 − z i + L V ( x i , x i +1 ) . Note that V ( x, z ) ≥ k x − z k due to the strong convexity of V ( x, · ). Thus, by convexity of g and h and the Lipschitz continuity of ∇ g we get for any z ∈ Xf ( x i +1 ) − f ( z ) = [ g ( x i +1 ) − g ( z )] + [ h ( x i +1 ) − h ( z )]= [ g ( x i +1 ) − g ( x i )] + [ g ( x i ) − g ( z )] + [ h ( x i +1 ) − h ( z )] ≤ [ h∇ g ( x i ) , x i +1 − x i i + L V ( x i , x i +1 )] + h∇ g ( x i ) , x i − z i + h ( x i +1 ) − h ( z ) ≤ h∇ g ( x i ) , x i +1 − z i + h h ′ ( x i +1 ) , x i +1 − z i + L V ( x i , x i +1 ) = ε i +1 ( z );i.e., the following inequality holds for any z ∈ X : f ( x i +1 ) − f ( z ) ≤ ε i +1 ( z ) . (41)21 o . Let us ﬁrst prove inequality (39). In view of (37), the optimality condition for x i +1 in (11. a )has the form h∇ G ( x i , ω i +1 ) + h ′ ( x i +1 ) + β i [ ϑ ′ ( x i +1 ) − h ϑ ′ ( x i )] , z − x i +1 i ≥ , ∀ z ∈ X, or, equivalently, h∇ G ( x i , ω i +1 ) + h ′ ( x i +1 ) , x i +1 − z i ≤ β i h ϑ ′ ( x i +1 ) − ϑ ′ ( x i ) , z − x i +1 i = β i h V ′ ( x i , x i +1 ) , z − x i +1 i = β i [ V ( x i , z ) − V ( x i +1 , z ) − V ( x i , x i +1 )] , ∀ z ∈ X what results in h∇ g ( x i ) , x i +1 − z i + h h ′ ( x i +1 ) , x i +1 − z i ≤ β i [ V ( x i , z ) − V ( x i +1 , z ) − V ( x i , x i +1 )] −h ζ i +1 , x i +1 − z i . (42)It follows from (41) and condition β i ≥ L that f ( x i +1 ) − f ( z ) ≤ ε i +1 ( z ) ≤ h∇ g ( x i ) , x i +1 − z i + h h ′ ( x i +1 ) , x i +1 − z i + β i V ( x i , x i +1 ) . Together with (42), this inequality implies ε i +1 ( z ) ≤ β i [ V ( x i , z ) − V ( x i +1 , z ) − V ( x i , x i +1 )] − h ζ i +1 , x i +1 − z i . On the other hand, due to the strong convexity of V ( x, · ) we have h ζ i +1 , z − x i +1 i − β i V ( x i , x i +1 ) = h ζ i +1 , z − x i i + h ζ i +1 , x i − x i +1 i − β i V ( x i , x i +1 ) ≤ h ζ i +1 , z − x i i + k ζ i +1 k ∗ β i . Combining these inequalities, we obtain f ( x i +1 ) − f ( z ) ≤ ε i +1 ( z ) ≤ β i [ V ( x i , z ) − V ( x i +1 , z )] − h ζ i +1 , x i − z i + k ζ i +1 k ∗ β i (43)for all z ∈ X . Dividing (43) by β i and taking the sum over i from 0 to N − o . We now prove the bound (40). Applying Lemma 6.1 of [43] with z = x we get ∀ z ∈ X, N X i =1 β − i − h ζ i , z − z i − i ≤ V ( x , z ) + N X i =1 β − i − k ζ i k ∗ , (44)where z i = argmin z ∈ X (cid:8) − β − i − h ζ i , z i + V ( z i − , z ) (cid:9) depend only on z , ζ , . . . , ζ i . Further, N X i =1 β − i − h ζ i , z − x i − i = N X i =1 β − i − [ h ζ i , z i − − x i − i + h ζ i , z − z i − i ] ≤≤ V ( x , z ) + N X i =1 β − i − h ζ i , z i − − x i − i + β − i − k ζ i k ∗ . Combining this inequality with (39) we arrive at (40). (cid:3) The last equality follows from the following remarkable identity see, for instance, [19]: for any u, u ′ and w ∈ X h V ′ u ′ ( u, u ′ ) , w − u ′ i = V ( u, w ) − V ( u ′ , w ) − V ( u, u ′ ) . roof of Proposition 2.1. Note that, by deﬁnition, ν ≥ L and κ ≥

1, thus, Proposition A.1can be applied to the corresponding SMD recursion. When applying recursively bound (39) ofthe proposition with z = x ∗ and h ( x ) ≡ E { V x ( x i , x ∗ ) } is ﬁnite along with E {k x i − x ∗ k } , and so E {h ζ i +1 , x i − x ∗ i} = 0. Thus, after taking expectation we obtain m X i =1 [ E { g ( x i ) } − g ∗ ] ≤ β E { V x ( x , x ∗ ) − V x ( x m , x ∗ ) } + β − m X i =1 E {k ζ i k ∗ }≤ E { V x ( x , x ∗ ) − V x ( x m , x ∗ ) } + β − m X i =1 (cid:0) κ ν [ E { g ( x i − ) − h∇ g ( x ∗ ) , x i − − x ∗ i} − g ∗ ] + κ ′ ς ∗ (cid:1) , what, thanks to convexity of g , leads to (cid:20) − κ νβ (cid:21) m X i =1 [ E { g ( x i ) } − g ∗ ] + β E { V x ( x m , x ∗ ) }≤ β E { V x ( x , x ∗ ) } + κ νβ [ E { g ( x ) − h∇ g ( x ∗ ) , x − x ∗ i} − g ∗ ] + m κ ′ ς ∗ β . Because, due to convexity of g , g ( b x m ) ≤ m P mi =1 g ( x i ) and E { g ( x ) − h∇ g ( x ∗ ) , x − x ∗ i} − g ∗ ≤ ν E {k x − x ∗ k } ≤ νR we conclude that when β ≥ κ ν E { g ( b x m ) } − g ∗ ≤ m (cid:18) β Θ R + κ νβ E { V g ( x ∗ , x ) } (cid:19) + 2 κ ′ ς ∗ β ≤ R m (cid:18) Θ β + κ ν β (cid:19) + 2 κ ′ ς ∗ β what is (13). (cid:3) A.2 Proof of Theorem 2.1

We start with the following straightforward result:

Lemma A.1

Let x ∗ ∈ X ⊂ E be s -sparse, x ∈ X , and let x s = sparse( x ) —an optimal solutionto (9). We have k x s − x ∗ k ≤ √ s k x s − x ∗ k ≤ √ s k x − x ∗ k . (45) Proof.

Indeed, we have k x s − x ∗ k ≤ k x s − x k + k x − x ∗ k ≤ k x − x ∗ k (recall that x ∗ is s -sparse). Because x s − x ∗ is 2 s -sparse we have by Assumption S2 k x s − x ∗ k ≤ √ s k x s − x ∗ k ≤ √ s k x − x ∗ k . (cid:3) Proof of the theorem relies upon the following characterization of the properties of approximatesolutions y k , x k , x ′ k and y ′ k . 23 roposition A.2 Under the premise of Theorem 2.1,(i) after k preliminary stages of the algorithm one has E {k y k − x ∗ k } ≤ s E {k y k − x ∗ k } ≤ − k R + 32 ς ∗ ¯ s κ ′ κν κ , (46) E { g ( b x m ( y k − , β )) } − g ∗ ≤ − k − κR ¯ s + 2 κ ′ ς ∗ κ ν . (47) In particular, upon completion of K = K preliminary stages approximate solutions b x (1) and b y (1) satisfy E {k b y (1) − x ∗ k } ≤ s E {k b y (1) − x ∗ k } ≤ ς ∗ ¯ s κ ′ κν κ , (48) E { g ( b x (1) ) } − g ∗ ≤ κ ′ ς ∗ κ ν . (49) (ii) Suppose that at least one asymptotic stage is complete. Let r k = 2 − k r where r =64 ς ∗ ¯ s κ ′ κν κ .Then after k stages of the asymptotic phase one has E {k y ′ k − x ∗ k } ≤ s E {k y ′ k − x ∗ k } ≤ r k = 2 − k r , (50) E { g ( b x m k ( y ′ k − , β )) } − g ∗ ≤ ς ∗ κ ′ β k ≤ − k +2 ς ∗ κ ′ κ ν . (51) Proof of the proposition. 1 o . We ﬁrst show that under the premise of the proposition thefollowing relationship holds for 1 ≤ k ≤ K : E {k y k − x ∗ k } ≤ R k := R k − + 16 ς ∗ ¯ s κ ′ κν κ , R = R. (52)Obviously, (52) implies (46) for all 1 ≤ k ≤ K . Observe that (52) clearly holds for k = 1. Let usnow perform the recursive step k − → k . Indeed, bound (13) of Proposition 2.1 implies thatafter m iterations of the SMD with the stepsize parameter satisfying (14) and initial condition x such that E {k x − x ∗ k } ≤ R k − one has E { g ( b x m ) } − g ∗ ≤ m h κ ν + ν i R k − + κ ′ ς ∗ κ ν ≤ [8Θ κ + 1] ν m R k − + κ ′ ς ∗ κ ν . (53)Note that when m ≥ κ − ¯ s (8Θ κ + 1) ν we have8¯ sκ [8Θ κ + 1] νm ≤ . Therefore, when utilizing the bound (45) of Lemma A.1 we get E {k y k − x ∗ k } ≤ s E {k y k − x ∗ k } ≤ s E {k b x m − x ∗ k } ≤ sκ [ E { g ( b x m ) } − g ∗ ] ≤ sκ (cid:18) [8Θ κ + 1] ν m R k − + κ ′ ς ∗ κ ν (cid:19) ≤ R k := R k − + 16 ς ∗ ¯ s κ ′ κ κ ν E { g ( b x m ( y k − , β )) } − g ∗ ≤ κR k − s + κ ′ ς ∗ κ ν ≤ − k − κR ¯ s + 2 κ ′ ς ∗ κ ν what implies (47). Now, (48) and (49) follow straightforwardly by applying (46) and (47) with K = K . o . Let us prove (50). Recall that at the beginning of the ﬁrst stage of the second phase wehave E {k ¯ y − x ∗ k} ≤ r . Now, let us do the recursive step, i.e., assume that (50) holds for some0 ≤ k < K ′ , and let us show that it holds for k + 1. Because Θ ≥ κ ≥ β k ≥ κ ν , k = 1 , ... , and, by (13), E { g ( b x m k ( y ′ k − , β k )) } − g ∗ ≤ r k − m k (cid:18) Θ β k + κ ν β k (cid:19) + 2 κ ′ ς ∗ β k ≤ β k r k − m k + 2 κ ′ ς ∗ β k ≤ − k r κ s + 2 − k κ ′ ς ∗ κ ν ≤ − k r κ s ≤ − k +2 κ ′ ς ∗ κ ν . (54)Observe that E {k x m k ( y ′ k − , β k ) − x ∗ k } ≤ κ [ E { g ( b x m k ( y ′ k − , β k )) } − g ∗ ] ≤ − k r s , so that by Lemma A.1 E {k y ′ k − x ∗ k } ≤ s E {k x m k ( y ′ k − , β k ) − x ∗ k } ≤ − k r = r k , and (50) follows. Now (51) is an immediate consequence of (50) and (54). (cid:3) Proof of the theorem. 1 o . Let us start with the situation where no asymptotic stage takesplace. Because we have assumed that N is large enough so that at least one preliminary stagetook place this can only happen when either m K ≥ N or m ≥ N . Due to m >

1, by (48) wehave in the ﬁrst case: E {k y K − x ∗ k } ≤ R K := 2 − K R + 32 ς ∗ ¯ s κ ′ κν κ ≤ − K +1 R ≤ R exp (cid:26) − cN κ Θ κ ¯ sν (cid:27) for some absolute c >

0. Furthermore, due to (47) we also have in this case E { g ( b x m ( y K − , β )) } − g ∗ ≤ − K − κR ¯ s + 2 κ ′ ς ∗ κ ν ≤ − K − κR ¯ s ≤ κR ¯ s exp (cid:26) − cN κ Θ κ ¯ sν (cid:27) . Next, m ≥ N implies that ¯ sκ ≥ cN Θ ν κ (55)for some absolute constant c , so that approximate solution y K at the end of the preliminaryphase satisﬁes (cf. (48)) E {k b y − x ∗ k } ≤ C ς ∗ ¯ s κ ′ κν κ ≤ C Θ κ ′ ς ∗ ¯ s κ N . E { g ( b x ) } − g ∗ ≤ C κ ′ ς ∗ κ ν ≤ C Θ κ ′ ς ∗ ¯ sκN . o . Now, let us suppose that at least one stage of the asymptotic phase was completed. Applyingthe bound (50) of Proposition A.2 we have E {k y ′ K − x ∗ k } ≤ r . When M < N/

2, same asabove, we have E {k b y N − x ∗ k } ≤ r ≤ R exp (cid:26) − cN κ Θ κ ¯ sν (cid:27) and E { g ( b x m K ′ ( y K ′ − , β )) } − g ∗ ≤ E { g ( b x ) } − g ∗ ≤ κR ¯ s exp (cid:26) − cN κ Θ κ ¯ sν (cid:27) . (56)When M ≥ N/

2, since m k ≤ C ¯ m k where ¯ m k = 512 ¯ s Θ ν κ κ k we have N ≤ C K ′ X k =1 ¯ m k ≤ C K ′ +1 ¯ m ≤ C K ′ ¯ s Θ ν κ κ . We conclude that 2 − K ′ ≤ C ¯ s Θ ν κ κN so that E {k b y N − x ∗ k } = E {k b y K ′ − x ∗ k } ≤ − K ′ r ≤ C Θ κ ′ ς ∗ ¯ s κ N .

Finally, by (51), E { g ( b x m K ′ ( y K ′ − , β )) } − g ∗ ≤ − K ′ +2 ς ∗ κ ′ κ ν ≤ C ς ∗ ¯ s κ ′ Θ κN ;together with (56) this implies (15). (cid:3) A.3 Proof of Theorem 2.2 o . By the Chebyshev inequality, ∀ ℓ Prob {k b x ( ℓ ) M − x ∗ k ≥ θ M } ≤ ; (57)applying [40, Theorem 3.1] we conclude thatProb {k b x N, − ǫ − x ∗ k ≥ C α θ M } ≤ e − Lψ ( α, ) where ψ ( α, β ) = (1 − α ) ln 1 − α − β + α ln αβ (58)and C α = − α √ − α . When choosing α = √ √ which corresponds to C α = 2 we obtain ψ ( α, ) =0 . ... > . {k b x N, − ǫ − x ∗ k ≥ θ M } ≤ ǫ if L ≥

10 ln[1 /ǫ ]. When combining this result with that of Lemma A.1 we arrive at the theoremstatement for solutions b x N, − ǫ and b y N, − ǫ . o . The corresponding result for b x ′ N, − ǫ and its “sparsiﬁcation” b y ′ N, − ǫ is due to the followingsimple statement. 26 roposition A.3 Let < α < , | · | be a norm on E , z ∈ E , and let z ℓ , ℓ = 1 , ..., L beindependent and satisfy Prob {| z ℓ − z | ≥ δ } ≤ β for some δ > and β < α . Then for b z , b z ∈ Argmin u ∈{ z ,...,z L } L X ℓ =1 | u − z ℓ | , (59) it holds Prob {| b z − z | ≥ C ′ α δ } ≤ e − Lψ ( α,β ) with C ′ α = α − α . Proof.

W.l.o.g. we may put δ = 1 and z = 0. Proof of the proposition follows that of [40,Theorem 3.1] with Lemma 2.1 of [40] replaced with the following result. Lemma A.2

Let z , ..., z L ∈ E , and let b z be an optimal solution to (59). Let < α < , andlet | b z | ≥ C ′ α . Then there exists a subset I of { , ..., L } of cardinality card I > αL such that forall ℓ ∈ I | z ℓ | > . Proof of the lemma.

Let us assume that | z ℓ | ≤ , ℓ = 1 , ..., ¯ L for ¯ L ≥ (1 − α ) L . Then P Lℓ =1 | z ℓ − b z | = P ℓ ≤ ¯ L | z ℓ − b z | + P ℓ> ¯ L | z ℓ − b z | ≥ ¯ L ( C α −

1) + P ℓ> ¯ L [ | z ℓ | − C α ] ≥ P ℓ ≤ ¯ L | z ℓ | + ¯ L ( C α −

2) + P ℓ> ¯ L | z ℓ | − ( L − ¯ L ) C α ≥ P Lℓ =1 | z ℓ | + ¯ L ( C α − − ( L − ¯ L ) C α ≥ P Lℓ =1 | z ℓ − z | + ¯ L (2 C α − − LC α + L − > P Lℓ =2 | z ℓ − z | for ¯ L > LC α + L − C α − . We conclude that 1 − α ≤ C α +12( C α − , same as C α ≤ α − α . (cid:3) For instance, when choosing α = 1 / C α = 13 /

4, and β such that C α / √ β = 10 weobtain ψ ( α, β ) = 0 . ... so that for L = ⌋ .

46 ln[1 /ǫ ] ⌊ we have Lψ ( α, β ) ≥ ln[1 /ǫ ]. BecauseProb (cid:26) k b x ( ℓ ) M − x ∗ k ≥ θ M √ β (cid:27) ≤ β, ℓ = 1 , ..., L, by Lemma A.2 we conclude thatProb (cid:8) k b x ′ − ǫ,N − x ∗ k ≥ θ M (cid:9) ≤ ǫ, implying statement of the theorem for b x ′ − ǫ,N and b y ′ − ǫ,N . o . The proof of the claim for solutions b x ′′ − ǫ,N and b y ′′ − ǫ,N follows the lines of that of [25,Theorem 4]. We reproduce it here (with improved parameters of the procedure) to meet theneeds of the proof of Theorem 2.3.Let us denote I ( τ M ) the subset of { , ..., L } ∪ ∅ such that g ( b x ( i ) M ) − g ∗ ≤ τ M and thus k b x ( i ) M − x ∗ k ≤ θ M for i ∈ I ( τ M ). Assuming the latter set is nonempty we have for all i, j ∈ I ( τ M ) k b x ( i ) M − b x ( j ) M k ≤ θ M . On the other hand, using (57) and independence of b x ( i ) M we conclude that(cf. e.g., [39, Lemma 23])Prob {| I | ≥⌉ L/ ⌈} ≥ Prob (cid:8) B ( L, ) ≥⌉ L/ ⌈ (cid:9) ≥ − exp (cid:26) − Lψ (cid:18) ⌈ L/ ⌉ L , (cid:19)(cid:27) ⌈ a ⌉ is the largest integer strictly less than a , B ( N, p ) is a (

N, p )-binomial random variable,and ψ ( · , · ) is as in (58). When ε ≤ and L = ⌋ .

05 ln[1 /ε ] ⌊≥

16 we haveProb {| I | ≥⌉ L/ ⌈} ≥ − e − Lψ ( , ) ≥ − e − . L ≥ − ε. Therefore, if we denote Ω ε a subset of Ω N such that | I ( τ M ) | > L/ ω N ∈ Ω ǫ we have P { Ω ε } ≥ − ε. Let now ω N ∈ Ω ε be ﬁxed. Observe that the optimal value b r = r b i ⌉ L/ ⌈ of (20)satisﬁes b r ≤ θ M , and that among ⌉ L/ ⌈ closest to b x ′′ N, − ǫ points there is at least one, let it be b x (¯ i ) M satisfying g ( b x (¯ i ) M ) − g ∗ ≤ τ M and k b x (¯ i ) M − x ∗ k ≤ θ M . We conclude that whenever ω N ∈ Ωone has k b x ′′ N, − ǫ − x ∗ k ≤ k b x ′′ N, − ǫ − b x (¯ i ) M k + k b x (¯ i ) M − x ∗ k ≤ θ M + 2 θ M ≤ θ M , implying that Prob {k b x ′′ N, − ǫ − x ∗ k ≥ θ M } ≤ ε whenever L = ⌋ .

05 ln[1 /ε ] ⌊ . (cid:3) A.4 Proof of Theorem 2.3 o . Let ω N ∈ Ω ǫ/ deﬁned as in 3 o of the proof of Theorem 2.2; we choose L ≥⌋ .

05 ln[2 /ε ] ⌊ sothat Prob { Ω ǫ/ } ≤ ǫ/

2. We denote b r the optimal value of (20); recall that b r ≤ θ M . Then forany i, j ∈ b I we have k b x ( i ) M − b x ( j ) M k ≤ b r ≤ θ M , (60)and for some ¯ i ∈ b I we have g ( b x (¯ i ) M ) − g ∗ ≤ τ M (61)where τ M and θ M are deﬁned in (17) and (18) respectively. W.l.o.g. we can assume that b x (¯ i ) M isthe minimizer of g ( x ) over b x ( i ) M , i ∈ b I .Let us consider the aggregation procedure. From now on all probabilities are assumed tobe computed with respect to the distribution P K of the (second) sample ω K , conditional torealization ω N of the ﬁrst sample (independent of ω K ). To alleviate notation we drop thecorresponding “conditional indices.”The proof of the theorem relies on the following statement which may be of independentinterest. Proposition A.4

Let U : [0 , × Ω → R be continuously diﬀerentiable and such that u ( t ) = E { U ( t, ω ) } is ﬁnite for all t ∈ [0 , , convex and diﬀerentiable with Lipschitz-continuous gradient: | u ′ ( t ′ ) − u ′ ( t ) | ∗ ≤ M| t − t ′ | , ∀ t, t ′ ∈ [0 , . In the situation in question, let ε ∈ (0 , ] , J ≥ k /ε ] j , and t i = i − m , i = 1 , ..., m . Considerthe estimate b v = median j [ b v j ] , b v j = 1 m m X i =1 U ′ ( t i , ω ji ) j = 1 , ..., J of the diﬀerence v = u (1) − u (0) using M = mJ independent realizations ω ji , i = 1 , ..., m, j =1 , ..., L . Then Prob {| b v − v | ≥ ρ } ≤ ε (62)28 here ρ = 14 m hp M ( u (1) − u ∗ ) + p M ( u (0) − u ∗ ) i + 2 m vuut m X i =1 E { [ ζ ( t i )] } , (here and below, ζ j ( t i ) = U ′ ( t i , ω ji ) − u ′ ( t i ) and u ∗ = min ≤ t ≤ u ( t ) ).In particular, if for µ ≥ M E { [ ζ ( t )] } ≤ µ ( u ( t ) − u ∗ ) + ς (63) then Prob {| b v − v | ≥ ¯ ρ } ≤ ε (64) where ¯ ρ = 2 r µm hp u (1) − u ∗ + p u (0) − u ∗ i + 2 ς √ m . We postpone the proof of the proposition to the end of this section. We now ﬁnish the proof ofthe theorem. o . Denote b v ji = median ℓ [ b v ℓji ]. For j ∈ b I , j = ¯ i let x ( t ) = b x ( j ) M + t (cid:0)b x (¯ i ) M − b x ( j ) M (cid:1) . Note that U ( t, ω ) = G ( x ( t ) , ω ) and u ( t ) = g ( x ( t )) satisfy the premise of Proposition A.4 with M = r j ¯ i L where r j ¯ i = k b x (¯ i ) M − b x ( j ) M k , µ = χ L r j ¯ i , and ς = χ ′ ς ∗ r j ¯ i . When applying the proposition with ε = ǫ/L , J = L ′ , and K = mL ′ we conclude that ∀ j ∈ b I, j = ¯ i Prob {| b v j ¯ i − v j ¯ i | ≥ ̺ j ¯ i } ≤ ǫL , implying that Prob { max j ∈ b I,j =¯ i | b v j ¯ i − v j ¯ i | ≥ ̺ j ¯ i } ≤ ǫ ̺ ij = 2 r j ¯ i r L χm (cid:20)q g ( b x ( i ) M ) − g ∗ + q g ( b x ( j ) M ) − g ∗ (cid:21) + 2 r j ¯ i ς ∗ r χ ′ m . Let now Ω ′ ǫ/ ⊂ Ω K such that for allmax ¯ i = j ∈ b I | b v j ¯ i − v j ¯ i | ≤ ̺ j ¯ i , ∀ ω K ∈ Ω ′ ǫ/ ;by (65) Prob { Ω ′ ǫ/ } ≥ − ǫ/ o . Let us ﬁx ω K ∈ Ω ′ ǫ/ ; our current objective is to show that in this case the set of admissi-ble b x ( i ) M ’s is nonempty—it contains b x (¯ i ) M —and, moreover, all admissible b x ( j ) M ’s satisfy the bound g ( b x ( j ) M ) ≤ γ ( r ¯ ij ) with γ ( r ) deﬁned as in (23).Let α, β, τ >

0, and let v ( γ ) = γ − τ − α ( γ + τ ) + β ]; then v ( γ ) > γ ≥ p (2 α + τ ) + 4 β . Indeed, v ( · ) being nondecreasing for γ ≥ α , it suﬃces to verify the inequalityfor γ = p (2 α + τ ) + 4 β . Because2 α + τ + β/α > p (2 α + τ ) + 4 β

29e have 4 α + 4 ατ + 2 β > α (cid:16)p (2 α + τ ) + 4 β + τ (cid:17) , and v ( γ ) = [(2 α + τ ) + 4 β ] − τ − α (cid:16)p (2 α + τ ) + 4 β + τ (cid:17) − β > . Applying the above observation to α = 2 r j ¯ i q L χm , β = 2 r j ¯ i ς ∗ q χ ′ m , and τ = τ M we conclude thatwhenever g ( b x ( j ) M ) − g ∗ ≥ γ ( r j ¯ i ) v j ¯ i = g ( b x (¯ i ) M ) − g ( b x ( j ) M ) ≤ τ M − g ( b x ( j ) M ) < − ̺ j ¯ i . (66)Therefore, for g ( b x ( j ) M ) ≥ γ ( r j ¯ i )median ℓ [ b v ℓj ¯ i ] − ρ ¯ ij = [median ℓ [ b v ℓj ¯ i ] − v j ¯ i ] + v j ¯ i − ρ ¯ ij < ̺ j ¯ i − ̺ j ¯ i − ρ ¯ ij < ∀ ω K ∈ Ω ′ ǫ/ . Furthermore, for g ( b x ( j ) M ) − g ∗ < γ ( r j ¯ i ) we havemedian ℓ [ b v ℓj ¯ i ] − ρ ¯ ij ≤ ̺ ¯ ij − ρ ¯ ij < ∀ ω K ∈ Ω ′ ǫ/ , and we conclude that b x (¯ i ) M is admissible.On the other hand, whenever g ( b x ( j ) M ) − g ∗ ≥ γ ( r j ¯ i ) we have v ¯ ij > ̺ ¯ ij (cf. (66)), andmedian ℓ [ b v ℓ ¯ ij ] − ρ j ¯ i = [median ℓ [ b v ℓj ¯ i ] − v ¯ ij ] + v ¯ ij − ρ ¯ ij > − ̺ ¯ ij + 2 ̺ ¯ ij − ρ ¯ ij ≥ ∀ ω K ∈ Ω ′ ǫ/ . We conclude that b x ( j ) M is not admissible if g ( b x ( j ) M ) ≥ γ ( r j ¯ i ) and ω K ∈ Ω ′ ǫ/ . o . Now we are done. So, assume that [ ω N , ω K ] ∈ Ω ǫ/ × Ω ′ ǫ/ (what is the case with probability ≥ − ǫ ). We have r ij ≤ θ M for i, j ∈ b I by (60), and g ( x (¯ i ) M ) ≤ τ M for some admissible ¯ i ∈ b I by (61). In this situation, all b x ( j ) M such that g ( b x ( j ) M ) − g ∗ ≥ γ ( r j ¯ i ), j ∈ b I , are not admissible,implying that the suboptimality of the selected solution x N + K, − ǫ is bounded with γ (8 θ M ),thus Risk g,ǫ ( x N + K, − ǫ | X ) ≤ ¯ γ = γ (8 θ M ) . The “in particular” part of the statement of the theorem can be veriﬁed by direct substitutionof the corresponding values of m , θ M , and τ M into the expression for ¯ γ . (cid:3) Proof of Proposition A.4.

Let us denote¯ v = E { b v j } = 1 m m X i =1 u ′ ( t i );we have | b v − v | ≤ | b v − ¯ v | + | ¯ v − v | . (67) o . Note that b v j − ¯ v = 1 m m X i =1 U ′ ( t i , ω ji ) − u ′ ( t i ) = 1 m m X i =1 ζ j ( t i ) , E { ( b v j − ¯ v ) } ≤ m m X i =1 E { [ ζ j ( t i )] } =: υ . By the Chebyshev inequality, Prob {| b v j − ¯ v | ≥ υ } ≤ , andProb { median j [ b v j ] − ¯ v ≥ υ } ≤ Prob n X j { b v j − ¯ v ≥ υ } ≥ J/ o ≤ Prob { B ( J, ) ≥ J/ } ≤ e − Jψ ( , ) ≤ e − . J where ψ ( · , · ) is deﬁned in (58). Because the same bound holds for Prob { median j [ b v j ] − ¯ v ≤ − υ } we conclude thatProb {| b v − ¯ v | ≥ υ } = Prob {| median j [ b v j ] − ¯ v | ≥ υ } ≤ e − J/ ≤ ε (68)for J ≥ /ε ). Furthermore, if (63) holds we have E { ( b v j − ¯ v ) } ≤ m m X i =1 [ µ ( u ( t i ) − g ∗ ) + ς ] ≤ m [( u (1) − u ∗ ) + ( u (0) − u ∗ )] + ς m =: ¯ υ implying (68) with υ replaced with ¯ υ :Prob {| b v − ¯ v | ≥ υ } ≤ e J/ ≤ ε (69) o . Next, we bound the diﬀerence ¯ v − v . Let s i = i/m , i = 0 , ..., m , and r i = u ′ ( s i ) − u ′ ( s i − ).Let us show that v − ¯ v ≤ m hp M ( u (1) − u ∗ ) + p M ( u (0) − u ∗ ) i . Note that δ i = Z s i s i − [ u ′ ( s ) − u ′ ( t i )] ds ≤ r i ( s i − s i − ) = (4 m ) − r i , so that v − ¯ v ≤ m X i =1 δ i ≤ (4 m ) − [ u ′ (1) − u ′ (0)] . Let now t ∗ ∈ [0 ,

1] be a minimizer of u on [0 , u wehave | u ′ (0) − u ′ ( t ∗ ) | ≤ M [ u (0) − u ∗ + t ∗ u ′ ( t ∗ )] ≤ M [ u (0) − u ∗ ]and | u ′ (1) − u ′ ( t ∗ ) | ≤ M [ u (1) − u ∗ − (1 − t ∗ ) u ′ ( t ∗ )] ≤ M [ u (1) − u ∗ ] . We conclude that u ′ (1) − u ′ (0) ≤ u ′ (1) − u ′ ( t ∗ ) + u ′ ( t ∗ ) − u ′ (0) ≤ p M [ u (0) − u ∗ ] + p M [ u (1) − u ∗ ] , and v − ¯ v ≤ (4 m ) − [ u ′ (1) − u ′ (0)] ≤ m hp M [ u (0) − u ∗ ] + p M [ u (1) − u ∗ ] i . v − v is completely analogous, implying that | v − ¯ v | ≤ m hp M ( u (1) − u ∗ ) + p M ( u (0) − u ∗ ) i . When substituting the latter bound and the bound (68) into (67) we obtainProb {| b v − v | ≥ υ + υ ′ } ≤ ε for J ≥ /ε ), what implies (62). When replacing (68) with (69) in the above derivation weobtain (64). (cid:3) A.5 Proofs for Section 3.2

The following statement is essentially well known:

Lemma A.3

Let φ ∈ R p × q with q ≤ p for the sake of deﬁniteness, be a random sub-Gaussianmatrix φ ∼ SG (0 , S ) implying that ∀ x ∈ R p × q , E (cid:8) e h x,φ i (cid:9) ≤ e h x,S ( x ) i . (70) Suppose that S (cid:22) ¯ sI ; then E {k φ k ∗ } ≤ C ¯ s ( p + q ) and E {k φ k ∗ } ≤ C ′ ¯ s ( p + q ) where C and C ′ are absolute constants. Proof of the lemma.1 o . Let u ∈ R q be such that k u k = 1. Then the random vector ζ = φu ∈ R p is sub-Gaussianwith ζ ∼ SG (0 , Q ), that is for any v ∈ R p E (cid:8) e v T ζ (cid:9) = E (cid:8) e v T φu (cid:9) = E (cid:8) e h uv T ,φ i (cid:9) ≤ e h uv T ,S ( uv T ) i = e v T Qv where Q = Q T ∈ R p × p . Note thatmax k v k =1 v T Qv = max k v k =1 h uv T , S ( uv T ) i ≤ max k w k =1 h w, S ( w ) i . Therefore, we have Q (cid:22) ¯ sI , and Tr( Q ) ≤ ¯ sp . o . Let Γ = { u ∈ R q : k u k = 1 } , and let D ǫ be a minimal ǫ -net, w.r.t. k · k , in Γ, and let N ǫ be the cardinality of D ǫ . We claim that (cid:8) u T φ T φu ≤ υ ∀ u ∈ D ǫ (cid:9) ⇒ (cid:8) k φ T φ k ∗ ≤ (1 − ǫ ) − υ (cid:9) . (71)Indeed, let the premise in (71) hold true; φ T φ is symmetric, so let ¯ v ∈ Γ be such that ¯ v T φ T φ ¯ v = k φ T φ k ∗ . There exists u ∈ D ǫ such that k ¯ v − u k ≤ ǫ , whence k φ T φ k ∗ = | ¯ v T φ T φ ¯ v | ≤ k φ T φ k ∗ k ¯ v − u k + | u T φ T φu | ≤ k φ T φ k ∗ ǫ + υ (note that the quadratic form z T Qz is Lipschitz continuous on Γ, with constant 2 k Q k ∗ w.r.t. k · k ), whence k φ T φ k ∗ ≤ (1 − ǫ ) − υ . 32 . We can straightforwardly build an ǫ -net D ′ in Γ in such a way that the k·k -distance betweenevery two distinct points of the net is > ǫ , so that the balls B v = { z ∈ R p : k z − v k ≤ ǫ/ } with v ∈ D ′ are mutually disjoint. Since the union of these balls belongs to B = { z ∈ R q : k z k ≤ ǫ/ } , we get Card( D ′ )( ǫ/ q ≤ (1 + ǫ/ q , that is, N ǫ ≤ Card( D ′ ) ≤ (1 + 2 /ǫ ) q .Now we need the following well-known result (we present its proof at the end of this sectionfor the sake of completeness). Lemma A.4

Let ζ ∼ SG (0 , Q ) be a sub-Gaussian random vector in R n , i.e. ∀ t ∈ R n E (cid:8) e t T ζ (cid:9) ≤ e t T Qt (72) where Q = Q T ∈ R n × n . Then for all x ≥ {k ζ k ≥ Tr( Q ) + 2 √ xv + 2 x ¯ q } ≤ e − x (73) where ¯ q = max i σ i ( Q ) is the principal eigenvalue of Q and v = k Q k = P i σ i ( Q ) is the squaredFrobenius norm of Q . Thus, for any α > {k ζ k ≥ Tr( Q )(1 + α − ) + (2 + α ) x ¯ q } ≤ e − x . (74)Utilizing (74) with α = 1 we conclude that ∀ u ∈ Γ the random vector ζ = φu satisﬁesProb {k ζ k ≥ sp + 3¯ sx } ≤ e − x . (75)Let us set ǫ = ; utilizing (75), we conclude that the probability of violating the premise in (71)with υ = 2¯ sp + 3¯ sx does not exceed exp {− x + q ln[1 + 2 ǫ − ] } = exp {− x + q ln 9 } , so thatProb (cid:8) k φ T φ k ∗ ≥ s (2 p + 3 x ) (cid:9) ≤ exp {− x + q ln 9 } . Now we are done: recall that E {k φ k ∗ } = E {k φ T φ k ∗ } = 2 Z ∞ Prob {k φ T φ k ∗ ≥ u } u du ≤ Z ∞ u min n exp n sp − u s + q ln 9 o , o du ≤ Z ¯ s (4 p +6 q ln 9)0 udu + 2 Z ∞ ¯ s (4 p +6 q ln 9) u exp n sp − u s + q ln 9 o du ≤ ¯ s (4 p + 6 q ln 9) + 12¯ s (4 p + 6 q ln 9) + 72¯ s ≤ C ′ ¯ s ( p + q ) . Similarly we get E {k φ k ∗ } ≤ C ¯ s ( p + q ) for an appropriate C . o . Let us now prove Lemma A.4.Note that for t < / (2¯ s ) and η ∈ R n , η ∼ N (0 , I ) independent of ζ we have by (70) E (cid:8) e t h ζ,ζ i (cid:9) = E n E η (cid:8) e √ t h ζ,η i (cid:9)o = E η n E (cid:8) e √ t h ζ,η i (cid:9)o ≤ E η (cid:8) e t h η,Sη i (cid:9) = E η (cid:8) e t h η,Dη i (cid:9) = Y i E η i (cid:8) e tη i s i (cid:9) = Y i (1 − ts i ) − / where D = Diag( s i ) is the diagonal matrix of eigenvalues. Recall that one has, cf. [7, Lemma8], − ln(1 − ts i ) − ts i ≤ t s i − ts i ≤ t s i − t ¯ s t < / (2¯ s ). On the other hand, ∀ t < / (2¯ s )Prob {k ζ k − Tr( S ) ≥ u } ≤ E n exp n t (cid:2) k ζ k − X i s i − u (cid:3)oo ≤ exp n − tu + t − t ¯ s X i s i o = exp n − tu + t v − t ¯ s o . When choosing t = √ xv +2¯ s √ x (cid:0) < s (cid:1) and u = 2 √ xv + 2 x ¯ s we obtainProb {k ζ k ≥ Tr( S ) + 2 √ xv + 2 x ¯ s } ≤ e − x what is (73). Because v ≤ Tr( S )¯ s the latter bound also implies (74). (cid:3) References [1] B. Adcock, A. C. Hansen, C. Poon, and B. Roman. Breaking the coherence barrier: A newtheory for compressed sensing. In

Forum of Mathematics, Sigma , volume 5. CambridgeUniversity Press, 2017.[2] A. Agarwal, S. Negahban, and M. J. Wainwright. Stochastic optimization and sparse statis-tical recovery: Optimal algorithms for high dimensions. In

Advances in Neural InformationProcessing Systems , pages 1538–1546, 2012.[3] R. F. Barber and W. Ha. Gradient descent with non-convex constraints: local concavitydetermines convergence.

Information and Inference: A Journal of the IMA , 7(4):755–806,03 2018.[4] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.

The Annals of Statistics , 37(4):1705–1732, 2009.[5] A. Bietti and J. Mairal. Stochastic optimization with variance reduction for inﬁnite datasetswith ﬁnite sum structure. In

Advances in Neural Information Processing Systems , pages1623–1633, 2017.[6] J. Bigot, C. Boyer, and P. Weiss. An analysis of block sampling strategies in compressedsensing.

IEEE transactions on information theory , 62(4):2125–2139, 2016.[7] L. Birg´e, P. Massart, et al. Minimum contrast estimators on sieves: exponential boundsand rates of convergence.

Bernoulli , 4(3):329–375, 1998.[8] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing.

Applied and computational harmonic analysis , 27(3):265–274, 2009.[9] C. Boyer, J. Bigot, and P. Weiss. Compressed sensing with structured sparsity and struc-tured acquisition.

Applied and Computational Harmonic Analysis , 46(2):312–350, 2019.[10] E. Candes. Compressive sampling. In

Proceedings of the International Congress of Mathe-maticians , volume 3, pages 1433–1452. Madrid, August 22-30, Spain, 2006.[11] E. Candes. The restricted isometry property and its implications for compressed sensing.

Comptes rendus de l’Acad´emie des Sciences, Math´ematique , 346(9-10):589–592, 2008.3412] E. Candes and Y. Plan. Tight oracle bounds for low-rank matrix recovery from a minimalnumber of random measurements. to appear.

IEEE Transactions on Information Theory ,57(4):2342–2359, 2011.[13] E. Candes, T. Tao, et al. The dantzig selector: Statistical estimation when p is much largerthan n.

The annals of Statistics , 35(6):2313–2351, 2007.[14] E. J. Candes and Y. Plan. A probabilistic and ripless theory of compressed sensing.

IEEEtransactions on information theory , 57(11):7235–7254, 2011.[15] E. J. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery froma minimal number of noisy random measurements.

IEEE Transactions on InformationTheory , 57(4):2342–2359, 2011.[16] E. J. Cand`es, Y. Plan, et al. Near-ideal model selection by ℓ minimization. The Annals ofStatistics , 37(5A):2145–2177, 2009.[17] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization.

Foundationsof Computational mathematics , 9(6):717, 2009.[18] E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete andinaccurate measurements.

Communications on Pure and Applied Mathematics: A JournalIssued by the Courant Institute of Mathematical Sciences , 59(8):1207–1223, 2006.[19] G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithmusing bregman functions.

SIAM Journal on Optimization , 3(3):538–543, 1993.[20] A. Dalalyan and P. Thompson. Outlier-robust estimation of a sparse linear model using ℓ m -estimator. In Advances in Neural Information Processing Systems ,pages 13188–13198, 2019.[21] T. Eltoft, T. Kim, and T.-W. Lee. On the multivariate laplace distribution.

IEEE SignalProcessing Letters , 13(5):300–303, 2006.[22] M. Fazel, E. Candes, B. Recht, and P. Parrilo. Compressed sensing and robust recovery oflow rank matrices. In ,pages 1043–1047. IEEE, 2008.[23] R. Foygel Barber and H. Liu. Between hard and soft thresholding: optimal iterative thresh-olding algorithms.

Information and Inference: A Journal of the IMA , 12 2019. iaz027.[24] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convexstochastic composite optimization, ii: shrinking procedures and optimal algorithms.

SIAMJournal on Optimization , 23(4):2061–2089, 2013.[25] D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In

International Conference on Machine Learning , pages 37–45, 2014.[26] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensionalm-estimation. In

Advances in Neural Information Processing Systems , pages 685–693, 2014.[27] A. Juditsky, F. K. Karzan, and A. Nemirovski. On a uniﬁed view of nullspace-type con-ditions for recoveries associated with general sparsity structures.

Linear Algebra and itsApplications , 441:124–151, 2014. 3528] A. Juditsky, F. K. Karzan, A. Nemirovski, B. Polyak, et al. Accuracy guaranties for ℓ recovery of block-sparse signals. The Annals of Statistics , 40(6):3077–3107, 2012.[29] A. Juditsky, A. Nazin, A. Nemirovsky, and A. Tsybakov. Algorithms of robust stochastic op-timization based on mirror descent method.

Automation and Remote Control , 80(9):1607–1627, 2019.[30] A. Juditsky, A. Nazin, A. Tsybakov, and N. Vayatis. Generalization error bounds foraggregation by mirror descent with averaging. In

Advances in neural information processingsystems , pages 603–610, 2006.[31] A. Juditsky and A. Nemirovski. Accuracy guarantees for ℓ -recovery. IEEE Transactionson Information Theory , 57(12):7818–7839, 2011.[32] A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scaleoptimization, I: general purpose methods. In S. Sra, S. Nowozin, and S. J. Wright, editors,

Optimization for Machine Learning . MIT Press Cambridge, 2011.[33] A. Juditsky and A. Nemirovski. First order methods for nonsmooth convex large-scaleoptimization, i: general purpose methods. In S. Sra, S. Nowozin, and S. J. Wright, editors,

Optimization for Machine Learning , pages 121–148. MIT Press, 2011.[34] A. Juditsky and Y. Nesterov. Deterministic and stochastic primal-dual subgradient algo-rithms for uniformly convex minimization.

Stochastic Systems , 4(1):44–80, 2014.[35] V. Koltchinskii, K. Lounici, A. B. Tsybakov, et al. Nuclear-norm penalization and optimalrates for noisy low-rank matrix completion.

The Annals of Statistics , 39(5):2302–2329,2011.[36] S. Kotz and S. Nadarajah.

Multivariate t-distributions and their applications . CambridgeUniversity Press, 2004.[37] G. Lecu´e and M. Lerasle. Robust machine learning by median-of-means: theory and prac-tice. Technical report, 2017. arXiv preprint arXiv:1711.10306.[38] G. Lecu´e, S. Mendelson, et al. Regularization and the small-ball method i: sparse recovery.

The Annals of Statistics , 46(2):611–641, 2018.[39] M. Lerasle and R. I. Oliveira. Robust empirical mean estimators. Technical report, 2011.arXiv preprint arXiv:1112.3914.[40] S. Minsker. Geometric median and robust estimation in banach spaces.

Bernoulli ,21(4):2308–2335, 2015.[41] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of ﬁrst order methods fornon-strongly convex optimization.

Mathematical Programming , pages 1–39.[42] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise andhigh-dimensional scaling.

The Annals of Statistics , pages 1069–1097, 2011.[43] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximationapproach to stochastic programming.

SIAM Journal on Optimization , 19(4):1574–1609,2009. 3644] A. S. Nemirovski and D. B. Yudin.

Complexity of problems and eﬀectiveness of methods ofoptimization(Russian book) . Nauka, Moscow, 1979. Translated as

Problem complexity andmethod eﬃciency in optimization,

J. Wiley & Sons, New York 1983.[45] Y. Nesterov. Primal-dual subgradient methods for convex problems.

Mathematical pro-gramming , 120(1):221–259, 2009.[46] Y. Nesterov and A. Nemirovski. On ﬁrst-order algorithms for l 1/nuclear norm minimiza-tion.

Acta Numerica , 22:509–575, 2013.[47] N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy al-gorithms with sparse constraints.

IEEE Transactions on Information Theory , 63(11):6869–6895, 2017.[48] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlatedgaussian designs.

Journal of Machine Learning Research , 11(Aug):2241–2259, 2010.[49] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrixequations via nuclear norm minimization.

SIAM review , 52(3):471–501, 2010.[50] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. In

Conference on Learning Theory, Workshop and Conference Proceedings , volume 23, pages10.1–10.28, 2012.[51] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1-regularized loss minimization.

Journal of Machine Learning Research , 12(Jun):1865–1892, 2011.[52] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In

Advancesin neural information processing systems , pages 2199–2207, 2010.[53] S. Van De Geer and P. B¨uhlmann. On the conditions used to prove oracle results for theLasso.