On Monte-Carlo methods in convex stochastic optimization
aa r X i v : . [ m a t h . S T ] J a n ON MONTE-CARLO METHODS IN CONVEX STOCHASTICOPTIMIZATION
DANIEL BARTL AND SHAHAR MENDELSON
Abstract.
We develop a novel procedure for estimating the optimizer of gen-eral convex stochastic optimization problems of the form min x ∈X E [ F ( x, ξ )],when the given data is a finite independent sample selected according to ξ .The procedure is based on a median-of-means tournament, and is the firstprocedure that exhibits the optimal statistical performance in heavy tailedsituations: we recover the asymptotic rates dictated by the central limit theo-rem in a non-asymptotic manner once the sample size exceeds some explicitlycomputable threshold. Additionally, our results apply in the high-dimensionalsetup, as the threshold sample size exhibits the optimal dependence on thedimension (up to a logarithmic factor). The general setting allows us to re-cover recent results on multivariate mean estimation and linear regression inheavy-tailed situations and to prove the first sharp, non-asymptotic results forthe portfolio optimization problem. Contents
1. Introduction and appetizer 22. Main results 52.1. Difficulties caused by non-Gaussian tails 52.2. What to expect when tails are Gaussian 62.3. Recovering Gaussian rates without sub-Gaussian tails 82.4. The procedure 152.5. Related literature 163. Applications 173.1. Multivariate mean estimation 173.2. Linear regression 183.3. Ridge regression 203.4. Portfolio optimization 204. On the smallest singular value of general random matrix ensembles 225. Proofs of the main results 325.1. Estimation error, the quadratic term 335.2. Estimation error, the multiplier term 385.3. Prediction error 425.4. Proof under a deterministic lower bound of the Hessian 456. Proofs for the portfolio optimization problem 457. Concluding remarks 50
Date : January 21, 2021.2010
Mathematics Subject Classification.
Key words and phrases. stochastic optimization, sample-path optimization, stochastic coun-terpart method, finite sample / non-asymptotic concentration inequality.
References 511.
Introduction and appetizer
Stochastic optimization is widely used as a way of solving certain problems nu-merically. It appears in diverse areas of mathematics, with a generic convex sto-chastic optimization problem taking the following form: One is given a randomvariable ξ whose range is a measurable space Ξ, a convex set of actions X ⊆ R d ,and a function F : X × Ξ → R that is convex in its first argument. The objectiveis to solve the optimization problemmin x ∈X f ( x ) where f ( x ) := E [ F ( x, ξ )] . (SO)In typical situations, however, one does not have access to the function f directly.Rather, the information one is given is the set of values F ( · , ξ i ) Ni =1 , where ( ξ i ) Ni =1 is an independent sample , selected according to ξ and of cardinality N . This typeof random data is natural, for example, if the distribution of ξ is not known andthe only possibility is to sample it; or when an exact computation of f is unfeasibleand one relies on Monte-Carlo methods to evaluate it instead. We refer the readerto Shapiro, Dentcheva, and Ruszczy´nski [35] or to Kim, Pasupathy, and Henderson[12] for introductions on such aspects of stochastic optimization.Regardless of the reason why one uses a random sample, the fundamental ques-tion remains unchanged: to what degree (SO) can be recovered when the givendata is a random sample?To be more accurate, assume that (SO) admits a unique optimal action x ∗ anddenote by b x ∗ N a candidate for the optimal action that is selected using some sample-based procedure. Given a prescribed error r >
0, one seeks to bound the probabilitythat the estimation error k b x ∗ N − x ∗ k or the prediction error / optimality gap f ( b x ∗ N ) − f ( x ∗ ) exceeds r in terms of the sample size N .It should be stressed that the norm k·k need not be the Euclidean norm. Rather,the right choice of k · k turns out to be a natural Hilbertian structure endowed bythe Hessian of f . The reason for that is clarified in what follows.The typical approach used to produce b x ∗ N is called sample average approximation (SAA) and is denoted in what follows by b x SAA N . The choice is very natural: b x SAA N is a minimizer of the empirical mean b f N := N P Ni =1 F ( · , ξ i ).Asymptotic properties of the SAA solution have been thoroughly investigated.Roughly and inaccurately put, the SAA solution behaves asymptotically as onewould expect based on the central limit theorem. However, as we shall explainimmediately, these asymptotic results can be misleading. In fact, unless one imposes highly restrictive integrability assumptions , when given a finite sample the SAAsolution behaves poorly: it exhibits drastically weaker rates than what one mayexpect based on the asymptotic behaviour.In contrast to the SAA solution, we propose a novel procedure that aims atselecting the optimal action when given a finite random sample. This procedureexhibits the best possible performance regarding the estimation and predictionerrors, and it does so in completely heavy tailed situations. Our results are basedon the methods developed by G. Lugosi and the second named author in [18, 19, 20]. N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 3
Before explaining what we mean by “highly restrictive integrability assumptions”and indicating the very poor behaviour of SAA in their absence, let us interject withan example of an important convex stochastic optimization problem. This examplewill accompany us throughout the article and will help concretize the results andassumptions of this article.
Example 1.1 (Portfolio optimization) . Modern portfolio theory was initiated byMarkowitz [21] and is among the central optimization problems in mathematicalfinance. Without going into details, the portfolio optimization problem has threeingredients: • a d -dimensional random vector X , which is interpreted as (discounted)future prices of some stocks/goods; • a random variable Y , which is interpreted as the (random) future payoff; • a concave, increasing utility function U : R → R .We assume that the stocks X are available for buying and selling today at theprices π so that a trading strategy x ∈ R d bears the cost h π, x i := P di =1 π i x i andgives the (random, future) payoff h X, x i . After trading according to the strategy x , the investor’s terminal wealth is Y + h X − π, x i and her goal is to maximize theexpected utility, namely max x ∈X E (cid:2) U (cid:0) Y + h X − π, x i (cid:1)(cid:3) , where X ⊆ R d is closed and convex. (For instance X = [0 , ∞ ) d corresponds toshort-selling constraints.) We refer, e.g., to F¨ollmer and Schied [7, Chapter 3]or to Shapiro, Dentcheva, and Ruszczynski [35, Section 1.4] for a more elaborateintroduction to the portfolio optimization problem.Setting F ( x, ξ ) := ℓ ( − Y − h X − π, x i ) with ξ := ( X, Y ) and ℓ := − U ( − · ), theportfolio optimization problem is indeed a special instance of (SO).It is important to stress that the portfolio optimization problem highlights thenatural presence of heavy tails : for one, even if the input X is light-tailed (e.g.Gaussian, as is the case in the Bachelier model), the composition with the utilityfunction U can render the problem heavy-tailed. This is particularly true for theexponential utility function U = − exp( − · ), which is arguably (among) the mostimportant utility functions. Additionally, X itself is often heavy tailed; for instance,in the famous Black-Scholes model X is log-normal.Before explaining the new procedure, it is worthwhile to outline what is knownon the statistical performance of the sample average approximation method. As wealready mentioned, the vast majority of known results are of an asymptotic nature.To ease notation we assume here and for the rest of this section that ∇ f ( x ∗ ) = Id(recall that x ∗ is the unique minimizer of f ).One can show that under suitable regularity and mild integrability assumptions, √ N ( b x SAA N − x ∗ ) is asymptotically a multivariate Gaussian with zero mean andcovariance matrix C ov[ ∇ F ( x ∗ , ξ )], see e.g. [35, Chapter 5]. In particular, if we set σ := λ max ( C ov[ ∇ F ( x ∗ , ξ )]) DANIEL BARTL AND SHAHAR MENDELSON to be the largest eigenvalue of the covariance matrix of the gradient of F at x ∗ anddenote by k · k the Euclidean norm, it follows that P h k b x SAA N − x ∗ k ≥ r i ≈ (cid:16) − cN r σ (cid:17) asymptotically as N → ∞ . This error rate is often used to calculate the minimal sample size N required toguarantee that the estimation error is below the wanted threshold r with someprescribed confidence 1 − δ . However, as we shall explain in Section 2.1 below, unless(restrictive) integrability assumptions are imposed, the asymptomatic exponentialdecay really does hold only asymptotically . Indeed, it may very well be possiblethat the non-asymptotic (or, finite sample) rate P h k b x SAA N − x ∗ k ≥ r i ≤ cσ N r for all N ≥ N ratherthan exponentially.The meaning of this significant gap between asymptotic and finite sample be-haviour is that the asymptotic estimate is misleading: while it suggests that σ r · log( δ ) samples are enough to guarantee a confidence of 1 − δ , one actually needs σ r · δ samples. For small values of δ this gap in the required sample size is significant. Our contribution is the following. We construct a procedure that estimates theoptimal action b x ∗ N — but it is not SAA. Recalling that ∇ f ( x ∗ ) = Id throughoutthis section for notational simplicity, given a finite sample, the procedure recoversthe optimal Gaussian rate under modest integrability assumptions: for (small) r > P h k b x ∗ N − x ∗ k ≥ r i ≤ (cid:16) − CN r σ (cid:17) whenever N ≥ N ( r ) , where N ( r ) can be controlled explicitly and depends only on certain low-order mo-ments of the random variables involved (see Theorem 2.9 for the precise statement).As a matter of fact, in typical situations N ( r ) . max n d log( d ) , trace( C ov[ ∇ F ( x ∗ , ξ )]) r o . It is worthwhile mentioning that the prediction error is one order of magnitudesmaller than the estimation error, namely, for N ≥ N ( r ) P h f ( b x ∗ N ) ≥ f ( x ∗ ) + r i ≤ (cid:16) − CN r σ (cid:17) . Although this is a trivial consequence of the first order condition for optimalityif X = R d , it is far less obvious if X is a proper subset of R d and x ∗ lies on theboundary of X . Remark 1.2.
While our procedure showcases the possibility of drastically im-proving the statistical properties of the SAA, this improvement does not come forfree. A major advantage of SAA is its computational simplicity, and unfortunately,our procedure is the outcome of a (rather complex) tournament that takes placebetween the actions in X . As a result, its computational cost is quite high (seeSection 2.4). Tournament based procedures are used in other natural statisticalproblems and there are ongoing attempts to obtain alternative procedures thatmaintain the tournament’s optimal statistical performance in a computationally N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 5 feasible way—see, for example [6, 11]. Finding procedures that do the same forstochastic optimization is a worthwhile challenge.
Plan of the article.
We begin Section 2 with a detailed explanation of the dev-astating effect heavy-tailed random variables can have on SAA; we then formulateour main result (Theorem 2.9), discuss its application to the portfolio optimiza-tion problem, and survey related literature. Section 3 contains a description ofseveral other applications of our main result. In Section 4 we lay the groundworkfor the proof of Theorem 2.9 by establishing a high probability lower bound on thesmallest singular value of a general random matrix ensemble (see Theorem 4.4 andCorollary 4.5)—a result that is of independent interest. This lower bound will beused in Section 5, where we prove our main result. Proofs related to the portfoliooptimization problem are presented in Section 6. Finally, Section 7 contains twoconcluding remarks. 2.
Main results
Difficulties caused by non-Gaussian tails.
Let us revisit our claim thatheavy tails drastically change the statistical performance of the sample averageapproximation. As a starting point, consider the more basic problem of estimatingthe mean µ := E [ ξ ] of a one-dimensional, square integrable random variable ξ .It should be noted that by setting F ( x, ξ ) := x − ξx , one-dimensional meanestimation becomes a stochastic optimization problem.Following the SAA approach, the corresponding estimator for the mean is b µ N := N P Ni =1 ξ i . The central limit theorem then guarantees that P [ | b µ N − µ | ≥ r ] ≤ (cid:16) − N r σ (cid:17) asymptotically as N → ∞ , (2.1)where σ := V ar[ ξ ] denotes the variance of ξ . On the other hand, invoking Markov’sinequality to bound the probability in (2.1) for finite N only implies that P [ | b µ N − µ | ≥ r ] ≤ σ N r for every N ≥ r and N such that N r ≥
1, and let ξ be the symmetric random variable takingthe values ± N r with probability Nr ) and 0 with probability 1 − Nr ) . Then µ = 0 and σ = 1. Moreover, given a sample of cardinality N , a straightforwardcomputation shows that there is an absolute constant C such that the followingholds: with probability at least CNr , exactly one of the sample points is nonzero.On that event we clearly have | b µ N − µ | = r , showing that the estimate in Markov’sinequality (2.2) is sharp (up to the absolute multiplicative constant C ).In order to improve (2.2) one needs to impose a stronger integrability assumptionand, eventually, one can show that (2.1) holds for all N ≥ ξ has sub-Gaussian tails in the sense that P [ | ξ − E [ ξ ] | ≥ t ] is at most of the order exp( − c t σ )for t ≥ c ′ σ .At this point, sub-Gaussian tails seem unavoidable if one’s goal is to have fi-nite sample estimates that match the asymptotic ones (as dictated by the centrallimit theorem). There is, however, one important possibility that so far has been DANIEL BARTL AND SHAHAR MENDELSON neglected: we are free to come up with an alternative estimator instead of theempirical average. To explain, at an intuitive level, how this might be a way outof our predicament, note that the non-optimal performance of the empirical meanstems from the fact that, in the presence of heavy tails, some of the observationswill have untypically large values. These observations, while few in numbers, offsetthe empirical mean from its true counterpart, and the hope is that getting rid ofthose outliers would lead to a better statistical performance. The so-called median-of-means estimator is a simple yet powerful estimator that does just that. It goesback at least to Nemirovsky and Yudin [27].Partition the sample { , . . . , N } = ∪ nj =1 I j into n disjoint blocks I j of cardinality m := 3 σ r (w.l.o.g. assume that n and m are integers). By (2.2), we have that P [ | b µ I j − µ | ≥ r ] ≤
13 where b µ I j := 1 m X i ∈ I j ξ i and a basic Binomial calculation reveals that the probability that the majority ofthe n blocks satisfy | b µ I j − µ | ≥ r is of the order of exp( − cn ). Since n = N r σ , weconclude that P h(cid:12)(cid:12)(cid:12) median j =1 ,...,n b µ I j − µ (cid:12)(cid:12)(cid:12) ≥ r i ≤ (cid:16) − CN r σ (cid:17) for all N ≥ . In other words, the median-of-means estimator exhibits the best possible perfor-mance (2.1) (up to a multiplicative constant) under the sole assumption that ξ hasa finite second moment.Appealing as this sounds, it is important to stress that the median is a one-dimensional object and has no simple vector-valued analogue. In fact, the ques-tion of an optimal multivariate mean estimation procedure, assuming only thatthe vector has a finite mean and covariance, remained open until it was resolvedrecently in [19]. In contrast, stochastic optimization is a multi-dimensional prob-lem, and just like multivariate mean estimation, simply minimizing the functionalmedian j m P i ∈ I j F ( · , ξ i ) does not lead to an optimal estimator. What has a betterchance of success is the tournament procedure which happens to be a powerfulextension of the idea of median-of-means. We will explain the procedure in Section2.4 below.2.2. What to expect when tails are Gaussian.
Ignoring for a second the dif-ficulties caused by non-Gaussian tails, let us explain the kind of result one couldhope for in general convex stochastic optimization problems and how the under-lying dimension d enters the picture. This will serve as our benchmark in whatfollows.To that end, consider the case where F is a quadratic function defined on thewhole of R d , that is, F ( x, ξ ) := h b, x i + 12 h Ax, x i for x ∈ X := R d , (2.3)where b = b ( ξ ) is a d -dimensional Gaussian vector with zero mean and A = A ( ξ )is a random positive definite symmetric ( d × d )-matrix specified in what follows.Although this example appears to be very special, it should not be considered N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 7 as a toy example: by a second order Taylor expansion, every convex function isapproximately a quadratic function, at least locally, around the minimizer.By (2.3), it follows that ∇ f ( x ∗ ) = E [ A ] , ∇ F ( x ∗ , ξ ) = b and ∇ F ( x ∗ , ξ ) = A, and we assume throughout that E [ A ] is non-degenerate (i.e. E [ A ] has full rank).Setting k z k := h∇ f ( x ∗ ) z, z i for z ∈ R d and recalling that b has zero mean, it isevident that f = k · k ; thus, the optimal action is given by x ∗ = 0. Remark 2.1.
To get a clearer picture of this setup, it might help the reader tofirst consider the case ∇ f ( x ∗ ) = Id, and then k · k is just the Euclidean norm.The advantage of using the quadratic function considered here is that the sam-ple average approximation optimizer has a particularly simple form: b x SAA N is anyelement satisfying (cid:16) N N X i =1 A i (cid:17)b x SAA N = 1 N N X i =1 b i . (2.4)To explain the statistical behavior of b x SAA N , let us first focus on the gradient andassume for the sake of simplicity that the Hessian is deterministic , that is, A = E [ A ]. In that case, ∇ f ( x ∗ ) b x SAA N is Gaussian with mean x ∗ = 0 and covariance N C ov[ ∇ F ( x ∗ , ξ )]. A straightforward computation (noting that k · k = k∇ f ( x ∗ ) ·k ) reveals the following: for the estimation error k b x SAA N − x ∗ k to be smaller than r with constant probability (say, with probability at least ), it is necessary to havea sample size of cardinality larger than N ≥ N G ( r ), where N G ( r ) := 1 r trace (cid:16) ∇ f ( x ∗ ) − · C ov[ ∇ F ( x ∗ , ξ )] (cid:17) . (2.5)On the other hand, for large N , it follows from the concentration of a Lipschitz func-tion of the Gaussian vector that the probability that the estimation error exceeds r is of the order 2 exp( − cN r σ ). And the variance parameter is σ := λ max (cid:16) ∇ f ( x ∗ ) − · C ov[ ∇ F ( x ∗ , ξ )] (cid:17) . (2.6)To summarize, when the quadratic function has a deterministic Hessian, the min-imal sample size needed to guarantee that the estimation error does not exceed r with constant probability is N G ( r ), whereas the correct variance parameter (namely σ ) dictates the high-probability regime. Note that the latter, of course, matchesthe variance parameter of [35, Chapter 5], as stated in Section 1.In a next step, still within the setting of the quadratic function (2.3), let usremove the assumption that the Hessian is deterministic. In that case, if one wishesto make any statement regarding the estimation (or, prediction) error, the empiricalHessian N P Ni =1 A i on the left hand side of (2.4) must not be degenerate. One canreadily verify that, unless some specific assumptions are made, the empirical Hessianis singular with probability 1 whenever N < d . Thus, N ≥ d is another restrictionon the minimal sample size (though, at this point, it is far from obvious that asample of size d or proportional to d would suffice to guarantee a non-degenerateHessian with, say, constant probability). DANIEL BARTL AND SHAHAR MENDELSON
Following these observations one can make a very optimistic guess on the esti-mate one can hope to obtain: that there exists a procedure b x N ∗ such that for N ≥ max { N G ( r ) , d } , with probability at least 1 − (cid:16) − cN r σ (cid:17) , we have that k b x ∗ N − x ∗ k ≤ r. Remark 2.2.
This is indeed an optimist guess, and is “very gaussian”. The min-imal sample size max { N G ( r ) , d } is the result of rather trivial obstructions; theirremoval is necessary if the estimation error is to have any chance of being smallerthan r with constant probability. Furthermore, the variance term σ , which dic-tates the high probability regime (for large N ), is effectively one dimensional: itcorresponds to the worst direction of the gradient (w.r.t. the norm k · k ).Before we proceed with the main (affirmative) result of this article, let us con-clude this section with a comment regarding the relation between the estimationerror and the prediction error / the optimality gap, in a more general setup thanthe simple example we presented previously.If X = R d , a Taylor expansion and the first order condition for optimality of x ∗ immediately implies that f ( x ) − f ( x ∗ ) = 12 h∇ f ( y )( x − x ∗ ) , x − x ∗ i where y is some mid-point between x ∗ and x . In particular, setting k B k op :=sup z ∈ R d s.t. k z k≤ h Bz, z i to be the operator norm of a positive semi-definite ( d × d )-matrix B , and c H := sup y ∈X s.t. k y − x ∗ k < k∇ f ( y ) k op (2.7)it is clear that f ( x ) − f ( x ∗ ) ≤ c H r whenever k x − x ∗ k ≤ r and r <
1. Thus, one can make another highly optimistic guess : that there existsa procedure b x ∗ N for which the prediction error / optimality gap is smaller than theestimation error by at least one order of magnitude.What makes this guess optimistic (and nontrivial), is that the above argumentcrucially relies on the fact that X = R d ; or, more generally, that x ∗ lies in theinterior of X . That need not be the case.2.3. Recovering Gaussian rates without sub-Gaussian tails.
This sectioncontains the main result of the article, formulated in Theorem 2.9 below. It providesaffirmative answers to the optimistic guesses made in the previous section (undersome mild assumptions, of course). The assumptions might appear technical atfirst glance, and to help the reader put them in context, each assumption willbe explained in the case of portfolio optimization, and in a heuristic manner; thedetailed analysis will be presented in Section 6. Note that k · k op is indeed the operator norm from ( R d , k · k ) to ( R d , k · k ∗ ), where k · k ∗ is thedual norm of k · k , i.e. k y k ∗ := sup k x k≤ h x, y i . N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 9
Assumption 2.3 (Convexity and coercivity) . The following hold:(a)
X ⊆ R d is closed and convex;(b) x F ( x, γ ) is convex and twice continuously differentiable for every γ ∈ Ξ;(c) F ( x, ξ ), and ∇ F ( x, ξ ) are integrable and ∇ F ( x, ξ ) is square integrable forevery x ∈ X .Further, there exists an optimal action x ∗ ∈ X that satisfies f ( x ∗ ) = inf x ∈X f ( x ),and the seminorm induced by the Hessian of f at x ∗ given by k z k := h∇ f ( x ∗ ) z, z i for z ∈ R d is a true norm (i.e. k z k = 0 implies z = 0). Remark 2.4.
While f clearly inherits convexity from F , it is not clear a priori that f is twice continuously differentiable (in the same sense as F if X is not open). Thisfollows once Assumption 2.7 below is imposed, as we shall explain in the beginningof Section 5.Note that the minimizer x ∗ in Assumption 2.3 is unique, as k · k is a norm. Infact, the latter relates to a standard assumption in stochastic optimization—the socalled quadratic growth condition : that there is a constant κ > f ( x ) ≥ f ( x ∗ ) + κ k x − x ∗ k for all x close to x ∗ . Indeed, denoting ˜ κ := λ min ( ∇ f ( x ∗ )), the smallest eigenvalueof the Hessian of f at x ∗ , we have that ˜ κ > k · k is true norm. Moreover,a Taylor expansion shows that the quadratic growth condition holds with constant κ = ˜ κ for all x in an infinitesimal neighbourhood of x ∗ (or with constant κ = ˜ κ in asufficiently small neighbourhood). Conversely, at least when x ∗ lies in the interiorof X , the quadratic growth condition readily implies ˜ κ ≥ κ and in particular that k · k is a true norm.Let us now give an intuitive interpretation of Assumption 2.3 in thecontext of the portfolio optimization example. To ease notation,we shall assume that π = E [ X ] = 0 and that C ov[ X ] = Id. Theconvexity and differentiability parts of the assumption are clear,and it is straightforward to verify that ∇ F ( x, ξ ) = − ℓ ′ ( − Y − h X, x i ) · X, ∇ F ( x, ξ ) = ℓ ′′ ( − Y − h X, x i ) · X ⊗ X. In particular k z k = E [ ℓ ′′ ( − Y − h X, x ∗ i ) · h X, z i ] . Ignoring the ℓ ′′ -term inside the expectation for the moment, thiswould imply that k · k = k · k . In general, when accounting for the ℓ ′′ -term, a minor integrability assumption will be used to guaranteethat k · k and k · k are equivalent norms. If X is not open, we mean by “continuously differentiable” that there is a continuous function ∇ F ( · , γ ) such that F ( y, γ ) − F ( x, γ ) = R h∇ F ( x + t ( y − x ) , γ ) , y − x i dt . A similar notion holdsfor ∇ F for the “twice continuously differentiable”. In Section 2.2 we argued that unless the Hessian has a specific form, the empir-ical Hessian is singular whenever N ≤ d . However, without further assumptions,believing that a sample of cardinality d is enough to guarantee a non-degenerateempirical Hessian (say with constant probability) is too optimistic—certainly in thegeneral setting we are interested in here. Degeneracy can happen even in dimension d = 1: if ∇ F ( x ∗ , ξ ) takes the value ε with probability ε > k · k is simply the absolute value—regardless of the choice of ε .However, with probability (1 − ε ) N , all observations in a sample of size N are zero,and for small ε , e.g. ε = N , that probability converges to 1 as N → ∞ .As it happens, the following modest integrability assumption on the Hessianprevents such behavior. Assumption 2.5 (Integrability of the Hessian) . There is a constant L such that E [ h∇ F ( x ∗ , ξ ) z, z i ] ≤ L for every z ∈ R d with k z k = 1.Another way of formulating Assumption 2.5 is in the sense of norm-equivalence :for every z ∈ R d , we have that E [ h∇ F ( x ∗ , ξ ) z, z i ] ≤ L · E [ h∇ F ( x ∗ , ξ ) z, z i ] , (2.8)i.e. the L -norm and the L -norm of the forms h∇ F ( x ∗ , ξ ) z, z i are equivalent .Therefore the constant L pertains to the worst direction z ∈ R d (and not e.g. theaverage over different directions). As such, L typically does not depend on thedimension d .Under Assumption 2.5 one can prove a lower bound on the smallest singularvalue of the empirical Hessian whenever N & d log( d ).To check Assumption 2.5 in the portfolio optimization example,note that E [ h∇ F ( x ∗ , ξ ) z, z i ] = E [ ℓ ′′ ( − Y − h X, x ∗ i ) · h X, z i ] . Thus, Assumption 2.5 is a simple consequence of H¨older’s inequalityand a minor integrability condition.
Remark 2.6.
Let us stress that Assumption 2.5 is just a tractable way of ensuringthat our argument works; it could be replaced by the more general assumptionthat ∇ F ( x ∗ , ξ ) satisfies a so-called stable lower bound , see Remark 7.2. The stablelower bound and its role in obtaining lower bounds on the smallest singular valuesof rather general random matrix ensembles is described in Theorem 4.4 and inCorollary 4.1.There is another reason why the minimal sample should be at least a (large con-stant) multiple of d , namely, because F need not be quadratic. In the example inSection 2.2 F was a quadratic function, and as a result the Hessian did not dependon the action x . In general, when invoking a second order Taylor expansion, theHessian does depend on some mid-point x ∗ + t ( x − x ∗ ). At the same time, As-sumption 2.5 only takes into account the Hessian at the optimizer; therefore, some Note that the reverse inequality to (2.8) is trivially true with constant 1, by H¨older’s inequality.
N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 11 continuity assumption is needed if one is to control the deviation from quadratic,which is governed by E H ( x ) := sup t ∈ [0 , (cid:12)(cid:12)(cid:12)D(cid:16) ∇ F ( x ∗ + t ( x − x ∗ ) , ξ ) − ∇ F ( x ∗ , ξ ) (cid:17) ( x − x ∗ ) , x − x ∗ E(cid:12)(cid:12)(cid:12) for x ∈ X .Note that E H ( x ) is likely to be of order k x − x ∗ k under a suitable Lipschitzcondition on the Hessian. Assumption 2.7 is there to ensure that E H ( x ) is sufficientlysmall. Assumption 2.7 (Continuity of the Hessian) . There exists a radius r ∈ (0 , α ∈ (0 ,
1] and a measurable function K : Ξ → [0 , ∞ ) such that E [ K ( ξ )] < ∞ and for all x, y ∈ X with k x − x ∗ k , k y − x ∗ k ≤ r ,we have that (cid:13)(cid:13)(cid:13) ∇ F ( x, ξ ) − ∇ F ( y, ξ ) (cid:13)(cid:13)(cid:13) op ≤ k x − y k α · K ( ξ ) . (b) For some given constant c (which will be specified in Theorem 2.9 and dependsonly on the parameter L of Assumption 2.5) and for all x ∈ X with k x − x ∗ k ≤ r , we have that P h E H ( x ) ≤ k x − x ∗ k i ≥ (1 − c ) . Assumption 2.7 implies that, setting N H , E := dα log (cid:16) r α E [ K ( ξ )] + 2 (cid:17) , (2.9)whenever N & N H , E , the error caused by replacing F by its quadratic approxima-tion does not distort the outcome by too much. Remark 2.8.
At a first glance it might seem as if part (b) of Assumption 2.7follows from part (a). It is true that E H ( x ) ≤ k x − x ∗ k α · K ( ξ ), but there is oneimportant difference: in typical situations, E H does not depend on the dimension d ,while K does (we shall see this phenomenon in the portfolio optimization problem).In particular, estimating E H using K ( ξ ) will unnecessarily force the threshold radius r to depend on the dimension, which is something we wish to avoid. On the otherhand, the dimension-dependent term E [ K ( ξ )] only appears through a logarithmicfactor in the minimal sample size.Returning to the portfolio optimization problem, let us, for the sakeof a simplified exposition, pretend that ℓ ′′ is 1-Lipschitz continuous,and recall that k · k and k · k are equivalent norms. Then (cid:12)(cid:12)(cid:10) ( ∇ F ( x, ξ ) − ∇ F ( y, ξ )) z, z (cid:11)(cid:12)(cid:12) = | ℓ ′′ ( − Y − h X, x i ) − ℓ ′′ ( − Y − h X, y i ) | · h X, z i ≤ |h X, x − y i| · h X, z i . (2.10) Thus E H ( x ) ≤ |h X, x − x ∗ i| and, under some mild integrabilityassumption, the latter term behaves like k x − x ∗ k on average. Markov’s inequality and the fact that k · k and k · k are equivalentnorms imply that P h E H ( x ) > k x − x ∗ k i ≤ E [ E H ( x )] k x − x ∗ k ≤ c E [ E H ( x )] k x − x ∗ k is of order k x − x ∗ k .On the other hand, (2.10) implies that k∇ F ( x, ξ ) − ∇ F ( y, ξ ) k op ≤ K ( ξ ) · k x − y k , for K ( ξ ) := k X k which typically scales like d .With all the definitions set in place let us turn to the formulation of our mainresult. Recall that N G ( r ), σ , c H , and N H , E were defined in (2.5), (2.6), (2.7), and(2.9) respectively. Theorem 2.9 (Estimation and prediction error) . There are constants c , c , c depending only on L such that the following holds. Assume that Assumptions 2.3,2.5, 2.7 hold, let r ∈ (0 , r ) and consider N ≥ c max { d log(2 d ) , N H , E , N G ( r ) } . Then there is a procedure b x ∗ N such that, with probability at least − (cid:16) − c N min n , r σ o(cid:17) , we have that k b x ∗ N − x ∗ k ≤ r,f ( b x ∗ N ) ≤ f ( x ∗ ) + 2 c H · r . The procedure is described in Section 2.4.
Theorem 2.9 implies that our procedure recovers (up to multiplicative constants)the optimal asymptotic rates for the sample average approximation [35, Chapter 5]in a non-asymptotic fashion and when the random variables involved can be heavytailed. To complete the heuristics pertaining to the portfolio optimizationproblem, one has to compute N G ( r ) , σ and c H . Again, for the sakeof simplicity we shall ignore ℓ ′ and ℓ ′′ at every appearance (keepingin mind that some minor integrability assumptions guarantee thatthis simplification does not shift the results by too much from thetruth).In this case, C ov[ ∇ F ( x ∗ , ξ )] = Id and ∇ f ( x ∗ ) = Id, and inparticular σ = λ max (Id) = 1 and N G ( r ) = trace(Id) r = dr . Moreover, ignoring ℓ ′′ also clearly implies that c H = 1.In Corollary 3.7 we specify all the assumptions that are needed to make thisheuristic argument hold; but for now let us state a particularly simple case which isof interest in its own right: the exponential portfolio optimization in the Bacheliermodel. N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 13
Recall that in this model U ( · ) = − exp( − · ) is the exponential utility functionand X is zero-mean Gaussian. We assume that the covariance matrix of X is non-degenerate and that both Y and U (2 Y ) are integrable. Under these assumptions,we shall see that there exists a unique optimal action x ∗ ∈ X . Set¯ σ := E (cid:2) exp( − Y − h X, x ∗ i ) (cid:3) , and assume that Assumption 2.5 holds true. While the latter assumption can beverified via some integrability conditions (we shall see this in context of the generalportfolio optimization problem in Lemma 6.2), the obtained bounds may fail tobe sharp. To showcase that Assumption 2.5 can sometimes be easily verified byother means, consider for a moment Y = h X, ˜ x i + W for some ˜ x ∈ X and W thatis independent of X . Then x ∗ = ˜ x and Gaussian norm equivalence (i.e. there isan absolute constant C such that E [ h X, z i ] ≤ C E [ h X, z i ] for every z ∈ R d )together with independence of X and W readily implies that Assumption 2.5 issatisfied with L = C E [exp( W ) ] E [exp( W )] . Corollary 2.10 (Exponential portfolio optimization) . Under the above assump-tions, there are constants c , c , c , c > depending only on L and E [ | Y + h X, x ∗ i| ] such that the following holds. For r ∈ (0 , min { , c ¯ σ } ) and N ≥ c max n d ¯ σ r , d o , there is a procedure b x ∗ N such that, with probability at least − (cid:16) − c N min n , r ¯ σ o(cid:17) , we have that k b x ∗ N − x ∗ k ≤ r,u ( b x ∗ N ) ≥ u ( x ∗ ) − c r . Remark 2.11.
The origin of the (somewhat unfamiliar) term d appearing in theminimal sample size in Corollary 2.10 is N H , E . However, as our previous heuristicsindicate, that term should be of order d log( d ).As it happens, the source of this difference is the exponential utility function.Indeed, the heuristic presentation was based on a simplifying assumption: that ℓ ′′′ was bounded by 1. That allowed us to conclude that E [ K ( ξ )] was of the order d , resulting in N H , E of order d log( d ) = 3 d log( d ). Here, however, ℓ ′′′ is theexponential function; the term E [ K ( ξ )] is actually of order exp( √ d ) resulting in N H , E that is of order d log(exp( √ d )) = d .It should be stressed that although the term d in the minimal sample sizeof Corollary 2.10 could be off by a factor of √ d , Corollary 2.10 is, to the bestof our knowledge, the first non-asymptotic estimate for the exponential portfoliooptimization problem.In some of the examples we will present later, the Hessian additionally satisfiesa deterministic lower bound: Assumption 2.12 (Deterministic lower bound of the Hessian) . There is r ∈ (0 , ε > ∇ F ( x, ξ ) (cid:23) ε ∇ f ( x ∗ )for all x ∈ X with k x − x ∗ k ≤ r .Moreover, ∇ f ( x ) = E [ ∇ F ( x, ξ )] and ∇ f ( x ) = E [ ∇ F ( x, ξ )] for all x ∈ X with k x − x ∗ k < r .When Assumption 2.12 holds true, the two Assumptions 2.5 and 2.7 that wereimposed to control the smallest singular value of the Hessian are not needed, andTheorem 2.9 can be simplified as follows. Theorem 2.13 (Estimation and prediction error, simplified) . There are constants c , c depending only on ε such that the following holds. Assume that Assumptions2.3 and 2.12 hold, let r ∈ (0 , r ) , and consider N ≥ c N G ( r ) . Then there is a procedure b x ∗ N such that, with probability at least − (cid:16) − c N min n , r σ o(cid:17) , we have that k b x ∗ N − x ∗ k ≤ r,f ( b x ∗ N ) ≤ f ( x ∗ ) + 2 c H · r . Before presenting the procedure b x ∗ N in detail, let us compare the outcome of The-orem 2.9 with the current state of the art. Our focus is on recent non-asymptoticresults by Oliveira and Thompson [28, 29]; a general literature review will be pre-sented in Section 2.5.For a clearer comparison, let us restate Theorem 2.9 (pertaining to the predictionerror) as follows:given some fixed confidence level δ ∈ (0 , r >
0, and N ≥ c max { d log(2 d ) , N H , E , N G ( r ) } , the prediction error is bounded by r with probability at least 1 − δ , whenever thesample N satisfies N & σ r log (cid:16) δ (cid:17) (2.11)In contrast, the main result of Oliveira and Thompson [29, Theorem 3] regardinggeneral convex stochastic optimization problems is the following. There is a randomvariable b Σ N (which we shall not define here) that satisfies b Σ N & d · (cid:16) E [ k∇ F ( x ∗ , ξ ) k ] + 1 N N X i =1 k∇ F ( x ∗ , ξ i ) k (cid:17) and in order to guarantee that the prediction error is bounded by r with probabilityat least 1 − δ one should have(2.12) P h N ≥ b Σ N r log (cid:16) δ (cid:17)i ≥ − δ. There are two major differences between these two results. The first one, whichshould not come as a great surprise following the discussion in Section 2.1, is that
N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 15 b Σ N does not concentrate around its mean with high probability in heavy tailedsituations. Thus, (2.12) forces N to grow like ( δ ) p for some power p > k∇ F ( x ∗ , ξ ) k . For small δ (i.e. if one is interested in highconfidence) this is in stark contrast to the order log( δ ) in (2.11). The seconddifference is the dependence of b Σ N on the dimension d . Indeed, even if we neglectthe integrability issues and replace b Σ N by its mean, what we end up with is notthe correct variance parameter (which appears in the central limit theorem). Forinstance, if k · k is the Euclidean norm, then b Σ N & d · trace( C ov[ ∇ F ( x ∗ , ξ )]) , which is at least d times (possibly even d times) larger than the true varianceparameter σ = λ max ( C ov[ ∇ F ( x ∗ , ξ )]). In high-dimensional problems such as theportfolio optimization problem, where d is the number of stocks and is likely tobe a large number, this difference is significant. As a result, even for moderateconfidence levels δ , the required sample size jumps up by a factor of d (or even d )from what we would expect.2.4. The procedure.
As we already explained previously, to have any hope of op-timal performance, the procedure b x ∗ N cannot be the sample average approximation.Instead, b x ∗ N will be determined through median-of-mean tournaments conductedbetween all x ∈ X , following the method introduced in [20].The first phase of the procedure returns a set of candidates, each with a smallestimation error.(Step 1) For some (small) tuning parameter θ to be specified later, set n := θN min n , r σ o and m := Nn .
Without loss of generality assume that m and n are integers. Partition { , . . . , N } = n [ j =1 I j into n disjoint blocks I j , each of equal cardinality | I j | = m .(Step 2) For every x ∈ X , compute the empirical mean of F ( x, ξ ) on the j -th block b f I j ( x ) := 1 m X i ∈ I j F ( x, ξ i ) . We then say that x ∈ X defeats y ∈ X on the j -th block if b f I j ( x ) < b f I j ( y ),and that x wins the match against y if b f I j ( x ) < b f I j ( y ) on more than n j, i.e. if x defeats y on a majority of the blocks.Denote by ˜ X ∗ N ⊆ X the set of champions , i.e.˜ X ∗ N := n x ∈ X : x wins the match against every y ∈ X that satisfies k x − y k ≥ r o . The following proposition shows that elements in ˜ X ∗ N satisfy the part of Theorem2.9 pertaining to the estimation error. Proposition 2.14 (Estimation error) . In the setting of Theorem 2.9, with proba-bility at least − − c n ) , we have that x ∗ ∈ ˜ X ∗ N and every x ∈ ˜ X ∗ N satisfies k x − x ∗ k ≤ r . As we already explained, if X = R d or, more generally, if x ∗ lies in the interiorof X , the first order condition for optimality immediately implies that f ( x ) ≤ f ( x ∗ ) + c H r for every x ∈ X with k x − x ∗ k ≤ r . In particular, in that case, itfollows from Proposition 2.14 that any choice b x ∗ N ∈ ˜ X ∗ N satisfies the assertion ofTheorem 2.9.However, in general, if one wishes to find x ∈ ˜ X ∗ N with a small prediction error,one requires an additional procedure, which we describe now.To simplify notation, assume without loss of generality that the set ˜ X ∗ N hasalready been determined, and that we can run an additional second procedure, forwhich we are given a new (independent) sample F ( · , ξ i ) Ni = N +1 .Again partition { N + 1 , . . . , N } into n disjoint blocks I ′ j of cardinality | I ′ j | = m with the same n and m , and denote by b f I ′ j ( · ) the empirical mean on the block I ′ j .(Step 3) We say that x ∈ X wins its home match against y ∈ X if b f I ′ j ( x ) ≤ b f I ′ j ( y ) + c H r n j. Denote by b X ∗ N the winners , i.e., b X ∗ N := n x ∈ ˜ X ∗ N : x wins its home matchagainst every y ∈ ˜ X ∗ N o . In light of Proposition 2.14, the crucial advantage here is that, with high probability,matches are only carried out between competitors that are close to x ∗ . The followingproposition shows that any b x ∗ N ∈ b X ∗ N satisfies the requirements in Theorem 2.9: Proposition 2.15 (Prediction error) . In the setting of Theorem 2.9, with proba-bility at least − − c n ) , we have that x ∗ ∈ b X ∗ N and every x ∈ b X ∗ N satisfies f ( x ) < f ( x ∗ ) + 2 c H r . Related literature.
Stochastic optimization and the statistical properties ofthe sample average approximation method have been studied intensively for severaldecades; it is therefore impossible to mention every single contribution. Instead, werefer to Kim, Pasupathy, and Henderson [12], Shapiro, Dentcheva, and Ruszczy´nski[35], Kleywegt, Shapiro, Homem-de-Mello [13], Shapiro [34], or Homem-de-Melloand Bayraksan [10]. As we already mentioned, the statistical analysis in theseworks is always of asymptotic nature. Let us also refer to Banholzer, Fliege, andWerner [2] for a recent study of asymptotic almost sure convergence rates and anup-to date review on the asymptotic convergence analysis, and to Bertsimas, Gupta,and Kallus [4] who raise concerns that most statistical results for sample averageapproximation are of asymptotic nature.Shifting to a non-asymptotic analysis of the performance of the SAA, the avail-able literature gets considerably less diverse. An early reference here is Pflug[30, 31], who relies on Talagrand’s deviation inequality for the supremum of Gauss-ian processes (which is obviously suitable only in very special, light-tailed scenarios).Similar methods were adapted by R¨omisch [32] and Vogel [41] and (for the errorof the value—i.e. min x ∈X f ( x )) by Guigues, Juditsky, and Nemirovski in [9]. The N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 17 two recent papers by Oliveira and Thompson [28, 29] which have been mentionedat the end of Section 2 contain results that are closest to ours.The portfolio optimization problem is often listed as a prime example of a sto-chastic optimization problem, see e.g. [35, Section 1.4]. As such, and with theexception of perhaps more applied studies like [8, 42, 44], existing estimates onthe problem have been derived as applications of general results—much like in ourwork.There are other (optimization) problems in mathematical finance that have beenanalyzed via sampling, such as the estimation of risk measures. We refer the readerto Weber [43] (which relies on large deviation methods) for an early reference andto [3] for a more recent review. We believe that our methods can be applied tothose type of problems as well, and defer it to future work.We should mention that the statistical analysis of the SAA has close ties tothe statistical analysis of problems in high-dimensional statistics, such as linearregression. But despite obvious similarities, the two fields have grown adrift. Thecurrent focus in statistical learning literature is on non-asymptotic statements thathold under increasingly relaxed assumptions. Among the outcomes of this approachwere [20, 23, 24]—alternatives to the sample average approximation in statisticallearning problems that recover the Gaussian rates in completely heavy tailed sit-uations. We believe that pursuing the same direction in the context of stochasticoptimization would lead to intriguing questions. Indeed, there are many variantsof (convex) stochastic optimization problems that are not covered in this article.For example, among these questions is chance constrained stochastic optimization ,where the optimization takes place only over actions x ∈ X satisfying probabilisticconstraints of the form E [ G ( x, ξ )] ≤
0. We are confident that our methods can beadapted to these settings as well, though we shall leave this for future work.3.
Applications
Before continuing with the portfolio optimization problem (in its general form),we present three of Theorem 2.9.3.1.
Multivariate mean estimation.
As a first application of Theorem 2.9, letus return to the problem of mean estimation—this time for a square integrable d -dimensional random vector ξ . Mean estimation is a stochastic optimization problembecause E [ ξ ] = argmin (cid:8) E [ k x − ξ k ] : x ∈ R d (cid:9) . This suggests that one should set F ( x, ξ ) := 12 k x − ξ k for x ∈ X := R d so that x ∗ = E [ ξ ]. (The factor has only a normalizing purpose to make thecomputations below clearer.) As an intimidate consequence of Theorem 2.13 weobtain the following. Corollary 3.1 (Multivariate mean estimation) . There are absolute constants C , C such that the following holds. Let r > and let N ≥ C trace( C ov[ ξ ]) r . Then there is a procedure b x ∗ N such that, with probability at least − (cid:16) − C N min n , r λ max ( C ov[ ξ ]) o(cid:17) , we have that (cid:13)(cid:13)b x ∗ N − E [ ξ ] (cid:13)(cid:13) ≤ r. Let us once again mention that the question of finding an estimator of the meanof a heavy tailed random vector that exhibits Gaussian rates remained open untilvery recently: it was settled in [19, Theorem 1]; see also [17] for a recent survey.Corollary 3.1 recovers [19, Theorem 1].
Proof of Corollary 3.1.
For every x ∈ X , we have that ∇ F ( x, ξ ) = x − ξ, ∇ F ( x, ξ ) = Id . In particular k · k = k · k and Assumption 2.3 is clearly satisfied. Actually, as theHessian is deterministic and independent of the action x , we are in the setting ofTheorem 2.13 and it remains to compute the parameters N G ( r ) and σ . To thatend, it suffices to note that ∇ f ( x ∗ ) − = Id and C ov[ ∇ F ( x ∗ , ξ )] = C ov[ ξ ]; hence N G ( r ) = 1 r trace( C ov[ ξ ]) ,σ = λ max ( C ov[ ξ ]) . The proof therefore follows from Theorem 2.13. (cid:3)
Linear regression.
Linear regression is one of the fundamental problemsstudied in statistics. Given a one-dimensional random variable Y and a d -dimensionalrandom vector X , the task is to find the best possible forecast of Y based onlinear combinations of X . To be more precise, one seeks the minimizer of x E [( h X, x i − Y ) ] over x ∈ R d (or a subset thereof). This problem clearly fallswithin the scope of this article by considering F ( x, ξ ) := 12 ( h X, x i − Y ) where ξ := ( X, Y )with
X ⊆ R d . (The purpose of the factor is again only for convenience.) In orderto lighten notation, we shall impose a standard assumption on X : that it is centredand isotropic. The latter means that its covariance matrix is the identity. Assumption 3.2.
The set
X ⊆ R d is closed and convex, X is centred and isotropic,and there is a constant L X such that E [ h X, x i ] ≤ L X E [ h X, x i ] (3.1)for every x ∈ R d . Further, ¯ σ := E [( h X, x ∗ i − Y ) ] is finite.The assumption (3.1), which means that the L and L norms of linear forms of X are equivalent, is a typical assumption made in high-dimensional statistics andit is not too restrictive. N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 19
Corollary 3.3 (Linear Regression) . If Assumption 3.2 is satisfied, there are con-stants c and c depending only on L X such that the following holds. Let r > and N ≥ c max n d log(2 d ) , d · ¯ σ r o . Then there is a procedure b x ∗ N such that, with probability at least − (cid:16) − c N min n , r ¯ σ o(cid:17) , we have that k b x ∗ N − x ∗ k ≤ r, E X,Y [( h X, b x ∗ N i − Y ) ] ≤ E [( h X, x ∗ i − Y ) ] + 2 r . Here, E X,Y [ · ] denotes the expectation taken only over X and Y (and, of course,not the sample ( X i , Y i ) Ni =1 used for the computation of b x ∗ N ).Compared to the benchmark result on linear regression in a heavy tailed scenario[20], Corollary 3.3 has an additional log(2 d ) term in the estimate on the samplesize. This term is merely an artifact of the generality of our main result. Its originlies in the matrix-Bernstein inequality, which we use in the process of bounding thesmallest singular value of a general random matrix ensemble (namely ∇ F ( x ∗ , ξ ));that extra factor is not needed in this example and can be easily removed. Proof of Corollary 3.3.
For every x ∈ X , we have that ∇ F ( x, ξ ) = ( h X, x i − Y ) X, ∇ F ( x, ξ ) = X ⊗ X. As X is isotropic, this implies ∇ f ( x ∗ ) = Id and in particular k · k = k · k . Also, as( h X, x ∗ i− Y ) and X are both in L by assumption, the Cauchy-Schwartz inequalityshows that ∇ F ( x ∗ , ξ ) is square integrable. In particular, Assumption 2.3 is satisfied.Employing the norm equivalence of X (3.1), we obtain E [ h∇ F ( x ∗ , ξ ) z, z i ] = E [ h X, z i ] ≤ L X k z k for every z ∈ R d ; thus Assumption 2.5 is satisfied with L = L X . Finally, as theHessian is independent of the action x , Assumption 2.7 is clearly satisfied with α = 1 , K ≡ r ; in particular, N H , E ≤ d .It remains to compute the parameters N G ( r ) , σ , and c H . Using once again thatthe Hessian is independent of the action x , it is evident that c H = 1. Turning to σ and N G ( r ), recall that ∇ f ( x ∗ ) = Id, and let us estimate the largest eigenvalueand the trace of C ov[ ∇ F ( x ∗ , ξ )].For every z ∈ R d , the Cauchy-Schwartz inequality together with (3.1) imply h C ov[ ∇ F ( x ∗ , ξ )] z, z i = V ar[( h X, x ∗ i − Y ) h X, z i ] ≤ E (cid:2)(cid:0) ( h X, x ∗ i − Y ) h X, z i (cid:1) (cid:3) ≤ E [( h X, x ∗ i − Y ) ] E [ h X, z i ] ≤ ¯ σ L X k z k . (3.2) This clearly implies σ ≤ L X ¯ σ and, taking the standard Euclidean basis z = e i in(3.2), we conclude C ov[ ∇ F ( x ∗ , ξ )] ii ≤ ¯ σ L X , hence N G ( r ) = 1 r d X i =1 C ov[ ∇ F ( x ∗ , ξ )] ii ≤ ¯ σ L X dr . The proof now follows from Theorem 2.9. (cid:3)
Ridge regression.
A popular modification of linear regression is ridge re-gression (also known as weight decay regression and ℓ -regularized regression in themachine learning community). The idea is that one penalizes a large Euclideannorm of x by setting F ( x, ξ ) := ( h X, x i − Y ) + k x k where ξ := ( X, Y ) and x ∈ X ⊆ R d . The aim of this sort of penalization is tocounteract over-fitting .In contrast to the estimate we obtain in linear regression, the factor d log(2 d ) inthe minimal sample is not needed here: Corollary 3.4.
If Assumption 3.2 is satisfied, there are absolute constants C and C such that the following holds. Let r > and set N ≥ C d · ¯ σ r . Then there is a procedure b x ∗ N such that, with probability at least − (cid:16) − C N min n , r ¯ σ o(cid:17) , satisfies that k b x ∗ N − x ∗ k ≤ r, E X,Y [( h X, b x ∗ N i − Y ) ] + k b x ∗ N k ≤ E [( h X, x ∗ i − Y ) ] + k x ∗ k + 2 r . Proof.
For every x ∈ X , we have that ∇ F ( x, ξ ) = 2 X ⊗ X + 2Id , ∇ f ( x ) = 4Id;hence we are in the setting of Theorem 2.13. All that remains is to estimate theparameters N G ( r ) and σ , and this can be done exactly as in the proof of Corollary3.3. (cid:3) Portfolio optimization.
As a final example, we address the portfolio opti-mization problem in its general form, that is, we apply Theorem 2.9 to the problem F ( x, ξ ) := ℓ ( V x ) where V x := − Y − h X, x i ,ξ := ( X, Y ) , for x ∈ X ⊆ R d that is closed and convex. Moreover, ℓ : R → R is strictly convex,increasing, bounded from below, and we assume that it is three times continuouslydifferentiable.We shall impose the following two assumptions on the zero-mean random vector X . Firstly, assume that X satisfies a (directional) L - L norm equivalence, i.e.there is a constant L X such that E [ h X, z i ] ≤ L X · E [ h X, z i ] < ∞ (3.3) N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 21 for all z ∈ R d . In addition, we assume that the following no-arbitrage conditionholds P [ h X, z i < > z ∈ R d \ { } . (3.4) Remark 3.5.
The classical no-arbitrage condition used in mathematical financereads as follows: for every z ∈ R d , we have that P [ h X, z i <
0] = 0 implies that P [ h X, z i >
0] = 0;i.e. it is not possible to make profit without taking any risk. Under this condition,it is well-known that one can decompose R d = V ⊕ V ⊥ into an orthogonal sum suchthat P [ h X, v i < > v ∈ V \ { } and P [ h X, w i = 0] = 1 for all w ∈ V ⊥ ,see e.g. [7, Section 1.3]. In particular, replacing X by X ∩ V viewed as a subsetof R dim( V ) does not affect the outcome of the portfolio optimization problem butguarantees that (3.4) holds.Moreover, we shall assume that part (c) of Assumption 2.3, pertaining the inte-grability of F, ∇ F, ∇ F , is satisfied. By H¨older’s inequality and (3.3), this is thecase if, for example, ℓ ( V x ) , ℓ ′ ( V x ) , ℓ ′′ ( V x ) are integrable for every x ∈ X . Then f is real-valued, and standard arguments building on the no-arbitrage condition,show that f is strictly convex and coercive. In particular, a unique optimal action x ∗ ∈ X exists; see e.g. [7, Section 3.1].Finally, denoting by B ∗ the ball of radius 1 w.r.t. the norm k · k centered at x ∗ and restricted to X , we assume that¯ σ := E [( ℓ ′ ( V x ∗ ) ℓ ′′ ( V x ∗ ) ) ] ,v := E [ | V x ∗ | ] ,v := E [ ℓ ′′ ( V x ∗ ) ] ,v K := E [sup x ∈B ∗ ℓ ′′′ ( V x ) ] ,v E H := sup x ∈B ∗ E [sup t ∈ [0 , ℓ ′′′ ( V x ∗ + t ( x − x ∗ ) ) ] are all finite. Remark 3.6.
If, for instance, ℓ ′′′ is non-negative and increasing, the terms v K and v E H can be simplified as sup x ∈B ∗ ℓ ′′′ ( V x ) = ℓ ′′′ ( V x ∗ + k X k ∗ ) , sup t ∈ [0 , ℓ ′′′ ( V x ∗ + t ( x − x ∗ ) ) ≤ ℓ ′′′ ( V x ∗ + |h X, x − x ∗ i| )where k · k ∗ := sup x ∈ R d s.t. k x k≤ h x, ·i denotes the dual norm of k · k . (Note that if k · k is (equivalent to) the Euclidean norm, then its dual norm is (equivalent to) theEuclidean norm too).Under these assumptions, we obtain the following: Corollary 3.7 (Portfolio optimization) . There are constants c , c , c , c dependingonly on L X , v , v such that the following holds. For r ∈ (0 , min { , c v E H } ) and N ≥ c max n d · ¯ σ r , d log( d ( v K + 2)) o , there is a procedure b x ∗ N such that, with probability at least − (cid:16) − c N min n , r ¯ σ o(cid:17) , we have that k b x ∗ N − x ∗ k ≤ r,u ( b x ∗ N ) ≥ u ( x ∗ ) − c r . We postpone the proof of Corollary 3.7; it will be presented in Section 6, wherewe also show how to recover the estimate in the exponential portfolio optimizationproblem (i.e. Corollary 2.10).4.
On the smallest singular value of general random matrixensembles
In the course of the analysis needed in the proof of Theorem 2.9, a crucial ingre-dient is that sampling can exhibit that h∇ F ( x ∗ , ξ )( x − x ∗ ) , x − x ∗ i is sufficientlylarge. Put differently, it is essential to derive a suitable lower bound on the smallestsingular value of the empirical random matrix of ∇ F ( x ∗ , ξ ).Results of this type have been studied extensively [1, 14, 33, 36, 39, 45], however(to the best of our knowledge) the focus was on random matrix ensembles thathad some additional special structure, like iid rows/columns. Unfortunately suchspecial structure need not exist in our setting.In this section, let A be a real, square integrable, positive-semidefinite ( d × d )-random matrix and let ( A i ) i ≥ be independent copies of A . (in the context of thisarticle, the case that interests us is A = ∇ F ( x ∗ , ξ ).) Denote its expectation by A := E [ A ], the semi-norm endowed by A is k x k := h A x, x i = E [ h Ax, x i ] and the corresponding unit sphere is S := { x ∈ R d : k x k = 1 } . To simplify the presentation, assume that k ·k is a true norm, i.e. k x k = 0 impliesthat x = 0. Next, for any ( d × d )-matrix B , its operator norm is given by k B k op := max x,y ∈ S h Bx, y i . Before stating the main result of this section (Theorem 4.4) let us begin withone of its outcomes.
Corollary 4.1.
Assume that there is a constant
L > such that E [ h Ax, x i ] ≤ L for all x ∈ S. (4.1) Then there are constants c and c that depend only on L such that the followingholds. Let γ ∈ (0 , and assume that N ≥ c dγ log (cid:16) dγ (cid:17) . Then, with probability at least − (cid:0) − c N γ (cid:1) , N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 23 we have that λ min (cid:16) N N X i =1 A i (cid:17) ≥ (1 − γ ) λ min ( E [ A ]) . (Here, of course, λ min denotes the smallest singular value.)As the results of this section will be used in the proof of Theorem 2.9 for theconstruction of a median-of-means tournament, we also need a median-of-meansversion of Corollary 4.1. To that end, as before, let N = nm for two integers n and m , and consider a partition of { , . . . , N } into n disjoint blocks I j , each one ofcardinality | I j | = m . Throughout this article, we make the notational convention that the letter j always refers blocks, i.e. j is always an element of { , . . . , n } . In ad-dition, 0 < C, C , C , . . . denote absolute constants independent of all parameters.They are allowed to change their values from line to line.We often encounter the so-called Rademacher random variables ( ε i ) i ≥ , whichare independent, symmetric random signs (i.e. P [ ε i = ±
1] = ) that are also inde-pendent of all the other random variables that appear in the analysis; in particular,they are independent of ( A i ) Ni =1 .We have already seen in the introduction that the smallest empirical singularvalue of a random matrix cannot be bounded (even with constant probability)without imposing some of assumption. While an integrability assumption as in(4.1) will do the job, the following definition from [25] contains the essence of whatis actually needed. Definition 4.2 (Stable lower bound) . A set H ⊆ L of real-valued functionssatisfies a stable lower bound with parameters ( m, γ, l, k ) if the following holds:for every h ∈ H and independent copies ( h i ) i ≥ of h , with probability at least1 − − k ), for all J ⊆ { , . . . , m } with | J | ≤ l , we have that1 m X i ∈{ ,...,m }\ J h i ≥ (1 − γ ) E [ h ] . We say that a (symmetric, positive semi-definite, d × d ) random matrix A satisfiesa stable lower bound with parameters ( m, γ, l, k ) if H := {h Ax, x i : x ∈ S } does(with the same parameters).The stable lower bound can be seen as an extension of the small ball property .Recall that a set H is said to satisfy a small ball property with parameters ( κ, δ ) if P (cid:2) h ≥ κ · E [ h ] (cid:3) ≥ δ for all h ∈ H. This condition is used frequently in problems involving a quadratic term (see,e.g. [15, 22, 26]).
Remark 4.3.
Let H = {h Ax, x i : x ∈ S } . Then the following hold.(i) If H satisfies the small ball property with constants ( κ, δ ), then H satisfies astable lower bound with parameters( m, γ, k, l ) = (cid:16) m, − δκ , s δm, s δm (cid:17) for every m , where s , s > (ii) If H is bounded in L p for some p >
2, that is, E [ | h | p ] p ≤ L for all h ∈ H where L is a fixed constant, then H satisfies a stable lower bound with pa-rameters ( m, γ, k, l ) = (cid:16) m, γ, s mγ pp − , s mγ max { pp − , } (cid:17) where s , s > p and L .The first statement is an immediate consequence of a Binomial concentrationinequality and second statement can be found in [25, Section 2.1] together with amore thorough discussion and analysis of stable lower bounds.The following theorem is the main result of this section. To formulate it, recallthat for a random matrix A , E [ A ] is denoted by A . Theorem 4.4.
There are absolute constants C and C such that the followingholds. Fix γ, τ ∈ (0 , and assume that(a) the random matrix A satisfies a stable lower bound with parameters ( m, γ , l, k ) were k ≥ max { , τ ) } ,(b) the sample size satisfies N ≥ C max n k E [ A A − A ] k op , dmτ k log (cid:16) log(3 d ) E [ k A A − A k op ] γτ k E [ A A − A ] k op (cid:17)o . Then, with probability at least − (cid:16) − C N τ min n lm , km o(cid:17) , for every x ∈ R d and every J j ⊆ I j with | J j | ≤ l for all j , we have that (cid:12)(cid:12)(cid:12)n j : 1 m X i ∈ I j \ J j h A i x, x i ≥ (1 − γ ) k x k o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n. The formulation in Theorem 4.4 is tailer-made for the median-of-means analysiswhich is presented in the next section (as part of the proof of Theorem 2.9). It en-sures a stability property that is stronger than a mere lower bound on the quadraticform (and therefore, a lower estimate on the smallest singular value). Rather, itgives a useful lower bound even if a proportion of the sample is arbitrarily modifiedor removed. This will be useful in the proof of Theorem 2.9 when passing from theHessian evaluated at the optimizer to the Hessian evaluated at mid-points. In ad-dition, Theorem 4.4 provides an almost isometric lower bound (i.e. when γ is closeto zero) while the analysis for Theorem 2.9 actually only requires an isomorphiclower bound.Finally, let us formulate the following immediate consequence of Theorem 4.4. Corollary 4.5.
There are absolute constants C and C such that the followingholds. Fix γ ∈ (0 , and assume that(a) the random matrix A satisfies a stable lower bound with parameters ( N, γ , l, k ) for some k ≥ ,(b) the sample size satisfies N ≥ C max n k E [ A A − A ] k op , dNk log (cid:16) log(3 d ) E [ k A A − A k op ] γ k E [ A A − A ] k op (cid:17)o . N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 25
Then, with probability at least − (cid:16) − C N min n lN , kN o(cid:17) , we have that λ min (cid:16) N N X i =1 A i (cid:17) ≥ (1 − γ ) λ min ( E [ A ]) . Proof.
Applying Theorem 4.4 with n = 1 (hence m = N ), τ = 0 .
6, and J j = ∅ yields the following: with probability at least 1 − − C N min { lN , kN } ), forevery x ∈ R d , we have that D N N X i =1 A i x, x E ≥ (1 − γ ) h E [ A ] x, x i . A twofold application of the extremal expression λ min ( · ) = min x ∈ R d s.t. k x k =1 h · x, x i of the smallest singular value completes the proof. (cid:3) In order to recover Corollary 4.1 from Corollary 4.5 (and later also to applyTheorem 4.4 in the proof of Theorem 2.9) we need two simple observations, showingthat under a L − L norm equivalence, the estimate on the sample size in Corollary4.5 and Theorem 4.4 can be simplified. Lemma 4.6.
We have that ≤ k E [ A A − A ] k op ≤ E [ k A A − A k op ] ≤ d · k E [ A A − A ] k op . (4.2) Proof.
We start with the final inequality in (4.2). Recall that A = E [ A ] and notethat the relation k · k = k A · k transfers to the operator norm in the followingsense: for any ( d × d )-matrix B , we have that k B k op = k A − B A − k op where k B k op := max x,y ∈ R d s.t. k x k , k y k ≤ h Bx, y i . (4.3)Indeed, { x ∈ R d : h A x, x i ≤ } is the ellipsoid A − B d (where B d is the unit ballw.r.t. the Euclidean norm).The norm k · k op is the usual spectral norm which, for positive semidefinitematrices, equals the largest singular value λ max ( · ).Setting F := A − A A − which is positive semidefinite, we thus have E [ k A A − A k op ] = E [ k F k op ]= E [ λ max ( F )] ≤ d X i =1 E [( F ) ii ] , where the last inequality follows by bounding the largest singular value by thetrace. On the other hand, for every i = 1 , . . . , d , taking the Euclidean unit vector x = y = e i in (4.3) shows that k E [ A A − A ] k op = k E [ F ] k op ≥ E [( F ) ii ]and hence the last inequality of the lemma follows.The second inequality in (4.2) is trivial and the first inequality follows from (4.3)and Jensen’s inequality for matrices (which states that E [ F ] (cid:23) E [ F ] = Id). Thiscompletes the proof. (cid:3) Lemma 4.7.
Assume that there is a constant L such that E [ h Ax, x i ] ≤ L for all x ∈ S . Then k E [ A A − A ] k op ≤ dL . Proof.
Recall that F := A − A A − and that, by (4.3), k E [ A A − A ] k op = k E [ F ] k op = max x ∈ R d s.t. k x k ≤ E [ h F x, x i ] . Moreover, for every x ∈ R d , we have h F x, x i = h F F x, F x i≤ k F k op k F x k , which, in combination with the Cauchy-Schwartz inequality, yields that k E [ A A − A ] k op ≤ max x ∈ R d s.t. k x k ≤ E [ k F k ] E [ k F x k ] . At this point, recall that { x ∈ R d : h A x, x i ≤ } is the ellipsoid A − B d , and notethat an equivalent formulation of (4.1) is E [ k F x k ] = E [ h F x, x i ] ≤ L E [ h F x, x i ]for every x ∈ R d , and that E [ h F x, x i ] = 1 for every x ∈ R d with k x k = 1. Hence,bounding k F k op = λ max ( F ) by the trace of F , E [ k F k ] ≤ d X i =1 E [( F ii ) ] = d X i =1 E [ h F e i , e i i ] ≤ dL . In conclusion k E [ A A − A ] k op ≤ dL , as claimed. (cid:3) Proof of Corollary 4.1.
By Remark 4.3 (ii), the random matrix A satisfies a stablelower bound with parameters ( N, γ , s N γ , s N γ ) for constants s and s depend-ing only on L . The proof therefore follows from Corollary 4.5 together with Lemma4.6 and Lemma 4.7. (cid:3) Throughout this article, we often aim to prove statements that are supposed tohold with high probability, uniformly over (uncountable) sets; for example that forall x ∈ S , most coordinates of ( h A i x, x i ) Ni =1 are suitably large. The proofs of suchstatements follow (in principle) a recurring scheme: in a first step, we prove thata slightly stronger variant of the statement holds with high probability for a singleelement (i.e. Lemma 4.8 below). By a trivial union bound, this allows to extendthe validity of the statement to a finite set of high cardinality, and we shall choosesuch a set with an extra feature: it approximates the original set in a suitable sense(i.e. Lemma 4.10 below). It then remains to show that the oscillations caused bypassing from the approximating set to the whole set do not distort the outcome bytoo much (i.e. Lemma 4.14 below).From now on, we shall continue under the same assumptions used in the formu-lation of Theorem 4.4 without mentioning these assumptions at every instance. N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 27
Lemma 4.8.
There is an absolute constant C such that the following holds. Forevery x ∈ S , with probability at least − − CN τ km ) , for all choices J j ⊆ I j with | J j | ≤ l , we have that (cid:12)(cid:12)(cid:12)n j : 1 m X i ∈ I j \ J j h A i x, x i ≥ − γ o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − τ (cid:17) n. In particular, for a set ¯ S ⊆ S satisfying log( | ¯ S | ) ≤ Cτ N km , the above statementholds uniformly over x ∈ ¯ S .Proof. The proof follows from an application of Bennett’s inequality and is essen-tially the same as the proof of [25, Lemma 4.3]. For completeness, we sketch theargument. Fix some x ∈ S and, for every j , set δ ( j ) := ( m P i ∈ I j \ J j h A i x, x i ≥ − γ for all J j ⊆ I j with | J j | ≤ l, δ := P [ δ ( j ) = 1] ≤ − k ) ≤ . If δ = 0, there is nothing to prove, so assume otherwise. Now make use ofBennett’s inequality [5, Theorem 2.9]: for every u ≥ − − Cδn · u log( u )), we have that |{ j : δ ( j ) = 1 }| ≤ uδn. Apply this to u := τ δ ≥ exp (cid:16) k (cid:17) ≥ τ ≥ exp( − k ) by assumption) and observethat Cδn · u log( u ) = 12 Cτ n log (cid:16) τ δ (cid:17) ≥ Cτ nk
Cτ N k m . This completes the first part of the proof. The “in particular” part is a consequenceof the union bound. (cid:3)
In a next step we choose a subset of S of high cardinality which “covers” S w.r.t.the natural metric. Definition 4.9.
Let (
S, d ) be a metric space and let ρ >
0. A set ¯ S ⊂ S is a ρ -cover of S with respect to the metric d if for every s ∈ S there is ¯ s ∈ ¯ S such that d ( s, ¯ s ) ≤ ρ .For now, let C be the absolute constant from Lemma 4.8 and set C to be theconstant from assumption (b) in Theorem 4.4. Recall that we have the freedom tochoose C as we see fit, and set C to be a constant specified in what follows. Lemma 4.10.
Let ρ := C γτ k E [ A A − A ] k op log(3 d ) E [ k A A − A k op ] . Then there is a ρ -cover ¯ S ⊆ S (w.r.t. the norm k · k ) of log-cardinality at most Cτ N km . Proof.
By a simple volumetric argument (see e.g. [38, Exercise 2.2.14]) there is aset ¯ S ⊆ S := { x ∈ R d : k x k = 1 } such that, for every x ∈ S there is y = y ( x ) ∈ ¯ S with k x − y k ≤ ρ with cardinalitylog( | ¯ S | ) ≤ d log (cid:16) ρ (cid:17) ≤ C d log (cid:16) log(3 d ) E [ k A A − A k op ] γτ k E [ A A − A ] k op (cid:17) , where C depends only on C . By assumption (b) on the sample size in Theorem4.4 we conclude that log( | ¯ S | ) ≤ Cτ N km once the constant C (in assumption (b))is chosen sufficiently large. Finally, the relation k · k = k A · k readily implies thatthe set ¯ S := { A − y : y ∈ ¯ S } satisfies the statement of the lemma. (cid:3) The final step in the proof consist of showing that the transition from a ρ -cover¯ S to the whole S does not distort the wanted outcome by too much. To that end,we fix from now on the ρ -cover ¯ S of Lemma 4.10 and denote by y = y ( x ) ∈ ¯ S theelement satisfying k x − y k ≤ ρ . Also, for every x ∈ S , set∆( x ) := h Ax, x i − h
Ay, y i . Remark 4.11.
The following two preliminary lemmas are stated here to maintaina chronological order within the proofs. However, it might be helpful to skip to(the proof of) Lemma 4.14 where their role is clarified and return here afterwards.
Lemma 4.12.
We have that E [ | ∆( x ) | ] ≤ ρ for every x ∈ S. Proof.
Fix some x ∈ S . As A is symmetric and positive semidefinite, it followsfrom the triangle inequality that (cid:12)(cid:12) h Ax, x i − h Ay, y i (cid:12)(cid:12) ≤ h A ( x − y ) , x − y i . Combined with the fact that | a − b | ≤ | a − b | max { a, b } for a, b ≥
0, we have | ∆( x ) | ≤ h A ( x − y ) , x − y i · max (cid:8) h Ax, x i , h Ay, y i (cid:9) , and by the Cauchy-Schwartz inequality E [ | ∆( x ) | ] ≤ E [ h A ( x − y ) , x − y i ] · (cid:0) E [ h Ax, x i ] + E [ h Ay, y i ] (cid:1) = 4 k x − y k · ( k x k + k y k )= 8 k x − y k , where the second equality follows by definition of the norm k · k and the final oneholds because x, y ∈ S . (cid:3) Lemma 4.13.
Let C T be the absolute constant in Talagrand’s concentration in-equality [37] . Moreover, set b := γ ml , let ( A i ) Ni =1 be independent copies of A and N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 29 set ( ε i ) i ≥ to be independent Rademacher random variables that are independent of ( A i ) Ni =1 . Then we have that E h sup x ∈ S (cid:12)(cid:12)(cid:12) N N X i =1 ε i ( | ∆ i ( x ) | ∧ b ) (cid:12)(cid:12)(cid:12)i ≤ γτC T once C (the absolute constant of Lemma 4.10) is small enough.Proof. As a preliminary step, we invoke the contraction inequality for Bernoulli pro-cesses [16, Corollary 3.17] conditionally on ( A i ) Ni =1 , and applied to the 1-Lipschitzmap t
7→ | t | ∧ b which passes through the origin. It follows that E h sup x ∈ S (cid:12)(cid:12)(cid:12) N N X i =1 ε i ( | ∆ i ( x ) | ∧ b ) (cid:12)(cid:12)(cid:12)i ≤ E h sup x ∈ S (cid:12)(cid:12)(cid:12) N N X i =1 ε i ∆ i ( x ) (cid:12)(cid:12)(cid:12)i . Rewriting ∆ i ( x ) = h A i ( x + y ) , x − y i we have that (cid:12)(cid:12)(cid:12) N N X i =1 ε i ∆ i ( x ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)D(cid:16) N N X i =1 ε i A i (cid:17) ( x + y ) , x − y E(cid:12)(cid:12)(cid:12) ≤ ρ (cid:13)(cid:13)(cid:13) N N X i =1 ε i A i (cid:13)(cid:13)(cid:13) op , because k x + y k ≤ k x − y k ≤ ρ . Next, set F i := ε i A − A i A − for every i and recall the relation between k · k op and the spectral norm k · k op stated in (4.3).The Matrix-Bernstein inequality [40, Theorem I] implies that E h(cid:13)(cid:13)(cid:13) N X i =1 ε i A i (cid:13)(cid:13)(cid:13) op i = E h(cid:13)(cid:13)(cid:13) N X i =1 F i (cid:13)(cid:13)(cid:13) op i ≤ (cid:16) C ( d ) N k E [ F ] k op (cid:17) + C ( d ) E h max ≤ i ≤ N k F i k i where C ( d ) = 4(1 + 2 log(2 d )) ≤
22 log(2 d ) . Further, estimating the maximum by the sum and using that k F k = k F k op ,we trivially have E h max ≤ i ≤ N k F i k i ≤ N E [ k F k ]= N E [ k A A − A k op ] . Putting everything together, we therefore obtain E h sup x ∈ S (cid:12)(cid:12)(cid:12) N N X i =1 ε i ( | ∆ i ( x ) | ∧ b ) (cid:12)(cid:12)(cid:12)i ≤ ρC ( d ) (cid:16)(cid:16) k E [ A A − A ] k op N (cid:17) + (cid:16) E [ k A A − A k op ] N (cid:17) (cid:17) ≤ ρC ( d ) (cid:16) (cid:16) E [ k A A − A k op ] k E [ A A − A ] k op (cid:17) (cid:17) , where the second inequality follows from assumption (b) in Theorem 4.4: that onthe sample size satisfies N ≥ k E [ A A − A ] k op . By Lemma 4.6, the expectation of theoperator norm is always larger than the operator norm of the expectation, hence E h sup x ∈ S (cid:12)(cid:12)(cid:12) N N X i =1 ε i ( | ∆ i ( x ) | ∧ b ) (cid:12)(cid:12)(cid:12)i ≤ ρC ( d ) E [ k A A − A k op ] k E [ A A − A ] k op . Recalling the value of ρ from Lemma 4.10 shows that the latter term is at most γτC T . This completes the proof. (cid:3) Lemma 4.14.
There exists an absolute constant C such that the following holds.For every x ∈ S and every j , let J ∗ j ( x ) := { largest l coordinates of ( | ∆ i ( x ) | ) i ∈ I j } ⊆ I j . Then, with probability at least − − C N τ lm ) , we have that sup x ∈ S (cid:12)(cid:12)(cid:12)n j : 1 m X i ∈ I j \ J ∗ j ( x ) | ∆ i ( x ) | > γ o(cid:12)(cid:12)(cid:12) ≤ τ n . (4.4) Proof.
Set b := γ ml . We claim that, for every x ∈ S and every j ,1 m X i ∈ I j \ J ∗ j ( x ) | ∆ i ( x ) | > γ m X i ∈ I j ( | ∆ i ( x ) | ∧ b ) > γ . (4.5)Indeed, if | ∆ i ( x ) | ≤ b for i ∈ I j \ J ∗ j ( x ), the second sum in (4.5) is trivially at least asbig as the first one. Otherwise, if there is i ∈ I j \ J ∗ j ( x ) for which | ∆ i ( x ) | > b , thenby definition of J ∗ j ( x ), there are least l coordinates i ∈ I j for which | ∆ i ( x ) | > b . Inparticular, the second sum in (4.5) is at least m lb = γ , and (4.5) holds. Therefore,it suffices to show that R ≤ γτ holds with high probability, where R := sup x ∈ S N N X i =1 ( | ∆ i ( x ) | ∧ b ) . To that end, Talagrand’s concentration inequality for bounded empirical pro-cesses [37] (see also [5]): there is an absolute constant C T such that P h R ≤ R + C T (cid:0) R + R + R (cid:1)i ≥ − − u )for every u ≥
0, where R := sup x ∈ S E [ | ∆( x ) | ∧ b ] ,R := sup x ∈ S E [( | ∆( x ) | ∧ b ) ] · (cid:16) uN (cid:17) ,R := sup x ∈ S k| ∆( x ) | ∧ b k L ∞ · uN ,R := E h sup x ∈ S (cid:12)(cid:12)(cid:12) N N X i =1 ε i ( | ∆ i ( x ) | ∧ b ) (cid:12)(cid:12)(cid:12)i . N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 31
Thus, to conclude the proof, let us show that for u = CN τ lm , the sum R + C T ( R + R + R ) is smaller than γτ .We now proceed to bound R , . . . , R . First, recall that C is the absoluteconstant of Lemma 4.10 which we are still able to choose as small as we want, andnote that ρ = C γτ k E [ A A − A ] k op log(3 d ) E [ k A A − A k op ] ≤ C γτ log(3) ≤ C γτ, where the first inequality holds by Lemma 4.6 and the second one by absorbing into C .By Lemma 4.12 we have that E [ | ∆( x ) | ] ≤ ρ for every x ∈ S ; thus R ≤ ρ ≤ γτ C < .For the terms R and R , which involve u = CN τ lm , note that E [( | ∆( x ) | ∧ b ) ] ≤ ρb for every x ∈ S. Indeed, this follows from the trivial estimate ( | ∆( x ) | ∧ b ) ≤ | ∆( x ) | b and Lemma4.12 once again. Recalling that b = γ ml , we therefore have R ≤ (cid:16) ρbuN (cid:17) ≤ (cid:0) C Cγ τ (cid:1) ≤ γτC T C is small enough. Moreover, R ≤ buN = Cγτ ≤ γτC T C is small enough.Finally, by Lemma 4.13, we have R ≤ γτC T . This completes the proof (cid:3) Proof of Theorem 4.4.
The statement of Theorem 4.4 is clearly homogeneous in x ∈ R d , hence it suffices to restrict to x ∈ S . The proof follows from a combinationof Lemma 4.8 and Lemma 4.14. Indeed, using the notation of Lemma 4.14, forevery x ∈ S , y = y ( x ), and every J j ⊆ I j with | J j | ≤ l , we write1 m X i ∈ I j \ J j h A i x, x i ≥ m X i ∈ I j \ ( J j ∪ J ∗ j ( x )) h A i x, x i≥ m X i ∈ I j \ ( J j ∪ J ∗ j ( x )) h A i y, y i − m X i ∈ I j \ ( J j ∪ J ∗ j ( x )) | ∆ i ( x ) | . (4.6)By Lemma 4.8, with probability at least 1 − − Cτ N km ), for every x , y , andsets J j as above, we have that1 m X i ∈ I j \ ( J j ∪ J ∗ j ( x )) h A i y, y i ≥ − γ (cid:16) − τ (cid:17) n blocks . Next, by Lemma 4.14, with probability at least 1 − − Cτ N lm ), for every x , y ,and sets J j as above, we have that1 m X i ∈ I j \ ( J j ∪ J ∗ j ( x )) | ∆ i ( x ) | ≤ γ (cid:16) − τ (cid:17) n blocks . Taking the intersection of the two high probability events yields the claim. (cid:3) Proofs of the main results
In addition to the notational conventions already explained in Section 4, set c, c , c , . . . to be constants that may depend on L (the parameter appearing inAssumption 2.5). As before, these constants may change their values from line toline. Moreover, 0 < θ < τ < are constants that may depend on L as well. Forthe sake of a clearer presentation, rather than stating the explicit values of θ and τ now, we collect constraints on their values along the way.Next recall that n = θN min n , r σ o and m = Nn , where we assume without loss of generality that m and n are integers. Thus, N = nm and m ≥ θ . Finally, set B ∗ r := { x ∈ X : k x − x ∗ k ≤ r } , S ∗ r := { x ∈ X : k x − x ∗ k = r } to be the ball and sphere of radius r around x ∗ restricted to X , respectively. Recallthe constant r of Assumption 2.7 and assume throughout that r ≤ r .The proof of Theorem 2.9 relies on the following decomposition: for every j andevery x ∈ X , a Taylor expansion implies that b f I j ( x ) − b f I j ( x ∗ )= 1 m X i ∈ I j h∇ F ( x ∗ , ξ i ) , x − x ∗ i + 12 1 m X i ∈ I j h∇ F ( z i , ξ i )( x − x ∗ ) , x − x ∗ i =: M x,x ∗ ( j ) + 12 Q x,x ∗ ( j ) , where z i are midpoints between x and x ∗ (and each z i may depend on ξ i ). Forobvious reasons we call Q x,x ∗ the quadratic term and M x,x ∗ the multiplier term .We start with the proof of the estimation error, formulated in Proposition 2.14.With the decomposition of b f I j ( x ) − b f I j ( x ∗ ) into a multiplier and a quadratic termat hand, the strategy of the proof that x ∗ defeats any competitor x ∈ S ∗ r on the j -th block (i.e. that b f I j ( x ) − b f I j ( x ∗ ) > r ) and the multiplier term is likely not to betoo negative: Lemma 5.1.
There is a constant c such that the following holds. With probabilityat least − − cτ n ) , for every x ∈ S ∗ r , we have that (cid:12)(cid:12)(cid:12)n j : 12 Q x,x ∗ ( j ) ≥ r o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n, (5.1) (cid:12)(cid:12)(cid:12)n j : M x,x ∗ ( j ) ≥ − r o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n. (5.2) N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 33
Let us show that Lemma 5.1 implies Proposition 2.14:
Proof of Proposition 2.14.
On the high probability event from Lemma 5.1, for every x ∈ S ∗ r , we have that b f I j ( x ) − b f I j ( x ∗ ) ≥ r > − τ ) n blocks j. Therefore, as τ < , on that event, x ∗ wins the match against every x ∈ S ∗ r .The extension to all x ∈ X with k x − x ∗ k ≥ r is a simple consequence of convexity.Indeed, let x ∈ X with k x − x ∗ k ≥ r and set y := x ∗ + r k x − x ∗ k ( x − x ∗ ) ∈ S ∗ r . Note now that (for every sample) b f I j ( · ) − b f I j ( x ∗ ) is a convex function which equalszero at x ∗ . Hence, if this function is strictly positive in y , then, by convexity, it isalso strictly positive on { x ∗ + t ( y − x ∗ ) : t ≥ } ∩ X , which is the subset of the ray that originates from x ∗ and passed through y , con-sisting of the points that are “beyond” y . Taking t = r k x − x ∗ k , we see that x ∗ defeats x (at least) on the same blocks on which it defeats y . In conclusion, on thehigh probability event of the lemma, x ∗ wins the match against x . (cid:3) Remark 5.2.
By Assumption 2.3, the functions F , ∇ F and ∇ F are so-calledCarath´eodory functions and therefore are jointly measurable. Moreover, by As-sumption 2.7 and the dominated convergence theorem, one can readily verify that f is twice continuously differentiable near x ∗ with ∇ f ( x ) = E [ ∇ F ( x, ξ )] and ∇ f ( x ) = E [ ∇ F ( x, ξ )]for all x ∈ X with k x − x ∗ k < r . In particular, from this we get that k · k = E [ h∇ F ( x ∗ , ξ ) · , ·i ] .5.1. Estimation error, the quadratic term.
This subsection contains the proofof (5.1) from Lemma 5.1: we show that the quadratic term is likely to be at leastof order r . The proof relies on the results of Section 4 and the strategy is thefollowing. In a first step, we ignore the fact that the Hessian in the definition of Q x,x ∗ is evaluated at a midpoint between x ∗ and x , considering instead the Hessianevaluated at the optimizer x ∗ . We employ the median-of-mean-type lower bound onthe smallest singular value of the random matrix ∇ F ( x ∗ , ξ ) established in Section4, which is summarized in the following lemma. Lemma 5.3.
There are constants s , c > depending only on L such that thefollowing holds. With probability at least − − cτ N ) , for every x ∈ S ∗ r andevery choices of subsets J j ⊆ I j with | J j | ≤ s m , we have that (cid:12)(cid:12)(cid:12)n j : 1 m X i ∈ I j \ J j h∇ F ( x ∗ , ξ i )( x − x ∗ ) , x − x ∗ i ≥ r o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − τ (cid:17) n. (5.3) Proof.
We apply Theorem 4.4 with A := ∇ F ( x ∗ , ξ ). By Remark 4.3, the randommatrix A satisfies a stable lower bound with parameters (cid:16) m, γ , l, k (cid:17) = (cid:16) m, , s m, s m (cid:17) for constants s , s that depend only on L . Moreover, once θ is sufficiently small,we have that k = s m ≥ s θ ≥ max n , (cid:16) τ (cid:17)o , showing that assumption (a) in Theorem 4.4 is satisfied. Lemma 4.7 and Lemma4.6 imply thatmax n k E [ A A − A ] k op , dmτ k log (cid:16) log(3 d ) E [ k A A − A k op ] γτ k E [ A A − A ] k op (cid:17)o ≤ max n c d, ds τ log (cid:16) log(3 d )2 dτ (cid:17)o ≤ c d log(2 d )for a constant c that depends only L and τ . And, as τ depends only on L , c actually only depends only on L .Recall that Theorem 2.9 has the requirement that N ≥ c d log(2 d ) for a constant c that may depend on L and which we are free to choose to be as large as we went.Doing so shows that assumption (b) of Theorem 4.4 holds as well.Setting c := C min { s , s } where C is the constant from Theorem 4.4, it followsfrom that theorem that (5.3) holds with probability at least 1 − − cτ N ). (cid:3) From now on, we fix the constant s from Lemma 5.3. As s depends only on L , all constants θ, τ, c, c , . . . which are allowed to depend on L may also dependon s .In a next step, we show that replacing the Hessian at a midpoint with the Hessianat x ∗ does not come at a high cost. Clearly, in Lemma 5.3 we may arbitrarily modify/ delete s m coordinates from each block j , and we shall argue in the following thatthe errors E H ,i ( x ) = sup t ∈ [0 , (cid:12)(cid:12)(cid:12)D(cid:16) ∇ F ( x ∗ + t ( x − x ∗ ) , ξ i ) − ∇ F ( x ∗ , ξ i ) (cid:17) ( x − x ∗ ) , x − x ∗ E(cid:12)(cid:12)(cid:12) are well behaved on the remaining blocks. As before, the proof has two components.In the first one, we analyze what happens for a single x ∈ S ∗ r . Here the error isgoverned by the probability that E H ( x ) is large, where we recall that r < r and byAssumption 2.7 sup x ∈S ∗ r P h E H ( x ) ≥ r i ≤ c for a constant c which we are free to choose depending on L —hence, c may dependon s . Lemma 5.4.
There is a constant c such that, for every x ∈ S ∗ r and every j , withprobability at least − − cm ) , we have that (cid:12)(cid:12)(cid:12)n i ∈ I j : E H ,i ( x ) ≤ r o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − s (cid:17) m. Proof.
Fix some x ∈ S ∗ r . Setting c := s , Assumption 2.7 implies that δ := P h E H ( x ) ≥ r i ≤ s . If δ = 0 there is nothing to prove, so assume otherwise and apply the Binomialconcentration inequality [5, Corollary 2.11]: there is an absolute constant C > N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 35 such that, for every u ≥
0, with probability at least 1 − − Cmδ min { u, u } ),we have that (cid:12)(cid:12)(cid:12)n i ∈ I j : E H ,i ( x ) ≥ r o(cid:12)(cid:12)(cid:12) ≤ mδ (1 + u ) . Applying this to u := s δ − ≥ c = Cs ). (cid:3) Next, let us show that most blocks have many indices i for which the errors E H ,i ( x ) are small. Lemma 5.5.
There is a constant c such that, for every x ∈ S ∗ r , with probability atleast − − cτ N ) , we have that (cid:12)(cid:12)(cid:12)n j : (cid:12)(cid:12)(cid:8) i ∈ I j : E H ,i ( x ) ≤ r (cid:9)(cid:12)(cid:12) ≥ (cid:0) − s (cid:1) m o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − τ (cid:17) n. In particular, the statement holds uniformly over sets ¯ S ∗ r ⊆ S ∗ r of cardinality atmost log( | ¯ S ∗ r | ) ≤ cτ N .Proof. Fix x ∈ S ∗ r and, for every j , set δ ( j ) := ( , if |{ i ∈ I j : E H ,i ( x ) ≤ r }| ≥ (1 − s ) m , otherwise . By Lemma 5.4 we have that δ := P [ δ ( j ) = 1] ≤ − c m )for a constant c .Now recall that m ≥ θ and therefore δ ≤ once θ is small enough (i.e. θ ≤ c log( ) suffices). By Bennett’s inequality, exactly as in the proof of Lemma 4.8,we have that for every u ≥
2, with probability at least 1 − − Cδnu log( u )), |{ j : δ ( j ) = 1 }| ≤ uδn. To complete the proof, set u := τ δ so that u ≥ exp( c m ) ≥
2, again once θ is smallenough. (cid:3) The final step is to extend the outcome of the lemma from the net ¯ S ∗ r to thewhole S ∗ r . To that end, recall that by Assumption 2.7 k∇ F ( x, ξ ) − ∇ F ( y, ξ ) k op ≤ k x − y k α · K ( ξ )for all x, y ∈ B ∗ r . Remark 5.6.
In what follows we shall, from time to time, divide by E [ K ( ξ )];therefore, we shall assume without loss of generality that E [ K ( ξ )] >
0. Note thatif E [ K ( ξ )] = 0 then the Hessian is constant in B ∗ r and there is nothing to prove: E H ( x ) = 0 for every x ∈ B ∗ r . Lemma 5.7.
Let c be the constant of Lemma 5.5 and let C be an absolute constantto be specified later. Set ρ := (cid:16) C τ s r α E [ K ( ξ )] (cid:17) α . Then S ∗ r contains a ρr -cover with respect to the norm k · k , which is denoted by ¯ S ∗ r ,and log( | ¯ S ∗ r | ) ≤ c τ N. Proof. If ρ ≥ < ρ <
1. By a standard volumetric argument (see e.g. [38,Exercise 2.2.14]) there is a ρr cover ¯ S ∗ r ⊆ S ∗ r , andlog( | ¯ S ∗ r | ) ≤ d log (cid:16) ρ (cid:17) = dα log (cid:16) r α E [ K ( ξ )] C τ s (cid:17) . By assumption we have N ≥ c N H , E ≡ c dα log (cid:16) r α E [ K ( ξ )] + 2 (cid:17) for a constant c which we are free to choose large enough. Then log( | ¯ S ∗ r | ) ≤ c τ N ,as claimed. (cid:3) Lemma 5.8.
Let ¯ S ∗ r be as in Lemma 5.7, and for every x ∈ S ∗ r set y = y ( x ) to bethe nearest point to x in ¯ S ∗ r .There is an absolute constant C such that the following holds. With probabilityat least − − Cτ n ) , for every x ∈ S ∗ r , we have that (cid:12)(cid:12)(cid:12)n j : (cid:12)(cid:12)(cid:8) i ∈ I j : |E H ,i ( x ) − E H ,i ( y ) | ≥ r (cid:9)(cid:12)(cid:12) ≤ s m o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − τ (cid:17) n. Proof.
Let Ψ := sup x ∈S ∗ r n n X j =1 |{ i ∈ I j : |E H ,i ( x ) −E H ,i ( y ) |≥ r }|≥ s m and observe that it suffices to show that P [Ψ ≥ τ ] ≤ − Cτ n ). By thebounded differences inequality [5, Theorem 6.2], we have that for every u ≥ P [Ψ ≥ E [Ψ] + u ] ≤ − Cnu ) , and setting u := τ , all that remains to show is that E [Ψ] ≤ τ .Note that1 |{ i ∈ I j : |E H ,i ( x ) −E H ,i ( y ) |≥ r }|≥ s m ≤ r s m X i ∈ I j |E H ,i ( x ) − E H ,i ( y ) | for every j ; hence, Ψ ≤ r s sup x ∈S ∗ r N N X i =1 |E H ,i ( x ) − E H ,i ( y ) | . (5.4)To control the difference of the E ’s, for simpler notation, for ( t, z ) ∈ [0 , × S ∗ r set A tzi := ∇ F ( x ∗ + t ( z − x ∗ ) , ξ i ) − ∇ F ( x ∗ , ξ i ) , and observe that E H ,i ( z ) = sup t ∈ [0 , |h A tzi ( z − x ∗ ) , z − x ∗ i| for every z ∈ S ∗ r . Now, for every t ∈ [0 ,
1] and every x ∈ S ∗ r (and y = y ( x )), (cid:12)(cid:12) h A txi ( x − x ∗ ) , x − x ∗ i − h A tyi ( y − x ∗ ) , y − x ∗ i (cid:12)(cid:12) = (cid:12)(cid:12) h A txi ( x − x ∗ + y − x ∗ ) , x − y i − h ( A tyi − A txi )( y − x ∗ ) , y − x ∗ i (cid:12)(cid:12) ≤ ρr k A txi k op + r k A tyi − A txi k op , N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 37 where the last inequality holds by definition of the operator norm and by notingthat k x − y k ≤ ρr , k x − x ∗ + y − x ∗ k ≤ r , and k y − x ∗ k ≤ r . Invoking thesubadditivity of “sup t ( · )” and the triangle inequality, |E H i ( x ) − E H i ( x ) | ≤ ρr sup t ∈ [0 , k A txi k op + r sup t ∈ [0 , k A tyi − A txi k op ≤ ρr r α K ( ξ i ) + r ( rρ ) α K ( ξ i ) ≤ r ( ρr ) α K ( ξ i )(5.5)where the second inequality holds by Assumption 2.7 on the continuity of theHessian and as we may assume without loss of generality that ρ ≤ ρ ≤ ρ α ).Plugging (5.5) into (5.4) implies thatΨ ≤ r s · r ( ρr ) α N N X i =1 K ( ξ i )and therefore E [Ψ] ≤ ρ α r α E [ K ( ξ )] s . Setting ρ = (cid:16) C τ s r α E [ K ( ξ )] (cid:17) α just as in Lemma 5.7, and recalling that we are free to choose C small enough, itfollows that E [Ψ] ≤ τ , as required. (cid:3) We are now ready for the
Proof of Proposition 5.1, quadratic part.
For every x ∈ S ∗ r and j , set J ∗ j ( x ) := { largest s m coordinates of ( E H ,i ( x )) i ∈ I j } ⊆ I j . For every j , recalling the definition of E H ( x ) and the fact that ∇ F ( z, ξ ) is positivesemidefinite, it follows that Q x,x ∗ ( j ) ≥ m X i ∈ I j \ J ∗ j ( x ) inf t ∈ [0 , (cid:10) ∇ F ( x ∗ + t ( x − x ∗ ) , ξ i )( x − x ∗ ) , x − x ∗ (cid:11) ≥ m X i ∈ I j \ J ∗ j ( x ) (cid:10) ∇ F ( x ∗ , ξ i )( x − x ∗ ) , x − x ∗ (cid:11) − m X i ∈ I j \ J ∗ j ( x ) E H ,i ( x )=: A x ( j ) + B x ( j )As | J ∗ j ( x ) | ≤ s m for every x ∈ S ∗ r by definition, Lemma 5.3 implies that, withprobability at least 1 − − cτ n ), for every x ∈ S ∗ r , we have that A x ( j ) ≥ r (cid:16) − τ (cid:17) n of the blocks j. Moreover, combining Lemma 5.5 and Lemma 5.8 implies that, with probability atleast 1 − − cτ n ), for every x ∈ S ∗ r , we have that (cid:12)(cid:12)(cid:12)n j : (cid:12)(cid:12)(cid:8) i ∈ I j : E H ,i ( x ) ≤ r (cid:9)(cid:12)(cid:12) ≥ (1 − s ) m o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − τ (cid:17) n, and on that event clearly B x ( j ) ≥ − r (cid:16) − τ (cid:17) n of the blocks j. In particular, combining the estimates on A x ( j ) and B x ( j ) gives: with probabilityat least 1 − − cτ n ), for every x ∈ S ∗ r , we have that Q x,x ∗ ( j ) ≥ r − τ ) n of the blocks j. This completes the proof. (cid:3)
Estimation error, the multiplier term.
This subsection contains the proofof (5.2) from Lemma 5.1, stating that the multiplier term M x,x ∗ ( j ) = 1 m X i ∈ I j h∇ F ( x ∗ , ξ i ) , x − x ∗ i is likely to be at most of order r . To ease notation, set H := ∇ f ( x ∗ ) = E [ ∇ F ( x ∗ , ξ )] , G := C ov[ ∇ F ( x ∗ , ξ )] , ı.e., H is the Hessian of f at x ∗ and G is the covariance matrix of the gradient of F ( · , ξ ) at x ∗ . In particular, a straightforward computation shows that N G ( r ) = 1 r trace( H − G ) = 1 r trace( H − GH − ) , (5.6) σ = λ max ( H − G )= λ max ( H − GH − ) = k G k op . (5.7)As before, we analyze what happens for a single x ∈ S ∗ r ; the high probabilityestimate we obtain allows us to control a net in S ∗ r ; we then show that passing fromthe net to the entire set S ∗ r does not distort the outcome by too much. Lemma 5.9.
There is an absolute constant C such that, for every x ∈ S ∗ r , withprobability at least − − Cτ n ) , we have that (cid:12)(cid:12)(cid:12)n j : M x,x ∗ ( j ) ≥ − r o | ≥ (cid:16) − τ (cid:17) n. In particular, the statement holds uniformly over a set ¯ S ∗ r ⊆ S ∗ r of cardinality atmost log( | ¯ S ∗ r | ) ≤ Cτ n .Proof. Fix some x ∈ S ∗ r and define for 1 ≤ i ≤ N , U i := h∇ F ( x ∗ , ξ i ) − E [ ∇ F ( x ∗ , ξ )] , x − x ∗ i = h∇ F ( x ∗ , ξ i ) , x − x ∗ i − h∇ f ( x ∗ ) , x − x ∗ i . If x ∗ lies in the interior of X , the first order condition for optimality implies that h∇ f ( x ∗ ) , x − x ∗ i equals zero. In general, the first order condition implies that thisterm is non-negative. In either case, we get that (cid:12)(cid:12)(cid:12) m X i ∈ I j U i (cid:12)(cid:12)(cid:12) ≤ r
32 implies M x,x ∗ ( j ) ≥ − r N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 39
To that end, consider first a single block j . As the U i ’s are i.i.d. zero mean ran-dom variables, Markov’s inequality together with the Cauchy-Schwartz inequalityimplies that P h(cid:12)(cid:12)(cid:12) m X i ∈ I j U i (cid:12)(cid:12)(cid:12) ≥ r i ≤ r E h(cid:12)(cid:12)(cid:12) m X i ∈ I j U i (cid:12)(cid:12)(cid:12)i ≤ r (cid:16) E [ | U | ] m (cid:17) . (5.8)By the definition of σ (or rather, by the alternative expression in (5.7)) and as k x − x ∗ k = r , we have that E [ | U | ] = h G ( x − x ∗ ) , x − x ∗ i≤ r σ . (5.9)Combining (5.8) and (5.9) and using that m ≥ σ θr , we conclude that P h(cid:12)(cid:12)(cid:12) m X i ∈ I j U i (cid:12)(cid:12)(cid:12) ≥ r i ≤ rσr √ m ≤ √ θ ≤ τ θ is sufficiently small.The claim now follows from a Binomial estimate—just as in the proof of Lemma5.4: the probability that a single block j has the wanted property is at least 1 − τ (by the above); therefore, the probability that the number of desirable blocks j issmaller than the mean (which is at least (1 − τ ) n ) by more than τ n is at most2 exp( − Cnτ ). (cid:3) Lemma 5.10.
Let C be the absolute constant of Lemma 5.9. Then there is anabsolute constant C and a set ¯ S ∗ r ⊆ S ∗ r with cardinality log( | ¯ S ∗ r | ) ≤ C nτ suchthat the following holds. Let ρ := (cid:16) trace( H − G ) C τ n (cid:17) . For every x ∈ S ∗ r there is y = y ( x ) ∈ ¯ S ∗ r with h G ( x − y ) , x − y i ≤ ρr and , (5.10) h∇ f ( x ∗ ) , x − y i ≥ . (5.11) Proof.
As a first step, we ignore (5.11) and construct a set ˜ S ∗ r with log-cardinalitysatisfying log( | ˜ S ∗ r | ) ≤ C nτ such that (5.10) holds with ρr instead of 2 ρr . Tothat end, observe that covering the sphere { x ∈ R d : h H x, x i = 1 } w.r.t. to the normendowed by G is equivalent to covering the Euclidean sphere { x ∈ R d : h x, x i =1 } w.r.t. the norm endowed by H − GH − . Hence, denoting by G the standardGaussian vector in R d , the dual Sudakov inequality (see e.g. [16, Theorem 3.18])guarantees the existence of a ρr cover of ˜ S ∗ r ⊆ S ∗ r with respect to the norm h G · , ·i such that log (cid:16) | ˜ S ∗ r | (cid:17) ≤ C (cid:16) E [ h H − GH − G , Gi ] ρ (cid:17) ; thus, for every x ∈ S ∗ r there is y = y ( x ) ∈ ˜ S ∗ r with h G ( y − x ) , y − x i ≤ ρr . Observethat E [ h H − GH − G , Gi ] = E [ k ( H − GH − ) Gk ] ≤ E [ k ( H − GH − ) Gk ]= trace( H − GH − ) = trace( H − G ) , (where the last equality was already observed in (5.6)). Thus,log (cid:16) | ˜ S ∗ r | (cid:17) ≤ C trace( H − G ) ρ = C C τ n ≤ C τ n, once C is sufficiency small.Next, we modify ˜ S ∗ r , ensuring that both equations (5.10) and (5.11) hold: forevery z ∈ ˜ S ∗ r , pick some y ( z ) ∈ argmin n h∇ f ( x ∗ ) , y i : y ∈ S ∗ r s.t. h G ( y − z ) , ( y − z ) i ≤ ρr o ;thus, y ( z ) is the minimizer of h∇ f ( x ∗ ) , y i with the ρr -ball (with respect to the norm h G · , ·i ) centred at z .It is straightforward to verify that the set¯ S ∗ r := { y ( z ) : z ∈ ˜ S ∗ r } satisfies the statement of the lemma. (cid:3) Lemma 5.11.
Let ¯ S ∗ r and y = y ( x ) be as in Lemma 5.10. There is an absoluteconstant C such that the following holds. With probability at least − − Cτ n ) ,for every x ∈ S ∗ r , we have that (cid:12)(cid:12)(cid:12)n j : M x,x ∗ ( j ) ≥ M y,x ∗ ( j ) − r o(cid:12)(cid:12)(cid:12) ≥ (cid:16) − τ (cid:17) n. Proof.
For every x ∈ X and j , set∆ x ( j ) := M x,x ∗ ( j ) − M y,x ∗ ( j )¯∆ x ( j ) := ∆ x ( j ) − E [∆ x ( j )]= 1 m X i ∈ I j h∇ F ( x ∗ , ξ i ) − E [ ∇ F ( x ∗ , ξ )] , x − y i . Recalling that E [∆ x ( j )] ≥ x ∈S ∗ r n n X j =1 {| ¯∆ x ( j ) |≥ r } , the statement of the lemma therefore follows if Ψ ≤ τ holds with probability atleast 1 − − Cτ n ).To that end, we once more rely on the bounded difference inequality [5, Theorem6.2]: for every u ≥
0, we have that P [Ψ ≥ E [Ψ] + u ] ≤ − Cnu ) . Setting u := τ , all that is left is to show that E [Ψ] ≤ τ . N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 41
First, observe that 1 | a |≥ b ≤ b | a | . Hence,Ψ ≤ r sup x ∈S ∗ r n n X j =1 | ¯∆ x ( j ) |≤ r (cid:16) sup x ∈S ∗ r E [ | ¯∆ x (1) | ] + sup x ∈S ∗ r n n X j =1 (cid:12)(cid:12)(cid:12) ¯∆ x ( j ) (cid:12)(cid:12) − E [ | ¯∆ x ( j ) | ] (cid:12)(cid:12)(cid:12)(cid:17) =: 32 r (cid:16) R + R (cid:17) . In particular E [Ψ] ≤ r ( R + E [ R ]).To estimate R , recall that ¯∆ x ( j ) is a sum of independent, zero mean randomvariables. Thus, E [ | ¯∆ x (1) | ] ≤ (cid:16) h G ( x − y ) , x − y i m (cid:17) ≤ r (cid:16) trace( H − G ) C τ nm (cid:17) for every x ∈ S ∗ r , where the second inequality follows from Lemma 5.10. Recallingthat, by our assumptions, nm = N ≥ c N G ( r ) ≡ c trace( H − G ) r , we conclude that E [ | ¯∆ x (1) | ] ≤ r τ √ C c . Therefore, R ≤ r τ √ C c . Turning to R , by symmetrization, the contraction theorem for Rademacherprocesses, and de-symmetrization (see e.g. [5, Section 11.3]), it is evident that E [ R ] ≤ E h sup x ∈S ∗ r (cid:12)(cid:12)(cid:12) n n X j =1 ε j | ¯∆ x ( j ) | (cid:12)(cid:12)(cid:12)i ≤ E h sup x ∈S ∗ r (cid:12)(cid:12)(cid:12) n n X j =1 ε j ¯∆ x ( j ) (cid:12)(cid:12)(cid:12)i ≤ E h sup x ∈S ∗ r (cid:12)(cid:12)(cid:12) n n X j =1 ¯∆ x ( j ) (cid:12)(cid:12)(cid:12)i . Moreover, by the definition of ¯∆ x ( j ), E [ R ] ≤ E h sup x ∈S ∗ r (cid:12)(cid:12)(cid:12) N N X i =1 (cid:10) ∇ F ( x ∗ , ξ i ) − E [ ∇ F ( x ∗ , ξ )] , x − y (cid:11)(cid:12)(cid:12)(cid:12)i . (5.12)Recall that k · k = k H · k and that k x − y k ≤ r . Therefore, for any z ∈ R d ,sup x ∈S ∗ r |h z, x − y i| = sup x ∈S ∗ r |h H − z, H ( x − y ) i|≤ r k H − z k . Together with linearity of z
7→ h z, x − y i and (5.12), E [ R ] ≤ r E h(cid:13)(cid:13)(cid:13) N N X i =1 H − (cid:16) ∇ F ( x ∗ , ξ i ) − E [ ∇ F ( x ∗ , ξ )] (cid:17)(cid:13)(cid:13)(cid:13) i ≤ r (cid:16) trace( C ov[ H − ∇ F ( x ∗ , ξ )]) N (cid:17) . It remains to observe thattrace( C ov[ H − ∇ F ( x ∗ , ξ )]) = trace( H − GH − )= trace( H − G ) . Since N ≥ c N G ( r ) ≡ c r trace( H − G ), it is evident that E [ R ] ≤ r √ c , and combining the two estimates, E [Ψ] ≤ r (cid:16) r τ √ C c + 16 r √ c (cid:17) ≤ τ , where the last inequality holds as soon as c is large enough. (cid:3) Proof of Proposition 5.1, multiplier part.
Let ¯ S ∗ r be the cover defined in Lemma5.10. By Lemma 5.9, with probability at least 1 − − cτ n ), for every y ∈ ¯ S ∗ r ,we have that M y,x ∗ ( j ) ≥ − r
32 on more than (cid:16) − τ (cid:17) n blocks j. Moreover, by Lemma 5.11, with probability at least 1 − − cτ n ), for every x ∈ S ∗ r there is y ∈ ¯ S ∗ r such that M x,x ∗ ( j ) ≥ M y,x ∗ ( i ) − r
32 on more than (cid:16) − τ (cid:17) n blocks j. Combining the two estimates, it follows that, with probability at least 1 − − cτ n ),for every x ∈ S ∗ r , M x,x ∗ ( j ) ≥ − r
16 on more than (1 − τ ) n blocks j, which is exactly what we wanted to show. (cid:3) Prediction error.
In this section we shall prove Proposition 2.15, dealingwith the prediction error. To that end, let U ∗ r := { x ∈ B ∗ r : f ( x ) ≥ f ( x ∗ ) + 2 c H r } be the set of all x ∈ B ∗ r that are in an “unfavorable position”. Let us stress againthat if x ∗ lies in the interior of X , then U ∗ r is empty and the estimate on theprediction error holds automatically. We therefore assume that U ∗ r is not empty.The proof of Proposition 2.15 relies on the convexity of F : for any x, y ∈ X and j , we have that b f I ′ j ( x ) − b f I ′ j ( y ) ≥ m X i ∈ I ′ j h∇ F ( y, ξ i ) , x − y i =: M ′ x,y ( j ) . In particular, this implies that
N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 43 (i) x ∗ wins its home match against x if M ′ x,x ∗ ( j ) ≥ − c H r n j, (ii) x does not win its home match against x ∗ if M ′ x,x ∗ ( j ) > c H r n j, Thus, Proposition 2.15 is a consequence of the following lemma, which we shallprove in this section.
Lemma 5.12.
There is a constant c such that the following holds. With probabilityat least − − cτ n ) , for every x ∈ B ∗ r and every y ∈ U ∗ r we have that (cid:12)(cid:12)(cid:12)n j : M ′ x,x ∗ ( j ) ≥ − c H r o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n, (5.13) (cid:12)(cid:12)(cid:12)n j : M ′ y,x ∗ ( j ) > c H r o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n. (5.14)Just as in the analysis of the estimation error, Proposition 2.15 is an easy con-sequence of Lemma 5.12: Proof of Proposition 2.15.
By Proposition 2.14 we have that with probability atleast 1 − − cτ n ), x ∗ ∈ ˜ X ∗ N and ˜ X ∗ N ⊆ B ∗ r . In the following we argue condi-tionally on that high probability event, and recall that τ < . By Lemma 5.12, withprobability at least 1 − − cτ n ), we have that x ∗ wins its home match againstevery competitor in B ∗ r (in particular, against every competitor in ˜ X ∗ N ). Moreover,on the same event, every element in ˜ X ∗ N that is in an unfavorable position loses itshome match against x ∗ . Thus, x ∗ ∈ b X ∗ N and b X ∗ N ⊆ B ∗ r \ U ∗ r , which is exactly whatwe wanted to show. (cid:3) The first part of Lemma 5.12 (namely (5.13)) is an immediate consequence ofLemma 5.1. Thus, in the following, we focus on the second part of Lemma 5.12(namely (5.14)), dealing with U ∗ r . Our starting point is the following observation: Lemma 5.13.
Let x ∈ U ∗ r . Then we have that h∇ f ( x ∗ ) , x − x ∗ i ≥ c H r . Proof.
Since x ∈ U ∗ r we have that 2 c H r ≤ f ( x ) − f ( x ∗ ). A Taylor expansion around x ∗ shows that there is a midpoint z such that f ( x ) − f ( x ∗ ) = h∇ f ( x ∗ ) , x − x ∗ i + 12 h∇ f ( z )( x − x ∗ ) , x − x ∗ i≤ h∇ f ( x ∗ ) , x − x ∗ i + c H r . The proof clearly follows. (cid:3)
Thanks to Lemma 5.13, the proof of the second part of Lemma 5.12 (namely(5.14)) follows the same path as the proof of the bound on the multiplier term inthe context of the estimation error. Thus, we shall only sketch the argument forthe sake of completeness.
Lemma 5.14.
There is an absolute constant C such that, for every x ∈ U ∗ r , withprobability at least − − Cτ n ) , we have that (cid:12)(cid:12)(cid:12)n j : M ′ x,x ∗ ( j ) ≥ c H r o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n. (5.15) In particular, the statement holds uniformly for sets ¯ U ∗ r ⊆ U ∗ r whose cardinalitysatisfies log( | ¯ U ∗ r | ) ≤ Cτ n .Proof. Fix x ∈ U ∗ r . By Lemma 5.13, we have E [ h∇ F ( x ∗ , ξ ) , x − x ∗ i ] ≥ c H r . Thus, exactly as in the proof of Lemma 5.9, we conclude that P h M ′ x,x ∗ ( j ) ≤ c H r i ≤ √ θc H ≤ √ θ ≤ τ . (as long as θ is small enough, and the second inequality holds because c H ≥ − − Cτ n ). (cid:3) Lemma 5.15.
Let C be the absolute constant of Lemma 5.14. Then there is anabsolute constant C and a set ¯ U ∗ r ⊆ U ∗ r whose cardinality satisfies log( | ¯ U ∗ r | ) ≤ C nτ , such that the following holds.Let ρ := (cid:16) trace( H − G ) C τ n (cid:17) For every x ∈ U ∗ r there is y = y ( x ) ∈ ¯ U ∗ r with h G ( x − y ) , x − y i ≤ ρr and , (5.16) h∇ f ( x ∗ ) , x − y i ≥ . (5.17) Proof.
Just as in Lemma 5.10, we can construct a set ˜ B ∗ r ⊆ B ∗ r satisfying (5.16)with 2 ρr replaced by ρr . The modification of that set is again similar: for z ∈ ˜ B ∗ r ,define y ( z ) ∈ argmin n h∇ f ( x ∗ ) , y i : y ∈ U ∗ r s.t. h G ( y − z ) , y − z i ≤ ρr o . with the convention y ( z ) := y for some fixed y ∈ U ∗ r if the above set is empty.Then ¯ U ∗ r := { y ( z ) : z ∈ ˜ B ∗ r } satisfies the statement of the lemma. (cid:3) Finally, fix the set ¯ U ∗ r of Lemma 5.15 and for x ∈ U ∗ r denote by y = y ( x ) ∈ ¯ U ∗ r the best approximation in the cover constructed in Lemma 5.15. Lemma 5.16.
There is an absolute constant C such that the following holds. Withprobability at least − − Cτ n ) , for every x ∈ U ∗ r , we have that (cid:12)(cid:12)(cid:12)n j : M ′ x,x ∗ ( j ) ≥ M ′ y,x ∗ ( j ) − c H r o(cid:12)(cid:12)(cid:12) ≥ (1 − τ ) n. Proof.
Recall that c H ≥ (cid:3) N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 45
Proof of Lemma 5.12.
Let us prove (5.13). Note that Lemma 5.1 (without anymodifications in the proof) yields the following: with probability at least 1 − − cτ n ), for all x ∈ B ∗ r , we have that M ′ x,x ∗ ( j ) ≥ − c H r − τ ) n blocks j. In particular, this holds for U ∗ r ⊆ B ∗ r .As for (5.14), a combination of Lemma 5.14 and Lemma 5.16 shows that withprobability at least 1 − − cτ n ), for every x ∈ U ∗ r , we have that M ′ x,x ∗ ( j ) > c H r − τ ) n blocks j. This completes the proof. (cid:3)
Proof under a deterministic lower bound of the Hessian.
To concludethis section, let us prove Theorem 2.13. There are only very few modificationsneeded in the proof of our main results, as we explain in what follows.In the proof of Theorem 2.9, the only place where the requirement N ≥ c max { d log(2 d ) , N H , E } was used, was for the statement of Lemma 5.1 pertaining to the quadratic term,namely (5.1). However, Assumption 2.12 clearly implies that, with probability 1,for every x ∈ S ∗ r , we have that (cid:12)(cid:12)(cid:12)n j : 12 Q x,x ∗ ( j ) ≥ r ε o(cid:12)(cid:12)(cid:12) = n. Thus, the only modification that is needed, is to prove the part of Lemma 5.1pertaining to the multiplier term (namely (5.2)) with εr instead of r . Inspectingthe proof, one readily sees that this is possible once the constants θ, τ, c, c , . . . areallowed to depend on ε .6. Proofs for the portfolio optimization problem
The proof of Corollary 3.7.
The proof builds on several lemmas statedbelow, making the heuristic computations explained in the introduction rigorous.Throughout, we work under the assumptions made in Section 3.4. Note that ∇ F ( x, ξ ) = − ℓ ′ ( V x ) · X, ∇ F ( x, ξ ) = ℓ ′′ ( V x ) · X ⊗ X where we recall V x = − Y − h X, x i for x ∈ X . Finally, denote by k · k X := E [ h X, ·i ] = h C ov[ X ] · , ·i = k C ov[ X ] · k the norm endowed by X . Note that k · k X is indeed a norm by the no-arbitragecondition. Lemma 6.1.
There is a constant c > depending on L X , v and a constant c > depending on L X , v , v such that c k · k X ≤ k · k ≤ c k · k X . In particular, k · k is a true norm.
Proof.
We start with the second inequality. H¨older’s inequality (with exponent and conjugate exponent 3) implies that k z k = E [ ℓ ′′ ( V x ∗ ) h X, z i ] ≤ E [ ℓ ′′ ( V x ∗ ) ] E [ h X, z i ] for every z ∈ R d . The first term is bounded by v and norm equivalence of X (see(3.3)) implies that the second term is bounded by L X k z k X .We continue with the first inequality. The Paley-Zygmund inequality togetherwith norm equivalence of X implies that P h h X, z i ≥ k z k X i ≥ (1 − ) E [ h X, z i ] E [ h X, z i ] ≥ L X (6.1)for every z ∈ R d \ { } . Moreover, setting ε := min { ℓ ′′ ( u ) : | u | ≤ L X E [ | V x ∗ | ] } , an application of Markov’s inequality shows that P [ ℓ ′′ ( V x ∗ ) < ε ] ≤ L X . In combination with (6.1), we conclude that k z k ≡ E [ ℓ ′′ ( V x ∗ ) h X, z i ] ≥ ε k z k X L X . Finally, as noted previously, the no-arbitrage condition (3.4) immediately impliesthat k · k X is a true norm. Hence, by the above, k · k is a true norm as well. Thiscompletes the proof. (cid:3) Lemma 6.2.
Assumption 2.5 is satisfied with a constant L depending on L X , v , v .Proof. An application of H¨older’s inequality (with exponents 3 , ) implies that E [ h∇ F ( x ∗ , ξ ) z, z i ] = E [ ℓ ′′ ( V x ∗ ) h X, z i ] ≤ E [ ℓ ′′ ( V x ∗ ) ] E [ h X, z i ] for every z ∈ R d . The first term is v by definition and, by norm equivalence of X (see (3.3)) and Lemma 6.1, the second term is bounded uniformly in { z ∈ R d : k z k ≤ } by a constant depending on L X , v . (cid:3) Lemma 6.3.
Let L be the parameter from Assumption 2.5. Then σ ≤ √ L · ¯ σ and N G ( r ) ≤ √ L · ¯ σ · dr . Proof.
We make the preliminary claim that C ov[ ∇ F ( x ∗ , ξ )] (cid:22) ¯ σ √ L ∇ f ( x ∗ ) . Indeed, for every z ∈ R d , h C ov[ ∇ F ( x ∗ , ξ )] z, z i ≤ E [ ℓ ′ ( V x ∗ ) h X, z i ]= E h ℓ ′ ( V x ∗ ) ℓ ′′ ( V x ∗ ) · ℓ ′′ ( V x ∗ ) h X, z i i ≤ ¯ σ E [( ℓ ′′ ( V x ∗ ) h X, z i ) ] , N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 47 where the last step follows from H¨older’s inequality. Moreover, by Assumption 2.5, E [( ℓ ′′ ( V x ∗ ) h X, z i ) ] = E [ h∇ F ( x ∗ , ξ ) z, z i ] ≤ √ L k z k = √ L h∇ f ( x ∗ ) z, z i hence the preliminary claim follows.As both the largest singular value and the trace are monotone w.r.t. the positivesemidefinite order, the statement of the lemma follows. (cid:3) We need the following simple auxiliary lemma. Recall that k · k ∗ denotes thedual norm of k · k . Lemma 6.4.
There is a constant c depending on L X , v such that E [ k X k ∗ ] ≤ cd .Proof. By Lemma 6.1 we have k ·k X ≤ c k ·k for a constant c depending on L X , v .This immediately implies that k · k ∗ ≤ c k · k X, ∗ where k · k X, ∗ is the dual norm of k · k X . As the norm k · k X is endowed by C ov[ X ], its dual norm k · k X, ∗ is endowedby C ov[ X ] − , whence k X k X, ∗ = k Y k for Y := C ov[ X ] − X. Applying H¨older’s inequality (in its version for three random variables, with expo-nents 3 , ,
3) shows that E [ k Y k ] = d X i,j,k =1 E [ Y i Y j Y k ] ≤ d X i,j,k =1 E [ Y i ] E [ Y j ] E [ Y k ] . It remains to note that Y satisfies the same norm equivalence as X does, andtherefore, denoting by e i the i -th standard Euclidean unit vector, E [ Y i ] = E [ h Y, e i i ] ≤ L X E [ h Y, e i i ]= L X C ov[ Y ] ii = L X . Combining everything completes the proof. (cid:3)
Lemma 6.5.
There is a constant c > depending on L X , v such that, for every x, y ∈ B ∗ , we have that P h E H ( x ) ≤ k x − x ∗ k i ≤ cv E H · k x − x ∗ k , (6.2) (cid:13)(cid:13)(cid:13) ∇ F ( x, ξ ) − ∇ F ( y, ξ ) (cid:13)(cid:13)(cid:13) op ≤ k x − y k · K ( ξ )(6.3) where K ( ξ ) satisfies E [ K ( ξ )] ≤ cv K d .Proof. As a preliminary observation, note that a Taylor expansion gives h ( ∇ F ( x, ξ ) − ∇ F ( y, ξ )) z, z i = ℓ ′′′ ( V x + t ( y − x ) ) h X, x − y ih X, z i (6.4)for every z ∈ R d , where t ∈ [0 ,
1] is some number depending on x, y, ξ . We start by proving (6.2). For every x ∈ B ∗ , by (6.4) and definition of E H , wehave that E H ( x ) ≡ sup t ∈ [0 , (cid:12)(cid:12)(cid:10)(cid:0) ∇ F ( x ∗ + t ( x − x ∗ ) , ξ ) − ∇ F ( x ∗ , ξ ) (cid:1) ( x − x ∗ ) , x − x ∗ (cid:11)(cid:12)(cid:12) ≤ sup t ∈ [0 , | ℓ ′′′ ( V x ∗ + t ( x − x ∗ ) ) ||h X, x − x ∗ i| . In particular, applying H¨older’s inequality, the norm equivalence of X from (3.3),and Lemma 6.1, we obtain E [ E H ( x )] ≤ v E H E [ h X, x − x ∗ i ] ≤ v E H L X k x − x ∗ k X ≤ c v E H k x − x ∗ k for a constant c depending on L X , v . The claim (6.2) therefore follows fromMarkov’s inequality.We now prove (6.3). The definition of the operator norm together with theinequality |h X, x − y i| ≤ k X k ∗ k x − y k (which holds by definition of the dual norm k · k ∗ ) shows that (6.4) implies k∇ F ( x, ξ ) − ∇ F ( y, ξ ) k op ≤ sup ˜ x, ˜ y ∈B ∗ and t ∈ [0 , | ℓ ′′′ ( V ˜ x + t (˜ y − ˜ x ) ) | · k x − y k · k X k ∗ =: K ( ξ ) · k x − y k . This proves (6.3).It remains to control the expectation of K ( ξ ). For every ˜ x, ˜ y ∈ B ∗ and t ∈ [0 , x + t (˜ y − ˜ x ) ∈ B ∗ by convexity of X , whence K ( ξ ) ≤ sup x ∈B ∗ | ℓ ′′′ ( V x ) |k X k ∗ .Therefore H¨older’s inequality and the definition of v K imply that E [ K ( ξ )] ≤ v K E [ k X k ∗ ] . By Lemma 6.4, the last term is bounded by c d for a constant c depending on L X , v . This completes the proof. (cid:3) Lemma 6.6.
There is a constant c depending on L X , v such that c H ≤ cv E H .Proof. We need to show that k E [ ∇ F ( x, ξ )] k op ≤ cv E H for every x ∈ B ∗ . Fix such x . From the Taylor expansion (6.4) we get ∇ F ( x, ξ ) = ∇ F ( x ∗ , ξ ) + ℓ ′′′ ( V x ∗ + t ( x − x ∗ ) ) h X, x − x ∗ i X ⊗ X for some t ∈ [0 ,
1] which depends on x and ξ .The expectation of the first term equals ∇ f ( x ∗ ), and the operator norm of thelatter equals 1. To estimate the operator norm of the expectation of the secondterm, let z ∈ R d with k z k ≤
1. Then H¨older’s inequality (in its version for threerandom variables, with exponents 2 , ,
3) implies E [ ℓ ′′′ ( V x ∗ + t ( x − x ∗ ) ) h X, x − x ∗ ih X, z i ] ≤ v E H · E [ h X, x − x ∗ i ] · E [ h X, z i ] ≤ v E H · L X k x − x ∗ k X · L X k z k X , N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 49 where the last inequality follows from norm equivalence of X from (3.3). Finally,recalling that k x − x ∗ k ≤ k z k ≤
1, the proof is completed by an application ofLemma 6.1 which states that k·k X ≤ c k·k for a constant depending on L X , v . (cid:3) We are now ready for the
Proof of Corollary 3.7.
Regarding Assumption 2.3: convexity and differentiabilityare clearly satisfied, and integrability holds by assumption. Moreover, by Lemma6.1, k · k is a true norm.Assumption 2.5 follows from Lemma 6.2.Lemma 6.5 shows that Assumption 2.7 is satisfied for α = 1 and r := min n , c v E H o , where c is a constant depending on the constant c of Lemma 6.5 and the parameter L of Assumption 2.5; hence c depends on L X , v , v .Moreover, Lemma 6.5 also gives N E , H ≤ c d log( dv K + 2) for a constant c depending on L X , v .The parameters N G ( r ), σ , and c H are bounded in Lemma 6.3 and Lemma 6.6,respectively. Combing everything completes the proof. (cid:3) The proof of Corollary 2.10.
First note that the ¯ σ of the Corollary 2.10 andthe ¯ σ of Corollary 3.7 coincide. Moreover, as X is Gaussian with non-degeneratecovariance matrix, the no-arbitrage condition (3.4) readily follows. Further recallthe well-known Gaussian norm equivalence E [ h X, z i p ] p ≤ C √ p · E [ h X, z i ] for every z ∈ R d and every p ≥
2, where C is an absolute constant. In particular(3.3) holds and our assumption that U (2 Y ) is integrable implies part (c) of As-sumption 2.3. Along the same lines as Gaussian norm equivalence, the followingholds. Lemma 6.7.
There is a constant c depending on L and v ≡ E [ | V x ∗ | ] such that E [ k X k ∗ ] ≤ cd , E [exp(4 k X k ∗ )] ≤ exp( c √ d ) , E [ h X, x − x ∗ i ] ≤ c k x − x ∗ k , E [exp(4 |h X, x − x ∗ i| )] ≤ c for every x ∈ B ∗ .Proof. The two statements involving k X k ∗ follow from similar arguments as givenin Lemma 6.4, noting that C ov[ X ] − X is standard Gaussian. The two statementsinvolving h X, x − x ∗ i follow from Gaussian norm equivalence and the bound k·k X ≤ c k · k from Lemma 6.1 for a constant c depending on L, v . (cid:3) Proof of Corollary 2.10.
The only modifications needed pertain to Lemma 6.5 andLemma 6.6, where terms can be simplified due to special features of the exponentialfunction.
We start with Lemma 6.5. As ℓ ′′′ = exp is increasing and as exp( a + b ) =exp( a ) exp( b ) for a, b ∈ R , we may use Remark 3.6 to get E H ( x ) ≤ exp( V x ∗ ) · exp( |h X, x − x ∗ i| ) · |h X, x − x ∗ i| , k∇ F ( x, ξ ) − ∇ F ( y, ξ ) k op ≤ exp( V x ∗ ) · exp( k X k ∗ ) · k X k ∗ k x − y k =: K ( ξ ) k x − y k . It remains to bound the expectation all terms. To that end, applying H¨older’sinequality (in its version for three random variable, with exponents 2 , ,
4) andLemma 6.7 gives E [ E H ( x )] ≤ E [exp(2 V x ∗ )] E [exp(4 |h X, x − x ∗ i| )] E [ h X, x − x ∗ i ] ≤ ¯ σ c k x − x ∗ k , (6.5) E [ K ( ξ )] ≤ E [exp(2 V x ∗ )] E [exp(4 k X k ∗ )] E [ k X k ∗ ] ≤ ¯ σ exp( c √ d ) , (6.6)for a constant c depending on L, v .In particular, (6.5) shows that Assumption 2.7 is satisfied for r := min n , c ¯ σ o where c is a constant depending on L, v . In combination with (6.6), this impliesthat N E , H ≤ d log( r ¯ σ exp( c √ d )) ≤ d log( c exp( c √ d )) ≤ c d for a constant c depending on L, v .Similar (but simpler) arguments show that c H can be bounded in terms of L, v .This completes the proof. (cid:3) Concluding remarks
Remark 7.1 (Dependence of the procedure on parameters) . All the parameters(e.g., σ , k · k , L , N G etc. ) that are required in the formulation of Theorem 2.9(the only parameters needed in the definition of the procedure b x ∗ N are c H , L , σ and r ) depend on the unknown optimal action x ∗ .While an a priori knowledge of the parameters seems unrealistic, there are variousways around this problem. It should be stressed that any type of estimate on theseparameters suffices to ensure the procedure performs well. For example, findingsome ˆ σ ′ such that ˆ σ ′ ≤ σ ≤ σ ′ is enough for our purposes, and estimating σ within a constant multiplicative factor is a considerably simpler task than the oneswe have to deal with in the analysis of the procedure b x ∗ N . Alternatively, one canreplace the parameters with the (local) worst-case scenario; for example, instead of σ , to consider ¯ σ := sup x close to x ∗ σ ( x )where σ ( x ) is defined just as σ but with gradient and Hessian evaluated at x rather than at x ∗ . Finally, it is much simpler to test whether a solution is a goodone than producing a candidate. Therefore, one can increase the sample size and N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 51 test the candidates that are produced. Once the sample size passes the criticalthreshold from Theorem 2.9, a good candidate will be identified.All of these are standard methods and there are plenty of other alternatives totackle such issues. We shall not pursue these aspects further in this article.
Remark 7.2 (On the integrability of the Hessian) . In the course of the proof ofTheorem 2.9, the only place Assumption 2.5 was used was in Lemma 5.3, and thereit was used twice: firstly, by Remark 4.3, Assumption 2.5 guarantees the existenceof three constants s , s , s > L , such that, for every m ≥ ∇ F ( x ∗ , ξ ) satisfies a stable lower bound withparameters ( m, s , s m, s m ) . (7.1)(In fact, one can choose s = as we did for notational purposes). On the otherhand, by Lemma 4.6 and Lemma 4.7, Assumption 2.5 implies that the minimalsample size from Theorem 4.4, N H := max n k E [ A A − A ] k op , log (cid:16) log(3 d ) E [ k A A − A k op ] k E [ A A − A ] k op (cid:17)o (where A := ∇ F ( x ∗ , ξ ) and A := ∇ f ( x ∗ )) can be bounded by c · d log(2 d ) for aconstant c depending only on L .In particular, we see that Theorem 2.9 remains valid if Assumption 2.5 is replacedby assumption (7.1) together with the requirement that the sample size exceeds c N H (in that case the constants c , c , c appearing in Theorem 2.9 depend on s , s , s ). Acknowledgements:
Daniel Bartl is grateful for financial support through theVienna Science and Technology Fund (WWTF) project MA16-021 and the AustrianScience Fund (FWF) project P28661.Part of this work was conducted while Shahar Mendelson was visiting the Facultyof Mathematics, University of Vienna, and the Erwin Schr¨odinger Institute, Vienna.He would also like to thank Jungo Connectivity for its support.
References [1] R. Adamczak, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann. Quantitative estimates ofthe convergence of the empirical covariance matrix in log-concave ensembles.
Journal of theAmerican Mathematical Society , 23(2):535–561, 2010.[2] D. Banholzer, J. Fliege, and R. Werner. On almost sure rates of convergence for sampleaverage approximations.
SIAM Journal on Optimization , 2017.[3] D. Bartl and L. Tangpi. Non-asymptotic rates for the estimation of risk measures. arXivpreprint arXiv:2003.10479 , 2020.[4] D. Bertsimas, V. Gupta, and N. Kallus. Robust sample average approximation.
MathematicalProgramming , 171(1-2):217–282, 2018.[5] S. Boucheron, G. Lugosi, and P. Massart.
Concentration inequalities: A nonasymptotic theoryof independence . Oxford university press, 2013.[6] Y. Cherapanamjeri, N. Tripuraneni, P. L. Bartlett, and M. I. Jordan. Optimal mean estima-tion without a variance. arXiv preprint arXiv:2011.12433 , 2020.[7] H. F¨ollmer and A. Schied.
Stochastic finance: an introduction in discrete time . Walter deGruyter, 2011.[8] H. Ghodrati and Z. Zahiri. A Monte Carlo simulation technique to determine the optimalportfolio.
Management Science Letters , 4(3):465–474, 2014. [9] V. Guigues, A. Juditsky, and A. Nemirovski. Non-asymptotic confidence bounds for theoptimal value of a stochastic program.
Optimization Methods and Software , 32(5):1033–1058,2017.[10] T. Homem-de Mello and G. Bayraksan. Monte Carlo sampling-based methods for stochasticoptimization.
Surveys in Operations Research and Management Science , 19(1):56–85, 2014.[11] S. B. Hopkins. Mean estimation with sub-gaussian rates in polynomial time.
Annals of Sta-tistics , 48(2):1193–1213, 2020.[12] S. Kim, R. Pasupathy, and S. G. Henderson. A guide to sample average approximation. pages207–243, 2015.[13] A. J. Kleywegt, A. Shapiro, and T. Homem-de Mello. The sample average approximationmethod for stochastic discrete optimization.
SIAM Journal on Optimization , 12(2):479–502,2002.[14] V. Koltchinskii and S. Mendelson. Bounding the smallest singular value of a random matrixwithout concentration.
International Mathematics Research Notices , 2015(23):12991–13008,2015.[15] G. Lecu´e and S. Mendelson. Regularization and the small-ball method I: sparse recovery.
TheAnnals of Statistics , 46(2):611–641, 2018.[16] M. Ledoux and M. Talagrand.
Probability in Banach Spaces: isoperimetry and processes .Springer Science & Business Media, 2013.[17] G. Lugosi and S. Mendelson. Mean estimation and regression under heavy-tailed distributions:A survey.
Foundations of Computational Mathematics , 19(5):1145–1190, 2019.[18] G. Lugosi and S. Mendelson. Near-optimal mean estimators with respect to general norms.
Probability Theory and Related Fields , 175(3-4):957–973, 2019.[19] G. Lugosi and S. Mendelson. Sub-Gaussian estimators of the mean of a random vector.
Annalsof Statistics , 47(2):783–794, 2019.[20] G. Lugosi and S. Mendelson. Risk minimization by median-of-means tournaments.
Journalof the European Mathematical Society , 22(3):925–965, 2020.[21] H. Markowitz. Portfolio selection.
The Journal of Finance , 7:77–91, 1952.[22] S. Mendelson. Learning without concentration. In
Conference on Learning Theory , pages25–39, 2014.[23] S. Mendelson. On aggregation for heavy-tailed classes.
Probability Theory and Related Fields ,168(3-4):641–674, 2017.[24] S. Mendelson. An unrestricted learning procedure.
Journal of the ACM (JACM) , 66(6):1–42,2019.[25] S. Mendelson. Extending the scope of the small-ball method.
Studia Mathematica , 2020+.[26] S. Mendelson. On the geometry of random polytopes. In
Geometric Aspects of FunctionalAnalysis , pages 187–198. Springer, 2020.[27] A. Nemirovsky. Problem complexity and method efficiency in optimization.[28] R. I. Oliveira and P. Thompson. Sample average approximation with heavier tails I:non-asymptotic bounds with weak assumptions and stochastic constraints. arXiv preprintarXiv:1705.00822 , 2017.[29] R. I. Oliveira and P. Thompson. Sample average approximation with heavier tails II: local-ization in stochastic convex optimization and persistence results for the Lasso. arXiv preprintarXiv:1711.04734 , 2017.[30] G. C. Pflug. Stochastic programs and statistical data.
Annals of Operations Research , 85:59–78, 1999.[31] G. C. Pflug. Stochastic optimization and statistical inference.
Handbooks in operations re-search and management science , 10:427–482, 2003.[32] W. R¨omisch. Stability of stochastic programming problems.
Handbooks in operations researchand management science , 10:483–554, 2003.[33] M. Rudelson. Random vectors in the isotropic position.
Journal of Functional Analysis ,164(1):60–72, 1999.[34] A. Shapiro. Monte Carlo sampling methods.
Handbooks in operations research and manage-ment science , 10:353–425, 2003.[35] A. Shapiro, D. Dentcheva, and A. Ruszczy´nski.
Lectures on stochastic programming: modelingand theory . SIAM, 2014.[36] N. Srivastava and R. Vershynin. Covariance estimation for distributions with 2 + ε moments. The Annals of Probability , 41(5):3081–3111, 2013.
N MONTE-CARLO METHODS IN CONVEX STOCHASTIC OPTIMIZATION 53 [37] M. Talagrand. Sharper bounds for Gaussian and empirical processes.
The Annals of Proba-bility , pages 28–76, 1994.[38] M. Talagrand.
Upper and lower bounds for stochastic processes: modern methods and classicalproblems , volume 60. Springer Science & Business Media, 2014.[39] K. Tikhomirov. Sample covariance matrices of heavy-tailed distributions.
International Math-ematics Research Notices , 2018(20):6254–6289, 2018.[40] J. Tropp. The expected norm of a sum of independent random matrices: An elementaryapproach. pages 173–202, 2016.[41] S. Vogel. Universal confidence sets for solutions of optimization problems.
SIAM Journal onOptimization , 19(3):1467–1488, 2008.[42] Q. Wang and H. Sun. Sparse markowitz portfolio selection by using stochastic linear comple-mentarity approach.
Journal of Industrial & Management Optimization , 14(2):541, 2018.[43] S. Weber. Distribution-invariant risk measures, entropy, and large deviations.
Journal ofApplied Probability , 44(1):16–40, 2007.[44] H. Xu and D. Zhang. Monte Carlo methods for mean-risk optimization and portfolio selection.
Computational Management Science , 9(1):3–29, 2012.[45] P. Yaskov. Sharp lower bounds on the least singular value of a random matrix without thefourth moment condition.
Electronic Communications in Probability , 20, 2015.
Department of Mathematics, Vienna University, Austria
Email address : [email protected] Mathematical Sciences Institute, The Australian National University Canberra,Australia
Email address ::