[PDF] Sample Complexity of Sample Average Approximation for Conditional Stochastic Optimization

Abstract

In this paper, we study a class of stochastic optimization problems, referred to as the \emph{Conditional Stochastic Optimization} (CSO), in the form of $\min_{x \in \mathcal{X}} \EE_{\xi}f_\xi\Big({\EE_{\eta|\xi}[g_\eta(x,\xi)]}\Big)$, which finds a wide spectrum of applications including portfolio selection, reinforcement learning, robust learning, causal inference and so on. Assuming availability of samples from the distribution $\PP(\xi)$ and samples from the conditional distribution $\PP(\eta|\xi)$, we establish the sample complexity of the sample average approximation (SAA) for CSO, under a variety of structural assumptions, such as Lipschitz continuity, smoothness, and error bound conditions. We show that the total sample complexity improves from $\cO(d/\eps^4)$ to $\cO(d/\eps^3)$ when assuming smoothness of the outer function, and further to $\cO(1/\eps^2)$ when the empirical function satisfies the quadratic growth condition. We also establish the sample complexity of a modified SAA, when ξ and η are independent. Several numerical experiments further support our theoretical findings. Keywords: stochastic optimization, sample average approximation, large deviations theory

Full PDF

SSAMPLE COMPLEXITY OF SAMPLE AVERAGE APPROXIMATIONFOR CONDITIONAL STOCHASTIC OPTIMIZATION

YIFAN HU ∗ , XIN CHEN ∗ , AND

NIAO HE ∗ Abstract.

In this paper, we study a class of stochastic optimization problems, referred to asthe

Conditional Stochastic Optimization (CSO), in the form of min x ∈X E ξ f ξ (cid:16) E η | ξ [ g η ( x, ξ )] (cid:17) , whichﬁnds a wide spectrum of applications including portfolio selection, reinforcement learning, robustlearning, causal inference and so on. Assuming availability of samples from the distribution P ( ξ ) andsamples from the conditional distribution P ( η | ξ ), we establish the sample complexity of the sampleaverage approximation (SAA) for CSO, under a variety of structural assumptions, such as Lipschitzcontinuity, smoothness, and error bound conditions. We show that the total sample complexityimproves from O ( d/(cid:15) ) to O ( d/(cid:15) ) when assuming smoothness of the outer function, and further to O (1 /(cid:15) ) when the empirical function satisﬁes the quadratic growth condition. We also establish thesample complexity of a modiﬁed SAA, when ξ and η are independent. Several numerical experimentsfurther support our theoretical ﬁndings. Key words. stochastic optimization, sample average approximation, large deviations theory

AMS subject classiﬁcations.

1. Introduction.

Decision-making in the presence of uncertainty has been afundamental and long-standing challenge in many ﬁelds of science and engineering.In recent years, extensive research eﬀorts have been devoted to the design and theoryof eﬃcient algorithms for solving the classical stochastic optimization (SO) in theform of(1.1) min x ∈X F ( x ) := E ξ [ f ( x, ξ )] , ranging from convex to non-convex objectives, from ﬁrst-order to second-order meth-ods, and from sub-linear to linear convergent algorithms, e.g., see [5] and referencestherein for a comprehensive survey. Here X ⊆ R d is the decision set and f ( x, ξ ) issome cost function dependent on the random vector ξ . In general, (1.1) cannot becomputed analytically or solved exactly, even when the underlying distribution of therandom vector ξ is known, and one has to resort to Monte Carlo sampling techniques.An important Monte Carlo method – the sample average approximation (SAA,a.k.a., the empirical risk minimization in machine learning community) – is widelyused to solve (1.1), assuming availability of samples from the underlying distribution.SAA works by solving the approximation of the original problem:(1.2) min x ∈X ˆ F n ( x ) := 1 n n (cid:88) i =1 f ( x, ξ i ) , where ξ , . . . , ξ n are i.i.d. samples generated from the distribution of ξ . Note thatˆ F n ( x ) converges pointwise to F ( x ) with probability 1 as n goes to inﬁnity. Finite-sample convergence of SAA for SO has been well established. The seminal workby [20] proved that for general Lipschitz continuous objectives, SAA requires a samplecomplexity of O ( d/(cid:15) ) to obtain an (cid:15) -optimal solution to the stochastic optimizationproblem. [35] proved that for strongly convex and Lipschitz continuous objectives, ∗ Department of Industrial and Enterprise Systems Engineering (ISE), University of Illi-nois at Urbana-Champaign (UIUC), Urbana, IL ([email protected], [email protected],[email protected].) 1 a r X i v : . [ m a t h . O C ] F e b he sample complexity of SAA is O (1 /(cid:15) ). Detailed results can be found in the books,e.g., [37] and [34].More generally, SAA is also a popular computational tool for solving multi-stagestochastic programming (MSP) problems. In its general form, a MSP ﬁnds a sequenceof decisions { x t } Tt =0 that minimizes the nested expectation in the following form:(1.3) min x ∈X f ( x ) + E ξ (cid:20) inf x f ( x , ξ ) + E ξ | ξ (cid:104) · · · + E ξ T | ξ T − (cid:2) inf x T f T ( x T , ξ T ) (cid:3)(cid:105)(cid:21) , where T is the number of decision periods, ξ , . . . , ξ T can be considered as a ran-dom process, and the decision x t is a function of the history of the process up totime t . Similarly, the SAA approach works by ﬁrst generating a large scenario treewith conditional sampling and then processing with stage-based or scenario-based de-composition methods [30, 32, 33]. When extended to the multi-stage case, the ﬁnitesample analysis indicates that the total number of samples, or scenarios, to achievean (cid:15) -optimal solution to the original problem (1.3) grows exponentially as the numberof stages increases [38, 37]. In particular, for general three-stage stochastic problems,the sample complexity of SAA cannot be smaller than O ( d /(cid:15) ); this holds true evenif the cost functions in all stages are linear and the random vectors are stage-wiseindependent as discussed in [36].In this paper, we study an intermediate class of problems, referred to as the Conditional Stochastic Optimization (CSO), that sits in between the classical SO andthe MSP. The problem of interest takes the following general form:(1.4) min x ∈X F ( x ) := E ξ (cid:104) f ξ (cid:16) E η | ξ [ g η ( x, ξ )] (cid:17)(cid:105) . Here X is the domain of the decision variable x ∈ R d ; f ξ ( · ) : R k → R is a continuouscost function dependent on the random vector ξ , and g η ( · , ξ ) : R d → R k is a vector-valued continuous cost function dependent on both random vectors ξ and η . The innerexpectation is with respect to η given ξ , and the outer expectation is with respect to ξ .Similar to the classical stochastic optimization, we do not assume any knowledge onthe underlying distribution of P ( ξ ) nor the conditional distribution P ( η | ξ ). Instead,we assume availability of samples from the distribution P ( ξ ) and samples from theconditional distribution P ( η | ξ ) for any given ξ .CSO is more general than the classical stochastic optimization as it capturesdynamic randomness and involves conditional expectation. It takes the SO as a specialcase when g η ( x, ξ ) is an identical function. On the other hand, it is less complicatedthan the MSP (in particular the three-stage case with T = 3) as it seeks for a staticdecision and does not subject to non-anticipativity constraints.The goal of this paper is to analyze the sample complexity of SAA for solvingCSO, which can be constructed as follows based on conditional sampling :(1.5) min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η ij ( x, ξ i ) (cid:19) , where { ξ i } ni =1 are i.i.d. samples generated from P ( ξ ) and { η ij } mj =1 are i.i.d. samplesgenerated from the conditional distribution P ( η | ξ i ) for a given outer sample ξ i . Wewould like to examine the total number of samples T = nm + n required for SAA (1.5)to achieve an (cid:15) -optimal solution to the original CSO problem (1.4). e also consider a special case of the CSO problem (1.4), when the randomvectors ξ and η are independent:(1.6) min x ∈X F ( x ) := E ξ (cid:104) f ξ (cid:16) E η [ g η ( x, ξ )] (cid:17)(cid:105) . One could still approximate (1.6) by the SAA (1.5), mimicking the conditional sam-pling scheme and using diﬀerent samples { η i , . . . , η im } from the distribution of η foreach ξ i . However, since the inner expectation is no longer a conditional expectation,there is no necessity to estimate the inner expectation with diﬀerent realizations of η for each ξ i . Hence, an alternative way to approximate (1.6) is through a modiﬁedSAA:(1.7) min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η j ( x, ξ i ) (cid:19) . where { ξ i } ni =1 are i.i.d. samples generated from the distribution of ξ and { η j } mj =1 are i.i.d. samples generated from the distribution of η . As a result, the componentfunctions f ξ i (cid:16) m (cid:80) mj =1 g η j ( x, ξ i ) (cid:17) , i = 1 , . . . , n become dependent since they share thesame { η j } mj =1 , making it very diﬀerent from (1.5). In this case, the total numberof samples becomes T = n + m . We refer to this sampling scheme as independentsampling . Notably, CSO can be used to model a varietyof applications, including portfolio selection [16], robust supervised learning [7], rein-forcement learning [7, 8], personalized medical treatment [44], instrumental variableregression [27], and so on. We discuss some of these examples in details below.

Robust Supervised Learning.

Incorporation of priors on invariance and robustnessinto the supervised learning procedures is crucial for computer vision and speechrecognition [28, 3]. Taking image classiﬁcation as an example, we would like to builda classiﬁer that is both accurate and invariant to certain kinds of data transformation,such as rotation or perturbation. Let ξ = ( a , b ) , · · · , ξ n = ( a n , b n ) be a set of inputdata, where a i is the feature vector and b i is the label. A plausible way to achievesuch consistency is to consider the class of robust linear classiﬁers, say f ( x, x , ξ ) = E η | ξ ∼ µ ( σ ( a )) [ x T η + x ] for given image data ξ , by averaging the prediction over allpossible transformations σ ( a ), and then ﬁnding the best ﬁt by minimizing the expectedrisk: min ( x,x ) E ξ =( a,b ) (cid:104) (cid:96) (cid:0) b, E η | ξ [ η T x + x ] (cid:1)(cid:105) + ν (cid:107) x (cid:107) . Here (cid:96) ( · , · ) is some loss function, ν > µ ( · ) is agiven distribution (e.g., uniform) over the transformations. Clearly, such problemsbelong to the category of CSO. Reinforcement Learning.

Policy evaluation is a fundamental task in Markov de-cision processes and reinforcement learning. Consider a discounted Markov decisionprocess characterized by the tuple M := ( S , A , P, r, γ ), where S is a ﬁnite state space, A is a ﬁnite action space, P ( s, a, s (cid:48) ) represents the (unknown) state transition proba-bility from state s to s (cid:48) given action a , r ( s, a ) : S × A → R is a reward function, and γ ∈ (0 ,

1) is a discount factor. Given a stochastic policy π ( a | s ), the goal of the policyevaluation is to estimate the value function V π ( s ) := E (cid:2) (cid:80) ∞ k =0 γ k r ( s k , a k ) (cid:12)(cid:12) s = s (cid:3) nder the policy. It is well-known that V π ( · ) is a ﬁxed point of the Bellman equa-tion [1], V π ( s ) = E s (cid:48) | a,s [ r ( s, a ) + γV π ( s (cid:48) )] . To estimate the value function V π ( s ), one could resort to minimizing the mean squaredBellman error [40, 7], namely:min V ( · ): S→ R E s ∼ µ ( · ) ,a ∼ π ( ·| s ) (cid:2)(cid:0) r ( s, a ) − E s (cid:48) | a,s [ V ( s ) − γV ( s (cid:48) )] (cid:1) (cid:3) . Here µ ( · ) is the stationary distribution. This minimization problem can be viewedas a special case of CSO. Recently, [8] showed that ﬁnding the optimal policy canalso be formulated into an optimization problem in a similar form by exploiting thesmoothed Bellman optimality equation. Again, the resulting problem falls under thecategory of CSO. Uplift Modeling.

Uplift modelling aims at estimating individual treatment eﬀects,and it has been widely studied in causal inference literature and used for personalizedmedicine treatment and targeted marketing [18, 44]. In an individual uplift model,the goal is to estimate the eﬀect of a treatment on an individual with feature vector x ,which could be represented by u ( x ) := E [ y | x, t = 1] − E [ y | x, t = − t ∈ {± } represents whether a treatment has been given to an individual, and y ∈ Y ⊆ R rep-resents the outcome. In practice, obtaining joint labels ( y, t ) can be diﬃcult, whereasobtaining one label (either t or y ) of the individual is relatively easier. [44] consideredan individual uplift model that assumes availability of only one label from the jointlabels, and estimates the unknown label with p ( y | x ) = (cid:80) t = {± } p ( y | x, t ) p ( t | x ). Theyshowed that the individual uplift u ( x ) is equivalent to the optimal solution to thefollowing least-squares problem:min u ∈ L ( p ) E x ∼ p ( x ) (cid:2) ( E w | x [ w ] · u ( x ) − E z | x [ z ]) (cid:3) , where L ( p ) = { f : X → R | E x ∼ p ( x ) [ f ( x ) ] < ∞} is a function space, and w and z aretwo auxiliary random variables, whose conditional density are given by p ( z = z | x ) = p ( y = z | x ) + p ( y = − z | x ), p ( w = w | x ) = p ( t = w | x ) + p ( t = − w | x ). If wefurther restrict u ( · ) to a ﬁnite dimensional parameterization, then the above problembecomes a special case of CSO.For these applications, there are many settings in which samples can be generatedaccording to our assumptions. For instance, in robust supervised learning and upliftmodeling, there are multiple samples from P ( η | ξ ) available for any given ξ . A closely related class of problems, called stochastic com-position optimization , has been extensively studied in the literature; see, e.g., [45, 31,12, 41], to name just a few. This class of problems takes the following form:(1.8) min x ∈X f ◦ g ( x ) := E ξ (cid:104) f ξ (cid:16) E η [ g η ( x )] (cid:17)(cid:105) , where f ( u ) := E ξ [ f ξ ( u )], and g ( x ) := E η [ g η ( x )]. Although the two problems, (1.8)and (1.4), share some similarities in that both objectives are represented by nestedexpectations, they are fundamentally diﬀerent in two aspects: i) the inner randomness η in (1.4) is conditionally dependent on the outer randomness ξ , while the innerexpectation in (1.8) is taken over the marginal distribution of η ; ii) the inner random unction g η ( x, ξ ) in (1.4) depends on both ξ and η . As a result, unlike (1.8), the CSOproblem (1.4) cannot be formulated as a composition of two deterministic functionsdue to the dependence between the inner and outer function. Another key distinctionfrom (1.8) is that we assume availability of samples from the distributions P ( ξ ) and P ( η | ξ ), rather than samples from the joint distribution P ( ξ, η ). These two distinctionsfurther lead to a drastic diﬀerence in the SAA construction and the sample complexityanalysis of these two types of problems, as we will show in the rest of the paper.When solving either (1.8) or (1.4), most of the existing work is devoted to de-veloping stochastic oracle-based algorithms and their convergence analysis for solvingthese problems. Related work includes two-timescale [31, 45, 41, 42] and single-timescale [13] stochastic approximation algorithms for solving the problem (1.8),variance-reduced algorithms for solving the SAA counterpart of (1.8) [22, 17, 39], anda primal-dual functional stochastic approximation algorithm for solving the problem(1.4) [7]. These methods usually require convexity of the objective in order to obtainan (cid:15) -optimal solution. Our work diﬀers from the ones listed above in that we mainlyfocus on establishing the sample complexity of SAA itself, rather than designing eﬃ-cient algorithms to solve the resulting SAA.We point out that our paper has the same strain as a series of papers [20, 38, 36,29, 12, 9, 2, 23], centered at the sample average approximation approach for stochasticprograms. In particular, [9] derived a central limit theorem result for the SAA of thestochastic composition optimization problem (1.8) and [12] established the rate ofconvergence. Despite these developments, the study of the basic SAA approach andits ﬁnite sample complexity analysis remains unexplored for solving the general CSOproblem (1.4) and even the special case (1.6). We aim to close this gap in this paper. In this paper, we formally analyze the sample complexityof the corresponding SAA approach for solving CSO. Our contributions are summa-rized as follows and in Table 1.1.(a) We establish the ﬁrst sample complexity results of the SAA in (1.5) for the CSOproblem (1.4) under several structural assumptions:(i) Both f ξ and g η are Lipschitz continuous;(ii) In addition to (i), f ξ is Lipschitz smooth;(iii) In addition to (i), the empirical function satisﬁes the H¨olderian error boundcondition;(iv) In addition to (i), f ξ is Lipschitz smooth and the empirical function satisﬁesthe H¨olderian error bound condition.None of these assumptions require convexity of the underlying objective function.Note that the H¨olderian error bound (HEB) condition [4], which includes the qua-dratic growth (QG) condition [19] as a special case, is a much weaker assumptionthan strong convexity, and holds for many nonconvex problems in machine learn-ing applications [6]. We show that, for general Lipschitz continuous problems,the sample complexity of SAA improves from O ( d/(cid:15) ) to O ( d/(cid:15) ) when assum-ing smoothness; for problems satisfying the QG condition, the sample complexityof SAA improves from O (1 /(cid:15) ) to O (1 /(cid:15) ) when assuming smoothness. This isvery diﬀerent from the classical results on the SO and the MSP, where Lipschitzsmoothness plays no essential role in the sample complexity [20, 36]. Our resultsare built on the traditional large deviation theory and stability arguments, whileleveraging several bias-variance decomposition techniques, in order to fully exploit However, when solving the SAA problem itself, convexity conditions are often necessary forobtaining a global minimizer. 5 able 1.1

Sample Complexity of SAA Methods

Problem Assumptions Sample Complexity f ξ ( · ) ˆ F n or ˆ F nm Conditional Independent

SO [20] - - O ( d/(cid:15) ) - SO [35] - Strongly Convex O (1 /(cid:15) ) - MSP ( T = 3) [36] - - O ( d /(cid:15) ) O ( d /(cid:15) )CSO - - O ( d/(cid:15) ) O ( d/(cid:15) )CSO Smooth - O ( d/(cid:15) )CSO - Quadratic Growth O (1 /(cid:15) ) O ( d/(cid:15) )CSO Smooth Quadratic Growth O (1 /(cid:15) )ˆ F n or ˆ F nm = empirical objective; (cid:15) = accuracy; d = dimension;Conditional = conditional sampling; Independent = independent sampling; the speciﬁc structure of CSO and other structural assumptions.(b) We analyze the sample complexity of the modiﬁed SAA in (1.7) for the special case(1.6), where ξ and η are independent. We show that the total sample complexity ofthe modiﬁed SAA is O ( d/(cid:15) ) for the general Lipschitz continuous problems. Theexistence of the QG condition only improves the complexity of the outer samplesfrom O ( d/(cid:15) ) to O (1 /(cid:15) ), yet the overall complexity is dominated by the complexityof the inner samples, which is O ( d/(cid:15) ). Our complexity result matches with theasymptotic rate established in [9] even without assuming smoothness of outer andinner functions and is unimprovable.(c) We conduct some simulations of the SAA approach on several examples, includingthe logistic regression, least absolute value (LAV) regression and its smoothedcounterpart, under some modiﬁcations. Our simulation results indicate that solvingthe nonsmooth LAV regression requires more samples than solving its smoothcounterpart to achieve the same accuracy. We also observe that when the varianceof the inner randomness is relatively large, for a ﬁxed budget T , setting n = O ( √ T )samples seems to perform best for logistic regression, which matches with ourtheory. Although both conditional sampling and independent sampling schemescan be applied to solving the special case (1.6), with nearly matching samplecomplexity in situation (iv) (see last row in Table 1.1), our simulations show thatusing the independent sampling scheme exhibits better performance in practice. The remainder of this paper is organized as follows.In Section 2, we introduce some notations and preliminaries. In Section 3, we give thebasic assumptions and analyze the mean squared error of the Monte Carlo estimation.In Section 4, we present the main results on the sample complexity of SAA for CSOunder diﬀerent structural assumptions. In Section 5, we provide results for the specialcase when ξ and η are independent. Numerical results are given in Section 6.

2. Preliminaries.

For convenience, we collect here some notations that will beused throughout the paper. We also introduce some mathematical tools and proposi-tions that are necessary for future discussion. For simplicity, we restrict our attentionto l -norm, denoted as || · || . Similar results on sample complexity with respect todiﬀerent norms can be obtained with minor modiﬁcation of the analysis. et X ⊆ R d be the decision set. We say X has a ﬁnite diameter D X , if || x − x || ≤ D X , ∀ x , x ∈ X . For υ ∈ (0 , { x l } Ql =1 is said to be a υ -net of X , if x l ∈ X , ∀ l = 1 , · · · , Q , and the following holds: ∀ x ∈ X , ∃ l ( x ) ∈ { , · · · , Q } such that || x − x l ( x ) || ≤ υ. If X has a ﬁnite diameter D X , for any υ ∈ (0 , υ -net of X , and the size of the υ -net is bounded, Q ≤ O (( D X /υ ) d ) [37].A function f : X → R is said to be L -Lipschitz continuous, if there exists aconstant L > | f ( x ) − f ( x ) | ≤ L || x − x || , ∀ x , x ∈ X . The function f : X → R is said to be S -Lipschitz smooth, if it is continuously diﬀerentiable andits gradient is S -Lipschitz continuous. This also implies that ∀ x , x ∈ X : | f ( x ) − f ( x ) − ∇ f ( x ) (cid:62) ( x − x ) | ≤ S || x − x || . If a continuously diﬀerentiable function f : X → R satisﬁes that ∀ x , x ∈ X , f ( x ) − f ( x ) −∇ f ( x ) (cid:62) ( x − x ) ≥ µ || x − x || , then f is called µ -strongly convex when µ >

0, convex when µ = 0, and µ -weaklyconvex when µ < Definition

Let f : X → R be a functionwith compact domain X and the optimal solution set X ∗ is nonempty. f ( · ) satisﬁesthe ( µ, δ ) -H¨olderian error bound condition if there exists δ ≥ and µ > such that ∀ x ∈ X , f ( x ) − min x ∈X f ( x ) ≥ µ inf z ∈X ∗ || x − z || δ . In particular, when δ = 1 , we say f satisﬁes the quadratic growth (QG) condition. The H¨olderian error bound condition is also known as the (cid:32)Lojasiewicz inequal-ity [4]. When δ = 1, the condition implies a quadratic growth of the function valuenear any local minima. The QG condition is a weaker assumption than strong con-vexity and does not need to be convex. When f ( · ) is convex, the QG condition is alsoreferred as optimal strong convexity [24] and semi-strong convexity [14].The Cram´er’s large deviation theorem will be frequently used, so we list it as alemma below based on the result in [20]. We extend the result to random vectors andprovide the proof in Appendix Section A. Lemma

Let X , · · · , X n be i.i.d samples of zero mean random variable X with ﬁnite variance σ . For any (cid:15) > , it holds P (cid:18) n n (cid:88) i =1 X i ≥ (cid:15) (cid:19) ≤ exp( − nI ( (cid:15) )) , where I ( (cid:15) ) := sup t ∈ R { t(cid:15) − log M ( t ) } is the rate function of random variable X , and M ( t ) := E e tX is the moment generating function of X . For any δ > , there exists (cid:15) > , for any (cid:15) ∈ (0 , (cid:15) ) , I ( (cid:15) ) ≥ (cid:15) (2+ δ ) σ . If X is a zero-mean sub-Gaussian, then P ( n (cid:80) ni =1 X i ≥ (cid:15) ) ≤ exp( − n(cid:15) σ ) , ∀ (cid:15) > . If X is a zero-mean random vector in R k such that E (cid:107) X (cid:107) = σ < ∞ , then forany δ > , there exists (cid:15) > , for any (cid:15) ∈ (0 , (cid:15) ) , P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 X i (cid:13)(cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:19) ≤ k exp (cid:18) − n(cid:15) (2 + δ ) σ (cid:19) . We will also use the simple fact that for any random variables Y and Z , if randomvariable W ≤ X := Y + Z , then for any (cid:15) > P ( W > (cid:15) ) ≤ P ( X > (cid:15) ) ≤ P ( Y > (cid:15) ) + P ( Z > (cid:15) ). Lastly, throughout the paper, we call x (cid:15) ∈ X an (cid:15) -optimal solutionto the problem min x ∈X F ( x ), if F ( x (cid:15) ) − min x ∈X F ( x ) ≤ (cid:15) . . Mean Squared Error of SAA Estimator for CSO. In this section, wemake the basic assumptions and analyze the mean squared error of the Monte Carloestimate of the function value f ( x ) at a given point. Recall the problem (1.4):min x ∈X F ( x ) := E ξ (cid:104) f ξ (cid:16) E η | ξ [ g η ( x, ξ )] (cid:17)(cid:105) , where f ξ ( · ) : R k → R , g η ( · , ξ ) : R d → R k are random functions. Recall its SAAcounterpart (1.5): min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η ij ( x, ξ i ) (cid:19) . We denote x ∗ and ˆ x nm the optimal solutions to the CSO and the SAA problems, re-spectively. We are interested in estimating the probability of ˆ x nm being an (cid:15) -optimalsolution to the CSO problem, namely P ( F (ˆ x nm ) − F ( x ∗ ) ≤ (cid:15) ), for an arbitrary accu-racy (cid:15) > P ( ξ ) and conditional distribution P ( η | ξ ) for any given ξ , and we makethe following basic assumptions: Assumption

We assume that(a) The decision set

X ⊆ R d has a ﬁnite diameter D X > .(b) f ξ ( · ) is L f -Lipschitz continuous and g η ( · , ξ ) is L g -Lipschitz continuous forany given ξ and η .(c) For all x ∈ X , f ( x, ξ ) is Borel measurable in ξ and g η ( x, ξ ) is Borel measur-able in η for all ξ .(d) σ f := max x ∈X V ξ (cid:0) f ξ ( E η | ξ [ g η ( x, ξ )]) (cid:1) < ∞ .(e) σ g := max x ∈X ,ξ E η | ξ || g η ( x, ξ ) − E η | ξ g η ( x, ξ ) || < ∞ .(f ) | f ξ ( · ) | ≤ M f , (cid:107) g η ( · , ξ ) (cid:107) ≤ M g for any ξ and η . The assumption (f) on the boundedness of function values are implied from assump-tions (a) and (b). The assumptions (d) and (e) on boundedness of variances arecommonly used for sample complexity analysis in the literature.The assumptions (b)and (c) together suggests that the function f ξ and g η ( x, ξ ) are Carath´eodory func-tions [21]. Although the parameters L f , L g , σ f , and σ g could depend on dimensions d and k , we treat these parameters as given constants throughout the paper. In this subsection, we analyzethe mean squared error (MSE) of the estimator ˆ F nm ( x ), i.e., the SAA objective (orthe empirical objective), for estimating the true objective function F ( x ), at a given x . The MSE can be decomposed into the sum of squared bias and variance of theestimator:(3.1) MSE( ˆ F nm ( x )) := E | ˆ F nm ( x ) − F ( x ) | = ( E ˆ F nm ( x ) − F ( x )) + V ( ˆ F nm ( x )) . We have the following lemmas on bounding the bias and variance.

Lemma

Let { η j } mj =1 be conditional samples from P ( η | ξ ) given ξ ∼ P ( ξ ) .Under Assumption 3.1, for any ﬁxed x ∈ X that is independent of ξ and { η j } mj =1 , it olds that, (3.2) (cid:12)(cid:12)(cid:12)(cid:12) E { ξ, { η j } mj =1 } (cid:20) f ξ (cid:18) m m (cid:88) j =1 g η j ( x, ξ ) (cid:19) − f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L f σ g √ m . If additionally, f ξ ( · ) is S -Lipschitz smooth, we have (3.3) (cid:12)(cid:12)(cid:12)(cid:12) E { ξ, { η j } mj =1 } (cid:20) f ξ (cid:18) m m (cid:88) j =1 g η j ( x, ξ ) (cid:19) − f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Sσ g m . Proof.

Deﬁne X j := g η j ( x, ξ ) − E η | ξ g η ( x, ξ ) and ¯ X := (cid:80) mj =1 X j /m . It follows E { η j } mj =1 | ξ [ ¯ X ] = 0 by deﬁnition, and E { η j } mj =1 | ξ [ (cid:107) ¯ X (cid:107) ] ≤ σ g /m by Assumption 3.1(d). E { η j } mj =1 | ξ ∇ f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1) (cid:62) (cid:0) m (cid:80) mj =1 X j ( x ) (cid:1) = 0 since x is independent of { η j } mj =1 .The results then follow directly by invoking the Lipschitz continuity and smoothnessand taking expectations. Lemma

Under Assumption 3.1, it holds that V ( ˆ F nm ( x )) ≤ σ f n + M f L f σ g n √ m . Proof.

We ﬁrst introduce ˆ F n ( x ) := n (cid:80) ni =1 f ξ i (cid:0) E η | ξ i [ g η ( x, ξ i )] (cid:1) . It follows fromthe independence among { ξ i } ni =1 that V ( ˆ F n ( x )) ≤ σ f n . By deﬁnition we have V (cid:16) ˆ F nm ( x ) (cid:17) − V (cid:16) ˆ F n ( x ) (cid:17) = 1 n (cid:104) E ( ˆ F m ( x ) ) − ( E ˆ F m ( x )) (cid:105) − n (cid:104) ( E ( ˆ F ( x ) ) − ( E ˆ F ( x )) (cid:105) = 1 n (cid:104) E ( ˆ F m ( x ) ) − E ( ˆ F ( x ) ) (cid:105) + 1 n (cid:104) ( E ˆ F ( x )) − ( E ˆ F m ( x )) (cid:105) , where ˆ F m ( x ) := f ξ (cid:0) m (cid:80) mj =1 g η j ( x, ξ ) (cid:1) and ˆ F ( x ) := f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1) . From As-sumption 3.1(b) and Lemma 3.1, we have E ( ˆ F m ( x ) ) − E ( ˆ F ( x ) ) ≤ M f E | ˆ F m ( x ) − ˆ F ( x ) | ≤ M f L f σ g / √ m . In addition, ( E ˆ F ( x )) − ( E ˆ F m ( x )) ≤ M f L f σ g / √ m .Hence, we obtain the desired result.The following result on the mean squared error follows naturally by (3.1). Theorem

Under Assumption 3.1, we have (3.4)

MSE ( ˆ F nm ( x )) ≤ L f σ g m + 1 n (cid:18) σ f + 4 M f L f σ g √ m (cid:19) . If additionally, f ξ ( · ) is S -Lipschitz smooth, the mean squared error is further boundedby (3.5) MSE ( ˆ F nm ( x )) ≤ S σ g m + 1 n (cid:18) σ f + 4 M f L f σ g √ m (cid:19) . Unlike the classical stochastic optimization, the SAA objective of CSO is no longerunbiased. The estimation error of the SAA objective therefore comes from both biasand variance. A key observation from Theorem 3.1 is that Lipschitz smoothness of f ξ ( · ) is essential to reduce the bias and can be potentially exploited to improve thesample complexity of SAA. e point out that in [15], the authors also consider the estimation problem of theexpected value of a non-linear function on a conditional expectation, i.e., E [ f ( E [ ζ | ξ ])].Their setting is slightly diﬀerent from ours as they restrict f to be one-dimensionaland assume f contains a ﬁnite number of discontinuous or non-diﬀerential points andis thrice diﬀerentiable with ﬁnite derivatives on all continuous points. They providean asymptotic bound O (1 /m + 1 /n ) of the mean squared error for their nestedestimator based on Taylor expansion. Here we focus on a general continuous outerfunction f ξ ( · ), and show that Lipschitz smoothness of f ξ ( · ) is suﬃcient to achieve asimilar error bound with ﬁnite samples.

4. Sample Complexity of SAA for Conditional Stochastic Optimiza-tion.

In this section, we analyze the number of samples required for the solution tothe SAA (1.5) to be (cid:15) -optimal of the CSO problem (1.4), with high probability.We consider two general cases: (i) when the objective is Lipschitz continuous and(ii) when the empirical objective satisﬁes the H¨olderian error bound condition. Inthe former case, we establish a uniform convergence analysis based on concentrationinequalities to bound P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ), and in the latter case, we provide astability analysis. In both cases, we further take into account two scenarios, with andwithout the Lipschitz smoothness assumption of the outer function f ξ ( · ). We ﬁrst consider the case when the objective is Lipschitz continuous and prove theuniform convergence.

Theorem

Under Assumption 3.1, for any δ > ,there exists (cid:15) > such that for (cid:15) ∈ (0 , (cid:15) ) , when m ≥ L f σ g /(cid:15) , we have P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤ O (1) (cid:18) L f L g D X (cid:15) (cid:19) d exp (cid:18) − n(cid:15) δ )( σ f + 4 M f L f σ g ) (cid:19) . (4.1) If additionally, f ξ ( · ) is S -Lipschitz smooth, then (4.1) holds as long as m ≥ Sσ g /(cid:15) .Proof. We construct a υ -net to get rid of the supreme over x and use a con-centration inequality to bound the probability. First, we pick a υ -net { x l } Ql =1 on thedecision set X , such that L f L g υ = (cid:15)/

4, thus Q ≤ O (1)( L g L f D X (cid:15) ) d . Note that { x l } Ql =1 has no randomness. By deﬁnition of υ -net, we have ∀ x ∈ X , ∃ l ( x ) ∈ { , , · · · , Q } ,s.t. || x − x l ( x ) || ≤ υ = (cid:15)/ L f L g . Invoking Lipschitz continuity of f ξ and g η , weobtain | ˆ F nm ( x ) − ˆ F nm ( x l ( x ) ) | ≤ (cid:15) , | F ( x ) − F ( x l ( x ) ) | ≤ (cid:15) . Hence, for any x ∈ X , | ˆ F nm ( x ) − F ( x ) |≤ | ˆ F nm ( x ) − ˆ F nm ( x l ( x ) ) | + | ˆ F nm ( x l ( x ) ) − F ( x l ( x ) ) | + | F ( x l ( x ) ) − F ( x ) |≤ (cid:15) | ˆ F nm ( x l ( x ) ) − F ( x l ( x ) ) | ≤ (cid:15) l ∈{ , , ··· ,Q } | ˆ F nm ( x l ) − F ( x l ) | . t follows that(4.2) P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤ P (cid:18) max l ∈{ , , ··· ,Q } | ˆ F nm ( x l ) − F ( x l ) | > (cid:15) (cid:19) ≤ Q (cid:88) l =1 P (cid:18) | ˆ F nm ( x l ) − F ( x l ) | > (cid:15) (cid:19) . Deﬁne Z i ( l ) := f ξ i ( m (cid:80) mj =1 g η ij ( x l , ξ i )) − F ( x l ), then Z ( l ) , Z ( l ) , · · · , Z n ( l ) are i.i.d.random variables. Denote their expectation as E Z ( l ). Then Z i ( l ) − E Z ( l ) is a zero-mean random variable.If max l E Z ( l ) ≤ (cid:15)/

4, by Lemma 2.1, we have(4.3) P (cid:18) ˆ F nm ( x l ) − F ( x l ) > (cid:15) (cid:19) ≤ P (cid:18) ˆ F nm ( x l ) − F ( x l ) > (cid:15) E Z ( l ) (cid:19) = P (cid:18) n n (cid:88) i =1 [ Z i ( l ) − E Z ( l )] > (cid:15) (cid:19) ≤ exp (cid:18) − n(cid:15) δ + 2) V ( Z ( l )) (cid:19) . Similarly, we could show that if max l E Z ( l ) ≥ − (cid:15)/ P (cid:18) F ( x l ) − ˆ F nm ( x l ) > (cid:15) (cid:19) ≤ exp (cid:18) − n(cid:15) δ + 2) V ( Z ( l )) (cid:19) . Based on Lemma 3.1, we have, for Lipschitz continuous f ξ ( · ), | E Z ( l ) | ≤ L f σ g / √ m , ∀ l = 1 , · · · , Q ; for Lipschitz smooth f ξ ( · ), | E Z ( l ) | ≤ Sσ g / m , ∀ l = 1 , · · · , Q . Thus,max l E Z ( l ) ≤ (cid:15)/ m is suﬃciently large. By analysis of Theorem3.1, we know V ( Z ( l )) ≤ σ f + 4 M f L f σ g / √ m ≤ σ f + 4 M f L f σ g . Plugging into (4.2)with Q ≤ O (1)( L g L f D X (cid:15) ) d , we obtain the desired result.Since ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ ) ≤

0, we have P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) )= P (cid:16) [ F (ˆ x nm ) − ˆ F nm (ˆ x nm )] + [ ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ )] + [ ˆ F nm ( x ∗ ) − F ( x ∗ )] ≥ (cid:15) (cid:17) ≤ P (cid:16) F (ˆ x nm ) − ˆ F nm (ˆ x nm ) ≥ (cid:15)/ (cid:17) + P (cid:16) ˆ F nm ( x ∗ ) − F ( x ∗ ) ≥ (cid:15)/ (cid:17) . (4.5)Invoking Theorem 4.1, we immediately have the following result. Corollary

UnderAssumption 3.1, for any δ > , there exists (cid:15) > such that for (cid:15) ∈ (0 , (cid:15) ) , when m ≥ L f σ g /(cid:15) , (4.6) P (cid:18) F (ˆ x nm ) − F ( x ∗ ) > (cid:15) (cid:19) ≤ O (1) (cid:18) L f L g D X (cid:15) (cid:19) d exp (cid:18) − n(cid:15) δ )( σ f + 4 M f L f σ g ) (cid:19) . If additionally, f ξ ( · ) is S -Lipschitz smooth, then (4.6) holds as long as m ≥ Sσ g /(cid:15) . It further implies the following sample complexity result. orollary With probability at least − α , the solution to the SAA problemis (cid:15) -optimal to the original CSO problem if the sample sizes n and m satisfy that n ≥ O (1) σ f + 4 M f L f σ g (cid:15) (cid:20) d log (cid:18) L f L g D X (cid:15) (cid:19) + log (cid:18) α (cid:19) (cid:21) ,m ≥ (cid:40) L f σ g (cid:15) , Under Assumption 3.1 , Sσ g (cid:15) , f ξ ( · ) is also Lipschitz smooth.Ignoring the log factors, under Assumption 3.1, the total sample complexity of SAAfor achieving an (cid:15) -optimal solution is T = mn + n = O ( d/(cid:15) ); when f ξ ( · ) is Lipschitzsmooth, the total sample complexity reduces to T = mn + n = O ( d/(cid:15) ) . The above result indicates that in general, the sample complexity of the SAA forthe CSO problem is O ( d/(cid:15) ) when assuming only Lipschitz continuity of the functions f ξ and g η . The sample complexity drops to O ( d/(cid:15) ) assuming additionally Lipschitzsmoothness of the outer function f ξ . Notice that the complexity depends only lin-early on the dimension of the decision set. This is quite diﬀerent from the three-stagestochastic optimization. In [36], for a three-stage stochastic programming, the au-thors showed the sample sizes for estimating the second and the third stages needto be at least O ( d/(cid:15) ), leading to a total of O ( d /(cid:15) ) samples, to guarantee uniformconvergence even for stage-wise independent random variables. In this subsec-tion, we consider the case when the empirical function satisﬁes H¨olderian error boundcondition, which includes the quadratic growth condition and strong convexity as spe-cial cases. Error bound condition has been widely studied recently in the context of(stochastic) oracle-based algorithm for faster convergence; see e.g., [19, 11, 43] and ref-erences therein. To our best knowledge, very few papers have exploited the H¨olderianerror bound condition for the SAA approach and analyzed the sample complexityunder such a condition. We show that the CSO problem under the H¨olderian errorbound condition yields smaller orders of sample complexity for the SAA approach.We make the following two assumptions throughout this subsection.

Assumption

The empirical function ˆ F nm ( x ) satisﬁes the ( µ, δ ) -H¨olderianerror bound condition with µ > , δ ≥ , i.e., it holds that ∀ x ∈ X , ˆ F nm ( x ) − min x ∈X ˆ F nm ( x ) ≥ µ inf z ∈X ∗ nm || x − z || δ , where n, m are any positive integers, and X ∗ nm is the optimal solution set of the em-pirical objective function ˆ F nm ( x ) over X . Assumption

The empirical function ˆ F nm has a unique minimizer ˆ x nm on X , for any n and m . An interesting special case of Assumption 4.1 is the quadratic growth (QG) condi-tion when δ = 1. QG condition is actually satisﬁed by a wide spectrum of objectives,such as strongly convex functions, general strongly convex functions composed withpiecewise linear functions, general piecewise convex quadratic functions, etc. Thereare also many other speciﬁc examples arising in machine learning applications thatsatisfy the QG condition, including logistic loss composed with linear functions andneural networks with linear activation functions, see [6, 19], and reference therein.Another interesting case is the polyhedral error bound condition when δ = 0, which is nown to hold true for many piecewise linear loss functions [4]. For both cases, thesefunctions are not necessarily strongly convex nor convex. Relevant problems withSAA objective ˆ F nm satisfying the QG condition are discussed in Appendix Section D.Assumption 4.2 could be restricted and less straightforward to verify. In general,for a non-strictly convex empirical objective function, the optimal solution is notnecessarily unique. Yet, it is not exclusive to strictly convex functions. We illustrateone such example below. Lastly, we point out that when ˆ F nm ( x ) is strongly convex,for example, l regularized convex empirical objective, the above assumptions holdnaturally. In the following, we give some examples when ˆ F nm ( x ) satisﬁes the QGcondition. Example 1.

Consider the following one-dimensional function F ( x ) = E ξ (cid:2) ( E η | ξ [ η ] x ) + 3 sin ( E η | ξ [ η ] x ) (cid:3) , where ξ and η can be any random vectors that satisfy η | ξ ≥ √ µ with probability 1.Denote ¯ η i = m (cid:80) mj =1 η ij , the empirical function is given byˆ F nm ( x ) = 1 n n (cid:88) i =1 ¯ η i x + 3 n n (cid:88) i =1 sin (¯ η i x ) . It can be easily veriﬁed that ˆ F nm ( x ) satisﬁes the QG condition with parameter µ > F nm ( x ) has a unique minimizer x ∗ = 0 for any m, n . Example 2.

Consider the robust logistic regression problem with the objective(4.7) F ( x ) = E ξ =( a,b ) [log(1 + exp( − b E η | ξ [ η ] T x ))] , where a ∈ R d is a random feature vector and b ∈ { , − } is the label, η = a + N (0 , σ I d ) is a perturbed noisy observation of the input feature vector a . The empir-ical objective function ˆ F nm ( x ) is given by(4.8) ˆ F nm ( x ) = 1 n n (cid:88) i =1 log (cid:18) (cid:18) − b i m m (cid:88) j =1 η (cid:62) ij x (cid:19)(cid:19) . ˆ F nm ( x ) satisﬁes the QG condition on any compact convex set in Appendix Section D.Note that the minimizer of a general empirical objective function is not necessarilyalways unique. However, the Hessian of ˆ F nm ( x ) shows that ˆ F nm ( x ) is strictly convexif m (cid:80) mj =1 η (cid:62) ij (cid:54) = 0 for all i , which is satisﬁed with high probability. Thus, ˆ F nm ( x ) hasa unique minimizer with high probability.Next, we present our main result on the sample complexity of SAA. Theorem

Under Assumption 3.1,4.1, 4.2, for any (cid:15) > , we have (4.9) P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ (cid:15) (cid:18) L f L g (cid:18) L f L g µn (cid:19) /δ + 2 L f σ g √ m (cid:19) . If additionally, f ξ ( · ) is S -Lipschitz smooth, then we further have (4.10) P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ (cid:15) (cid:18) L f L g (cid:18) L f L g µn (cid:19) /δ + Sσ g m (cid:19) . iﬀerent from the previous section, we use a stability argument to exploit the errorbound condition. As shown in Lemma 3.1, the empirical function is a biased estimatorof the original function due to the composition of f ξ ( · ) and g η ( · , ξ ). Introducing aperturbed set of samples could reduce some dependence in randomness. We deﬁne abias term which will be used later in the proof:(4.11) ∆( m ) := (cid:40) L f σ g √ m , f ξ ( · ) is L f Lipschitz continuous , Sσ g m , f ξ ( · ) is additionally S Lipschitz smooth.Below we provide the detailed proof of Theorem 4.2.

Proof.

Recall that x ∗ and ˆ x nm are the minimizers of F ( x ) and ˆ F nm ( x ), respec-tively. It’s clear that x ∗ has no randomness, and ˆ x nm is a function of { ξ i } ni =1 , { η ij } mj =1 .We decompose the error F (ˆ x nm ) − F ( x ∗ ) in three terms, and analyze each term below: F (ˆ x nm ) − F ( x ∗ ) = F (ˆ x nm ) − ˆ F nm (ˆ x nm ) (cid:124) (cid:123)(cid:122) (cid:125) := E + ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) := E + ˆ F nm ( x ∗ ) − F ( x ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) := E . First, we use a stability argument and Lemma 3.1 to bound E E = E [ F (ˆ x nm ) − ˆ F nm (ˆ x nm )]. Deﬁne(4.12) ˆ F ( k ) nm ( x ) := 1 n n (cid:88) i (cid:54) = k f ξ i (cid:18) m m (cid:88) j =1 g η ij ( x, ξ i ) (cid:19) + 1 n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η (cid:48) kj ( x, ξ (cid:48) k ) (cid:19) as the empirical function by replacing the k th outer sample ξ k with another i.i.d outersample ξ (cid:48) k , and replacing the corresponding inner samples { η kj } mj =1 with { η (cid:48) kj } mj =1 ,which are sampled from the conditional distribution of P ( η | ξ (cid:48) k ) for a given sample ξ (cid:48) k .Denote ˆ x ( k ) nm := argmin x ∈X ˆ F ( k ) nm ( x ). We decompose E E = E [ F (ˆ x nm ) − ˆ F nm (ˆ x nm )]into three terms: E E = E (cid:20) n n (cid:88) k =1 F (ˆ x nm ) − n n (cid:88) k =1 f ξ k (cid:18) E η | ξ k g η (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) E η | ξ k g η (ˆ x ( k ) nm , ξ k ) (cid:19) − n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) . (4.13)Note that E [ F (ˆ x nm )] = E [ F (ˆ x ( k ) nm )] since ξ k and ξ (cid:48) k are i.i.d, which implies that ˆ x nm and ˆ x ( k ) nm follow an identical distribution. Since ˆ x ( k ) nm is independent of ξ k , E [ F (ˆ x ( k ) nm )] = E [ f ξ k ( E η | ξ k g (ˆ x ( k ) nm , ξ k ))] for any k . Then the ﬁrst term in (4.13) is 0. As ˆ x ( k ) nm isindependent of { η kj } mj =1 , the second term in (4.13) could be bounded by Lemma 3.1,it holds(4.14) E (cid:20) f ξ k (cid:18) E η | ξ k g η (ˆ x ( k ) nm , ξ k ) (cid:19) − f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) ≤ ∆( m ) . or the third term in (4.13), by deﬁnition it implies(4.15)ˆ F nm (ˆ x ( k ) nm ) − ˆ F nm (ˆ x nm ) = ˆ F ( k ) nm (ˆ x ( k ) nm ) − ˆ F ( k ) nm (ˆ x nm )+ 1 n f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19) − n f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x nm , ξ k ) (cid:19) + 1 n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η (cid:48) kj (ˆ x nm , ξ (cid:48) k ) (cid:19) − n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η (cid:48) kj (ˆ x ( k ) nm , ξ (cid:48) k ) (cid:19) . By Lipschitz continuity of f ξ and g η and that ˆ F ( k ) nm (ˆ x ( k ) nm ) − ˆ F ( k ) nm (ˆ x nm ) ≤

0, it holds(4.16) ˆ F nm (ˆ x ( k ) nm ) − ˆ F nm (ˆ x nm ) ≤ n L f L g || ˆ x ( k ) nm − ˆ x nm || . Since ˆ x nm is the unique minimizer of ˆ F nm ( x ), and ˆ F nm ( x ) satisﬁes QG condition withparameter µ , we have(4.17) ˆ F nm (ˆ x ( k ) nm ) − ˆ F nm (ˆ x nm ) ≥ µ || ˆ x ( k ) nm − ˆ x nm || δ . Combining with (4.16), we obtain(4.18) || ˆ x ( k ) nm − ˆ x nm || ≤ (cid:18) L f L g µn (cid:19) /δ . By Lipschitz continuity of f ξ ( · ) and g η ( · , ξ ), and deﬁnition of ˆ F nm (ˆ x nm ), we obtain E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) ≤ L f L g (cid:18) L f L g µn (cid:19) /δ . (4.19)Combining (4.13), (4.19), and (4.14), we obtain(4.20) E E ≤ L f L g (cid:18) L f L g µn (cid:19) /δ + ∆( m ) . Second, by optimality of ˆ x nm of ˆ F nm , we have(4.21) E E = E [ ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ )] ≤ . Next, we bound E E . Deﬁne ˆ F n ( x ) := n (cid:80) ni =1 f ξ i (cid:0) E η | ξ i [ g η ( x, ξ i )] (cid:1) . Notice that x ∗ is independent of { η ij } mj =1 for any i = { , · · · , n } and E [ ˆ F n ( x ∗ ) − F ( x ∗ )] = 0. ByLemma 3.1, it holds(4.22) E E = E [ ˆ F nm ( x ∗ ) − ˆ F n ( x )] + E [ ˆ F n ( x ) − F ( x )] ≤ ∆( m );Combining (4.20), (4.21), (4.22), with Markov inequality, we obtain the desired re-sult.The sample complexity of SAA under the H¨olderian error bound condition followsdirectly. orollary Under Assumption 4.1 and 4.2, with probability at least − α ,the solution to the SAA problem is (cid:15) -optimal to the original CSO problem if the samplesizes n and m satisfy that n ≥ (2 L f L g ) δ +1 µ ( α(cid:15) ) δ , m ≥ (cid:40) L f σ g α (cid:15) , Under Assumption 3.1 , Sσ g α(cid:15) , f ξ ( · ) is also Lipschitz smooth.Hence, the total sample complexity of SAA for achieving an (cid:15) -optimal solution is atmost T = mn + n = O (1 /(cid:15) δ +2 ); when f ξ ( · ) is Lipschitz smooth, the total samplecomplexity reduces to T = mn + n = O (1 /(cid:15) δ +1 ) . In particular, when the empirical function is strongly convex or satisﬁes the QGcondition, i.e., Assumption 4.1 with δ = 1, this leads to the total sample complexity of O (1 /(cid:15) ) for Lipschitz continuous case and O (1 /(cid:15) ) for Lipschitz smooth case, respec-tively. From the above corollary, the error bound condition only aﬀects the samplecomplexity of the outer samples, and the sample size decreases as δ decreases. As δ gets closer to zero, the sample complexity will essentially be dominated by the innersample size.A key diﬀerence between the results in Theorems 4.1 and 4.2 lies in the dependenceon the problem dimension d and conﬁdence level α . While the sample complexityunder the H¨olderian error bound condition is dimension-free, the dependence on theconﬁdence level 1 − α grows from O (log(1 /α )) to O (1 /α δ ). This is similar to classicalresults on stochastic optimization for strongly convex objectives [35]. Theorem 4.2could also be used to derive a dimensional free sample complexity of l regularizedSAA for a general convex CSO problem. See Appendix Section E for more details.

5. Sample Complexity of SAA for CSO with Independent RandomVariables.

In this section, we consider the special case of CSO when the randomvariables ξ and η are independent. The objective then simpliﬁes to:(5.1) min x ∈X F ( x ) := E ξ [ f ξ ( E η [ g η ( x, ξ )])] . This is similar yet slightly more general than (1.8), the compositional objective con-sidered in [42, 41]. Note that the inner cost function we consider here is dependenton both ξ and η , and thus cannot be written as a composition of two deterministicfunctions.The sample complexity of SAA under the conditional sampling setting achieved inSection 4 applies to this setting since it can be viewed as a special case of the former.However, since the inner expectation is no longer a conditional expectation, we nowconsider an alternative modiﬁed SAA, using the independent sampling scheme, inwhich we use the same set of samples to estimate the inner expectation. The procedureof the independent sampling scheme for solving (5.1) works as follows: ﬁrst generate n i.i.d. samples { ξ i } ni =1 from the distribution of ξ ; and m i.i.d samples { η j } mj =1 fromthe distribution of η , then solve the following approximation problem:(5.2) min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η j ( x, ξ i ) (cid:19) . As a result, the total sample complexity becomes T = m + n . In recent workby [9], the authors established a central limit theorem result for the SAA (5.2) with = n . In particular, they have shown that for Lipschitz smooth functions f ξ ( · ) and g η ( · , ξ ) = g η ( · ), the SAA estimator converges in distribution as follows: √ m (cid:18) min x ∈X ˆ F mm ( x ) − min x ∈X F ( x ) (cid:19) → Z ( W )where W ( · ) = ( W ( · ) , W ( · )) is a zero-mean Brownian process with certain covariancefunctions and Z ( · ) is a function that depends on the ﬁrst order information. Thisresult only yields an asymptotic convergence rate of order O (1 / √ m ) for the SAAwith m = n . Below, we will provide a ﬁnite sample analysis for SAA and establishreﬁned sample complexity results based on concentration inequality techniques.In the SAA problem (5.2), the component functions f ξ i (cid:0) m (cid:80) mj =1 g η j ( x, ξ i ) (cid:1) sharethe same random vectors { η j } mj =1 and are dependent. This is distinct from the SAA(1.5) considered in the previous section. Because of this key diﬀerence, the previousanalysis will no longer apply to this modiﬁed SAA. We will resort to a diﬀerentanalysis for deriving the sample complexity. Similarly, we consider two structuralassumptions, when the empirical objective is only known to be Lipschitz continuousand when the empirical objective also satisﬁes the error bound condition. We ﬁrstconsider the case when the objective is Lipschitz continuous. We make the samebasic assumptions of the Lipschitz continuity of f ξ ( · ) and g η ( · , ξ ) and boundedness ofvariances as described in Assumption 3.1. Our main result is summarized below. Theorem

Under the independent sampling scheme and Assumption 3.1, forany δ > , there exists an (cid:15) > such that for any (cid:15) ∈ (0 , (cid:15) ) , it holds (5.3) P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤O (1) (cid:18) L f L g D X (cid:15) (cid:19) d (cid:18) exp (cid:18) − n(cid:15) δ + 2) σ f (cid:19) + nk exp (cid:18) − m(cid:15) δ + 2) L f σ g (cid:19)(cid:19) . Here, d is the dimension of the decision set, and k is the dimension of the range offunction g .Proof. First, we pick a υ -net { x l } Ql =1 on the decision set X , such that L f L g υ = (cid:15)/

4. Using a similar argument in the proof of Theorem 4.1, we obtain(5.4) P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤ Q (cid:88) l =1 P (cid:18) | ˆ F nm ( x l ) − F ( x l ) | > (cid:15) (cid:19) ≤ Q (cid:88) l =1 P (cid:18) | ˆ F nm ( x l ) − ˆ F n ( x l ) | > (cid:15) (cid:19) + Q (cid:88) l =1 P (cid:18) | ˆ F n ( x l ) − F ( x l ) | > (cid:15) (cid:19) . By Lipschitz continuity of f ξ ( x ) and Lemma 2.1, we have(5.5) P (cid:18) | ˆ F nm ( x l ) − ˆ F n ( x l ) | ≥ (cid:15) (cid:19) ≤ n (cid:88) i =1 P (cid:18) || m m (cid:88) j =1 g η j ( x l , ξ i ) − E η g η ( x l , ξ i ) || ≥ (cid:15) L f (cid:19) ≤ nk exp (cid:18) − m(cid:15) δ + 2) L f σ g (cid:19) . y Lemma 2.1, we obtain(5.6) P (cid:18) | ˆ F n ( x l ) − F ( x l ) | ≥ (cid:15) (cid:19) ≤ (cid:18) − n(cid:15) δ + 2) σ f (cid:19) . Combining with the fact that Q ≤ O (1)( L g L f D X (cid:15) ) d , we obtain the desired result.Invoking the relation in (4.5), the above theorem implies the following: Corollary

Under Assumption 3.1, with probability at least − α , the so-lution to the modiﬁed SAA problem (5.2) is (cid:15) -optimal to the original problem (5.1) ifthe sample sizes n and m satisfy n ≥ O (1) σ f (cid:15) (cid:20) d log (cid:18) L f L g D X (cid:15) (cid:19) + log (cid:18) α (cid:19) (cid:21) ,m ≥ O (1) L f σ g (cid:15) (cid:20) d log (cid:18) L f L g D X (cid:15) (cid:19) + log (cid:18) α (cid:19) + log ( nk ) (cid:21) . Ignoring the log factors, under Assumption 3.1, the total sample complexity of themodiﬁed SAA for achieving an (cid:15) -optimal solution is T = m + n = O ( d/(cid:15) ) . Note that this sample complexity is signiﬁcantly smaller than that for the gen-eral CSO. The O ( d/(cid:15) ) sample complexity also matches the lower bounds on samplecomplexity of SAA for classical stochastic optimization with Lipschitz continuousobjectives [25]; therefore, this result is unimprovable without further assumptions. We now con-sider the case when the empirical objective satisﬁes Assumption 4.1 and 4.2, i.e.,the empirical objective ˆ F nm ( x ) satisﬁes the error bound condition and has a uniqueminimizer for any integers n, m . Our main result is summarized as follows. Theorem

Under Assumptions 3.1, 4.1, and 4.2, for any (cid:15) > and υ > ,we have (5.7) P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ (cid:15) (cid:18) L f L g (cid:18) L f L g µn (cid:19) /δ + O (1) L f M g (cid:112) d log( D X /υ ) √ m + L f σ g √ m + 2 υL f L g (cid:19) . The solution to the modiﬁed SAA problem (5.2) is (cid:15) -optimal to the problem (5.1) withprobability at least − α , if υ = (cid:15)α L f L g , and the sample sizes n and m satisfy that (5.8) n ≥ (2 L f L g ) δ +1 µ ( α(cid:15) ) δ , m ≥ max (cid:40)(cid:18) L f σ g α(cid:15) (cid:19) , O (1) (cid:18) L f M g α(cid:15) (cid:19) d log (cid:18) D X L f L g α(cid:15) (cid:19)(cid:41) . Similar to Theorem 4.2, the outer sample size is independent of dimension anddecreases as δ decreases. As δ gets closer to zero, the sample complexity will essentiallybe dominated by the inner sample size. In particular, when the empirical functionsatisﬁes the QG condition or is strongly convex, i.e., Assumption 4.1 holds with δ = 1,the outer sample size is reduced from O ( d/(cid:15) ) in the Lipschitz continuous case to O (1 /(cid:15) ). Yet, the total sample complexity remains O ( d/(cid:15) ).For a CSO problem with independent random vectors (5.1), both SAA approaches,through conditional sampling, or independent sampling, can be applied to solve the roblem. Comparing Theorem 4.2 and Theorem 5.2, when smoothness and the qua-dratic growth condition are satisﬁed, the sample complexities of these two SAA ap-proaches achieve the same order O (1 /(cid:15) ), except for an extra O ( d ) factor for theindependent sampling. Interestingly, for a given small dimension d and the same sam-ple budget T , the independent sampling might outperform the conditional samplingscheme since the constant factor in the sample complexity of conditional samplingis much larger. The numerical experiment on our testing cases in the next sectionfurther supports the ﬁnding.In contrast to the sample complexity established in Section 4 for the conditionalsampling setting, a notable diﬀerence here is that the Lipschitz smoothness conditiondoes not necessarily help reduce the sample complexity. This result aligns with thecentral limit theorem established in [9]. One of the reasons arises from the inter-dependence among the component functions in the modiﬁed SAA objective, leadingto extra variance. Because of that, the analysis requires sophisticated arguments tohandle the dependence and is much more involved . We defer the proof to AppendixSection B. Remark

Although the overall O (1 /(cid:15) ) sample complexity cannot be furtherimproved in general, it is worth pointing out that, for some interesting speciﬁc in-stances, the modiﬁed SAA could achieve lower sample complexity than what is de-scribed from theory. We illustrate this from the following example.Example 3. For γ >

0, consider the following problemmin x ∈X F ( x ) := H ( E η [ x + η ] , γ ) + ( E η [ x + η ]) , where η ∼ N (0 , σ η ) and H ( · , γ ) is the Huber function, i.e.,(5.9) H ( x, γ ) =  | x | − γ for | x | > γ. γ x for | x | ≤ γ. Note that here f ξ ( x ) := f ( x ) = H ( x, γ ) + x is deterministic, and g η ( x, ξ ) = x + η .When γ > f ( x ) is 1 /γ -Lipschitz smooth. When γ → f ( x ) → | x | + x , whichis no longer diﬀerentiable. In this example, x ∗ = argmin x ∈X F ( x ) = − E η , F ∗ =min x ∈X F ( x ) = 0. The empirical objective becomes ˆ F m ( x ) = H ( x + ¯ η, γ ) + ( x + ¯ η ) , where ¯ η = m (cid:80) mj =1 η j . Thus, ˆ x m = argmin x ∈X ˆ F m ( x ) = − ¯ η . We show that the errorof SAA satisﬁes(5.10) 0 ≤ E F (ˆ x m ) − F ( x ∗ ) − (cid:18) σ η γm erf (cid:18)(cid:115) γ m σ η (cid:19) + σ η m (cid:19) ≤ (cid:114) σ η πm exp (cid:18) − mγ σ η (cid:19) , where erf( x ) := √ π (cid:82) x exp( − x ) dx . As a result, when γ → γ → E F (ˆ x m ) − F ( x ∗ ) = (cid:114) σ η πm + σ η m . For completeness, we provide detailed derivation in Appendix Section C. This exampleshows that the SAA error improves from O (1 / √ m ) to O (1 /m ) as the objective tran-sits from nonsmooth to smooth. When γ →

0, the function becomes non-Lipschitzdiﬀerentiable and the O (1 / √ m ) bound for this setting is indeed tight. It remains aninteresting open problem to identify suﬃcient conditions for achieving theoreticallybetter sample complexity under the independent sampling scheme. σ η /σ ξ = 0 . σ η /σ ξ = 10 (c) σ η /σ ξ = 100 Fig. 6.1 . Logistic regression, conditional sampling, dimension d = 10

6. Numerical Experiments.

In this section, we conduct numerical experi-ments based on two applications, logistic regression and robust regression, to demon-strate the performance of SAA for solving CSO problems. For a ﬁxed sample budget T , we adopt diﬀerence sample allocation strategies for ( m, n ), and compute the cor-responding accuracy of the SAA estimators. We repeat 30 runs for each sampleallocation and report the average performance. The SAA problems are solved byCVXPY 1.0.9 [10]. We consider the robust logistic regressionproblem in Example 2. The problem is formulated in (4.7) and its SAA counterpartis of the form (4.8) with domain X = { x | x ∈ R d , (cid:107) x (cid:107) ≤ } .Note that from Example 2, f is Lipschitz-smooth, ˆ F nm ( x ) satisﬁes QG conditionon any compact convex set, and with high probability has a unique minimizer forlarge n . Theorem 4.2 implies that the theoretical optimal sample allocation strategyis n = O (1 / √ T ) and m = O (1 / √ T ).In the experiment, we set d = 10 and the samples of ξ = ( a, b ) and η aregenerated as follows: a i ∼ N (0 , σ ξ I d ), b i = {± } according to the sign of a Ti x ∗ , η ij ∼ N ( a i , σ η I d ). We set σ ξ = 1, and consider three cases for σ η : σ η = { . , , } ,corresponding low, medium, high variances from inner randomness. For a given sam-ple budget T ranging from 10 to 10 , four diﬀerent sample allocation strategies areconsidered, i.e. n = [ T / ], n = [ T / ], n = [ T / ], and n = [ T / ]. We then computethe average estimation error F (ˆ x nm ) − F ∗ over 30 runs and its standard deviation.The results are summarized in Figure 6.1, where x -axis denotes the sample budget T , and y -axis shows the estimation error. Each curve represents a sampling scheme,showing the average error and upper conﬁdence bound.The trend from Figure 6.1(a)-(c) shows that when the inner variance is relativelylarge, setting n = O ( T / ) consistently outperforms the other sampling strategies,which matches our analysis. The error bar suggests that larger number of outersamples results in smaller deviation of the estimation accuracy. We now examine the robust regression problem,where the objective is no longer Lipschitz diﬀerentiable. The problem is as follows:(6.1) min x ∈X F ( x ) = E ξ =( a,b ) | E η | ξ η (cid:62) x − b | , where a ∈ R d is a random feature vector and b ∈ R is the label, η = a + N (0 , σ η I d ) isa perturbed noisy observation of the input feature vector a , and the domain is X = x | x ∈ R d , (cid:107) x (cid:107) ≤ } . For comparison purposes, we also consider the smoothedversion of this problem based on the Huber function:min x ∈X F γ ( x ) = E ξ =( a,b ) H (cid:0) E η | ξ η (cid:62) x − b, γ (cid:1) , where γ > F nm ( x ) = 1 n n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) m m (cid:88) i =1 η (cid:62) ij x − b i (cid:12)(cid:12)(cid:12)(cid:12) , ˆ F γnm ( x ) = 1 n n (cid:88) i =1 H (cid:18) m m (cid:88) i =1 η (cid:62) ij x − b i , γ (cid:19) . Theorem 4.1 and Theorem 4.2 indicate that Lipschitz smoothness of outer function f ξ ( x ) helps reduce the inner sample size required to achieve the same level of accuracy.For a given budget T , the theoretical optimal sample allocation strategies for thesetwo problems is n = O ( T / ) and n = O ( T / ), respectively.In our experiment, we set d = 20. Samples of ξ = ( a, b ) and η are generated asfollows: a i ∼ N (0 , σ ξ I d ), b i = a (cid:62) i x ∗ , η ij ∼ N ( a i , σ η I d ). As in the previous experi-ment, we measure the average error and upper conﬁdence bound for both problemswith sample budget T ranging from 10 to 10 under four diﬀerent sample alloca-tion strategies over 30 runs. We also consider two sets of smoothness parameters, γ ∈ { . , } . The results are summarized in Figure 6.2.Figure 6.2 (a)-(c) shows that setting n = O ( √ T ) indeed yields almost the bestaccuracy for absolute value loss minimization, which again matches our analysis. Theoverall performance of SAA for the original and that of the smoothed problems behavequite similarly in this case, yet solving the smoothed problem yields much betteraccuracy under the same budget. This also supports our theoretical ﬁndings that thesample complexity is lower for smooth problems. In this experiment, we consider a modiﬁed logistic regression example, that falls intothe special case with independent inner and outer randomness:min x ∈X F ( x ) = E ξ =( a,b ) log(1 + exp( − b ( E η η + a ) (cid:62) x )) , where a ∼ N (0 , σ ξ I d ) ∈ R d is a random feature vector, b ∈ {± } , η ∼ N (0 , σ η I d ) isthe noise. The empirical function of the two sampling schemes ˆ F nm ( x ) is of the formˆ F nm ( x ) = 1 n n (cid:88) i =1 log (cid:18) (cid:0) − b i (cid:0) m m (cid:88) j =1 η ij + a i (cid:1) (cid:62) x (cid:1)(cid:19) . When employing the independent sampling scheme, we generate { η j } mj =1 and let η ij = η j for all i > n is in the order of O ( √ T ),and m is set to m = T /n or m = T − n . In the experiment, d = { , } . σ ξ = 1, and σ η = 10, and the samples are generated accordingly. For any given sample budget T ,we compare the performance of the two sampling scheme under diﬀerent choices ofouter sample n varying from 0 to 10000.Figure 6.3(a) illustrates the comparison when d = 10, and T = 10000. The bellshape in Figure 6.3(a) reﬂects a clear bias-variance tradeoﬀ for diﬀerent n and m . σ η /σ ξ = 0 .

1, absolute value (b) σ η /σ ξ = 10, absolute value (c) σ η /σ ξ = 100, absolute value(d) σ η /σ ξ = 0 .

1, Huber, γ = 0 . σ η /σ ξ = 10, Huber, γ = 0 . σ η /σ ξ = 100, Huber, γ = 0 . σ η /σ ξ = 0 .

1, Huber, γ = 10 (h) σ η /σ ξ = 10, Huber, γ = 10 (i) σ η /σ ξ = 100, Huber, γ = 10 Fig. 6.2 . Error of SAA for absolute value loss and Huber loss, dimension d = 20(a) d = 10, T = 10000, Varying outer sample size n (b) Various d and T Fig. 6.3 . Comparison of conditional sampling and independent sampling schemes

In Figure 6.3(b), we report the best performance (by choosing the best n ) ofthese two sampling schemes with d ∈ { , } , and T ranging from 1000 to 50000.Figure 6.3(b) shows that the independent sampling scheme always achieves a smallererror for the logistic regression problem. The gap between the two schemes decreasesas the dimension increases, which also matches our analysis. . Conclusion. In this paper, we introduce the class of conditional stochasticoptimization problems and provide sample complexity analysis of sample averageapproximation under diﬀerent structural assumptions. Our results show that theoverall sample complexity can be signiﬁcantly reduced under Lipschitz smoothnesscondition, which is very diﬀerent from the theory of classical stochastic optimizationand multi-stage stochastic programming. By exploiting error bound conditions, thesample complexity could be further reduced. To our best knowledge, these are the ﬁrstnon-asymptotic sample complexity results established in the context of conditionalstochastic optimization. For future work, we will investigate stochastic approximationalgorithms for solving this family of problems and establish their sample complexities.

Appendix A. Proof of Propositions.A.1. Proof of Lemma 2.1.

Proof.

The proof of one dimension random variable case was given in [20] usingChernoﬀ bound. Based on that, we consider the case when X is a zero-mean randomvector in R k . Denote X i = ( X i , X i , · · · , X ki ) (cid:62) for i = 1 , · · · , n , σ j = V ( X j ), z j = (cid:80) kj =1 σ j σ j , and I j ( · ) the rate function of the j th coordinate of the random vector X . Wehave P ( || ¯ X || ≥ (cid:15) ) = P (cid:18) k (cid:88) j =1 ( ¯ X j − E X j ) ≥ (cid:15) (cid:19) ≤ k (cid:88) j =1 P (cid:18) ( ¯ X j ) ≥ (cid:15) z j (cid:19) = k (cid:88) j =1 P (cid:18) | ¯ X j | ≥ (cid:15) √ z j (cid:19) ≤ k (cid:88) j =1 exp (cid:18) − n min (cid:26) I j ( (cid:15) √ z j ); I j ( − (cid:15) √ z j ) (cid:27)(cid:19) (A.1)By Lemma 2.1 and deﬁnition, we get P ( || ¯ X || ≥ (cid:15) ) ≤ k (cid:88) j =1 exp (cid:18) − n(cid:15) ( δ + 2) z j σ j (cid:19) = 2 k exp (cid:18) − n(cid:15) ( δ + 2) (cid:80) kj =1 σ j (cid:19) . Using the fact that (cid:80) kj =1 σ j ≤ E || X || , we obtain the desired result. Appendix B. Proof of Theorem 5.2.

Convergence Analysis.

We follow a similar decomposition as we did in provingTheorem 4.2 and use the same notations, like ˆ F ( k ) nm ( x ) and ˆ x ( k ) nm , the perturbed em-pirical function and its minimizer, except that we replace all the η kj with η j for k = 1 , · · · , n and replace the conditional expectation E η | ξ with E η . Unfortunately,one will immediately notice that Lemma 3.1 is no longer applicable for bounding thesecond term in (4.13): E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) E η g η (ˆ x ( k ) nm , ξ k ) (cid:19) − n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) . Because the minimizer ˆ x ( k ) nm depends on { η j } mj =1 . Then Lemma 3.1 is not applicable.Below we provide the detailed proof of Theorem 5.2. Proof.

Deﬁne E := F (ˆ x nm ) − ˆ F nm (ˆ x nm ), andˆ F ( k ) nm ( x ) := 1 n n (cid:88) i (cid:54) = k f ξ i (cid:18) m m (cid:88) j =1 g η j ( x, ξ i ) (cid:19) + 1 n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η j ( x, ξ (cid:48) k ) (cid:19) , he empirical function by replacing the outer sample ξ k with an i.i.d sample ξ (cid:48) k . Denoteˆ x ( k ) nm = argmin x ∈X ˆ F ( k ) nm ( x ). Then, E E could be written as: E E = E (cid:20) F (ˆ x nm ) − n n (cid:88) k =1 f ξ k (cid:18) E η g η (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) E η g η (ˆ x ( k ) nm , ξ k ) (cid:19) − n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) . (B.1)Since ξ k and ξ (cid:48) k are i.i.d, ˆ x nm and ˆ x ( k ) nm follow identical distribution. Then E F (ˆ x nm ) = E F (ˆ x ( k ) nm ). As ˆ x ( k ) nm is independent of ξ k , by deﬁnition of F ( x ), we know E F (ˆ x ( k ) nm ) = E f ξ k ( E η g η (ˆ x ( k ) nm , ξ k )) for any k = 1 , · · · , n . As a result, the ﬁrst term is0. To analyze the second term, denote H k ( x ) := f ξ k (cid:18) E η g η ( x, ξ k ) (cid:19) − f ξ k (cid:18) m m (cid:88) j =1 g η j ( x, ξ k ) (cid:19) . We pick a υ -net { x l } Ql =1 for the decision set X , such that for any x ∈ X , there exists l ∈ { , · · · , Q } , (cid:107) x − x l (cid:107) ≤ υ . Then it holds for any s > (cid:16) s E H k (ˆ x ( k ) nm ) (cid:17) ≤ exp (cid:18) s E max l =1 , ··· ,Q H k ( x l ) + 2 sυL f L g (cid:19) ≤ E exp (cid:18) s max l =1 , ··· ,Q H k ( x l ) + 2 sυL f L g (cid:19) = E max l =1 , ··· ,Q exp ( sH k ( x l ) + 2 sυL f L g ) ≤ E Q (cid:88) l =1 exp ( sH k ( x l ) + 2 sυL f L g ) = Q (cid:88) l =1 E exp ( sH k ( x l ) + 2 sυL f L g ) . (B.2)The ﬁrst inequality holds as ˆ x ( k ) nm is independent of ξ k , and f ξ ( · ) and g η ( · , ξ ) areLipschitz continuous, which implies, H k (ˆ x ( k ) nm ) ≤ sup x ∈X H k ( x ) ≤ max l =1 , ··· ,Q H k ( x l ) + 2 υL f L g . The second inequality holds by Jensen’s inequality. Next we show that H k ( x l ) − E H k ( x l ) is a sub-Gaussian random variable for any given ξ k . Since H k ( x l ) is a functionof { η j } mj =1 . Denote H k ( x l ) := ˜ H ( η , . . . , η m ). Then for any p ∈ [ m ], and given η , . . . , η p − , η p +1 , · · · , η m , we havesup η (cid:48) p ˜ H ( η , · · · , η p − , η (cid:48) p , η p +1 , · · · , η m ) − inf η (cid:48)(cid:48) p ˜ H ( η , · · · , η p − , η (cid:48)(cid:48) p , η p +1 , · · · , η m )= sup η (cid:48) p ,η (cid:48)(cid:48) p E ξ k f ξ k (cid:18) m m (cid:88) j (cid:54) = p g η j ( x, ξ k )+ 1 m g η (cid:48)(cid:48) p ( x, ξ k ) (cid:19) − f ξ k (cid:18) m m (cid:88) j (cid:54) = p g η j ( x, ξ k )+ 1 m g η (cid:48) p ( x, ξ k ) (cid:19) ≤ sup η (cid:48) p ,η (cid:48)(cid:48) p E ξ k L f m (cid:12)(cid:12)(cid:12)(cid:12) g η (cid:48)(cid:48) p ( x, ξ k ) − g η (cid:48) p ( x, ξ k ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ M g L f m , here M g is the upper bound of g η ( · , ξ ) on X . It implies that H k ( x l ) = ˜ H ( η , · · · , η m )has bounded diﬀerence M g L f m . By McDiarmids inequality [26], for any r > P ( H k ( x l ) − E H k ( x l ) ≥ r ) ≤ (cid:18) − r m M g L g (cid:19) . It implies that H k ( x l ) − E H k ( x l ) is a sub-Gaussian random variable with zero meanand variance proxy 2 M g L f /m for any given ξ k . By deﬁnition it yields E exp ( s [ H k ( x l ) − E H k ( x l )]) ≤ exp (cid:18) M g L f s m (cid:19) . Since x l is independent of random vectors { η j } mj =1 , by Lemma 3.1, we know E H k ( x l ) ≤ L f σ g √ m . It further implies E exp( sH k ( x l )) ≤ exp (cid:18) M g L f s m + sL f σ g √ m (cid:19) . With (B.2), we haveexp (cid:16) s E H k (ˆ x ( k ) nm ) (cid:17) ≤ Q exp (cid:32) M g L f s m + sL f σ g √ m + 2 sυL f L g (cid:33) . Taking the logarithm, dividing s on each side, and minimizing over s yields E H k (ˆ x ( k ) nm ) ≤ (cid:115) Q ) L f M g m + L f σ g √ m + 2 υL f L g . Since Q ≤ O (1)( D X /υ ) d , we have E H k (ˆ x ( k ) nm ) ≤ O (1) L f M g √ m (cid:115) d log (cid:18) D X υ (cid:19) + L f σ g √ m + 2 υL f L g . (B.3)For the third term in (B.1), by following the similar steps from (4.15) to (4.18),we obtain E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) ≤ L f L g (cid:18) L f L g µn (cid:19) /δ . (B.4)Combining with (B.1), (B.3), and (B.4),(B.5) E E ≤ L f L g (cid:18) L f L g µn (cid:19) /δ + O (1) L f M g √ m (cid:115) d log (cid:18) D X L f L g α(cid:15) (cid:19) + L f σ g √ m + 2 υL f L g . Similar with the steps from (4.21) and (4.22), by optimality of ˆ x nm of ˆ F nm and Lemma3.1,(B.6) E [ ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ )] ≤ E [ ˆ F nm ( x ∗ ) − F ( x ∗ )] ≤ L f σ g √ m . Finally, combining (B.5), (B.6), with Markov inequality, we obtain (5.7). et L f L g (cid:18) L f L g µn(cid:15) δ (cid:19) /δ ≤ α O (1) L f M g √ m(cid:15) (cid:115) d log (cid:18) D X υ (cid:19) ≤ α L f σ g √ m(cid:15) ≤ α . We obtain the desired sample complexity (5.8).

Appendix C. Example of Huber Loss Minimization.

To show (5.10), denote Y = E η − ¯ η , then Y ∼ N (0 , σ η m ). Then the error of SAAis E F (ˆ x m ) − F ( x ∗ ) = E H ( E η − ¯ η, γ ) + E (¯ η − E η ) = (cid:90) γ γ y p ( y ) dy + 2 (cid:90) + ∞ γ (cid:18) y − γ (cid:19) p ( y ) dy + E Y , (C.1)where p ( y ) = √ m √ πσ η exp (cid:0) − my σ η (cid:1) is the PDF of Y , and E Y = σ η m . Denote erf( x ) := √ π (cid:82) x exp( − x ) dx , y := y (cid:113) m σ η . The ﬁrst term in (C.1) is (cid:90) γ γ y p ( y ) dy = 2 σ η mγ √ π (cid:90) γ (cid:113) m σ η y exp( − y ) dy = σ η γm erf (cid:18)(cid:115) γ m σ η (cid:19) − (cid:114) σ η πm exp (cid:18) − γ m σ η (cid:19) . We use the fact that: (cid:90) z x exp( − x ) dx = 14 √ π erf( z ) −

12 exp( − z ) z. The second term in (C.1) is bounded by (cid:114) σ η πm exp (cid:18) − mγ σ η (cid:19) = (cid:90) + ∞ γ yp ( y ) dy ≤ (cid:90) + ∞ γ ( y − γ ) p ( y ) dy ≤ (cid:90) + ∞ γ yp ( y ) dy. Combining them together, we have (5.10). For a given γ >

0, erf (cid:0)(cid:113) γ m σ η (cid:1) → m → ∞ . By (5.10), we have E F (ˆ x nm ) − F ( x ∗ ) = O (cid:18) m (cid:19) . When γ →

0, (C.1) becomeslim γ → E F (ˆ x m ) − F ( x ∗ ) = lim γ → (cid:90) γ γ y p ( y ) dy + 2 (cid:90) + ∞ γ (cid:18) y − γ (cid:19) p ( y ) dy + σ η m = (cid:114) σ η πm + σ η m = O (cid:18) √ m (cid:19) . Appendix D. Empirical Objectives Satisfying Quadratic Growth Con-dition. trongly Convex Function Composed with Linear Function. The empirical ob-jective function is ˆ F nm ( x ) = n (cid:80) ni =1 f ξ i ( A i x ), where f ξ ( · ) is µ -strongly convex, A i x := m (cid:80) mj =1 g η ij ( x, ξ i ), the average of linear inner function g η ij ( x, ξ i ) := A η ij x .To show that ˆ F nm ( x ) satisﬁes the QG condition. Denote u i = A i y , v i = A i x . Since f ξ ( · ) is strongly convex, f ξ i ( u i ) − f ξ i ( v i ) − ∇ f ξ i ( v i ) (cid:62) ( u i − v i ) ≥ µ || u i − v i || . Taking average over n such inequalities, we obtain1 n n (cid:88) i =1 f ξ i ( u i ) − f ξ i ( v i ) − ∇ f ξ i ( v i ) (cid:62) ( u i − v i ) ≥ n n (cid:88) i =1 µ || u i − v i || . Replacing u i , v i with A i y and A i x , we have1 n n (cid:88) i =1 f ξ i ( A i y ) − f ξ i ( A i x ) − ∇ f ξ i ( A i x ) (cid:62) A i ( y − x ) ≥ n n (cid:88) i =1 µ y − x ) (cid:62) A (cid:62) i A i ( y − x ) . Since ∇ ˆ F nm ( x ) (cid:62) = n (cid:80) ni =1 ( A (cid:62) i ∇ f ξ i ( A i x )) (cid:62) = n (cid:80) ni =1 ∇ f ξ i ( A i x ) (cid:62) A i , we getˆ F nm ( y ) − ˆ F nm ( x ) − ∇ ˆ F nm ( x ) (cid:62) ( y − x ) ≥ n n (cid:88) i =1 µ || A i ( y − x ) || ≥ µ || n n (cid:88) i =1 A i ( y − x ) || . Let z be a point in X ∗ , we haveˆ F nm ( x ) − ˆ F nm ( z ) ≥ µ || n n (cid:88) i =1 A i ( x − z ) || ≥ µθ ( n (cid:80) ni =1 A i )2 || x − z || ≥ min z ∈X ∗ µθ ( n (cid:80) ni =1 A i )2 || x − z || . (D.1)Here θ ( A ) is the smallest non-zero singular of A . Thus ˆ F nm ( x ) satisﬁes quadraticgrowth condition for any n and m . A special case is when n = m = 1, i.e., a stronglyconvex objective composed with a linear function satisﬁes QG condition. Some Strictly Convex Functions Composed with Linear Function on a CompactSet.

Consider Example 2, the logistic regression problem with the objective F ( x ) = E ξ =( a,b ) log(1 + exp( − b E η | ξ [ η ] T x )) , where a ∈ R d is a random feature vector and b ∈ { , − } is the label, η = a + N (0 , σ I d ) is a perturbed noisy observation of the input feature vector a . Its empiricalobjective function ˆ F nm ( x ) is given byˆ F nm ( x ) = 1 n n (cid:88) i =1 log (cid:18) (cid:18) − b i m m (cid:88) j =1 η (cid:62) ij x (cid:19)(cid:19) . where E η ij = a i . Here f ξ i ( u ) = log (cid:0) b i u ) (cid:1) . ˆ F nm ( x ) = 1 /n (cid:80) ni =1 f ( u i ), where f ( u ) = log (cid:0) u ) (cid:1) is strictly convex, and u i = m (cid:80) mj =1 η (cid:62) ij x is bounded for any x ∈ X and realization η ij . It is easy to verify that on any compact set, f ( u ) is strongly onvex. The strong convexity parameter is related to the compact set. With (D.1),ˆ F nm ( x ) satisﬁes the QG condition.Note that the result is not necessarily true for all strictly convex function. Forinstance, || x || is strictly convex, but || Ax || does not satisﬁes quadratic growth con-dition on any compact set containing x = 0. Appendix E. Other Results on Regularized SAA.

The Theorem 4.2 discuss the sample complexity of SAA for strongly convex andQG condition cases. We show that the result obtained in Theorem 4.2 can be usedto obtain dimensional free sample complexity for general convex objective by adding l -regularization. Lemma

E.1 ([35]).

Consider a stochastic convex optimization problem: min x ∈X G ( x ) , where G ( x ) is the expectation over some convex random function. Suppose that thedecision set X ∈ R d has bounded diameter D X . Denote G µ ( x ) := G ( x ) + µ || x || ,where µ > is a strongly convex parameter. Denote ˆ G ( x ) as the SAA counterpart of G ( x ) , x ∗ ∈ argmin x ∈X G ( x ) , ˆ x ∈ argmin x ∈X ˆ G ( x ) , x ∗ µ = argmin x ∈X G µ ( x ) , and ˆ x u the minimizer of SAA of the regularized objective, namely ˆ x u = argmin x ∈X ˆ G µ ( x ) :=ˆ G ( x ) + µ || x || . If E [ G µ (ˆ x µ ) − G µ ( x ∗ µ )] ≤ β ( µ ) , then E [ G (ˆ x µ ) − G ( x ∗ )] ≤ β ( µ ) + µ D X . Remark

E.1.

This theorem shows that the minimum point ˆ x µ to a l -regularizedempirical function ˆ G µ could be a good solution to the original convex function G ( x ) as long as one selects µ properly. Note that ˆ x µ might not be a minimum point ofthe empirical function ˆ G ( x ) . In CSO case, according to Theorem 4.2, if F ( x ) isconvex, the expected error of SAA method for min x ∈X F ( x ) + µ || x || is bounded by β ( µ ) = L f L g µn +2∆( m ) . Then, E F (ˆ x nm ) − F ( x ∗ ) ≤ L f L g µn + µ D X +2∆( m ) . Minimizingover µ , and by Markov inequality, we obtain, P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ √ L f L g D X √ n(cid:15) + 2∆( m ) (cid:15) . We notice that the outer sample size, n = O (1 /(cid:15) ) , is dimensional free, while inTheorem 4.1, n = O ( d/(cid:15) ) , depends linearly in dimension; the inner sample size m is not aﬀected. For high-dimensional problems, adding regularization is sometimesmore favorable as it lowers the sample complexity by d and also helps boosting theconvergence when solving the SAA. Acknowledgments.

We would like to acknowledge Alexander Shapiro and LinXiao for fruitful discussions and the reviewers for their helpful comments.

REFERENCES[1]

D. P. Bertsekas , Dynamic Programming and Optimal Control, Vol I

D. Bertsimas, V. Gupta, and N. Kallus , Robust sample average approximation , Mathemat-ical Programming, 171 (2017), pp. 217–282, https://doi.org/10.1007/s10107-017-1174-z.283]

A. N. Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal , Enhancing robustness of machinelearning systems via data transformations , in Information Sciences and Systems (CISS),2018 52nd Annual Conference on, IEEE, 2018, pp. 1–5, https://doi.org/10.1109/ciss.2018.8362326.[4]

J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter , From error bounds to the com-plexity of ﬁrst-order descent methods for convex functions , Mathematical Programming,165 (2017), pp. 471–507, https://doi.org/10.1007/s10107-016-1091-6.[5]

L. Bottou, F. Curtis, and J. Nocedal , Optimization methods for large-scale machine learn-ing , SIAM Review, 60 (2018), pp. 223–311, https://doi.org/10.1137/16m1080173.[6]

Z. Charles and D. Papailiopoulos , Stability and generalization of learning algorithms thatconverge to global optima , in Proceedings of the 35th International Conference on MachineLearning, vol. 80 of Proceedings of Machine Learning Research, PMLR, 10–15 Jul 2018,pp. 745–754, http://proceedings.mlr.press/v80/charles18a.html (accessed 2019-07-16).[7]

B. Dai, N. He, Y. Pan, B. Boots, and L. Song , Learning from conditional distributionsvia dual embeddings , in Proceedings of the 20th International Conference on ArtiﬁcialIntelligence and Statistics, vol. 54 of Proceedings of Machine Learning Research, PMLR,20–22 Apr 2017, pp. 1458–1467, http://proceedings.mlr.press/v54/dai17a.html (accessed2019-07-05).[8]

B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song , SBEED: Convergentreinforcement learning with nonlinear function approximation , in Proceedings of the 35thInternational Conference on Machine Learning, vol. 80 of Proceedings of Machine Learn-ing Research, PMLR, 10–15 Jul 2018, pp. 1125–1134, http://proceedings.mlr.press/v80/dai18c.html (accessed 2019-07-15).[9]

D. Dentcheva, S. Penev, and A. Ruszczy´nski , Statistical estimation of composite risk func-tionals and risk optimization problems , Annals of the Institute of Statistical Mathematics,69 (2016), pp. 737–760, https://doi.org/10.1007/s10463-016-0559-8.[10]

S. Diamond and S. Boyd , CVXPY: A Python-embedded modeling language for convex opti-mization , Journal of Machine Learning Research, 17 (2016), pp. 1–5, https://web.stanford.edu/ ∼ boyd/papers/pdf/cvxpy paper.pdf (accessed 2019-07-16).[11] D. Drusvyatskiy and A. S. Lewis , Error bounds, quadratic growth, and linear convergenceof proximal methods , Mathematics of Operations Research, 43 (2018), pp. 919–948, https://doi.org/10.1287/moor.2017.0889.[12]

Y. M. Ermoliev and V. I. Norkin , Sample average approximation method for compoundstochastic optimization problems , SIAM Journal on Optimization, 23 (2013), pp. 2231–2263, https://doi.org/10.1137/120863277.[13]

S. Ghadimi, A. Ruszczy´nski, and M. Wang , A single time-scale stochastic approximationmethod for nested stochastic optimization , Dec. 2018, https://arxiv.org/abs/1812.01094.[14]

P. Gong and J. Ye , Linear convergence of variance-reduced projected stochastic gradientwithout strong convexity , June 2014, https://arxiv.org/abs/1406.1102.[15]

L. J. Hong and S. Juneja , Estimating the mean of a non-linear function of conditionalexpectation , in Proceedings of the 2009 Winter Simulation Conference (WSC), IEEE, dec2009, https://doi.org/10.1109/wsc.2009.5429428.[16]

L. J. Hong, S. Juneja, and G. Liu , Kernel smoothing for nested estimation with applicationto portfolio risk measurement , Operations Research, 65 (2017), pp. 657–673, https://doi.org/10.1287/opre.2017.1591.[17]

Z. Huo, B. Gu, J. Liu, and H. Huang , Accelerated method for stochastic composition optimiza-tion with nonsmooth regularization

M. Jaskowski and S. Jaroszewicz , Uplift modeling for clinical trial data , in ICML Workshopon Clinical Data Analysis, 2012, http://people.cs.pitt.edu/ ∼ milos/icml clinicaldata 2012/Papers/Oral Jaroszewitz ICML Clinical 2012.pdf (accessed 2019-07-15).[19] H. Karimi, J. Nutini, and M. Schmidt , Linear convergence of gradient and proximal-gradientmethods under the polyak-(cid:32)lojasiewicz condition , in Joint European Conference on MachineLearning and Knowledge Discovery in Databases, Springer, 2016, pp. 795–811, https://doi.org/10.1007/978-3-319-46128-1 50.[20]

A. J. Kleywegt, A. Shapiro, and T. H. de Mello , The sample average approximationmethod for stochastic discrete optimization , SIAM Journal on Optimization, 12 (2002),pp. 479–502, https://doi.org/10.1137/s1052623499363220.[21]

E. Kubi´nska , Approximation of carath´eodory functions and multifunctions , Real Analysis Ex-change, 30 (2005), p. 351, https://doi.org/10.14321/realanalexch.30.1.0351.[22]

X. Lian, M. Wang, and J. Liu , Finite-sum Composition Optimization via Variance Reduced radient Descent , in Proceedings of the 20th International Conference on Artiﬁcial In-telligence and Statistics, vol. 54 of Proceedings of Machine Learning Research, PMLR,20–22 Apr 2017, pp. 1159–1167, http://proceedings.mlr.press/v54/lian17a.html (accessed2019-07-16).[23] H. Liu, X. Wang, T. Yao, R. Li, and Y. Ye , Sample average approximation with sparsity-inducing penalty for high-dimensional stochastic programming , Mathematical Program-ming, 178 (2018), pp. 69–108, https://doi.org/10.1007/s10107-018-1278-0.[24]

J. Liu and S. J. Wright , Asynchronous stochastic coordinate descent: Parallelism andconvergence properties , SIAM Journal on Optimization, 25 (2015), pp. 351–376, https://doi.org/10.1137/140961134.[25]

P. Massart and ´E. N´ed´elec , Risk bounds for statistical learning , The Annals of Statistics,34 (2006), pp. 2326–2366, https://doi.org/10.1214/009053606000000786.[26]

C. McDiarmid , On the method of bounded diﬀerences , in Surveys in Combinatorics, CambridgeUniversity Press, 1989, pp. 148–188, https://doi.org/10.1017/cbo9781107359949.008.[27]

K. Muandet, A. Mehrjou, S. K. Lee, and A. Raj , Dual iv: A single stage instrumentalvariable regression , arXiv preprint arXiv:1910.12358, (2019), https://arxiv.org/abs/1910.12358.[28]

P. Niyogi, F. Girosi, and T. Poggio , Incorporating prior information in machine learningby creating virtual examples , Proceedings of the IEEE, 86 (1998), pp. 2196–2209, https://doi.org/10.1109/5.726787.[29]

B. K. Pagnoncelli, S. Ahmed, and A. Shapiro , Sample average approximation methodfor chance constrained programming: Theory and applications , Journal of Opti-mization Theory and Applications, 142 (2009), pp. 399–416, https://doi.org/10.1007/s10957-009-9523-6.[30]

M. V. F. Pereira and L. M. V. G. Pinto , Multi-stage stochastic optimization applied toenergy planning , Mathematical Programming, 52 (1991), pp. 359–375, https://doi.org/10.1007/bf01582895.[31]

B. Polyak , Minimization of composite regression functions

R. T. Rockafellar and R. J.-B. Wets , Scenarios and policy aggregation in optimizationunder uncertainty , Mathematics of Operations Research, 16 (1991), pp. 119–147, https://doi.org/10.1287/moor.16.1.119.[33]

A. Ruszczy´nski , Decomposition methods in stochastic programming , Mathematical Program-ming, 79 (1997), pp. 333–353, https://doi.org/10.1007/bf02614323.[34]

S. Shalev-Shwartz and S. Ben-David , Understanding machine learning: From theory toalgorithms , Cambridge University Press, 2014, https://doi.org/10.1017/cbo9781107298019.[35]

S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan , Learnability, stability anduniform convergence , Journal of Machine Learning Research, 11 (2010), pp. 2635–2670,https://doi.org/10.1007/978-3-642-34106-9 3.[36]

A. Shapiro , On complexity of multistage stochastic programs , Operations Research Letters, 34(2006), pp. 1–8, https://doi.org/10.1016/j.orl.2005.02.003.[37]

A. Shapiro, D. Dentcheva, and A. Ruszczy´nski , Lectures on stochastic programming: mod-eling and theory , Society for Industrial and Applied Mathematics, 2014, https://doi.org/10.1137/1.9780898718751.[38]

A. Shapiro and A. Nemirovski , On complexity of stochastic programming problems , inContinuous Optimization, Springer-Verlag, 2005, pp. 111–146, https://doi.org/10.1007/0-387-26771-9 4.[39]

S. Shen, L. Xu, J. Liu, J. Guo, and Q. Ling , Asynchronous stochastic composition optimiza-tion with variance reduction , Nov. 2018, https://arxiv.org/abs/1811.06396.[40]

R. S. Sutton, H. R. Maei, and C. Szepesv´ari , A convergent o ( n ) temporal-diﬀerence algorithm for oﬀ-policy learning with linear function approxi-mation , in Advances in Neural Information Processing Systems 21, Cur-ran Associates, Inc., 2009, pp. 1609–1616, http://papers.nips.cc/paper/3626-a-convergent-on-temporal-diﬀerence-algorithm-for-oﬀ-policy-learning-with-linear-function-approximation.pdf.[41] M. Wang, E. X. Fang, and H. Liu , Stochastic compositional gradient descent: algorithmsfor minimizing compositions of expected-value functions , Mathematical Programming, 161(2017), pp. 419–449, https://doi.org/10.1007/s10107-016-1017-3.[42]

M. Wang, J. Liu, and E. X. Fang , Accelerating stochastic composition optimization , Journal ofMachine Learning Research, 18 (2017), pp. 1–23, http://jmlr.org/papers/v18/16-504.html.3043]

Y. Xu, Q. Lin, and T. Yang , Accelerated stochastic subgradient methods under local errorbound condition , July 2016, https://arxiv.org/abs/1607.01027.[44]

I. Yamane, F. Yger, J. Atif, and M. Sugiyama , Uplift modeling from separate labels , inAdvances in Neural Information Processing Systems 31, Curran Associates, Inc., 2018,pp. 9927–9937, http://papers.nips.cc/paper/8198-uplift-modeling-from-separate-labels.pdf.[45]