Sample Complexity of Sample Average Approximation for Conditional Stochastic Optimization
SSAMPLE COMPLEXITY OF SAMPLE AVERAGE APPROXIMATIONFOR CONDITIONAL STOCHASTIC OPTIMIZATION
YIFAN HU ∗ , XIN CHEN ∗ , AND
NIAO HE ∗ Abstract.
In this paper, we study a class of stochastic optimization problems, referred to asthe
Conditional Stochastic Optimization (CSO), in the form of min x ∈X E ξ f ξ (cid:16) E η | ξ [ g η ( x, ξ )] (cid:17) , whichfinds a wide spectrum of applications including portfolio selection, reinforcement learning, robustlearning, causal inference and so on. Assuming availability of samples from the distribution P ( ξ ) andsamples from the conditional distribution P ( η | ξ ), we establish the sample complexity of the sampleaverage approximation (SAA) for CSO, under a variety of structural assumptions, such as Lipschitzcontinuity, smoothness, and error bound conditions. We show that the total sample complexityimproves from O ( d/(cid:15) ) to O ( d/(cid:15) ) when assuming smoothness of the outer function, and further to O (1 /(cid:15) ) when the empirical function satisfies the quadratic growth condition. We also establish thesample complexity of a modified SAA, when ξ and η are independent. Several numerical experimentsfurther support our theoretical findings. Key words. stochastic optimization, sample average approximation, large deviations theory
AMS subject classifications.
1. Introduction.
Decision-making in the presence of uncertainty has been afundamental and long-standing challenge in many fields of science and engineering.In recent years, extensive research efforts have been devoted to the design and theoryof efficient algorithms for solving the classical stochastic optimization (SO) in theform of(1.1) min x ∈X F ( x ) := E ξ [ f ( x, ξ )] , ranging from convex to non-convex objectives, from first-order to second-order meth-ods, and from sub-linear to linear convergent algorithms, e.g., see [5] and referencestherein for a comprehensive survey. Here X ⊆ R d is the decision set and f ( x, ξ ) issome cost function dependent on the random vector ξ . In general, (1.1) cannot becomputed analytically or solved exactly, even when the underlying distribution of therandom vector ξ is known, and one has to resort to Monte Carlo sampling techniques.An important Monte Carlo method – the sample average approximation (SAA,a.k.a., the empirical risk minimization in machine learning community) – is widelyused to solve (1.1), assuming availability of samples from the underlying distribution.SAA works by solving the approximation of the original problem:(1.2) min x ∈X ˆ F n ( x ) := 1 n n (cid:88) i =1 f ( x, ξ i ) , where ξ , . . . , ξ n are i.i.d. samples generated from the distribution of ξ . Note thatˆ F n ( x ) converges pointwise to F ( x ) with probability 1 as n goes to infinity. Finite-sample convergence of SAA for SO has been well established. The seminal workby [20] proved that for general Lipschitz continuous objectives, SAA requires a samplecomplexity of O ( d/(cid:15) ) to obtain an (cid:15) -optimal solution to the stochastic optimizationproblem. [35] proved that for strongly convex and Lipschitz continuous objectives, ∗ Department of Industrial and Enterprise Systems Engineering (ISE), University of Illi-nois at Urbana-Champaign (UIUC), Urbana, IL ([email protected], [email protected],[email protected].) 1 a r X i v : . [ m a t h . O C ] F e b he sample complexity of SAA is O (1 /(cid:15) ). Detailed results can be found in the books,e.g., [37] and [34].More generally, SAA is also a popular computational tool for solving multi-stagestochastic programming (MSP) problems. In its general form, a MSP finds a sequenceof decisions { x t } Tt =0 that minimizes the nested expectation in the following form:(1.3) min x ∈X f ( x ) + E ξ (cid:20) inf x f ( x , ξ ) + E ξ | ξ (cid:104) · · · + E ξ T | ξ T − (cid:2) inf x T f T ( x T , ξ T ) (cid:3)(cid:105)(cid:21) , where T is the number of decision periods, ξ , . . . , ξ T can be considered as a ran-dom process, and the decision x t is a function of the history of the process up totime t . Similarly, the SAA approach works by first generating a large scenario treewith conditional sampling and then processing with stage-based or scenario-based de-composition methods [30, 32, 33]. When extended to the multi-stage case, the finitesample analysis indicates that the total number of samples, or scenarios, to achievean (cid:15) -optimal solution to the original problem (1.3) grows exponentially as the numberof stages increases [38, 37]. In particular, for general three-stage stochastic problems,the sample complexity of SAA cannot be smaller than O ( d /(cid:15) ); this holds true evenif the cost functions in all stages are linear and the random vectors are stage-wiseindependent as discussed in [36].In this paper, we study an intermediate class of problems, referred to as the Conditional Stochastic Optimization (CSO), that sits in between the classical SO andthe MSP. The problem of interest takes the following general form:(1.4) min x ∈X F ( x ) := E ξ (cid:104) f ξ (cid:16) E η | ξ [ g η ( x, ξ )] (cid:17)(cid:105) . Here X is the domain of the decision variable x ∈ R d ; f ξ ( · ) : R k → R is a continuouscost function dependent on the random vector ξ , and g η ( · , ξ ) : R d → R k is a vector-valued continuous cost function dependent on both random vectors ξ and η . The innerexpectation is with respect to η given ξ , and the outer expectation is with respect to ξ .Similar to the classical stochastic optimization, we do not assume any knowledge onthe underlying distribution of P ( ξ ) nor the conditional distribution P ( η | ξ ). Instead,we assume availability of samples from the distribution P ( ξ ) and samples from theconditional distribution P ( η | ξ ) for any given ξ .CSO is more general than the classical stochastic optimization as it capturesdynamic randomness and involves conditional expectation. It takes the SO as a specialcase when g η ( x, ξ ) is an identical function. On the other hand, it is less complicatedthan the MSP (in particular the three-stage case with T = 3) as it seeks for a staticdecision and does not subject to non-anticipativity constraints.The goal of this paper is to analyze the sample complexity of SAA for solvingCSO, which can be constructed as follows based on conditional sampling :(1.5) min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η ij ( x, ξ i ) (cid:19) , where { ξ i } ni =1 are i.i.d. samples generated from P ( ξ ) and { η ij } mj =1 are i.i.d. samplesgenerated from the conditional distribution P ( η | ξ i ) for a given outer sample ξ i . Wewould like to examine the total number of samples T = nm + n required for SAA (1.5)to achieve an (cid:15) -optimal solution to the original CSO problem (1.4). e also consider a special case of the CSO problem (1.4), when the randomvectors ξ and η are independent:(1.6) min x ∈X F ( x ) := E ξ (cid:104) f ξ (cid:16) E η [ g η ( x, ξ )] (cid:17)(cid:105) . One could still approximate (1.6) by the SAA (1.5), mimicking the conditional sam-pling scheme and using different samples { η i , . . . , η im } from the distribution of η foreach ξ i . However, since the inner expectation is no longer a conditional expectation,there is no necessity to estimate the inner expectation with different realizations of η for each ξ i . Hence, an alternative way to approximate (1.6) is through a modifiedSAA:(1.7) min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η j ( x, ξ i ) (cid:19) . where { ξ i } ni =1 are i.i.d. samples generated from the distribution of ξ and { η j } mj =1 are i.i.d. samples generated from the distribution of η . As a result, the componentfunctions f ξ i (cid:16) m (cid:80) mj =1 g η j ( x, ξ i ) (cid:17) , i = 1 , . . . , n become dependent since they share thesame { η j } mj =1 , making it very different from (1.5). In this case, the total numberof samples becomes T = n + m . We refer to this sampling scheme as independentsampling . Notably, CSO can be used to model a varietyof applications, including portfolio selection [16], robust supervised learning [7], rein-forcement learning [7, 8], personalized medical treatment [44], instrumental variableregression [27], and so on. We discuss some of these examples in details below.
Robust Supervised Learning.
Incorporation of priors on invariance and robustnessinto the supervised learning procedures is crucial for computer vision and speechrecognition [28, 3]. Taking image classification as an example, we would like to builda classifier that is both accurate and invariant to certain kinds of data transformation,such as rotation or perturbation. Let ξ = ( a , b ) , · · · , ξ n = ( a n , b n ) be a set of inputdata, where a i is the feature vector and b i is the label. A plausible way to achievesuch consistency is to consider the class of robust linear classifiers, say f ( x, x , ξ ) = E η | ξ ∼ µ ( σ ( a )) [ x T η + x ] for given image data ξ , by averaging the prediction over allpossible transformations σ ( a ), and then finding the best fit by minimizing the expectedrisk: min ( x,x ) E ξ =( a,b ) (cid:104) (cid:96) (cid:0) b, E η | ξ [ η T x + x ] (cid:1)(cid:105) + ν (cid:107) x (cid:107) . Here (cid:96) ( · , · ) is some loss function, ν > µ ( · ) is agiven distribution (e.g., uniform) over the transformations. Clearly, such problemsbelong to the category of CSO. Reinforcement Learning.
Policy evaluation is a fundamental task in Markov de-cision processes and reinforcement learning. Consider a discounted Markov decisionprocess characterized by the tuple M := ( S , A , P, r, γ ), where S is a finite state space, A is a finite action space, P ( s, a, s (cid:48) ) represents the (unknown) state transition proba-bility from state s to s (cid:48) given action a , r ( s, a ) : S × A → R is a reward function, and γ ∈ (0 ,
1) is a discount factor. Given a stochastic policy π ( a | s ), the goal of the policyevaluation is to estimate the value function V π ( s ) := E (cid:2) (cid:80) ∞ k =0 γ k r ( s k , a k ) (cid:12)(cid:12) s = s (cid:3) nder the policy. It is well-known that V π ( · ) is a fixed point of the Bellman equa-tion [1], V π ( s ) = E s (cid:48) | a,s [ r ( s, a ) + γV π ( s (cid:48) )] . To estimate the value function V π ( s ), one could resort to minimizing the mean squaredBellman error [40, 7], namely:min V ( · ): S→ R E s ∼ µ ( · ) ,a ∼ π ( ·| s ) (cid:2)(cid:0) r ( s, a ) − E s (cid:48) | a,s [ V ( s ) − γV ( s (cid:48) )] (cid:1) (cid:3) . Here µ ( · ) is the stationary distribution. This minimization problem can be viewedas a special case of CSO. Recently, [8] showed that finding the optimal policy canalso be formulated into an optimization problem in a similar form by exploiting thesmoothed Bellman optimality equation. Again, the resulting problem falls under thecategory of CSO. Uplift Modeling.
Uplift modelling aims at estimating individual treatment effects,and it has been widely studied in causal inference literature and used for personalizedmedicine treatment and targeted marketing [18, 44]. In an individual uplift model,the goal is to estimate the effect of a treatment on an individual with feature vector x ,which could be represented by u ( x ) := E [ y | x, t = 1] − E [ y | x, t = − t ∈ {± } represents whether a treatment has been given to an individual, and y ∈ Y ⊆ R rep-resents the outcome. In practice, obtaining joint labels ( y, t ) can be difficult, whereasobtaining one label (either t or y ) of the individual is relatively easier. [44] consideredan individual uplift model that assumes availability of only one label from the jointlabels, and estimates the unknown label with p ( y | x ) = (cid:80) t = {± } p ( y | x, t ) p ( t | x ). Theyshowed that the individual uplift u ( x ) is equivalent to the optimal solution to thefollowing least-squares problem:min u ∈ L ( p ) E x ∼ p ( x ) (cid:2) ( E w | x [ w ] · u ( x ) − E z | x [ z ]) (cid:3) , where L ( p ) = { f : X → R | E x ∼ p ( x ) [ f ( x ) ] < ∞} is a function space, and w and z aretwo auxiliary random variables, whose conditional density are given by p ( z = z | x ) = p ( y = z | x ) + p ( y = − z | x ), p ( w = w | x ) = p ( t = w | x ) + p ( t = − w | x ). If wefurther restrict u ( · ) to a finite dimensional parameterization, then the above problembecomes a special case of CSO.For these applications, there are many settings in which samples can be generatedaccording to our assumptions. For instance, in robust supervised learning and upliftmodeling, there are multiple samples from P ( η | ξ ) available for any given ξ . A closely related class of problems, called stochastic com-position optimization , has been extensively studied in the literature; see, e.g., [45, 31,12, 41], to name just a few. This class of problems takes the following form:(1.8) min x ∈X f ◦ g ( x ) := E ξ (cid:104) f ξ (cid:16) E η [ g η ( x )] (cid:17)(cid:105) , where f ( u ) := E ξ [ f ξ ( u )], and g ( x ) := E η [ g η ( x )]. Although the two problems, (1.8)and (1.4), share some similarities in that both objectives are represented by nestedexpectations, they are fundamentally different in two aspects: i) the inner randomness η in (1.4) is conditionally dependent on the outer randomness ξ , while the innerexpectation in (1.8) is taken over the marginal distribution of η ; ii) the inner random unction g η ( x, ξ ) in (1.4) depends on both ξ and η . As a result, unlike (1.8), the CSOproblem (1.4) cannot be formulated as a composition of two deterministic functionsdue to the dependence between the inner and outer function. Another key distinctionfrom (1.8) is that we assume availability of samples from the distributions P ( ξ ) and P ( η | ξ ), rather than samples from the joint distribution P ( ξ, η ). These two distinctionsfurther lead to a drastic difference in the SAA construction and the sample complexityanalysis of these two types of problems, as we will show in the rest of the paper.When solving either (1.8) or (1.4), most of the existing work is devoted to de-veloping stochastic oracle-based algorithms and their convergence analysis for solvingthese problems. Related work includes two-timescale [31, 45, 41, 42] and single-timescale [13] stochastic approximation algorithms for solving the problem (1.8),variance-reduced algorithms for solving the SAA counterpart of (1.8) [22, 17, 39], anda primal-dual functional stochastic approximation algorithm for solving the problem(1.4) [7]. These methods usually require convexity of the objective in order to obtainan (cid:15) -optimal solution. Our work differs from the ones listed above in that we mainlyfocus on establishing the sample complexity of SAA itself, rather than designing effi-cient algorithms to solve the resulting SAA.We point out that our paper has the same strain as a series of papers [20, 38, 36,29, 12, 9, 2, 23], centered at the sample average approximation approach for stochasticprograms. In particular, [9] derived a central limit theorem result for the SAA of thestochastic composition optimization problem (1.8) and [12] established the rate ofconvergence. Despite these developments, the study of the basic SAA approach andits finite sample complexity analysis remains unexplored for solving the general CSOproblem (1.4) and even the special case (1.6). We aim to close this gap in this paper. In this paper, we formally analyze the sample complexityof the corresponding SAA approach for solving CSO. Our contributions are summa-rized as follows and in Table 1.1.(a) We establish the first sample complexity results of the SAA in (1.5) for the CSOproblem (1.4) under several structural assumptions:(i) Both f ξ and g η are Lipschitz continuous;(ii) In addition to (i), f ξ is Lipschitz smooth;(iii) In addition to (i), the empirical function satisfies the H¨olderian error boundcondition;(iv) In addition to (i), f ξ is Lipschitz smooth and the empirical function satisfiesthe H¨olderian error bound condition.None of these assumptions require convexity of the underlying objective function.Note that the H¨olderian error bound (HEB) condition [4], which includes the qua-dratic growth (QG) condition [19] as a special case, is a much weaker assumptionthan strong convexity, and holds for many nonconvex problems in machine learn-ing applications [6]. We show that, for general Lipschitz continuous problems,the sample complexity of SAA improves from O ( d/(cid:15) ) to O ( d/(cid:15) ) when assum-ing smoothness; for problems satisfying the QG condition, the sample complexityof SAA improves from O (1 /(cid:15) ) to O (1 /(cid:15) ) when assuming smoothness. This isvery different from the classical results on the SO and the MSP, where Lipschitzsmoothness plays no essential role in the sample complexity [20, 36]. Our resultsare built on the traditional large deviation theory and stability arguments, whileleveraging several bias-variance decomposition techniques, in order to fully exploit However, when solving the SAA problem itself, convexity conditions are often necessary forobtaining a global minimizer. 5 able 1.1
Sample Complexity of SAA Methods
Problem Assumptions Sample Complexity f ξ ( · ) ˆ F n or ˆ F nm Conditional Independent
SO [20] - - O ( d/(cid:15) ) - SO [35] - Strongly Convex O (1 /(cid:15) ) - MSP ( T = 3) [36] - - O ( d /(cid:15) ) O ( d /(cid:15) )CSO - - O ( d/(cid:15) ) O ( d/(cid:15) )CSO Smooth - O ( d/(cid:15) )CSO - Quadratic Growth O (1 /(cid:15) ) O ( d/(cid:15) )CSO Smooth Quadratic Growth O (1 /(cid:15) )ˆ F n or ˆ F nm = empirical objective; (cid:15) = accuracy; d = dimension;Conditional = conditional sampling; Independent = independent sampling; the specific structure of CSO and other structural assumptions.(b) We analyze the sample complexity of the modified SAA in (1.7) for the special case(1.6), where ξ and η are independent. We show that the total sample complexity ofthe modified SAA is O ( d/(cid:15) ) for the general Lipschitz continuous problems. Theexistence of the QG condition only improves the complexity of the outer samplesfrom O ( d/(cid:15) ) to O (1 /(cid:15) ), yet the overall complexity is dominated by the complexityof the inner samples, which is O ( d/(cid:15) ). Our complexity result matches with theasymptotic rate established in [9] even without assuming smoothness of outer andinner functions and is unimprovable.(c) We conduct some simulations of the SAA approach on several examples, includingthe logistic regression, least absolute value (LAV) regression and its smoothedcounterpart, under some modifications. Our simulation results indicate that solvingthe nonsmooth LAV regression requires more samples than solving its smoothcounterpart to achieve the same accuracy. We also observe that when the varianceof the inner randomness is relatively large, for a fixed budget T , setting n = O ( √ T )samples seems to perform best for logistic regression, which matches with ourtheory. Although both conditional sampling and independent sampling schemescan be applied to solving the special case (1.6), with nearly matching samplecomplexity in situation (iv) (see last row in Table 1.1), our simulations show thatusing the independent sampling scheme exhibits better performance in practice. The remainder of this paper is organized as follows.In Section 2, we introduce some notations and preliminaries. In Section 3, we give thebasic assumptions and analyze the mean squared error of the Monte Carlo estimation.In Section 4, we present the main results on the sample complexity of SAA for CSOunder different structural assumptions. In Section 5, we provide results for the specialcase when ξ and η are independent. Numerical results are given in Section 6.
2. Preliminaries.
For convenience, we collect here some notations that will beused throughout the paper. We also introduce some mathematical tools and proposi-tions that are necessary for future discussion. For simplicity, we restrict our attentionto l -norm, denoted as || · || . Similar results on sample complexity with respect todifferent norms can be obtained with minor modification of the analysis. et X ⊆ R d be the decision set. We say X has a finite diameter D X , if || x − x || ≤ D X , ∀ x , x ∈ X . For υ ∈ (0 , { x l } Ql =1 is said to be a υ -net of X , if x l ∈ X , ∀ l = 1 , · · · , Q , and the following holds: ∀ x ∈ X , ∃ l ( x ) ∈ { , · · · , Q } such that || x − x l ( x ) || ≤ υ. If X has a finite diameter D X , for any υ ∈ (0 , υ -net of X , and the size of the υ -net is bounded, Q ≤ O (( D X /υ ) d ) [37].A function f : X → R is said to be L -Lipschitz continuous, if there exists aconstant L > | f ( x ) − f ( x ) | ≤ L || x − x || , ∀ x , x ∈ X . The function f : X → R is said to be S -Lipschitz smooth, if it is continuously differentiable andits gradient is S -Lipschitz continuous. This also implies that ∀ x , x ∈ X : | f ( x ) − f ( x ) − ∇ f ( x ) (cid:62) ( x − x ) | ≤ S || x − x || . If a continuously differentiable function f : X → R satisfies that ∀ x , x ∈ X , f ( x ) − f ( x ) −∇ f ( x ) (cid:62) ( x − x ) ≥ µ || x − x || , then f is called µ -strongly convex when µ >
0, convex when µ = 0, and µ -weaklyconvex when µ < Definition
Let f : X → R be a functionwith compact domain X and the optimal solution set X ∗ is nonempty. f ( · ) satisfiesthe ( µ, δ ) -H¨olderian error bound condition if there exists δ ≥ and µ > such that ∀ x ∈ X , f ( x ) − min x ∈X f ( x ) ≥ µ inf z ∈X ∗ || x − z || δ . In particular, when δ = 1 , we say f satisfies the quadratic growth (QG) condition. The H¨olderian error bound condition is also known as the (cid:32)Lojasiewicz inequal-ity [4]. When δ = 1, the condition implies a quadratic growth of the function valuenear any local minima. The QG condition is a weaker assumption than strong con-vexity and does not need to be convex. When f ( · ) is convex, the QG condition is alsoreferred as optimal strong convexity [24] and semi-strong convexity [14].The Cram´er’s large deviation theorem will be frequently used, so we list it as alemma below based on the result in [20]. We extend the result to random vectors andprovide the proof in Appendix Section A. Lemma
Let X , · · · , X n be i.i.d samples of zero mean random variable X with finite variance σ . For any (cid:15) > , it holds P (cid:18) n n (cid:88) i =1 X i ≥ (cid:15) (cid:19) ≤ exp( − nI ( (cid:15) )) , where I ( (cid:15) ) := sup t ∈ R { t(cid:15) − log M ( t ) } is the rate function of random variable X , and M ( t ) := E e tX is the moment generating function of X . For any δ > , there exists (cid:15) > , for any (cid:15) ∈ (0 , (cid:15) ) , I ( (cid:15) ) ≥ (cid:15) (2+ δ ) σ . If X is a zero-mean sub-Gaussian, then P ( n (cid:80) ni =1 X i ≥ (cid:15) ) ≤ exp( − n(cid:15) σ ) , ∀ (cid:15) > . If X is a zero-mean random vector in R k such that E (cid:107) X (cid:107) = σ < ∞ , then forany δ > , there exists (cid:15) > , for any (cid:15) ∈ (0 , (cid:15) ) , P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 X i (cid:13)(cid:13)(cid:13)(cid:13) ≥ (cid:15) (cid:19) ≤ k exp (cid:18) − n(cid:15) (2 + δ ) σ (cid:19) . We will also use the simple fact that for any random variables Y and Z , if randomvariable W ≤ X := Y + Z , then for any (cid:15) > P ( W > (cid:15) ) ≤ P ( X > (cid:15) ) ≤ P ( Y > (cid:15) ) + P ( Z > (cid:15) ). Lastly, throughout the paper, we call x (cid:15) ∈ X an (cid:15) -optimal solutionto the problem min x ∈X F ( x ), if F ( x (cid:15) ) − min x ∈X F ( x ) ≤ (cid:15) . . Mean Squared Error of SAA Estimator for CSO. In this section, wemake the basic assumptions and analyze the mean squared error of the Monte Carloestimate of the function value f ( x ) at a given point. Recall the problem (1.4):min x ∈X F ( x ) := E ξ (cid:104) f ξ (cid:16) E η | ξ [ g η ( x, ξ )] (cid:17)(cid:105) , where f ξ ( · ) : R k → R , g η ( · , ξ ) : R d → R k are random functions. Recall its SAAcounterpart (1.5): min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η ij ( x, ξ i ) (cid:19) . We denote x ∗ and ˆ x nm the optimal solutions to the CSO and the SAA problems, re-spectively. We are interested in estimating the probability of ˆ x nm being an (cid:15) -optimalsolution to the CSO problem, namely P ( F (ˆ x nm ) − F ( x ∗ ) ≤ (cid:15) ), for an arbitrary accu-racy (cid:15) > P ( ξ ) and conditional distribution P ( η | ξ ) for any given ξ , and we makethe following basic assumptions: Assumption
We assume that(a) The decision set
X ⊆ R d has a finite diameter D X > .(b) f ξ ( · ) is L f -Lipschitz continuous and g η ( · , ξ ) is L g -Lipschitz continuous forany given ξ and η .(c) For all x ∈ X , f ( x, ξ ) is Borel measurable in ξ and g η ( x, ξ ) is Borel measur-able in η for all ξ .(d) σ f := max x ∈X V ξ (cid:0) f ξ ( E η | ξ [ g η ( x, ξ )]) (cid:1) < ∞ .(e) σ g := max x ∈X ,ξ E η | ξ || g η ( x, ξ ) − E η | ξ g η ( x, ξ ) || < ∞ .(f ) | f ξ ( · ) | ≤ M f , (cid:107) g η ( · , ξ ) (cid:107) ≤ M g for any ξ and η . The assumption (f) on the boundedness of function values are implied from assump-tions (a) and (b). The assumptions (d) and (e) on boundedness of variances arecommonly used for sample complexity analysis in the literature.The assumptions (b)and (c) together suggests that the function f ξ and g η ( x, ξ ) are Carath´eodory func-tions [21]. Although the parameters L f , L g , σ f , and σ g could depend on dimensions d and k , we treat these parameters as given constants throughout the paper. In this subsection, we analyzethe mean squared error (MSE) of the estimator ˆ F nm ( x ), i.e., the SAA objective (orthe empirical objective), for estimating the true objective function F ( x ), at a given x . The MSE can be decomposed into the sum of squared bias and variance of theestimator:(3.1) MSE( ˆ F nm ( x )) := E | ˆ F nm ( x ) − F ( x ) | = ( E ˆ F nm ( x ) − F ( x )) + V ( ˆ F nm ( x )) . We have the following lemmas on bounding the bias and variance.
Lemma
Let { η j } mj =1 be conditional samples from P ( η | ξ ) given ξ ∼ P ( ξ ) .Under Assumption 3.1, for any fixed x ∈ X that is independent of ξ and { η j } mj =1 , it olds that, (3.2) (cid:12)(cid:12)(cid:12)(cid:12) E { ξ, { η j } mj =1 } (cid:20) f ξ (cid:18) m m (cid:88) j =1 g η j ( x, ξ ) (cid:19) − f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L f σ g √ m . If additionally, f ξ ( · ) is S -Lipschitz smooth, we have (3.3) (cid:12)(cid:12)(cid:12)(cid:12) E { ξ, { η j } mj =1 } (cid:20) f ξ (cid:18) m m (cid:88) j =1 g η j ( x, ξ ) (cid:19) − f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Sσ g m . Proof.
Define X j := g η j ( x, ξ ) − E η | ξ g η ( x, ξ ) and ¯ X := (cid:80) mj =1 X j /m . It follows E { η j } mj =1 | ξ [ ¯ X ] = 0 by definition, and E { η j } mj =1 | ξ [ (cid:107) ¯ X (cid:107) ] ≤ σ g /m by Assumption 3.1(d). E { η j } mj =1 | ξ ∇ f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1) (cid:62) (cid:0) m (cid:80) mj =1 X j ( x ) (cid:1) = 0 since x is independent of { η j } mj =1 .The results then follow directly by invoking the Lipschitz continuity and smoothnessand taking expectations. Lemma
Under Assumption 3.1, it holds that V ( ˆ F nm ( x )) ≤ σ f n + M f L f σ g n √ m . Proof.
We first introduce ˆ F n ( x ) := n (cid:80) ni =1 f ξ i (cid:0) E η | ξ i [ g η ( x, ξ i )] (cid:1) . It follows fromthe independence among { ξ i } ni =1 that V ( ˆ F n ( x )) ≤ σ f n . By definition we have V (cid:16) ˆ F nm ( x ) (cid:17) − V (cid:16) ˆ F n ( x ) (cid:17) = 1 n (cid:104) E ( ˆ F m ( x ) ) − ( E ˆ F m ( x )) (cid:105) − n (cid:104) ( E ( ˆ F ( x ) ) − ( E ˆ F ( x )) (cid:105) = 1 n (cid:104) E ( ˆ F m ( x ) ) − E ( ˆ F ( x ) ) (cid:105) + 1 n (cid:104) ( E ˆ F ( x )) − ( E ˆ F m ( x )) (cid:105) , where ˆ F m ( x ) := f ξ (cid:0) m (cid:80) mj =1 g η j ( x, ξ ) (cid:1) and ˆ F ( x ) := f ξ (cid:0) E η | ξ g η ( x, ξ ) (cid:1) . From As-sumption 3.1(b) and Lemma 3.1, we have E ( ˆ F m ( x ) ) − E ( ˆ F ( x ) ) ≤ M f E | ˆ F m ( x ) − ˆ F ( x ) | ≤ M f L f σ g / √ m . In addition, ( E ˆ F ( x )) − ( E ˆ F m ( x )) ≤ M f L f σ g / √ m .Hence, we obtain the desired result.The following result on the mean squared error follows naturally by (3.1). Theorem
Under Assumption 3.1, we have (3.4)
MSE ( ˆ F nm ( x )) ≤ L f σ g m + 1 n (cid:18) σ f + 4 M f L f σ g √ m (cid:19) . If additionally, f ξ ( · ) is S -Lipschitz smooth, the mean squared error is further boundedby (3.5) MSE ( ˆ F nm ( x )) ≤ S σ g m + 1 n (cid:18) σ f + 4 M f L f σ g √ m (cid:19) . Unlike the classical stochastic optimization, the SAA objective of CSO is no longerunbiased. The estimation error of the SAA objective therefore comes from both biasand variance. A key observation from Theorem 3.1 is that Lipschitz smoothness of f ξ ( · ) is essential to reduce the bias and can be potentially exploited to improve thesample complexity of SAA. e point out that in [15], the authors also consider the estimation problem of theexpected value of a non-linear function on a conditional expectation, i.e., E [ f ( E [ ζ | ξ ])].Their setting is slightly different from ours as they restrict f to be one-dimensionaland assume f contains a finite number of discontinuous or non-differential points andis thrice differentiable with finite derivatives on all continuous points. They providean asymptotic bound O (1 /m + 1 /n ) of the mean squared error for their nestedestimator based on Taylor expansion. Here we focus on a general continuous outerfunction f ξ ( · ), and show that Lipschitz smoothness of f ξ ( · ) is sufficient to achieve asimilar error bound with finite samples.
4. Sample Complexity of SAA for Conditional Stochastic Optimiza-tion.
In this section, we analyze the number of samples required for the solution tothe SAA (1.5) to be (cid:15) -optimal of the CSO problem (1.4), with high probability.We consider two general cases: (i) when the objective is Lipschitz continuous and(ii) when the empirical objective satisfies the H¨olderian error bound condition. Inthe former case, we establish a uniform convergence analysis based on concentrationinequalities to bound P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ), and in the latter case, we provide astability analysis. In both cases, we further take into account two scenarios, with andwithout the Lipschitz smoothness assumption of the outer function f ξ ( · ). We first consider the case when the objective is Lipschitz continuous and prove theuniform convergence.
Theorem
Under Assumption 3.1, for any δ > ,there exists (cid:15) > such that for (cid:15) ∈ (0 , (cid:15) ) , when m ≥ L f σ g /(cid:15) , we have P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤ O (1) (cid:18) L f L g D X (cid:15) (cid:19) d exp (cid:18) − n(cid:15) δ )( σ f + 4 M f L f σ g ) (cid:19) . (4.1) If additionally, f ξ ( · ) is S -Lipschitz smooth, then (4.1) holds as long as m ≥ Sσ g /(cid:15) .Proof. We construct a υ -net to get rid of the supreme over x and use a con-centration inequality to bound the probability. First, we pick a υ -net { x l } Ql =1 on thedecision set X , such that L f L g υ = (cid:15)/
4, thus Q ≤ O (1)( L g L f D X (cid:15) ) d . Note that { x l } Ql =1 has no randomness. By definition of υ -net, we have ∀ x ∈ X , ∃ l ( x ) ∈ { , , · · · , Q } ,s.t. || x − x l ( x ) || ≤ υ = (cid:15)/ L f L g . Invoking Lipschitz continuity of f ξ and g η , weobtain | ˆ F nm ( x ) − ˆ F nm ( x l ( x ) ) | ≤ (cid:15) , | F ( x ) − F ( x l ( x ) ) | ≤ (cid:15) . Hence, for any x ∈ X , | ˆ F nm ( x ) − F ( x ) |≤ | ˆ F nm ( x ) − ˆ F nm ( x l ( x ) ) | + | ˆ F nm ( x l ( x ) ) − F ( x l ( x ) ) | + | F ( x l ( x ) ) − F ( x ) |≤ (cid:15) | ˆ F nm ( x l ( x ) ) − F ( x l ( x ) ) | ≤ (cid:15) l ∈{ , , ··· ,Q } | ˆ F nm ( x l ) − F ( x l ) | . t follows that(4.2) P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤ P (cid:18) max l ∈{ , , ··· ,Q } | ˆ F nm ( x l ) − F ( x l ) | > (cid:15) (cid:19) ≤ Q (cid:88) l =1 P (cid:18) | ˆ F nm ( x l ) − F ( x l ) | > (cid:15) (cid:19) . Define Z i ( l ) := f ξ i ( m (cid:80) mj =1 g η ij ( x l , ξ i )) − F ( x l ), then Z ( l ) , Z ( l ) , · · · , Z n ( l ) are i.i.d.random variables. Denote their expectation as E Z ( l ). Then Z i ( l ) − E Z ( l ) is a zero-mean random variable.If max l E Z ( l ) ≤ (cid:15)/
4, by Lemma 2.1, we have(4.3) P (cid:18) ˆ F nm ( x l ) − F ( x l ) > (cid:15) (cid:19) ≤ P (cid:18) ˆ F nm ( x l ) − F ( x l ) > (cid:15) E Z ( l ) (cid:19) = P (cid:18) n n (cid:88) i =1 [ Z i ( l ) − E Z ( l )] > (cid:15) (cid:19) ≤ exp (cid:18) − n(cid:15) δ + 2) V ( Z ( l )) (cid:19) . Similarly, we could show that if max l E Z ( l ) ≥ − (cid:15)/ P (cid:18) F ( x l ) − ˆ F nm ( x l ) > (cid:15) (cid:19) ≤ exp (cid:18) − n(cid:15) δ + 2) V ( Z ( l )) (cid:19) . Based on Lemma 3.1, we have, for Lipschitz continuous f ξ ( · ), | E Z ( l ) | ≤ L f σ g / √ m , ∀ l = 1 , · · · , Q ; for Lipschitz smooth f ξ ( · ), | E Z ( l ) | ≤ Sσ g / m , ∀ l = 1 , · · · , Q . Thus,max l E Z ( l ) ≤ (cid:15)/ m is sufficiently large. By analysis of Theorem3.1, we know V ( Z ( l )) ≤ σ f + 4 M f L f σ g / √ m ≤ σ f + 4 M f L f σ g . Plugging into (4.2)with Q ≤ O (1)( L g L f D X (cid:15) ) d , we obtain the desired result.Since ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ ) ≤
0, we have P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) )= P (cid:16) [ F (ˆ x nm ) − ˆ F nm (ˆ x nm )] + [ ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ )] + [ ˆ F nm ( x ∗ ) − F ( x ∗ )] ≥ (cid:15) (cid:17) ≤ P (cid:16) F (ˆ x nm ) − ˆ F nm (ˆ x nm ) ≥ (cid:15)/ (cid:17) + P (cid:16) ˆ F nm ( x ∗ ) − F ( x ∗ ) ≥ (cid:15)/ (cid:17) . (4.5)Invoking Theorem 4.1, we immediately have the following result. Corollary
UnderAssumption 3.1, for any δ > , there exists (cid:15) > such that for (cid:15) ∈ (0 , (cid:15) ) , when m ≥ L f σ g /(cid:15) , (4.6) P (cid:18) F (ˆ x nm ) − F ( x ∗ ) > (cid:15) (cid:19) ≤ O (1) (cid:18) L f L g D X (cid:15) (cid:19) d exp (cid:18) − n(cid:15) δ )( σ f + 4 M f L f σ g ) (cid:19) . If additionally, f ξ ( · ) is S -Lipschitz smooth, then (4.6) holds as long as m ≥ Sσ g /(cid:15) . It further implies the following sample complexity result. orollary With probability at least − α , the solution to the SAA problemis (cid:15) -optimal to the original CSO problem if the sample sizes n and m satisfy that n ≥ O (1) σ f + 4 M f L f σ g (cid:15) (cid:20) d log (cid:18) L f L g D X (cid:15) (cid:19) + log (cid:18) α (cid:19) (cid:21) ,m ≥ (cid:40) L f σ g (cid:15) , Under Assumption 3.1 , Sσ g (cid:15) , f ξ ( · ) is also Lipschitz smooth.Ignoring the log factors, under Assumption 3.1, the total sample complexity of SAAfor achieving an (cid:15) -optimal solution is T = mn + n = O ( d/(cid:15) ); when f ξ ( · ) is Lipschitzsmooth, the total sample complexity reduces to T = mn + n = O ( d/(cid:15) ) . The above result indicates that in general, the sample complexity of the SAA forthe CSO problem is O ( d/(cid:15) ) when assuming only Lipschitz continuity of the functions f ξ and g η . The sample complexity drops to O ( d/(cid:15) ) assuming additionally Lipschitzsmoothness of the outer function f ξ . Notice that the complexity depends only lin-early on the dimension of the decision set. This is quite different from the three-stagestochastic optimization. In [36], for a three-stage stochastic programming, the au-thors showed the sample sizes for estimating the second and the third stages needto be at least O ( d/(cid:15) ), leading to a total of O ( d /(cid:15) ) samples, to guarantee uniformconvergence even for stage-wise independent random variables. In this subsec-tion, we consider the case when the empirical function satisfies H¨olderian error boundcondition, which includes the quadratic growth condition and strong convexity as spe-cial cases. Error bound condition has been widely studied recently in the context of(stochastic) oracle-based algorithm for faster convergence; see e.g., [19, 11, 43] and ref-erences therein. To our best knowledge, very few papers have exploited the H¨olderianerror bound condition for the SAA approach and analyzed the sample complexityunder such a condition. We show that the CSO problem under the H¨olderian errorbound condition yields smaller orders of sample complexity for the SAA approach.We make the following two assumptions throughout this subsection.
Assumption
The empirical function ˆ F nm ( x ) satisfies the ( µ, δ ) -H¨olderianerror bound condition with µ > , δ ≥ , i.e., it holds that ∀ x ∈ X , ˆ F nm ( x ) − min x ∈X ˆ F nm ( x ) ≥ µ inf z ∈X ∗ nm || x − z || δ , where n, m are any positive integers, and X ∗ nm is the optimal solution set of the em-pirical objective function ˆ F nm ( x ) over X . Assumption
The empirical function ˆ F nm has a unique minimizer ˆ x nm on X , for any n and m . An interesting special case of Assumption 4.1 is the quadratic growth (QG) condi-tion when δ = 1. QG condition is actually satisfied by a wide spectrum of objectives,such as strongly convex functions, general strongly convex functions composed withpiecewise linear functions, general piecewise convex quadratic functions, etc. Thereare also many other specific examples arising in machine learning applications thatsatisfy the QG condition, including logistic loss composed with linear functions andneural networks with linear activation functions, see [6, 19], and reference therein.Another interesting case is the polyhedral error bound condition when δ = 0, which is nown to hold true for many piecewise linear loss functions [4]. For both cases, thesefunctions are not necessarily strongly convex nor convex. Relevant problems withSAA objective ˆ F nm satisfying the QG condition are discussed in Appendix Section D.Assumption 4.2 could be restricted and less straightforward to verify. In general,for a non-strictly convex empirical objective function, the optimal solution is notnecessarily unique. Yet, it is not exclusive to strictly convex functions. We illustrateone such example below. Lastly, we point out that when ˆ F nm ( x ) is strongly convex,for example, l regularized convex empirical objective, the above assumptions holdnaturally. In the following, we give some examples when ˆ F nm ( x ) satisfies the QGcondition. Example 1.
Consider the following one-dimensional function F ( x ) = E ξ (cid:2) ( E η | ξ [ η ] x ) + 3 sin ( E η | ξ [ η ] x ) (cid:3) , where ξ and η can be any random vectors that satisfy η | ξ ≥ √ µ with probability 1.Denote ¯ η i = m (cid:80) mj =1 η ij , the empirical function is given byˆ F nm ( x ) = 1 n n (cid:88) i =1 ¯ η i x + 3 n n (cid:88) i =1 sin (¯ η i x ) . It can be easily verified that ˆ F nm ( x ) satisfies the QG condition with parameter µ > F nm ( x ) has a unique minimizer x ∗ = 0 for any m, n . Example 2.
Consider the robust logistic regression problem with the objective(4.7) F ( x ) = E ξ =( a,b ) [log(1 + exp( − b E η | ξ [ η ] T x ))] , where a ∈ R d is a random feature vector and b ∈ { , − } is the label, η = a + N (0 , σ I d ) is a perturbed noisy observation of the input feature vector a . The empir-ical objective function ˆ F nm ( x ) is given by(4.8) ˆ F nm ( x ) = 1 n n (cid:88) i =1 log (cid:18) (cid:18) − b i m m (cid:88) j =1 η (cid:62) ij x (cid:19)(cid:19) . ˆ F nm ( x ) satisfies the QG condition on any compact convex set in Appendix Section D.Note that the minimizer of a general empirical objective function is not necessarilyalways unique. However, the Hessian of ˆ F nm ( x ) shows that ˆ F nm ( x ) is strictly convexif m (cid:80) mj =1 η (cid:62) ij (cid:54) = 0 for all i , which is satisfied with high probability. Thus, ˆ F nm ( x ) hasa unique minimizer with high probability.Next, we present our main result on the sample complexity of SAA. Theorem
Under Assumption 3.1,4.1, 4.2, for any (cid:15) > , we have (4.9) P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ (cid:15) (cid:18) L f L g (cid:18) L f L g µn (cid:19) /δ + 2 L f σ g √ m (cid:19) . If additionally, f ξ ( · ) is S -Lipschitz smooth, then we further have (4.10) P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ (cid:15) (cid:18) L f L g (cid:18) L f L g µn (cid:19) /δ + Sσ g m (cid:19) . ifferent from the previous section, we use a stability argument to exploit the errorbound condition. As shown in Lemma 3.1, the empirical function is a biased estimatorof the original function due to the composition of f ξ ( · ) and g η ( · , ξ ). Introducing aperturbed set of samples could reduce some dependence in randomness. We define abias term which will be used later in the proof:(4.11) ∆( m ) := (cid:40) L f σ g √ m , f ξ ( · ) is L f Lipschitz continuous , Sσ g m , f ξ ( · ) is additionally S Lipschitz smooth.Below we provide the detailed proof of Theorem 4.2.
Proof.
Recall that x ∗ and ˆ x nm are the minimizers of F ( x ) and ˆ F nm ( x ), respec-tively. It’s clear that x ∗ has no randomness, and ˆ x nm is a function of { ξ i } ni =1 , { η ij } mj =1 .We decompose the error F (ˆ x nm ) − F ( x ∗ ) in three terms, and analyze each term below: F (ˆ x nm ) − F ( x ∗ ) = F (ˆ x nm ) − ˆ F nm (ˆ x nm ) (cid:124) (cid:123)(cid:122) (cid:125) := E + ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) := E + ˆ F nm ( x ∗ ) − F ( x ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) := E . First, we use a stability argument and Lemma 3.1 to bound E E = E [ F (ˆ x nm ) − ˆ F nm (ˆ x nm )]. Define(4.12) ˆ F ( k ) nm ( x ) := 1 n n (cid:88) i (cid:54) = k f ξ i (cid:18) m m (cid:88) j =1 g η ij ( x, ξ i ) (cid:19) + 1 n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η (cid:48) kj ( x, ξ (cid:48) k ) (cid:19) as the empirical function by replacing the k th outer sample ξ k with another i.i.d outersample ξ (cid:48) k , and replacing the corresponding inner samples { η kj } mj =1 with { η (cid:48) kj } mj =1 ,which are sampled from the conditional distribution of P ( η | ξ (cid:48) k ) for a given sample ξ (cid:48) k .Denote ˆ x ( k ) nm := argmin x ∈X ˆ F ( k ) nm ( x ). We decompose E E = E [ F (ˆ x nm ) − ˆ F nm (ˆ x nm )]into three terms: E E = E (cid:20) n n (cid:88) k =1 F (ˆ x nm ) − n n (cid:88) k =1 f ξ k (cid:18) E η | ξ k g η (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) E η | ξ k g η (ˆ x ( k ) nm , ξ k ) (cid:19) − n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) . (4.13)Note that E [ F (ˆ x nm )] = E [ F (ˆ x ( k ) nm )] since ξ k and ξ (cid:48) k are i.i.d, which implies that ˆ x nm and ˆ x ( k ) nm follow an identical distribution. Since ˆ x ( k ) nm is independent of ξ k , E [ F (ˆ x ( k ) nm )] = E [ f ξ k ( E η | ξ k g (ˆ x ( k ) nm , ξ k ))] for any k . Then the first term in (4.13) is 0. As ˆ x ( k ) nm isindependent of { η kj } mj =1 , the second term in (4.13) could be bounded by Lemma 3.1,it holds(4.14) E (cid:20) f ξ k (cid:18) E η | ξ k g η (ˆ x ( k ) nm , ξ k ) (cid:19) − f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) ≤ ∆( m ) . or the third term in (4.13), by definition it implies(4.15)ˆ F nm (ˆ x ( k ) nm ) − ˆ F nm (ˆ x nm ) = ˆ F ( k ) nm (ˆ x ( k ) nm ) − ˆ F ( k ) nm (ˆ x nm )+ 1 n f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19) − n f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x nm , ξ k ) (cid:19) + 1 n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η (cid:48) kj (ˆ x nm , ξ (cid:48) k ) (cid:19) − n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η (cid:48) kj (ˆ x ( k ) nm , ξ (cid:48) k ) (cid:19) . By Lipschitz continuity of f ξ and g η and that ˆ F ( k ) nm (ˆ x ( k ) nm ) − ˆ F ( k ) nm (ˆ x nm ) ≤
0, it holds(4.16) ˆ F nm (ˆ x ( k ) nm ) − ˆ F nm (ˆ x nm ) ≤ n L f L g || ˆ x ( k ) nm − ˆ x nm || . Since ˆ x nm is the unique minimizer of ˆ F nm ( x ), and ˆ F nm ( x ) satisfies QG condition withparameter µ , we have(4.17) ˆ F nm (ˆ x ( k ) nm ) − ˆ F nm (ˆ x nm ) ≥ µ || ˆ x ( k ) nm − ˆ x nm || δ . Combining with (4.16), we obtain(4.18) || ˆ x ( k ) nm − ˆ x nm || ≤ (cid:18) L f L g µn (cid:19) /δ . By Lipschitz continuity of f ξ ( · ) and g η ( · , ξ ), and definition of ˆ F nm (ˆ x nm ), we obtain E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η kj (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) ≤ L f L g (cid:18) L f L g µn (cid:19) /δ . (4.19)Combining (4.13), (4.19), and (4.14), we obtain(4.20) E E ≤ L f L g (cid:18) L f L g µn (cid:19) /δ + ∆( m ) . Second, by optimality of ˆ x nm of ˆ F nm , we have(4.21) E E = E [ ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ )] ≤ . Next, we bound E E . Define ˆ F n ( x ) := n (cid:80) ni =1 f ξ i (cid:0) E η | ξ i [ g η ( x, ξ i )] (cid:1) . Notice that x ∗ is independent of { η ij } mj =1 for any i = { , · · · , n } and E [ ˆ F n ( x ∗ ) − F ( x ∗ )] = 0. ByLemma 3.1, it holds(4.22) E E = E [ ˆ F nm ( x ∗ ) − ˆ F n ( x )] + E [ ˆ F n ( x ) − F ( x )] ≤ ∆( m );Combining (4.20), (4.21), (4.22), with Markov inequality, we obtain the desired re-sult.The sample complexity of SAA under the H¨olderian error bound condition followsdirectly. orollary Under Assumption 4.1 and 4.2, with probability at least − α ,the solution to the SAA problem is (cid:15) -optimal to the original CSO problem if the samplesizes n and m satisfy that n ≥ (2 L f L g ) δ +1 µ ( α(cid:15) ) δ , m ≥ (cid:40) L f σ g α (cid:15) , Under Assumption 3.1 , Sσ g α(cid:15) , f ξ ( · ) is also Lipschitz smooth.Hence, the total sample complexity of SAA for achieving an (cid:15) -optimal solution is atmost T = mn + n = O (1 /(cid:15) δ +2 ); when f ξ ( · ) is Lipschitz smooth, the total samplecomplexity reduces to T = mn + n = O (1 /(cid:15) δ +1 ) . In particular, when the empirical function is strongly convex or satisfies the QGcondition, i.e., Assumption 4.1 with δ = 1, this leads to the total sample complexity of O (1 /(cid:15) ) for Lipschitz continuous case and O (1 /(cid:15) ) for Lipschitz smooth case, respec-tively. From the above corollary, the error bound condition only affects the samplecomplexity of the outer samples, and the sample size decreases as δ decreases. As δ gets closer to zero, the sample complexity will essentially be dominated by the innersample size.A key difference between the results in Theorems 4.1 and 4.2 lies in the dependenceon the problem dimension d and confidence level α . While the sample complexityunder the H¨olderian error bound condition is dimension-free, the dependence on theconfidence level 1 − α grows from O (log(1 /α )) to O (1 /α δ ). This is similar to classicalresults on stochastic optimization for strongly convex objectives [35]. Theorem 4.2could also be used to derive a dimensional free sample complexity of l regularizedSAA for a general convex CSO problem. See Appendix Section E for more details.
5. Sample Complexity of SAA for CSO with Independent RandomVariables.
In this section, we consider the special case of CSO when the randomvariables ξ and η are independent. The objective then simplifies to:(5.1) min x ∈X F ( x ) := E ξ [ f ξ ( E η [ g η ( x, ξ )])] . This is similar yet slightly more general than (1.8), the compositional objective con-sidered in [42, 41]. Note that the inner cost function we consider here is dependenton both ξ and η , and thus cannot be written as a composition of two deterministicfunctions.The sample complexity of SAA under the conditional sampling setting achieved inSection 4 applies to this setting since it can be viewed as a special case of the former.However, since the inner expectation is no longer a conditional expectation, we nowconsider an alternative modified SAA, using the independent sampling scheme, inwhich we use the same set of samples to estimate the inner expectation. The procedureof the independent sampling scheme for solving (5.1) works as follows: first generate n i.i.d. samples { ξ i } ni =1 from the distribution of ξ ; and m i.i.d samples { η j } mj =1 fromthe distribution of η , then solve the following approximation problem:(5.2) min x ∈X ˆ F nm ( x ) := 1 n n (cid:88) i =1 f ξ i (cid:18) m m (cid:88) j =1 g η j ( x, ξ i ) (cid:19) . As a result, the total sample complexity becomes T = m + n . In recent workby [9], the authors established a central limit theorem result for the SAA (5.2) with = n . In particular, they have shown that for Lipschitz smooth functions f ξ ( · ) and g η ( · , ξ ) = g η ( · ), the SAA estimator converges in distribution as follows: √ m (cid:18) min x ∈X ˆ F mm ( x ) − min x ∈X F ( x ) (cid:19) → Z ( W )where W ( · ) = ( W ( · ) , W ( · )) is a zero-mean Brownian process with certain covariancefunctions and Z ( · ) is a function that depends on the first order information. Thisresult only yields an asymptotic convergence rate of order O (1 / √ m ) for the SAAwith m = n . Below, we will provide a finite sample analysis for SAA and establishrefined sample complexity results based on concentration inequality techniques.In the SAA problem (5.2), the component functions f ξ i (cid:0) m (cid:80) mj =1 g η j ( x, ξ i ) (cid:1) sharethe same random vectors { η j } mj =1 and are dependent. This is distinct from the SAA(1.5) considered in the previous section. Because of this key difference, the previousanalysis will no longer apply to this modified SAA. We will resort to a differentanalysis for deriving the sample complexity. Similarly, we consider two structuralassumptions, when the empirical objective is only known to be Lipschitz continuousand when the empirical objective also satisfies the error bound condition. We firstconsider the case when the objective is Lipschitz continuous. We make the samebasic assumptions of the Lipschitz continuity of f ξ ( · ) and g η ( · , ξ ) and boundedness ofvariances as described in Assumption 3.1. Our main result is summarized below. Theorem
Under the independent sampling scheme and Assumption 3.1, forany δ > , there exists an (cid:15) > such that for any (cid:15) ∈ (0 , (cid:15) ) , it holds (5.3) P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤O (1) (cid:18) L f L g D X (cid:15) (cid:19) d (cid:18) exp (cid:18) − n(cid:15) δ + 2) σ f (cid:19) + nk exp (cid:18) − m(cid:15) δ + 2) L f σ g (cid:19)(cid:19) . Here, d is the dimension of the decision set, and k is the dimension of the range offunction g .Proof. First, we pick a υ -net { x l } Ql =1 on the decision set X , such that L f L g υ = (cid:15)/
4. Using a similar argument in the proof of Theorem 4.1, we obtain(5.4) P (cid:18) sup x ∈X | ˆ F nm ( x ) − F ( x ) | > (cid:15) (cid:19) ≤ Q (cid:88) l =1 P (cid:18) | ˆ F nm ( x l ) − F ( x l ) | > (cid:15) (cid:19) ≤ Q (cid:88) l =1 P (cid:18) | ˆ F nm ( x l ) − ˆ F n ( x l ) | > (cid:15) (cid:19) + Q (cid:88) l =1 P (cid:18) | ˆ F n ( x l ) − F ( x l ) | > (cid:15) (cid:19) . By Lipschitz continuity of f ξ ( x ) and Lemma 2.1, we have(5.5) P (cid:18) | ˆ F nm ( x l ) − ˆ F n ( x l ) | ≥ (cid:15) (cid:19) ≤ n (cid:88) i =1 P (cid:18) || m m (cid:88) j =1 g η j ( x l , ξ i ) − E η g η ( x l , ξ i ) || ≥ (cid:15) L f (cid:19) ≤ nk exp (cid:18) − m(cid:15) δ + 2) L f σ g (cid:19) . y Lemma 2.1, we obtain(5.6) P (cid:18) | ˆ F n ( x l ) − F ( x l ) | ≥ (cid:15) (cid:19) ≤ (cid:18) − n(cid:15) δ + 2) σ f (cid:19) . Combining with the fact that Q ≤ O (1)( L g L f D X (cid:15) ) d , we obtain the desired result.Invoking the relation in (4.5), the above theorem implies the following: Corollary
Under Assumption 3.1, with probability at least − α , the so-lution to the modified SAA problem (5.2) is (cid:15) -optimal to the original problem (5.1) ifthe sample sizes n and m satisfy n ≥ O (1) σ f (cid:15) (cid:20) d log (cid:18) L f L g D X (cid:15) (cid:19) + log (cid:18) α (cid:19) (cid:21) ,m ≥ O (1) L f σ g (cid:15) (cid:20) d log (cid:18) L f L g D X (cid:15) (cid:19) + log (cid:18) α (cid:19) + log ( nk ) (cid:21) . Ignoring the log factors, under Assumption 3.1, the total sample complexity of themodified SAA for achieving an (cid:15) -optimal solution is T = m + n = O ( d/(cid:15) ) . Note that this sample complexity is significantly smaller than that for the gen-eral CSO. The O ( d/(cid:15) ) sample complexity also matches the lower bounds on samplecomplexity of SAA for classical stochastic optimization with Lipschitz continuousobjectives [25]; therefore, this result is unimprovable without further assumptions. We now con-sider the case when the empirical objective satisfies Assumption 4.1 and 4.2, i.e.,the empirical objective ˆ F nm ( x ) satisfies the error bound condition and has a uniqueminimizer for any integers n, m . Our main result is summarized as follows. Theorem
Under Assumptions 3.1, 4.1, and 4.2, for any (cid:15) > and υ > ,we have (5.7) P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ (cid:15) (cid:18) L f L g (cid:18) L f L g µn (cid:19) /δ + O (1) L f M g (cid:112) d log( D X /υ ) √ m + L f σ g √ m + 2 υL f L g (cid:19) . The solution to the modified SAA problem (5.2) is (cid:15) -optimal to the problem (5.1) withprobability at least − α , if υ = (cid:15)α L f L g , and the sample sizes n and m satisfy that (5.8) n ≥ (2 L f L g ) δ +1 µ ( α(cid:15) ) δ , m ≥ max (cid:40)(cid:18) L f σ g α(cid:15) (cid:19) , O (1) (cid:18) L f M g α(cid:15) (cid:19) d log (cid:18) D X L f L g α(cid:15) (cid:19)(cid:41) . Similar to Theorem 4.2, the outer sample size is independent of dimension anddecreases as δ decreases. As δ gets closer to zero, the sample complexity will essentiallybe dominated by the inner sample size. In particular, when the empirical functionsatisfies the QG condition or is strongly convex, i.e., Assumption 4.1 holds with δ = 1,the outer sample size is reduced from O ( d/(cid:15) ) in the Lipschitz continuous case to O (1 /(cid:15) ). Yet, the total sample complexity remains O ( d/(cid:15) ).For a CSO problem with independent random vectors (5.1), both SAA approaches,through conditional sampling, or independent sampling, can be applied to solve the roblem. Comparing Theorem 4.2 and Theorem 5.2, when smoothness and the qua-dratic growth condition are satisfied, the sample complexities of these two SAA ap-proaches achieve the same order O (1 /(cid:15) ), except for an extra O ( d ) factor for theindependent sampling. Interestingly, for a given small dimension d and the same sam-ple budget T , the independent sampling might outperform the conditional samplingscheme since the constant factor in the sample complexity of conditional samplingis much larger. The numerical experiment on our testing cases in the next sectionfurther supports the finding.In contrast to the sample complexity established in Section 4 for the conditionalsampling setting, a notable difference here is that the Lipschitz smoothness conditiondoes not necessarily help reduce the sample complexity. This result aligns with thecentral limit theorem established in [9]. One of the reasons arises from the inter-dependence among the component functions in the modified SAA objective, leadingto extra variance. Because of that, the analysis requires sophisticated arguments tohandle the dependence and is much more involved . We defer the proof to AppendixSection B. Remark
Although the overall O (1 /(cid:15) ) sample complexity cannot be furtherimproved in general, it is worth pointing out that, for some interesting specific in-stances, the modified SAA could achieve lower sample complexity than what is de-scribed from theory. We illustrate this from the following example.Example 3. For γ >
0, consider the following problemmin x ∈X F ( x ) := H ( E η [ x + η ] , γ ) + ( E η [ x + η ]) , where η ∼ N (0 , σ η ) and H ( · , γ ) is the Huber function, i.e.,(5.9) H ( x, γ ) = | x | − γ for | x | > γ. γ x for | x | ≤ γ. Note that here f ξ ( x ) := f ( x ) = H ( x, γ ) + x is deterministic, and g η ( x, ξ ) = x + η .When γ > f ( x ) is 1 /γ -Lipschitz smooth. When γ → f ( x ) → | x | + x , whichis no longer differentiable. In this example, x ∗ = argmin x ∈X F ( x ) = − E η , F ∗ =min x ∈X F ( x ) = 0. The empirical objective becomes ˆ F m ( x ) = H ( x + ¯ η, γ ) + ( x + ¯ η ) , where ¯ η = m (cid:80) mj =1 η j . Thus, ˆ x m = argmin x ∈X ˆ F m ( x ) = − ¯ η . We show that the errorof SAA satisfies(5.10) 0 ≤ E F (ˆ x m ) − F ( x ∗ ) − (cid:18) σ η γm erf (cid:18)(cid:115) γ m σ η (cid:19) + σ η m (cid:19) ≤ (cid:114) σ η πm exp (cid:18) − mγ σ η (cid:19) , where erf( x ) := √ π (cid:82) x exp( − x ) dx . As a result, when γ → γ → E F (ˆ x m ) − F ( x ∗ ) = (cid:114) σ η πm + σ η m . For completeness, we provide detailed derivation in Appendix Section C. This exampleshows that the SAA error improves from O (1 / √ m ) to O (1 /m ) as the objective tran-sits from nonsmooth to smooth. When γ →
0, the function becomes non-Lipschitzdifferentiable and the O (1 / √ m ) bound for this setting is indeed tight. It remains aninteresting open problem to identify sufficient conditions for achieving theoreticallybetter sample complexity under the independent sampling scheme. σ η /σ ξ = 0 . σ η /σ ξ = 10 (c) σ η /σ ξ = 100 Fig. 6.1 . Logistic regression, conditional sampling, dimension d = 10
6. Numerical Experiments.
In this section, we conduct numerical experi-ments based on two applications, logistic regression and robust regression, to demon-strate the performance of SAA for solving CSO problems. For a fixed sample budget T , we adopt difference sample allocation strategies for ( m, n ), and compute the cor-responding accuracy of the SAA estimators. We repeat 30 runs for each sampleallocation and report the average performance. The SAA problems are solved byCVXPY 1.0.9 [10]. We consider the robust logistic regressionproblem in Example 2. The problem is formulated in (4.7) and its SAA counterpartis of the form (4.8) with domain X = { x | x ∈ R d , (cid:107) x (cid:107) ≤ } .Note that from Example 2, f is Lipschitz-smooth, ˆ F nm ( x ) satisfies QG conditionon any compact convex set, and with high probability has a unique minimizer forlarge n . Theorem 4.2 implies that the theoretical optimal sample allocation strategyis n = O (1 / √ T ) and m = O (1 / √ T ).In the experiment, we set d = 10 and the samples of ξ = ( a, b ) and η aregenerated as follows: a i ∼ N (0 , σ ξ I d ), b i = {± } according to the sign of a Ti x ∗ , η ij ∼ N ( a i , σ η I d ). We set σ ξ = 1, and consider three cases for σ η : σ η = { . , , } ,corresponding low, medium, high variances from inner randomness. For a given sam-ple budget T ranging from 10 to 10 , four different sample allocation strategies areconsidered, i.e. n = [ T / ], n = [ T / ], n = [ T / ], and n = [ T / ]. We then computethe average estimation error F (ˆ x nm ) − F ∗ over 30 runs and its standard deviation.The results are summarized in Figure 6.1, where x -axis denotes the sample budget T , and y -axis shows the estimation error. Each curve represents a sampling scheme,showing the average error and upper confidence bound.The trend from Figure 6.1(a)-(c) shows that when the inner variance is relativelylarge, setting n = O ( T / ) consistently outperforms the other sampling strategies,which matches our analysis. The error bar suggests that larger number of outersamples results in smaller deviation of the estimation accuracy. We now examine the robust regression problem,where the objective is no longer Lipschitz differentiable. The problem is as follows:(6.1) min x ∈X F ( x ) = E ξ =( a,b ) | E η | ξ η (cid:62) x − b | , where a ∈ R d is a random feature vector and b ∈ R is the label, η = a + N (0 , σ η I d ) isa perturbed noisy observation of the input feature vector a , and the domain is X = x | x ∈ R d , (cid:107) x (cid:107) ≤ } . For comparison purposes, we also consider the smoothedversion of this problem based on the Huber function:min x ∈X F γ ( x ) = E ξ =( a,b ) H (cid:0) E η | ξ η (cid:62) x − b, γ (cid:1) , where γ > F nm ( x ) = 1 n n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) m m (cid:88) i =1 η (cid:62) ij x − b i (cid:12)(cid:12)(cid:12)(cid:12) , ˆ F γnm ( x ) = 1 n n (cid:88) i =1 H (cid:18) m m (cid:88) i =1 η (cid:62) ij x − b i , γ (cid:19) . Theorem 4.1 and Theorem 4.2 indicate that Lipschitz smoothness of outer function f ξ ( x ) helps reduce the inner sample size required to achieve the same level of accuracy.For a given budget T , the theoretical optimal sample allocation strategies for thesetwo problems is n = O ( T / ) and n = O ( T / ), respectively.In our experiment, we set d = 20. Samples of ξ = ( a, b ) and η are generated asfollows: a i ∼ N (0 , σ ξ I d ), b i = a (cid:62) i x ∗ , η ij ∼ N ( a i , σ η I d ). As in the previous experi-ment, we measure the average error and upper confidence bound for both problemswith sample budget T ranging from 10 to 10 under four different sample alloca-tion strategies over 30 runs. We also consider two sets of smoothness parameters, γ ∈ { . , } . The results are summarized in Figure 6.2.Figure 6.2 (a)-(c) shows that setting n = O ( √ T ) indeed yields almost the bestaccuracy for absolute value loss minimization, which again matches our analysis. Theoverall performance of SAA for the original and that of the smoothed problems behavequite similarly in this case, yet solving the smoothed problem yields much betteraccuracy under the same budget. This also supports our theoretical findings that thesample complexity is lower for smooth problems. In this experiment, we consider a modified logistic regression example, that falls intothe special case with independent inner and outer randomness:min x ∈X F ( x ) = E ξ =( a,b ) log(1 + exp( − b ( E η η + a ) (cid:62) x )) , where a ∼ N (0 , σ ξ I d ) ∈ R d is a random feature vector, b ∈ {± } , η ∼ N (0 , σ η I d ) isthe noise. The empirical function of the two sampling schemes ˆ F nm ( x ) is of the formˆ F nm ( x ) = 1 n n (cid:88) i =1 log (cid:18) (cid:0) − b i (cid:0) m m (cid:88) j =1 η ij + a i (cid:1) (cid:62) x (cid:1)(cid:19) . When employing the independent sampling scheme, we generate { η j } mj =1 and let η ij = η j for all i > n is in the order of O ( √ T ),and m is set to m = T /n or m = T − n . In the experiment, d = { , } . σ ξ = 1, and σ η = 10, and the samples are generated accordingly. For any given sample budget T ,we compare the performance of the two sampling scheme under different choices ofouter sample n varying from 0 to 10000.Figure 6.3(a) illustrates the comparison when d = 10, and T = 10000. The bellshape in Figure 6.3(a) reflects a clear bias-variance tradeoff for different n and m . σ η /σ ξ = 0 .
1, absolute value (b) σ η /σ ξ = 10, absolute value (c) σ η /σ ξ = 100, absolute value(d) σ η /σ ξ = 0 .
1, Huber, γ = 0 . σ η /σ ξ = 10, Huber, γ = 0 . σ η /σ ξ = 100, Huber, γ = 0 . σ η /σ ξ = 0 .
1, Huber, γ = 10 (h) σ η /σ ξ = 10, Huber, γ = 10 (i) σ η /σ ξ = 100, Huber, γ = 10 Fig. 6.2 . Error of SAA for absolute value loss and Huber loss, dimension d = 20(a) d = 10, T = 10000, Varying outer sample size n (b) Various d and T Fig. 6.3 . Comparison of conditional sampling and independent sampling schemes
In Figure 6.3(b), we report the best performance (by choosing the best n ) ofthese two sampling schemes with d ∈ { , } , and T ranging from 1000 to 50000.Figure 6.3(b) shows that the independent sampling scheme always achieves a smallererror for the logistic regression problem. The gap between the two schemes decreasesas the dimension increases, which also matches our analysis. . Conclusion. In this paper, we introduce the class of conditional stochasticoptimization problems and provide sample complexity analysis of sample averageapproximation under different structural assumptions. Our results show that theoverall sample complexity can be significantly reduced under Lipschitz smoothnesscondition, which is very different from the theory of classical stochastic optimizationand multi-stage stochastic programming. By exploiting error bound conditions, thesample complexity could be further reduced. To our best knowledge, these are the firstnon-asymptotic sample complexity results established in the context of conditionalstochastic optimization. For future work, we will investigate stochastic approximationalgorithms for solving this family of problems and establish their sample complexities.
Appendix A. Proof of Propositions.A.1. Proof of Lemma 2.1.
Proof.
The proof of one dimension random variable case was given in [20] usingChernoff bound. Based on that, we consider the case when X is a zero-mean randomvector in R k . Denote X i = ( X i , X i , · · · , X ki ) (cid:62) for i = 1 , · · · , n , σ j = V ( X j ), z j = (cid:80) kj =1 σ j σ j , and I j ( · ) the rate function of the j th coordinate of the random vector X . Wehave P ( || ¯ X || ≥ (cid:15) ) = P (cid:18) k (cid:88) j =1 ( ¯ X j − E X j ) ≥ (cid:15) (cid:19) ≤ k (cid:88) j =1 P (cid:18) ( ¯ X j ) ≥ (cid:15) z j (cid:19) = k (cid:88) j =1 P (cid:18) | ¯ X j | ≥ (cid:15) √ z j (cid:19) ≤ k (cid:88) j =1 exp (cid:18) − n min (cid:26) I j ( (cid:15) √ z j ); I j ( − (cid:15) √ z j ) (cid:27)(cid:19) (A.1)By Lemma 2.1 and definition, we get P ( || ¯ X || ≥ (cid:15) ) ≤ k (cid:88) j =1 exp (cid:18) − n(cid:15) ( δ + 2) z j σ j (cid:19) = 2 k exp (cid:18) − n(cid:15) ( δ + 2) (cid:80) kj =1 σ j (cid:19) . Using the fact that (cid:80) kj =1 σ j ≤ E || X || , we obtain the desired result. Appendix B. Proof of Theorem 5.2.
Convergence Analysis.
We follow a similar decomposition as we did in provingTheorem 4.2 and use the same notations, like ˆ F ( k ) nm ( x ) and ˆ x ( k ) nm , the perturbed em-pirical function and its minimizer, except that we replace all the η kj with η j for k = 1 , · · · , n and replace the conditional expectation E η | ξ with E η . Unfortunately,one will immediately notice that Lemma 3.1 is no longer applicable for bounding thesecond term in (4.13): E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) E η g η (ˆ x ( k ) nm , ξ k ) (cid:19) − n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) . Because the minimizer ˆ x ( k ) nm depends on { η j } mj =1 . Then Lemma 3.1 is not applicable.Below we provide the detailed proof of Theorem 5.2. Proof.
Define E := F (ˆ x nm ) − ˆ F nm (ˆ x nm ), andˆ F ( k ) nm ( x ) := 1 n n (cid:88) i (cid:54) = k f ξ i (cid:18) m m (cid:88) j =1 g η j ( x, ξ i ) (cid:19) + 1 n f ξ (cid:48) k (cid:18) m m (cid:88) j =1 g η j ( x, ξ (cid:48) k ) (cid:19) , he empirical function by replacing the outer sample ξ k with an i.i.d sample ξ (cid:48) k . Denoteˆ x ( k ) nm = argmin x ∈X ˆ F ( k ) nm ( x ). Then, E E could be written as: E E = E (cid:20) F (ˆ x nm ) − n n (cid:88) k =1 f ξ k (cid:18) E η g η (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) E η g η (ˆ x ( k ) nm , ξ k ) (cid:19) − n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19)(cid:21) + E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) . (B.1)Since ξ k and ξ (cid:48) k are i.i.d, ˆ x nm and ˆ x ( k ) nm follow identical distribution. Then E F (ˆ x nm ) = E F (ˆ x ( k ) nm ). As ˆ x ( k ) nm is independent of ξ k , by definition of F ( x ), we know E F (ˆ x ( k ) nm ) = E f ξ k ( E η g η (ˆ x ( k ) nm , ξ k )) for any k = 1 , · · · , n . As a result, the first term is0. To analyze the second term, denote H k ( x ) := f ξ k (cid:18) E η g η ( x, ξ k ) (cid:19) − f ξ k (cid:18) m m (cid:88) j =1 g η j ( x, ξ k ) (cid:19) . We pick a υ -net { x l } Ql =1 for the decision set X , such that for any x ∈ X , there exists l ∈ { , · · · , Q } , (cid:107) x − x l (cid:107) ≤ υ . Then it holds for any s > (cid:16) s E H k (ˆ x ( k ) nm ) (cid:17) ≤ exp (cid:18) s E max l =1 , ··· ,Q H k ( x l ) + 2 sυL f L g (cid:19) ≤ E exp (cid:18) s max l =1 , ··· ,Q H k ( x l ) + 2 sυL f L g (cid:19) = E max l =1 , ··· ,Q exp ( sH k ( x l ) + 2 sυL f L g ) ≤ E Q (cid:88) l =1 exp ( sH k ( x l ) + 2 sυL f L g ) = Q (cid:88) l =1 E exp ( sH k ( x l ) + 2 sυL f L g ) . (B.2)The first inequality holds as ˆ x ( k ) nm is independent of ξ k , and f ξ ( · ) and g η ( · , ξ ) areLipschitz continuous, which implies, H k (ˆ x ( k ) nm ) ≤ sup x ∈X H k ( x ) ≤ max l =1 , ··· ,Q H k ( x l ) + 2 υL f L g . The second inequality holds by Jensen’s inequality. Next we show that H k ( x l ) − E H k ( x l ) is a sub-Gaussian random variable for any given ξ k . Since H k ( x l ) is a functionof { η j } mj =1 . Denote H k ( x l ) := ˜ H ( η , . . . , η m ). Then for any p ∈ [ m ], and given η , . . . , η p − , η p +1 , · · · , η m , we havesup η (cid:48) p ˜ H ( η , · · · , η p − , η (cid:48) p , η p +1 , · · · , η m ) − inf η (cid:48)(cid:48) p ˜ H ( η , · · · , η p − , η (cid:48)(cid:48) p , η p +1 , · · · , η m )= sup η (cid:48) p ,η (cid:48)(cid:48) p E ξ k f ξ k (cid:18) m m (cid:88) j (cid:54) = p g η j ( x, ξ k )+ 1 m g η (cid:48)(cid:48) p ( x, ξ k ) (cid:19) − f ξ k (cid:18) m m (cid:88) j (cid:54) = p g η j ( x, ξ k )+ 1 m g η (cid:48) p ( x, ξ k ) (cid:19) ≤ sup η (cid:48) p ,η (cid:48)(cid:48) p E ξ k L f m (cid:12)(cid:12)(cid:12)(cid:12) g η (cid:48)(cid:48) p ( x, ξ k ) − g η (cid:48) p ( x, ξ k ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ M g L f m , here M g is the upper bound of g η ( · , ξ ) on X . It implies that H k ( x l ) = ˜ H ( η , · · · , η m )has bounded difference M g L f m . By McDiarmids inequality [26], for any r > P ( H k ( x l ) − E H k ( x l ) ≥ r ) ≤ (cid:18) − r m M g L g (cid:19) . It implies that H k ( x l ) − E H k ( x l ) is a sub-Gaussian random variable with zero meanand variance proxy 2 M g L f /m for any given ξ k . By definition it yields E exp ( s [ H k ( x l ) − E H k ( x l )]) ≤ exp (cid:18) M g L f s m (cid:19) . Since x l is independent of random vectors { η j } mj =1 , by Lemma 3.1, we know E H k ( x l ) ≤ L f σ g √ m . It further implies E exp( sH k ( x l )) ≤ exp (cid:18) M g L f s m + sL f σ g √ m (cid:19) . With (B.2), we haveexp (cid:16) s E H k (ˆ x ( k ) nm ) (cid:17) ≤ Q exp (cid:32) M g L f s m + sL f σ g √ m + 2 sυL f L g (cid:33) . Taking the logarithm, dividing s on each side, and minimizing over s yields E H k (ˆ x ( k ) nm ) ≤ (cid:115) Q ) L f M g m + L f σ g √ m + 2 υL f L g . Since Q ≤ O (1)( D X /υ ) d , we have E H k (ˆ x ( k ) nm ) ≤ O (1) L f M g √ m (cid:115) d log (cid:18) D X υ (cid:19) + L f σ g √ m + 2 υL f L g . (B.3)For the third term in (B.1), by following the similar steps from (4.15) to (4.18),we obtain E (cid:20) n n (cid:88) k =1 f ξ k (cid:18) m m (cid:88) j =1 g η j (ˆ x ( k ) nm , ξ k ) (cid:19) − ˆ F nm (ˆ x nm ) (cid:21) ≤ L f L g (cid:18) L f L g µn (cid:19) /δ . (B.4)Combining with (B.1), (B.3), and (B.4),(B.5) E E ≤ L f L g (cid:18) L f L g µn (cid:19) /δ + O (1) L f M g √ m (cid:115) d log (cid:18) D X L f L g α(cid:15) (cid:19) + L f σ g √ m + 2 υL f L g . Similar with the steps from (4.21) and (4.22), by optimality of ˆ x nm of ˆ F nm and Lemma3.1,(B.6) E [ ˆ F nm (ˆ x nm ) − ˆ F nm ( x ∗ )] ≤ E [ ˆ F nm ( x ∗ ) − F ( x ∗ )] ≤ L f σ g √ m . Finally, combining (B.5), (B.6), with Markov inequality, we obtain (5.7). et L f L g (cid:18) L f L g µn(cid:15) δ (cid:19) /δ ≤ α O (1) L f M g √ m(cid:15) (cid:115) d log (cid:18) D X υ (cid:19) ≤ α L f σ g √ m(cid:15) ≤ α . We obtain the desired sample complexity (5.8).
Appendix C. Example of Huber Loss Minimization.
To show (5.10), denote Y = E η − ¯ η , then Y ∼ N (0 , σ η m ). Then the error of SAAis E F (ˆ x m ) − F ( x ∗ ) = E H ( E η − ¯ η, γ ) + E (¯ η − E η ) = (cid:90) γ γ y p ( y ) dy + 2 (cid:90) + ∞ γ (cid:18) y − γ (cid:19) p ( y ) dy + E Y , (C.1)where p ( y ) = √ m √ πσ η exp (cid:0) − my σ η (cid:1) is the PDF of Y , and E Y = σ η m . Denote erf( x ) := √ π (cid:82) x exp( − x ) dx , y := y (cid:113) m σ η . The first term in (C.1) is (cid:90) γ γ y p ( y ) dy = 2 σ η mγ √ π (cid:90) γ (cid:113) m σ η y exp( − y ) dy = σ η γm erf (cid:18)(cid:115) γ m σ η (cid:19) − (cid:114) σ η πm exp (cid:18) − γ m σ η (cid:19) . We use the fact that: (cid:90) z x exp( − x ) dx = 14 √ π erf( z ) −
12 exp( − z ) z. The second term in (C.1) is bounded by (cid:114) σ η πm exp (cid:18) − mγ σ η (cid:19) = (cid:90) + ∞ γ yp ( y ) dy ≤ (cid:90) + ∞ γ ( y − γ ) p ( y ) dy ≤ (cid:90) + ∞ γ yp ( y ) dy. Combining them together, we have (5.10). For a given γ >
0, erf (cid:0)(cid:113) γ m σ η (cid:1) → m → ∞ . By (5.10), we have E F (ˆ x nm ) − F ( x ∗ ) = O (cid:18) m (cid:19) . When γ →
0, (C.1) becomeslim γ → E F (ˆ x m ) − F ( x ∗ ) = lim γ → (cid:90) γ γ y p ( y ) dy + 2 (cid:90) + ∞ γ (cid:18) y − γ (cid:19) p ( y ) dy + σ η m = (cid:114) σ η πm + σ η m = O (cid:18) √ m (cid:19) . Appendix D. Empirical Objectives Satisfying Quadratic Growth Con-dition. trongly Convex Function Composed with Linear Function. The empirical ob-jective function is ˆ F nm ( x ) = n (cid:80) ni =1 f ξ i ( A i x ), where f ξ ( · ) is µ -strongly convex, A i x := m (cid:80) mj =1 g η ij ( x, ξ i ), the average of linear inner function g η ij ( x, ξ i ) := A η ij x .To show that ˆ F nm ( x ) satisfies the QG condition. Denote u i = A i y , v i = A i x . Since f ξ ( · ) is strongly convex, f ξ i ( u i ) − f ξ i ( v i ) − ∇ f ξ i ( v i ) (cid:62) ( u i − v i ) ≥ µ || u i − v i || . Taking average over n such inequalities, we obtain1 n n (cid:88) i =1 f ξ i ( u i ) − f ξ i ( v i ) − ∇ f ξ i ( v i ) (cid:62) ( u i − v i ) ≥ n n (cid:88) i =1 µ || u i − v i || . Replacing u i , v i with A i y and A i x , we have1 n n (cid:88) i =1 f ξ i ( A i y ) − f ξ i ( A i x ) − ∇ f ξ i ( A i x ) (cid:62) A i ( y − x ) ≥ n n (cid:88) i =1 µ y − x ) (cid:62) A (cid:62) i A i ( y − x ) . Since ∇ ˆ F nm ( x ) (cid:62) = n (cid:80) ni =1 ( A (cid:62) i ∇ f ξ i ( A i x )) (cid:62) = n (cid:80) ni =1 ∇ f ξ i ( A i x ) (cid:62) A i , we getˆ F nm ( y ) − ˆ F nm ( x ) − ∇ ˆ F nm ( x ) (cid:62) ( y − x ) ≥ n n (cid:88) i =1 µ || A i ( y − x ) || ≥ µ || n n (cid:88) i =1 A i ( y − x ) || . Let z be a point in X ∗ , we haveˆ F nm ( x ) − ˆ F nm ( z ) ≥ µ || n n (cid:88) i =1 A i ( x − z ) || ≥ µθ ( n (cid:80) ni =1 A i )2 || x − z || ≥ min z ∈X ∗ µθ ( n (cid:80) ni =1 A i )2 || x − z || . (D.1)Here θ ( A ) is the smallest non-zero singular of A . Thus ˆ F nm ( x ) satisfies quadraticgrowth condition for any n and m . A special case is when n = m = 1, i.e., a stronglyconvex objective composed with a linear function satisfies QG condition. Some Strictly Convex Functions Composed with Linear Function on a CompactSet.
Consider Example 2, the logistic regression problem with the objective F ( x ) = E ξ =( a,b ) log(1 + exp( − b E η | ξ [ η ] T x )) , where a ∈ R d is a random feature vector and b ∈ { , − } is the label, η = a + N (0 , σ I d ) is a perturbed noisy observation of the input feature vector a . Its empiricalobjective function ˆ F nm ( x ) is given byˆ F nm ( x ) = 1 n n (cid:88) i =1 log (cid:18) (cid:18) − b i m m (cid:88) j =1 η (cid:62) ij x (cid:19)(cid:19) . where E η ij = a i . Here f ξ i ( u ) = log (cid:0) b i u ) (cid:1) . ˆ F nm ( x ) = 1 /n (cid:80) ni =1 f ( u i ), where f ( u ) = log (cid:0) u ) (cid:1) is strictly convex, and u i = m (cid:80) mj =1 η (cid:62) ij x is bounded for any x ∈ X and realization η ij . It is easy to verify that on any compact set, f ( u ) is strongly onvex. The strong convexity parameter is related to the compact set. With (D.1),ˆ F nm ( x ) satisfies the QG condition.Note that the result is not necessarily true for all strictly convex function. Forinstance, || x || is strictly convex, but || Ax || does not satisfies quadratic growth con-dition on any compact set containing x = 0. Appendix E. Other Results on Regularized SAA.
The Theorem 4.2 discuss the sample complexity of SAA for strongly convex andQG condition cases. We show that the result obtained in Theorem 4.2 can be usedto obtain dimensional free sample complexity for general convex objective by adding l -regularization. Lemma
E.1 ([35]).
Consider a stochastic convex optimization problem: min x ∈X G ( x ) , where G ( x ) is the expectation over some convex random function. Suppose that thedecision set X ∈ R d has bounded diameter D X . Denote G µ ( x ) := G ( x ) + µ || x || ,where µ > is a strongly convex parameter. Denote ˆ G ( x ) as the SAA counterpart of G ( x ) , x ∗ ∈ argmin x ∈X G ( x ) , ˆ x ∈ argmin x ∈X ˆ G ( x ) , x ∗ µ = argmin x ∈X G µ ( x ) , and ˆ x u the minimizer of SAA of the regularized objective, namely ˆ x u = argmin x ∈X ˆ G µ ( x ) :=ˆ G ( x ) + µ || x || . If E [ G µ (ˆ x µ ) − G µ ( x ∗ µ )] ≤ β ( µ ) , then E [ G (ˆ x µ ) − G ( x ∗ )] ≤ β ( µ ) + µ D X . Remark
E.1.
This theorem shows that the minimum point ˆ x µ to a l -regularizedempirical function ˆ G µ could be a good solution to the original convex function G ( x ) as long as one selects µ properly. Note that ˆ x µ might not be a minimum point ofthe empirical function ˆ G ( x ) . In CSO case, according to Theorem 4.2, if F ( x ) isconvex, the expected error of SAA method for min x ∈X F ( x ) + µ || x || is bounded by β ( µ ) = L f L g µn +2∆( m ) . Then, E F (ˆ x nm ) − F ( x ∗ ) ≤ L f L g µn + µ D X +2∆( m ) . Minimizingover µ , and by Markov inequality, we obtain, P ( F (ˆ x nm ) − F ( x ∗ ) ≥ (cid:15) ) ≤ √ L f L g D X √ n(cid:15) + 2∆( m ) (cid:15) . We notice that the outer sample size, n = O (1 /(cid:15) ) , is dimensional free, while inTheorem 4.1, n = O ( d/(cid:15) ) , depends linearly in dimension; the inner sample size m is not affected. For high-dimensional problems, adding regularization is sometimesmore favorable as it lowers the sample complexity by d and also helps boosting theconvergence when solving the SAA. Acknowledgments.
We would like to acknowledge Alexander Shapiro and LinXiao for fruitful discussions and the reviewers for their helpful comments.
REFERENCES[1]
D. P. Bertsekas , Dynamic Programming and Optimal Control, Vol I
D. Bertsimas, V. Gupta, and N. Kallus , Robust sample average approximation , Mathemat-ical Programming, 171 (2017), pp. 217–282, https://doi.org/10.1007/s10107-017-1174-z.283]
A. N. Bhagoji, D. Cullina, C. Sitawarin, and P. Mittal , Enhancing robustness of machinelearning systems via data transformations , in Information Sciences and Systems (CISS),2018 52nd Annual Conference on, IEEE, 2018, pp. 1–5, https://doi.org/10.1109/ciss.2018.8362326.[4]
J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter , From error bounds to the com-plexity of first-order descent methods for convex functions , Mathematical Programming,165 (2017), pp. 471–507, https://doi.org/10.1007/s10107-016-1091-6.[5]
L. Bottou, F. Curtis, and J. Nocedal , Optimization methods for large-scale machine learn-ing , SIAM Review, 60 (2018), pp. 223–311, https://doi.org/10.1137/16m1080173.[6]
Z. Charles and D. Papailiopoulos , Stability and generalization of learning algorithms thatconverge to global optima , in Proceedings of the 35th International Conference on MachineLearning, vol. 80 of Proceedings of Machine Learning Research, PMLR, 10–15 Jul 2018,pp. 745–754, http://proceedings.mlr.press/v80/charles18a.html (accessed 2019-07-16).[7]
B. Dai, N. He, Y. Pan, B. Boots, and L. Song , Learning from conditional distributionsvia dual embeddings , in Proceedings of the 20th International Conference on ArtificialIntelligence and Statistics, vol. 54 of Proceedings of Machine Learning Research, PMLR,20–22 Apr 2017, pp. 1458–1467, http://proceedings.mlr.press/v54/dai17a.html (accessed2019-07-05).[8]
B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song , SBEED: Convergentreinforcement learning with nonlinear function approximation , in Proceedings of the 35thInternational Conference on Machine Learning, vol. 80 of Proceedings of Machine Learn-ing Research, PMLR, 10–15 Jul 2018, pp. 1125–1134, http://proceedings.mlr.press/v80/dai18c.html (accessed 2019-07-15).[9]
D. Dentcheva, S. Penev, and A. Ruszczy´nski , Statistical estimation of composite risk func-tionals and risk optimization problems , Annals of the Institute of Statistical Mathematics,69 (2016), pp. 737–760, https://doi.org/10.1007/s10463-016-0559-8.[10]
S. Diamond and S. Boyd , CVXPY: A Python-embedded modeling language for convex opti-mization , Journal of Machine Learning Research, 17 (2016), pp. 1–5, https://web.stanford.edu/ ∼ boyd/papers/pdf/cvxpy paper.pdf (accessed 2019-07-16).[11] D. Drusvyatskiy and A. S. Lewis , Error bounds, quadratic growth, and linear convergenceof proximal methods , Mathematics of Operations Research, 43 (2018), pp. 919–948, https://doi.org/10.1287/moor.2017.0889.[12]
Y. M. Ermoliev and V. I. Norkin , Sample average approximation method for compoundstochastic optimization problems , SIAM Journal on Optimization, 23 (2013), pp. 2231–2263, https://doi.org/10.1137/120863277.[13]
S. Ghadimi, A. Ruszczy´nski, and M. Wang , A single time-scale stochastic approximationmethod for nested stochastic optimization , Dec. 2018, https://arxiv.org/abs/1812.01094.[14]
P. Gong and J. Ye , Linear convergence of variance-reduced projected stochastic gradientwithout strong convexity , June 2014, https://arxiv.org/abs/1406.1102.[15]
L. J. Hong and S. Juneja , Estimating the mean of a non-linear function of conditionalexpectation , in Proceedings of the 2009 Winter Simulation Conference (WSC), IEEE, dec2009, https://doi.org/10.1109/wsc.2009.5429428.[16]
L. J. Hong, S. Juneja, and G. Liu , Kernel smoothing for nested estimation with applicationto portfolio risk measurement , Operations Research, 65 (2017), pp. 657–673, https://doi.org/10.1287/opre.2017.1591.[17]
Z. Huo, B. Gu, J. Liu, and H. Huang , Accelerated method for stochastic composition optimiza-tion with nonsmooth regularization
M. Jaskowski and S. Jaroszewicz , Uplift modeling for clinical trial data , in ICML Workshopon Clinical Data Analysis, 2012, http://people.cs.pitt.edu/ ∼ milos/icml clinicaldata 2012/Papers/Oral Jaroszewitz ICML Clinical 2012.pdf (accessed 2019-07-15).[19] H. Karimi, J. Nutini, and M. Schmidt , Linear convergence of gradient and proximal-gradientmethods under the polyak-(cid:32)lojasiewicz condition , in Joint European Conference on MachineLearning and Knowledge Discovery in Databases, Springer, 2016, pp. 795–811, https://doi.org/10.1007/978-3-319-46128-1 50.[20]
A. J. Kleywegt, A. Shapiro, and T. H. de Mello , The sample average approximationmethod for stochastic discrete optimization , SIAM Journal on Optimization, 12 (2002),pp. 479–502, https://doi.org/10.1137/s1052623499363220.[21]
E. Kubi´nska , Approximation of carath´eodory functions and multifunctions , Real Analysis Ex-change, 30 (2005), p. 351, https://doi.org/10.14321/realanalexch.30.1.0351.[22]
X. Lian, M. Wang, and J. Liu , Finite-sum Composition Optimization via Variance Reduced radient Descent , in Proceedings of the 20th International Conference on Artificial In-telligence and Statistics, vol. 54 of Proceedings of Machine Learning Research, PMLR,20–22 Apr 2017, pp. 1159–1167, http://proceedings.mlr.press/v54/lian17a.html (accessed2019-07-16).[23] H. Liu, X. Wang, T. Yao, R. Li, and Y. Ye , Sample average approximation with sparsity-inducing penalty for high-dimensional stochastic programming , Mathematical Program-ming, 178 (2018), pp. 69–108, https://doi.org/10.1007/s10107-018-1278-0.[24]
J. Liu and S. J. Wright , Asynchronous stochastic coordinate descent: Parallelism andconvergence properties , SIAM Journal on Optimization, 25 (2015), pp. 351–376, https://doi.org/10.1137/140961134.[25]
P. Massart and ´E. N´ed´elec , Risk bounds for statistical learning , The Annals of Statistics,34 (2006), pp. 2326–2366, https://doi.org/10.1214/009053606000000786.[26]
C. McDiarmid , On the method of bounded differences , in Surveys in Combinatorics, CambridgeUniversity Press, 1989, pp. 148–188, https://doi.org/10.1017/cbo9781107359949.008.[27]
K. Muandet, A. Mehrjou, S. K. Lee, and A. Raj , Dual iv: A single stage instrumentalvariable regression , arXiv preprint arXiv:1910.12358, (2019), https://arxiv.org/abs/1910.12358.[28]
P. Niyogi, F. Girosi, and T. Poggio , Incorporating prior information in machine learningby creating virtual examples , Proceedings of the IEEE, 86 (1998), pp. 2196–2209, https://doi.org/10.1109/5.726787.[29]
B. K. Pagnoncelli, S. Ahmed, and A. Shapiro , Sample average approximation methodfor chance constrained programming: Theory and applications , Journal of Opti-mization Theory and Applications, 142 (2009), pp. 399–416, https://doi.org/10.1007/s10957-009-9523-6.[30]
M. V. F. Pereira and L. M. V. G. Pinto , Multi-stage stochastic optimization applied toenergy planning , Mathematical Programming, 52 (1991), pp. 359–375, https://doi.org/10.1007/bf01582895.[31]
B. Polyak , Minimization of composite regression functions
R. T. Rockafellar and R. J.-B. Wets , Scenarios and policy aggregation in optimizationunder uncertainty , Mathematics of Operations Research, 16 (1991), pp. 119–147, https://doi.org/10.1287/moor.16.1.119.[33]
A. Ruszczy´nski , Decomposition methods in stochastic programming , Mathematical Program-ming, 79 (1997), pp. 333–353, https://doi.org/10.1007/bf02614323.[34]
S. Shalev-Shwartz and S. Ben-David , Understanding machine learning: From theory toalgorithms , Cambridge University Press, 2014, https://doi.org/10.1017/cbo9781107298019.[35]
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan , Learnability, stability anduniform convergence , Journal of Machine Learning Research, 11 (2010), pp. 2635–2670,https://doi.org/10.1007/978-3-642-34106-9 3.[36]
A. Shapiro , On complexity of multistage stochastic programs , Operations Research Letters, 34(2006), pp. 1–8, https://doi.org/10.1016/j.orl.2005.02.003.[37]
A. Shapiro, D. Dentcheva, and A. Ruszczy´nski , Lectures on stochastic programming: mod-eling and theory , Society for Industrial and Applied Mathematics, 2014, https://doi.org/10.1137/1.9780898718751.[38]
A. Shapiro and A. Nemirovski , On complexity of stochastic programming problems , inContinuous Optimization, Springer-Verlag, 2005, pp. 111–146, https://doi.org/10.1007/0-387-26771-9 4.[39]
S. Shen, L. Xu, J. Liu, J. Guo, and Q. Ling , Asynchronous stochastic composition optimiza-tion with variance reduction , Nov. 2018, https://arxiv.org/abs/1811.06396.[40]
R. S. Sutton, H. R. Maei, and C. Szepesv´ari , A convergent o ( n ) temporal-difference algorithm for off-policy learning with linear function approxi-mation , in Advances in Neural Information Processing Systems 21, Cur-ran Associates, Inc., 2009, pp. 1609–1616, http://papers.nips.cc/paper/3626-a-convergent-on-temporal-difference-algorithm-for-off-policy-learning-with-linear-function-approximation.pdf.[41] M. Wang, E. X. Fang, and H. Liu , Stochastic compositional gradient descent: algorithmsfor minimizing compositions of expected-value functions , Mathematical Programming, 161(2017), pp. 419–449, https://doi.org/10.1007/s10107-016-1017-3.[42]
M. Wang, J. Liu, and E. X. Fang , Accelerating stochastic composition optimization , Journal ofMachine Learning Research, 18 (2017), pp. 1–23, http://jmlr.org/papers/v18/16-504.html.3043]
Y. Xu, Q. Lin, and T. Yang , Accelerated stochastic subgradient methods under local errorbound condition , July 2016, https://arxiv.org/abs/1607.01027.[44]
I. Yamane, F. Yger, J. Atif, and M. Sugiyama , Uplift modeling from separate labels , inAdvances in Neural Information Processing Systems 31, Curran Associates, Inc., 2018,pp. 9927–9937, http://papers.nips.cc/paper/8198-uplift-modeling-from-separate-labels.pdf.[45]