Heteroscedasticity-aware residuals-based contextual stochastic optimization
aa r X i v : . [ m a t h . O C ] J a n Heteroscedasticity-aware residuals-basedcontextual stochastic optimization
Rohit Kannan , G¨uzin Bayraksan , and James R. Luedtke Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, USA.E-mail: [email protected] Department of Integrated Systems Engineering, The Ohio State University, Columbus, OH, USA.E-mail: [email protected] Department of Industrial & Systems Engineering and Wisconsin Institute for Discovery,University of Wisconsin-Madison, Madison, WI, USA. E-mail: [email protected]
January 8, 2021
Abstract
We explore generalizations of some integrated learning and optimization frameworks for data-driven contextual stochastic optimization that can adapt to heteroscedasticity. We identify conditionson the stochastic program, data generation process, and the prediction setup under which thesegeneralizations possess asymptotic and finite sample guarantees for a class of stochastic programs,including two-stage stochastic mixed-integer programs with continuous recourse. We verify that ourassumptions hold for popular parametric and nonparametric regression methods.
Key words:
Data-driven stochastic programming, distributionally robust optimization, covariates,regression, heteroscedasticity, convergence rate, large deviations
We study data-driven stochastic programming in the presence of covariate/contextual information andexamine heteroscedastic cases. Specifically, we consider the setting where we have a finite numberof observations of the uncertain parameters Y within an optimization model along with simultaneousobservations of random covariates X . Given a new random observation X = x , our goal is to solve the conditional stochastic program min z ∈Z E [ c ( z, Y ) | X = x ] . (SP)Here, z denotes the decision vector, Z ⊆ R d z is the feasible region, and c : R d z × R d y → R is anextended real-valued function. An example application of this framework is production planning underdemand uncertainty [5], where products’ demands ( Y ) can be predicted using covariates ( X ) such ashistorical demands, location, and web chatter before making decisions ( z ) on production and inventorylevels. Another application is grid scheduling under wind uncertainty [10], where covariates ( X ) such asweather observations, seasonality, and location can be used to predict available wind power ( Y ) beforecreating generator schedules ( z ). Heteroscedasticity arises, for instance, when the variability of productdemands or wind power availability depends significantly on the location, seasonality, or other covariates.Kannan et al. [17, 18] consider data-driven approaches that integrate a machine learning predictionmodel within a sample average approximation (SAA) or distributionally robust optimization (DRO)setup to approximate the solution to the conditional stochastic program (SP); see also [1, 26]. They firstfit a statistical/machine learning model to predict Y given X and use this model and its residuals toconstruct scenarios for Y given X = x . Then, they use these scenarios within an SAA or DRO frameworkto approximate the solution to (SP). We refer the readers to [e.g., 1, 5, 17, 18, 26] for a review of otherdata-driven approximations to (SP). 1he data-driven formulations in Kannan et al. [17, 18] assume that the dependence of the randomvector Y on the random covariates X can be modeled as Y = f ∗ ( X ) + ε , where f ∗ ( x ) := E [ Y | X = x ]is the regression function and ε are zero-mean errors. These approaches crucially require the errors ε tobe independent of the covariates X . Motivated by applications where such an assumption may fail tohold, we explore generalizations of these approaches that do not require this independence assumption. Notation.
Let [ n ] := { , . . . , n } , k·k denote the Euclidean or operator ℓ -norm, proj S ( v ) denote the or-thogonal projection of v onto a nonempty closed convex set S , I denote an identity matrix of appropriatedimension, v T denote the transpose of a vector v , and A ≻ A is positive definite.Let δ denote the Dirac measure. For scalars c , . . . , c l , we write diag( c , . . . , c l ) to denote the l × l diag-onal matrix with i th diagonal entry equal to c i . For sets A , B ⊆ R d z , let D ( A , B ) := sup v ∈A dist( v, B )denote the deviation of A from B , where dist( v, B ) := inf w ∈B k v − w k . The abbreviations ‘a.e.’, ‘a.s.’,‘LLN’, ‘i.i.d.’, and ‘r.h.s.’ are shorthand for ‘almost everywhere’, ‘almost surely’, ‘law of large numbers’,‘independent and identically distributed’, and ‘right-hand side’. For a random vector V with probabilitymeasure P V , we write a.e. v ∈ V to denote P V -a.e. v ∈ V . The symbols p −→ and a.s. −−→ denote con-vergence in probability and almost surely with respect to the probability measure generating the jointdata on ( Y, X ). For random sequences { V n } and { W n } , we write V n = o p ( W n ) and V n = O p ( W n ) toconvey that V n = R n W n with { R n } converging in probability to zero, or being bounded in probability,respectively. We write O (1) for generic constants. To handle heteroscedasticity, we assume that the random vector Y is related to the random covariates X as Y = f ∗ ( X ) + Q ∗ ( X ) ε , where f ∗ denotes the regression function, Q ∗ ( X ) is the square root of theconditional covariance matrix of the error term, and the zero-mean random errors ε are independentof the covariates X . This type of model is common in statistics; see, e.g., [2, 6, 8, 30]. The functions f ∗ and Q ∗ are assumed to belong to known classes of functions F and Q , respectively (which may beinfinite-dimensional and depend on the sample size n ). Let Y ⊆ R d y , X ⊆ R d x , and Ξ ⊆ R d y denote thesupports of Y , X , and ε , respectively. Additionally, let P Y | X = x denote the conditional distribution of Y given X = x and P X and P ε denote the distributions of X and ε , respectively. We assume that Y isnonempty and convex and Q ∗ ( x ) ≻ x ∈ X .Under the above assumptions, the conditional stochastic program (SP) is equivalent to v ∗ ( x ) := min z ∈Z (cid:8) g ( z ; x ) := E (cid:2) c ( z, f ∗ ( x ) + Q ∗ ( x ) ε ) (cid:3)(cid:9) , (1)where the expectation above is computed with respect to the distribution P ε of ε . We assume that thefeasible set Z ⊂ R d z is nonempty and compact, E [ | c ( z, f ∗ ( x ) + ε ) | ] < + ∞ for each z ∈ Z and a.e. x ∈ X ,and the function g ( · ; x ) is lower semicontinuous on Z for a.e. x ∈ X . These assumptions ensure thatproblem (1) is well-defined and its set of optimal solutions S ∗ ( x ) is nonempty for a.e. x ∈ X .Let D n := { ( y i , x i ) } ni =1 denote the joint observations of ( Y, X ) and { ε i } ni =1 denote the correspondingrealizations of the errors ε . Note that these realizations of ε satisfy ε i = (cid:2) Q ∗ ( x i ) (cid:3) − ( y i − f ∗ ( x i )) , ∀ i ∈ [ n ] . If we know the functions f ∗ and Q ∗ , then we can construct the following full-information SAA (FI-SAA)to problem (1) using the data D n :min z ∈Z (cid:26) g ∗ n ( z ; x ) := 1 n n X i =1 c ( z, f ∗ ( x ) + Q ∗ ( x ) ε i ) (cid:27) . (2) We focus our attention on this popular model of heteroscedasticity even though our framework applies more generally,e.g., to relationships of the form Y = m ∗ ( X, ε ) with the mapping m ∗ ( x, · ) being invertible for a.e. x ∈ X and satisfyingsome regularity conditions. f ∗ and Q ∗ are unknown, we first estimate them by ˆ f n and ˆ Q n , respectively,using a regression method on the data D n (see Section 4 for details). Assuming that the estimate ˆ Q n is a.s. positive definite on X (i.e., it a.s. satisfies ˆ Q n ( x ) ≻ x ∈ X ), we then use the empiricalestimates ˆ ε in := (cid:2) ˆ Q n ( x i ) (cid:3) − ( y i − ˆ f n ( x i )) , ∀ i ∈ [ n ] , of { ε i } ni =1 to construct the following empirical residuals-based SAA (ER-SAA) to problem (1) in theheteroscedastic setting (cf. [17, 18]) :ˆ v ERn ( x ) := min z ∈Z (cid:26) ˆ g ERn ( z ; x ) := 1 n n X i =1 c (cid:0) z, proj Y ( ˆ f n ( x ) + ˆ Q n ( x )ˆ ε in ) (cid:1)(cid:27) . (3)Let ˆ z ERn ( x ) denote an optimal solution to problem (3) and ˆ S ERn ( x ) denote its optimal solution set.Additionally, let P ∗ n ( x ) and ˆ P ERn ( x ) denote the estimates of the conditional distribution P Y | X = x of Y given X = x corresponding to the FI-SAA problem (2) and ER-SAA problem (3), respectively, i.e., P ∗ n ( x ) := 1 n n X i =1 δ f ∗ ( x )+ Q ∗ ( x ) ε i and ˆ P ERn ( x ) := 1 n n X i =1 δ proj Y ( ˆ f n ( x )+ ˆ Q n ( x )ˆ ε in ) . When we only have a limited number of observations n , the following residuals-based DRO formulationprovides an alternative to the ER-SAA problem (3) that can yield solutions with better out-of-sampleperformance (cf. [18]): min z ∈Z sup Q ∈ ˆ P n ( x ) E Y ∼ Q [ c ( z, Y )] , (4)where ˆ P n ( x ) is an ambiguity set for P Y | X = x . Following [18], we call problem (4) with ˆ P n ( x ) centered atˆ P ERn ( x ) the empirical residuals-based DRO (ER-DRO) problem. For the homoscedastic case, i.e., when Q ∗ ≡ ˆ Q n ≡ I and so the model class Q comprises only the constantfunction Q : x I , ∀ x ∈ X , Kannan et al. [17, 18] investigate conditions under which the optimal valueof problems (3) and (4) asymptotically converge in probability to those of the true problem (1). Theyalso identify conditions under which every accumulation point of a sequence of optimal solutions toproblems (3) and (4) is in probability an optimal solution to problem (1) and outline conditions underwhich solutions to problems (3) and (4) possess finite sample guarantees. An integral part of this analysisis bounding a distance between the empirical distributions ˆ P ERn ( x ) and P ∗ n ( x ).By the Lipschitz continuity of orthogonal projections, we have for each x ∈ Xk proj Y ( ˆ f n ( x ) + ˆ Q n ( x )ˆ ε in ) − ( f ∗ ( x ) + Q ∗ ( x ) ε i ) k ≤ k ˜ ε in ( x ) k , ∀ i ∈ [ n ] , where the i th deviation term ˜ ε in ( x ) is given by˜ ε in ( x ) := ( ˆ f n ( x ) + ˆ Q n ( x )ˆ ε in ) − ( f ∗ ( x ) + Q ∗ ( x ) ε i ) . The analysis in [17, 18] implies that under certain assumptions on the stochastic program (1), asymptoticand finite sample guarantees on the power mean deviation term ( n P ni =1 k ˜ ε in ( x ) k p ) /p for a suitable valueof p ≥ n P ni =1 k ˜ ε in ( x ) k p ) /p for p = 1 and p = 2 translate to theoreticalguarantees on solutions to problems (3) and (4) for a class of two-stage stochastic mixed-integer programs(MIPs) with continuous recourse and, in the ER-DRO setting, to broad families of ambiguity sets.We now provide concrete examples of how guarantees on the mean deviation term n P ni =1 k ˜ ε in ( x ) k (i.e., when p = 1) translate to guarantees on the ER-SAA problem (3). In addition to focusing on theER-SAA problem (3) for brevity, we narrow our attention to stochastic programs (1) whose objectivefunction satisfies the following Lipschitz condition. We can also construct similar generalizations of the Jackknife-based SAAs in [17]. ssumption 1. For each z ∈ Z , the function c ( z, · ) is Lipschitz continuous on Y with Lipschitz constant L ( z ) satisfying sup z ∈Z L ( z ) < + ∞ .As an example, Appendix EC.2 of [17] verifies that Assumption 1 holds for two-stage stochastic MIPswith continuous recourse under mild conditions. For extensions of the below results to a broader classof stochastic programs (1) and to the ER-DRO problem (4), we refer the readers to [17, 18].We now list conditions on the FI-SAA problem (2) under which consistency and asymptotic optimal-ity, rates of convergence, and finite sample guarantees—to be defined precisely in respective theoremsbelow—can be achieved for the ER-SAA approximation (3) of the true problem (1) in the heteroscedasticsetting. As mentioned, a key component of this analysis requires respective conditions to be satisfiedby the mean deviation term; these are investigated in Section 3. Section 4 presents examples of regres-sion/learning setups that satisfy the assumptions set forth for the heteroscedastic setting.We begin with a uniform weak LLN assumption on the FI-SAA objective (see Assumption 3 of [17]and the surrounding discussion for conditions under which it holds). Along with suitable convergenceof the mean deviation term, this assumption helps us establish uniform convergence in probability ofthe sequence of objective functions of the ER-SAA problem (3) to the objective function of the trueproblem (1) on the feasible set Z (see Proposition 1 of [17]). This in turn provides the building blockfor consistency and asymptotic optimality. Assumption 2.
For a.e. x ∈ X , the sequence of sample average objective functions { g ∗ n ( · ; x ) } of theFI-SAA problem (2) converges in probability to the objective function g ( · ; x ) of the true problem (1)uniformly on the set Z .Our first result implies that consistency of the mean deviation term n P ni =1 k ˜ ε in ( x ) k translates toconsistency and asymptotic optimality of solutions to the ER-SAA problem (3). Theorem 1. [ Consistency and asymptotic optimality ] Suppose Assumptions 1 and 2 hold and themean deviation term converges to zero in probability, i.e., n P ni =1 k ˜ ε in ( x ) k p −→ x ∈ X . Then fora.e. x ∈ X ˆ v ERn ( x ) p −→ v ∗ ( x ) , D (cid:0) ˆ S ERn ( x ) , S ∗ ( x ) (cid:1) p −→ , and sup z ∈ ˆ S ERn ( x ) g ( z ; x ) p −→ v ∗ ( x ) . Proof.
See the proofs of Proposition 1 and Theorem 1 of Kannan et al. [17].Next, we refine Assumption 2 to assume that the sequence of objective functions of the FI-SAA prob-lem (2) converges to the objective function of the true problem (1) at a suitable rate (see Assumption 5of [17] and the surrounding discussion for conditions under which it holds).
Assumption 3.
The function c in problem (1) and the data D n satisfy the following functional centrallimit theorem for the FI-SAA objective: √ n ( g ∗ n ( · ; x ) − g ( · ; x )) d −→ V ( · ; x ) , for a.e. x ∈ X , where g ∗ n ( · ; x ), g ( · ; x ), and V ( · ; x ) are (random) elements of L ∞ ( Z ), the Banach space of essentiallybounded functions on Z equipped with the supremum norm.Our second result implies that rates of convergence of the mean deviation term n P ni =1 k ˜ ε in ( x ) k tozero directly translate to rates of convergence of the suboptimality of ER-SAA solutions to zero. Theorem 2. [ Rate of convergence ] Suppose Assumptions 1 and 3 hold and there exists a constant r ∈ (0 ,
1] such that n P ni =1 k ˜ ε in ( x ) k = O p ( n − r/ ) for a.e. x ∈ X . Then, for a.e. x ∈ X (cid:12)(cid:12) ˆ v ERn ( x ) − v ∗ ( x ) (cid:12)(cid:12) = O p ( n − r/ ) and (cid:12)(cid:12) g (ˆ z ERn ( x ); x ) − v ∗ ( x ) (cid:12)(cid:12) = O p ( n − r/ ) . Proof.
Follows from the proof of Theorem 11 of Kannan et al. [18] (cf. Theorem 2 of [17]).Finally, we refine Assumption 3 to assume that the sequence of objectives of the FI-SAA problem (2)possess a finite sample guarantee (see [17, Assumption 7] and the discussion after it for conditions underwhich it holds). 4 ssumption 4.
The FI-SAA problem (2) possesses the following uniform exponential bound property:for any constant κ > x ∈ X , there exist positive constants K ( κ, x ) and β ( κ, x ) such that P n sup z ∈Z | g ∗ n ( z ; x ) − g ( z ; x ) | > κ o ≤ K ( κ, x ) exp( − nβ ( κ, x )) , ∀ n ∈ N . Our final result of this section implies that finite sample guarantees on the mean deviation term n P ni =1 k ˜ ε in ( x ) k translate to finite sample guarantees on solutions to the ER-SAA problem (3). Theorem 3. [ Finite sample guarantee ] Suppose Assumptions 1 and 4 hold and for any constant κ > x ∈ X , there exist positive constants ˜ K ( κ, x ) and ˜ β ( κ, x ) such that P (cid:26) n n X i =1 k ˜ ε in ( x ) k > κ (cid:27) ≤ ˜ K ( κ, x ) exp (cid:0) − n ˜ β ( κ, x ) (cid:1) , ∀ n ∈ N . Then, for a.e. x ∈ X , given constant η >
0, there exist positive constants Q ( η, x ) and γ ( η, x ) (dependingon K , ˜ K , β , and ˜ β ) such that P (cid:8) dist(ˆ z ERn ( x ) , S ∗ ( x )) ≥ η (cid:9) ≤ Q ( η, x ) exp( − nγ ( η, x )) , ∀ n ∈ N . Proof.
See Theorem 3 of Kannan et al. [17].In the remainder of this note, we identify conditions under which the asymptotic and finite sampleguarantees required by Theorems 1, 2, and 3 hold for the mean deviation term n P ni =1 k ˜ ε in ( x ) k . Asimilar analysis can be carried out for the root-mean-square deviation term ( n P ni =1 k ˜ ε in ( x ) k ) / , whichis required by [18] for the analysis of phi-divergence-based ER-DRO problems (4) for stochastic programssatisfying Assumption 1. We omit these details for brevity. In this section, we investigate conditions under which the mean deviation term n P ni =1 k ˜ ε in ( x ) k convergesto zero in probability at a certain rate and possesses finite sample guarantees. We begin by boundingthe mean deviation in terms of the functions f ∗ and Q ∗ , their regression estimates ˆ f n and ˆ Q n , and thedata D n . Throughout, we implicitly assume that the estimate ˆ Q n a.s. satisfies ˆ Q n ( x ) ≻ x ∈ X ,which can be guaranteed by an appropriate choice of the model class Q . We begin by noting that1 n n X i =1 k ˜ ε in ( x ) k = 1 n n X i =1 k ( ˆ f n ( x ) + ˆ Q n ( x )ˆ ε in ) − ( f ∗ ( x ) + Q ∗ ( x ) ε i ) k≤ k ˆ f n ( x ) − f ∗ ( x ) k + 1 n n X i =1 k ˆ Q n ( x )ˆ ε in − Q ∗ ( x ) ε i k . (5)We now bound the second term on the r.h.s. of inequality (5). We have1 n n X i =1 k ˆ Q n ( x )ˆ ε in − Q ∗ ( x ) ε i k = 1 n n X i =1 (cid:13)(cid:13) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − ( y i − ˆ f n ( x i )) − Q ∗ ( x ) (cid:2) Q ∗ ( x i ) (cid:3) − ( y i − f ∗ ( x i )) (cid:13)(cid:13) = 1 n n X i =1 (cid:13)(cid:13) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − (cid:0) y i − f ∗ ( x i ) + f ∗ ( x i ) − ˆ f n ( x i ) (cid:1) − Q ∗ ( x ) (cid:2) Q ∗ ( x i ) (cid:3) − ( y i − f ∗ ( x i )) (cid:13)(cid:13) ≤ n n X i =1 (cid:13)(cid:13)(cid:0) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − − Q ∗ ( x ) (cid:2) Q ∗ ( x i ) (cid:3) − (cid:1) ( y i − f ∗ ( x i )) (cid:13)(cid:13) + 1 n n X i =1 (cid:13)(cid:13) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − ( f ∗ ( x i ) − ˆ f n ( x i )) (cid:13)(cid:13) n n X i =1 (cid:13)(cid:13)(cid:0) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − − Q ∗ ( x ) (cid:2) Q ∗ ( x i ) (cid:3) − (cid:1) Q ∗ ( x i ) ε i (cid:13)(cid:13) + 1 n n X i =1 (cid:13)(cid:13) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − ( f ∗ ( x i ) − ˆ f n ( x i )) (cid:13)(cid:13) , (6)where the final step follows from the definition of { ε i } ni =1 . We have for each i ∈ [ n ]ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − − Q ∗ ( x ) (cid:2) Q ∗ ( x i ) (cid:3) − = ˆ Q n ( x ) (cid:0)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:1) + [ ˆ Q n ( x ) − Q ∗ ( x )] (cid:2) Q ∗ ( x i ) (cid:3) − . Plugging the above equality into inequality (6), we get1 n n X i =1 k ˆ Q n ( x )ˆ ε in − Q ∗ ( x ) ε i k≤ n n X i =1 (cid:16) k ˆ Q n ( x ) k (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) k Q ∗ ( x i ) k + k ˆ Q n ( x ) − Q ∗ ( x ) k (cid:17) k ε i k +1 n n X i =1 k ˆ Q n ( x ) k (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) k f ∗ ( x i ) − ˆ f n ( x i ) k≤k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k Q ∗ ( x i ) k (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / + (7) k ˆ Q n ( x ) − Q ∗ ( x ) k (cid:18) n n X i =1 k ε i k (cid:19) + k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) / , where the last step above follows by repeated application of the Cauchy-Schwarz inequality. Finally,using inequality (7) in inequality (5), we get1 n n X i =1 k ˜ ε in ( x ) k ≤ k ˆ f n ( x ) − f ∗ ( x ) k + k ˆ Q n ( x ) − Q ∗ ( x ) k (cid:18) n n X i =1 k ε i k (cid:19) + k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k Q ∗ ( x i ) k (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / + k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) / . (8)In the remainder of this section, we rely on inequality (8) to identify conditions under which the meandeviation term n P ni =1 k ˜ ε in ( x ) k possesses asymptotic and finite sample guarantees. We postpone theverification of these assumptions to Section 4.Before we proceed, we mention alternative ways to bound the mean deviation term n P ni =1 k ˜ ε in ( x ) k that may be easier to verify in some contexts. By slightly changing some of the steps leading to inequal-ity (7), the third term on the r.h.s. of inequality (8) can be replaced with the term k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − Q ∗ ( x i ) − I (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / . When the second term in the expression above possesses the requisite asymptotic and finite sampleguarantees (see, e.g., [30, Section 3]), this yields an alternative form of inequality (8) that requiresmilder assumptions on the distribution of the errors ε . For another alternative, notice that the first termon the r.h.s. of inequality (6) can also be bounded from above as1 n n X i =1 (cid:13)(cid:13)(cid:0) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − − Q ∗ ( x ) (cid:2) Q ∗ ( x i ) (cid:3) − (cid:1) Q ∗ ( x i ) ε i (cid:13)(cid:13) = 1 n n X i =1 (cid:13)(cid:13)(cid:0) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − Q ∗ ( x i ) − Q ∗ ( x ) (cid:1) ε i (cid:13)(cid:13) n n X i =1 (cid:13)(cid:13) ˆ Q n ( x ) (cid:0)(cid:2) ˆ Q n ( x i ) (cid:3) − Q ∗ ( x i ) − I (cid:1) ε i + ( ˆ Q n ( x ) − Q ∗ ( x )) ε i (cid:13)(cid:13) ≤k ˆ Q n ( x ) k (cid:16) sup ¯ x ∈X k (cid:2) ˆ Q n (¯ x ) (cid:3) − k (cid:17)(cid:18) n n X i =1 k Q ∗ ( x i ) − ˆ Q n ( x i ) k (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / + k ˆ Q n ( x ) − Q ∗ ( x ) k (cid:18) n n X i =1 k ε i k (cid:19) , where the final step follows by the Cauchy-Schwarz inequality. Additionally, the second term on ther.h.s. of inequality (6) can also be bounded from above as1 n n X i =1 (cid:13)(cid:13) ˆ Q n ( x ) (cid:2) ˆ Q n ( x i ) (cid:3) − ( f ∗ ( x i ) − ˆ f n ( x i )) (cid:13)(cid:13) ≤ k ˆ Q n ( x ) k (cid:16) sup ¯ x ∈X k (cid:2) ˆ Q n (¯ x ) (cid:3) − k (cid:17)(cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) . Using these bounds in inequality (6), we conclude that the mean deviation n P ni =1 k ˜ ε in ( x ) k can also bebounded from above as1 n n X i =1 k ˜ ε in ( x ) k ≤ k ˆ f n ( x ) − f ∗ ( x ) k + k ˆ Q n ( x ) − Q ∗ ( x ) k (cid:18) n n X i =1 k ε i k (cid:19) + k ˆ Q n ( x ) k (cid:16) sup ¯ x ∈X k (cid:2) ˆ Q n (¯ x ) (cid:3) − k (cid:17)(cid:18) n n X i =1 k Q ∗ ( x i ) − ˆ Q n ( x i ) k (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / + k ˆ Q n ( x ) k (cid:16) sup ¯ x ∈X k (cid:2) ˆ Q n (¯ x ) (cid:3) − k (cid:17)(cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) . (9)Inequality (9) can be used to derive alternative conditions under which our asymptotic and finite sampleguarantees hold. For instance, asymptotic and finite sample guarantees on the uniform convergence ofthe estimate ˆ Q n to Q ∗ on X directly translate to the requisite asymptotic and finite sample guarantees onthe quantities involving the estimate ˆ Q n in (9). These conditions again necessitate milder assumptionson the distribution of the errors ε relative to (8); however, they require the function Q ∗ and its regressionestimate ˆ Q n to be (asymptotically) a.s. uniformly invertible (cf. [22]), i.e., sup ¯ x ∈X k [ Q ∗ (¯ x )] − k < + ∞ and a.s. (for n large enough) sup ¯ x ∈X k (cid:2) ˆ Q n (¯ x ) (cid:3) − k < + ∞ . We omit these details for brevity and continuewith inequality (8) for the rest of our analysis. We begin with assumptions that guarantee that the mean deviation term n P ni =1 k ˜ ε in ( x ) k converges tozero in probability. Assumption 5.
The function Q ∗ and the data D n satisfy the weak LLNs1 n n X i =1 k Q ∗ ( x i ) k p −→ E [ k Q ∗ ( X ) k ] and 1 n n X i =1 (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) p −→ E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3) . Assumption 6.
The samples { ε i } ni =1 satisfy the weak LLN n P ni =1 k ε i k p −→ E [ k ε k ].Assumptions 5 and 6 are mild weak LLN assumptions that hold, for instance, when the samples { ( x i , ε i ) } are i.i.d. and the quantities E [ k Q ∗ ( X ) k ], E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3) , and E [ k ε k ] are finite. They alsohold for non-i.i.d. data arising from mixing/stationary processes that satisfy suitable assumptions (seethe discussion following Assumption 3 of [17]). We also require the following consistency assumption onthe regression estimates ˆ f n and ˆ Q n (cf. Assumption 4 of [17]). Assumption 7.
The regression estimates ˆ f n and ˆ Q n possess the following consistency properties:ˆ f n ( x ) p −→ f ∗ ( x ) and ˆ Q n ( x ) p −→ Q ∗ ( x ) , for a.e. x ∈ X , and1 n n X i =1 k ˆ f n ( x i ) − f ∗ ( x i ) k p −→ , n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) p −→ . Lemma 4.
We have (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / ≤ (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / + (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / . Proof.
The triangle inequality for the operator norm implies (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) + (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) , ∀ i ∈ [ n ] . Therefore, the following component-wise inequality holds:0 ≤ (cid:13)(cid:13)(cid:2) ˆ Q n ( x ) (cid:3) − (cid:13)(cid:13) ... (cid:13)(cid:13)(cid:2) ˆ Q n ( x n ) (cid:3) − (cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:2) ˆ Q n ( x ) (cid:3) − − (cid:2) Q ∗ ( x ) (cid:3) − (cid:13)(cid:13) ... (cid:13)(cid:13)(cid:2) ˆ Q n ( x n ) (cid:3) − − (cid:2) Q ∗ ( x n ) (cid:3) − (cid:13)(cid:13) + (cid:13)(cid:13)(cid:2) Q ∗ ( x ) (cid:3) − (cid:13)(cid:13) ... (cid:13)(cid:13)(cid:2) Q ∗ ( x n ) (cid:3) − (cid:13)(cid:13) . The stated result then follows as a consequence of the triangle inequality for the ℓ -norm.Applying Assumptions 5, 6, and 7 to inequality (8) immediately yields the following result. Theorem 5.
Suppose Assumptions 5, 6, and 7 hold. Then n P ni =1 k ˜ ε in ( x ) k p −→ x ∈ X . Proof.
Follows from inequality (8), Assumptions 5, 6, and 7, Lemma 4, the continuous mapping theorem,and the fact that O p (1) O p (1) = O p (1), O p (1) o p (1) = o p (1), and o p (1) + o p (1) = o p (1). We refine Assumption 7 to obtain rates of convergence (cf. Assumption 6 of [17]).
Assumption 8.
There is a constant < r ≤ f n and ˆ Q n satisfythe following convergence rate criteria: k ˆ f n ( x ) − f ∗ ( x ) k = O p ( n − r/ ) and k ˆ Q n ( x ) − Q ∗ ( x ) k = O p ( n − r/ ) , for a.e. x ∈ X , n n X i =1 k ˆ f n ( x i ) − f ∗ ( x i ) k = O p ( n − r ) , n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) = O p ( n − r ) . Inequality (8) along with Assumptions 5, 6, and 8 readily yields the following result.
Theorem 6.
Suppose Assumptions 5, 6, and 8 hold. Then n P ni =1 k ˜ ε in ( x ) k = O p ( n − r/ ) for a.e. x ∈ X . Proof.
Follows by applying Assumptions 5, 6, and 8, Lemma 4, the continuous mapping theorem, andthe fact that O p (1) + O p ( n − r/ ) = O p (1), O p (1) O p ( n − r/ ) = O p ( n − r/ ), and O p ( n − r/ ) + O p ( n − r/ ) = O p ( n − r/ ) to inequality (8). We make the following additional assumptions to establish a finite sample guarantee (cf. Assumption 8of [17]).
Assumption 9.
The regression estimates ˆ f n and ˆ Q n possess the following finite sample properties: forany constant κ >
0, there exist positive constants K f ( κ, x ), ¯ K f ( κ ), β f ( κ, x ), ¯ β f ( κ ), K Q ( κ, x ), ¯ K Q ( κ ), β Q ( κ, x ), and ¯ β Q ( κ ) such that for each n ∈ NP (cid:8) k f ∗ ( x ) − ˆ f n ( x ) k > κ (cid:9) ≤ K f ( κ, x ) exp ( − nβ f ( κ, x )) , for a.e. x ∈ X , P (cid:8) k Q ∗ ( x ) − ˆ Q n ( x ) k > κ (cid:9) ≤ K Q ( κ, x ) exp ( − nβ Q ( κ, x )) , for a.e. x ∈ X , The constant r is independent of n , but could depend on the covariate dimension d x . (cid:26) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k > κ (cid:27) ≤ ¯ K f ( κ ) exp (cid:0) − n ¯ β f ( κ ) (cid:1) , and P (cid:26) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) > κ (cid:27) ≤ ¯ K Q ( κ ) exp (cid:0) − n ¯ β Q ( κ ) (cid:1) . The next two assumptions strengthen Assumptions 5 and 6 to assume finite sample properties forthe quantities involved.
Assumption 10.
For any constant κ >
0, there exist positive constants γ Q ( κ ) and ¯ γ Q ( κ ) such that foreach n ∈ N P (cid:26)(cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / > (cid:16) E h(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) i(cid:17) / + κ (cid:27) ≤ exp( − nγ Q ( κ )) , P (cid:26)(cid:18) n n X i =1 k Q ∗ ( x i ) k (cid:19) / > (cid:0) E (cid:2) k Q ∗ ( X ) k (cid:3)(cid:1) / + κ (cid:27) ≤ exp( − n ¯ γ Q ( κ )) . Assumption 11.
For any constant κ >
0, there exist positive constants γ ε ( κ ) and ¯ γ ε ( κ ) such that foreach n ∈ NP (cid:26) n n X i =1 k ε i k > E [ k ε k ]+ κ (cid:27) ≤ exp( − nγ ε ( κ )) , P (cid:26)(cid:18) n n X i =1 k ε i k (cid:19) / > ( E (cid:2) k ε k (cid:3) ) / + κ (cid:27) ≤ exp( − n ¯ γ ε ( κ )) . The first part of Assumption 10 holds, e.g., if for each κ >
0, there is a constant γ Q ( κ ) > P (cid:26) n n X i =1 (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) > E h(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) i + κ (cid:27) ≤ exp( − nγ Q ( κ )) . The function γ Q ( · ) in the inequality above is related to the so-called rate function in large deviationstheory (see Section 7.2.8 of [27]). Similar conclusions hold for the probability inequalities involving theterms n P ni =1 k Q ∗ ( x i ) k and n P ni =1 k ε i k in Assumptions 10 and 11. From large deviations theory,we can also conclude that the constants γ Q ( κ ), ¯ γ Q ( κ ), γ ε ( κ ), and ¯ γ ε ( κ ) in Assumptions 10 and 11 areguaranteed to exist for i.i.d. data D n and for each constant κ > E (cid:2) exp (cid:0)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) p (cid:1)(cid:3) < + ∞ for some p > E [exp( k Q ∗ ( X ) k p )] < + ∞ for some p >
4, and E [exp( k ε k p )] < + ∞ for some p >
4. The discussion following Assumption 7 of [17] providesavenues for verifying Assumptions 10 and 11 for non-i.i.d. data D n .We are now ready to state our finite sample guarantee. Theorem 7.
Suppose Assumptions 9, 10, and 11 hold. Then, for any constant κ > x ∈ X ,there exist positive constants ˜ K ( κ, x ) and ˜ β ( κ, x ) such that P (cid:26) n n X i =1 k ˜ ε in ( x ) k > κ (cid:27) ≤ ˜ K ( κ, x ) exp( − n ˜ β ( κ, x )) . Proof.
Using (8) and the inequality P { V + W > c + c } ≤ P { V > c } + P { W > c } for any randomvariables V , W and constants c , c , we get P n n n X i =1 k ˜ ε in ( x ) k > κ o ≤ P n k ˆ f n ( x ) − f ∗ ( x ) k > κ o + P (cid:26)(cid:18) n n X i =1 k ε i k (cid:19) k ˆ Q n ( x ) − Q ∗ ( x ) k > κ (cid:27) + P (cid:26) k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k Q ∗ ( x i ) k (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / > κ (cid:27) +9 (cid:26) k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) / > κ (cid:27) . (10)For a.e. x ∈ X , the first term on the r.h.s. of inequality (10) can be bounded using Assumption 9 as P n k ˆ f n ( x ) − f ∗ ( x ) k > κ o ≤ K f ( κ , x ) exp( − nβ f ( κ , x )) . Next, consider the second term on the r.h.s. of inequality (10). We have for a.e. x ∈ X P (cid:26)(cid:18) n n X i =1 k ε i k (cid:19) k ˆ Q n ( x ) − Q ∗ ( x ) k > κ (cid:27) ≤ P (cid:26) n n X i =1 k ε i k > E [ k ε k ] + κ (cid:27) + P n ( E [ k ε k ] + κ ) k ˆ Q n ( x ) − Q ∗ ( x ) k > κ o ≤ exp( − nγ ε ( κ )) + P (cid:26) k ˆ Q n ( x ) − Q ∗ ( x ) k > κ E [ k ε k ] + κ ) (cid:27) ≤ exp( − nγ ε ( κ )) + K Q (cid:0) κ E [ k ε k ]+ κ ) , x (cid:1) exp (cid:0) − nβ Q ( κ E [ k ε k ]+ κ ) , x ) (cid:1) , where the second inequality follows from Assumption 11 and the final step follows from Assumption 9.The third term on the r.h.s. of inequality (10) can be bounded for a.e. x ∈ X as P (cid:26) k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k Q ∗ ( x i ) k (cid:19) / (cid:18) n n X i =1 k ε i k (cid:19) / > κ (cid:27) ≤ P (cid:8) k ˆ Q n ( x ) k > k Q ∗ ( x ) k + κ (cid:9) + P (cid:26)(cid:18) n n X i =1 k Q ∗ ( x i ) k (cid:19) / > (cid:0) E [ k Q ∗ ( X ) k ] (cid:1) / + κ (cid:27) + P (cid:26)(cid:18) n n X i =1 k ε i k (cid:19) / > (cid:0) E [ k ε k ] (cid:1) / + κ (cid:27) + P (cid:26)(cid:0) k Q ∗ ( x ) k + κ (cid:1)(cid:0)(cid:0) E [ k Q ∗ ( X ) k ] (cid:1) / + κ (cid:1)(cid:0)(cid:0) E [ k ε k ] (cid:1) / + κ (cid:1)(cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / > κ (cid:27) ≤ K Q ( κ, x ) exp ( − nβ Q ( κ, x )) + exp( − n ¯ γ Q ( κ )) + exp( − n ¯ γ ε ( κ ))+ P (cid:26)(cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / > h ( κ, x ) (cid:27) ≤ K Q ( κ, x ) exp ( − nβ Q ( κ, x )) + exp( − n ¯ γ Q ( κ )) + exp( − n ¯ γ ε ( κ )) + ¯ K Q ( h ( κ, x )) exp( − n ¯ β Q ( h ( κ, x ))) , where the second inequality follows from Assumptions 9, 10, and 11, the final inequality follows fromAssumption 9, and h ( κ, x ) := κ (cid:0) k Q ∗ ( x ) k + κ (cid:1)(cid:0)(cid:0) E [ k Q ∗ ( X ) k ] (cid:1) / + κ (cid:1)(cid:0)(cid:0) E [ k ε k ] (cid:1) / + κ (cid:1) . Finally, the fourth term on the r.h.s. of inequality (10) can be bounded for a.e. x ∈ X as P (cid:26) k ˆ Q n ( x ) k (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / (cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) / > κ (cid:27) ≤ P (cid:8) k ˆ Q n ( x ) k > k Q ∗ ( x ) k + κ (cid:9) + P (cid:26)(cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / > (cid:16) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:17) / + 2 κ (cid:27) + P (cid:26)(cid:0) k Q ∗ ( x ) k + κ (cid:1)(cid:16)(cid:16) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:17) / + 2 κ (cid:17)(cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) / > κ (cid:27) ≤ K Q ( κ, x ) exp ( − nβ Q ( κ, x )) + P (cid:26)(cid:18) n n X i =1 k f ∗ ( x i ) − ˆ f n ( x i ) k (cid:19) / > h ( κ, x ) (cid:27) +10 (cid:26)(cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / + (cid:18) n n X i =1 (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:19) / > (cid:16) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:17) / + 2 κ (cid:27) ≤ K Q ( κ, x ) exp ( − nβ Q ( κ, x )) + ¯ K f ( h ( κ, x )) exp( − n ¯ β f ( h ( κ, x ))) + ¯ K Q ( κ ) exp( − n ¯ β Q ( κ )) + exp( − nγ Q ( κ )) , where the second inequality follows from Assumption 9 and Lemma 4, the final inequality follows fromAssumptions 9 and 10 and the probability inequality stated at the beginning of this proof, and h ( κ, x ) := κ (cid:0) k Q ∗ ( x ) k + κ (cid:1)(cid:16)(cid:16) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:17) / + 2 κ (cid:17) . Putting the above bounds together in inequality (10), we have for a.e. x ∈ X P (cid:26) n n X i =1 k ˜ ε in ( x ) k > κ (cid:27) ≤ exp( − nγ ε ( κ )) + exp( − n ¯ γ ε ( κ )) + exp( − nγ Q ( κ )) + exp( − n ¯ γ Q ( κ ))+ (11) K f ( κ , x ) exp( − nβ f ( κ , x )) + ¯ K f ( h ( κ, x )) exp( − n ¯ β f ( h ( κ, x )))+ K Q (cid:0) κ E [ k ε k ]+ κ ) , x (cid:1) exp (cid:0) − nβ Q ( κ E [ k ε k ]+ κ ) , x ) (cid:1) + 2 K Q ( κ, x ) exp ( − nβ Q ( κ, x )) +¯ K Q ( κ ) exp( − n ¯ β Q ( κ )) + ¯ K Q ( h ( κ, x )) exp( − n ¯ β Q ( h ( κ, x ))) , which then implies the desired result.Suppose we make the mild assumptions that the functions ¯ K f ( · ), K Q ( · , x ), and ¯ K Q ( · ) in Assump-tion 9 are monotonically nonincreasing on R + and the functions ¯ β f ( · ), β Q ( · , x ), and ¯ β Q ( · ) therein aremonotonically nondecreasing on R + (cf. Appendix EC.3 of [17]). For a.e. x ∈ X and tolerance κ satisfying κ < min n E [ k ε k ] , k Q ∗ ( x ) k , (cid:0) E [ k Q ∗ ( X ) k ] (cid:1) / , (cid:0) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:1) / o , we can use inequality (11) to derive the bound P (cid:26) n n X i =1 k ˜ ε in ( x ) k > κ (cid:27) ≤ exp( − nγ ε ( κ )) + exp( − n ¯ γ ε ( κ )) + exp( − nγ Q ( κ )) + exp( − n ¯ γ Q ( κ ))+ K f ( κ , x ) exp( − nβ f ( κ , x )) + ¯ K f (¯ h ( κ, x )) exp( − n ¯ β f (¯ h ( κ, x )))+ K Q (cid:0) κ E [ k ε k ] , x (cid:1) exp (cid:0) − nβ Q ( κ E [ k ε k ] , x ) (cid:1) + 2 K Q ( κ, x ) exp ( − nβ Q ( κ, x )) +¯ K Q ( κ ) exp( − n ¯ β Q ( κ )) + ¯ K Q (¯ h ( κ, x )) exp( − n ¯ β Q (¯ h ( κ, x ))) , where¯ h ( κ, x ) := κ k Q ∗ ( x ) k (cid:0) E [ k Q ∗ ( X ) k ] (cid:1) / (cid:0) E [ k ε k ] (cid:1) / , ¯ h ( κ, x ) := κ k Q ∗ ( x ) k (cid:16) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:17) / . Therefore, for a.e. x ∈ X and κ < min (cid:8) E [ k ε k ] , k Q ∗ ( x ) k , (cid:0) E [ k Q ∗ ( X ) k ] (cid:1) / , (cid:0) E (cid:2)(cid:13)(cid:13)(cid:2) Q ∗ ( X ) (cid:3) − (cid:13)(cid:13) (cid:3)(cid:1) / (cid:9) P (cid:26) n n X i =1 k ˜ ε in ( x ) k > κ (cid:27) ≤ ˜ K ( κ, x ) exp( − n ˜ β ( κ, x )) , with ˜ K ( κ, x ) := 4 + K f ( κ , x ) + ¯ K f (¯ h ( κ, x )) + K Q (cid:0) κ E [ k ε k ] , x (cid:1) + 2 K Q ( κ, x ) + ¯ K Q ( κ ) + ¯ K Q (¯ h ( κ, x )) and˜ β ( κ, x ) := min n γ ε ( κ ) , ¯ γ ε ( κ ) , γ Q ( κ ) , ¯ γ Q ( κ ) , β f ( κ , x ) , ¯ β f (¯ h ( κ, x )) , β Q (cid:0) κ E [ k ε k ] , x (cid:1) , β Q ( κ, x ) , ¯ β Q ( κ ) , ¯ β Q (¯ h ( κ, x )) o . Unlike the functions h ( · , x ) and h ( · , x ), the functions ¯ h ( · , x ) and ¯ h ( · , x ) are linear. Consequently, forsmall-enough tolerances κ > α ∈ (0 , n required for P (cid:8) n P ni =1 k ˜ ε in ( x ) k > κ (cid:9) ≤ α . This can in turn enable amore interpretable estimate of the sample size n required for solutions of the ER-SAA problem (3) andthe ER-DRO problem (4) to be approximately optimal to the true problem (1) with probability 1 − α (cf. Proposition 2 of [17]). 11 Some regression setups that satisfy our assumptions
In this section, we verify that Assumptions 7, 8, and 9 hold for some regression setups. We do not attemptto be exhaustive. We first discuss methods for estimating the regression function f ∗ and note theirasymptotic and finite sample guarantees. We then list some popular models for the class of functions Q ,discuss approaches for estimating the matrix-valued function Q ∗ , and note their theoretical guarantees. We identify conditions under which the parts of Assumptions 7, 8, and 9 involving the regression esti-mate ˆ f n hold for some prediction setups. Although these assumptions on ˆ f n are the same as those inAssumptions 4, 6, and 8 of [17], we focus on regression setups that work in the heteroscedastic setting. Ordinary least squares (OLS) regression.
When the regression function f ∗ is linear, its OLSestimate ˆ f n satisfies Assumptions 7 and 8 with constant r = 1 (see Proposition EC.3. of [17] for details).Furthermore, Theorem 11 and Remark 12 of [14] can be used to readily identify conditions under whichthe estimates ˆ f n possess a finite sample guarantee like in Assumption 9. However, OLS regression doesnot yield an efficient estimator of f ∗ in the heteroscedastic case [23]. An alternative to OLS regressionis feasible weighted least squares (FWLS) regression [22, 23], which results in asymptotically efficientestimates when the estimate ˆ Q n of Q ∗ is consistent. These asymptotic results of FWLS regressioncontinue to hold at the expense of asymptotic efficiency even if the estimate ˆ Q n of Q ∗ may be inconsistent(see, e.g., Section 3.3 of [23]). Sparse regression methods.
Proposition EC.4. of [17] lists conditions under which the ordinaryLasso regression estimate ˆ f n satisfies Assumptions 7 and 8 with constant r = 1 and a finite sampleguarantee like in Assumption 9. Theorem 1 of Belloni et al. [3] outlines conditions under which similarasymptotic results hold for the heteroscedasticity-adapted Lasso. Medeiros and Mendes [19] and Ziel[31] present asymptotic analyses of the adaptive Lasso for time series data. Their analyses applies toGARCH-type processes. Theorems 2 and 3 of [19] and Theorem 1 of [31] present conditions under whichthe estimate ˆ f n satisfies Assumptions 7 and 8 with r = 1. Belloni et al. [4] present asymptotic andfinite sample guarantees for the heteroscedasticity-adapted square-root Lasso. Finally, Dalalyan et al. [8]introduce a scaled heteroscedastic Dantzig selector. Theorem 5.2 therein presents large deviation boundsfor both regression estimates ˆ f n and ˆ Q n under certain sparsity assumptions . Other M-estimators.
The conclusions for OLS regression carry over to more general M-estimators.In particular, Appendix EC.2 of [17] presents conditions under which Assumptions 7 and 8 continueto hold with r = 1. Similar to the special case of OLS regression, vanilla M-estimators may no longerbe efficient—feasible weighted M-estimation is an asymptotically efficient alternative. Theorems 1, 3,and 5 of Sun et al. [28] and Theorem 2.1 of Zhou et al. [30] present large deviation results of the formAssumption 9 for adaptive Huber regression when the function f ∗ is linear. Remarkably, their resultshold even for heavy-tailed error distributions. Finally, Schick [25] considers a semiparametric regressionsetup for f ∗ and establishes rates of convergence of weighted least squares estimates. kNN regression. Proposition EC.5. of [17] summarizes conditions under which the kNN regressionestimate ˆ f n of f ∗ satisfies Assumptions 7 and 8 with constant r = O (1) /d x . It also notes conditionsunder which ˆ f n possesses a finite sample guarantee like in Assumption 9 (cf. Corollary 1 of [15]). Kernel regression.
Hansen [13] studies conditions under which kernel regression estimates are uni-formly consistent given dependent data D n satisfying mixing conditions. Theorems 1, 2, and 4 thereincan be used to show that the kernel regression estimate ˆ f n satisfies Assumptions 7 and 8 with constant r = O (1) /d x . Mokkadem et al. [20] study large deviations results for some kernel regression estimates. Here, by the term efficient estimator , we mean a minimum variance unbiased estimator. Although [8] consider the fixed design setting, their analysis can be modified to accommodate random designs undersuitable assumptions on the distribution P X of the covariates X (see Section 4 of [8]). .2 Estimating the conditional covariance matrix of the errors In this section, we identify conditions under which the parts of Assumptions 7, 8, and 9 involving theregression estimate ˆ Q n hold for some prediction setups. These assumptions for ˆ Q n —in particular, As-sumption 9—are not as well-studied in the literature as those for ˆ f n . Therefore, they are typically harderto verify than their counterparts in Section 4.1. Because deriving theoretical properties of estimatorsfor the heteroscedastic setting and deriving finite sample properties of estimators in general are areas oftopical interest, we envision that future research will enable easier verification of these assumptions.For simplicity, we only consider function classes Q that comprise diagonal covariance matrices (cf.[30]), although the theoretical developments in Section 3 apply more generally. Bauwens et al. [2] reviewsome model classes Q with non-diagonal covariance matrices that are popular in time series modeling. Example 1. [Parametric Models] The model class is Q = { Q : R d x → R d y × d y : Q ( X ) = diag( q ( X ) , q ( X ) , . . . , q d y ( X )) } , where q j : R d x → R + for each j ∈ [ d y ]. Forms of the functions q j of interest include [21, 23]:i. ( q j ( X )) = σ j (1 + θ T j X ) for parameters ( σ j , θ j ),ii. ( q j ( X )) = exp( σ j + θ T j X ) for parameters ( σ j , θ j ),iii. ( q j ( X )) = exp (cid:0) σ j + θ T j log( X ) (cid:1) for parameters ( σ j , θ j ).For the rest of the note, we absorb the parameter σ j into θ j for simplicity of presentation. With thisnotation, the above examples can be cast in the general form ( q j ( X )) = h j ( θ T j g j ( X )) for known functions h j and g j and a parameter θ j that is to be estimated. The above setup can also accommodate caseswhere the parameters of the function Q ∗ include some of the parameters of the function f ∗ . Example 2. [Nonparametric Model] The model class is Q = { Q : R d x → R d y × d y : Q ( X ) = diag( q ( X ) , q ( X ) , . . . , q d y ( X )) } , where each q j : R d x → R + is assumed to be ‘sufficiently smooth’. Chapter 8 of Fan and Yao [11] presentssome popular models for the functions q j in a time series context.Suppose for ease of exposition that the covariance matrix of the errors ε is the identity matrix. Then,for each j ∈ [ d y ] and any ¯ x ∈ X , we have E (cid:2) ( Y j − f ∗ j ( X )) | X = ¯ x (cid:3) = ( q ∗ j (¯ x )) for the components q ∗ j (¯ x )of Q ∗ (¯ x ) in Examples 1 and 2. This motivates the estimation of each function q ∗ j by regressing the squaredresiduals ( y ij − ˆ f j,n ( x i )) on the covariate observation x i . For the parametric setup in Example 1, thisnonlinear regression problem can often be transformed into a linear regression problem. An alternative forthe parametric regression setup is to estimate the parameters θ j in ˆ Q n concurrently with the parametersof the estimate ˆ f n using an M-estimation procedure. Section 3 of Davidian and Carroll [9] outlinesseveral approaches for estimating the parameters in Example 1, including the methods mentioned above.Chapter 8 of Fan and Yao [11] discusses nonparametric regression methods for estimating each function q j .We now outline approaches for verifying that the estimate ˆ Q n satisfies Assumptions 7, 8, and 9. Con-sider first the parametric setup in Example 1. Suppose the function Q ∗ ( · ) ≡ Q ( · ; θ ∗ ) for some function Q and the goal is to estimate the parameter θ ∗ . Let ˆ θ n denote the estimate of θ ∗ corresponding to theregression estimate ˆ Q n , i.e., ˆ Q n ( · ) ≡ Q ( · ; ˆ θ n ). Suppose for a.e. realization x ∈ X , the function Q ( x ; · ) isLipschitz continuous with Lipschitz constant L Q ( x ) and its inverse [ Q ( x ; · )] − is also Lipschitz continu-ous with Lipschitz constant ¯ L Q ( x ). These assumptions hold for the model classes in Example 1 if theparameters θ therein are restricted to lie in suitable compact sets . Because k ˆ Q n ( x ) − Q ∗ ( x ) k ≤ L Q ( x ) k ˆ θ n − θ ∗ k and 1 n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) ≤ (cid:16) n n X i =1 ¯ L Q ( x i ) (cid:17) k ˆ θ n − θ ∗ k , As noted in [17, Appendix EC.3.2.], it suffices to assume that the above Lipschitz continuity assumptions hold locallyfor the asymptotic results. θ n of θ ∗ directly translate to the asymptoticand finite sample guarantees on the estimate ˆ Q n in Assumptions 7, 8, and 9. When the functions f ∗ and Q ∗ are jointly estimated using M-estimators, the results listed in Appendix EC.3.2. of [17] provideconditions under which the estimator ˆ θ n of θ ∗ is consistent and Assumptions 7 and 8 hold with r = 1.They also present a hard-to-verify uniform exponential bound condition under which ˆ θ n possesses a finitesample guarantee. Carroll and Ruppert [6] consider robust M-estimators for θ ∗ that possess a similarrate of convergence when f ∗ is linear. Dalalyan et al. [8] present asymptotic and finite sample guaranteesfor a scaled Dantzig estimator of θ ∗ under some sparsity assumptions. Finally, Fan et al. [12] presenta quasi-maximum likelihood approach for estimating the parameters of GARCH models and investigatetheir asymptotic properties.Next, consider the nonparametric setup in Example 2, and suppose the function Q ∗ and its regressionestimate ˆ Q n are (asymptotically) a.s. uniformly invertible . We have1 n n X i =1 (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − − (cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) ≤ n n X i =1 (cid:13)(cid:13)(cid:2) Q ∗ ( x i ) (cid:3) − (cid:13)(cid:13) (cid:13)(cid:13)(cid:2) ˆ Q n ( x i ) (cid:3) − (cid:13)(cid:13) k ˆ Q n ( x i ) − Q ∗ ( x i ) k ≤ (cid:18) sup ¯ x ∈X (cid:13)(cid:13)(cid:2) Q ∗ (¯ x ) (cid:3) − (cid:13)(cid:13) (cid:19)(cid:18) sup ¯ x ∈X (cid:13)(cid:13)(cid:2) ˆ Q n (¯ x ) (cid:3) − (cid:13)(cid:13) (cid:19)(cid:18) n n X i =1 k ˆ Q n ( x i ) − Q ∗ ( x i ) k (cid:19) Therefore, asymptotic and finite sample guarantees for k ˆ Q n ( x ) − Q ∗ ( x ) k and n P ni =1 k ˆ Q n ( x i ) − Q ∗ ( x i ) k are sufficient for verifying Assumptions 7, 8, and 9. Theorem 8.5 of Fan and Yao [11] can be used toidentify conditions under which these asymptotic guarantees hold for local linear estimators on timeseries data when the dimension of the covariates d x = 1. They also note approaches for estimating Q ∗ when d x >
1. Theorem 2 of Ruppert et al. [24] can be used to verify Assumptions 7 and 8 for localpolynomial smoothers. Proposition 2.1 and Theorem 3.1 of Jin et al. [16] identify conditions under whichAssumptions 7 and 8 hold for a local likelihood estimator. Van Keilegom and Wang [29] consider semi-parametric models for both f ∗ and Q ∗ . Theorems 3.1 and 3.2 therein can be used to verify Assumptions 7and 8 for the estimates ˆ Q n . Section 3 of Zhou et al. [30] presents robust estimators of Q ∗ when f ∗ islinear and notes that these estimators ˆ Q n possess asymptotic and finite sample guarantees in the formof Assumptions 7, 8, and 9. Finally, Theorem 3.1 of Chesneau et al. [7] can be used to derive asymptoticguarantees for wavelet estimators of Q ∗ . In this note, we propose generalizations of the ER-SAA and ER-DRO frameworks in [17, 18] that canhandle heteroscedastic errors, focusing mainly on ER-SAA for brevity. We identify sufficient conditionsunder which solutions to these approximations possess asymptotic and finite sample guarantees for aclass of two-stage stochastic MIPs with continuous recourse. Furthermore, we outline conditions underwhich these assumptions hold for some regression setups, including OLS, Lasso, and kNN regression.Future work includes verification of the large deviation Assumption 9 for the regression estimate ˆ Q n for additional prediction setups, consideration of more general relationships between the random vector Y and the random covariates X , and investigation of the computational performance of the generalizationsof the ER-SAA and ER-DRO problems on a practical application involving heteroscedasticity. Acknowledgments
We thank Prof. Erick Delage for encouraging us to investigate extensions of the ER-SAA formulationto the heteroscedastic setting. This research is supported by the Department of Energy, Office of Sci-ence, Office of Advanced Scientific Computing Research, Applied Mathematics program under ContractNumber DE-AC02-06CH11357. Although inequality (9) in Section 3 can yield similar guarantees under such uniform invertibility assumptions, we stickwith Assumptions 7, 8, and 9 dictated by inequality (8) for simplicity. eferences [1] G.-Y. Ban, J. Gallien, and A. J. Mersereau. Dynamic procurement of new products with covariate information: Theresidual tree method. Manufacturing & Service Operations Management , 21(4):798–815, 2019.[2] L. Bauwens, S. Laurent, and J. V. Rombouts. Multivariate GARCH models: a survey.
Journal of Applied Economet-rics , 21(1):79–109, 2006.[3] A. Belloni, D. Chen, V. Chernozhukov, and C. Hansen. Sparse models and methods for optimal instruments with anapplication to eminent domain.
Econometrica , 80(6):2369–2429, 2012.[4] A. Belloni, V. Chernozhukov, and L. Wang. Pivotal estimation via square-root lasso in nonparametric regression.
TheAnnals of Statistics , 42(2):757–788, 2014.[5] D. Bertsimas and N. Kallus. From predictive to prescriptive analytics.
Management Science , 66(3):1025–1044, 2020.[6] R. J. Carroll and D. Ruppert. Robust estimation in heteroscedastic linear models.
The Annals of Statistics , pages429–441, 1982.[7] C. Chesneau, S. El Kolei, J. Kou, and F. Navarro. Nonparametric estimation in a regression model with additive andmultiplicative noise.
Journal of Computational and Applied Mathematics , page 112971, 2020.[8] A. Dalalyan, M. Hebiri, K. Meziani, and J. Salmon. Learning heteroscedastic models by convex programming undergroup sparsity. In
Proceedings of the 30th international conference on machine learning , pages 379–387, 2013.[9] M. Davidian and R. J. Carroll. Variance function estimation.
Journal of the American Statistical Association , 82(400):1079–1091, 1987.[10] P. Donti, B. Amos, and J. Z. Kolter. Task-based end-to-end model learning in stochastic optimization. In
Advancesin Neural Information Processing Systems , pages 5484–5494, 2017.[11] J. Fan and Q. Yao.
Nonlinear time series: nonparametric and parametric methods . Springer Science & BusinessMedia, 2008.[12] J. Fan, L. Qi, and D. Xiu. Quasi-maximum likelihood estimation of GARCH models with heavy-tailed likelihoods.
Journal of Business & Economic Statistics , 32(2):178–191, 2014.[13] B. E. Hansen. Uniform convergence rates for kernel estimation with dependent data.
Econometric Theory , pages726–748, 2008.[14] D. Hsu, S. M. Kakade, and T. Zhang. Random design analysis of ridge regression. In
Proceedings of the 25th annualconference on learning theory , volume 23, pages 9.1–9.24, 2012.[15] H. Jiang. Non-asymptotic uniform rates of consistency for k-nn regression. In
Proceedings of the AAAI Conferenceon Artificial Intelligence , volume 33, pages 3999–4006, 2019.[16] S. Jin, L. Su, and Z. Xiao. Adaptive nonparametric regression with conditional heteroskedasticity.
EconometricTheory , 31(6):1153, 2015.[17] R. Kannan, G. Bayraksan, and J. R. Luedtke. Data-driven sample average approximation with covariate information.Optimization Online. URL: , 2020.[18] R. Kannan, G. Bayraksan, and J. R. Luedtke. Residuals-based distributionally robust optimization with covariateinformation. Optimization Online. URL: , 2020.[19] M. C. Medeiros and E. F. Mendes. ℓ -regularization of high-dimensional time-series models with non-Gaussian andheteroskedastic errors. Journal of Econometrics , 191(1):255–271, 2016.[20] A. Mokkadem, M. Pelletier, and B. Thiam. Large and moderate deviations principles for kernel estimators of themultivariate regression.
Mathematical Methods of Statistics , 17(2):146–172, 2008.[21] J. L. Powell. Models, testing, and correction of heteroskedasticity. Lecture notes, Department of Economics, Universityof California, Berkeley. URL: https://eml.berkeley.edu/~powell/e240b_sp10/hetnotes.pdf , 2010.[22] P. M. Robinson. Asymptotically efficient estimation in the presence of heteroskedasticity of unknown form.
Econo-metrica: Journal of the Econometric Society , pages 875–891, 1987.[23] J. P. Romano and M. Wolf. Resurrecting weighted least squares.
Journal of Econometrics , 197(1):1–19, 2017.[24] D. Ruppert, M. P. Wand, U. Holst, and O. H¨ossjer. Local polynomial variance-function estimation.
Technometrics ,39(3):262–273, 1997.[25] A. Schick. Weighted least squares estimates in partly linear regression models.
Statistics & Probability Letters , 27(3):281–287, 1996.[26] S. Sen and Y. Deng. Learning enabled optimization: Towards a fusion of statistical learning and stochastic program-ming. Optimization Online. URL: , 2017.[27] A. Shapiro, D. Dentcheva, and A. Ruszczy´nski.
Lectures on stochastic programming: modeling and theory . SIAM,2009.[28] Q. Sun, W.-X. Zhou, and J. Fan. Adaptive Huber regression.
Journal of the American Statistical Association , 115(529):254–265, 2020.[29] I. Van Keilegom and L. Wang. Semiparametric modeling and estimation of heteroscedasticity in regression analysisof cross-sectional data.
Electronic Journal of Statistics , 4:133–160, 2010.[30] W.-X. Zhou, K. Bose, J. Fan, and H. Liu. A new perspective on robust M-estimation: Finite sample theory andapplications to dependence-adjusted multiple testing.
Annals of Statistics , 46(5):1904, 2018.[31] F. Ziel. Iteratively reweighted adaptive Lasso for conditional heteroscedastic time series with applications to AR-ARCHtype processes.
Computational Statistics & Data Analysis , 100:773–793, 2016., 100:773–793, 2016.