Ensemble Conditional Variance Estimator for Sufficient Dimension Reduction
EEnsemble Conditional Variance Estimator for SufficientDimension Reduction
Lukas Fertl ∗ Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, Austria
Efstathia Bura † Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, AustriaMarch 1, 2021 A BSTRACT
Ensemble Conditional Variance Estimation ( ECVE ) is a novel sufficient dimension reduction (SDR)method in regressions with continuous response and predictors.
ECVE applies to general non-additiveerror regression models. It operates under the assumption that the predictors can be replaced by a lowerdimensional projection without loss of information.It is a semiparametric forward regression modelbased exhaustive sufficient dimension reduction estimation method that is shown to be consistent undermild assumptions. It is shown to outperform central subspace mean average variance estimation ( csMAVE ), its main competitor, under several simulation settings and in a benchmark data set analysis. Let (Ω , F , P ) be a probability space. Let Y be a univariate continuous response and X a p -variate continuous predictor,jointly distributed, with ( Y, X T ) T : Ω → R p +1 . We consider the linear sufficient dimension reduction model Y = g cs ( B T X , (cid:15) ) , (1)where X ∈ R p is independent of the random variable (cid:15) , B is a p × k matrix of rank k , and g cs : R k +1 → R is anunknown non-constant function.[ZZ10, Thm. 1] showed that if ( Y, X T ) T has a joint continuous distribution, (1) is equivalent to Y ⊥⊥ X | B T X , (2)where the symbol ⊥⊥ indicates stochastic independence. The matrix B is not unique. It can be replaced by any basis ofits column space, span { B } . Let S denote a subspace of R p , and let P S denote the orthogonal projection onto S withrespect to the usual inner product. If the response Y and predictor vector X are independent conditionally on P S X ,then P S X can replace X as the predictor in the regression of Y on X without loss of information. Such subspaces S are called dimension reduction subspaces and their intersection, provided it satisfies the conditional independencecondition (2), is called the central subspace and denoted by S Y | X [see [Coo98, p. 105], [Coo07]].By their equivalence, under both models (1) and (2), F Y | X ( y ) = F Y | B T X ( y ) and S Y | X = span { B } . Since theconditional distribution of Y | X is the same as that of Y | B T X , B T X contains all the information in X for modelingthe target variable Y , and it can replace X without any loss of information. ∗ [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b PREPRINT - M
ARCH
1, 2021If the error term in model (1) is additive with E ( (cid:15) | X ) = 0 , (1) reduces to Y = g ( B T X ) + (cid:15) . Now, E ( Y | X ) = E ( Y | B T X ) = E ( Y | P S X ) , where S = span { B } . The mean subspace, denoted by S E ( Y | X ) , is the intersection of allsubspaces S such that E ( Y | X ) = E ( Y | P S X ) [CL02]. In this case, (1) becomes the classic mean subspace modelwith span { B } = S E ( Y | X ) . [CL02] showed that the mean subspace is a subset of the central subspace, S E ( Y | X ) ⊆ S Y | X .Several linear sufficient dimension reduction (SDR) methods estimate S E ( Y | X ) consistently ([AC09, MZ13, Li18,XTLZ02]). Linear refers to the reduction being a linear transformation of the predictor vector.
Minimum AverageVariance Estimation ( MAVE ) [XTLZ02] is the most competitive and accurate method among them.
MAVE differentiatesfrom the majority of SDR methods, in that it is not inverse regression based such as, for example, the widely used
Sliced Inverse Regression (SIR, [Li91]).
MAVE requires minimal assumptions on the distribution of ( Y, X T ) T and isbased on estimating the gradients of the regression function E ( Y | X ) via local-linear smoothing [CD88].The central subspace mean average variance estimation ( csMAVE ) [WX08, WY19] is the extension of MAVE that con-sistently and exhaustively estimates the span { B } in model (1) without restrictive assumptions limiting its applicability. csMAVE has remained the gold standard since it was proposed by [WX08]. It is based on repeatedly applying MAVE onthe sliced target variables f u ( Y ) = { s u − ARCH 1, 2021 [YL11] introduced ensembles as a device to extend mean subspace to central subspace SDR methods. The ensemble approach of combining mean subspaces to span the central subspace comprizes of two components: (a) a rich familyof functions of transformations for the response and (b) a sampling mechanism for drawing the functions from theensemble to ascertain coverage of the central subspace. To distinguish between families of functions and ensembles,[YL11] use the term parametric ensemble, which we define next. Definition. A family F of measurable functions from R to R is called an ensemble. If F is a family of measurablefunctions with respect to an index set Ω T ; i.e. F = { f t : t ∈ Ω T } , F is called a parametric ensemble. Let F be an ensemble, f ∈ F and let f ( Y ) , for Y following model (1). The space S E ( f ( Y ) | X ) is defined to be the meansubspace of the transformed random variable f ( Y ) [see [Coo98] or [CL02]]. Definition. An ensemble F characterizes the central subspace S Y | X , if span {S E ( f t ( Y ) | X ) : f t ∈ F} = S Y | X (5)As an example, the parametric ensemble F = { f t : t ∈ Ω T } = { { z ≤ t } : t ∈ R } can characterize the centralsubspace S Y | X . That is, E ( f t ( Y ) | X ) is the conditional cumulative distribution function evaluated at t . To see this, let B ∈ S ( p, k ) be such that E ( f t ( Y ) | X ) = E ( f t ( Y ) | B T X ) for all t . Then, F Y | X ( t ) = E ( f t ( Y ) | X ) = E ( f t ( Y ) | B T X ) = F Y | B T X ( t ) for all t . Varying over the parametric ensemble F , in this case over t ∈ R , obtains the conditionalcumulative distribution function. This indicator ensemble fully recovers the conditional distribution of Y | X and,thus, also the central subspace S Y | X , span {S E ( f t ( Y ) | X ) : f t ∈ F} = span {S E ( { Y ≤ t } | X ) : t ∈ R } = S Y | X We reproduce a list of parametric ensembles F , and associated regularity conditions, that can characterize S Y | X from[YL11] next. Characteristic ensemble F = { f t : t ∈ Ω T } = { exp( it · ) : t ∈ R } Indicator ensemble F = { { z ≤ t } : t ∈ R } , where span {S E ( f t ( Y ) | X ) : f t ∈ F} recovers the conditional cumulativedistribution function Kernel ensemble F = { h − K (( z − t ) /h ) : t ∈ R , h > } , where K is a kernel suitable for density estimation, and span {S E ( f t ( Y ) | X ) : f t ∈ F} recovers the conditional density Polynomial ensemble F = { z t : t = 1 , , , ... } , where span {S E ( f t ( Y ) | X ) : f t ∈ F} recovers the conditionalmoment generating function Box-Cox ensemble F = { ( z t − /t : t (cid:54) = 0 } ∪ { log( z ) : t = 0 } Box-Cox Transforms Wavelet ensemble Haar WaveletsThe characteristic and indicator ensembles describe the conditional characteristic and distribution function of Y | X , respectively, which always exist and determine the distribution uniquely. If the conditional density function f Y | X of Y | X exists, then the kernel ensemble characterizes the conditional distribution Y | X . Further, if the conditionalmoment generating function exists, then the polynomial ensemble characterizes S Y | X . [YL11] used the ensembledevice to extend MAVE [XTLZ02], which targets the mean subspace, to its ensemble version that also estimates thecentral subspace S Y | X consistently.Theorem 1 [YL11, Thm 2.1] establishes when an ensemble F is rich enough to characterize S Y | X . Theorem 1. Let B = { A : A is a Borel set in supp ( Y ) } be the set of indicator functions on supp ( Y ) and L ( F Y ) bethe set of square integrable random variables with respect to the distribution F Y of the response Y . If F ⊆ L ( F Y ) isdense in B ⊆ L ( F Y ) , then the ensemble F characterizes the central subspace S Y | X . In Theorem 2 we show that finitely many functions of an ensemble F are sufficient to characterize the central subspace S Y | X . Theorem 2. If a parametric ensemble F characterizes S Y | X , then there exist finitely many functions f t ∈ F , with t = 1 , . . . , m and m ∈ N , such that span {S E ( f t ( Y ) | X ) : t ∈ , . . . , m } = S Y | X PREPRINT - M ARCH 1, 2021 Proof: Let k = dim( S Y | X ) ≤ p . Since F characterizes S Y | X , dim( S E ( f t ( Y ) | X ) ) = k t ≤ k by (5) for any t . If k t = 0 , S E ( f t ( Y ) | X ) = { } so the corresponding f t does not contribute to (5). Assume k t ≥ . If there were infinitely many S E ( f t ( Y ) | X ) (cid:54) = { } of dimension at least 1, whose span is S Y | X , then infinitely as many are identical, otherwise thedimension of the central subspace S Y | X would be infinite, contradicting that dim( S Y | X ) = k < ∞ .The importance of Theorem 2 lies in the fact that the search to characterize the central subspace is over a finite set, eventhough it does not offer tools for identifying the elements of the ensemble. Throughout the paper, we refer to the following assumptions as needed. (E.1). Model (1) , Y = g cs ( B T X , (cid:15) ) holds with Y ∈ R , g cs : R k × R → R non constant in the first argument, B = ( b , ..., b k ) ∈ S ( p, k ) , X ∈ R p is independent of (cid:15) , the distribution of X is absolutely continuous with respect tothe Lebesgue measure in R p , supp ( f X ) is convex, and V ar( X ) = Σ x is positive definite. (E.2). The density f X : R p → [0 , ∞ ) of X is twice continuously differentiable with compact support supp ( f X ) . (E.3). For a parametric ensemble F , its index set Ω T is endowed with a probability measure F T such that for all t ∈ Ω T with S E ( f t ( Y ) | X ) (cid:54) = { } , P F T (cid:0) { ˜ t ∈ Ω T : S E ( f ˜ t ( Y ) | X ) = S E ( f t ( Y ) | X ) } (cid:1) > (E.4). For an ensemble F we assume that for all f ∈ F , the conditional expectation E ( f ( Y ) | X ) is twice continuously differentiable in the conditioning argument. Further, for all f ∈ F E ( | f ( Y ) | ) < ∞ Assumption (E.1) assures the existence and uniqueness of S Y | X = span { B } . Furthermore, it allows the mean subspaceto be a proper subset of the central subspace, i.e. S E ( Y | X ) (cid:40) S Y | X . In Assumption (E.2), the compactness assumptionfor supp ( f X ) is not as restrictive as it might seem. [YLC08, Prop. 11] showed that there is a compact set K ⊂ R p suchthat S Y | X | K = S Y | X , where X | K = X { X ∈ K } . Assumption (E.3) simply states that the set of indices that characterizethe central subspace S Y | X is not a null set. In practice, the choice of the probability measure F T on the index set Ω T of a parametric ensemble F can always guarantee the fulfillment of this assumption. If the characteristic or indicatorensemble are used, (E.4) states that the conditional characteristic or distribution function are twice continuouslydifferentiable. In this case, the 8 th moments exist since the complex exponential and indicator functions are bounded. Definition. For q ≤ p ∈ N , f ∈ F , and any V ∈ S ( p, q ) , we define ˜ L F ( V , s , f ) = V ar ( f ( Y ) | X ∈ s + span { V } ) (6) where s ∈ R p is a non-random shifting point. Definition. Let F be a parametric ensemble and F T a cumulative distribution function (cdf) on the index set Ω T . For q ≤ p , and any V ∈ S ( p, q ) , we define L F ( V ) = (cid:90) Ω T (cid:90) R p ˜ L ( V , x , f t ) dF X ( x ) dF T ( t ) (7) = E t ∼ F T (cid:16) E X (cid:16) ˜ L F ( V , X , f t ) (cid:17)(cid:17) = E t ∼ F T ( L ∗F ( V , f t )) , where F X is the cdf of X ,and L ∗F ( V , f t ) = E X (cid:16) ˜ L F ( V , X , f t ) (cid:17) . (8)For the identity function, f t ( z ) = z , (8) is the target function of the conditional variance estimation proposed in[FB21]. If the random variable t is concentrated on t ; i.e., t ∼ δ t , then the ensemble conditional variance estimator ( ECVE ) coincides with the conditional variance estimator (CVE).The following theorem will be used in establishing the main result of this paper, which obtains the exhaustive sufficientreduction of the conditional distribution of Y given the predictor vector X .4 PREPRINT - M ARCH 1, 2021 Theorem 3. Assume (E.1) and (E.2) hold, in particular model (1) holds. Let (cid:101) B be a basis of S E ( f t ( Y ) | X ) ; i.e. span { (cid:101) B } = S E ( f t ( Y ) | X ) ⊆ S Y | X = span { B } . Then, for any f ∈ F for which assumption (E.4) holds, f ( Y ) = g ( (cid:101) B T X ) + ˜ (cid:15), (9) with E (˜ (cid:15) | X ) = 0 and g : R k t → R is a twice continuously differentiable function, where k t = dim( S E ( f t ( Y ) | X ) ) . By Theorem 3, any response Y can be written as an additive error via the decomposition (9). The predictors and theadditive error term are only required to be conditionally uncorrelated in model (9). The conditional variance estimator [FB21] also estimated (cid:101) B in (9) but under the more restrictive condition of predictor and error independence. Proof of Theorem 3. f ( Y ) = E ( f ( Y ) | X ) + f ( Y ) − E ( f ( Y ) | X ) (cid:124) (cid:123)(cid:122) (cid:125) ˜ (cid:15) = E ( f ( Y ) | X ) + ˜ (cid:15) = E (cid:16) f ( Y ) | (cid:101) B T X (cid:17) + ˜ (cid:15) = g ( (cid:101) B T X ) + ˜ (cid:15) where g ( (cid:101) B T X ) = E (cid:16) f ( Y ) | (cid:101) B T X (cid:17) . By the tower property of the conditional expectation, E (˜ (cid:15) | X ) = E ( f ( Y ) | X ) − E ( E ( f ( Y ) | X ) | X ) = E ( f ( Y ) | X ) − E ( f ( Y ) | X ) = . The function g is twice continuous differentiable by(E.4). Theorem 4. Assume (E.1) and (E.2) hold. Let F be a parametric ensemble, s ∈ supp ( f X ) ⊂ R p , V ∈ S ( p, q ) definedin (3) . Then, for any f ∈ F for which assumption (E.4) holds, ˜ L F ( V , s , f ) = µ ( V , s , f ) − µ ( V , s , f ) + V ar(˜ (cid:15) | X ∈ s + span { V } ) (10) where µ l ( V , s , f ) = (cid:90) R q g ( (cid:101) B T s + (cid:101) B T Vr ) l f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r d r = t ( l ) ( V , s , f ) t (0) ( V , s , f ) , (11) for g given in (9) with t ( l ) ( V , s , f ) = (cid:90) R q g ( (cid:101) B T s + (cid:101) B T Vr ) l f X ( s + Vr ) d r , (12) and V ar(˜ (cid:15) | X ∈ s + span { V } ) = E (˜ (cid:15) | X ∈ s + span { V } )= (cid:90) supp ( f X ) ∩ R q h ( s + Vr ) f X ( s + Vr ) d r / (cid:90) R q f X ( s + Vr ) d r = ˜ h ( V , s , f ) t (0) ( V , s , f ) (13) with E (˜ (cid:15) | X = x ) = h ( x ) and ˜ h ( V , s , f ) = (cid:82) supp ( f X ) ∩ R q h ( s + Vr ) f X ( s + Vr ) d r . Further assume h ( · ) tobe continuous, then L ∗F ( V , f t ) in (8) is well defined and continuous, V tq = argmin V ∈S ( p,q ) L ∗F ( V , f t ) (14) is well defined, and the conditional variance estimator of the transformed response f t ( Y ) identifies S E ( f t ( Y ) | X ) , S E ( f t ( Y ) | X ) = span { V tq } ⊥ . (15)[FB21] assumed model Y = g ( B T X ) + (cid:15) with (cid:15) ⊥⊥ X , which implies S E ( Y | X ) = span { B } = S Y | X . [FB21] showedthat the conditional variance estimator (CVE) can identify S E ( Y | X ) at the population level.Theorem 4 extends this result to obtain that the conditional variance estimator (CVE) identifies the mean subspace S E ( Y | X ) also in models of the form Y = g ( B T X ) + ˜ (cid:15) , where ˜ (cid:15) is simply conditionally uncorrelated with X . This allowsCVE to apply to problems where the mean subspace is a proper subset of the central subspace , i.e. S E ( Y | X ) (cid:40) S Y | X . V tq in (14) is not unique since for all orthogonal O ∈ R q × q , L ∗F ( V tq O , f t ) = L ∗F ( V tq , f t ) as L ∗F ( V tq , f t ) depends on V tq only through span { V tq } by (6). Nevertheless, it is a unique minimizer over the Grassmann manifold Gr ( p, q ) in5 PREPRINT - M ARCH 1, 2021(4). To see this, suppose V ∈ S ( p, q ) is an arbitrary basis of a subspace M ∈ Gr ( p, q ) . We can identify M through theprojection P M = VV T . By (31), we write x = Vr + Ur . Application of the Fubini-Tonelli Theorem yields ˜ t ( l ) ( P M , s , f ) = (cid:90) supp ( f X ) g ( B T s + B T P M x ) l f X ( s + P M x ) d x (16) = t ( l ) ( V , s , f ) (cid:90) supp ( f X ) ∩ R p − q d r . Therefore ˜ t ( l ) ( P M , s , f ) / ˜ t (0) ( P M , s , f ) = t ( l ) ( V , s , f ) /t (0) ( V , s , f ) and µ l ( · , s , f ) in (11) can also be viewedas a function from Gr ( p, q ) to R .Next we define the ensemble conditional variance estimator (ECVE) for a parametric ensemble F which characterizesthe central subspace S Y | X . Following the ensemble minimum average variance estimation formulation in [YL11], weextend the original objective function by integrating over the index random variable t ∼ F T in (7) that indexes theensemble F as [YL11]. Definition 5. Let V q = argmin V ∈ S ( p,q ) L F ( V ) (17) The Ensemble Conditional Variance Estimator with respect to the ensemble F is defined to be any basis B p − q, F of span { V q } ⊥ . Theorem 6. Assume (E.1), (E.2), (E.3), and (E.4) hold, and that the function h ( · ) defined in Theorem 4 is continuous.Let F be a parametric ensemble that characterizes S Y | X , with k = dim( S Y | X ) , and V be an element of the Stiefelmanifold S ( p, q ) , which is defined in (3) , with q = p − k . Then, V q in (17) is well defined and S Y | X = span { V q } ⊥ . (18) Assume ( Y i , X (cid:62) i ) (cid:62) i =1 ,...,n is an i.i.d. sample from model (1), and let d i ( V , s ) = (cid:107) X i − P s +span { V } X i (cid:107) = (cid:107) X i − s (cid:107) − (cid:104) X i − s , VV (cid:62) ( X i − s ) (cid:105) = (cid:107) ( I p − VV (cid:62) )( X i − s ) (cid:107) = (cid:107) Q V ( X i − s ) (cid:107) (19)where (cid:104)· , ·(cid:105) is the usual inner product in R p , P V = VV (cid:62) and Q V = I p − P V . The estimators we propose involvea variation of kernel smoothing, which depends on a bandwidth h n . In our procedure, h n is the squared width of aslice around the subspace s + span { V } . In order to obtain pointwise convergence for the ensemble CVE, we use thefollowing bias and variance assumptions on the bandwidth, as typical in nonparametric estimation. (H.1). For n → ∞ , h n → (H.2). For n → ∞ , nh ( p − q ) / n → ∞ In order to obtain consistency of the proposed estimator, Assumption (H.2) will be strengthened to log( n ) /nh ( p − q ) / n → .We also let K , which we refer to as kernel , be a function satisfying the following assumptions. (K.1). K : [0 , ∞ ) → [0 , ∞ ) is a non increasing and continuous function, so that | K ( z ) | ≤ M , with (cid:82) R q K ( (cid:107) r (cid:107) ) d r < ∞ for q ≤ p − . (K.2). There exist positive finite constants L and L such that K satisfies either (1) or (2) below:(1) K ( u ) = 0 for | u | > L and for all u, ˜ u it holds | K ( u ) − K (˜ u ) | ≤ L | u − ˜ u | (2) K ( u ) is differentiable with | ∂ u K ( u ) | ≤ L and for some ν > it holds | ∂ u K ( u ) | ≤ L | u | − ν for | u | > L The Gaussian kernel K ( z ) = exp( − z ) , for example, fulfills both (K.1) and (K.2) [see [Han08]], and will be usedthroughout the paper.For i = 1 , . . . , n , we let w i ( V , s ) = K (cid:16) d i ( V , s ) h n (cid:17)(cid:80) nj =1 K (cid:16) d j ( V , s ) h n (cid:17) (20)6 PREPRINT - M ARCH 1, 2021 ¯ y l ( V , s , f ) = n (cid:88) i =1 w i ( V , s ) f ( Y i ) l for l = 1 , (21)We estimate ˜ L F ( V , s , f ) in (10) with ˜ L n, F ( V , s , f ) = ¯ y ( V , s , f ) − ¯ y ( V , s , f ) , (22)and the objective function L ∗F ( V , f ) in (8) with L ∗ n ( V , f ) = 1 n n (cid:88) i =1 ˜ L n, F ( V , X i , f ) , (23)where each data point X i is a shifting point. For a parametric ensemble F = { f t : t ∈ Ω T } and ( t j ) j =1 ,...,m n an i.i.d.sample from F T with lim n →∞ m n = ∞ , the final estimate of the objective function in (7) is given by L n, F ( V ) = 1 m n m n (cid:88) j =1 L ∗ n ( V , f t j ) (24)The ensemble conditional variance estimator (ECVE) is defined to be any basis of span { ˆ V q } ⊥ , where ˆ V q = argmin V ∈ S ( p,q ) L n, F ( V ) (25)We use the same algorithm as in [FB21] to solve the optimization problem (25). It requires the explicit form of thegradient of (24). Theorem 7 provides the gradient when a Gaussian kernel is used. Theorem 7. The gradient of ˜ L n, F ( V , s , f ) in (22) is given by ∇ V ˜ L n, F ( V , s , f ) = 1 h n n (cid:88) i =1 ( ˜ L n, F ( V , s , f ) − ( f ( Y i ) − ¯ y ( V , s , f )) ) w i d i ∇ V d i ( V , s ) ∈ R p × q , and the gradient of L n, F ( V ) in (24) is ∇ V L n, F ( V ) = 1 nm n n (cid:88) i =1 m n (cid:88) j =1 ∇ V ˜ L n, F ( V , X i , f t j ) . In the implementation of ECVE, we follow [FB21] and set the bandwidth to h n = 1 . tr ( (cid:98) Σ x ) p (cid:16) n − / (4+ p − q ) (cid:17) . (26)where (cid:98) Σ x = (1 /n ) (cid:80) i ( X i − ¯ X )( X i − ¯ X ) T and ¯ X = (1 /n ) (cid:80) i X i . L ∗ n ( V , f ) The set of points { x ∈ R p : (cid:107) x − P s +span { V } x (cid:107) ≤ h n } represents a slice in the subspace of R p about s + span { V } .In the estimation of L ( V ) two different weighting schemes are used: (a) Within slices : The weights are defined in (20)and are used to calculate (22). (b) Between slices : Equal weights /n are used to calculate (23). Another idea for thebetween slices weighting is to assign more weight to slices with more points. This can be realized by altering (23) to L ( w ) n ( V , f ) = n (cid:88) i =1 ˜ w ( V , X i ) ˜ L n ( V , X i , f ) , with (27) ˜ w ( V , X i ) = (cid:80) nj =1 K ( d j ( V , X i ) /h n ) − (cid:80) nl,u =1 K ( d l ( V , X u ) /h n ) − n = (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) (cid:80) nl,u =1 ,l (cid:54) = u K ( d l ( V , X u ) /h n ) (28)The denominator in (28) guarantees the weights ˜ w ( V , X i ) sum up to one. If (27) instead of (23) is used in (24) werefer to this method as weighted ensemble conditional variance estimation .For example, if a rectangular kernel is used, (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) is the number of X j ( j (cid:54) = i ) points in theslice corresponding to ˜ L n ( V , X i , f ) . Therefore, this slice is assigned weight that is proportional to the number of X j points in it, and the more observations we use for estimating L ( V , X i , f ) , the better its accuracy.7 PREPRINT - M ARCH 1, 2021 The consistency of ECVE derives from the consistency of CVE [FB21] that targets a specific S E ( f t ( Y ) | X ) and the factthat we can recover S Y | X from S E ( f t ( Y ) | X ) across all transformations f t ∈ F = { f t : t ∈ Ω T } for an ensemble thatcharacterizes S Y | X . This is achieved in sequential steps from Theorem 8, which is the main building block, to Theorem11. The proofs are technical and lengthy, and, thus, are given in the Appendix. Theorem 8. Assume conditions (E.1), (E.2), (E.4), (K.1), (K.2), (H.1) hold, a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) . Let F be a parametric ensemble such that E ( | ˜ (cid:15) | l | X = x ) is continuous for l = 1 , . . . , , andthe second conditional moment is twice continuously differentiable, where ˜ (cid:15) is given by Theorem 3. Then, L ∗ n ( V , f ) ,defined in (23) , converges uniformly in probability to L ∗ ( V , f ) in (8) for all f ∈ F ; i.e., sup V ∈S ( p,q ) | L ∗ n ( V , f ) − L ∗ ( V , f ) | −→ in probability as n → ∞ . Next, Theorem 9 shows that ensemble conditional variance estimator is consistent for S E ( f t ( Y ) | X ) for any transformation f . Theorem 9. Under the same conditions as Theorem 8, the conditional variance estimator span { (cid:98) B tk t } estimates S E ( f t ( Y ) | X ) consistently, for f t ∈ F . That is, (cid:107) P (cid:98) B tkt − P S E ( ft ( Y ) | X ) (cid:107) → in probability as n → ∞ . where (cid:98) B tk t is any basis of span { (cid:98) V tk t } ⊥ with (cid:98) V tk t = argmin V ∈S ( p,q ) L ∗ n, F ( V , f t ) . with q = p − k t and k t = dim( S E ( f t ( Y ) | X ) ) . A straightforward application of Theorem 9, using the identity function, obtains that S E ( Y | X ) can be consistentlyestimated by ECVE. Theorem 10. Assume the conditions of Theorem 8 hold. Let F be a parametric ensemble such that sup t ∈ Ω T | f t ( Y ) | Assume the conditions of Theorem 8 and (E.3) hold. Let F be a parametric ensemble that characterizes S Y | X and whose members satisfy sup t ∈ Ω T | f t ( Y ) | < M < ∞ almost surely. Also, assume the index random variable t ∼ F T is independent from the data ( Y i , X i ) i =1 ,...,n . Then, the ensemble conditional variance estimator (ECVE)is a consistent estimator for S Y | X . That is, for any basis (cid:98) B p − q, F of span { (cid:98) V q } ⊥ , where (cid:98) V q is defined in (25) with q = p − k and k = dim( S Y | X ) , (cid:107) P (cid:98) B p − q, F − P S Y | X (cid:107) −→ in probability as n → ∞ , where P M denotes the orthogonal projection onto the range space of the matrix or linear subspace M . m n on ECVE In this section we study the influence of the number of functions of the ensemble F , m n in (24), on the accuracy ofthe ensemble conditional variance estimation. In Theorem 10 and 11, how fast m n approaches ∞ is unspecified. Weconsider the 2-dimensional regression model Y = ( b T X ) + (0 . b T X ) ) (cid:15), (29)8 PREPRINT - M ARCH 1, 2021where p = 10 , k = 2 , X ∼ N (0 , I ) , (cid:15) ∼ N (0 , independent of X , b = (1 , , . . . , T ∈ R p , and b =(0 , , , . . . , T ∈ R p . Therefore, S E ( Y | X ) = span { b } (cid:40) S Y | X = span { B } , with B = ( b , b ) .We set the sample size to n = 300 and vary m over { , , , , , , } for the (a) indicator, F m, Indicator = { { x ≥ q j } : j = 1 , . . . , m } , where q j is the j/ ( m + 1) th empirical quantile of ( Y i ) i =1 ,...,n ; (b) characteristic or Fourier, F m, Fourier = { sin( jx ) : j = 1 , . . . , m/ } ∪ { cos( jx ) : j = 1 , . . . , m/ } ; (c) monomial, F m, Monom = { x j : j =1 , . . . , m } , (d) and Box-Cox, F m, BoxCox = { ( x t j − /t j : t j = 0 . j − / ( m − , j = 1 , . . . , m − } ∪ { log( x ) } ,ensembles.For each ensemble, we form the ensemble conditional variance estimator and its weighted version as in section 4.1,see also [FB21]. The results of 100 replications for each method and each m are displayed in Figure 1. We assessthe estimation accuracy with err j,m = (cid:107) (cid:98) B (cid:98) B T − BB T (cid:107) / (2 k ) / , j = 1 , . . . , , m ∈ { , , , , , , , } . ECVE ’s main competitor, csMAVE , which does not vary with m , estimate of the central subspace has median error 0.2with a wide range from 0.1 to 0.6. The estimation accuracy of Fourier, Indicator and Box-Cox ECVE vary over m and ison par or better for some m values.For the Fourier basis, fewer basis functions give the best performance, the indicator and BoxCox ensembles are quite ro-bust against varying m , whereas the errors get rapidly larger if m is increased for the monomial ensemble. The weightedversion of ECVE improves the accuracy for all ensembles. F , Fourier_weighted , F , Indicator_weighted , F , BoxCox_weighted areon par or more accurate than csMAVE . In sum, the simulation results support a choice of a small m number of basisfunctions. Based on this and further unreported simulations, we set the default value of m to m n = (cid:26) (cid:100) log( n ) (cid:101) , if (cid:100) log( n ) (cid:101) even (cid:100) log( n ) (cid:101) + 1 , if (cid:100) log( n ) (cid:101) odd (30)for all simulations in Section 6.2, 6.3 and the data analysis in Section 7. We explore the consistency rate of the conditional variance estimator (CVE) and ensemble conditional varianceestimator (CVE) , csMAVE and mMAVE in model (29).Specifically, we apply seven estimation methods, the first five targeting the central subspace S Y | X and the last two S E ( Y | X ) , as follows. For S Y | X , we compare ECVE for the indicator (I), Fourier (II), monomial (III) and Box-Cox (IV)ensembles, as in Section 6.1, and csMAVE (V). For S E ( Y | X ) , we use CVE (VI) of [FB21] and mMAVE (VII) in [XTLZ02].The simulation is performed as follows. We generate 100 i.i.d samples ( Y i , X Ti ) i =1 ,...,n from (29) for each samplesize n = 100 , , , , , . Model (29) is a two dimensional model with S E ( Y | X ) = span( b ) (cid:40) S Y | X = span( B ) . For methods (I)-(V), we set k = 2 and estimate B ∈ R × . For (VI) and (VII), we set k = 1 and estimate b ∈ R × . Then, we calculate err j,n = (cid:107) (cid:98) B (cid:98) B T − BB T (cid:107) / (2 k ) / , j = 1 , . . . , , n ∈{ , , , , , } . Figure 2 displays the distribution of err j,n for increasing n for the seven methods. Asthe sample size increases ECVE Indicator, Fourier and csMAVE are on par with respect to both speed and accuracy.The accuracy of ECVE Box-Cox improves as the sample size increases but at a slower rate. There is no improvementin the accuracy of ECVE monomial. This is not surprising as the monomial, as well as the Box-Cox, do not satisfythe assumption sup t ∈ Ω T | f t ( Y ) | < M < ∞ in Theorem 11, in contrast to the Indicator and Fourier ensembles. TheFourier, Indicator ECVE and csMAVE estimate S Y | X = span { B } consistently and the mean subspace methods, CVEand mMAVE, estimate S E ( Y | X ) = span { b } consistently. We consider seven models, (M1-M7) defined in Table 1, three different sample sizes { , , } , and threedifferent distributions of the predictor vector X = Σ / Z ∈ R p , where Σ = (Σ ij ) i,j =1 ,...,p , Σ i,j = 0 . | i − j | .Throughout, p = 10 , B are the first k columns of I p , and (cid:15) ∼ N (0 , independent of X . As in [WX08], weconsider three distributions for Z ∈ R p : (I) N (0 , I p ) , (II) p -dimensional uniform distribution on [ −√ , √ p , i.e. allcomponents of Z are independent and uniformly distributed , and (III) a mixture-distribution N (0 , I p ) + µ , where µ = ( µ , . . . , µ p ) T ∈ R p with µ j = 2 , µ k = 0 , for k (cid:54) = j , and j is uniformly distributed on { , . . . , p } .The simple and weighted [see Section 4.1] Fourier and Indicator ensembles are used to form four ensembleconditional variance estimators (ECVE). The monomial and BoxCox ensembles were also used but did not givesatisfactory results and are not reported. From these two ensembles four ECVE estimators are formed and compared9 PREPRINT - M ARCH 1, 2021Figure 1: Box plots of the estimation errors over repli-cations of model (29) with n = 300 over m = |F| =(2 , , , , , , , across four ensembles. Figure 2: Estimation error distribution of model (29) plottedover n = (100 , , , , , for the seven (I-VII) methodsagainst the reference method csMAVE [WX08], which is implemented in the R package MAVE . The source code for conditional variance estimation and its ensemble version is available at https://git.art-ist.cc/daniel/CVE .Table 1: Models Name Model S E ( Y | X ) S Y | X k M1 Y = b T X + 0 . (cid:15) span { b } span { b } Y = cos(2 b T X ) + cos( b T X ) + 0 . (cid:15) span { b , b } span { b , b } Y = ( b T X ) + (0 . b T X ) ) (cid:15) span { b } span { b , b } Y = b T X . . b T X ) + ( | b T X | + ( b T X ) + 0 . (cid:15) span { b , b } span { b , b } Y = b T X + sin( b T X ( b T X ) ) (cid:15) span { b } span { b , b , b } Y = 0 . b T X ) (cid:15) span { } span { b } Y = cos( b T X − π ) + cos(2 b T X ) (cid:15) span { b } span { b } We set q = p − k and generate r = 100 replicates of models M1-M7 with the specified distribution of X and samplesize n . We estimate B using the four ECVE methods and csMAVE. The accuracy of the estimates is assessed using err = (cid:107) P B − P (cid:98) B (cid:107) / √ k ∈ [0 , , where P B = B ( B T B ) − B T is the orthogonal projection matrix on span { B } .The factor √ k normalizes the distance, with values closer to zero indicating better agreement and values closer toone indicating strong disagreement. The results are displayed in Tables 2-8. In M1, which is taken from [WX08],the mean subspace agrees with the central subspace, i.e. S E ( Y | X ) = S Y | X , but due to the unboundedness of the linkfunction g ( x ) = 1 /x most mean subspace estimation methods, such as SIR, mean MAVE and CVE , fail. In contrast,all 4 ensemble CVE methods and csMAVE succeed in identifying the minimal dimension reduction subspace, withensemble CVE performing slightly better, as can be seen in Table 2. In particular, Fourier is the best performingmethod. M2, is a two dimensional mean subspace model, i.e. S E ( Y | X ) = S Y | X , and in Table 3 we see that csMAVE isthe best performing method. M3 is the same as model (29) and here the mean subspace is a proper subset of the centralsubspace. In Table 4 we see that Indicator_weighted and csMAVE are the best performers and are roughly on par. InM4, the two dimensional mean subspace, which determines also the heteroskedasticity, agrees with the central subspace.In Table 5 we see that this model is quite challenging for all methods, and only Indicator_weighted and csMAVE give satisfactory results, with Indicator_weighted the clear winner.In M5, the heteroskedasticity is induced by an interaction term, and the three dimensional central subspace model isa proper superset of the one dimensional mean subspace. In Table 6 we see that M5 is quite challenging for all fivemethods, therefore we increase the sample size n to . For M5, the two weighted ensemble conditional varianceestimators are the best performing methods followed by csMAVE .10 PREPRINT - M ARCH 1, 2021M6 is a one dimensional pure central subspace model, whereas the mean subspace is . In Table 7, we see that for n = 100 the two weighted ECVEs are the best performing methods and for higher sample sizes csMAVE is slightlymore accurate than the ECVE methods.In M7 the one dimensional mean subspace agrees with the central subspace, i.e. S E ( Y | X ) = S Y | X , and the conditionalfirst and second moments, E ( Y l | X ) for l = 1 , , are highly nonlinear and periodic functions of the sufficient reduction.In Table 8, we see that all ensemble conditional variance estimators clearly outperform csMAVE .Table 2: Mean and standard deviation (in parenthesis) of estimation errors of M1 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 (0.031) (0.038) (0.043) (0.042) (0.033)III 400 0.082 0.101 0.127 0.132 (0.020) (0.029) (0.031) (0.032) (0.022) Table 3: Mean and standard deviation (in parenthesis) of estimation errors of M2 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 0.670 0.601 0.629 0.582 (0.089) (0.135) (0.130) (0.140) (0.176)I 200 0.478 0.388 0.436 0.407 (0.201) (0.152) (0.193) (0.162) (0.136)I 400 0.226 0.201 0.231 0.236 (0.153) (0.074) (0.127) (0.111) (0.025)II 100 0.663 0.652 0.687 0.658 (0.097) (0.104) (0.057) (0.080) (0.176)II 200 0.525 0.468 0.601 0.539 (0.171) (0.171) (0.127) (0.148) (0.096)II 400 0.267 0.307 0.375 0.357 (0.081) (0.146) (0.154) (0.141) (0.021)III 100 0.657 0.590 (0.203) (0.165) (0.147) (0.151) (0.193)III 400 0.170 0.170 0.144 0.170 (0.110) (0.071) (0.053) (0.063) (0.019) We apply the ensemble conditional variance estimator and csMAVE to the Boston Housing data set. This data set hasbeen extensively used as a benchmark for assessing regression methods [see, for example, [JWHT13]], and is availablein the R -package mlbench . The data contains 506 instances of 14 variables from the 1970 Boston census, 13 of whichare continuous. The binary variable chas , indexing proximity to the Charles river, is omitted from the analysis sinceensemble conditional variance estimation operates under the assumption of continuous predictors. The target variableis the median value of owner-occupied homes, medv , in $1,000. The 12 predictors are crim (per capita crime rate bytown), zn (proportion of residential land zoned for lots over 25,000 sq.ft), indus (proportion of non-retail businessacres per town), nox (nitric oxides concentration (parts per 10 million)), rm (average number of rooms per dwelling), age (proportion of owner-occupied units built prior to 1940), dis (weighted distances to five Boston employment11 PREPRINT - M ARCH 1, 2021Table 4: Mean and standard deviation (in parenthesis) of estimation errors of M3 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 0.744 0.657 0.668 (0.148) (0.102) (0.177) (0.064) (0.061)II 100 0.751 0.698 0.683 (0.079) (0.084) (0.153) (0.052) (0.045)III 100 0.739 0.676 0.654 (0.048) (0.162) (0.171) (0.153) (0.131)III 400 0.616 0.252 0.297 0.202 (0.151) (0.113) (0.106) (0.055) (0.042) Table 5: Mean and standard deviation (in parenthesis) of estimation errors of M4 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 0.836 0.794 0.774 centres), rad (index of accessibility to radial highways), tax (full-value property-tax rate per $10,000), ptratio (pupil-teacher ratio by town), lstat (percentage of lower status of the population), and b stands for B − . where B is the proportion of blacks by town.We analyze these data with the weighted and unweighted Fourier and indicator ensembles, and csMAVE . We computeunbiased error estimates by leave-one-out cross-validation. We estimate the sufficient reduction with the five methodsfrom the standardized training set, estimate the forward model from the reduced training set using mars , multivariateadaptive regression splines [Fri91], in the R -package mda , and predict the target variable on the test set. We reportresults for dimension k = 1 . The analysis was repeated setting k = 2 with similar results. Table 9 reports the firstquantile, median, mean and third quantile of the out-of-sample prediction errors. The reductions estimated by theensemble CVE methods achieve lower mean and median prediction errors than csMAVE . Also, both ensemble CVE and csMAVE are approximately on par with the variable selection methods in [JWHT13, Section 8.3.3].Moreover, we plot the standardized response medv against the reduced Fourier and csMAVE predictors, B T X , inFigure 3. The sufficient reductions are estimated using the entire data set. A particular feature of these data is that theresponse medv appears to be truncated as the highest median price of exactly $50,000 is reported in 16 cases. Bothmethods pick up similar patterns, which is captured by the relatively high absolute correlation of the coefficients ofthe two reductions, | (cid:98) B T Fourier (cid:98) B csMAVE | = 0 . . The coefficients of the reductions, (cid:98) B Fourier and (cid:98) B csMAVE , are reportedin Table 10. For the Fourier ensemble, the variables rm and lstat have the highest influence on the target variable12 PREPRINT - M ARCH 1, 2021Table 6: Mean and standard deviation (in parenthesis) of estimation errors of M5 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 0.705 Table 7: Mean and standard deviation (in parenthesis) of estimation errors of M6 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 0.304 (0.057) (0.054) (0.107) (0.059) (0.061)I 400 0.142 0.146 0.199 0.138 (0.036) ( 0.035) (0.069) (0.039) (0.034)II 100 0.308 (0.058) (0.057) (0.095) (0.058) (0.061)II 400 0.144 0.150 0.190 0.142 (0.039) (0.042) (0.055) (0.045) (0.032)III 100 0.373 0.375 0.504 (0.065) (0.070) (0.100) (0.060) (0.083)III 400 0.149 0.151 0.194 0.146 (0.039) (0.038) (0.068) (0.042) (0.032) PREPRINT - M ARCH 1, 2021Table 8: Mean and standard deviation (in parenthesis) of estimation errors of M7 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE I 100 0.273 Table 9: Summary statistics of the out of sample prediction errors for the Boston Housing data obtained by LOO crossvalidation Fourier Fourier_weighted Indicator Indicator_weighted csMAVE 25% quantile medv . This agrees with the analysis in [JWHT13, Section 8.3.4] where it was found that these two variables are by farthe most important using different variable selection techniques, such as random forests and boosted regression trees. Incontrast, the reduction estimated by csMAVE has a lower coefficient for rm and higher ones for crim and rad .Table 10: Rounded coefficients of the estimated reductions for (cid:98) B Fourier and (cid:98) B csMAVE from the full Boston Housing data crim zn indus nox rm age dis rad tax ptratio b lstatFourier csMAVE In this paper, we extend the mean subspace conditional variance estimation ( CVE ) of [FB21] to the ensemble conditionalvariance estimation ( ECVE ), which exhaustively estimates the central subspace , by applying the ensemble deviceintroduced by [YL11]. In Section 5 we showed that the new estimator is consistent for the central subspace. Theregularity conditions for consistency require the joint distribution of the target variable and predictors, ( Y, X T ) T , besufficiently smooth. They are comparable to those under which the main competitor csMAVE [WX08] is consistent.We analysed the estimation accuracy of ECVE in Section 6. We found that it is either on par with csMAVE or that itexhibits substantial performance improvement in certain models. We could not characterize the defining features ofthe models for which the ensemble conditional variance estimation outperforms csMAVE . This is an interesting line offurther research together with establishing more theoretical results such as the rate of convergence, estimation of thestructural dimension, and the limiting distribution of the estimator. ECVE identifies the central subspace via the orthogonal complement and thus circumvents the estimation and inversionof the variance matrix of the predictors X . This renders the method formally applicable to settings where the samplesize n is small or smaller than p , the number of predictors, and leads to potential future research.Throughout, the dimension of the central subspace, k = dim( S Y | X ) , is assumed to be known. The derivation ofasymptotic tests for dimension is technically very challenging due to the lack of closed-form solution and the lack of14 PREPRINT - M ARCH 1, 2021Figure 3: Panel A: Y vs. (cid:98) B T Fourier X . Panel B: Y vs. (cid:98) B T csMAVE X independence of all quantities in the calculation. The dimension can be estimated via cross-validation, as in [WX08]and [FB21], or information criteria. Acknowledgements The authors gratefully acknowledge the support of the Austrian Science Fund (FWF P 30690-N35) and thank DanielKapla for his programming assistance. Daniel Kapla also co-authored the CVE R package that implements the proposedmethod. References [AC09] Kofi P. Adragni and R. Dennis Cook. Sufficient dimension reduction and prediction in regression. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences ,367(1906):4385–4405, 11 2009.[Ame85] Takeshi Amemiya. Advanced econometrics . Harvard university press, 1985.[Boo02] W. M. Boothby. An Introduction to Differentiable Manifolds and Riemannian Geometry . Academic Press,2002.[CD88] William S. Cleveland and Susan J. Devlin. Locally weighted regression: An approach to regressionanalysis by local fitting. Journal of the American Statistical Association , 83(403):596–610, 1988.[CL02] R.Dennis Cook and Bing Li. Dimension reduction for conditional mean in regression. Ann. Statist. ,30(2):455–474, 04 2002.[Coo98] Dennis R. Cook. Regression Graphics: Ideas for studying regressions through graphics . Wiley, NewYork, 1998.[Coo07] R. Dennis Cook. Fisher lecture: Dimension reduction in regression. Statist. Sci. , 22(1):1–26, 02 2007.[Fad85] Arnold M. Faden. The existence of regular conditional probabilities: Necessary and sufficient conditions. The Annals of Probability , 13(1):288–298, 1985.[FB21] Lukas Fertl and Efstathia Bura. Conditional variance estimator for sufficient dimension reduction, 2021.[Fri91] Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics , 19(1):1–67, 1991.[GH94] Phillip Griffiths and Joseph Harris. Principles of algebraic geometry . Wiley Classics Library. John Wiley& Sons, Inc., New York, 1994. Reprint of the 1978 original.[Han08] Bruce E. Hansen. Uniform convergence rates for kernel estimation with dependent data. EconometricTheory , 24:726–748, 2008.[Heu95] H. Heuser. Analysis 2, 9 Auflage . Teubner, 1995.15 PREPRINT - M ARCH 1, 2021[Jen69] Robert I. Jennrich. Asymptotic properties of non-linear least squares estimators. Ann. Math. Statist. ,40(2):633–643, 04 1969.[JWHT13] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to StatisticalLearning: with Applications in R . Springer, 2013.[Kar93] Alan F. Karr. Probability . Springer Texts in Statistics. Springer-Verlag New York, 1993.[Li91] K. C. Li. Sliced inverse regression for dimension reduction. Journal of the American StatisticalAssociation , 86(414):316–327, 1991.[Li18] Bing Li. Sufficient dimension reduction: methods and applications with R . CRC Press, Taylor & FrancisGroup, 2018.[LJFR04] D. Leao Jr., M. Fragoso, and P. Ruffino. Regular conditional probability, disintegration of probability andradon spaces. Proyecciones (Antofagasta) , 23:15 – 29, 05 2004.[MMW + 63] M.R. Mickey, P.B. Mundle, D.N. Walker, A.M. Glinski, Inc C-E-I-R, and Aerospace Research Labora-tories (U.S.). Test Criteria for Pearson Type III Distributions . ARL (Aerospace Research Laboratories(U.S.))). Aerospace Research Laboratories, Office of Aerospace Research, United States Air Force, 1963.[MZ13] Yanyuan Ma and Liping Zhu. A review on dimension reduction. International Statistical Review ,81(1):134–150, 4 2013.[S.N27] S.N.Bernstein. Theory of Probability . 1927.[Tag11] Hemant D. Tagare. Notes on optimization on stiefel manifolds, January 2011.[WX08] Hansheng Wang and Yingcun Xia. Sliced regression for dimension reduction. Journal of the AmericanStatistical Association , 103(482):811–821, 2008.[WY19] Hang Weiqiang and Xia Yingcun. MAVE: Methods for Dimension Reduction , 2019. R package version1.3.10.[XTLZ02] Yingcun Xia, Howell Tong, W. K. Li, and Li-Xing Zhu. An adaptive estimation of dimension reductionspace. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64(3):363–410, 2002.[YL11] Xiangrong Yin and Bing Li. Sufficient dimension reduction based on an ensemble of minimum averagevariance estimators. Ann. Statist. , 39(6):3392–3416, 12 2011.[YLC08] Xiangrong Yin, Bing Li, and R. Cook. Successive direction extraction for estimating the central subspacein a multiple-index regression. Journal of Multivariate Analysis , 99:1733–1757, 09 2008.[ZZ10] Peng Zeng and Yu Zhu. An integral transform method for estimating the central mean and centralsubspaces. Journal of Multivariate Analysis , 101(1):271 – 290, 2010.16 PREPRINT - M ARCH 1, 2021 Appendix For any V ∈ S ( p, q ) , defined in (3), we generically denote a basis of the orthogonal complement of its columnspace span { V } , by U . That is, U ∈ S ( p, p − q ) such that span { V } ⊥ span { U } and span { V } ∪ span { U } = R p , U T V = ∈ R ( p − q ) × q , U T U = I p − q . For any x , s ∈ R p we can always write x = s + P V ( x − s ) + P U ( x − s ) = s + Vr + Ur (31)where r = V T ( x − s ) ∈ R q , r = U T ( x − s ) ∈ R p − q . Proof of Theorem 4. The density of X | X ∈ s + span { V } is given by f X | X ∈ s +span { V } ( r ) = f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r (32)where X is the p -dimensional continuous random covariate vector with density f X ( x ) , s ∈ supp ( f X ) ⊂ R p , and V belongs to the Stiefel manifold S ( p, q ) defined in (3). Equation (32) follows from Theorem 3.1 of [LJFR04] andthe fact that ( R p , B ( R p )) , where B ( R p ) denotes the Borel sets on R p , is a Polish space, which in turnguarantees theexistence of the regular conditional probability of X | X ∈ s + span { V } [see also [Fad85]]. Further, the measureis concentrated on the affine subspace s + span { V } ⊂ R p with density (32), which follows from Definition 8.38,Theorem 8.39 of [Kar93], the orthogonal decomposition (31), and the continuity of f X (E.2).By assumption (E.1), Y = g cs ( B T X , (cid:15) ) with (cid:15) ⊥⊥ X . Assume f ∈ F for which assumption (E.4) holds and let (cid:101) B bea basis of S E ( f t ( Y ) | X ) ; that is, span { (cid:101) B } = S E ( f t ( Y ) | X ) ⊆ S Y | X = span { B } . By Theorem 3, f ( Y ) = g ( (cid:101) B T X ) + ˜ (cid:15) ,with E (˜ (cid:15) | X ) = 0 and g twice continuously differentiable. Therefore, ˜ L F ( V , s , f ) = V ar ( f ( Y ) | X ∈ s + span { V } )= V ar (cid:16) g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17) + 2 cov (cid:16) ˜ (cid:15), g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17) + V ar (˜ (cid:15) | X ∈ s + span { V } )= V ar (cid:16) g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17) + V ar (˜ (cid:15) | X ∈ s + span { V } ) (33)The covariance term in (33) vanishes sincecov (cid:16) ˜ (cid:15), g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17) = E E (˜ (cid:15) | X ) (cid:124) (cid:123)(cid:122) (cid:125) =0 g ( (cid:101) B T X ) | X ∈ s + span { V } − E (cid:16) g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17) E E (˜ (cid:15) | X ) (cid:124) (cid:123)(cid:122) (cid:125) =0 | X ∈ s + span { V } = 0 i.e. the sigma field generated by X ∈ s + span { V } is a subset of that generated by X . By the same argument andusing (32) V ar (˜ (cid:15) | X ∈ s + span { V } ) = E (˜ (cid:15) | X ∈ s + span { V } )= E ( E (˜ (cid:15) | X ) | X ∈ s + span { V } ) = E ( h ( X ) | X ∈ s + span { V } )= (cid:90) supp ( f X ) ∩ R q h ( s + Vr ) × f X ( s + Vr ) d r /t (0) ( V , s , f )) where E (˜ (cid:15) | X = x ) = h ( x ) . Using (32) again for the first term in (33) obtains formula (10) and (13).To see that (7), (10), and (13) are well defined and continuous, let ˜ g ( V , s , r ) = g ( B T s + B T Vr ) l f X ( s + Vr ) for l = 1 , or ˜ g ( V , s , r ) = h ( B T s + B T Vr ) f X ( s + Vr ) (for (13)) which are continuous by assumption. Inconsequence, the parameter depending integrals (12) and (13) are well defined and continuous if (1) ˜ g ( V , s , · ) isintegrable for all V ∈ S ( p, q ) , s ∈ supp ( f X ) , (2) ˜ g ( · , · , r ) is continuous for all r , and (3) there exists an integrabledominating function of ˜ g that does not depend on V and s [see [Heu95, p. 101]].Furthermore, for some compact set K , t ( l ) ( V , s ) = (cid:82) K ˜ g ( V , s , r ) d r , since supp ( f X ) is compact by (E.2). Thefunction ˜ g ( V , s , r ) is continuous in all inputs by the continuity of g (E.4) and f X by (E.2), and therefore it attains a17 PREPRINT - M ARCH 1, 2021maximum. In consequence, all three conditions are satisfied so that t ( l ) ( V , s ) is well defined and continuous. By thesame argument (13) is well defined and continuous.Next, µ l ( V , s ) = t ( l ) ( V , s ) /t (0) ( V , s ) is continuous since t (0) ( V , s ) > for all interior points s ∈ supp ( f X ) bythe continuity of f X , convexity of the support and Σ x > . Then, ˜ L ( V , s , f ) in (10) is continuous, which results in(8) also being well defined and continuous by virtue of it being a parameter depending integral following the samearguments as above. Moreover, (14) exists as the minimizer of a continuous function over the compact set S ( p, q ) .Then, (8) can be writen as L ∗F ( V , f ) = E s ∼ X (cid:0) µ ( V , s , f ) − µ ( V , s , f ) (cid:1) + E s ∼ X ( V ar (˜ (cid:15) | X ∈ s + span { V } )) (34)where s ∼ X signifies that s is distributed as X and the expectation is with respect to the distribution of s .It now suffices to show that the second term on the right hand side of (34) is constant with respect to V . By the law oftotal variance, V ar(˜ (cid:15) ) = E ( V ar(˜ (cid:15) | X ∈ s + span { V } )) + V ar ( E (˜ (cid:15) | X ∈ s + span { V } ))= E ( V ar(˜ (cid:15) | X ∈ s + span { V } )) (35)since E (˜ (cid:15) | X ∈ s + span { V } ) = E ( E (˜ (cid:15) | X ) (cid:124) (cid:123)(cid:122) (cid:125) =0 | X ∈ s + span { V } ) = 0 . Inserting (35) into (34) obtains L ∗F ( V , f t ) = E (cid:0) µ ( V , X , f t ) − µ ( V , X , f t ) (cid:1) + V ar(˜ (cid:15) )= E s ∼ X (cid:16) V ar (cid:16) g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17)(cid:17) + V ar(˜ (cid:15) ) (36)Next we show that (8), or, equivalently (36)), attains its minimum at V ⊥ (cid:101) B . Let s ∈ supp ( f X ) and V = ( v , ..., v q ) ∈ R p × q , so that v u ∈ span { B } for some u ∈ { , ..., q } . Since X ∈ s + span { V } ⇐⇒ X = s + P V ( X − s ) , by thefirst term in (36) V ar (cid:16) g ( (cid:101) B T X ) | X ∈ s + span { V } (cid:17) = V ar (cid:16) g ( (cid:101) B T X ) | X = s + VV T ( X − s ) (cid:17) = V ar (cid:16) g ( (cid:101) B T s + (cid:101) B T VV T ( X − s )) | X = s + VV T ( X − s ) (cid:17) ≥ (37)If (37) is positive, i.e. (cid:101) B T VV T ( X − s ) (cid:54) = 0 with positive probability, then the lower bound is not attained. If itis zero; i.e., for V such that V and (cid:101) B T are orthogonal, then L ∗F ( V , f ) = V ar(˜ (cid:15) ) . Since s is arbitrary yet constant,the same inequality holds for (8); that is, (8) attains its minimum for V such that V and (cid:101) B T are orthogonal. Since span { (cid:101) B T } = S E ( f t ( Y ) | X ) , (14) follows. Proof of Theorem 6. Under assumptions (E.1), (E.2), and (E.3), (7) is well defined and continuous by argumentsanalogous to those in the proof of Theorem 4. Therefore (17) exists as a minimizer of a continuous function over thecompact set S ( p, q ) .To show S Y | X = span { V q } ⊥ , let ˜ S (cid:54) = S Y | X with dim( ˜ S ) = dim( S Y | X ) = k . Also, let Z ∈ R p × ( p − k ) be anorthonormal base of ˜ S ⊥ . Suppose L F ( Z ) = min V ∈S ( p,p − k ) L F ( V ) . By (14) and (15) in Theorem 4, L ∗F ( V , f t ) ,considered as a function from R p × ( p − k t ) , is minimized by an orthonormal base of S ⊥ E ( f t ( Y ) | X ) with p − k t elements,where k t = dim( S E ( f t ( Y ) | X ) ) ≤ k . By (E.1), S E ( f t ( Y ) | X ) ⊆ S Y | X = span { B } . As in the proof of Theorem 4,we obtain that L ∗F ( V , f t ) , as a function from R p × ( p − k ) , is minimized by an orthonormal base U ∈ R p × ( p − k ) of span { B } ⊥ .Since ˜ S = span { Z } (cid:54) = span { U } = S Y | X , we can rearrange the bases U = ( U , U ) and Z = ( Z , Z ) suchthat span { U } = span { Z } and span { U } (cid:54) = span { Z } . Since F characterises S Y | X , the set A = { t ∈ Ω T :span { U } ⊆ S E ( f t ( Y ) | X ) } is non empty and by (E.3) A is not a null set with respect to the probability measure F T .Thus, min V ∈S ( p,p − k ) L F ( V ) = L F ( Z ) = E t ∼ F T ( L ∗F ( Z , f t ))= (cid:90) A L ∗F ( Z , f t ) (cid:124) (cid:123)(cid:122) (cid:125) >L ∗F ( U ,f t ) dF T ( t ) + (cid:90) A c L ∗F ( Z , f t ) (cid:124) (cid:123)(cid:122) (cid:125) = L ∗F ( U ,f t ) dF T ( t ) > E t ∼ F T ( L ∗F ( U , f t )) , which contradicts our assumption that L F ( Z ) = min V ∈S ( p,p − k ) L F ( V ) .18 PREPRINT - M ARCH 1, 2021Next we introduce notation and auxiliary lemmas for the proof of Theorem 8. We suppose all assumptions of Theorem 8hold. We generically use the letter “C” to denote constants.Suppose f is an arbitrary element of F and let ˜ Y i = f ( Y i ) = g ( (cid:101) B T X i ) + ˜ (cid:15) i (38)with span { (cid:101) B } = S E ( ˜ Y | X ) = S E ( f ( Y ) | X ) . Condition (E.4) yields that g is twice continuously differentiable, and E ( | ˜ Y | ) < ∞ . Since f is fixed, we suppress it in t ( l ) ( V , s , f ) and ˜ h ( V , s , f ) , so that t ( l ) n ( V , s , f ) = t ( l ) n ( V , s ) = 1 nh ( p − q ) / n n (cid:88) i =1 K (cid:18) d i ( V , s ) h n (cid:19) ˜ Y li , (39)which is the sample version of (12) for l = 0 , , . Eqn. (21) can be expressed as ¯ y l ( V , s ) = t ( l ) n ( V , s ) t (0) n ( V , s ) , (40) Lemma 12. Assume (E.2) and (K.1) hold. For a continuous function g , we let Z n ( V , s ) = (cid:0)(cid:80) i g ( X i ) l K ( d i ( V , s ) /h n ) (cid:1) / ( nh ( p − q ) / n ) . Then, E ( Z n ( V , s )) = (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q ˜ g ( r , h / n r ) d r d r where ˜ g ( r , r ) = g ( s + Vr + Ur ) l f X ( s + Vr + Ur ) , x = s + Vr + Ur in (31). Proof of Lemma 12. By (31), (cid:107) P U ( x − s ) (cid:107) = (cid:107) Ur (cid:107) = (cid:107) r (cid:107) . Further E ( Z n ( V , s )) = 1 h ( p − q ) / n (cid:90) supp ( f X ) g ( x ) l K ( (cid:107) P U ( x − s ) /h / n (cid:107) ) f X ( x ) d x = 1 h ( p − q ) / n (cid:90) supp ( f X ) ∩ R p − q (cid:90) supp ( f X ) ∩ R q g ( s + Vr + Ur ) l K ( (cid:107) r /h / n (cid:107) ) × f X ( s + Vr + Ur ) d r d r = (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q g ( s + Vr + h / n Ur ) l f X ( s + Vr + h / n Ur ) d r d r where the substitution ˜ r = r /h / n , d r = h ( p − q ) / n d ˜ r was used to obtain the last equality. Lemma 13. Assume (E.1), (E.2), (E.3), (E.4), (H.1) and (K.1) hold. Then, there exists a constant C > , such that V ar (cid:16) nh ( p − q ) / n t ( l ) n ( V , s , f ) (cid:17) ≤ nh ( p − q ) / n C for n > n (cid:63) and t ( l ) n ( V , s ) , l = 0 , , , in (39). Proof of Lemma 13. Since a continuous function attains a finite maximum over a compact set, sup x ∈ supp ( f X ) | g ( (cid:101) B T x ) | < ∞ . Therefore, | ˜ Y i | ≤ | g ( (cid:101) B T X i ) | + | ˜ (cid:15) i | ≤ sup x ∈ supp ( f X ) | g ( (cid:101) B T x ) | + | ˜ (cid:15) i | = C + | ˜ (cid:15) i | and | ˜ Y i | l ≤ (cid:80) lu =0 (cid:0) lu (cid:1) C u | ˜ (cid:15) i | l − u . Since ( ˜ Y i , X i ) are i.i.d., V ar (cid:16) nh ( p − q ) / n t ( l ) n ( V , s , f ) (cid:17) = n V ar (cid:18) ˜ Y l K (cid:18) d i ( V , s ) h n (cid:19)(cid:19) ≤ n E (cid:18) ˜ Y l K (cid:18) d i ( V , s ) h n (cid:19)(cid:19) = n E (cid:18) | ˜ Y | l K (cid:18) d i ( V , s ) h n (cid:19)(cid:19) ≤ n l (cid:88) u =0 (cid:18) lu (cid:19) C u E (cid:18) | ˜ (cid:15) i | l − u K (cid:18) d i ( V , s ) h n (cid:19)(cid:19) = n l (cid:88) u =0 (cid:18) lu (cid:19) C u E (cid:18) E ( | ˜ (cid:15) i | l − u | X i ) K (cid:18) d i ( V , s ) h n (cid:19)(cid:19) (41)19 PREPRINT - M ARCH 1, 2021for l = 0 , , . Let E ( | ˜ (cid:15) i | l − u | X i ) = g l − u ( X i ) for a continuous (by assumption) function g l − u ( · ) with finitemoments for l = 0 , , by the compactness of supp ( f X ) . Using Lemma 12 with Z n ( V , s ) = 1 nh ( p − q ) / n (cid:88) i g l − u ( X i ) K ( d i ( V , s ) /h n ) , where K ( · ) fulfills (K.1), we calculate E (cid:18) E ( | ˜ (cid:15) i | l − u | X i ) K (cid:18) d i ( V , s ) h n (cid:19)(cid:19) = h ( p − q ) / n E ( Z n ( V , s ))= h ( p − q ) / n (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) × (cid:90) supp ( f X ) ∩ R q g l − u ( s + Vr + h / n Ur ) f X ( s + Vr + h / n Ur ) d r d r (42) ≤ h ( p − q ) / n C since all integrands in (42) are continuous and over compact sets by (E.2) and the continuity of g l − u ( · ) and K ( · ) , sothat the integral can be upper bounded by a finite constant C . Inserting (42) into (41) yields V ar (cid:16) nh ( p − q ) / n t ( l ) n ( V , s , f ) (cid:17) ≤ nh ( p − q ) / n l (cid:88) u =0 (cid:18) lu (cid:19) C u C (cid:124) (cid:123)(cid:122) (cid:125) = C = nh ( p − q ) / n C (43)In Lemma 14 we show that d i ( V , s ) in (19) is Lipschitz in its inputs under assumption (E.2). Lemma 14. Under assumption (E.2) there exists a constant < C < ∞ such that for all δ > and V , V j ∈ S ( p, q ) with (cid:107) P V − P V j (cid:107) < δ and for all s , s j ∈ supp ( f X ) ⊂ R p with (cid:107) s − s j (cid:107) < δ , | d i ( V , s ) − d i ( V j , s j ) | ≤ C δ for d i ( V , s ) given by (19) Proof of Lemma 14. | d i ( V , s ) − d i ( V j , s j ) | ≤ (cid:12)(cid:12) (cid:107) X i − s (cid:107) − (cid:107) X i − s j (cid:107) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) X i − s , P V ( X i − s ) (cid:105) − (cid:104) X i − s j , P V j ( X i − s j ) (cid:105) (cid:12)(cid:12) = I + I (44)where (cid:104)· , ·(cid:105) is the scalar product in R p . We bound the first term on the right hand side of (44) as follows using (cid:107) X i (cid:107) ≤ sup z ∈ supp ( f X ) (cid:107) z (cid:107) = C < ∞ with probability 1 by (E.2). I = (cid:12)(cid:12) (cid:107) X i − s (cid:107) − (cid:107) X i − s j (cid:107) (cid:12)(cid:12) ≤ |(cid:104) X i , s − s j (cid:105)| + (cid:12)(cid:12) (cid:107) s (cid:107) − (cid:107) s j (cid:107) (cid:12)(cid:12) ≤ (cid:107) X i (cid:107)(cid:107) s − s j (cid:107) + 2 C (cid:107) s − s j (cid:107) ≤ C δ + 2 C δ = 4 C δ by Cauchy-Schwartz and the reverse triangular inequality for which (cid:12)(cid:12) (cid:107) s (cid:107) − (cid:107) s j (cid:107) (cid:12)(cid:12) = |(cid:107) s (cid:107) − (cid:107) s j (cid:107)| ( (cid:107) s (cid:107) + (cid:107) s j (cid:107) ) ≤(cid:107) s − s j (cid:107) C . The second term in (44) satisfies I ≤ (cid:12)(cid:12) (cid:104) X i , ( P V − P V j ) X i (cid:105) (cid:12)(cid:12) + 2 (cid:12)(cid:12) (cid:104) X i , P V s − P V j s j (cid:105) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) s , P V s (cid:105) − (cid:104) s j , P V j s j (cid:105) (cid:12)(cid:12) ≤ (cid:107) X i (cid:107) (cid:107) P V − P V j (cid:107) + 2 (cid:107) X i (cid:107) (cid:13)(cid:13) P V ( s − s j ) + ( P V − P V j ) s j (cid:13)(cid:13) + |(cid:104) s − s j , P V s (cid:105)| + (cid:12)(cid:12) (cid:104) s j , P V s − P V j s j (cid:105) (cid:12)(cid:12) ≤ C δ + 2 C ( δ + C δ ) + C δ + C ( δ + C δ ) = 4 C δ + 4 C δ Collecting all constants into C (i.e. C = 8 C + 4 C ) yields the result.To show Theorems 8 and 15, we use the Bernstein inequality [S.N27]. Let { Z i , i = 1 , , . . . } , be an independentsequence of bounded random variables with | Z i | ≤ b . Let S n = (cid:80) ni =1 Z i , E n = E ( S n ) and V n = V ar( S n ) . Then, P ( | S n − E n | > t ) < (cid:18) − t / V n + bt/ (cid:19) (45)20 PREPRINT - M ARCH 1, 2021Assumption (K.2) yields | K ( u ) − K ( u (cid:48) ) | ≤ K ∗ ( u (cid:48) ) δ (46)for all u, u (cid:48) with | u − u (cid:48) | < δ ≤ L and K ∗ ( · ) is a bounded and integrable kernel function [see [Han08]]. Specifically,if condition (1) of (K.2) holds, then K ∗ ( u ) = L {| u |≤ L } . If condition (2) holds, then K ∗ ( u ) = L {| u |≤ L } +1 {| u | > L } | u − L | − ν .Let A = S ( p, q ) × supp ( f X ) . In Lemma 15 and 16 we show that (39) converges uniformly in probability to (12) byshowing that the variance and bias terms vanish uniformly in probability, respectively. Lemma 15. Under the assumptions of Theorem 8, sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) = O P ( a n ) , l = 0 , , (47) Proof of Lemma 15. The proof proceeds in 3 steps: (i) truncation, (ii) discretization by covering A = S ( p, q ) × supp ( f X ) , and (iii) application of Bernstein’s inequality (45). If the function f in (38) is bounded, the truncation stepand the assumption a n /h ( p − q ) / n = O (1) are not needed.(i) We let τ n = a − n and truncate ˜ Y li by τ n as follows. We let t ( l ) n, trc ( V , s ) = (1 /nh ( p − q ) / n ) (cid:88) i K ( (cid:107) P U ( X i − s ) (cid:107) /h n ) ˜ Y li {| ˜ Y i | l ≤ τ n } (48)be the truncated version of (39) and ˜ R ( l ) n = (1 /nh ( p − q ) / n ) (cid:80) i | ˜ Y i | l {| ˜ Y i | l >τ n } be the remainder of (39). Therefore R ( l ) n ( V , s ) = t ( l ) n ( V , s ) − t ( l ) n, trc ( V , s ) ≤ M ˜ R ( l ) n due to (K.1) and sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ M ( ˜ R ( l ) n + E ˜ R ( l ) n )+ sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n, trc ( V , s ) − E (cid:16) t ( l ) n, trc ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) (49)By Cauchy-Schwartz and the Markov inequality, P ( | Z | > t ) = P ( Z > t ) ≤ E ( Z ) /t , we obtain E ˜ R ( l ) n = 1 h ( p − q ) / n E (cid:16) | ˜ Y i | l {| ˜ Y i | l >τ n } (cid:17) ≤ h ( p − q ) / n (cid:113) E ( | ˜ Y i | l ) (cid:113) P ( | ˜ Y i | l > τ n ) ≤ h ( p − q ) / n (cid:113) E ( | ˜ Y i | l ) (cid:32) E ( | ˜ Y i | l ) a − n (cid:33) / = o ( a n ) , (50)where the last equality uses the assumption a n /h ( p − q ) / n = O (1) and the expectations are finite due to (E.4) for l = 0 , , . No truncation is needed for l = 0 or if ˜ Y i = f ( Y i ) ≤ sup f ∈F | f ( Y i ) | < C < ∞ .Therefore, the first two terms of the right hand side of (49) converge to 0 with rate a n by (50) and Markov’s inequality.From this point on, ˜ Y i will denote the truncated version ˜ Y i {| ˜ Y i |≤ τ n } and we do not distinguish the truncated from theuntruncated t n ( V , s ) since this truncation results in an error of magnitude a n .(ii) For the discretization step we cover the compact set A = S ( p, q ) × supp ( f X ) by finitely many balls, which is possibleby (E.2) and the compactness of S ( p, q ) . Let δ n = a n h n and A j = { V : (cid:107) P V − P V j (cid:107) ≤ δ n }×{ s : (cid:107) s − s j (cid:107) ≤ δ n } bea cover of A with ball centers V j × s j . Then, A ⊂ (cid:83) Nj =1 A j and the number of balls can be bounded by N ≤ C δ − dn δ − pn for some constant C ∈ (0 , ∞ ) , where d = dim ( S ( p, q )) = pq − q ( q + 1) / . Let V × s ∈ A j . Then by Lemma 14there exists < C < ∞ , such that | d i ( V , s ) − d i ( V j , s j ) | ≤ C δ n (51)for d i in (19). Under (K.2), which implies (46), inequality (51) yields (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) d i ( V , s ) h n (cid:19) − K (cid:18) d i ( V j , s j ) h n (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K ∗ (cid:18) d i ( V j , s j ) h n (cid:19) C a n (52)for V × s ∈ A j and K ∗ ( · ) an integrable and bounded function.21 PREPRINT - M ARCH 1, 2021Define r ( l ) n ( V j , s j ) = (1 /nh ( p − q ) / n ) (cid:80) ni =1 K ∗ ( d i ( V j , s j ) /h n ) | ˜ Y i | l . For notational convenience we next drop thedependence on l and j and observe that (52) yields | t ( l ) n ( V , s ) − t ( l ) n ( V j , s j ) | ≤ C a n r ( l ) n ( V j , s j ) (53)Since K ∗ fulfills (K.1) except for continuity, an analogous argument as in the proof of Lemma 12 yields that E (cid:16) r ( l ) n ( V j , s j ) (cid:17) < ∞ . By subtracting and adding t ( l ) n ( V j , s j ) , E ( t ( l ) n ( V j , s j )) , the triangular inequality, (53) andintegrability of r ln , we obtain (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − t ( l ) n ( V j , s j ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E (cid:16) t ( l ) n ( V j , s j ) − t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n ( | r n | + | E ( r n ) | ) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n ( | r n − E ( r n ) | + 2 | E ( r n ) | ) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n + | r n − E ( r n ) | + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) (54)for any constant C > C E ( r ( l ) n ( V j , s j )) and n such that C a n ≤ , since a n = o (1) , which in turn yields that thereexists < C < ∞ such that (54) holds.Since sup x ∈ A f ( x ) = max ≤ j ≤ N sup x ∈ A j f ( x ) ≤ (cid:80) Nj =1 sup x ∈ A j f ( x ) for any cover of A and continuous function f , P ( sup V × s ∈ A | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) ≤ N (cid:88) j =1 P ( sup V × s ∈ A j | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) ≤ N max ≤ j ≤ N P ( sup V × s ∈ A j | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) (55) ≤ N (cid:18) max ≤ j ≤ N P ( | t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17) | > C a n ) + max ≤ j ≤ N P ( | r n − E ( r n ) | > C a n ) (cid:19) ≤ C δ − ( d + p ) (cid:18) max ≤ j ≤ N P ( | t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17) | > C a n ) + max ≤ j ≤ N P ( | r n − E ( r n ) | > C a n ) (cid:19) by the subadditivity of probability for the first inequality and (54) for the third inequality above, where the last inequalityis due to N ≤ C δ − dn δ − pn for a cover of A .Finally, we bound the first and second term in the last line of (55) by the Bernstein inequality (45). For the firstterm in the last line of (55), let Z i = Y li K ( d i ( V j , s j ) /h n ) and S n = (cid:80) i Z i = nh ( p − q ) / n t ( l ) n ( V j , s j ) . Then, Z i areindependent with | Z i | ≤ b = M τ n = M /a n by (K.1) and the truncation step (i). For V n = V ar( S n ) , Lemma 13yields nh ( p − q ) / n C ≥ V n with C > , and set t = C a n nh ( p − q ) / n . The Bernstein inequality (45) yields P (cid:16)(cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) > C a n (cid:17) < (cid:18) − t / V n + bt/ (cid:19) ≤ (cid:32) − (1 / C a n n h ( p − q ) n nh ( p − q ) / n C + (1 / M τ n C a n nh ( p − q ) / n ) (cid:33) ≤ (cid:18) − (1 / C log( n ) C/C + M / (cid:19) = 2 n − γ ( C ) where a n = log( n ) / ( nh ( p − q ) / n ) and γ ( C ) = C (2( C/C + M / − that is an increasing function that can bemade arbitrarily large by increasing C .For the second term in the last line of (55), set Z i = Y li K ∗ ( d i ( V j , s j ) /h n ) in (45) and proceed similarly to obtain P (cid:16)(cid:12)(cid:12)(cid:12) r ( l ) n ( V j , s j ) − E (cid:16) r ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) > C a n (cid:17) < n − (1 / C C/C / M = 2 n − γ ( C ) By (H.1), h ( p − q ) / n ≤ for n large and (H.2) implies / ( nh ( p − q ) / n ) ≤ for n large, therefore h − n ≤ n / ( p − q ) ≤ n since p − q ≥ . Then, δ − n = ( a n h n ) − ≤ n / h − n h ( p − q ) / n ≤ n / . Therefore, (55) is smaller than C δ − ( d + p ) n n − γ ( C ) ≤ Cn d + p ) / − γ ( C ) . For C large enough, we have d + p ) / − γ ( C ) < and n d + p ) / − γ ( C ) → . This completes the proof. 22 PREPRINT - M ARCH 1, 2021If we assume | ˜ Y i | < M < ∞ almost surely, the requirement a n /h ( p − q ) / n = O (1) for the bandwidth can be droppedand the truncation step of the proof of Lemma 15 is no longer necessary. Lemma 16. Under (E.1), (E.2), (E.3), (E.4), (H.1), (K.1), and (cid:82) R p − q K ( (cid:107) r (cid:107) ) d r = 1 , sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) ( V , s ) + 1 { l =2 } ˜ h ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) = O ( h n ) , l = 0 , , (56) where t ( l ) ( V , s ) and ˜ h ( V , s ) are defined in Theorem 4.Proof of Lemma 16. Let ˜ g ( r , r ) = g ( (cid:101) B T s + (cid:101) B T Vr + (cid:101) B T Ur ) l f X ( s + Vr + Ur ) , where r , r satisfy theorthogonal decomposition (31). E (cid:16) t (0) n ( V , s ) (cid:17) = E ( K ( d i ( V , s ) /h n )) /h ( p − q ) / n E ( t (1) n ( V , s )) = E (cid:16) K ( d i ( V , s ) /h n ) g ( (cid:101) B T X i ) (cid:17) /h ( p − q ) / n + E K ( d i ( V , s ) /h n ) E (˜ (cid:15) i | X ) (cid:124) (cid:123)(cid:122) (cid:125) =0 /h ( p − q ) / n E ( t (2) n ( V , s )) = E (cid:16) K ( d i ( V , s ) /h n ) g ( (cid:101) B T X i ) (cid:17) /h ( p − q ) / n + 2 E K ( d i ( V , s ) /h n ) E (˜ (cid:15) i | X ) (cid:124) (cid:123)(cid:122) (cid:125) =0 /h ( p − q ) / n + E K ( d i ( V , s ) /h n ) E (˜ (cid:15) i | X ) (cid:124) (cid:123)(cid:122) (cid:125) = h ( X i ) /h ( p − q ) / n Then E (cid:16) t ( l ) n ( V , s ) (cid:17) = (cid:90) R p − q K ( (cid:107) r (cid:107) ) (cid:90) R p ˜ g ( r , h n / r ) d r d r (57)holds by Lemma 12 for l = 0 , . For l = 2 , ˜ Y i = g i + 2 g i (cid:15) i + (cid:15) i with g i = g ( (cid:101) B T X i ) and can be handled asin the case of l = 0 , . Plugging in (57) the second order Taylor expansion for some ξ in the neighborhood of 0, ˜ g ( r , h n / r ) = ˜ g ( r , 0) + h n / ∇ r ˜ g ( r , T r + h n r T ∇ r ˜ g ( r , ξ ) r , yields E (cid:16) t ( l ) n ( V , s ) (cid:17) = (cid:90) R q ˜ g ( r , d r + (cid:112) h n (cid:18)(cid:90) R q ∇ r ˜ g ( r , d r (cid:19) T (cid:90) R p − q K ( (cid:107) r (cid:107) ) r d r + h n (cid:90) R p − q K ( (cid:107) r (cid:107) ) (cid:90) R p r T ∇ r ˜ g ( r , ξ ) r d r d r = t ( l ) ( V , s ) + h n R ( V , s ) since (cid:82) R q ˜ g ( r , d r = t ( l ) ( V , s ) and (cid:82) R p − q K ( (cid:107) r (cid:107) ) r d r = 0 ∈ R p − q due to K ( (cid:107) · (cid:107) ) being even. Let R ( V , s ) = (cid:82) R p − q K ( (cid:107) r (cid:107) ) (cid:82) R p r T ∇ r ˜ g ( r , ξ ) r d r d r . By (E.4) and (E.2), | r T ∇ r ˜ g ( r , ξ ) r | ≤ C (cid:107) r (cid:107) for C = sup x , y (cid:107)∇ r ˜ g ( x , y ) (cid:107) < ∞ , since a continuous function over a compact set is bounded. Then, R ( V , s ) ≤ CC (cid:82) R p − q K ( (cid:107) r (cid:107) ) (cid:107) r (cid:107) d r < ∞ for some C > , since the integral over r is over a compact set by (E.2).Lemma 17 follows directly from Lemmas 15 and 16 and the triangle inequality. Lemma 17. Suppose (E.1), (E.2), (E.3), (E.4), (K.1), (K.2), (H.1) hold. If a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) , then for l = 0 , , V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) ( V , s ) + 1 { l =2 } ˜ h ( V , s ) − t ( l ) n ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( a n + h n ) Theorem 18. Suppose (E.1), (E.2), (E.3), (E.4), (K.1), (K.2), (H.1) hold. Let a n = log( n ) /nh ( p − q ) / n = o (1) , a n /h ( p − q ) / n = O (1) , then sup V × s ∈ A (cid:12)(cid:12)(cid:12) ¯ y l ( V , s ) − µ l ( V , s ) − { l =2 } ˜ h ( V , s ) (cid:12)(cid:12)(cid:12) = o P (1) , l = 0 , , PREPRINT - M ARCH 1, 2021 and sup V × s ∈ A (cid:12)(cid:12)(cid:12) ˜ L n, F ( V , s ) − ˜ L F ( V , s ) (cid:12)(cid:12)(cid:12) = o P (1) (58) where ¯ y l ( V , s ) , µ l ( V , s ) , ˜ L n, F ( V , s ) and ˜ L F ( V , s ) are defined in (21) , (11) , (22) and (10) , respectively.Proof of Theorem 18. Let δ n = inf V × s ∈ A n t (0) ( V , s ) , where t (0) ( V , s ) is defined in (12), and A n = S ( p, q ) ×{ x ∈ supp ( f X ) : | x − ∂ supp ( f X ) | ≥ b n } , where ∂C denotes the boundary of the set C and | x − C | = inf r ∈ C | x − r | ,for a sequence b n → so that δ − n ( a n + h n ) → for any bandwidth h n that satisfies the assumptions. Then, ¯ y l ( V , s ) = t ( l ) n ( V , s ) t (0) n ( V , s ) = t ( l ) n ( V , s ) /t (0) ( V , s ) t (0) n ( V , s ) /t (0) ( V , s ) (59)We consider the numerator and enumerator of (59) separately. By Lemma 17 sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (0) n ( V , s ) t (0) ( V , s ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup A | t (0) n ( V , s ) − t (0) ( V , s ) | inf A n t (0) ( V , s ) = O P ( δ − n ( a n + h n ))sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup A | t ( l ) n ( V , s ) − t ( l ) ( V , s ) | inf A n t (0) ( V , s ) = O P ( δ − n ( a n + h n )) , and therefore by A n ↑ A = S ( p, q ) × supp ( f X ) , lim n →∞ sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = lim n →∞ sup V × s ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Substituting in (59), we obtain ¯ y l ( V , s ) = t ( l ) n ( V , s ) /t (0) ( V , s ) t (0) n ( V , s ) /t (0) ( V , s ) = µ l + O P ( δ − n ( a n + h n ))1 + O P ( δ − n ( a n + h n )) = µ l + O P ( δ − n ( a n + h n )) . For l = 2 , ˜ Y i = g ( (cid:101) B T X i ) + 2 g ( (cid:101) B T X i )˜ (cid:15) i + ˜ (cid:15) i , and (58) follows from (10). Lemma 19. Under (E.1), (E.2), (E.4), there exists < C < ∞ such that | µ l ( V , s ) − µ l ( V j , s ) | ≤ C (cid:107) P V − P V j (cid:107) (60) for all interior points s ∈ supp ( f X ) Proof. From the representation ˜ t ( l ) ( P V , s ) in (16) instead of t ( l ) ( V , s ) , we consider µ l ( V , s ) = µ l ( P V , s ) as afunction on the Grassmann manifold since P V ∈ Gr ( p, q ) . Then, (cid:12)(cid:12) µ l ( P V , s ) − µ l ( P V j , s ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s )˜ t (0) ( P V , s ) − ˜ t ( l ) ( P V j , s )˜ t (0) ( P V j , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup | ˜ t (0) ( P V , s ) | (inf ˜ t (0) ( P V , s )) (cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s ) − ˜ t ( l ) ( P V j , s ) (cid:12)(cid:12)(cid:12) + sup ˜ t ( l ) ( P V , s )(inf ˜ t (0) ( P V , s )) (cid:12)(cid:12)(cid:12) ˜ t (0) ( P V , s ) − ˜ t (0) ( P V j , s ) (cid:12)(cid:12)(cid:12) (61)with sup P V ∈ Gr ( p,q ) ˜ t (0) ( P V , s ) ∈ (0 , ∞ ) and inf P V ∈ Gr ( p,q ) ˜ t (0) ( P V , s ) ∈ (0 , ∞ ) since ˜ t ( l ) is continuous, Σ x > and s ∈ supp ( f X ) an interior point.By (E.2) and (E.4), ˜ g ( x ) = g ( (cid:101) B T x ) f X ( x ) is twice continuous differentiable and therefore Lipschitz continuous oncompact sets. We denote its Lipschitz constant by L < ∞ . Therefore, (cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s ) − ˜ t ( l ) ( P V j , s ) (cid:12)(cid:12)(cid:12) ≤ (cid:90) supp ( f X ) (cid:12)(cid:12) ˜ g ( s + P V r ) − ˜ g ( s + P V j r ) (cid:12)(cid:12) d r ≤ L (cid:90) supp ( f X ) (cid:107) ( P V − P V j ) r (cid:107) d r ≤ L (cid:32)(cid:90) supp ( f X ) (cid:107) r (cid:107) dr (cid:33) (cid:107) P V − P V j (cid:107) (62)where the last inequality is due to the sub-multiplicativity of the Frobenius norm and the integral being finite by (E.2).Plugging (62) in (61) and collecting all constants into C yields (60).24 PREPRINT - M ARCH 1, 2021 Proof of Theorem 8. By (23) and (7), | L ∗ n ( V , f ) − L ∗F ( V , f ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:16) ˜ L n, F ( V , X i , f ) − ˜ L F ( V , X i , f ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:16) ˜ L F ( V , X i , f ) − E ( ˜ L F ( V , X , f )) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (63)By Theorem 18, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i ˜ L n, F ( V , X i , f ) − ˜ L F ( V , X i , f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup V × s ∈ A (cid:12)(cid:12)(cid:12) ˜ L n, F ( V , s , f ) − ˜ L F ( V , s , f ) (cid:12)(cid:12)(cid:12) = o P (1) (64)The second term in (63) converges to 0 almost surely for all V ∈ S ( p, q ) by the strong law of large numbers. In orderto show uniform convergence the same technique as in the proof of Theorem 15 is used. Let B j = { V ∈ S ( p, q ) : (cid:107) VV T − V j V Tj (cid:107) ≤ ˜ a n } be a cover of S ( p, q ) ⊂ (cid:83) Nj =1 B j with N ≤ C ˜ a − dn = C ( n/ log( n )) d/ ≤ C n d/ , where d = dim( S ( p, q )) is defined in the proof of Theorem 15. By Lemma 19, | µ l ( V , X i ) − µ l ( V j , X i ) | ≤ C (cid:107) P V − P V j (cid:107) (65)Let G n ( V , f ) = (cid:80) i ˜ L F ( V , X i , f ) /n with E ( G n ( V )) = L ∗F ( V , f ) . Using (65) and following the same steps as inthe proof of Lemma 15 we obtain | G n ( V , f ) − L ∗F ( V , f ) | ≤ | G n ( V , f ) − G n ( V j , f ) | + | G n ( V j , f ) − L ∗F ( V j , f ) | + | L ∗F ( V , f ) − L ∗F ( V j , f ) |≤ C ˜ a n + | G n ( V j , f ) − L ∗F ( V j , f ) | (66)for V ∈ B j and some C > C . Inequality (66) leads to P (cid:32) sup V ∈S ( p,q ) | G n ( V , f ) − L ∗F ( V , f ) | > C ˜ a n (cid:33) ≤ C N P ( sup V ∈ B j | G n ( V , f ) − L ∗F ( V , f ) | > C ˜ a n ) ≤ C n d/ P ( | G n ( V j , f ) − L ∗F ( V j , f ) | > C ˜ a n ) ≤ C n d/ n − γ ( C ) → (67)where the last inequality in (67) is due to (45) with Z i = ˜ L F ( V j , X i , f ) , which is bounded since ˜ L F ( · , · , f ) iscontinuous on the compact set A , and γ ( C ) a monotone increasing function of C that can be made arbitrarily largeby choosing C accordingly. Therefore, sup V ∈S ( p,q ) | L ∗ n ( V , f ) − L ∗F ( V , f ) | ≤ o P (1) + O P (˜ a n ) which impliesTheorem 8. Proof of Theorem 9. We apply [Ame85, Thm 4.1.1] to obtain consistency of the conditional variance estimator. Thistheorem requires three conditions that guarantee the convergence of the minimizer of a sequence of random func-tions L ∗ n ( P V , f t ) to the minimizer of the limiting function L ∗ ( P V , f t ) ; i.e., P span { (cid:98) B tkt } ⊥ = argmin L ∗ n ( P V , f ) → P span { B } ⊥ = argmin L ∗ ( P V , f t ) in probability. To apply the theorem three conditions have to be met: (1) The param-eter space is compact; (2) L ∗ n ( P V , f t ) is continuous in P V and a measurable function of the data ( Y i , X Ti ) i =1 ,...,n , and(3) L ∗ n ( P V , f t ) converges uniformly to L ∗ ( P V , f t ) and L ∗ ( P V , f t ) attains a unique global minimum at S ⊥ E ( f t ( Y ) | X ) .Since L ∗ n ( V , f t ) depends on V only through P V = VV T , L ∗ n ( V , f t ) can be considered as functions on the Grassmannmanifold, which is compact, and the same holds true for L ∗ ( V , f t ) by (16). Further, L ∗ n ( V , f t ) is by definition a measur-able function of the data and continuous in V if a continuous kernel, such as the Gaussian, is used. Theorem 8 obtains theuniform convergence and Theorem 4 that the minimizer is unique when L ( V ) is minimized over the Grassmann manifold G ( p, q ) , since S E ( f t ( Y ) | X ) = span { (cid:101) B } is uniquely identifiable and so is span { (cid:101) B } ⊥ (i.e. (cid:107) P span { (cid:98) B tkt } − P span { (cid:101) B } (cid:107) = (cid:107) (cid:98) B tk t ( (cid:98) B tk t ) T − (cid:101) B (cid:101) B T (cid:107) = (cid:107) ( I p − (cid:101) B (cid:101) B T ) − ( I p − (cid:98) B tk t ( (cid:98) B tk t ) T ) (cid:107) = (cid:107) P span { (cid:101) B } ⊥ − P span { (cid:98) B tkt } ⊥ (cid:107) ). Thus, all threeconditions are met and the result is obtained. 25 PREPRINT - M ARCH 1, 2021 Proof of Theorem 10. Let ( t j ) j =1 ,...,m n be an i.i.d. sample from F T and write | L n, F ( V ) − L F ( V ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m n m n (cid:88) j =1 (cid:0) L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m n m n (cid:88) j =1 (cid:0) L ∗ ( V , f t j ) − E t ∼ F T ( L ∗ ( V , f t ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (68)Then, sup V ∈S ( p,q ) | L ∗ n ( V , f t ) − L ∗ ( V , f t ) | ≤ M , by the assumption sup t ∈ Ω T | f t ( Y ) | < M < ∞ , and the triangleinequality. That is, L ∗ n ( V , f t ) estimates a variance of a bounded response f t ( Y ) ∈ [ − M, M ] and is therefore boundedby the squared range M of f t ( Y ) . The same holds true for L ∗ ( V , f t ) . Further, M is an integrable dominantfunction so that Fatou’s Lemma applies.Consider the first term on the right hand side of (68) and let δ > . By Markov’s and triangle inequalities and Fatou’sLemma, lim sup n P sup V ∈S ( p,q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m n m n (cid:88) j =1 L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ ≤ δ lim sup n E F T E ( sup V ∈S ( p,q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m n m n (cid:88) j =1 L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) :Markov inequality ≤ δ lim sup n E F T m n m n (cid:88) j =1 E ( sup V ∈S ( p,q ) | L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) | = 1 δ lim sup n E F T (cid:32) E ( sup V ∈S ( p,q ) | L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) | ) (cid:33) ≤ δ E F T (cid:32) E (lim sup n sup V ∈S ( p,q ) | L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) | (cid:33) = 1 δ E t ∼ F T ( E (0)) = 0 since by Theorem 8 it holds lim sup n sup V ∈S ( p,q ) | L ∗ n ( V , f t j ) − L ∗ ( V , f t j ) | = 0 .For the second term on the right hand side of (68) we apply Theorem 2 of [Jen69] in [MMW + 63, p. 40]: Theorem 20. Let t j be an i.i.d. sample and L ∗ ( V , f t ) : Θ × Ω T → R where Θ is a compact subset of an euclideanspace. L ∗ ( V , f t ) is continuous in V and measurable in t by Theorem 4. If L ∗ ( V , f t j ) ≤ h ( t j ) , where h ( t j ) isintegrable with respect to F T , then m n m n (cid:88) j =1 L ∗ ( V , f t j ) −→ E F T ( L ∗ ( V , f t )) uniformly over V ∈ Θ almost surely as n → ∞ Here V ∈ S ( p, q ) = Θ ⊆ R pq , by sup t ∈ Ω T | f t ( Y ) | < M < ∞ and an analogous argument as for the first term in (68), Z j ( V ) = L ∗ ( V , f t j ) < M . Therefore, E (sup V ∈S ( p,q ) | Z j ( V ) | ) < M , which is integrable. Further, since t j arean i.i.d. sample from F T , Z j ( V ) is a i.i.d. sequence of random variables, Z j ( V ) is continuous in V by Theorem 4 andthe parameter space S ( p, q ) is compact. Then by Theorem 20, sup V ∈S ( p,q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m n m n (cid:88) j =1 L ∗ ( V , f t j ) − E t ∼ F T ( L ∗ ( V , f t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) −→ almost surely as n → ∞ if lim n →∞ m n = ∞ . Putting everything together it follows that sup V ∈S ( p,q ) | L n, F ( V ) − L F ( V ) | → in probabilityas n → ∞ . 26 PREPRINT - M ARCH 1, 2021 Proof of Theorem 11. The proof is directly analogous to the proof of Theorem 9. The uniform convergence of thetarget function L n, F ( V ) is obtained by Theorem 10. The minimizer over Gr ( p, q ) and its uniqueness derive fromTheorem 6. Proof of Theorem 7. In this proof we supress the dependence on f in the notation. The Gaussian kernel K satisfies ∂ z K ( z ) = − zK ( z ) . From (20) and (22) we have ˜ L n, F = ¯ y − ¯ y where ¯ y l = (cid:80) i w i ˜ Y li , l = 1 , . We let K j = K ( d j ( V , s ) /h n ) , suppress the dependence on V and s and write w i = K i / (cid:80) j K j . Then, ∇ K i =( − /h n ) K i d i ∇ d i and ∇ w i = − (cid:16) K i d i ∇ d i ( (cid:80) j K j ) − K i (cid:80) j K j d j ∇ d j (cid:17) / ( h n (cid:80) j K j ) . Next, ∇ ¯ y l = − h n (cid:88) i ˜ Y li K i d i ∇ d i − K i ( (cid:80) j K j d j ∇ d j )( (cid:80) j K j ) = − h n (cid:88) i ˜ Y li w i d i ∇ d i − (cid:88) j w j d j ∇ d j = − h n (cid:88) i ˜ Y li w i d i ∇ d i − (cid:88) j ˜ Y lj w j (cid:88) i w i d i ∇ d i = − h n (cid:88) i ( ˜ Y li − ¯ y l ) w i d i ∇ d i (69)Then, ∇ ˜ L n = ∇ ¯ y − y ∇ ¯ y , and inserting ∇ ¯ y l from (69) yields ∇ ˜ L n = ( − /h n ) (cid:80) i ( Y i − ¯ y − y ( Y i − ¯ y )) w i d i ∇ d i = (1 /h n ) (cid:16)(cid:80) i (cid:16) ˜ L n − ( Y i − ¯ y ) (cid:17) w i d i ∇ d i (cid:17) , since Y i − ¯ y − y ( Y i − ¯ y ) = ( Y i − ¯ y ) − ˜ L nn