[PDF] Ensemble Conditional Variance Estimator for Sufficient Dimension Reduction

Abstract

Ensemble Conditional Variance Estimation (ECVE) is a novel sufficient dimension reduction (SDR) method in regressions with continuous response and predictors. ECVE applies to general non-additive error regression models. It operates under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. It is a semiparametric forward regression model based exhaustive sufficient dimension reduction estimation method that is shown to be consistent under mild assumptions. It is shown to outperform central subspace mean average variance estimation (csMAVE), its main competitor, under several simulation settings and in a benchmark data set analysis.

Full PDF

EEnsemble Conditional Variance Estimator for SufﬁcientDimension Reduction

Lukas Fertl ∗ Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, Austria

Efstathia Bura † Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, AustriaMarch 1, 2021 A BSTRACT

Ensemble Conditional Variance Estimation ( ECVE ) is a novel sufﬁcient dimension reduction (SDR)method in regressions with continuous response and predictors.

ECVE applies to general non-additiveerror regression models. It operates under the assumption that the predictors can be replaced by a lowerdimensional projection without loss of information.It is a semiparametric forward regression modelbased exhaustive sufﬁcient dimension reduction estimation method that is shown to be consistent undermild assumptions. It is shown to outperform central subspace mean average variance estimation ( csMAVE ), its main competitor, under several simulation settings and in a benchmark data set analysis. Let (Ω , F , P ) be a probability space. Let Y be a univariate continuous response and X a p -variate continuous predictor,jointly distributed, with ( Y, X T ) T : Ω → R p +1 . We consider the linear sufﬁcient dimension reduction model Y = g cs ( B T X , (cid:15) ) , (1)where X ∈ R p is independent of the random variable (cid:15) , B is a p × k matrix of rank k , and g cs : R k +1 → R is anunknown non-constant function.[ZZ10, Thm. 1] showed that if ( Y, X T ) T has a joint continuous distribution, (1) is equivalent to Y ⊥⊥ X | B T X , (2)where the symbol ⊥⊥ indicates stochastic independence. The matrix B is not unique. It can be replaced by any basis ofits column space, span { B } . Let S denote a subspace of R p , and let P S denote the orthogonal projection onto S withrespect to the usual inner product. If the response Y and predictor vector X are independent conditionally on P S X ,then P S X can replace X as the predictor in the regression of Y on X without loss of information. Such subspaces S are called dimension reduction subspaces and their intersection, provided it satisﬁes the conditional independencecondition (2), is called the central subspace and denoted by S Y | X [see [Coo98, p. 105], [Coo07]].By their equivalence, under both models (1) and (2), F Y | X ( y ) = F Y | B T X ( y ) and S Y | X = span { B } . Since theconditional distribution of Y | X is the same as that of Y | B T X , B T X contains all the information in X for modelingthe target variable Y , and it can replace X without any loss of information. ∗ [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b PREPRINT - M

ARCH

1, 2021If the error term in model (1) is additive with E ( (cid:15) | X ) = 0 , (1) reduces to Y = g ( B T X ) + (cid:15) . Now, E ( Y | X ) = E ( Y | B T X ) = E ( Y | P S X ) , where S = span { B } . The mean subspace, denoted by S E ( Y | X ) , is the intersection of allsubspaces S such that E ( Y | X ) = E ( Y | P S X ) [CL02]. In this case, (1) becomes the classic mean subspace modelwith span { B } = S E ( Y | X ) . [CL02] showed that the mean subspace is a subset of the central subspace, S E ( Y | X ) ⊆ S Y | X .Several linear sufﬁcient dimension reduction (SDR) methods estimate S E ( Y | X ) consistently ([AC09, MZ13, Li18,XTLZ02]). Linear refers to the reduction being a linear transformation of the predictor vector.

Minimum AverageVariance Estimation ( MAVE ) [XTLZ02] is the most competitive and accurate method among them.

MAVE differentiatesfrom the majority of SDR methods, in that it is not inverse regression based such as, for example, the widely used

Sliced Inverse Regression (SIR, [Li91]).

MAVE requires minimal assumptions on the distribution of ( Y, X T ) T and isbased on estimating the gradients of the regression function E ( Y | X ) via local-linear smoothing [CD88].The central subspace mean average variance estimation ( csMAVE ) [WX08, WY19] is the extension of MAVE that con-sistently and exhaustively estimates the span { B } in model (1) without restrictive assumptions limiting its applicability. csMAVE has remained the gold standard since it was proposed by [WX08]. It is based on repeatedly applying MAVE onthe sliced target variables f u ( Y ) = { s u −

ARCH

1, 2021 [YL11] introduced ensembles as a device to extend mean subspace to central subspace SDR methods. The ensemble approach of combining mean subspaces to span the central subspace comprizes of two components: (a) a rich familyof functions of transformations for the response and (b) a sampling mechanism for drawing the functions from theensemble to ascertain coverage of the central subspace. To distinguish between families of functions and ensembles,[YL11] use the term parametric ensemble, which we deﬁne next.

Deﬁnition.

A family F of measurable functions from R to R is called an ensemble. If F is a family of measurablefunctions with respect to an index set Ω T ; i.e. F = { f t : t ∈ Ω T } , F is called a parametric ensemble. Let F be an ensemble, f ∈ F and let f ( Y ) , for Y following model (1). The space S E ( f ( Y ) | X ) is deﬁned to be the meansubspace of the transformed random variable f ( Y ) [see [Coo98] or [CL02]]. Deﬁnition.

An ensemble F characterizes the central subspace S Y | X , if span {S E ( f t ( Y ) | X ) : f t ∈ F} = S Y | X (5)As an example, the parametric ensemble F = { f t : t ∈ Ω T } = { { z ≤ t } : t ∈ R } can characterize the centralsubspace S Y | X . That is, E ( f t ( Y ) | X ) is the conditional cumulative distribution function evaluated at t . To see this, let B ∈ S ( p, k ) be such that E ( f t ( Y ) | X ) = E ( f t ( Y ) | B T X ) for all t . Then, F Y | X ( t ) = E ( f t ( Y ) | X ) = E ( f t ( Y ) | B T X ) = F Y | B T X ( t ) for all t . Varying over the parametric ensemble F , in this case over t ∈ R , obtains the conditionalcumulative distribution function. This indicator ensemble fully recovers the conditional distribution of Y | X and,thus, also the central subspace S Y | X , span {S E ( f t ( Y ) | X ) : f t ∈ F} = span {S E ( { Y ≤ t } | X ) : t ∈ R } = S Y | X We reproduce a list of parametric ensembles F , and associated regularity conditions, that can characterize S Y | X from[YL11] next. Characteristic ensemble F = { f t : t ∈ Ω T } = { exp( it · ) : t ∈ R } Indicator ensemble F = { { z ≤ t } : t ∈ R } , where span {S E ( f t ( Y ) | X ) : f t ∈ F} recovers the conditional cumulativedistribution function Kernel ensemble F = { h − K (( z − t ) /h ) : t ∈ R , h > } , where K is a kernel suitable for density estimation, and span {S E ( f t ( Y ) | X ) : f t ∈ F} recovers the conditional density Polynomial ensemble F = { z t : t = 1 , , , ... } , where span {S E ( f t ( Y ) | X ) : f t ∈ F} recovers the conditionalmoment generating function Box-Cox ensemble F = { ( z t − /t : t (cid:54) = 0 } ∪ { log( z ) : t = 0 } Box-Cox Transforms

Wavelet ensemble

Haar WaveletsThe characteristic and indicator ensembles describe the conditional characteristic and distribution function of Y | X , respectively, which always exist and determine the distribution uniquely. If the conditional density function f Y | X of Y | X exists, then the kernel ensemble characterizes the conditional distribution Y | X . Further, if the conditionalmoment generating function exists, then the polynomial ensemble characterizes S Y | X . [YL11] used the ensembledevice to extend MAVE [XTLZ02], which targets the mean subspace, to its ensemble version that also estimates thecentral subspace S Y | X consistently.Theorem 1 [YL11, Thm 2.1] establishes when an ensemble F is rich enough to characterize S Y | X . Theorem 1.

Let B = { A : A is a Borel set in supp ( Y ) } be the set of indicator functions on supp ( Y ) and L ( F Y ) bethe set of square integrable random variables with respect to the distribution F Y of the response Y . If F ⊆ L ( F Y ) isdense in B ⊆ L ( F Y ) , then the ensemble F characterizes the central subspace S Y | X . In Theorem 2 we show that ﬁnitely many functions of an ensemble F are sufﬁcient to characterize the central subspace S Y | X . Theorem 2.

If a parametric ensemble F characterizes S Y | X , then there exist ﬁnitely many functions f t ∈ F , with t = 1 , . . . , m and m ∈ N , such that span {S E ( f t ( Y ) | X ) : t ∈ , . . . , m } = S Y | X PREPRINT - M

ARCH

1, 2021

Proof:

Let k = dim( S Y | X ) ≤ p . Since F characterizes S Y | X , dim( S E ( f t ( Y ) | X ) ) = k t ≤ k by (5) for any t . If k t = 0 , S E ( f t ( Y ) | X ) = { } so the corresponding f t does not contribute to (5). Assume k t ≥ . If there were inﬁnitely many S E ( f t ( Y ) | X ) (cid:54) = { } of dimension at least 1, whose span is S Y | X , then inﬁnitely as many are identical, otherwise thedimension of the central subspace S Y | X would be inﬁnite, contradicting that dim( S Y | X ) = k < ∞ .The importance of Theorem 2 lies in the fact that the search to characterize the central subspace is over a ﬁnite set, eventhough it does not offer tools for identifying the elements of the ensemble. Throughout the paper, we refer to the following assumptions as needed. (E.1).

Model (1) , Y = g cs ( B T X , (cid:15) ) holds with Y ∈ R , g cs : R k × R → R non constant in the ﬁrst argument, B = ( b , ..., b k ) ∈ S ( p, k ) , X ∈ R p is independent of (cid:15) , the distribution of X is absolutely continuous with respect tothe Lebesgue measure in R p , supp ( f X ) is convex, and V ar( X ) = Σ x is positive deﬁnite. (E.2). The density f X : R p → [0 , ∞ ) of X is twice continuously differentiable with compact support supp ( f X ) . (E.3). For a parametric ensemble F , its index set Ω T is endowed with a probability measure F T such that for all t ∈ Ω T with S E ( f t ( Y ) | X ) (cid:54) = { } , P F T (cid:0) { ˜ t ∈ Ω T : S E ( f ˜ t ( Y ) | X ) = S E ( f t ( Y ) | X ) } (cid:1) > (E.4). For an ensemble F we assume that for all f ∈ F , the conditional expectation E ( f ( Y ) | X ) is twice continuously differentiable in the conditioning argument. Further, for all f ∈ F E ( | f ( Y ) | ) < ∞ Assumption (E.1) assures the existence and uniqueness of S Y | X = span { B } . Furthermore, it allows the mean subspaceto be a proper subset of the central subspace, i.e. S E ( Y | X ) (cid:40) S Y | X . In Assumption (E.2), the compactness assumptionfor supp ( f X ) is not as restrictive as it might seem. [YLC08, Prop. 11] showed that there is a compact set K ⊂ R p suchthat S Y | X | K = S Y | X , where X | K = X { X ∈ K } . Assumption (E.3) simply states that the set of indices that characterizethe central subspace S Y | X is not a null set. In practice, the choice of the probability measure F T on the index set Ω T of a parametric ensemble F can always guarantee the fulﬁllment of this assumption. If the characteristic or indicatorensemble are used, (E.4) states that the conditional characteristic or distribution function are twice continuouslydifferentiable. In this case, the 8 th moments exist since the complex exponential and indicator functions are bounded. Deﬁnition.

For q ≤ p ∈ N , f ∈ F , and any V ∈ S ( p, q ) , we deﬁne ˜ L F ( V , s , f ) = V ar ( f ( Y ) | X ∈ s + span { V } ) (6) where s ∈ R p is a non-random shifting point. Deﬁnition.

Let F be a parametric ensemble and F T a cumulative distribution function (cdf) on the index set Ω T . For q ≤ p , and any V ∈ S ( p, q ) , we deﬁne L F ( V ) = (cid:90) Ω T (cid:90) R p ˜ L ( V , x , f t ) dF X ( x ) dF T ( t ) (7) = E t ∼ F T (cid:16) E X (cid:16) ˜ L F ( V , X , f t ) (cid:17)(cid:17) = E t ∼ F T ( L ∗F ( V , f t )) , where F X is the cdf of X ,and L ∗F ( V , f t ) = E X (cid:16) ˜ L F ( V , X , f t ) (cid:17) . (8)For the identity function, f t ( z ) = z , (8) is the target function of the conditional variance estimation proposed in[FB21]. If the random variable t is concentrated on t ; i.e., t ∼ δ t , then the ensemble conditional variance estimator ( ECVE ) coincides with the conditional variance estimator (CVE).The following theorem will be used in establishing the main result of this paper, which obtains the exhaustive sufﬁcientreduction of the conditional distribution of Y given the predictor vector X .4 PREPRINT - M

ARCH

1, 2021

Theorem 3.

Assume (E.1) and (E.2) hold, in particular model (1) holds. Let (cid:101) B be a basis of S E ( f t ( Y ) | X ) ; i.e. span { (cid:101) B } = S E ( f t ( Y ) | X ) ⊆ S Y | X = span { B } . Then, for any f ∈ F for which assumption (E.4) holds, f ( Y ) = g ( (cid:101) B T X ) + ˜ (cid:15), (9) with E (˜ (cid:15) | X ) = 0 and g : R k t → R is a twice continuously differentiable function, where k t = dim( S E ( f t ( Y ) | X ) ) . By Theorem 3, any response Y can be written as an additive error via the decomposition (9). The predictors and theadditive error term are only required to be conditionally uncorrelated in model (9). The conditional variance estimator [FB21] also estimated (cid:101) B in (9) but under the more restrictive condition of predictor and error independence. Proof of Theorem 3. f ( Y ) = E ( f ( Y ) | X ) + f ( Y ) − E ( f ( Y ) | X ) (cid:124) (cid:123)(cid:122) (cid:125) ˜ (cid:15) = E ( f ( Y ) | X ) + ˜ (cid:15) = E (cid:16) f ( Y ) | (cid:101) B T X (cid:17) + ˜ (cid:15) = g ( (cid:101) B T X ) + ˜ (cid:15) where g ( (cid:101) B T X ) = E (cid:16) f ( Y ) | (cid:101) B T X (cid:17) . By the tower property of the conditional expectation, E (˜ (cid:15) | X ) = E ( f ( Y ) | X ) − E ( E ( f ( Y ) | X ) | X ) = E ( f ( Y ) | X ) − E ( f ( Y ) | X ) = . The function g is twice continuous differentiable by(E.4). Theorem 4.

Assume (E.1) and (E.2) hold. Let F be a parametric ensemble, s ∈ supp ( f X ) ⊂ R p , V ∈ S ( p, q ) deﬁnedin (3) . Then, for any f ∈ F for which assumption (E.4) holds, ˜ L F ( V , s , f ) = µ ( V , s , f ) − µ ( V , s , f ) + V ar(˜ (cid:15) | X ∈ s + span { V } ) (10) where µ l ( V , s , f ) = (cid:90) R q g ( (cid:101) B T s + (cid:101) B T Vr ) l f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r d r = t ( l ) ( V , s , f ) t (0) ( V , s , f ) , (11) for g given in (9) with t ( l ) ( V , s , f ) = (cid:90) R q g ( (cid:101) B T s + (cid:101) B T Vr ) l f X ( s + Vr ) d r , (12) and V ar(˜ (cid:15) | X ∈ s + span { V } ) = E (˜ (cid:15) | X ∈ s + span { V } )= (cid:90) supp ( f X ) ∩ R q h ( s + Vr ) f X ( s + Vr ) d r / (cid:90) R q f X ( s + Vr ) d r = ˜ h ( V , s , f ) t (0) ( V , s , f ) (13) with E (˜ (cid:15) | X = x ) = h ( x ) and ˜ h ( V , s , f ) = (cid:82) supp ( f X ) ∩ R q h ( s + Vr ) f X ( s + Vr ) d r . Further assume h ( · ) tobe continuous, then L ∗F ( V , f t ) in (8) is well deﬁned and continuous, V tq = argmin V ∈S ( p,q ) L ∗F ( V , f t ) (14) is well deﬁned, and the conditional variance estimator of the transformed response f t ( Y ) identiﬁes S E ( f t ( Y ) | X ) , S E ( f t ( Y ) | X ) = span { V tq } ⊥ . (15)[FB21] assumed model Y = g ( B T X ) + (cid:15) with (cid:15) ⊥⊥ X , which implies S E ( Y | X ) = span { B } = S Y | X . [FB21] showedthat the conditional variance estimator (CVE) can identify S E ( Y | X ) at the population level.Theorem 4 extends this result to obtain that the conditional variance estimator (CVE) identiﬁes the mean subspace S E ( Y | X ) also in models of the form Y = g ( B T X ) + ˜ (cid:15) , where ˜ (cid:15) is simply conditionally uncorrelated with X . This allowsCVE to apply to problems where the mean subspace is a proper subset of the central subspace , i.e. S E ( Y | X ) (cid:40) S Y | X . V tq in (14) is not unique since for all orthogonal O ∈ R q × q , L ∗F ( V tq O , f t ) = L ∗F ( V tq , f t ) as L ∗F ( V tq , f t ) depends on V tq only through span { V tq } by (6). Nevertheless, it is a unique minimizer over the Grassmann manifold Gr ( p, q ) in5 PREPRINT - M

ARCH

1, 2021(4). To see this, suppose V ∈ S ( p, q ) is an arbitrary basis of a subspace M ∈ Gr ( p, q ) . We can identify M through theprojection P M = VV T . By (31), we write x = Vr + Ur . Application of the Fubini-Tonelli Theorem yields ˜ t ( l ) ( P M , s , f ) = (cid:90) supp ( f X ) g ( B T s + B T P M x ) l f X ( s + P M x ) d x (16) = t ( l ) ( V , s , f ) (cid:90) supp ( f X ) ∩ R p − q d r . Therefore ˜ t ( l ) ( P M , s , f ) / ˜ t (0) ( P M , s , f ) = t ( l ) ( V , s , f ) /t (0) ( V , s , f ) and µ l ( · , s , f ) in (11) can also be viewedas a function from Gr ( p, q ) to R .Next we deﬁne the ensemble conditional variance estimator (ECVE) for a parametric ensemble F which characterizesthe central subspace S Y | X . Following the ensemble minimum average variance estimation formulation in [YL11], weextend the original objective function by integrating over the index random variable t ∼ F T in (7) that indexes theensemble F as [YL11]. Deﬁnition 5.

Let V q = argmin V ∈ S ( p,q ) L F ( V ) (17) The

Ensemble Conditional Variance Estimator with respect to the ensemble F is deﬁned to be any basis B p − q, F of span { V q } ⊥ . Theorem 6.

Assume (E.1), (E.2), (E.3), and (E.4) hold, and that the function h ( · ) deﬁned in Theorem 4 is continuous.Let F be a parametric ensemble that characterizes S Y | X , with k = dim( S Y | X ) , and V be an element of the Stiefelmanifold S ( p, q ) , which is deﬁned in (3) , with q = p − k . Then, V q in (17) is well deﬁned and S Y | X = span { V q } ⊥ . (18) Assume ( Y i , X (cid:62) i ) (cid:62) i =1 ,...,n is an i.i.d. sample from model (1), and let d i ( V , s ) = (cid:107) X i − P s +span { V } X i (cid:107) = (cid:107) X i − s (cid:107) − (cid:104) X i − s , VV (cid:62) ( X i − s ) (cid:105) = (cid:107) ( I p − VV (cid:62) )( X i − s ) (cid:107) = (cid:107) Q V ( X i − s ) (cid:107) (19)where (cid:104)· , ·(cid:105) is the usual inner product in R p , P V = VV (cid:62) and Q V = I p − P V . The estimators we propose involvea variation of kernel smoothing, which depends on a bandwidth h n . In our procedure, h n is the squared width of aslice around the subspace s + span { V } . In order to obtain pointwise convergence for the ensemble CVE, we use thefollowing bias and variance assumptions on the bandwidth, as typical in nonparametric estimation. (H.1). For n → ∞ , h n → (H.2). For n → ∞ , nh ( p − q ) / n → ∞ In order to obtain consistency of the proposed estimator, Assumption (H.2) will be strengthened to log( n ) /nh ( p − q ) / n → .We also let K , which we refer to as kernel , be a function satisfying the following assumptions. (K.1). K : [0 , ∞ ) → [0 , ∞ ) is a non increasing and continuous function, so that | K ( z ) | ≤ M , with (cid:82) R q K ( (cid:107) r (cid:107) ) d r < ∞ for q ≤ p − . (K.2). There exist positive ﬁnite constants L and L such that K satisﬁes either (1) or (2) below:(1) K ( u ) = 0 for | u | > L and for all u, ˜ u it holds | K ( u ) − K (˜ u ) | ≤ L | u − ˜ u | (2) K ( u ) is differentiable with | ∂ u K ( u ) | ≤ L and for some ν > it holds | ∂ u K ( u ) | ≤ L | u | − ν for | u | > L The Gaussian kernel K ( z ) = exp( − z ) , for example, fulﬁlls both (K.1) and (K.2) [see [Han08]], and will be usedthroughout the paper.For i = 1 , . . . , n , we let w i ( V , s ) = K (cid:16) d i ( V , s ) h n (cid:17)(cid:80) nj =1 K (cid:16) d j ( V , s ) h n (cid:17) (20)6 PREPRINT - M

ARCH

1, 2021 ¯ y l ( V , s , f ) = n (cid:88) i =1 w i ( V , s ) f ( Y i ) l for l = 1 , (21)We estimate ˜ L F ( V , s , f ) in (10) with ˜ L n, F ( V , s , f ) = ¯ y ( V , s , f ) − ¯ y ( V , s , f ) , (22)and the objective function L ∗F ( V , f ) in (8) with L ∗ n ( V , f ) = 1 n n (cid:88) i =1 ˜ L n, F ( V , X i , f ) , (23)where each data point X i is a shifting point. For a parametric ensemble F = { f t : t ∈ Ω T } and ( t j ) j =1 ,...,m n an i.i.d.sample from F T with lim n →∞ m n = ∞ , the ﬁnal estimate of the objective function in (7) is given by L n, F ( V ) = 1 m n m n (cid:88) j =1 L ∗ n ( V , f t j ) (24)The ensemble conditional variance estimator (ECVE) is deﬁned to be any basis of span { ˆ V q } ⊥ , where ˆ V q = argmin V ∈ S ( p,q ) L n, F ( V ) (25)We use the same algorithm as in [FB21] to solve the optimization problem (25). It requires the explicit form of thegradient of (24). Theorem 7 provides the gradient when a Gaussian kernel is used. Theorem 7.

The gradient of ˜ L n, F ( V , s , f ) in (22) is given by ∇ V ˜ L n, F ( V , s , f ) = 1 h n n (cid:88) i =1 ( ˜ L n, F ( V , s , f ) − ( f ( Y i ) − ¯ y ( V , s , f )) ) w i d i ∇ V d i ( V , s ) ∈ R p × q , and the gradient of L n, F ( V ) in (24) is ∇ V L n, F ( V ) = 1 nm n n (cid:88) i =1 m n (cid:88) j =1 ∇ V ˜ L n, F ( V , X i , f t j ) . In the implementation of ECVE, we follow [FB21] and set the bandwidth to h n = 1 . tr ( (cid:98) Σ x ) p (cid:16) n − / (4+ p − q ) (cid:17) . (26)where (cid:98) Σ x = (1 /n ) (cid:80) i ( X i − ¯ X )( X i − ¯ X ) T and ¯ X = (1 /n ) (cid:80) i X i . L ∗ n ( V , f ) The set of points { x ∈ R p : (cid:107) x − P s +span { V } x (cid:107) ≤ h n } represents a slice in the subspace of R p about s + span { V } .In the estimation of L ( V ) two different weighting schemes are used: (a) Within slices : The weights are deﬁned in (20)and are used to calculate (22). (b)

Between slices : Equal weights /n are used to calculate (23). Another idea for thebetween slices weighting is to assign more weight to slices with more points. This can be realized by altering (23) to L ( w ) n ( V , f ) = n (cid:88) i =1 ˜ w ( V , X i ) ˜ L n ( V , X i , f ) , with (27) ˜ w ( V , X i ) = (cid:80) nj =1 K ( d j ( V , X i ) /h n ) − (cid:80) nl,u =1 K ( d l ( V , X u ) /h n ) − n = (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) (cid:80) nl,u =1 ,l (cid:54) = u K ( d l ( V , X u ) /h n ) (28)The denominator in (28) guarantees the weights ˜ w ( V , X i ) sum up to one. If (27) instead of (23) is used in (24) werefer to this method as weighted ensemble conditional variance estimation .For example, if a rectangular kernel is used, (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) is the number of X j ( j (cid:54) = i ) points in theslice corresponding to ˜ L n ( V , X i , f ) . Therefore, this slice is assigned weight that is proportional to the number of X j points in it, and the more observations we use for estimating L ( V , X i , f ) , the better its accuracy.7 PREPRINT - M

ARCH

1, 2021

The consistency of ECVE derives from the consistency of CVE [FB21] that targets a speciﬁc S E ( f t ( Y ) | X ) and the factthat we can recover S Y | X from S E ( f t ( Y ) | X ) across all transformations f t ∈ F = { f t : t ∈ Ω T } for an ensemble thatcharacterizes S Y | X . This is achieved in sequential steps from Theorem 8, which is the main building block, to Theorem11. The proofs are technical and lengthy, and, thus, are given in the Appendix. Theorem 8.

Assume conditions (E.1), (E.2), (E.4), (K.1), (K.2), (H.1) hold, a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) . Let F be a parametric ensemble such that E ( | ˜ (cid:15) | l | X = x ) is continuous for l = 1 , . . . , , andthe second conditional moment is twice continuously differentiable, where ˜ (cid:15) is given by Theorem 3. Then, L ∗ n ( V , f ) ,deﬁned in (23) , converges uniformly in probability to L ∗ ( V , f ) in (8) for all f ∈ F ; i.e., sup V ∈S ( p,q ) | L ∗ n ( V , f ) − L ∗ ( V , f ) | −→ in probability as n → ∞ . Next, Theorem 9 shows that ensemble conditional variance estimator is consistent for S E ( f t ( Y ) | X ) for any transformation f . Theorem 9.

Under the same conditions as Theorem 8, the conditional variance estimator span { (cid:98) B tk t } estimates S E ( f t ( Y ) | X ) consistently, for f t ∈ F . That is, (cid:107) P (cid:98) B tkt − P S E ( ft ( Y ) | X ) (cid:107) → in probability as n → ∞ . where (cid:98) B tk t is any basis of span { (cid:98) V tk t } ⊥ with (cid:98) V tk t = argmin V ∈S ( p,q ) L ∗ n, F ( V , f t ) . with q = p − k t and k t = dim( S E ( f t ( Y ) | X ) ) . A straightforward application of Theorem 9, using the identity function, obtains that S E ( Y | X ) can be consistentlyestimated by ECVE. Theorem 10.

Assume the conditions of Theorem 8 hold. Let F be a parametric ensemble such that sup t ∈ Ω T | f t ( Y ) |

Assume the conditions of Theorem 8 and (E.3) hold. Let F be a parametric ensemble that characterizes S Y | X and whose members satisfy sup t ∈ Ω T | f t ( Y ) | < M < ∞ almost surely. Also, assume the index random variable t ∼ F T is independent from the data ( Y i , X i ) i =1 ,...,n . Then, the ensemble conditional variance estimator (ECVE)is a consistent estimator for S Y | X . That is, for any basis (cid:98) B p − q, F of span { (cid:98) V q } ⊥ , where (cid:98) V q is deﬁned in (25) with q = p − k and k = dim( S Y | X ) , (cid:107) P (cid:98) B p − q, F − P S Y | X (cid:107) −→ in probability as n → ∞ , where P M denotes the orthogonal projection onto the range space of the matrix or linear subspace M . m n on ECVE In this section we study the inﬂuence of the number of functions of the ensemble F , m n in (24), on the accuracy ofthe ensemble conditional variance estimation. In Theorem 10 and 11, how fast m n approaches ∞ is unspeciﬁed. Weconsider the 2-dimensional regression model Y = ( b T X ) + (0 . b T X ) ) (cid:15), (29)8 PREPRINT - M

ARCH

1, 2021where p = 10 , k = 2 , X ∼ N (0 , I ) , (cid:15) ∼ N (0 , independent of X , b = (1 , , . . . , T ∈ R p , and b =(0 , , , . . . , T ∈ R p . Therefore, S E ( Y | X ) = span { b } (cid:40) S Y | X = span { B } , with B = ( b , b ) .We set the sample size to n = 300 and vary m over { , , , , , , } for the (a) indicator, F m, Indicator = { { x ≥ q j } : j = 1 , . . . , m } , where q j is the j/ ( m + 1) th empirical quantile of ( Y i ) i =1 ,...,n ; (b) characteristic or Fourier, F m, Fourier = { sin( jx ) : j = 1 , . . . , m/ } ∪ { cos( jx ) : j = 1 , . . . , m/ } ; (c) monomial, F m, Monom = { x j : j =1 , . . . , m } , (d) and Box-Cox, F m, BoxCox = { ( x t j − /t j : t j = 0 . j − / ( m − , j = 1 , . . . , m − } ∪ { log( x ) } ,ensembles.For each ensemble, we form the ensemble conditional variance estimator and its weighted version as in section 4.1,see also [FB21]. The results of 100 replications for each method and each m are displayed in Figure 1. We assessthe estimation accuracy with err j,m = (cid:107) (cid:98) B (cid:98) B T − BB T (cid:107) / (2 k ) / , j = 1 , . . . , , m ∈ { , , , , , , , } . ECVE ’s main competitor, csMAVE , which does not vary with m , estimate of the central subspace has median error 0.2with a wide range from 0.1 to 0.6. The estimation accuracy of Fourier, Indicator and Box-Cox ECVE vary over m and ison par or better for some m values.For the Fourier basis, fewer basis functions give the best performance, the indicator and BoxCox ensembles are quite ro-bust against varying m , whereas the errors get rapidly larger if m is increased for the monomial ensemble. The weightedversion of ECVE improves the accuracy for all ensembles. F , Fourier_weighted , F , Indicator_weighted , F , BoxCox_weighted areon par or more accurate than csMAVE . In sum, the simulation results support a choice of a small m number of basisfunctions. Based on this and further unreported simulations, we set the default value of m to m n = (cid:26) (cid:100) log( n ) (cid:101) , if (cid:100) log( n ) (cid:101) even (cid:100) log( n ) (cid:101) + 1 , if (cid:100) log( n ) (cid:101) odd (30)for all simulations in Section 6.2, 6.3 and the data analysis in Section 7. We explore the consistency rate of the conditional variance estimator (CVE) and ensemble conditional varianceestimator (CVE) , csMAVE and mMAVE in model (29).Speciﬁcally, we apply seven estimation methods, the ﬁrst ﬁve targeting the central subspace S Y | X and the last two S E ( Y | X ) , as follows. For S Y | X , we compare ECVE for the indicator (I), Fourier (II), monomial (III) and Box-Cox (IV)ensembles, as in Section 6.1, and csMAVE (V). For S E ( Y | X ) , we use CVE (VI) of [FB21] and mMAVE (VII) in [XTLZ02].The simulation is performed as follows. We generate 100 i.i.d samples ( Y i , X Ti ) i =1 ,...,n from (29) for each samplesize n = 100 , , , , , . Model (29) is a two dimensional model with S E ( Y | X ) = span( b ) (cid:40) S Y | X = span( B ) . For methods (I)-(V), we set k = 2 and estimate B ∈ R × . For (VI) and (VII), we set k = 1 and estimate b ∈ R × . Then, we calculate err j,n = (cid:107) (cid:98) B (cid:98) B T − BB T (cid:107) / (2 k ) / , j = 1 , . . . , , n ∈{ , , , , , } . Figure 2 displays the distribution of err j,n for increasing n for the seven methods. Asthe sample size increases ECVE Indicator, Fourier and csMAVE are on par with respect to both speed and accuracy.The accuracy of ECVE Box-Cox improves as the sample size increases but at a slower rate. There is no improvementin the accuracy of ECVE monomial. This is not surprising as the monomial, as well as the Box-Cox, do not satisfythe assumption sup t ∈ Ω T | f t ( Y ) | < M < ∞ in Theorem 11, in contrast to the Indicator and Fourier ensembles. TheFourier, Indicator ECVE and csMAVE estimate S Y | X = span { B } consistently and the mean subspace methods, CVEand mMAVE, estimate S E ( Y | X ) = span { b } consistently. We consider seven models, (M1-M7) deﬁned in Table 1, three different sample sizes { , , } , and threedifferent distributions of the predictor vector X = Σ / Z ∈ R p , where Σ = (Σ ij ) i,j =1 ,...,p , Σ i,j = 0 . | i − j | .Throughout, p = 10 , B are the ﬁrst k columns of I p , and (cid:15) ∼ N (0 , independent of X . As in [WX08], weconsider three distributions for Z ∈ R p : (I) N (0 , I p ) , (II) p -dimensional uniform distribution on [ −√ , √ p , i.e. allcomponents of Z are independent and uniformly distributed , and (III) a mixture-distribution N (0 , I p ) + µ , where µ = ( µ , . . . , µ p ) T ∈ R p with µ j = 2 , µ k = 0 , for k (cid:54) = j , and j is uniformly distributed on { , . . . , p } .The simple and weighted [see Section 4.1] Fourier and

Indicator ensembles are used to form four ensembleconditional variance estimators (ECVE). The monomial and BoxCox ensembles were also used but did not givesatisfactory results and are not reported. From these two ensembles four ECVE estimators are formed and compared9

PREPRINT - M

ARCH

1, 2021Figure 1: Box plots of the estimation errors over repli-cations of model (29) with n = 300 over m = |F| =(2 , , , , , , , across four ensembles. Figure 2: Estimation error distribution of model (29) plottedover n = (100 , , , , , for the seven (I-VII) methodsagainst the reference method csMAVE [WX08], which is implemented in the R package MAVE . The source code for conditional variance estimation and its ensemble version is available at https://git.art-ist.cc/daniel/CVE .Table 1: Models

Name Model S E ( Y | X ) S Y | X k M1 Y = b T X + 0 . (cid:15) span { b } span { b } Y = cos(2 b T X ) + cos( b T X ) + 0 . (cid:15) span { b , b } span { b , b } Y = ( b T X ) + (0 . b T X ) ) (cid:15) span { b } span { b , b } Y = b T X . . b T X ) + ( | b T X | + ( b T X ) + 0 . (cid:15) span { b , b } span { b , b } Y = b T X + sin( b T X ( b T X ) ) (cid:15) span { b } span { b , b , b } Y = 0 . b T X ) (cid:15) span { } span { b } Y = cos( b T X − π ) + cos(2 b T X ) (cid:15) span { b } span { b } We set q = p − k and generate r = 100 replicates of models M1-M7 with the speciﬁed distribution of X and samplesize n . We estimate B using the four ECVE methods and csMAVE. The accuracy of the estimates is assessed using err = (cid:107) P B − P (cid:98) B (cid:107) / √ k ∈ [0 , , where P B = B ( B T B ) − B T is the orthogonal projection matrix on span { B } .The factor √ k normalizes the distance, with values closer to zero indicating better agreement and values closer toone indicating strong disagreement. The results are displayed in Tables 2-8. In M1, which is taken from [WX08],the mean subspace agrees with the central subspace, i.e. S E ( Y | X ) = S Y | X , but due to the unboundedness of the linkfunction g ( x ) = 1 /x most mean subspace estimation methods, such as SIR, mean MAVE and

CVE , fail. In contrast,all 4 ensemble CVE methods and csMAVE succeed in identifying the minimal dimension reduction subspace, withensemble CVE performing slightly better, as can be seen in Table 2. In particular,

Fourier is the best performingmethod. M2, is a two dimensional mean subspace model, i.e. S E ( Y | X ) = S Y | X , and in Table 3 we see that csMAVE isthe best performing method. M3 is the same as model (29) and here the mean subspace is a proper subset of the centralsubspace. In Table 4 we see that Indicator_weighted and csMAVE are the best performers and are roughly on par. InM4, the two dimensional mean subspace, which determines also the heteroskedasticity, agrees with the central subspace.In Table 5 we see that this model is quite challenging for all methods, and only

Indicator_weighted and csMAVE give satisfactory results, with

Indicator_weighted the clear winner.In M5, the heteroskedasticity is induced by an interaction term, and the three dimensional central subspace model isa proper superset of the one dimensional mean subspace. In Table 6 we see that M5 is quite challenging for all ﬁvemethods, therefore we increase the sample size n to . For M5, the two weighted ensemble conditional varianceestimators are the best performing methods followed by csMAVE .10 PREPRINT - M

ARCH

1, 2021M6 is a one dimensional pure central subspace model, whereas the mean subspace is . In Table 7, we see that for n = 100 the two weighted ECVEs are the best performing methods and for higher sample sizes csMAVE is slightlymore accurate than the ECVE methods.In M7 the one dimensional mean subspace agrees with the central subspace, i.e. S E ( Y | X ) = S Y | X , and the conditionalﬁrst and second moments, E ( Y l | X ) for l = 1 , , are highly nonlinear and periodic functions of the sufﬁcient reduction.In Table 8, we see that all ensemble conditional variance estimators clearly outperform csMAVE .Table 2: Mean and standard deviation (in parenthesis) of estimation errors of M1 Distribution n Fourier Fourier_weighted Indicator Indicator_weighted csMAVE

I 100 (0.031) (0.038) (0.043) (0.042) (0.033)III 400 0.082 0.101 0.127 0.132 (0.020) (0.029) (0.031) (0.032) (0.022)

Table 3: Mean and standard deviation (in parenthesis) of estimation errors of M2