[PDF] Conditional Variance Estimator for Sufficient Dimension Reduction

Abstract

Conditional Variance Estimation (CVE) is a novel sufficient dimension reduction (SDR) method for additive error regressions with continuous predictors and link function. It operates under the assumption that the predictors can be replaced by a lower dimensional projection without loss of information. In contrast to the majority of moment based sufficient dimension reduction methods, Conditional Variance Estimation is fully data driven, does not require the restrictive linearity and constant variance conditions, and is not based on inverse regression. CVE is shown to be consistent and its objective function to be uniformly convergent. CVE outperforms the mean average variance estimation, (MAVE), its main competitor, in several simulation settings, remains on par under others, while it always outperforms the usual inverse regression based linear SDR methods, such as Sliced Inverse Regression.

Full PDF

CConditional Variance Estimator for Sufﬁcient DimensionReduction

Lukas Fertl ∗ Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, Austria

Efstathia Bura † Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, AustriaFebruary 18, 2021 A BSTRACT

Conditional Variance Estimation (CVE) is a novel sufﬁcient dimension reduction (SDR) methodfor additive error regressions with continuous predictors and link function. It operates under theassumption that the predictors can be replaced by a lower dimensional projection without loss ofinformation. In contrast to the majority of moment based sufﬁcient dimension reduction methods,Conditional Variance Estimation is fully data driven, does not require the restrictive linearity andconstant variance conditions, and is not based on inverse regression. CVE is shown to be consistentand its objective function to be uniformly convergent. CVE outperforms the mean average varianceestimation, (MAVE), its main competitor, in several simulation settings, remains on par under others,while it always outperforms the usual inverse regression based linear SDR methods, such as SlicedInverse Regression.

Suppose ( Y, X T ) T have a joint continuous distribution, where Y ∈ R denotes a univariate response and X ∈ R p a p -dimensional covariate vector. We assume that the dependence of Y and X is modelled by Y = g ( B T X ) + (cid:15), (1)where X is independent of (cid:15) with positive deﬁnite variance-covariance matrix, V ar( X ) = Σ x , (cid:15) ∈ R is a meanzero random variable with ﬁnite V ar( (cid:15) ) = E (cid:0) (cid:15) (cid:1) = η , g is an unknown continuous non-constant function, and B = ( b , ..., b k ) ∈ R p × k of rank k ≤ p . Model (1) states that E ( Y | X ) = E ( Y | B T X ) (2)and requires the ﬁrst conditional moment E ( Y | X ) = g ( B T X ) contain the entirety of the information in X about Y and be captured by B T X , so that F ( Y | X ) = F ( Y | B T X ) , where F ( · | · ) denotes the conditional cumulativedistribution function (cdf) of the ﬁrst given the second argument. That is, Y is statistically independent of X when B T X is given and replacing X by B T X induces no loss of information for the regression of Y on X . ∗ [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b PREPRINT - F

EBRUARY

18, 2021Identifying the span of B ; i.e., the column space of B , as only the span { B } is identiﬁable, sufﬁces in order to identifythe sufﬁcient reduction of X for the regression of Y on X . We assume, without loss of generality, B is semi-orthogonal,i.e., B T B = I k , since a change of coordinate system by an orthogonal transformation does not alter model (2).For q ≤ p , let S ( p, q ) = { V ∈ R p × q : V T V = I q } , (3)denote the Stiefel manifold, that comprizes of all p × q matrices with orthonormal columns. S ( p, q ) is compact and dim( S ( p, q )) = pq − q ( q + 1) / [see [4] and Section 2.1 of [31]]. Further let Gr ( p, q ) = S ( p, q ) / S ( q, q ) (4)denote the Grassmann manifold, i.e. all q -dimensional subspaces in R p , which is exactly the quotient space of S ( p, q ) with all q × q orthonormal matrices S ( q, q ) , i.e. the basis of a linear subspace is unique up to orthogonal transformations.The fact that only span { B } is identiﬁable, can be expressed through the Grassmann manifold Gr ( p, q ) in (4). The goalof sufﬁcient dimension reduction in model (1) is to ﬁnd a subspace M ∈ Gr ( p, k ) such that any basis B ∈ S ( p, k ) of M fulﬁlls (1) or equivalently (2).Finding sufﬁcient reductions of the predictors to replace them in regression and classiﬁcation without loss of informationis called sufﬁcient dimension reduction [9]. The ﬁrst split in sufﬁcient dimension reduction taxonomy occurs betweenlikelihood and non-likelihood based methods. The former, which were developed more recently [11, 10, 12, 6, 5],assume knowledge either of the joint family of distributions of ( Y, X T ) T , or the conditional family of distributionsfor X | Y . The latter is the most researched branch of sufﬁcient dimension reduction and comprizes of three classesof methods: Inverse regression based, semi-parametric and nonparametric. Reviews of the former two classes can befound in [1, 25, 22].In this paper we present the conditional variance estimation , which falls in the class of nonparametric methods. Theestimators in this class minimize a criterion that describes the ﬁt of the dimension reduction model (2) under (1) to theobserved data. Since the criterion involves unknown distributions or regression functions, nonparametric estimation isused to recover span { B } . Statistical approaches to identify B in (2) include ordinary least squares and nonparametricmultiple index models [34]. The least squares estimator, Σ − x cov ( X , Y ) , always falls in span { B } [22, Th. 8.3].Principal Hessian Directions [24] was the ﬁrst sufﬁcient dimension reduction estimator to target span { B } in (2). Itsmain disadvantage is that it requires the so called linearity and constant variance conditions on the marginal distributionof X . Its relaxation, Iterative Hessian Transformation [13], still requires the linearity condition in order to recovervectors in span { B } .The most competitive nonparametric sufﬁcient dimension reduction method up to now has been minimum averagevariance estimation (MAVE, [35]). It assumes model (1), bounded fourth derivative covariate density, and existenceof continuous bounded third derivatives for g . It uses a local ﬁrst order approximation of g in (1) and minimizes theexpected conditional variance of the response given B T X .The conditional variance estimator also targets and recovers span { B } in models (1) and (2). The objective function isbased on the intuition that the directions in the predictor space that capture the dependence of Y on X should exhibitsigniﬁcantly higher variation in Y as compared with the directions along which Y exhibits markedly less variation. The conditional variance estimator is a fully data-driven estimator that performs better than or is on par with minimumaverage variance estimation in simulations. The conditional variance estimator differs from other approaches, includingMAVE, in that it only targets the span { B } and does not require an explicit form or estimation of the link function g .As a result, it requires weaker assumptions on its smoothness. Let (Ω , F , P ) be a probability space, and X : Ω → R p be a random vector with a continuous probability densityfunction f X and denote its support by supp ( f X ) . Throughout (cid:107) · (cid:107) denotes the Frobenius norm for matrices, Euclideannorm for vectors, and scalar product refers to the euclidean scalar product. For any matrix M , or linear subspace M , we denote by P M the projection matrix on the column space of the matrix or on the subspace, i.e. P M = M ( M T M ) − M T ∈ R p × p for M ∈ R p × q . For any V ∈ S ( p, q ) , deﬁned in (3), we generically denote a basis of theorthogonal complement of its column space span { V } , by U . That is, U ∈ S ( p, p − q ) such that span { V } ⊥ span { U } and span { V } ∪ span { U } = R p , U T V = ∈ R ( p − q ) × q , U T U = I p − q . For any x , s ∈ R p we can always write x = s + P V ( x − s ) + P U ( x − s ) = s + Vr + Ur (5)where r = V T ( x − s ) ∈ R q , r = U T ( x − s ) ∈ R p − q .2 PREPRINT - F

EBRUARY

18, 2021In the sequel, we refer to the following assumptions as needed and the proofs of the Theorems are presented in theAppendix. (A.1).

Model Y = g ( B T X ) + (cid:15) holds with Y ∈ R , g : R k → R non constant in all arguments, B = ( b , ..., b k ) ∈ R p × k of rank k ≤ p , X ∈ R p independent from (cid:15) , V ar( X ) = Σ x is positive deﬁnite , E ( (cid:15) ) = 0 , V ar( (cid:15) ) = η < ∞ . (A.2). The link function g and the density f X : R p → [0 , ∞ ) of X are twice continuous differentiable. (A.3). E ( | Y | ) < ∞ . (A.4). supp ( f X ) is compact. Remark.

Assumption (A.4) is not as restrictive as it might seem. [36] showed in Proposition 11 that there is a compactset

S ⊂ R p such that the mean subspace of model (1) is the same as the mean subspace of Y = g ( B T X |S ) + (cid:15) , where X |S = X { X ∈S} and A is the indicator function of A . Further S can be assumed to be an ellipsoid and for all (cid:101) S ⊇ S the same assertion holds true.

Deﬁnition.

For q ≤ p ∈ N and any V ∈ S ( p, q ) , we deﬁne ˜ L ( V , s ) = V ar( Y | X ∈ s + span { V } ) , (6) where s ∈ R p is a shifting point. Deﬁnition.

For V ∈ S ( p, q ) , we deﬁne the objective function, L ( V ) = (cid:90) R p ˜ L ( V , x ) f X ( x ) d x = E (cid:16) ˜ L ( V , X ) (cid:17) . (7) L ( V ) in (7) is the objective function for the estimator we propose for the span of B in (1) and Theorem 1 providesthe statistical motivation for the objective function (7) of the conditional variance estimator. First we derive that bothpopulation based functions (6) and (7) are well deﬁned.Let X be a p -dimensional continuous random vector with density f X ( x ) , s ∈ supp ( f X ) ⊂ R p , and V belongs to theStiefel manifold S ( p, q ) deﬁned in (3). The function f X | X ∈ s +span { V } ( r ) = f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r (8)is a proper conditional density of X that is concentrated on the afﬁne subspace s + span { V } using the concept ofregular conditional probability [21] under assumption (A.2). The detailed justiﬁcation is given in the Appendix, wherewe also show that under assumptions (A.1), (A.2) and (A.4), ˜ L ( V , s ) in (6) and L ( V ) in (7) are well deﬁned andcontinuous. Moreover, ˜ L ( V , s ) = µ ( V , s ) − µ ( V , s ) + η (9)where µ l ( V , s ) = (cid:90) R q g ( B T s + B T Vr ) l f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r d r = t ( l ) ( V , s ) t (0) ( V , s ) (10)with t ( l ) ( V , s ) = (cid:90) R q g ( B T s + B T Vr ) l f X ( s + Vr ) d r . (11) Theorem 1.

Suppose V = ( v , ..., v q ) ∈ S ( p, q ) and q ∈ { , . . . , p } . Under assumptions (A.1), (A.2) and (A.4),(a) For all s ∈ R p and V such that there exist u ∈ { , ..., q } with v u ∈ span { B } , ˜ L ( V , s ) > V ar( (cid:15) ) = η and L ( V ) > η .(b) For all s ∈ R p and span { V } ⊥ span { B } , ˜ L ( V , s ) = η and L ( V ) = η .Proof. Let s ∈ R p and V = ( v , ..., v q ) ∈ R p × q so that v u ∈ span { B } for some u ∈ { , ..., q } . To obtain (a),observe X ∈ s + span { V } ⇐⇒ X = s + P V ( X − s ) and using (6) yields ˜ L ( V , s ) = V ar (cid:0) g ( B T X ) | X = s + VV T ( X − s ) (cid:1) + V ar( (cid:15) )= V ar (cid:0) g ( B T s + B T VV T ( X − s )) | X = s + VV T ( X − s ) (cid:1) + η > η (12)since B T VV T ( X − s ) (cid:54) = 0 with probability 1, and therefore the variance term in (12) is positive. For V such that V and B are orthogonal, B T VV T ( X − s ) = 0 and (b) follows. Since s is arbitrary yet constant, the statements for L ( V ) follow. 3 PREPRINT - F

EBRUARY

18, 2021Theorem 1 also has an intuitive geometrical interpretation for the proposed method. If X is not random, the deterministicfunction Y = g ( B T X ) is constant in all directions orthogonal to B and varies in all other directions. If randomness isintroduced, as in model (1), then the variation in Y stems only from (cid:15) in all directions orthogonal to B . In all otherdirections the variation comprizes of the sum of the variation of (cid:15) and of g ( B T X ) . In consequence, the objectivefunction (7) captures the variation of Y as X varies in the column space of V and is minimized in the directionsorthogonal to B . We have shown that the objective function L ( V ) in (7) is well deﬁned and continuous in Section 2. Let V q = argmin V ∈S ( p,q ) L ( V ) . (13) V q is well deﬁned as the minimizer of a continuous function over the compact set S ( p, q ) . Nevertheless, V q is notunique since for all orthogonal O ∈ R q × q such that OO T = I q , L ( VO ) = L ( V ) as L ( V ) depends on V only through span { V } . Nevertheless, it is a unique minimizer over the Grassmann manifold Gr ( p, q ) in (4). To see this, suppose V ∈ S ( p, q ) is an arbitrary basis of a subspace M ∈ Gr ( p, q ) . We can identify M through the projection P M = VV T .By (5) we write x = Vr + Ur . By the Fubini-Tornelli Theorem we obtain ˜ t ( l ) ( P M , s ) = (cid:90) supp ( f X ) g ( B T s + B T P M x ) l f X ( s + P M x ) d x (14) = t ( l ) ( V , s ) (cid:90) supp ( f X ) ∩ R p − q d r . Therefore ˜ t ( l ) ( P M , s ) / ˜ t (0) ( P M , s ) = t ( l ) ( V , s ) /t (0) ( V , s ) and µ l ( · , s ) in (10) can also be viewed as a functionfrom Gr ( p, q ) to R . If the optimization (13) is over Gr ( p, q ) , the objective function (7) has a unique minimum at span { B } ⊥ by Theorem 1. Therefore B is not uniquely identiﬁable but its span { B } is.Corollary 2 follows directly from Theorem 1 and provides the means for identifying the linear projections of thepredictors satisfying (1). Corollary 2.

Under the assumptions (A.1), (A.2), and (A.3) the solution of the optimisation problem V q in (13) is welldeﬁned. Let k = dim(span { B } ) and q = p − k ,(a) span { V q } = span { B } ⊥ (b) span { V q } ⊥ = span { B } We next deﬁne the novel estimator of the sufﬁcient reduction space, span { B } , in (1), which is motivated by Theorem 1and Corollary 2 (b) serves as the estimation equation for the conditional variance estimator at the population level. Deﬁnition.

The

Conditional Variance Estimator is deﬁned to be any basis B p − q of span { V q } ⊥ . That is, the CVE of B is any B p − q such that span { B p − q } = span { V q } ⊥ (15)When q = p − k , where k = rank ( B ) in (1), then the CVE obtains the population span { B } . Alternatively, we can alsotarget B directly by maximizing the objective function L ( V ) . The downside of this approach is that X either needsto be standardized, or the conditioning argument needs to be changed to X = s + PΣ − x (span { V } ) ( X − s ) , where P M (span { V } ) is the orthogonal projection operator with respect to the inner product (cid:104) x , y (cid:105) M = x T My . In either case,the inversion of Σ x is required. Our choice of targeting the orthogonal complement avoids the inversion of Σ x , and theestimation algorithm in Section 4 can be applied to regressions with p > n or p ≈ n , where n denotes the sample size.Additionally, targeting the complement has computational advantages. The dimension of the search space span { V q } ⊥ is p − q , which is smaller than the dimension of the direct target space in (15) when q = p − k for small k , which is theappropriate setting in a dimension reduction context. Assume ( Y i , X Ti ) Ti =1 ,...,n is an independent identical distributed sample from model (1). For V ∈ S ( p, q ) and s ∈ R p ,we deﬁne d i ( V , s ) = (cid:107) X i − P s +span { V } X i (cid:107) = (cid:107) X i − s (cid:107) − (cid:104) X i − s , VV T ( X i − s ) (cid:105) = (cid:107) ( I p − VV T )( X i − s ) (cid:107) = (cid:107) P U ( X i − s ) (cid:107) (16)4 PREPRINT - F

EBRUARY

18, 2021where (cid:104)· , ·(cid:105) is the usual inner product in R p , P V = VV T and P U = I p − P V using the orthogonal decompositiongiven by (5).Let h n ∈ R + be a sequence of bandwidths and we call the set S s , V = { x ∈ R p : (cid:107) x − P s +span { V } x (cid:107) ≤ h n } a slice that depends on both the shifting point s and the matrix V . h n represent the squared width of a slice around thesubspace s + span { V } and fulﬁlls the following assumptions. (H.1). For n → ∞ , h n → (H.2). For n → ∞ , nh ( p − q ) / n → ∞ Remark.

For obtaining the consistency of the proposed estimator (H.2) will be strengthened to log( n ) /nh ( p − q ) / n → . Let K be a function satisfying the following assumptions. (K.1). K : [0 , ∞ ) → [0 , ∞ ) is a non increasing and continuous function, so that | K ( z ) | ≤ M , with (cid:82) R q K ( (cid:107) r (cid:107) ) d r < ∞ for q ≤ p − . (K.2). There exist positive ﬁnite constants L and L such that the kernel K satisﬁes one of the following:(1) K ( u ) = 0 for | u | > L and for all u, ˜ u it holds | K ( u ) − K (˜ u ) | ≤ L | u − ˜ u | (2) K ( u ) is differentiable with | ∂ u K ( u ) | ≤ L and for some ν > it holds | ∂ u K ( u ) | ≤ L | u | − ν for | u | > L Examples of functions that satisfy (K.1) and (K.2) include the Gaussian, K ( z ) = c exp( − z / , the exponential, K ( z ) = c exp( − z ) , and the squared Epanechnikov kernel, K ( z ) = c max { (1 − z ) , } (i.e. polynomial kernels),where c is a constant. The rectangular, K ( z ) = cI ( z ≤ , does not fulﬁll the assumptions but will be mentioned forintuitive explanations. A list of further kernel functions is given in [28, Table 1]. L ( V ) and its uniform convergenceDeﬁnition. For i = 1 , . . . , n , we deﬁne w i ( V , s ) = K (cid:16) d i ( V , s ) h n (cid:17)(cid:80) nj =1 K (cid:16) d j ( V , s ) h n (cid:17) (17) Deﬁnition.

The sample based estimate of ˜ L ( V , s ) is deﬁned as ˜ L n ( V , s ) = n (cid:88) i =1 w i ( V , s )( Y i − ¯ y ( V , s )) = ¯ y ( V , s ) − ¯ y ( V , s ) (18) where ¯ y l ( V , s ) = (cid:80) ni =1 w i ( V , s ) Y li , l = 1 , . Deﬁnition.

The estimate of the objective function L ( V ) in (7) is deﬁned as L n ( V ) = 1 n n (cid:88) i =1 ˜ L n ( V , X i ) , (19) where each data point X i is a shifting point. To obtain insight as to the choice of ˜ L n ( V , s ) in (18), let us consider the rectangular kernel, K ( z ) = 1 { z ≤ } . In thiscase, ˜ L n ( V , s ) computes the empirical variance of the Y i ’s corresponding to the X i ’s that are no further than √ h n away from the afﬁne space s + span { V } , i.e., d i ( V , s ) = (cid:107) X i − P s +span { V } X i (cid:107) ≤ h n . If a smooth kernel isused, such as the Gaussian in our simulation studies, then ˜ L n ( V , s ) is also smooth, which allows the computation ofgradients required to solve the optimization problem.In Theorem 3 we state the conditions under which L n ( V ) in (19) converges uniformly to its population counterpart in(7). This result will lead to the consistency of our estimator. Theorem 3.

Let ˜ a n = log( n ) /n . Under (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1), a n = log( n ) /nh ( p − q ) / n = o (1) ,and a n /h ( p − q ) / n = O (1) , sup V ∈S ( p,q ) | L n ( V ) − L ( V ) | → in probability as n → ∞ (20)5 PREPRINT - F

EBRUARY

18, 2021

Next we deﬁne the estimator we propose for span { B } in (1). Our main theoretical result follows in Theorem 4 whichestablishes the consistency of our estimator. Deﬁnition.

The sample based

Conditional Variance Estimator (cid:98) B p − q is any basis of span { (cid:98) V q } ⊥ where (cid:98) V q =argmin V ∈S ( p,q ) L n ( V ) . Theorem 4.

Under (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1), a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) , span { (cid:98) B k } is a consistent estimator for span { B } in model (1) ; i.e., (cid:107) P (cid:98) B k − P B (cid:107) → in probability as n → ∞ . L ( V ) The set of points { x ∈ R p : (cid:107) x − P s +span { V } x (cid:107) ≤ h n } represents a slice in the a subspace of R p about s +span { V } .In the estimation of L ( V ) two different weighting schemes are used:(a) Within a slice . The weights are deﬁned in (17) and are used to calculate (18).(b)

Between slices . Equal weights /n are used to calculate (19).The choice of weights can be potentially inﬂuential. Especially the between weighting scheme can further be reﬁned byassigning more weight to slices with more points. This can be realized by altering (19) to L ( w ) n ( V ) = n (cid:88) i =1 ˜ w ( V , X i ) ˜ L n ( V , X i ) , with (21) ˜ w ( V , X i ) = (cid:80) nj =1 K ( d j ( V , X i ) /h n ) − (cid:80) nl,u =1 K ( d l ( V , X u ) /h n ) − n = (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) (cid:80) nl,u =1 ,l (cid:54) = u K ( d l ( V , X u ) /h n ) (22)For example, if a rectangular kernel is used, (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) is the number of X j ( j (cid:54) = i ) points in theslice corresponding to ˜ L n ( V , X i ) . Therefore this slice gets higher weight, if the number of X j points in this slice islarger. That is, the more observations we use for estimating L ( V , X i ) the better its accuracy. The denominator in (22)guarantees the weights ˜ w ( V , X i ) sum up to one. The performance of conditional variance estimation depends crucially on the choice of the bandwidth sequence h n that controls the bias-variance trade-off if the mean squared error is used as measure for accuracy, in the sense that thesmaller h n is, the lower the bias and the higher the variance and vice versa. Furthermore, the choice of h n depends on p , q , the sample size n , and the distribution of X . We assume throughout the bandwidth satisﬁes assumptions (H.1) and(H.2). We will use Lemma 5 to derive a data-driven bandwidth we use in the computation of our estimator. Lemma 5.

Let M be a p × p positive deﬁnite matrix. Then,tr ( M ) p = argmin s> (cid:107) M − s I p (cid:107) (23) Proof.

Let U be the p × p matrix whose columns are the eigenvectors of M corresponding to its eigenvalues λ ≥ . . . ≥ λ p > . Then, M = U diag ( λ , ..., λ p ) U T , which implies (cid:107) M − s I p (cid:107) = (cid:107) diag ( λ , ..., λ p ) − s I p (cid:107) = (cid:80) pl =1 ( λ l − s ) .Taking the derivative with respect to s , setting it to 0 and solving for s obtains (23), since (cid:80) pl =1 λ l = tr ( M ) .If the predictors are multivariate normal, their joint density is approximated by N ( µ X , σ I p ) by Lemma 5, with σ = tr ( Σ x ) /p . This results in no bandwidth dependence on V and leads to a rule for bandwidth selection, as follows.Under X ∼ N p ( µ X , σ I p ) , (cid:101) X i = X i − X j ∼ N p (0 , σ I p ) for i (cid:54) = j , where we suppress the dependence on j fornotational convenience. Since all data are used as shifting points, d i ( V , X j ) = (cid:107) X i − X j (cid:107) − ( X i − X j ) T VV T ( X i − X j ) = (cid:107) (cid:101) X i (cid:107) − (cid:101) X Ti VV T (cid:101) X i . LetnObs = E (cid:16) { i ∈ { , ..., n } : (cid:101) X i ∈ span h { V }} (cid:17) = 1 + ( n − P ( d ( V , X ) ≤ h ) = 1 + ( n − P ( (cid:107) (cid:101) X (cid:107) − (cid:101) X T VV T (cid:101) X ≤ h ) (24)6 PREPRINT - F

EBRUARY

18, 2021where span h { V } = { x ∈ R p : (cid:107) x − P span { V } x (cid:107) ≤ h } and (cid:101) X = X − X ∗ , with X ∗ an independent copy of X . nObsis the expected number of points in a slice. Given a user speciﬁed value for nObs, h is the solution to (24).Let x ∈ R p . For any V ∈ S ( p, q ) in (3), there exists an orthonormal basis U ∈ R p × ( p − q ) of span { V } ⊥ such that x = Vr + Ur , by (5). Then, (cid:101) X = VR + UR , with R = V T (cid:101) X ∼ N (0 , σ I q ) , R = U T (cid:101) X ∼ N (0 , σ I p − q ) ,and (cid:101) X T VV T (cid:101) X = (cid:107) R (cid:107) and (cid:107) (cid:101) X (cid:107) = (cid:107) R (cid:107) + (cid:107) R (cid:107) . Therefore, P (cid:16) (cid:107) (cid:101) X (cid:107) − (cid:101) X T VV T (cid:101) X ≤ h (cid:17) = P ( (cid:107) R (cid:107) ≤ h ) = χ p − q (cid:18) h σ (cid:19) , (25)where χ p − q is the cumulative distribution function of a chi-squared random variable with p − q degrees of freedom.Plugging (25) in (24) obtains nObs = 1 + ( n − χ p − q (cid:18) h σ (cid:19) . (26)Solving (26) for h and Lemma 5 yield h n ( nObs ) = χ − p − q (cid:18) nObs − n − (cid:19) tr ( (cid:98) Σ x ) p , (27)where (cid:98) Σ x = (cid:80) i ( X i − ¯ X )( X i − ¯ X ) T /n and ¯ X = (cid:80) i X i /n .In order to ascertain h n satisﬁes (H.1) and (H.2), a reasonable choice is to set nObs = γ ( n ) for a function γ ( · ) with γ ( n ) → ∞ , γ ( n ) /n ≤ and γ ( n ) /n → . For example, nObs = γ ( n ) = n β with β ∈ (0 , can be used.Alternatively, a plug-in bandwidth based on rule-of-thumb rules of the form csn − / (4+ k ) , where s is an estimate ofscale and c a number close to 1, such as Silverman’s ( c = 1 . , s = standard deviation) or Scott’s ( c = 1 , s = standarddeviation), used in nonparametric density estimation [see [29]], is h n = 1 . tr ( (cid:98) Σ x ) p (cid:16) n − / (4+ p − q ) (cid:17) . (28)The term tr ( (cid:98) Σ X ) /p can be interpreted as the variance of X i − X j and p − q is the true dimension k . We use 1.2 as c based on empirical evidence from simulations. Since both (27) and (28) yield satisfactory results, we opted againstcross validation for bandwidth selection because of the computational burden involved, and used the bandwidth in (28)in simulations and data analyses. A Stiefel manifold optimization algorithm is used to obtain the solution of the sample version of the optimizationproblem (13). To calculate (cid:98) V q in (3.2), a curvilinear search is carried out [33, 31], which is similar to gradient descent.First an arbitrary starting value V (0) is selected by drawing a p × q matrix from the invariant measure; i.e., thedistribution that corresponds to the uniform, on S ( p, q ) , see [8]. The Q -component of the QR decomposition of a p × q matrix with independent standard normal entries follows the invariant measure [7]. The step-size τ > , the step sizereduction factor γ ∈ (0 , , and tolerance tol > are ﬁxed at the outset. Result: V ( end ) Initialize: V (0) , τ = 1 , tol = 10 − , γ = 0 . error = tol + 1 , maxit = 50 , count = 0 ; while error > tol and count ≤ maxit do • G = ∇ V L n ( V ( j ) ) ∈ R p × q , W = GV T − VG T • V ( j +1) = ( I p + τ W ) − ( I p − τ W ) V ( j ) • error = (cid:107) V ( j ) V ( j ) T − V ( j +1) V ( j +1) T (cid:107) / √ q if L n ( V ( j +1) ) > L n ( V ( j ) ) then V ( j +1) ← V ( j ) ; τ ← τ γ ; error ← tol + 1 else count ← count + 1 τ ← τγ endend Algorithm 1: Curvilinear search7

PREPRINT - F

EBRUARY

18, 2021Under mild regularity conditions on the objective function, [33] showed that the sequence generated by the algorithmconverges to a stationary point if the Armijo-Wolfe conditions [27] are used for determining the stepsize τ .The Armijo-Wolfe conditions require the evaluation of the gradient for each potential step size until one is found thatfulﬁlls the conditions and the step is accepted, i.e. for the determination of one step size the gradient has to be evaluatedmultiple times. Since for the conditional variance estimator, the gradient computation incurs the highest computationalcost, we use simpler conditions to determine the step size. Speciﬁcally, we simply require the step decrease theobjective function, otherwise the step size τ is decreased by the factor γ ∈ (0 , ). These simpliﬁed conditions arecomputationally less expensive and exhibit same behavior as the Armijo-Wolfe conditions in the simulations. Furtherwe capped the maximum number of steps at maxit = 50 steps, since the algorithm converged in about 10 iterations inall our simulations.The algorithm is repeated for m arbitrary V (0) starting values drawn from the invariant measure on S ( p, q ) . Amongthose, the value at which L n in (19) is minimal is selected as (cid:98) V q .The algorithm requires the computation of the gradient of L n ( V ) in (19) or (21). We compute the gradient of theobjective function for the Gaussian kernel in Theorems 6 and 7. The Gaussian kernel is the default kernel we use in theimplementation of the estimation algorithm in the R code that accompanies this manuscript. Theorem 6.

Let K ( z ) = exp ( − z / be the Gaussian kernel. Then, the gradient of ˜ L n ( V , s ) in (18) is given by ∇ V ˜ L n ( V , s ) = 1 h n n (cid:88) i =1 ( ˜ L n ( V , s ) − ( Y i − ¯ y ( V , s )) ) w i d i ∇ V d i ( V , s ) ∈ R p × q , and the gradient of L n ( V ) in (19) is ∇ V L n ( V ) = 1 n n (cid:88) i =1 ∇ V ˜ L n ( V , X i ) . with w i = w ( V , X i ) in (17) . The weighted version of conditional variance estimation in Section 3.3 is expected to increase the accuracy of theestimator for unevenly spaced data. When (21) and the gradient in (29) are used in the optimisation algorithm, we referto the estimator as weighted conditional variance estimation . If (21) and the gradient (cid:80) ni =1 ˜ w ( V , X i ) ∇ V ˜ L n ( V , X i ) is used; i.e., the ﬁrst summand in (29) is dropped, we refer to it as partially weighted conditional variance estimation .For both, we replace G in algorithm 1 with the corresponding gradient derived in Theorem 7. Theorem 7.

Let K ( z ) = exp ( − z / be the Gaussian kernel. Then, the gradient of L ( w ) n ( V ) in (21) is given by ∇ V L ( w ) n ( V ) = n (cid:88) i =1 (cid:16) ∇ V ˜ w ( V , X i ) ˜ L n ( V , X i ) + ˜ w ( V , X i ) ∇ V ˜ L n ( V , X i ) (cid:17) , (29) where ∇ V ˜ L n ( V , X i ) is given in Theorem 6. Furthermore, ∇ V ˜ w ( V , X i ) = − h n (cid:88) j  K j,i (cid:80) nl,u =1 K l,u d j,i ∇ V d j,i − ˜ w i n (cid:88) l,u =1 K l,u (cid:80) no,s =1 K o,s d l,u ∇ V d l,u  with ˜ w i = ˜ w ( V , X i ) in (22) , K j,i = K ( d j ( V , X i ) /h n ) , and d j,i = d j ( V , X i ) given in (16) . L n ( V ) We explore how accurately the sample version (19) of the objective function estimates the target subspace in an example.We consider a bivariate normal predictor vector, X = ( X , X ) T ∼ N ( , Σ x ) . We generate the response from Y = g ( B T X ) + (cid:15) = X + (cid:15) , with (cid:15) ∼ N (0 , η ) independent of X . In this setting, k = 1 , B = (1 , T , g ( z ) = z ∈ R in model (1). With these speciﬁcations, (10) becomes µ l ( V , s ) = (cid:90) R ( B T s + B T V r ) l f X | X ∈ s +span { V } ( r ) dr (30)8 PREPRINT - F

EBRUARY

18, 2021Dropping the terms that do not contain r in (8) yields f X | X ∈ s +span { V } ( r ) ∝ f X ( s + V r ) ∝ exp (cid:18) −

12 ( s + r V ) T Σ − x ( s + r V ) (cid:19) ∝ exp (cid:18) − (cid:0) r V T Σ − x s + r V T Σ − x V (cid:1)(cid:19) = exp (cid:18) − σ (cid:0) rσ V T Σ − x s + r (cid:1)(cid:19) ∝ exp (cid:18) − σ ( r − α ) (cid:19) , (31)where σ = 1 / ( V T Σ − x V ) , α = − σ V T Σ − x s and the symbol ∝ stands for proportional to. Letting ψ ( z ) denote thedensity of a standard normal variable, (31) obtains f X | X ∈ s +span { V } ( r ) = 1 σ ψ (cid:18) r − ασ (cid:19) (32)for V , s ∈ R × . Inserting (32) in (30) yields (cid:90) R ( B T s + B T V r ) l σ ψ (cid:18) r − ασ (cid:19) dr = (cid:26) B T s + B T V α l = 1( B T s ) + 2( B T s )( B T V ) α + ( B T V ) ( σ + α ) l = 2 Using (9), (6) and (7), yields ˜ L ( V , s ) = µ ( V , s ) − µ ( V , s ) + η = ( B T V ) σ + η , so that L ( V ) = E (cid:16) ˜ L ( V , X ) (cid:17) = ( B T V ) σ + η = ( B T V ) V T Σ − x V + η (33)From (33) we can easily see that L ( V ) attains its minimum at V ⊥ B . Also, if Σ x = I , the maximum of L ( V ) is attained at V = B . To visualize the behavior of ˜ L n ( V ) as the sample size increases, we parametrize V by V ( θ ) = (cos( θ ) , sin( θ )) T , θ ∈ [0 , π ] . Since B = (1 , T , the minimum of ˜ L ( V ) is at V ( π/

2) = (0 , T , which isorthogonal to B .The true L ( V ( θ )) and its estimates L n ( V ( θ )) are plotted for samples of different sizes n in Figure 1. L n ( V ( θ )) approximates L ( V ) fast and attains its minimum at the same value as L ( V ) even for n = 10 .As an aside, we note that assumption (A.4) is violated in this example, which suggests that the proposed estimator ofconditional variance estimation may apply under weaker assumptions. . . . . . . . L n ( V ) versus L ( V ) q L ( V ) n = 10n = 50n = 100n = 500 Figure 1: Solid black line is L ( V ( θ )) = cos( θ ) + 0 . , colored is L n ( V ( θ )) , θ ∈ [0 , π ] , n = 10 , , , . Thevertical black line is at θ = π/ PREPRINT - F

EBRUARY

18, 2021

We compare the estimation accuracy of conditional variance estimation with the forward model based sufﬁcientdimension reduction methods, mean outer product gradient estimation ( meanOPG ), mean minimum average varianceestimation ( meanMAVE ) [32], reﬁned outer product gradient ( rOPG ), reﬁned minimum average variance estimation( rmave ) [35, 22], and principal Hessian directions ( pHd ) [24, 15], and the inverse regression based methods, slicedinverse regression (

SIR ) [23] and sliced average variance estimation (

SAVE ) [14]. The dimension k is assumed to beknown throughout.We report results for conditional variance estimation using the “plug-in” bandwidth in (28) and three different conditionalvariance estimation versions, CVE , wCVE , and rCVE . CVE is obtained by using m = 10 arbitrary starting values in theoptimization algorithm and optimizing (19) as described in Section 4. rCVE , or reﬁned weighted CVE , is obtained bysetting the starting value V (0) at the optimizer of CVE , and using (21) in the optimization algorithm in Section 4 withthe partially weighted gradient as described in Section 3.3. wCVE , or weighted CVE , is obtained by optimizing (21)with partially weighted gradient as described in Sections 3.3 and 4. Methods rOPG and rmave refer to the originalreﬁned outer product gradient and reﬁned minimum average variance estimation algorithms published in [35]. They areimplemented using the R code in [22] with number of iterations nit = 25 , since the algorithm is seen to converge by 25.The dr package is used for the SIR , SAVE and pHd calculations, and the

MAVE package for mean outer product gradientestimation ( meanOPG ) and mean minimum average variance estimation ( meanMAVE ). The source code for conditionalvariance estimation can be downloaded from https://git.art-ist.cc/daniel/CVE .Table 1 lists the seven models (M1-M7) we consider. Throughout, we set p = 20 , b = (1 , , , , , , , ..., T / √ , b = (1 , − , , − , , − , , ..., T / √ ∈ R p for M1-M5. For M6, b = e , b = e and b = e p , and for M7 b , b , b are the same as in M6 and b = e , where e j denotes the p -vector with j th element equal to 1 and all othersare 0. The error term (cid:15) is independent of X for all models. In M2, M3, M4, M5 and M6, (cid:15) ∼ N (0 , . For M1 and M7, (cid:15) has a generalized normal distribution GN ( a, b, c ) with densitiy f (cid:15) ( z ) = c/ (2 b Γ(1 /c )) exp(( | z − a | /b ) c ) , see [26]with location 0 and shape-parameter 0.5 for M1, and shape-parameter 1 for M7 (Laplace distribution). For both thescale-parameter is chosen such that V ar( (cid:15) ) = 0 . .Table 1: Models Name Model X distribution (cid:15) distribution k n M1 Y = cos( b T X ) + (cid:15) X ∼ N p ( , Σ ) GN (0 , (cid:112) / , . Y = cos( b T X ) + 0 . (cid:15) X ∼ λZ p + N p ( , I p ) N (0 , Y = 2 log( | b T X | + 2) + 0 . (cid:15) X ∼ N p ( , I p ) N (0 , Y = ( b T X ) / (0 . . b T X ) ) + 0 . (cid:15) X ∼ N p (0 , Σ ) N (0 , Y = cos( π b T X )( b T X + 1) + 0 . (cid:15) X ∼ U ([0 , p ) N (0 , Y = ( b T X ) + ( b T X ) + ( b T X ) + 0 . (cid:15) X ∼ N p ( , I p ) N (0 , Y = ( b T X )( b T X ) + ( b T X )( b T X ) + (cid:15) X ∼ t ( I p ) GN (0 , (cid:112) / Γ(6) , The variance-covariance structure of X in models M1 and M4 satisﬁes Σ i,j = 0 . | i − j | for i, j = 1 , . . . , p . In M5, X isuniform with independent entries on the p -dimensional hyper-cube. In M7, X is multivariate t -distributed with 3 degreesof freedom. The link functions of M4 and M7 are studied in [35], but we use p = 20 instead of 10 and a non identitycovariance structure for M4 and the t -distribution instead of normal for M7. In M2, Z ∼ Bernoulli ( p mix ) − ∈ {− , } ,where q = (1 , , ..., T ∈ R q , mixing probability p mix ∈ [0 , and dispersion parameter λ > . For < p mix < , X has a mixture normal distribution, where p mix is the relative mode height and λ is a measure of mode distance.We set q = p − k and generate r = 100 replications of models M1 - M7. We estimate B using the ten sufﬁcientdimension reduction methods. The accuracy of the estimates is assessed using err = (cid:107) P B − P (cid:98) B (cid:107) / √ k , which lies inthe interval [0 , . The factor √ k normalizes the distance, with values closer to zero indicating better agreement andvalues closer to one indicating strong disagreement , speciﬁcally, (cid:107) P B − P (cid:98) B (cid:107) ≤ k .In Table 2 the mean and standard deviation of err for M1 - M7 are reported. In particular, for M2, p mix = 0 . and λ = 1 . The smallest error values are boldfaced. In models M1, M2 and M3, the conditional variance estimator isthe best performer, with its reﬁned version as close second. In M4, M5 and M6, any of the four versions of MAVEperforms better than the CVE. For model M7 the results of rOPG and rmave are not reported because the code frequentlyproduces an error message that a matrix is not invertible. Among the rest, the weighted version of CVE, wCVE, attainsthe minimum error. 10 PREPRINT - F

EBRUARY

18, 2021Sliced inverse regression (

SIR ) and sliced average variance estimation (

SAVE ) are not competitive throughout ourexperiments. Sliced inverse regression (

SIR ), in particular, is expected to fail in models M1-M3, and M6 since E ( Y | X ) is even.In Figure 2, box-plots for all combinations of p mix ∈ { . , . , . } and λ ∈ { , . , , . } are presented. The referencemethods are restricted to meanOPG and meanMAVE , since the others are not competitive. Conditional variance estimationperforms better than all competing methods and is the only method with consistently smaller errors when the two modesare further apart ( λ ≥ ) regardless of the mixing probability p mix . The performance of both meanOPG and meanMAVE worsens as one moves from left to right row-wise. The mixing probability, p mix , has no noticeable effect on theperformance of any method; i.e., the plots are very similar column-wise. In sum, meanMAVE ’s performance deterioratesas the bimodality of the predictor distribution becomes more distinct. In contrast, conditional variance estimation isunaffected. and appears to have an advantage over meanMAVE when the predictors have mixture distributions, the linkfunction is even about the midpoint of the two modes, and B is not orthogonal to the line connecting the two modes.Conditional variance estimation is the only method that estimates the mean subspace reliably in model M2 ( err ≈ . to . ), whereas meanMAVE misses it completely ( err ≈ ). These results indicate that conditional variance estimationis often approximately on par, and can perform much better than meanMAVE depending on the predictor distribution andthe link function. Table 2: Mean and standard deviation of estimation errors Model CVE wCVE rCVE meanOPG rOPG meanMAVE rmave pHd sir save M mean M mean M mean M mean 0.5663 0.5897 0.5554 0.4071 0.4026 0.4361 M mean 0.4429 0.5604 0.4779 0.4058 M mean 0.3828 0.3027 0.3230 0.1827 0.4632 M mean 0.6856 Furthermore we estimate the dimension k via cross-validation, following the approach in [35] , with ˆ k = argmin l =1 ,...,p CV ( l ) = argmin l =1 ,...,p (cid:80) i ( Y i − ˆ g − i ( (cid:98) B Tl X i )) n , (34)where ˆ g − i ( · ) is computed from the data ( Y j , (cid:98) B Tl X j ) j =1 ,...,n ; j (cid:54) = i using multivariate adaptive regression splines [17] inthe R -package mda , and (cid:98) B l = (cid:98) V ⊥ p − l is any basis of the orthogonal complement of (cid:98) V p − l = argmin V ∈S ( p,p − l ) L n ( V ) .For a given l , we calculate (cid:98) B l from the whole data set and predict Y i by ˆ Y i,l = ˆ g − i ( (cid:98) B Tl X i ) . For l = p , (cid:98) B p = I p . Theresults for the seven models are reported in Table 3. The CVE based dimension estimation is the most accurate inmodels M1, M2, M3, and M6 and differs slightly from that of MAVE in M7. MAVE performs better in M4 and M5,completely misses the true dimension in M2 and misses it most of the time in M3. Thus, the dimension estimationperformance of CVE and MAVE agrees with the estimation accuracy of the true subspace in Table 2, CVE estimates thedimension more accurately even in model M6, where it exhibits worse subspace estimation performance, and overallappears to be more accurate.We carried out many simulation experiments for an array of combinations of link functions, sufﬁcient reduction matrices B and their ranks, as well as predictor and error distributions. All reported and unreported results indicate that the11 PREPRINT - F

EBRUARY

18, 2021 lll llllllll

CVE wCVE rCVE meanOPG . . . . l = 0, p mix = 0.3 l llll l llllll llllllllll CVE wCVE rCVE meanOPG . . . . . l = 0.5, p mix = 0.3 l llllllll ll llll ll CVE wCVE rCVE meanOPG . . . . l = 1, p mix = 0.3 l lll ll CVE wCVE rCVE meanOPG . . . . . . . l = 1.5, p mix = 0.3 llllllll llllllll CVE wCVE rCVE meanOPG . . . . l = 0, p mix = 0.4 ll llll lll llllllllll CVE wCVE rCVE meanOPG . . . . . l = 0.5, p mix = 0.4 l llll llll llll CVE wCVE rCVE meanOPG . . . . l = 1, p mix = 0.4 lll l CVE wCVE rCVE meanOPG . . . . . . . . l = 1.5, p mix = 0.4 lllllll lllllll CVE wCVE rCVE meanOPG . . . . l = 0, p mix = 0.5 lllll llllll llllll lllllllll CVE wCVE rCVE meanOPG . . . . . l = 0.5, p mix = 0.5 lll lllll ll l l CVE wCVE rCVE meanOPG . . . . l = 1, p mix = 0.5 llll l l CVE wCVE rCVE meanOPG . . . . . . . . l = 1.5, p mix = 0.5 Figure 2: M2, p = 20 , n = 100 Table 3: Number of times dimension k is correctly estimated in replicationsM1 M2 M3 M4 M5 M6 M7CVE 83 41 88 62 46 74 19MAVE 67 0 14 76 60 57 21difference in performance of the two methods, CVE and mean MAVE, can be attributed to both the form of the linkfunction and the marginal predictor distribution. We observed that when the link function had a bounded ﬁrst orderderivative, CVE often outperformed mean MAVE across predictor distributions. In the opposite case, MAVE performedmostly better. Also, when the predictors have a bimodal distribution with well separated modes and the link functionis even, regardless of whether its derivative is bounded, CVE outperforms mean MAVE. In the other settings for thegenerated data, both methods were roughly on par. Three data sets are analyzed: the

Hitters data in the R package ISLR , which was also analyzed by [35], the

BostonHousing data in the R package mlbench , and the Concrete data from the

MAVE package. The reference method is meanMAVE from the

MAVE package in R and the CVE is calculated using m = 50 and maxit = 10 in the optimizationalgorithm 1 in Section 4. The estimation of the dimension is based on (34) in Section 5.Following [35], we remove 7 outliers from the Hitters data set leading to a sample size of 256. The response is Y = log( salary ) and the 16 continuous predictors are the game statistics of players in the Major League Baseballleague in the seasons 1986 and 1987. Further information can be found in .The Boston Housing data set contains 506 census tracts on 14 variables from the 1970 census. The response is medv ,the median value of owner-occupied homes in USD 1000’s. The factor variable chas is removed from the data setfor the analysis so that the response is modeled by the remaining 12 continous predictors. The description of thevariables can be found in .The

Concrete data set contains 1030 instances on 9 continuous variables The response is concrete compressive strength.Concrete strength is very important in civil engineering and is a highly nonlinear function of age and ingredients. The12

PREPRINT - F

EBRUARY

18, 2021description of the variables can be found in .For all three data sets we standardize both the predictors and the response by subtracting the mean and rescalingcolumn-wise so that each variable has unit variance. The data sets are analyzed using 10 fold cross-validation tocalculate an unbiased estimate of the prediction error [30] for our method ,

CVE , and its main competitor meanMAVE using the

MAVE package. The dimension for each method is estimated with (34) on the trainings set and we then ﬁta forward regression model on the training set replacing the original with the reduced predictors using multivariateadaptive regression splines [17] using the R package mda and calculate the prediction error on the test set for bothmethods. The dimension estimates of CVE and

MAVE mostly disagree.The mean and standard deviation of the 10-fold cross-validation prediction errors are reported in Table 4. Since theresponse is standardized, the values in Table 4 are bounded between 0 and 1, with smaller values indicating betterpredictive performance.

CVE performs slightly worse than mean

MAVE in the

Hitters data set, slightly better in the

Boston Housing and better in the

Concrete data set analysis.Table 4: Mean and standard deviation (in parenthesis) of standardized out of sample prediction errors for the three datasets Method Hitters Housing ConcreteCVE 0.216 0.260 0.361(0.101) (0.331) (0.206)MAVE 0.203 0.299 0.417(0.083) (0.382) (0.348)

Additionally, we reconstruct the analysis of the

Hitters data in [35], which does not account for the out-of-sampleprediction error as in Section 6 but uses the whole sample for estimation of B and its rank. Only the dimension k isestimated with leave-one-out cross validation.Table 5 reports the average cross validation mean squared error CV ( k ) in (34) using the whole data set over k = 1 , . . . , .Both conditional variance estimation and mean minimum average variance estimation estimate the dimension to be 2.Table 5: Mean cross-validation error k ˆ Y = 0 . . (cid:98) b T X ) − . (cid:98) b T X ) + 0 . (cid:98) b T X ) (35)with R = 0 . , and for minimum average variance estimation ˆ Y = 0 . . (cid:98) b T X ) − . (cid:98) b T X ) + 0 . (cid:98) b T X ) (36)with R = 0 . . Both models (35) and (36) have about the same ﬁt as measured by R . The in sample performanceof the two methods is practically the same for the Hitters data.

In this paper the novel conditional variance estimator (CVE) for the mean subspace is introduced. We presentits geometrical and theoretical foundation, show its consistency and propose an estimation algorithm with assured13

PREPRINT - F

EBRUARY

18, 2021Figure 3: Y against (cid:98) b T X and (cid:98) b T X convergence. CVE requires the forward model (1), Y = g ( B T X ) + (cid:15) , holds and weak assumptions on the responseand the covariates.Minimum average variance estimation (MAVE) [35] is the only other sufﬁcient dimension reduction method based onthe forward model (1). It estimates the sufﬁcient dimension reduction targeting both the reduction and the link function g in (1). CVE targets only the reduction and does not require estimation of the link function, which may explain whyit has an advantage over MAVE in some regression settings. For example, CVE exhibits similar performance acrossdifferent link functions (cos, exp, etc) for ﬁxed λ , whereas the performance of MAVE is very uneven for model M2 inSection 5. CVE is more accurate than MAVE when the link function is even and the predictor distribution is bimodalthroughout our simulation studies. Moreover, CVE does not require the inversion of the predictor covariance matrix andcan be applied to regressions with p ≈ n or p > n .The theoretical challenge in deriving the statistical properties of conditional variance estimation arises from the noveltyof its deﬁnition that involves random non i.i.d. weights that depend on the parameter to be estimated. References [1] Koﬁ P. Adragni and R. Dennis Cook. Sufﬁcient dimension reduction and prediction in regression.

PhilosophicalTransactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 367(1906):4385–4405, 112009.[2] Takeshi Amemiya.

Advanced Econometrics . Harvard university press, 1985.[3] S. N. Bernstein.

Theory of Probability . Moscow, 1927.[4] W. M. Boothby.

An Introduction to Differentiable Manifolds and Riemannian Geometry . Academic Press, 2002.[5] Efstathia Bura, Sabrina Duarte, and Liliana Forzani. Sufﬁcient reductions in regressions with exponential familyinverse predictors.

Journal of the American Statistical Association , 111(515):1313–1329, 2016.[6] Efstathia Bura and Liliana Forzani. Sufﬁcient reductions in regressions with elliptically contoured inversepredictors.

Journal of the American Statistical Association , 110(509):420–434, 2015.[7] Yasuko Chikuse.

Invariant measures on Stiefel manifolds with applications to multivariate analysis , volumeVolume 24 of

Lecture Notes–Monograph Series , pages 177–193. Institute of Mathematical Statistics, Hayward,CA, 1994.[8] Yasuko Chikuse.

Statistics on Special Manifolds . Springer-Verlag New York, New York, 2003.[9] Dennis R. Cook.

Regression Graphics: Ideas for studying regressions through graphics . Wiley, New York, 1998.14

PREPRINT - F

EBRUARY

18, 2021[10] R. D. Cook and L. Forzani. Principal ﬁtted components for dimension reduction in regression.

Statistical Science ,23(4):485–501, 2008.[11] R. Dennis Cook. Fisher lecture: Dimension reduction in regression.

Statist. Sci. , 22(1):1–26, 02 2007.[12] R. Dennis Cook and Liliana Forzani. Likelihood-based sufﬁcient dimension reduction.

Journal of the AmericanStatistical Association , 104(485):197–208, 3 2009.[13] R. Dennis Cook and Bing Li. Determining the dimension of iterative hessian transformation.

Ann. Statist. ,32(6):2501–2531, 12 2004.[14] R. Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction: Comment.

Journal ofthe American Statistical Association , 86(414):328–332, 1991.[15] R.Dennis Cook and Bing Li. Dimension reduction for conditional mean in regression.

Ann. Statist. , 30(2):455–474,04 2002.[16] Arnold M. Faden. The existence of regular conditional probabilities: Necessary and sufﬁcient conditions.

TheAnnals of Probability , 13(1):288–298, 1985.[17] Jerome H. Friedman. Multivariate adaptive regression splines.

The Annals of Statistics , 19(1):1–67, 1991.[18] Bruce E. Hansen. Uniform convergence rates for kernel estimation with dependent data.

Econometric Theory ,24:726–748, 2008.[19] H. Heuser.

Analysis 2, 9 Auﬂage . Teubner, 1995.[20] Alan F. Karr.

Probability . Springer Texts in Statistics. Springer-Verlag New York, 1993.[21] D. Leao Jr., M. Fragoso, and P. Rufﬁno. Regular conditional probability, disintegration of probability and radonspaces.

Proyecciones (Antofagasta) , 23:15 – 29, 05 2004.[22] Bing Li.

Sufﬁcient dimension reduction: methods and applications with R . CRC Press, Taylor & Francis Group,2018.[23] K. C. Li. Sliced inverse regression for dimension reduction.

Journal of the American Statistical Association ,86(414):316–327, 1991.[24] Ker-Chau Li. On principal hessian directions for data visualization and dimension reduction: Another applicationof stein’s lemma.

Journal of the American Statistical Association , 87(420):1025–1039, 1992.[25] Yanyuan Ma and Liping Zhu. A review on dimension reduction.

International Statistical Review , 81(1):134–150,4 2013.[26] Saralees Nadarajah. A generalized normal distribution.

Journal of Applied Statistics , 32(7):685–694, 2005.[27] J. Nocedal and S. Wright.

Line Search Methods , pages 30–65. Springer New York, New York, NY, 2006.[28] E Parzen. On estimation of a probability density function and mode.

The Annals of Mathematical Statistics ,33(3):1065–1076, 1961.[29] B. W. Silverman.

Density Estimation for Statistics and Data Analysis . Chapman & Hall, London, 1986.[30] M. Stone. Cross-validatory choice and assessment of statistical predictions.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 36(2):111–133, 1974.[31] Hemant D. Tagare. Notes on optimization on stiefel manifolds, January 2011.[32] Hang Weiqiang and Xia Yingcun.

MAVE: Methods for Dimension Reduction , 2019. R package version 1.3.10.[33] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints.

MathematicalProgramming , 142:397–434, 2013.[34] Yingcun Xia. A multiple-index model and dimension reduction.

Journal of the American Statistical Association ,103(484):1631–1640, 2008.[35] Yingcun Xia, Howell Tong, W. K. Li, and Li-Xing Zhu. An adaptive estimation of dimension reduction space.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64(3):363–410, 2002.[36] Xiangrong Yin, Bing Li, and R. Cook. Successive direction extraction for estimating the central subspace in amultiple-index regression.

Journal of Multivariate Analysis , 99:1733–1757, 09 2008.15

PREPRINT - F

EBRUARY

18, 2021

Justiﬁcation for (8) : Theorem 3.1 of [21] and the fact that ( R p , B ( R p )) , where B ( R p ) denotes the Borel sets on R p , is aPolish space guarantee the existence of the regular conditional probability of X | X ∈ s + span { V } [see also [16]].Further, the measure is concentrated on the afﬁne subspace s + span { V } ⊂ R p and is given by (8) by Deﬁnition 8.38and Theorem 8.39 of [20] and the orthogonal decomposition (5). Proof of (9) : Since X and (cid:15) in (1) are assumed to be independent, V ar( Y | X ∈ s + span { V } ) = V ar( g ( B T X ) | X ∈ s + span { V } ) + V ar( (cid:15) ) . Using (8) and V ar( Y | Z ) = E ( Y | Z ) − E ( Y | Z ) , we obtain (9).We let ˜ g ( V , s , r ) = g ( B T s + B T Vr ) l f X ( s + Vr ) . The parameter integral (11) is well deﬁned and continuous if(1) ˜ g ( V , s , · ) is integrable for all V ∈ S ( p, q ) , s ∈ supp ( f X ) , (2) ˜ g ( · , · , r ) is continuous for all r , and (3) there existsan integrable dominating function of ˜ g that does not depend on V and s [see [19, p. 101]].Furthermore t ( l ) ( V , s ) = (cid:82) K ˜ g ( V , s , r ) d r for some compact set K , since supp ( f X ) is compact due to (A.4). Thefunction ˜ g ( V , s , r ) is continuous in all inputs by the continuity of g and f X by (A.2), and therefore it attains amaximum. In consequence, all three conditions are satisﬁed so that t ( l ) ( V , s ) is well deﬁned and continuous.Next µ l ( V , s ) = t ( l ) ( V , s ) /t (0) ( V , s ) is continuous since t (0) ( V , s ) > for all s ∈ supp ( f X ) by the continuityof f X and Σ x > . Then, ˜ L ( V , s ) in (9) is continuous, which results in L ( V ) also being well deﬁned and continuousby virtue of it being a parameter integral following the same arguments as above.Next we establish the consistency of the conditional variance estimator. The uniform convergence in probability of thesample objective function in (19) is a sufﬁcient condition for obtaining the consistency of (cid:98) V q = argmin V ∈S ( p,q ) L n ( V ) ,as uniform convergence in probability of a random function implies convergence in probability of the minimizer of L n ( V ) to the minimizer of the limit function. Let t ( l ) n ( V , s ) = 1 nh ( p − q ) / n n (cid:88) i =1 K (cid:18) d i ( V , s ) h n (cid:19) Y li (37)be the sample version of (11) for l = 0 , , . The summands of ˜ L n in (18) can be expressed as ¯ y l ( V , s ) = t ( l ) n ( V , s ) t (0) n ( V , s ) , (38)Before we start with the proof a few auxiliary lemmas are shown. Lemma 8.

Assume (A.4) and (K.1) hold. Let Z n ( V , s ) = (cid:0)(cid:80) i g ( X i ) l K ( d i ( V , s ) /h n ) (cid:1) / ( nh ( p − q ) / n ) for a contin-uous function g . Then, E ( Z n ( V , s )) = (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q ˜ g ( r , h / n r ) d r d r where ˜ g ( r , r ) = g ( s + Vr + Ur ) l f X ( s + Vr + Ur ) , x = s + Vr + Ur in (5). Proof of Lemma 8.

By (5), (cid:107) P U ( x − s ) (cid:107) = (cid:107) Ur (cid:107) = (cid:107) r (cid:107) . Further E ( Z n ( V , s )) = 1 h ( p − q ) / n (cid:90) supp ( f X ) g ( x ) l K ( (cid:107) P U ( x − s ) /h / n (cid:107) ) f X ( x ) d x = 1 h ( p − q ) / n (cid:90) supp ( f X ) ∩ R p − q (cid:90) supp ( f X ) ∩ R q g ( s + Vr + Ur ) l K ( (cid:107) r /h / n (cid:107) ) × f X ( s + Vr + Ur ) d r d r = (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q g ( s + Vr + h / n Ur ) l × f X ( s + Vr + h / n Ur ) d r d r where the substitution ˜ r = r /h / n , d r = h ( p − q ) / n d ˜ r was used to obtain the last equality.16 PREPRINT - F

EBRUARY

18, 2021

Lemma 9.

Assume (A.1), (A.2), (A.3), (A.4), (H.1) and (K.1) hold. For all δ > there exist an n (cid:63) and ﬁnite constants ˜ b u,m for u ∈ { , , , , } and m ∈ { , } such that (˜ b l, − δ ) nh ( p − q ) / n − (˜ b l, ) + δn ≤ V ar( t ( l ) n ( V , s )) ≤ (˜ b l, + δ ) nh ( p − q ) / n − (˜ b l, ) − δn for n > n (cid:63) and t ( l ) n ( V , s ) , l = 0 , , , in (37). Proof of Lemma 9.

From (1) and the binomial formula, Y li = ( g i + (cid:15) i ) l = (cid:80) lu =0 (cid:0) lu (cid:1) g l − ui (cid:15) ui with g i = g ( B T X i ) . For m ∈ { , } and l ∈ { , , ..., } , using the independence of X i from (cid:15) i , we obtain E (cid:0) Y li K m ( d i ( V , s ) /h n ) (cid:1) = l (cid:88) u =0 (cid:18) lu (cid:19) E (cid:0) g l − ui K m ( d i ( V , s ) /h n ) (cid:1) E ( (cid:15) ui ) . (39)Setting Z n ( V , s ) = 1 / ( nh ( p − q ) / n ) (cid:80) i g ( X i ) l − u ˜ K ( d i ( V , s ) /h n ) in Lemma 8, where ˜ K ( z ) = K m ( z ) fulﬁlls (K.1)for m = 1 , , we obtain E (cid:0) g l − ui K m ( d i ( V , s ) /h n ) (cid:1) = h ( p − q ) / n E ( Z n ) . That is, if a kernel satisﬁes (K.1) its squarealso satisﬁes (K.1). Since the integrals are over compact sets by (A.4), by the dominated convergence theorem andLemma 8, it holds E ( Z n ) = (cid:90) supp ( f X ) ∩ R p − q ˜ K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q ˜ g ( r , h / n r ) d r d r = b l − u,mn (40) −−−−→ n →∞ b l − u,m = (cid:90) supp ( f X ) ∩ R p − q ˜ K ( (cid:107) r (cid:107) ) d r (cid:90) supp ( f X ) ∩ R q ˜ g ( r , dr (41)using that ˜ g ( r , r ) = g ( B T s + B T Vr + B T Ur ) l − u f X ( s + Vr + Ur ) is continuous by (A.2) and h n → byassumption (H.1).Assumption (A.3) implies E ( (cid:15) i ) < ∞ for i = 1 , . . . , n . From (40) and (39), we obtain E (cid:16) Y li K m ( d i ( V , s ) /h n ) /h ( p − q ) / n (cid:17) = l (cid:88) u =0 (cid:18) lu (cid:19) b l − u,mn E ( (cid:15) ui ) = ˜ b l,mn (42) −−−−→ n →∞ l (cid:88) u =0 (cid:18) lu (cid:19) b l − k,m E ( (cid:15) ui ) = ˜ b l,m < ∞ By (42), we have for l = 0 , , V ar (cid:16) t ( l ) n ( V , s ) (cid:17) = 1 nh p − qn V ar (cid:0) Y l K ( d ( V , s ) /h n ) (cid:1) = ˜ b l, n nh ( p − q ) / n − (˜ b l, n ) n since ( Y i , X Ti ) i =1 ,...,n are independent draws from the joint distribution of ( Y, X ) . This completes the proof since ˜ b u,mn → ˜ b u,m < ∞ for u ∈ { , , ..., } and m ∈ { , } .Next we show that d i ( V , s ) in (16) is Lipschitz in its inputs under assumption (A.4) in Lemma 10. Lemma 10.

Under assumption (A.4) there exists a constant < C < ∞ such that for all δ > and V , V j ∈ S ( p, q ) with (cid:107) P V − P V j (cid:107) < δ and for all s , s j ∈ supp ( f X ) ⊂ R p with (cid:107) s − s j (cid:107) < δ | d i ( V , s ) − d i ( V j , s j ) | ≤ C δ for d i ( V , s ) given by (16) Proof of Lemma 10. | d i ( V , s ) − d i ( V j , s j ) | ≤ (cid:12)(cid:12) (cid:107) X i − s (cid:107) − (cid:107) X i − s j (cid:107) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) X i − s , P V ( X i − s ) (cid:105) − (cid:104) X i − s j , P V j ( X i − s j ) (cid:105) (cid:12)(cid:12) = I + I (43)where (cid:104)· , ·(cid:105) is the scalar product on R p . For the ﬁrst term on the right hand side of (43) I = (cid:12)(cid:12) (cid:107) X i − s (cid:107) − (cid:107) X i − s j (cid:107) (cid:12)(cid:12) ≤ |(cid:104) X i , s − s j (cid:105)| + (cid:12)(cid:12) (cid:107) s (cid:107) − (cid:107) s j (cid:107) (cid:12)(cid:12) ≤ (cid:107) X i (cid:107)(cid:107) s − s j (cid:107) + 2 C (cid:107) s − s j (cid:107) ≤ C δ + 2 C δ = 4 C δ PREPRINT - F

EBRUARY

18, 2021by Cauchy-Schwartz and the reverse triangular inequality (i.e. (cid:12)(cid:12) (cid:107) s (cid:107) − (cid:107) s j (cid:107) (cid:12)(cid:12) = |(cid:107) s (cid:107) − (cid:107) s j (cid:107)| ( (cid:107) s (cid:107) + (cid:107) s j (cid:107) ) ≤(cid:107) s − s j (cid:107) C ) and (cid:107) X i (cid:107) ≤ sup z ∈ supp ( f X ) (cid:107) z (cid:107) = C < ∞ with probability 1 due to (A.4). The second term in (43)satisﬁes I ≤ (cid:12)(cid:12) (cid:104) X i , ( P V − P V j ) X i (cid:105) (cid:12)(cid:12) + 2 (cid:12)(cid:12) (cid:104) X i , P V s − P V j s j (cid:105) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) s , P V s (cid:105) − (cid:104) s j , P V j s j (cid:105) (cid:12)(cid:12) ≤ (cid:107) X i (cid:107) (cid:107) P V − P V j (cid:107) + 2 (cid:107) X i (cid:107) (cid:13)(cid:13) P V ( s − s j ) + ( P V − P V j ) s j (cid:13)(cid:13) + |(cid:104) s − s j , P V s (cid:105)| + (cid:12)(cid:12) (cid:104) s j , P V s − P V j s j (cid:105) (cid:12)(cid:12) ≤ C δ + 2 C ( δ + C δ ) + C δ + C ( δ + C δ ) = 4 C δ + 4 C δ Collecting all constants into C (i.e. C = 8 C + 4 C ) yields the result.The proofs of Theorems 3 and 11 require the Bernstein inequality [3]: Let Z , Z , ... be an independent sequence ofbounded random variables | Z i | ≤ b . Let S n = (cid:80) ni =1 Z i , E n = E ( S n ) and V n = V ar( S n ) . Then, P ( | S n − E n | > t ) < (cid:18) − t / V n + bt/ (cid:19) (44)Furthermore the proof of Theorem 11 requires assumption (K.2), which obtains | K ( u ) − K ( u (cid:48) ) | ≤ K ∗ ( u (cid:48) ) δ (45)for all u, u (cid:48) with | u − u (cid:48) | < δ ≤ L and K ∗ ( · ) is a bounded and integrable kernel function [see [18]]. Speciﬁcally, ifcondition (1) of (K.2) holds, then K ∗ ( u ) = L {| u |≤ L } . If (2) holds, then K ∗ ( u ) = L {| u |≤ L } + 1 {| u | > L } | u − L | − ν .Let A = S ( p, q ) × supp ( f X ) and by a slight abuse of notation, we generically denote constants by C . In Theorems 11and 12 we show that the variance and bias terms of (37) vanish uniformly in probability, respectively. Theorem 11.

Under (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), a n = log( n ) /nh ( p − q ) / n = o (1) and a n /h ( p − q ) / n = O (1) , sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) = O P ( a n ) for l = 0 , , (46) Remark.

If we assume | Y | < M < ∞ almost surely, the requirement a n /h ( p − q ) / n = O (1) for the bandwidth can bedropped and the truncation step of the proof of Theorem 11 can be skipped.Proof of Theorem 11. The proof is organized in 3 steps: a truncation step, a discretization step by covering A = S ( p, q ) × supp ( f X ) , and application of Bernstein’s inequality (44).We let τ n = a − n and truncate Y li by τ n as follows. We let t ( l ) n, trc ( V , s ) = (1 /nh ( p − q ) / n ) (cid:88) i K ( (cid:107) P U ( X i − s ) (cid:107) /h n ) Y li {| Y i | l ≤ τ n } (47)be the truncated version of (37) and ˜ R ( l ) n = (1 /nh ( p − q ) / n ) (cid:80) i | Y i | l {| Y i | l >τ n } be the remainder of (37). Therefore R ( l ) n ( V , s ) = t ( l ) n ( V , s ) − t ( l ) n, trc ( V , s ) ≤ M ˜ R ( l ) n due to (K.1) and sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ M ( ˜ R ( l ) n + E ˜ R ( l ) n )+ sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n, trc ( V , s ) − E (cid:16) t ( l ) n, trc ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) (48)By Cauchy-Schwartz and the Markov inequality, P ( | Z | > t ) = P ( Z > t ) ≤ E ( Z ) /t , we obtain E ˜ R ( l ) n = 1 h ( p − q ) / n E (cid:0) | Y i | l {| Y i | l >τ n } (cid:1) ≤ h ( p − q ) / n (cid:113) E ( | Y i | l ) (cid:113) P ( | Y i | l > τ n ) ≤ h ( p − q ) / n (cid:113) E ( | Y i | l ) (cid:18) E ( | Y i | l ) a − n (cid:19) / = o ( a n ) (49)where the last equality uses the assumption a n /h ( p − q ) / n = O (1) and the expectations are ﬁnite due to (A.3) for l = 0 , , . Obviously, no truncation is needed for l = 0 . 18 PREPRINT - F

EBRUARY

18, 2021Therefore the ﬁrst two terms of the right hand side of (48) converge to 0 with rate a n by (49) and Markov’s inequality.From now to the end of the proof Y i will denote the truncated version Y i {| Y i |≤ τ n } and we do not distinguish thetruncated from the untruncated t n ( V , s ) since this truncation results in an error of magnitude a n .For the discretization step we cover the compact set A = S ( p, q ) × supp ( f X ) by ﬁnitely many balls, which is possible by(A.4) and the compactness of S ( p, q ) . Let δ n = a n h n and A j = { V : (cid:107) P V − P V j (cid:107) ≤ δ n } × { s : (cid:107) s − s j (cid:107) ≤ δ n } be acover of A with ball centers V j × s j . Then, A ⊂ (cid:83) Nj =1 A j and the number of balls can be bounded by N ≤ C δ − dn δ − pn for some constant C ∈ (0 , ∞ ) , where d = dim ( S ( p, q )) = pq − q ( q + 1) / . Let V × s ∈ A j . Then by Lemma 10there exists < C < ∞ , such that | d i ( V , s ) − d i ( V j , s j ) | ≤ C δ n (50)holds for d i in (16). Under (K.2), which implies (45), inequality (50) yields (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) d i ( V , s ) h n (cid:19) − K (cid:18) d i ( V j , s j ) h n (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K ∗ (cid:18) d i ( V j , s j ) h n (cid:19) C a n (51)for V × s ∈ A j and K ∗ ( · ) an integrable and bounded function.Deﬁne r ( l ) n ( V j , s j ) = (1 /nh ( p − q ) / n ) (cid:80) ni =1 K ∗ ( d i ( V j , s j ) /h n ) | Y i | l . For notational convenience we drop the depen-dence on l and j in the following and observe that (51) yields | t ( l ) n ( V , s ) − t ( l ) n ( V j , s j ) | ≤ C a n r ( l ) n ( V j , s j ) (52)Since K ∗ fulﬁlls (K.1) except for continuity, an analogous argument as in the proof of Lemma 8 yields that E (cid:16) r ( l ) n ( V j , s j ) (cid:17) < ∞ by ( A. . By subtracting and adding t ( l ) n ( V j , s j ) , E ( t ( l ) n ( V j , s j )) , the triangular inequal-ity, (52) and integrability of r ln , we obtain (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − t ( l ) n ( V j , s j ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E (cid:16) t ( l ) n ( V j , s j ) − t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n ( | r n | + | E ( r n ) | ) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n ( | r n − E ( r n ) | + 2 | E ( r n ) | ) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n + | r n − E ( r n ) | + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) (53)for any C > C E ( r ( l ) n ( V j , s j )) and n such that a n ≤ , since a n = o (1) . Summarizing there exists < C < ∞ such that (53) holds.Then using sup x ∈ A f ( x ) = max ≤ j ≤ N sup x ∈ A j f ( x ) ≤ (cid:80) Nj =1 sup x ∈ A j f ( x ) for any partition of A and continuousfunction f , subadditivty of the probability for the ﬁrst inequality and (53) for the third inequality below, it holds P ( sup V × s ∈ A | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) (54) ≤ N (cid:88) j =1 P ( sup V × s ∈ A j | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) ≤ N max ≤ j ≤ N P ( sup V × s ∈ A j | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) ≤ N (cid:18) max ≤ j ≤ N P ( | t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17) | > C a n ) + max ≤ j ≤ N P ( | r n − E ( r n ) | > C a n ) (cid:19) ≤ C δ − ( d + p ) (cid:18) max ≤ j ≤ N P ( | t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17) | > C a n ) + max ≤ j ≤ N P ( | r n − E ( r n ) | > C a n ) (cid:19) where the last inequality is due to N ≤ C δ − dn δ − pn for a cover of A .Finally, we bound the ﬁrst and second term in the last line of (54) by the Bernstein inequality (44). For the ﬁrst termin the last line of (54), let Z i = Y li K ( d i ( V j , s j ) /h n ) and S n = (cid:80) i Z i = nh ( p − q ) / n t ( l ) n ( V j , s j ) , then the Z i areindependent, | Z i | ≤ b = M τ n = M /a n by (K.1) and the truncation step. For V n = V ar( S n ) , Lemma 9 yields nh ( p − q ) / n (cid:16) ˜ b l, − δ − h ( p − q ) / n (cid:16) (˜ b l, ) + δ (cid:17)(cid:17) ≤ V n ≤ nh ( p − q ) / n (cid:16) ˜ b l, + δ − h ( p − q ) / n (cid:16) (˜ b l, ) − δ (cid:17)(cid:17) PREPRINT - F

EBRUARY

18, 2021for n sufﬁciently large. We write nh ( p − q ) / n C ≥ V n with C = ˜ b l, + δ , and set t = C a n nh ( p − q ) / n . The Bernsteininequality (44) yields P (cid:16)(cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) > C a n (cid:17) < (cid:18) − t / V n + bt/ (cid:19) ≤ (cid:32) − (1 / C a n n h ( p − q ) n nh ( p − q ) / n C + (1 / M τ n C a n nh ( p − q ) / n ) (cid:33) ≤ (cid:18) − (1 / C log( n ) C/C + ( M / (cid:19) = 2 n − γ ( C ) where a n = log( n ) / ( nh ( p − q ) / n ) and deﬁne γ ( C ) = (1 / C C/C +( M / , which is an increasing function that can be madearbitrarily large by increasing C .For the second term in the last line of (54), set Z i = Y li K ∗ ( d i ( V j , s j ) /h n ) in the Bernstein inequality (44) and proceedanalogously to obtain P (cid:16)(cid:12)(cid:12)(cid:12) r ( l ) n ( V j , s j ) − E (cid:16) r ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) > C a n (cid:17) < n − (1 / C C/C / M = 2 n − γ ( C ) By (H.1), h ( p − q ) / n ≤ for n large, so that δ − n = ( a n h n ) − ≤ n / h − n h ( p − q ) / n ≤ n / . Further (H.2) implies / ( nh ( p − q ) / n ) ≤ for n large, therefore h − n ≤ n / ( p − q ) ≤ n since p − q ≥ . Therefore, (54) is smallerthan C δ − ( d + p ) n n − γ ( C ) ≤ Cn d + p ) / − γ ( C ) . For C large enough, we have d + p ) / − γ ( C ) < and n d + p ) / − γ ( C ) → . This completes the proof. Theorem 12.

Under (A.1), (A.2) and (A.4), (H.1), (K.1), and (cid:82) R p − q K ( (cid:107) r (cid:107) ) d r = 1 , sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) ( V , s ) + 1 { l =2 } η t (0) ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) = O ( h n ) , l = 0 , , (55) Proof of Theorem 12.

Let ˜ g ( r , r ) = g ( B T s + B T Vr + B T Ur ) l f X ( s + Vr + Ur ) where r , r satisfy theorthogonal decomposition (5). Then E (cid:16) t ( l ) n ( V , s ) (cid:17) = (cid:90) R p − q K ( (cid:107) r (cid:107) ) (cid:90) R p ˜ g ( r , h n / r ) d r d r + 1 { l =2 } η E (cid:16) t (0) n ( V , s ) (cid:17) (56)holds by Lemma 8 for l = 0 , . For l = 2 , Y i = g i + 2 g i (cid:15) i + (cid:15) i with g i = g ( B T X i ) and can be handled as in thecase of l = 0 , .Plugging in (56) the second order Taylor expansion for some ξ in the neighborhood of 0, ˜ g ( r , h n / r ) = ˜ g ( r ,

0) + h n / ∇ r ˜ g ( r , T r + h n r T ∇ r ˜ g ( r , ξ ) r , yields E (cid:16) t ( l ) n ( V , s ) (cid:17) = (cid:90) R q ˜ g ( r , d r + (cid:112) h n (cid:18)(cid:90) R q ∇ r ˜ g ( r , d r (cid:19) T (cid:90) R p − q K ( (cid:107) r (cid:107) ) r d r + h n (cid:90) R p − q K ( (cid:107) r (cid:107) ) (cid:90) R p r T ∇ r ˜ g ( r , ξ ) r d r d r = t ( l ) ( V , s ) + h n R ( V , s ) since (cid:82) R q ˜ g ( r , d r = t ( l ) ( V , s ) and (cid:82) R p − q K ( (cid:107) r (cid:107) ) r d r = 0 ∈ R p − q due to K ( (cid:107) · (cid:107) ) being even. Let R ( V , s ) = (cid:82) R p − q K ( (cid:107) r (cid:107) ) (cid:82) R p r T ∇ r ˜ g ( r , ξ ) r d r d r . By (A.4) and (A.2) it holds | r T ∇ r ˜ g ( r , ξ ) r | ≤ C (cid:107) r (cid:107) for C = sup x , y (cid:107)∇ r ˜ g ( x , y ) (cid:107) < ∞ , since a continuous function over a compact set is bounded. Then, R ( V , s ) ≤ CC (cid:82) R p − q K ( (cid:107) r (cid:107) ) (cid:107) r (cid:107) d r < ∞ for some C > since the integral over r is over a compact set by (A.4).Lemma 13 follows directly from Theorems 11 and 12 and the triangle inequality. Lemma 13.

Suppose (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1) hold. If a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) , then for l = 0 , , V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) ( V , s ) + 1 { l =2 } η t (0) ( V , s ) − t ( l ) n ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( a n + h n ) Combining the results of Theorems 11 and 12 and Lemma 13 obtains Theorem 14.20

PREPRINT - F

EBRUARY

18, 2021

Theorem 14.

Suppose (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1) hold. Let a n = log( n ) /nh ( p − q ) / n = o (1) , a n /h ( p − q ) / n = O (1) , δ n = inf V × s ∈ A n t (0) ( V , s ) , where t (0) ( V , s ) is deﬁned in (11) , andand A n = S ( p, q ) ×{ x ∈ supp ( f X ) : | x − ∂ supp ( f X ) | ≥ b n } , where ∂C denotes the boundary of the set C and | x − C | = inf r ∈ C | x − r | , for asequence b n → so that δ − n ( a n + h n ) → for any bandwidth h n that satisﬁes the assumptions. Then, sup V × s ∈ A (cid:12)(cid:12)(cid:12) ¯ y l ( V , s ) − µ l ( V , s ) − { l =2 } η t (0) ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( δ − n ( a n + h n )) , l = 0 , , and sup V × s ∈ A (cid:12)(cid:12)(cid:12) ˜ L n ( V , s ) − ˜ L ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( δ − n ( a n + h n )) (57) where ¯ y l ( V , s ) , µ l ( V , s ) , ˜ L n ( V , s ) and ˜ L ( V , s ) are deﬁned in (38) , (10) , (18) and (9) , respectively.Proof of Theorem 14. ¯ y l ( V , s ) = t ( l ) n ( V , s ) t (0) n ( V , s ) = t ( l ) n ( V , s ) /t (0) ( V , s ) t (0) n ( V , s ) /t (0) ( V , s ) We consider the numerator and enumerator separately. By Lemma 13 sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (0) n ( V , s ) t (0) ( V , s ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup A | t (0) n ( V , s ) − t (0) ( V , s ) | inf A n t (0) ( V , s ) = O P ( δ − n ( a n + h n )) Next sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup A | t ( l ) n ( V , s ) − t ( l ) ( V , s ) | inf A n t (0) ( V , s ) = O P ( δ − n ( a n + h n )) . Therefore by A n ↑ A = S ( p, q ) × supp ( f X ) we get lim n →∞ sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = lim n →∞ sup V × s ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and in total we obtain ¯ y l ( V , s ) = t ( l ) n ( V , s ) /t (0) ( V , s ) t (0) n ( V , s ) /t (0) ( V , s ) = µ l + O P ( δ − n ( a n + h n ))1 + O P ( δ − n ( a n + h n )) = µ l + O P ( δ − n ( a n + h n )) . For l = 2 , Y i = g ( B T X i ) + 2 g ( B T X i ) (cid:15) i + (cid:15) i , and (57) follows from (9). Lemma 15.

Under (A.1), (A.2), (A.4), there exists < C < ∞ such that | µ l ( V , s ) − µ l ( V j , s ) | ≤ C (cid:107) P V − P V j (cid:107) (58) for all s ∈ supp ( f X ) Proof.

From the representation ˜ t ( l ) ( P V , s ) in (14) instead of t ( l ) ( V , s ) , we consider µ l ( V , s ) = µ l ( P V , s ) as afunction on the Grassmann manifold. Then, (cid:12)(cid:12) µ l ( P V , s ) − µ l ( P V j , s ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s )˜ t (0) ( P V , s ) − ˜ t ( l ) ( P V j , s )˜ t (0) ( P V j , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup | ˜ t (0) ( P V , s ) | (inf ˜ t (0) ( P V , s )) (cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s ) − ˜ t ( l ) ( P V j , s ) (cid:12)(cid:12)(cid:12) + sup ˜ t ( l ) ( P V , s )(inf ˜ t (0) ( P V , s )) (cid:12)(cid:12)(cid:12) ˜ t (0) ( P V , s ) − ˜ t (0) ( P V j , s ) (cid:12)(cid:12)(cid:12) (59)with sup P V ∈ Gr ( p,q ) ˜ t (0) ( P V , s ) ∈ (0 , ∞ ) and inf P V ∈ Gr ( p,q ) ˜ t (0) ( P V , s ) ∈ (0 , ∞ ) since ˜ t ( l ) is continuous, Σ x > and s ∈ supp ( f X ) . 21 PREPRINT - F

EBRUARY

18, 2021By (A.2), ˜ g ( x ) = g ( B T x ) f X ( x ) is twice continuous differentiable and therefore Lipschitz continuous on compact sets.We denote its Lipschitz constant by L < ∞ . Therefore, (cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s ) − ˜ t ( l ) ( P V j , s ) (cid:12)(cid:12)(cid:12) ≤ (cid:90) supp ( f X ) (cid:12)(cid:12) ˜ g ( s + P V r ) − ˜ g ( s + P V j r ) (cid:12)(cid:12) d r ≤ L (cid:90) supp ( f X ) (cid:107) ( P V − P V j ) r (cid:107) d r ≤ L (cid:32)(cid:90) supp ( f X ) (cid:107) r (cid:107) dr (cid:33) (cid:107) P V − P V j (cid:107) (60)where the last inequality is due to the sub-multiplicativity of the Frobenius norm and the integral being ﬁnite by (A.4).Plugging (60) in (59) and collecting all constants into C yields (58). Proof of Theorem 3.

By (19) and (7), | L n ( V ) − L ( V ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:16) ˜ L n ( V , X i ) − ˜ L ( V , X i ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:16) ˜ L ( V , X i ) − E ( ˜ L ( V , X )) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (61)The ﬁrst term on the right hand side of (61) goes to 0 in probability uniformly in V by Theorem 14, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i ˜ L n ( V , X i ) − ˜ L ( V , X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup V × s ∈ A (cid:12)(cid:12)(cid:12) ˜ L n ( V , s ) − ˜ L ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( δ − n ( a n + h n )) (62)The second term in (61) converges to 0 almost surely for all V ∈ S ( p, q ) by the strong law of large numbers. In orderto show uniform convergence the same technique as in the proof of Theorem 11 is used. Let B j = { V ∈ S ( p, q ) : (cid:107) VV T − V j V Tj (cid:107) ≤ ˜ a n } be a cover of S ( p, q ) ⊂ (cid:83) Nj =1 B j with N ≤ C ˜ a − dn = C ( n/ log( n )) d/ ≤ C n d/ , where d = dim( S ( p, q )) is deﬁned in the proof of Theorem 11. By Lemma 15, | µ l ( V , X i ) − µ l ( V j , X i ) | ≤ C (cid:107) P V − P V j (cid:107) (63)Let G n ( V ) = (cid:80) i ˜ L ( V , X i ) /n with E ( G n ( V )) = L ( V ) . Using (63) and following the same steps as in the proof ofTheorem 11 we obtain | G n ( V ) − L ( V ) | ≤ | G n ( V ) − G n ( V j ) | + | G n ( V j ) − L ( V j ) | + | L ( V ) − L ( V j ) |≤ C ˜ a n + | G n ( V j ) − L ( V j ) | (64)for V ∈ B j and some C > C . Inequality (64) leads to P (cid:32) sup V ∈S ( p,q ) | G n ( V ) − L ( V ) | > C ˜ a n (cid:33) ≤ C N P ( sup V ∈ B j | G n ( V ) − L ( V ) | > C ˜ a n ) ≤ C n d/ P ( | G n ( V j ) − L ( V j ) | > C ˜ a n ) ≤ C n d/ n − γ ( C ) → (65)where the last inequality in (65) is due to the Bernstein inequality (44) with Z i = ˜ L ( V j , X i ) , which is boundedsince ˜ L ( · , · ) is continuous on the compact set A , and γ ( C ) a monotone increasing function of C that can be madearbitrarily large by choosing C accordingly. Therefore, sup V ∈S ( p,q ) | L n ( V ) − L ( V ) | ≤ O P ( δ − n ( a n + h n )+˜ a n ) with δ n = inf V × s ∈ A n t (0) ( V , s ) , where t (0) ( V , s ) is deﬁned in (11), and A n = S ( p, q ) ×{ x ∈ supp ( f X ) : f X ( x ) ≥ b n } for a sequence b n → so that δ − n ( a n + h n ) → for any bandwidth h n that satisﬁes the assumptions, which implies(20). Proof of Theorem 4.

We apply Theorem 4.1.1 of [2] to obtain consistency of the conditional variance estimator. Thistheorem requires three conditions that guarantee the convergence of the minimizer of a sequence of random functions L n ( P V ) to the minimizer of the limiting function L ( P V ) ; i.e., P span { (cid:98) B } ⊥ = argmin L n ( P V ) → P span { B } ⊥ =argmin L ( P V ) in probability. To apply the theorem three conditions have to be met: (1) The parameter space iscompact; (2) L n ( V ) is continuous and a measurable function of the data ( Y i , X Ti ) i =1 ,...,n and (3) L n ( V ) convergesuniformly to L ( V ) and L ( V ) attains a unique global minimum at span { B } ⊥ .Since L n ( V ) depends on V only through P V = VV T , L n ( V ) can be considered as functions on the Grassmannmanifold, which is compact, and the same holds true for L ( V ) by (14). Further, L n ( V ) is by deﬁnition a measurablefunction of the data and continuous in V if a continuous kernel is used, such as the Gaussian. Theorem 3 obtainsthe uniform convergence and Theorem 1 that the minimizer is unique when L ( V ) is minimized over the Grassmannmanifold G ( p, q ) , since span { B } is uniquely identiﬁable and so is span { B } ⊥ (i.e. (cid:107) P span { (cid:98) B } − P span { B } (cid:107) = (cid:107) (cid:98) B (cid:98) B T − BB T (cid:107) = (cid:107) ( I p − BB T ) − ( I p − (cid:98) B (cid:98) B T ) (cid:107) = (cid:107) P span { (cid:98) B } ⊥ − P span { B } ⊥ (cid:107) ). Thus, all three conditions are metand the result is obtained. 22 PREPRINT - F

EBRUARY

18, 2021

Proof of Theorem 6.

The Gaussian kernel K satisﬁes ∂ z K ( z ) = − zK ( z ) . From (17) and (18) we have ˜ L n = ¯ y − ¯ y where ¯ y l = (cid:80) i w i Y li , l = 1 , . We let K j = K ( d j ( V , s ) /h n ) , suppress the de-pendence on V and s and write w i = K i / (cid:80) j K j . Then, ∇ K i = ( − /h n ) K i d i ∇ d i and ∇ w i = − (cid:16) K i d i ∇ d i ( (cid:80) j K j ) − K i (cid:80) j K j d j ∇ d j (cid:17) / ( h n (cid:80) j K j ) . Next, ∇ ¯ y l = − h n (cid:88) i Y li (cid:16) K i d i ∇ d i − K i ( (cid:80) j K j d j ∇ d j ) (cid:17) ( (cid:80) j K j ) = − h n (cid:88) i Y li w i  d i ∇ d i − (cid:88) j w j d j ∇ d j  = − h n (cid:88) i Y li w i d i ∇ d i − (cid:88) j Y lj w j (cid:88) i w i d i ∇ d i  = − h n (cid:88) i ( Y li − ¯ y l ) w i d i ∇ d i (66)Then, ∇ ˜ L n = ∇ ¯ y − y ∇ ¯ y , and inserting ∇ ¯ y l from (66) yields ∇ ˜ L n = ( − /h n ) (cid:80) i ( Y i − ¯ y − y ( Y i − ¯ y )) w i d i ∇ d i = (1 /h n )( (cid:80) i (cid:16) ˜ L n − ( Y i − ¯ y ) (cid:17) w i d i ∇ d i ) , since Y i − ¯ y − y ( Y i − ¯ y ) = ( Y i − ¯ y ) − ˜ L nn