Conditional Variance Estimator for Sufficient Dimension Reduction
CConditional Variance Estimator for Sufficient DimensionReduction
Lukas Fertl ∗ Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, Austria
Efstathia Bura † Institute of Statistics and Mathematical Methods in EconomicsFaculty of Mathematics and GeoinformationTU Wien, Vienna, AustriaFebruary 18, 2021 A BSTRACT
Conditional Variance Estimation (CVE) is a novel sufficient dimension reduction (SDR) methodfor additive error regressions with continuous predictors and link function. It operates under theassumption that the predictors can be replaced by a lower dimensional projection without loss ofinformation. In contrast to the majority of moment based sufficient dimension reduction methods,Conditional Variance Estimation is fully data driven, does not require the restrictive linearity andconstant variance conditions, and is not based on inverse regression. CVE is shown to be consistentand its objective function to be uniformly convergent. CVE outperforms the mean average varianceestimation, (MAVE), its main competitor, in several simulation settings, remains on par under others,while it always outperforms the usual inverse regression based linear SDR methods, such as SlicedInverse Regression.
Suppose ( Y, X T ) T have a joint continuous distribution, where Y ∈ R denotes a univariate response and X ∈ R p a p -dimensional covariate vector. We assume that the dependence of Y and X is modelled by Y = g ( B T X ) + (cid:15), (1)where X is independent of (cid:15) with positive definite variance-covariance matrix, V ar( X ) = Σ x , (cid:15) ∈ R is a meanzero random variable with finite V ar( (cid:15) ) = E (cid:0) (cid:15) (cid:1) = η , g is an unknown continuous non-constant function, and B = ( b , ..., b k ) ∈ R p × k of rank k ≤ p . Model (1) states that E ( Y | X ) = E ( Y | B T X ) (2)and requires the first conditional moment E ( Y | X ) = g ( B T X ) contain the entirety of the information in X about Y and be captured by B T X , so that F ( Y | X ) = F ( Y | B T X ) , where F ( · | · ) denotes the conditional cumulativedistribution function (cdf) of the first given the second argument. That is, Y is statistically independent of X when B T X is given and replacing X by B T X induces no loss of information for the regression of Y on X . ∗ [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b PREPRINT - F
EBRUARY
18, 2021Identifying the span of B ; i.e., the column space of B , as only the span { B } is identifiable, suffices in order to identifythe sufficient reduction of X for the regression of Y on X . We assume, without loss of generality, B is semi-orthogonal,i.e., B T B = I k , since a change of coordinate system by an orthogonal transformation does not alter model (2).For q ≤ p , let S ( p, q ) = { V ∈ R p × q : V T V = I q } , (3)denote the Stiefel manifold, that comprizes of all p × q matrices with orthonormal columns. S ( p, q ) is compact and dim( S ( p, q )) = pq − q ( q + 1) / [see [4] and Section 2.1 of [31]]. Further let Gr ( p, q ) = S ( p, q ) / S ( q, q ) (4)denote the Grassmann manifold, i.e. all q -dimensional subspaces in R p , which is exactly the quotient space of S ( p, q ) with all q × q orthonormal matrices S ( q, q ) , i.e. the basis of a linear subspace is unique up to orthogonal transformations.The fact that only span { B } is identifiable, can be expressed through the Grassmann manifold Gr ( p, q ) in (4). The goalof sufficient dimension reduction in model (1) is to find a subspace M ∈ Gr ( p, k ) such that any basis B ∈ S ( p, k ) of M fulfills (1) or equivalently (2).Finding sufficient reductions of the predictors to replace them in regression and classification without loss of informationis called sufficient dimension reduction [9]. The first split in sufficient dimension reduction taxonomy occurs betweenlikelihood and non-likelihood based methods. The former, which were developed more recently [11, 10, 12, 6, 5],assume knowledge either of the joint family of distributions of ( Y, X T ) T , or the conditional family of distributionsfor X | Y . The latter is the most researched branch of sufficient dimension reduction and comprizes of three classesof methods: Inverse regression based, semi-parametric and nonparametric. Reviews of the former two classes can befound in [1, 25, 22].In this paper we present the conditional variance estimation , which falls in the class of nonparametric methods. Theestimators in this class minimize a criterion that describes the fit of the dimension reduction model (2) under (1) to theobserved data. Since the criterion involves unknown distributions or regression functions, nonparametric estimation isused to recover span { B } . Statistical approaches to identify B in (2) include ordinary least squares and nonparametricmultiple index models [34]. The least squares estimator, Σ − x cov ( X , Y ) , always falls in span { B } [22, Th. 8.3].Principal Hessian Directions [24] was the first sufficient dimension reduction estimator to target span { B } in (2). Itsmain disadvantage is that it requires the so called linearity and constant variance conditions on the marginal distributionof X . Its relaxation, Iterative Hessian Transformation [13], still requires the linearity condition in order to recovervectors in span { B } .The most competitive nonparametric sufficient dimension reduction method up to now has been minimum averagevariance estimation (MAVE, [35]). It assumes model (1), bounded fourth derivative covariate density, and existenceof continuous bounded third derivatives for g . It uses a local first order approximation of g in (1) and minimizes theexpected conditional variance of the response given B T X .The conditional variance estimator also targets and recovers span { B } in models (1) and (2). The objective function isbased on the intuition that the directions in the predictor space that capture the dependence of Y on X should exhibitsignificantly higher variation in Y as compared with the directions along which Y exhibits markedly less variation. The conditional variance estimator is a fully data-driven estimator that performs better than or is on par with minimumaverage variance estimation in simulations. The conditional variance estimator differs from other approaches, includingMAVE, in that it only targets the span { B } and does not require an explicit form or estimation of the link function g .As a result, it requires weaker assumptions on its smoothness. Let (Ω , F , P ) be a probability space, and X : Ω → R p be a random vector with a continuous probability densityfunction f X and denote its support by supp ( f X ) . Throughout (cid:107) · (cid:107) denotes the Frobenius norm for matrices, Euclideannorm for vectors, and scalar product refers to the euclidean scalar product. For any matrix M , or linear subspace M , we denote by P M the projection matrix on the column space of the matrix or on the subspace, i.e. P M = M ( M T M ) − M T ∈ R p × p for M ∈ R p × q . For any V ∈ S ( p, q ) , defined in (3), we generically denote a basis of theorthogonal complement of its column space span { V } , by U . That is, U ∈ S ( p, p − q ) such that span { V } ⊥ span { U } and span { V } ∪ span { U } = R p , U T V = ∈ R ( p − q ) × q , U T U = I p − q . For any x , s ∈ R p we can always write x = s + P V ( x − s ) + P U ( x − s ) = s + Vr + Ur (5)where r = V T ( x − s ) ∈ R q , r = U T ( x − s ) ∈ R p − q .2 PREPRINT - F
EBRUARY
18, 2021In the sequel, we refer to the following assumptions as needed and the proofs of the Theorems are presented in theAppendix. (A.1).
Model Y = g ( B T X ) + (cid:15) holds with Y ∈ R , g : R k → R non constant in all arguments, B = ( b , ..., b k ) ∈ R p × k of rank k ≤ p , X ∈ R p independent from (cid:15) , V ar( X ) = Σ x is positive definite , E ( (cid:15) ) = 0 , V ar( (cid:15) ) = η < ∞ . (A.2). The link function g and the density f X : R p → [0 , ∞ ) of X are twice continuous differentiable. (A.3). E ( | Y | ) < ∞ . (A.4). supp ( f X ) is compact. Remark.
Assumption (A.4) is not as restrictive as it might seem. [36] showed in Proposition 11 that there is a compactset
S ⊂ R p such that the mean subspace of model (1) is the same as the mean subspace of Y = g ( B T X |S ) + (cid:15) , where X |S = X { X ∈S} and A is the indicator function of A . Further S can be assumed to be an ellipsoid and for all (cid:101) S ⊇ S the same assertion holds true.
Definition.
For q ≤ p ∈ N and any V ∈ S ( p, q ) , we define ˜ L ( V , s ) = V ar( Y | X ∈ s + span { V } ) , (6) where s ∈ R p is a shifting point. Definition.
For V ∈ S ( p, q ) , we define the objective function, L ( V ) = (cid:90) R p ˜ L ( V , x ) f X ( x ) d x = E (cid:16) ˜ L ( V , X ) (cid:17) . (7) L ( V ) in (7) is the objective function for the estimator we propose for the span of B in (1) and Theorem 1 providesthe statistical motivation for the objective function (7) of the conditional variance estimator. First we derive that bothpopulation based functions (6) and (7) are well defined.Let X be a p -dimensional continuous random vector with density f X ( x ) , s ∈ supp ( f X ) ⊂ R p , and V belongs to theStiefel manifold S ( p, q ) defined in (3). The function f X | X ∈ s +span { V } ( r ) = f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r (8)is a proper conditional density of X that is concentrated on the affine subspace s + span { V } using the concept ofregular conditional probability [21] under assumption (A.2). The detailed justification is given in the Appendix, wherewe also show that under assumptions (A.1), (A.2) and (A.4), ˜ L ( V , s ) in (6) and L ( V ) in (7) are well defined andcontinuous. Moreover, ˜ L ( V , s ) = µ ( V , s ) − µ ( V , s ) + η (9)where µ l ( V , s ) = (cid:90) R q g ( B T s + B T Vr ) l f X ( s + Vr ) (cid:82) R q f X ( s + Vr ) d r d r = t ( l ) ( V , s ) t (0) ( V , s ) (10)with t ( l ) ( V , s ) = (cid:90) R q g ( B T s + B T Vr ) l f X ( s + Vr ) d r . (11) Theorem 1.
Suppose V = ( v , ..., v q ) ∈ S ( p, q ) and q ∈ { , . . . , p } . Under assumptions (A.1), (A.2) and (A.4),(a) For all s ∈ R p and V such that there exist u ∈ { , ..., q } with v u ∈ span { B } , ˜ L ( V , s ) > V ar( (cid:15) ) = η and L ( V ) > η .(b) For all s ∈ R p and span { V } ⊥ span { B } , ˜ L ( V , s ) = η and L ( V ) = η .Proof. Let s ∈ R p and V = ( v , ..., v q ) ∈ R p × q so that v u ∈ span { B } for some u ∈ { , ..., q } . To obtain (a),observe X ∈ s + span { V } ⇐⇒ X = s + P V ( X − s ) and using (6) yields ˜ L ( V , s ) = V ar (cid:0) g ( B T X ) | X = s + VV T ( X − s ) (cid:1) + V ar( (cid:15) )= V ar (cid:0) g ( B T s + B T VV T ( X − s )) | X = s + VV T ( X − s ) (cid:1) + η > η (12)since B T VV T ( X − s ) (cid:54) = 0 with probability 1, and therefore the variance term in (12) is positive. For V such that V and B are orthogonal, B T VV T ( X − s ) = 0 and (b) follows. Since s is arbitrary yet constant, the statements for L ( V ) follow. 3 PREPRINT - F
EBRUARY
18, 2021Theorem 1 also has an intuitive geometrical interpretation for the proposed method. If X is not random, the deterministicfunction Y = g ( B T X ) is constant in all directions orthogonal to B and varies in all other directions. If randomness isintroduced, as in model (1), then the variation in Y stems only from (cid:15) in all directions orthogonal to B . In all otherdirections the variation comprizes of the sum of the variation of (cid:15) and of g ( B T X ) . In consequence, the objectivefunction (7) captures the variation of Y as X varies in the column space of V and is minimized in the directionsorthogonal to B . We have shown that the objective function L ( V ) in (7) is well defined and continuous in Section 2. Let V q = argmin V ∈S ( p,q ) L ( V ) . (13) V q is well defined as the minimizer of a continuous function over the compact set S ( p, q ) . Nevertheless, V q is notunique since for all orthogonal O ∈ R q × q such that OO T = I q , L ( VO ) = L ( V ) as L ( V ) depends on V only through span { V } . Nevertheless, it is a unique minimizer over the Grassmann manifold Gr ( p, q ) in (4). To see this, suppose V ∈ S ( p, q ) is an arbitrary basis of a subspace M ∈ Gr ( p, q ) . We can identify M through the projection P M = VV T .By (5) we write x = Vr + Ur . By the Fubini-Tornelli Theorem we obtain ˜ t ( l ) ( P M , s ) = (cid:90) supp ( f X ) g ( B T s + B T P M x ) l f X ( s + P M x ) d x (14) = t ( l ) ( V , s ) (cid:90) supp ( f X ) ∩ R p − q d r . Therefore ˜ t ( l ) ( P M , s ) / ˜ t (0) ( P M , s ) = t ( l ) ( V , s ) /t (0) ( V , s ) and µ l ( · , s ) in (10) can also be viewed as a functionfrom Gr ( p, q ) to R . If the optimization (13) is over Gr ( p, q ) , the objective function (7) has a unique minimum at span { B } ⊥ by Theorem 1. Therefore B is not uniquely identifiable but its span { B } is.Corollary 2 follows directly from Theorem 1 and provides the means for identifying the linear projections of thepredictors satisfying (1). Corollary 2.
Under the assumptions (A.1), (A.2), and (A.3) the solution of the optimisation problem V q in (13) is welldefined. Let k = dim(span { B } ) and q = p − k ,(a) span { V q } = span { B } ⊥ (b) span { V q } ⊥ = span { B } We next define the novel estimator of the sufficient reduction space, span { B } , in (1), which is motivated by Theorem 1and Corollary 2 (b) serves as the estimation equation for the conditional variance estimator at the population level. Definition.
The
Conditional Variance Estimator is defined to be any basis B p − q of span { V q } ⊥ . That is, the CVE of B is any B p − q such that span { B p − q } = span { V q } ⊥ (15)When q = p − k , where k = rank ( B ) in (1), then the CVE obtains the population span { B } . Alternatively, we can alsotarget B directly by maximizing the objective function L ( V ) . The downside of this approach is that X either needsto be standardized, or the conditioning argument needs to be changed to X = s + PΣ − x (span { V } ) ( X − s ) , where P M (span { V } ) is the orthogonal projection operator with respect to the inner product (cid:104) x , y (cid:105) M = x T My . In either case,the inversion of Σ x is required. Our choice of targeting the orthogonal complement avoids the inversion of Σ x , and theestimation algorithm in Section 4 can be applied to regressions with p > n or p ≈ n , where n denotes the sample size.Additionally, targeting the complement has computational advantages. The dimension of the search space span { V q } ⊥ is p − q , which is smaller than the dimension of the direct target space in (15) when q = p − k for small k , which is theappropriate setting in a dimension reduction context. Assume ( Y i , X Ti ) Ti =1 ,...,n is an independent identical distributed sample from model (1). For V ∈ S ( p, q ) and s ∈ R p ,we define d i ( V , s ) = (cid:107) X i − P s +span { V } X i (cid:107) = (cid:107) X i − s (cid:107) − (cid:104) X i − s , VV T ( X i − s ) (cid:105) = (cid:107) ( I p − VV T )( X i − s ) (cid:107) = (cid:107) P U ( X i − s ) (cid:107) (16)4 PREPRINT - F
EBRUARY
18, 2021where (cid:104)· , ·(cid:105) is the usual inner product in R p , P V = VV T and P U = I p − P V using the orthogonal decompositiongiven by (5).Let h n ∈ R + be a sequence of bandwidths and we call the set S s , V = { x ∈ R p : (cid:107) x − P s +span { V } x (cid:107) ≤ h n } a slice that depends on both the shifting point s and the matrix V . h n represent the squared width of a slice around thesubspace s + span { V } and fulfills the following assumptions. (H.1). For n → ∞ , h n → (H.2). For n → ∞ , nh ( p − q ) / n → ∞ Remark.
For obtaining the consistency of the proposed estimator (H.2) will be strengthened to log( n ) /nh ( p − q ) / n → . Let K be a function satisfying the following assumptions. (K.1). K : [0 , ∞ ) → [0 , ∞ ) is a non increasing and continuous function, so that | K ( z ) | ≤ M , with (cid:82) R q K ( (cid:107) r (cid:107) ) d r < ∞ for q ≤ p − . (K.2). There exist positive finite constants L and L such that the kernel K satisfies one of the following:(1) K ( u ) = 0 for | u | > L and for all u, ˜ u it holds | K ( u ) − K (˜ u ) | ≤ L | u − ˜ u | (2) K ( u ) is differentiable with | ∂ u K ( u ) | ≤ L and for some ν > it holds | ∂ u K ( u ) | ≤ L | u | − ν for | u | > L Examples of functions that satisfy (K.1) and (K.2) include the Gaussian, K ( z ) = c exp( − z / , the exponential, K ( z ) = c exp( − z ) , and the squared Epanechnikov kernel, K ( z ) = c max { (1 − z ) , } (i.e. polynomial kernels),where c is a constant. The rectangular, K ( z ) = cI ( z ≤ , does not fulfill the assumptions but will be mentioned forintuitive explanations. A list of further kernel functions is given in [28, Table 1]. L ( V ) and its uniform convergenceDefinition. For i = 1 , . . . , n , we define w i ( V , s ) = K (cid:16) d i ( V , s ) h n (cid:17)(cid:80) nj =1 K (cid:16) d j ( V , s ) h n (cid:17) (17) Definition.
The sample based estimate of ˜ L ( V , s ) is defined as ˜ L n ( V , s ) = n (cid:88) i =1 w i ( V , s )( Y i − ¯ y ( V , s )) = ¯ y ( V , s ) − ¯ y ( V , s ) (18) where ¯ y l ( V , s ) = (cid:80) ni =1 w i ( V , s ) Y li , l = 1 , . Definition.
The estimate of the objective function L ( V ) in (7) is defined as L n ( V ) = 1 n n (cid:88) i =1 ˜ L n ( V , X i ) , (19) where each data point X i is a shifting point. To obtain insight as to the choice of ˜ L n ( V , s ) in (18), let us consider the rectangular kernel, K ( z ) = 1 { z ≤ } . In thiscase, ˜ L n ( V , s ) computes the empirical variance of the Y i ’s corresponding to the X i ’s that are no further than √ h n away from the affine space s + span { V } , i.e., d i ( V , s ) = (cid:107) X i − P s +span { V } X i (cid:107) ≤ h n . If a smooth kernel isused, such as the Gaussian in our simulation studies, then ˜ L n ( V , s ) is also smooth, which allows the computation ofgradients required to solve the optimization problem.In Theorem 3 we state the conditions under which L n ( V ) in (19) converges uniformly to its population counterpart in(7). This result will lead to the consistency of our estimator. Theorem 3.
Let ˜ a n = log( n ) /n . Under (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1), a n = log( n ) /nh ( p − q ) / n = o (1) ,and a n /h ( p − q ) / n = O (1) , sup V ∈S ( p,q ) | L n ( V ) − L ( V ) | → in probability as n → ∞ (20)5 PREPRINT - F
EBRUARY
18, 2021
Next we define the estimator we propose for span { B } in (1). Our main theoretical result follows in Theorem 4 whichestablishes the consistency of our estimator. Definition.
The sample based
Conditional Variance Estimator (cid:98) B p − q is any basis of span { (cid:98) V q } ⊥ where (cid:98) V q =argmin V ∈S ( p,q ) L n ( V ) . Theorem 4.
Under (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1), a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) , span { (cid:98) B k } is a consistent estimator for span { B } in model (1) ; i.e., (cid:107) P (cid:98) B k − P B (cid:107) → in probability as n → ∞ . L ( V ) The set of points { x ∈ R p : (cid:107) x − P s +span { V } x (cid:107) ≤ h n } represents a slice in the a subspace of R p about s +span { V } .In the estimation of L ( V ) two different weighting schemes are used:(a) Within a slice . The weights are defined in (17) and are used to calculate (18).(b)
Between slices . Equal weights /n are used to calculate (19).The choice of weights can be potentially influential. Especially the between weighting scheme can further be refined byassigning more weight to slices with more points. This can be realized by altering (19) to L ( w ) n ( V ) = n (cid:88) i =1 ˜ w ( V , X i ) ˜ L n ( V , X i ) , with (21) ˜ w ( V , X i ) = (cid:80) nj =1 K ( d j ( V , X i ) /h n ) − (cid:80) nl,u =1 K ( d l ( V , X u ) /h n ) − n = (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) (cid:80) nl,u =1 ,l (cid:54) = u K ( d l ( V , X u ) /h n ) (22)For example, if a rectangular kernel is used, (cid:80) nj =1 ,j (cid:54) = i K ( d j ( V , X i ) /h n ) is the number of X j ( j (cid:54) = i ) points in theslice corresponding to ˜ L n ( V , X i ) . Therefore this slice gets higher weight, if the number of X j points in this slice islarger. That is, the more observations we use for estimating L ( V , X i ) the better its accuracy. The denominator in (22)guarantees the weights ˜ w ( V , X i ) sum up to one. The performance of conditional variance estimation depends crucially on the choice of the bandwidth sequence h n that controls the bias-variance trade-off if the mean squared error is used as measure for accuracy, in the sense that thesmaller h n is, the lower the bias and the higher the variance and vice versa. Furthermore, the choice of h n depends on p , q , the sample size n , and the distribution of X . We assume throughout the bandwidth satisfies assumptions (H.1) and(H.2). We will use Lemma 5 to derive a data-driven bandwidth we use in the computation of our estimator. Lemma 5.
Let M be a p × p positive definite matrix. Then,tr ( M ) p = argmin s> (cid:107) M − s I p (cid:107) (23) Proof.
Let U be the p × p matrix whose columns are the eigenvectors of M corresponding to its eigenvalues λ ≥ . . . ≥ λ p > . Then, M = U diag ( λ , ..., λ p ) U T , which implies (cid:107) M − s I p (cid:107) = (cid:107) diag ( λ , ..., λ p ) − s I p (cid:107) = (cid:80) pl =1 ( λ l − s ) .Taking the derivative with respect to s , setting it to 0 and solving for s obtains (23), since (cid:80) pl =1 λ l = tr ( M ) .If the predictors are multivariate normal, their joint density is approximated by N ( µ X , σ I p ) by Lemma 5, with σ = tr ( Σ x ) /p . This results in no bandwidth dependence on V and leads to a rule for bandwidth selection, as follows.Under X ∼ N p ( µ X , σ I p ) , (cid:101) X i = X i − X j ∼ N p (0 , σ I p ) for i (cid:54) = j , where we suppress the dependence on j fornotational convenience. Since all data are used as shifting points, d i ( V , X j ) = (cid:107) X i − X j (cid:107) − ( X i − X j ) T VV T ( X i − X j ) = (cid:107) (cid:101) X i (cid:107) − (cid:101) X Ti VV T (cid:101) X i . LetnObs = E (cid:16) { i ∈ { , ..., n } : (cid:101) X i ∈ span h { V }} (cid:17) = 1 + ( n − P ( d ( V , X ) ≤ h ) = 1 + ( n − P ( (cid:107) (cid:101) X (cid:107) − (cid:101) X T VV T (cid:101) X ≤ h ) (24)6 PREPRINT - F
EBRUARY
18, 2021where span h { V } = { x ∈ R p : (cid:107) x − P span { V } x (cid:107) ≤ h } and (cid:101) X = X − X ∗ , with X ∗ an independent copy of X . nObsis the expected number of points in a slice. Given a user specified value for nObs, h is the solution to (24).Let x ∈ R p . For any V ∈ S ( p, q ) in (3), there exists an orthonormal basis U ∈ R p × ( p − q ) of span { V } ⊥ such that x = Vr + Ur , by (5). Then, (cid:101) X = VR + UR , with R = V T (cid:101) X ∼ N (0 , σ I q ) , R = U T (cid:101) X ∼ N (0 , σ I p − q ) ,and (cid:101) X T VV T (cid:101) X = (cid:107) R (cid:107) and (cid:107) (cid:101) X (cid:107) = (cid:107) R (cid:107) + (cid:107) R (cid:107) . Therefore, P (cid:16) (cid:107) (cid:101) X (cid:107) − (cid:101) X T VV T (cid:101) X ≤ h (cid:17) = P ( (cid:107) R (cid:107) ≤ h ) = χ p − q (cid:18) h σ (cid:19) , (25)where χ p − q is the cumulative distribution function of a chi-squared random variable with p − q degrees of freedom.Plugging (25) in (24) obtains nObs = 1 + ( n − χ p − q (cid:18) h σ (cid:19) . (26)Solving (26) for h and Lemma 5 yield h n ( nObs ) = χ − p − q (cid:18) nObs − n − (cid:19) tr ( (cid:98) Σ x ) p , (27)where (cid:98) Σ x = (cid:80) i ( X i − ¯ X )( X i − ¯ X ) T /n and ¯ X = (cid:80) i X i /n .In order to ascertain h n satisfies (H.1) and (H.2), a reasonable choice is to set nObs = γ ( n ) for a function γ ( · ) with γ ( n ) → ∞ , γ ( n ) /n ≤ and γ ( n ) /n → . For example, nObs = γ ( n ) = n β with β ∈ (0 , can be used.Alternatively, a plug-in bandwidth based on rule-of-thumb rules of the form csn − / (4+ k ) , where s is an estimate ofscale and c a number close to 1, such as Silverman’s ( c = 1 . , s = standard deviation) or Scott’s ( c = 1 , s = standarddeviation), used in nonparametric density estimation [see [29]], is h n = 1 . tr ( (cid:98) Σ x ) p (cid:16) n − / (4+ p − q ) (cid:17) . (28)The term tr ( (cid:98) Σ X ) /p can be interpreted as the variance of X i − X j and p − q is the true dimension k . We use 1.2 as c based on empirical evidence from simulations. Since both (27) and (28) yield satisfactory results, we opted againstcross validation for bandwidth selection because of the computational burden involved, and used the bandwidth in (28)in simulations and data analyses. A Stiefel manifold optimization algorithm is used to obtain the solution of the sample version of the optimizationproblem (13). To calculate (cid:98) V q in (3.2), a curvilinear search is carried out [33, 31], which is similar to gradient descent.First an arbitrary starting value V (0) is selected by drawing a p × q matrix from the invariant measure; i.e., thedistribution that corresponds to the uniform, on S ( p, q ) , see [8]. The Q -component of the QR decomposition of a p × q matrix with independent standard normal entries follows the invariant measure [7]. The step-size τ > , the step sizereduction factor γ ∈ (0 , , and tolerance tol > are fixed at the outset. Result: V ( end ) Initialize: V (0) , τ = 1 , tol = 10 − , γ = 0 . error = tol + 1 , maxit = 50 , count = 0 ; while error > tol and count ≤ maxit do • G = ∇ V L n ( V ( j ) ) ∈ R p × q , W = GV T − VG T • V ( j +1) = ( I p + τ W ) − ( I p − τ W ) V ( j ) • error = (cid:107) V ( j ) V ( j ) T − V ( j +1) V ( j +1) T (cid:107) / √ q if L n ( V ( j +1) ) > L n ( V ( j ) ) then V ( j +1) ← V ( j ) ; τ ← τ γ ; error ← tol + 1 else count ← count + 1 τ ← τγ endend Algorithm 1: Curvilinear search7
PREPRINT - F
EBRUARY
18, 2021Under mild regularity conditions on the objective function, [33] showed that the sequence generated by the algorithmconverges to a stationary point if the Armijo-Wolfe conditions [27] are used for determining the stepsize τ .The Armijo-Wolfe conditions require the evaluation of the gradient for each potential step size until one is found thatfulfills the conditions and the step is accepted, i.e. for the determination of one step size the gradient has to be evaluatedmultiple times. Since for the conditional variance estimator, the gradient computation incurs the highest computationalcost, we use simpler conditions to determine the step size. Specifically, we simply require the step decrease theobjective function, otherwise the step size τ is decreased by the factor γ ∈ (0 , ). These simplified conditions arecomputationally less expensive and exhibit same behavior as the Armijo-Wolfe conditions in the simulations. Furtherwe capped the maximum number of steps at maxit = 50 steps, since the algorithm converged in about 10 iterations inall our simulations.The algorithm is repeated for m arbitrary V (0) starting values drawn from the invariant measure on S ( p, q ) . Amongthose, the value at which L n in (19) is minimal is selected as (cid:98) V q .The algorithm requires the computation of the gradient of L n ( V ) in (19) or (21). We compute the gradient of theobjective function for the Gaussian kernel in Theorems 6 and 7. The Gaussian kernel is the default kernel we use in theimplementation of the estimation algorithm in the R code that accompanies this manuscript. Theorem 6.
Let K ( z ) = exp ( − z / be the Gaussian kernel. Then, the gradient of ˜ L n ( V , s ) in (18) is given by ∇ V ˜ L n ( V , s ) = 1 h n n (cid:88) i =1 ( ˜ L n ( V , s ) − ( Y i − ¯ y ( V , s )) ) w i d i ∇ V d i ( V , s ) ∈ R p × q , and the gradient of L n ( V ) in (19) is ∇ V L n ( V ) = 1 n n (cid:88) i =1 ∇ V ˜ L n ( V , X i ) . with w i = w ( V , X i ) in (17) . The weighted version of conditional variance estimation in Section 3.3 is expected to increase the accuracy of theestimator for unevenly spaced data. When (21) and the gradient in (29) are used in the optimisation algorithm, we referto the estimator as weighted conditional variance estimation . If (21) and the gradient (cid:80) ni =1 ˜ w ( V , X i ) ∇ V ˜ L n ( V , X i ) is used; i.e., the first summand in (29) is dropped, we refer to it as partially weighted conditional variance estimation .For both, we replace G in algorithm 1 with the corresponding gradient derived in Theorem 7. Theorem 7.
Let K ( z ) = exp ( − z / be the Gaussian kernel. Then, the gradient of L ( w ) n ( V ) in (21) is given by ∇ V L ( w ) n ( V ) = n (cid:88) i =1 (cid:16) ∇ V ˜ w ( V , X i ) ˜ L n ( V , X i ) + ˜ w ( V , X i ) ∇ V ˜ L n ( V , X i ) (cid:17) , (29) where ∇ V ˜ L n ( V , X i ) is given in Theorem 6. Furthermore, ∇ V ˜ w ( V , X i ) = − h n (cid:88) j K j,i (cid:80) nl,u =1 K l,u d j,i ∇ V d j,i − ˜ w i n (cid:88) l,u =1 K l,u (cid:80) no,s =1 K o,s d l,u ∇ V d l,u with ˜ w i = ˜ w ( V , X i ) in (22) , K j,i = K ( d j ( V , X i ) /h n ) , and d j,i = d j ( V , X i ) given in (16) . L n ( V ) We explore how accurately the sample version (19) of the objective function estimates the target subspace in an example.We consider a bivariate normal predictor vector, X = ( X , X ) T ∼ N ( , Σ x ) . We generate the response from Y = g ( B T X ) + (cid:15) = X + (cid:15) , with (cid:15) ∼ N (0 , η ) independent of X . In this setting, k = 1 , B = (1 , T , g ( z ) = z ∈ R in model (1). With these specifications, (10) becomes µ l ( V , s ) = (cid:90) R ( B T s + B T V r ) l f X | X ∈ s +span { V } ( r ) dr (30)8 PREPRINT - F
EBRUARY
18, 2021Dropping the terms that do not contain r in (8) yields f X | X ∈ s +span { V } ( r ) ∝ f X ( s + V r ) ∝ exp (cid:18) −
12 ( s + r V ) T Σ − x ( s + r V ) (cid:19) ∝ exp (cid:18) − (cid:0) r V T Σ − x s + r V T Σ − x V (cid:1)(cid:19) = exp (cid:18) − σ (cid:0) rσ V T Σ − x s + r (cid:1)(cid:19) ∝ exp (cid:18) − σ ( r − α ) (cid:19) , (31)where σ = 1 / ( V T Σ − x V ) , α = − σ V T Σ − x s and the symbol ∝ stands for proportional to. Letting ψ ( z ) denote thedensity of a standard normal variable, (31) obtains f X | X ∈ s +span { V } ( r ) = 1 σ ψ (cid:18) r − ασ (cid:19) (32)for V , s ∈ R × . Inserting (32) in (30) yields (cid:90) R ( B T s + B T V r ) l σ ψ (cid:18) r − ασ (cid:19) dr = (cid:26) B T s + B T V α l = 1( B T s ) + 2( B T s )( B T V ) α + ( B T V ) ( σ + α ) l = 2 Using (9), (6) and (7), yields ˜ L ( V , s ) = µ ( V , s ) − µ ( V , s ) + η = ( B T V ) σ + η , so that L ( V ) = E (cid:16) ˜ L ( V , X ) (cid:17) = ( B T V ) σ + η = ( B T V ) V T Σ − x V + η (33)From (33) we can easily see that L ( V ) attains its minimum at V ⊥ B . Also, if Σ x = I , the maximum of L ( V ) is attained at V = B . To visualize the behavior of ˜ L n ( V ) as the sample size increases, we parametrize V by V ( θ ) = (cos( θ ) , sin( θ )) T , θ ∈ [0 , π ] . Since B = (1 , T , the minimum of ˜ L ( V ) is at V ( π/
2) = (0 , T , which isorthogonal to B .The true L ( V ( θ )) and its estimates L n ( V ( θ )) are plotted for samples of different sizes n in Figure 1. L n ( V ( θ )) approximates L ( V ) fast and attains its minimum at the same value as L ( V ) even for n = 10 .As an aside, we note that assumption (A.4) is violated in this example, which suggests that the proposed estimator ofconditional variance estimation may apply under weaker assumptions. . . . . . . . L n ( V ) versus L ( V ) q L ( V ) n = 10n = 50n = 100n = 500 Figure 1: Solid black line is L ( V ( θ )) = cos( θ ) + 0 . , colored is L n ( V ( θ )) , θ ∈ [0 , π ] , n = 10 , , , . Thevertical black line is at θ = π/ PREPRINT - F
EBRUARY
18, 2021
We compare the estimation accuracy of conditional variance estimation with the forward model based sufficientdimension reduction methods, mean outer product gradient estimation ( meanOPG ), mean minimum average varianceestimation ( meanMAVE ) [32], refined outer product gradient ( rOPG ), refined minimum average variance estimation( rmave ) [35, 22], and principal Hessian directions ( pHd ) [24, 15], and the inverse regression based methods, slicedinverse regression (
SIR ) [23] and sliced average variance estimation (
SAVE ) [14]. The dimension k is assumed to beknown throughout.We report results for conditional variance estimation using the “plug-in” bandwidth in (28) and three different conditionalvariance estimation versions, CVE , wCVE , and rCVE . CVE is obtained by using m = 10 arbitrary starting values in theoptimization algorithm and optimizing (19) as described in Section 4. rCVE , or refined weighted CVE , is obtained bysetting the starting value V (0) at the optimizer of CVE , and using (21) in the optimization algorithm in Section 4 withthe partially weighted gradient as described in Section 3.3. wCVE , or weighted CVE , is obtained by optimizing (21)with partially weighted gradient as described in Sections 3.3 and 4. Methods rOPG and rmave refer to the originalrefined outer product gradient and refined minimum average variance estimation algorithms published in [35]. They areimplemented using the R code in [22] with number of iterations nit = 25 , since the algorithm is seen to converge by 25.The dr package is used for the SIR , SAVE and pHd calculations, and the
MAVE package for mean outer product gradientestimation ( meanOPG ) and mean minimum average variance estimation ( meanMAVE ). The source code for conditionalvariance estimation can be downloaded from https://git.art-ist.cc/daniel/CVE .Table 1 lists the seven models (M1-M7) we consider. Throughout, we set p = 20 , b = (1 , , , , , , , ..., T / √ , b = (1 , − , , − , , − , , ..., T / √ ∈ R p for M1-M5. For M6, b = e , b = e and b = e p , and for M7 b , b , b are the same as in M6 and b = e , where e j denotes the p -vector with j th element equal to 1 and all othersare 0. The error term (cid:15) is independent of X for all models. In M2, M3, M4, M5 and M6, (cid:15) ∼ N (0 , . For M1 and M7, (cid:15) has a generalized normal distribution GN ( a, b, c ) with densitiy f (cid:15) ( z ) = c/ (2 b Γ(1 /c )) exp(( | z − a | /b ) c ) , see [26]with location 0 and shape-parameter 0.5 for M1, and shape-parameter 1 for M7 (Laplace distribution). For both thescale-parameter is chosen such that V ar( (cid:15) ) = 0 . .Table 1: Models Name Model X distribution (cid:15) distribution k n M1 Y = cos( b T X ) + (cid:15) X ∼ N p ( , Σ ) GN (0 , (cid:112) / , . Y = cos( b T X ) + 0 . (cid:15) X ∼ λZ p + N p ( , I p ) N (0 , Y = 2 log( | b T X | + 2) + 0 . (cid:15) X ∼ N p ( , I p ) N (0 , Y = ( b T X ) / (0 . . b T X ) ) + 0 . (cid:15) X ∼ N p (0 , Σ ) N (0 , Y = cos( π b T X )( b T X + 1) + 0 . (cid:15) X ∼ U ([0 , p ) N (0 , Y = ( b T X ) + ( b T X ) + ( b T X ) + 0 . (cid:15) X ∼ N p ( , I p ) N (0 , Y = ( b T X )( b T X ) + ( b T X )( b T X ) + (cid:15) X ∼ t ( I p ) GN (0 , (cid:112) / Γ(6) , The variance-covariance structure of X in models M1 and M4 satisfies Σ i,j = 0 . | i − j | for i, j = 1 , . . . , p . In M5, X isuniform with independent entries on the p -dimensional hyper-cube. In M7, X is multivariate t -distributed with 3 degreesof freedom. The link functions of M4 and M7 are studied in [35], but we use p = 20 instead of 10 and a non identitycovariance structure for M4 and the t -distribution instead of normal for M7. In M2, Z ∼ Bernoulli ( p mix ) − ∈ {− , } ,where q = (1 , , ..., T ∈ R q , mixing probability p mix ∈ [0 , and dispersion parameter λ > . For < p mix < , X has a mixture normal distribution, where p mix is the relative mode height and λ is a measure of mode distance.We set q = p − k and generate r = 100 replications of models M1 - M7. We estimate B using the ten sufficientdimension reduction methods. The accuracy of the estimates is assessed using err = (cid:107) P B − P (cid:98) B (cid:107) / √ k , which lies inthe interval [0 , . The factor √ k normalizes the distance, with values closer to zero indicating better agreement andvalues closer to one indicating strong disagreement , specifically, (cid:107) P B − P (cid:98) B (cid:107) ≤ k .In Table 2 the mean and standard deviation of err for M1 - M7 are reported. In particular, for M2, p mix = 0 . and λ = 1 . The smallest error values are boldfaced. In models M1, M2 and M3, the conditional variance estimator isthe best performer, with its refined version as close second. In M4, M5 and M6, any of the four versions of MAVEperforms better than the CVE. For model M7 the results of rOPG and rmave are not reported because the code frequentlyproduces an error message that a matrix is not invertible. Among the rest, the weighted version of CVE, wCVE, attainsthe minimum error. 10 PREPRINT - F
EBRUARY
18, 2021Sliced inverse regression (
SIR ) and sliced average variance estimation (
SAVE ) are not competitive throughout ourexperiments. Sliced inverse regression (
SIR ), in particular, is expected to fail in models M1-M3, and M6 since E ( Y | X ) is even.In Figure 2, box-plots for all combinations of p mix ∈ { . , . , . } and λ ∈ { , . , , . } are presented. The referencemethods are restricted to meanOPG and meanMAVE , since the others are not competitive. Conditional variance estimationperforms better than all competing methods and is the only method with consistently smaller errors when the two modesare further apart ( λ ≥ ) regardless of the mixing probability p mix . The performance of both meanOPG and meanMAVE worsens as one moves from left to right row-wise. The mixing probability, p mix , has no noticeable effect on theperformance of any method; i.e., the plots are very similar column-wise. In sum, meanMAVE ’s performance deterioratesas the bimodality of the predictor distribution becomes more distinct. In contrast, conditional variance estimation isunaffected. and appears to have an advantage over meanMAVE when the predictors have mixture distributions, the linkfunction is even about the midpoint of the two modes, and B is not orthogonal to the line connecting the two modes.Conditional variance estimation is the only method that estimates the mean subspace reliably in model M2 ( err ≈ . to . ), whereas meanMAVE misses it completely ( err ≈ ). These results indicate that conditional variance estimationis often approximately on par, and can perform much better than meanMAVE depending on the predictor distribution andthe link function. Table 2: Mean and standard deviation of estimation errors Model CVE wCVE rCVE meanOPG rOPG meanMAVE rmave pHd sir save M mean M mean M mean M mean 0.5663 0.5897 0.5554 0.4071 0.4026 0.4361 M mean 0.4429 0.5604 0.4779 0.4058 M mean 0.3828 0.3027 0.3230 0.1827 0.4632 M mean 0.6856 Furthermore we estimate the dimension k via cross-validation, following the approach in [35] , with ˆ k = argmin l =1 ,...,p CV ( l ) = argmin l =1 ,...,p (cid:80) i ( Y i − ˆ g − i ( (cid:98) B Tl X i )) n , (34)where ˆ g − i ( · ) is computed from the data ( Y j , (cid:98) B Tl X j ) j =1 ,...,n ; j (cid:54) = i using multivariate adaptive regression splines [17] inthe R -package mda , and (cid:98) B l = (cid:98) V ⊥ p − l is any basis of the orthogonal complement of (cid:98) V p − l = argmin V ∈S ( p,p − l ) L n ( V ) .For a given l , we calculate (cid:98) B l from the whole data set and predict Y i by ˆ Y i,l = ˆ g − i ( (cid:98) B Tl X i ) . For l = p , (cid:98) B p = I p . Theresults for the seven models are reported in Table 3. The CVE based dimension estimation is the most accurate inmodels M1, M2, M3, and M6 and differs slightly from that of MAVE in M7. MAVE performs better in M4 and M5,completely misses the true dimension in M2 and misses it most of the time in M3. Thus, the dimension estimationperformance of CVE and MAVE agrees with the estimation accuracy of the true subspace in Table 2, CVE estimates thedimension more accurately even in model M6, where it exhibits worse subspace estimation performance, and overallappears to be more accurate.We carried out many simulation experiments for an array of combinations of link functions, sufficient reduction matrices B and their ranks, as well as predictor and error distributions. All reported and unreported results indicate that the11 PREPRINT - F
EBRUARY
18, 2021 lll llllllll
CVE wCVE rCVE meanOPG . . . . l = 0, p mix = 0.3 l llll l llllll llllllllll CVE wCVE rCVE meanOPG . . . . . l = 0.5, p mix = 0.3 l llllllll ll llll ll CVE wCVE rCVE meanOPG . . . . l = 1, p mix = 0.3 l lll ll CVE wCVE rCVE meanOPG . . . . . . . l = 1.5, p mix = 0.3 llllllll llllllll CVE wCVE rCVE meanOPG . . . . l = 0, p mix = 0.4 ll llll lll llllllllll CVE wCVE rCVE meanOPG . . . . . l = 0.5, p mix = 0.4 l llll llll llll CVE wCVE rCVE meanOPG . . . . l = 1, p mix = 0.4 lll l CVE wCVE rCVE meanOPG . . . . . . . . l = 1.5, p mix = 0.4 lllllll lllllll CVE wCVE rCVE meanOPG . . . . l = 0, p mix = 0.5 lllll llllll llllll lllllllll CVE wCVE rCVE meanOPG . . . . . l = 0.5, p mix = 0.5 lll lllll ll l l CVE wCVE rCVE meanOPG . . . . l = 1, p mix = 0.5 llll l l CVE wCVE rCVE meanOPG . . . . . . . . l = 1.5, p mix = 0.5 Figure 2: M2, p = 20 , n = 100 Table 3: Number of times dimension k is correctly estimated in replicationsM1 M2 M3 M4 M5 M6 M7CVE 83 41 88 62 46 74 19MAVE 67 0 14 76 60 57 21difference in performance of the two methods, CVE and mean MAVE, can be attributed to both the form of the linkfunction and the marginal predictor distribution. We observed that when the link function had a bounded first orderderivative, CVE often outperformed mean MAVE across predictor distributions. In the opposite case, MAVE performedmostly better. Also, when the predictors have a bimodal distribution with well separated modes and the link functionis even, regardless of whether its derivative is bounded, CVE outperforms mean MAVE. In the other settings for thegenerated data, both methods were roughly on par. Three data sets are analyzed: the
Hitters data in the R package ISLR , which was also analyzed by [35], the
BostonHousing data in the R package mlbench , and the Concrete data from the
MAVE package. The reference method is meanMAVE from the
MAVE package in R and the CVE is calculated using m = 50 and maxit = 10 in the optimizationalgorithm 1 in Section 4. The estimation of the dimension is based on (34) in Section 5.Following [35], we remove 7 outliers from the Hitters data set leading to a sample size of 256. The response is Y = log( salary ) and the 16 continuous predictors are the game statistics of players in the Major League Baseballleague in the seasons 1986 and 1987. Further information can be found in .The Boston Housing data set contains 506 census tracts on 14 variables from the 1970 census. The response is medv ,the median value of owner-occupied homes in USD 1000’s. The factor variable chas is removed from the data setfor the analysis so that the response is modeled by the remaining 12 continous predictors. The description of thevariables can be found in .The
Concrete data set contains 1030 instances on 9 continuous variables The response is concrete compressive strength.Concrete strength is very important in civil engineering and is a highly nonlinear function of age and ingredients. The12
PREPRINT - F
EBRUARY
18, 2021description of the variables can be found in .For all three data sets we standardize both the predictors and the response by subtracting the mean and rescalingcolumn-wise so that each variable has unit variance. The data sets are analyzed using 10 fold cross-validation tocalculate an unbiased estimate of the prediction error [30] for our method ,
CVE , and its main competitor meanMAVE using the
MAVE package. The dimension for each method is estimated with (34) on the trainings set and we then fita forward regression model on the training set replacing the original with the reduced predictors using multivariateadaptive regression splines [17] using the R package mda and calculate the prediction error on the test set for bothmethods. The dimension estimates of CVE and
MAVE mostly disagree.The mean and standard deviation of the 10-fold cross-validation prediction errors are reported in Table 4. Since theresponse is standardized, the values in Table 4 are bounded between 0 and 1, with smaller values indicating betterpredictive performance.
CVE performs slightly worse than mean
MAVE in the
Hitters data set, slightly better in the
Boston Housing and better in the
Concrete data set analysis.Table 4: Mean and standard deviation (in parenthesis) of standardized out of sample prediction errors for the three datasets Method Hitters Housing ConcreteCVE 0.216 0.260 0.361(0.101) (0.331) (0.206)MAVE 0.203 0.299 0.417(0.083) (0.382) (0.348)
Additionally, we reconstruct the analysis of the
Hitters data in [35], which does not account for the out-of-sampleprediction error as in Section 6 but uses the whole sample for estimation of B and its rank. Only the dimension k isestimated with leave-one-out cross validation.Table 5 reports the average cross validation mean squared error CV ( k ) in (34) using the whole data set over k = 1 , . . . , .Both conditional variance estimation and mean minimum average variance estimation estimate the dimension to be 2.Table 5: Mean cross-validation error k ˆ Y = 0 . . (cid:98) b T X ) − . (cid:98) b T X ) + 0 . (cid:98) b T X ) (35)with R = 0 . , and for minimum average variance estimation ˆ Y = 0 . . (cid:98) b T X ) − . (cid:98) b T X ) + 0 . (cid:98) b T X ) (36)with R = 0 . . Both models (35) and (36) have about the same fit as measured by R . The in sample performanceof the two methods is practically the same for the Hitters data.
In this paper the novel conditional variance estimator (CVE) for the mean subspace is introduced. We presentits geometrical and theoretical foundation, show its consistency and propose an estimation algorithm with assured13
PREPRINT - F
EBRUARY
18, 2021Figure 3: Y against (cid:98) b T X and (cid:98) b T X convergence. CVE requires the forward model (1), Y = g ( B T X ) + (cid:15) , holds and weak assumptions on the responseand the covariates.Minimum average variance estimation (MAVE) [35] is the only other sufficient dimension reduction method based onthe forward model (1). It estimates the sufficient dimension reduction targeting both the reduction and the link function g in (1). CVE targets only the reduction and does not require estimation of the link function, which may explain whyit has an advantage over MAVE in some regression settings. For example, CVE exhibits similar performance acrossdifferent link functions (cos, exp, etc) for fixed λ , whereas the performance of MAVE is very uneven for model M2 inSection 5. CVE is more accurate than MAVE when the link function is even and the predictor distribution is bimodalthroughout our simulation studies. Moreover, CVE does not require the inversion of the predictor covariance matrix andcan be applied to regressions with p ≈ n or p > n .The theoretical challenge in deriving the statistical properties of conditional variance estimation arises from the noveltyof its definition that involves random non i.i.d. weights that depend on the parameter to be estimated. References [1] Kofi P. Adragni and R. Dennis Cook. Sufficient dimension reduction and prediction in regression.
PhilosophicalTransactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , 367(1906):4385–4405, 112009.[2] Takeshi Amemiya.
Advanced Econometrics . Harvard university press, 1985.[3] S. N. Bernstein.
Theory of Probability . Moscow, 1927.[4] W. M. Boothby.
An Introduction to Differentiable Manifolds and Riemannian Geometry . Academic Press, 2002.[5] Efstathia Bura, Sabrina Duarte, and Liliana Forzani. Sufficient reductions in regressions with exponential familyinverse predictors.
Journal of the American Statistical Association , 111(515):1313–1329, 2016.[6] Efstathia Bura and Liliana Forzani. Sufficient reductions in regressions with elliptically contoured inversepredictors.
Journal of the American Statistical Association , 110(509):420–434, 2015.[7] Yasuko Chikuse.
Invariant measures on Stiefel manifolds with applications to multivariate analysis , volumeVolume 24 of
Lecture Notes–Monograph Series , pages 177–193. Institute of Mathematical Statistics, Hayward,CA, 1994.[8] Yasuko Chikuse.
Statistics on Special Manifolds . Springer-Verlag New York, New York, 2003.[9] Dennis R. Cook.
Regression Graphics: Ideas for studying regressions through graphics . Wiley, New York, 1998.14
PREPRINT - F
EBRUARY
18, 2021[10] R. D. Cook and L. Forzani. Principal fitted components for dimension reduction in regression.
Statistical Science ,23(4):485–501, 2008.[11] R. Dennis Cook. Fisher lecture: Dimension reduction in regression.
Statist. Sci. , 22(1):1–26, 02 2007.[12] R. Dennis Cook and Liliana Forzani. Likelihood-based sufficient dimension reduction.
Journal of the AmericanStatistical Association , 104(485):197–208, 3 2009.[13] R. Dennis Cook and Bing Li. Determining the dimension of iterative hessian transformation.
Ann. Statist. ,32(6):2501–2531, 12 2004.[14] R. Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction: Comment.
Journal ofthe American Statistical Association , 86(414):328–332, 1991.[15] R.Dennis Cook and Bing Li. Dimension reduction for conditional mean in regression.
Ann. Statist. , 30(2):455–474,04 2002.[16] Arnold M. Faden. The existence of regular conditional probabilities: Necessary and sufficient conditions.
TheAnnals of Probability , 13(1):288–298, 1985.[17] Jerome H. Friedman. Multivariate adaptive regression splines.
The Annals of Statistics , 19(1):1–67, 1991.[18] Bruce E. Hansen. Uniform convergence rates for kernel estimation with dependent data.
Econometric Theory ,24:726–748, 2008.[19] H. Heuser.
Analysis 2, 9 Auflage . Teubner, 1995.[20] Alan F. Karr.
Probability . Springer Texts in Statistics. Springer-Verlag New York, 1993.[21] D. Leao Jr., M. Fragoso, and P. Ruffino. Regular conditional probability, disintegration of probability and radonspaces.
Proyecciones (Antofagasta) , 23:15 – 29, 05 2004.[22] Bing Li.
Sufficient dimension reduction: methods and applications with R . CRC Press, Taylor & Francis Group,2018.[23] K. C. Li. Sliced inverse regression for dimension reduction.
Journal of the American Statistical Association ,86(414):316–327, 1991.[24] Ker-Chau Li. On principal hessian directions for data visualization and dimension reduction: Another applicationof stein’s lemma.
Journal of the American Statistical Association , 87(420):1025–1039, 1992.[25] Yanyuan Ma and Liping Zhu. A review on dimension reduction.
International Statistical Review , 81(1):134–150,4 2013.[26] Saralees Nadarajah. A generalized normal distribution.
Journal of Applied Statistics , 32(7):685–694, 2005.[27] J. Nocedal and S. Wright.
Line Search Methods , pages 30–65. Springer New York, New York, NY, 2006.[28] E Parzen. On estimation of a probability density function and mode.
The Annals of Mathematical Statistics ,33(3):1065–1076, 1961.[29] B. W. Silverman.
Density Estimation for Statistics and Data Analysis . Chapman & Hall, London, 1986.[30] M. Stone. Cross-validatory choice and assessment of statistical predictions.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 36(2):111–133, 1974.[31] Hemant D. Tagare. Notes on optimization on stiefel manifolds, January 2011.[32] Hang Weiqiang and Xia Yingcun.
MAVE: Methods for Dimension Reduction , 2019. R package version 1.3.10.[33] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints.
MathematicalProgramming , 142:397–434, 2013.[34] Yingcun Xia. A multiple-index model and dimension reduction.
Journal of the American Statistical Association ,103(484):1631–1640, 2008.[35] Yingcun Xia, Howell Tong, W. K. Li, and Li-Xing Zhu. An adaptive estimation of dimension reduction space.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64(3):363–410, 2002.[36] Xiangrong Yin, Bing Li, and R. Cook. Successive direction extraction for estimating the central subspace in amultiple-index regression.
Journal of Multivariate Analysis , 99:1733–1757, 09 2008.15
PREPRINT - F
EBRUARY
18, 2021
Justification for (8) : Theorem 3.1 of [21] and the fact that ( R p , B ( R p )) , where B ( R p ) denotes the Borel sets on R p , is aPolish space guarantee the existence of the regular conditional probability of X | X ∈ s + span { V } [see also [16]].Further, the measure is concentrated on the affine subspace s + span { V } ⊂ R p and is given by (8) by Definition 8.38and Theorem 8.39 of [20] and the orthogonal decomposition (5). Proof of (9) : Since X and (cid:15) in (1) are assumed to be independent, V ar( Y | X ∈ s + span { V } ) = V ar( g ( B T X ) | X ∈ s + span { V } ) + V ar( (cid:15) ) . Using (8) and V ar( Y | Z ) = E ( Y | Z ) − E ( Y | Z ) , we obtain (9).We let ˜ g ( V , s , r ) = g ( B T s + B T Vr ) l f X ( s + Vr ) . The parameter integral (11) is well defined and continuous if(1) ˜ g ( V , s , · ) is integrable for all V ∈ S ( p, q ) , s ∈ supp ( f X ) , (2) ˜ g ( · , · , r ) is continuous for all r , and (3) there existsan integrable dominating function of ˜ g that does not depend on V and s [see [19, p. 101]].Furthermore t ( l ) ( V , s ) = (cid:82) K ˜ g ( V , s , r ) d r for some compact set K , since supp ( f X ) is compact due to (A.4). Thefunction ˜ g ( V , s , r ) is continuous in all inputs by the continuity of g and f X by (A.2), and therefore it attains amaximum. In consequence, all three conditions are satisfied so that t ( l ) ( V , s ) is well defined and continuous.Next µ l ( V , s ) = t ( l ) ( V , s ) /t (0) ( V , s ) is continuous since t (0) ( V , s ) > for all s ∈ supp ( f X ) by the continuityof f X and Σ x > . Then, ˜ L ( V , s ) in (9) is continuous, which results in L ( V ) also being well defined and continuousby virtue of it being a parameter integral following the same arguments as above.Next we establish the consistency of the conditional variance estimator. The uniform convergence in probability of thesample objective function in (19) is a sufficient condition for obtaining the consistency of (cid:98) V q = argmin V ∈S ( p,q ) L n ( V ) ,as uniform convergence in probability of a random function implies convergence in probability of the minimizer of L n ( V ) to the minimizer of the limit function. Let t ( l ) n ( V , s ) = 1 nh ( p − q ) / n n (cid:88) i =1 K (cid:18) d i ( V , s ) h n (cid:19) Y li (37)be the sample version of (11) for l = 0 , , . The summands of ˜ L n in (18) can be expressed as ¯ y l ( V , s ) = t ( l ) n ( V , s ) t (0) n ( V , s ) , (38)Before we start with the proof a few auxiliary lemmas are shown. Lemma 8.
Assume (A.4) and (K.1) hold. Let Z n ( V , s ) = (cid:0)(cid:80) i g ( X i ) l K ( d i ( V , s ) /h n ) (cid:1) / ( nh ( p − q ) / n ) for a contin-uous function g . Then, E ( Z n ( V , s )) = (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q ˜ g ( r , h / n r ) d r d r where ˜ g ( r , r ) = g ( s + Vr + Ur ) l f X ( s + Vr + Ur ) , x = s + Vr + Ur in (5). Proof of Lemma 8.
By (5), (cid:107) P U ( x − s ) (cid:107) = (cid:107) Ur (cid:107) = (cid:107) r (cid:107) . Further E ( Z n ( V , s )) = 1 h ( p − q ) / n (cid:90) supp ( f X ) g ( x ) l K ( (cid:107) P U ( x − s ) /h / n (cid:107) ) f X ( x ) d x = 1 h ( p − q ) / n (cid:90) supp ( f X ) ∩ R p − q (cid:90) supp ( f X ) ∩ R q g ( s + Vr + Ur ) l K ( (cid:107) r /h / n (cid:107) ) × f X ( s + Vr + Ur ) d r d r = (cid:90) supp ( f X ) ∩ R p − q K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q g ( s + Vr + h / n Ur ) l × f X ( s + Vr + h / n Ur ) d r d r where the substitution ˜ r = r /h / n , d r = h ( p − q ) / n d ˜ r was used to obtain the last equality.16 PREPRINT - F
EBRUARY
18, 2021
Lemma 9.
Assume (A.1), (A.2), (A.3), (A.4), (H.1) and (K.1) hold. For all δ > there exist an n (cid:63) and finite constants ˜ b u,m for u ∈ { , , , , } and m ∈ { , } such that (˜ b l, − δ ) nh ( p − q ) / n − (˜ b l, ) + δn ≤ V ar( t ( l ) n ( V , s )) ≤ (˜ b l, + δ ) nh ( p − q ) / n − (˜ b l, ) − δn for n > n (cid:63) and t ( l ) n ( V , s ) , l = 0 , , , in (37). Proof of Lemma 9.
From (1) and the binomial formula, Y li = ( g i + (cid:15) i ) l = (cid:80) lu =0 (cid:0) lu (cid:1) g l − ui (cid:15) ui with g i = g ( B T X i ) . For m ∈ { , } and l ∈ { , , ..., } , using the independence of X i from (cid:15) i , we obtain E (cid:0) Y li K m ( d i ( V , s ) /h n ) (cid:1) = l (cid:88) u =0 (cid:18) lu (cid:19) E (cid:0) g l − ui K m ( d i ( V , s ) /h n ) (cid:1) E ( (cid:15) ui ) . (39)Setting Z n ( V , s ) = 1 / ( nh ( p − q ) / n ) (cid:80) i g ( X i ) l − u ˜ K ( d i ( V , s ) /h n ) in Lemma 8, where ˜ K ( z ) = K m ( z ) fulfills (K.1)for m = 1 , , we obtain E (cid:0) g l − ui K m ( d i ( V , s ) /h n ) (cid:1) = h ( p − q ) / n E ( Z n ) . That is, if a kernel satisfies (K.1) its squarealso satisfies (K.1). Since the integrals are over compact sets by (A.4), by the dominated convergence theorem andLemma 8, it holds E ( Z n ) = (cid:90) supp ( f X ) ∩ R p − q ˜ K ( (cid:107) r (cid:107) ) (cid:90) supp ( f X ) ∩ R q ˜ g ( r , h / n r ) d r d r = b l − u,mn (40) −−−−→ n →∞ b l − u,m = (cid:90) supp ( f X ) ∩ R p − q ˜ K ( (cid:107) r (cid:107) ) d r (cid:90) supp ( f X ) ∩ R q ˜ g ( r , dr (41)using that ˜ g ( r , r ) = g ( B T s + B T Vr + B T Ur ) l − u f X ( s + Vr + Ur ) is continuous by (A.2) and h n → byassumption (H.1).Assumption (A.3) implies E ( (cid:15) i ) < ∞ for i = 1 , . . . , n . From (40) and (39), we obtain E (cid:16) Y li K m ( d i ( V , s ) /h n ) /h ( p − q ) / n (cid:17) = l (cid:88) u =0 (cid:18) lu (cid:19) b l − u,mn E ( (cid:15) ui ) = ˜ b l,mn (42) −−−−→ n →∞ l (cid:88) u =0 (cid:18) lu (cid:19) b l − k,m E ( (cid:15) ui ) = ˜ b l,m < ∞ By (42), we have for l = 0 , , V ar (cid:16) t ( l ) n ( V , s ) (cid:17) = 1 nh p − qn V ar (cid:0) Y l K ( d ( V , s ) /h n ) (cid:1) = ˜ b l, n nh ( p − q ) / n − (˜ b l, n ) n since ( Y i , X Ti ) i =1 ,...,n are independent draws from the joint distribution of ( Y, X ) . This completes the proof since ˜ b u,mn → ˜ b u,m < ∞ for u ∈ { , , ..., } and m ∈ { , } .Next we show that d i ( V , s ) in (16) is Lipschitz in its inputs under assumption (A.4) in Lemma 10. Lemma 10.
Under assumption (A.4) there exists a constant < C < ∞ such that for all δ > and V , V j ∈ S ( p, q ) with (cid:107) P V − P V j (cid:107) < δ and for all s , s j ∈ supp ( f X ) ⊂ R p with (cid:107) s − s j (cid:107) < δ | d i ( V , s ) − d i ( V j , s j ) | ≤ C δ for d i ( V , s ) given by (16) Proof of Lemma 10. | d i ( V , s ) − d i ( V j , s j ) | ≤ (cid:12)(cid:12) (cid:107) X i − s (cid:107) − (cid:107) X i − s j (cid:107) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) X i − s , P V ( X i − s ) (cid:105) − (cid:104) X i − s j , P V j ( X i − s j ) (cid:105) (cid:12)(cid:12) = I + I (43)where (cid:104)· , ·(cid:105) is the scalar product on R p . For the first term on the right hand side of (43) I = (cid:12)(cid:12) (cid:107) X i − s (cid:107) − (cid:107) X i − s j (cid:107) (cid:12)(cid:12) ≤ |(cid:104) X i , s − s j (cid:105)| + (cid:12)(cid:12) (cid:107) s (cid:107) − (cid:107) s j (cid:107) (cid:12)(cid:12) ≤ (cid:107) X i (cid:107)(cid:107) s − s j (cid:107) + 2 C (cid:107) s − s j (cid:107) ≤ C δ + 2 C δ = 4 C δ PREPRINT - F
EBRUARY
18, 2021by Cauchy-Schwartz and the reverse triangular inequality (i.e. (cid:12)(cid:12) (cid:107) s (cid:107) − (cid:107) s j (cid:107) (cid:12)(cid:12) = |(cid:107) s (cid:107) − (cid:107) s j (cid:107)| ( (cid:107) s (cid:107) + (cid:107) s j (cid:107) ) ≤(cid:107) s − s j (cid:107) C ) and (cid:107) X i (cid:107) ≤ sup z ∈ supp ( f X ) (cid:107) z (cid:107) = C < ∞ with probability 1 due to (A.4). The second term in (43)satisfies I ≤ (cid:12)(cid:12) (cid:104) X i , ( P V − P V j ) X i (cid:105) (cid:12)(cid:12) + 2 (cid:12)(cid:12) (cid:104) X i , P V s − P V j s j (cid:105) (cid:12)(cid:12) + (cid:12)(cid:12) (cid:104) s , P V s (cid:105) − (cid:104) s j , P V j s j (cid:105) (cid:12)(cid:12) ≤ (cid:107) X i (cid:107) (cid:107) P V − P V j (cid:107) + 2 (cid:107) X i (cid:107) (cid:13)(cid:13) P V ( s − s j ) + ( P V − P V j ) s j (cid:13)(cid:13) + |(cid:104) s − s j , P V s (cid:105)| + (cid:12)(cid:12) (cid:104) s j , P V s − P V j s j (cid:105) (cid:12)(cid:12) ≤ C δ + 2 C ( δ + C δ ) + C δ + C ( δ + C δ ) = 4 C δ + 4 C δ Collecting all constants into C (i.e. C = 8 C + 4 C ) yields the result.The proofs of Theorems 3 and 11 require the Bernstein inequality [3]: Let Z , Z , ... be an independent sequence ofbounded random variables | Z i | ≤ b . Let S n = (cid:80) ni =1 Z i , E n = E ( S n ) and V n = V ar( S n ) . Then, P ( | S n − E n | > t ) < (cid:18) − t / V n + bt/ (cid:19) (44)Furthermore the proof of Theorem 11 requires assumption (K.2), which obtains | K ( u ) − K ( u (cid:48) ) | ≤ K ∗ ( u (cid:48) ) δ (45)for all u, u (cid:48) with | u − u (cid:48) | < δ ≤ L and K ∗ ( · ) is a bounded and integrable kernel function [see [18]]. Specifically, ifcondition (1) of (K.2) holds, then K ∗ ( u ) = L {| u |≤ L } . If (2) holds, then K ∗ ( u ) = L {| u |≤ L } + 1 {| u | > L } | u − L | − ν .Let A = S ( p, q ) × supp ( f X ) and by a slight abuse of notation, we generically denote constants by C . In Theorems 11and 12 we show that the variance and bias terms of (37) vanish uniformly in probability, respectively. Theorem 11.
Under (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), a n = log( n ) /nh ( p − q ) / n = o (1) and a n /h ( p − q ) / n = O (1) , sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) = O P ( a n ) for l = 0 , , (46) Remark.
If we assume | Y | < M < ∞ almost surely, the requirement a n /h ( p − q ) / n = O (1) for the bandwidth can bedropped and the truncation step of the proof of Theorem 11 can be skipped.Proof of Theorem 11. The proof is organized in 3 steps: a truncation step, a discretization step by covering A = S ( p, q ) × supp ( f X ) , and application of Bernstein’s inequality (44).We let τ n = a − n and truncate Y li by τ n as follows. We let t ( l ) n, trc ( V , s ) = (1 /nh ( p − q ) / n ) (cid:88) i K ( (cid:107) P U ( X i − s ) (cid:107) /h n ) Y li {| Y i | l ≤ τ n } (47)be the truncated version of (37) and ˜ R ( l ) n = (1 /nh ( p − q ) / n ) (cid:80) i | Y i | l {| Y i | l >τ n } be the remainder of (37). Therefore R ( l ) n ( V , s ) = t ( l ) n ( V , s ) − t ( l ) n, trc ( V , s ) ≤ M ˜ R ( l ) n due to (K.1) and sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ M ( ˜ R ( l ) n + E ˜ R ( l ) n )+ sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) n, trc ( V , s ) − E (cid:16) t ( l ) n, trc ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) (48)By Cauchy-Schwartz and the Markov inequality, P ( | Z | > t ) = P ( Z > t ) ≤ E ( Z ) /t , we obtain E ˜ R ( l ) n = 1 h ( p − q ) / n E (cid:0) | Y i | l {| Y i | l >τ n } (cid:1) ≤ h ( p − q ) / n (cid:113) E ( | Y i | l ) (cid:113) P ( | Y i | l > τ n ) ≤ h ( p − q ) / n (cid:113) E ( | Y i | l ) (cid:18) E ( | Y i | l ) a − n (cid:19) / = o ( a n ) (49)where the last equality uses the assumption a n /h ( p − q ) / n = O (1) and the expectations are finite due to (A.3) for l = 0 , , . Obviously, no truncation is needed for l = 0 . 18 PREPRINT - F
EBRUARY
18, 2021Therefore the first two terms of the right hand side of (48) converge to 0 with rate a n by (49) and Markov’s inequality.From now to the end of the proof Y i will denote the truncated version Y i {| Y i |≤ τ n } and we do not distinguish thetruncated from the untruncated t n ( V , s ) since this truncation results in an error of magnitude a n .For the discretization step we cover the compact set A = S ( p, q ) × supp ( f X ) by finitely many balls, which is possible by(A.4) and the compactness of S ( p, q ) . Let δ n = a n h n and A j = { V : (cid:107) P V − P V j (cid:107) ≤ δ n } × { s : (cid:107) s − s j (cid:107) ≤ δ n } be acover of A with ball centers V j × s j . Then, A ⊂ (cid:83) Nj =1 A j and the number of balls can be bounded by N ≤ C δ − dn δ − pn for some constant C ∈ (0 , ∞ ) , where d = dim ( S ( p, q )) = pq − q ( q + 1) / . Let V × s ∈ A j . Then by Lemma 10there exists < C < ∞ , such that | d i ( V , s ) − d i ( V j , s j ) | ≤ C δ n (50)holds for d i in (16). Under (K.2), which implies (45), inequality (50) yields (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) d i ( V , s ) h n (cid:19) − K (cid:18) d i ( V j , s j ) h n (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K ∗ (cid:18) d i ( V j , s j ) h n (cid:19) C a n (51)for V × s ∈ A j and K ∗ ( · ) an integrable and bounded function.Define r ( l ) n ( V j , s j ) = (1 /nh ( p − q ) / n ) (cid:80) ni =1 K ∗ ( d i ( V j , s j ) /h n ) | Y i | l . For notational convenience we drop the depen-dence on l and j in the following and observe that (51) yields | t ( l ) n ( V , s ) − t ( l ) n ( V j , s j ) | ≤ C a n r ( l ) n ( V j , s j ) (52)Since K ∗ fulfills (K.1) except for continuity, an analogous argument as in the proof of Lemma 8 yields that E (cid:16) r ( l ) n ( V j , s j ) (cid:17) < ∞ by ( A. . By subtracting and adding t ( l ) n ( V j , s j ) , E ( t ( l ) n ( V j , s j )) , the triangular inequal-ity, (52) and integrability of r ln , we obtain (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) − t ( l ) n ( V j , s j ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E (cid:16) t ( l ) n ( V j , s j ) − t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n ( | r n | + | E ( r n ) | ) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n ( | r n − E ( r n ) | + 2 | E ( r n ) | ) + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C a n + | r n − E ( r n ) | + (cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) (53)for any C > C E ( r ( l ) n ( V j , s j )) and n such that a n ≤ , since a n = o (1) . Summarizing there exists < C < ∞ such that (53) holds.Then using sup x ∈ A f ( x ) = max ≤ j ≤ N sup x ∈ A j f ( x ) ≤ (cid:80) Nj =1 sup x ∈ A j f ( x ) for any partition of A and continuousfunction f , subadditivty of the probability for the first inequality and (53) for the third inequality below, it holds P ( sup V × s ∈ A | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) (54) ≤ N (cid:88) j =1 P ( sup V × s ∈ A j | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) ≤ N max ≤ j ≤ N P ( sup V × s ∈ A j | t ( l ) n ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17) | > C a n ) ≤ N (cid:18) max ≤ j ≤ N P ( | t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17) | > C a n ) + max ≤ j ≤ N P ( | r n − E ( r n ) | > C a n ) (cid:19) ≤ C δ − ( d + p ) (cid:18) max ≤ j ≤ N P ( | t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17) | > C a n ) + max ≤ j ≤ N P ( | r n − E ( r n ) | > C a n ) (cid:19) where the last inequality is due to N ≤ C δ − dn δ − pn for a cover of A .Finally, we bound the first and second term in the last line of (54) by the Bernstein inequality (44). For the first termin the last line of (54), let Z i = Y li K ( d i ( V j , s j ) /h n ) and S n = (cid:80) i Z i = nh ( p − q ) / n t ( l ) n ( V j , s j ) , then the Z i areindependent, | Z i | ≤ b = M τ n = M /a n by (K.1) and the truncation step. For V n = V ar( S n ) , Lemma 9 yields nh ( p − q ) / n (cid:16) ˜ b l, − δ − h ( p − q ) / n (cid:16) (˜ b l, ) + δ (cid:17)(cid:17) ≤ V n ≤ nh ( p − q ) / n (cid:16) ˜ b l, + δ − h ( p − q ) / n (cid:16) (˜ b l, ) − δ (cid:17)(cid:17) PREPRINT - F
EBRUARY
18, 2021for n sufficiently large. We write nh ( p − q ) / n C ≥ V n with C = ˜ b l, + δ , and set t = C a n nh ( p − q ) / n . The Bernsteininequality (44) yields P (cid:16)(cid:12)(cid:12)(cid:12) t ( l ) n ( V j , s j ) − E (cid:16) t ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) > C a n (cid:17) < (cid:18) − t / V n + bt/ (cid:19) ≤ (cid:32) − (1 / C a n n h ( p − q ) n nh ( p − q ) / n C + (1 / M τ n C a n nh ( p − q ) / n ) (cid:33) ≤ (cid:18) − (1 / C log( n ) C/C + ( M / (cid:19) = 2 n − γ ( C ) where a n = log( n ) / ( nh ( p − q ) / n ) and define γ ( C ) = (1 / C C/C +( M / , which is an increasing function that can be madearbitrarily large by increasing C .For the second term in the last line of (54), set Z i = Y li K ∗ ( d i ( V j , s j ) /h n ) in the Bernstein inequality (44) and proceedanalogously to obtain P (cid:16)(cid:12)(cid:12)(cid:12) r ( l ) n ( V j , s j ) − E (cid:16) r ( l ) n ( V j , s j ) (cid:17)(cid:12)(cid:12)(cid:12) > C a n (cid:17) < n − (1 / C C/C / M = 2 n − γ ( C ) By (H.1), h ( p − q ) / n ≤ for n large, so that δ − n = ( a n h n ) − ≤ n / h − n h ( p − q ) / n ≤ n / . Further (H.2) implies / ( nh ( p − q ) / n ) ≤ for n large, therefore h − n ≤ n / ( p − q ) ≤ n since p − q ≥ . Therefore, (54) is smallerthan C δ − ( d + p ) n n − γ ( C ) ≤ Cn d + p ) / − γ ( C ) . For C large enough, we have d + p ) / − γ ( C ) < and n d + p ) / − γ ( C ) → . This completes the proof. Theorem 12.
Under (A.1), (A.2) and (A.4), (H.1), (K.1), and (cid:82) R p − q K ( (cid:107) r (cid:107) ) d r = 1 , sup V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) ( V , s ) + 1 { l =2 } η t (0) ( V , s ) − E (cid:16) t ( l ) n ( V , s ) (cid:17)(cid:12)(cid:12)(cid:12) = O ( h n ) , l = 0 , , (55) Proof of Theorem 12.
Let ˜ g ( r , r ) = g ( B T s + B T Vr + B T Ur ) l f X ( s + Vr + Ur ) where r , r satisfy theorthogonal decomposition (5). Then E (cid:16) t ( l ) n ( V , s ) (cid:17) = (cid:90) R p − q K ( (cid:107) r (cid:107) ) (cid:90) R p ˜ g ( r , h n / r ) d r d r + 1 { l =2 } η E (cid:16) t (0) n ( V , s ) (cid:17) (56)holds by Lemma 8 for l = 0 , . For l = 2 , Y i = g i + 2 g i (cid:15) i + (cid:15) i with g i = g ( B T X i ) and can be handled as in thecase of l = 0 , .Plugging in (56) the second order Taylor expansion for some ξ in the neighborhood of 0, ˜ g ( r , h n / r ) = ˜ g ( r ,
0) + h n / ∇ r ˜ g ( r , T r + h n r T ∇ r ˜ g ( r , ξ ) r , yields E (cid:16) t ( l ) n ( V , s ) (cid:17) = (cid:90) R q ˜ g ( r , d r + (cid:112) h n (cid:18)(cid:90) R q ∇ r ˜ g ( r , d r (cid:19) T (cid:90) R p − q K ( (cid:107) r (cid:107) ) r d r + h n (cid:90) R p − q K ( (cid:107) r (cid:107) ) (cid:90) R p r T ∇ r ˜ g ( r , ξ ) r d r d r = t ( l ) ( V , s ) + h n R ( V , s ) since (cid:82) R q ˜ g ( r , d r = t ( l ) ( V , s ) and (cid:82) R p − q K ( (cid:107) r (cid:107) ) r d r = 0 ∈ R p − q due to K ( (cid:107) · (cid:107) ) being even. Let R ( V , s ) = (cid:82) R p − q K ( (cid:107) r (cid:107) ) (cid:82) R p r T ∇ r ˜ g ( r , ξ ) r d r d r . By (A.4) and (A.2) it holds | r T ∇ r ˜ g ( r , ξ ) r | ≤ C (cid:107) r (cid:107) for C = sup x , y (cid:107)∇ r ˜ g ( x , y ) (cid:107) < ∞ , since a continuous function over a compact set is bounded. Then, R ( V , s ) ≤ CC (cid:82) R p − q K ( (cid:107) r (cid:107) ) (cid:107) r (cid:107) d r < ∞ for some C > since the integral over r is over a compact set by (A.4).Lemma 13 follows directly from Theorems 11 and 12 and the triangle inequality. Lemma 13.
Suppose (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1) hold. If a n = log( n ) /nh ( p − q ) / n = o (1) , and a n /h ( p − q ) / n = O (1) , then for l = 0 , , V × s ∈ A (cid:12)(cid:12)(cid:12) t ( l ) ( V , s ) + 1 { l =2 } η t (0) ( V , s ) − t ( l ) n ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( a n + h n ) Combining the results of Theorems 11 and 12 and Lemma 13 obtains Theorem 14.20
PREPRINT - F
EBRUARY
18, 2021
Theorem 14.
Suppose (A.1), (A.2), (A.3), (A.4), (K.1), (K.2), (H.1) hold. Let a n = log( n ) /nh ( p − q ) / n = o (1) , a n /h ( p − q ) / n = O (1) , δ n = inf V × s ∈ A n t (0) ( V , s ) , where t (0) ( V , s ) is defined in (11) , andand A n = S ( p, q ) ×{ x ∈ supp ( f X ) : | x − ∂ supp ( f X ) | ≥ b n } , where ∂C denotes the boundary of the set C and | x − C | = inf r ∈ C | x − r | , for asequence b n → so that δ − n ( a n + h n ) → for any bandwidth h n that satisfies the assumptions. Then, sup V × s ∈ A (cid:12)(cid:12)(cid:12) ¯ y l ( V , s ) − µ l ( V , s ) − { l =2 } η t (0) ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( δ − n ( a n + h n )) , l = 0 , , and sup V × s ∈ A (cid:12)(cid:12)(cid:12) ˜ L n ( V , s ) − ˜ L ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( δ − n ( a n + h n )) (57) where ¯ y l ( V , s ) , µ l ( V , s ) , ˜ L n ( V , s ) and ˜ L ( V , s ) are defined in (38) , (10) , (18) and (9) , respectively.Proof of Theorem 14. ¯ y l ( V , s ) = t ( l ) n ( V , s ) t (0) n ( V , s ) = t ( l ) n ( V , s ) /t (0) ( V , s ) t (0) n ( V , s ) /t (0) ( V , s ) We consider the numerator and enumerator separately. By Lemma 13 sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (0) n ( V , s ) t (0) ( V , s ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup A | t (0) n ( V , s ) − t (0) ( V , s ) | inf A n t (0) ( V , s ) = O P ( δ − n ( a n + h n )) Next sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup A | t ( l ) n ( V , s ) − t ( l ) ( V , s ) | inf A n t (0) ( V , s ) = O P ( δ − n ( a n + h n )) . Therefore by A n ↑ A = S ( p, q ) × supp ( f X ) we get lim n →∞ sup V × s ∈ A n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = lim n →∞ sup V × s ∈ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t ( l ) n ( V , s ) t (0) ( V , s ) − µ l ( V , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and in total we obtain ¯ y l ( V , s ) = t ( l ) n ( V , s ) /t (0) ( V , s ) t (0) n ( V , s ) /t (0) ( V , s ) = µ l + O P ( δ − n ( a n + h n ))1 + O P ( δ − n ( a n + h n )) = µ l + O P ( δ − n ( a n + h n )) . For l = 2 , Y i = g ( B T X i ) + 2 g ( B T X i ) (cid:15) i + (cid:15) i , and (57) follows from (9). Lemma 15.
Under (A.1), (A.2), (A.4), there exists < C < ∞ such that | µ l ( V , s ) − µ l ( V j , s ) | ≤ C (cid:107) P V − P V j (cid:107) (58) for all s ∈ supp ( f X ) Proof.
From the representation ˜ t ( l ) ( P V , s ) in (14) instead of t ( l ) ( V , s ) , we consider µ l ( V , s ) = µ l ( P V , s ) as afunction on the Grassmann manifold. Then, (cid:12)(cid:12) µ l ( P V , s ) − µ l ( P V j , s ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s )˜ t (0) ( P V , s ) − ˜ t ( l ) ( P V j , s )˜ t (0) ( P V j , s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup | ˜ t (0) ( P V , s ) | (inf ˜ t (0) ( P V , s )) (cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s ) − ˜ t ( l ) ( P V j , s ) (cid:12)(cid:12)(cid:12) + sup ˜ t ( l ) ( P V , s )(inf ˜ t (0) ( P V , s )) (cid:12)(cid:12)(cid:12) ˜ t (0) ( P V , s ) − ˜ t (0) ( P V j , s ) (cid:12)(cid:12)(cid:12) (59)with sup P V ∈ Gr ( p,q ) ˜ t (0) ( P V , s ) ∈ (0 , ∞ ) and inf P V ∈ Gr ( p,q ) ˜ t (0) ( P V , s ) ∈ (0 , ∞ ) since ˜ t ( l ) is continuous, Σ x > and s ∈ supp ( f X ) . 21 PREPRINT - F
EBRUARY
18, 2021By (A.2), ˜ g ( x ) = g ( B T x ) f X ( x ) is twice continuous differentiable and therefore Lipschitz continuous on compact sets.We denote its Lipschitz constant by L < ∞ . Therefore, (cid:12)(cid:12)(cid:12) ˜ t ( l ) ( P V , s ) − ˜ t ( l ) ( P V j , s ) (cid:12)(cid:12)(cid:12) ≤ (cid:90) supp ( f X ) (cid:12)(cid:12) ˜ g ( s + P V r ) − ˜ g ( s + P V j r ) (cid:12)(cid:12) d r ≤ L (cid:90) supp ( f X ) (cid:107) ( P V − P V j ) r (cid:107) d r ≤ L (cid:32)(cid:90) supp ( f X ) (cid:107) r (cid:107) dr (cid:33) (cid:107) P V − P V j (cid:107) (60)where the last inequality is due to the sub-multiplicativity of the Frobenius norm and the integral being finite by (A.4).Plugging (60) in (59) and collecting all constants into C yields (58). Proof of Theorem 3.
By (19) and (7), | L n ( V ) − L ( V ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:16) ˜ L n ( V , X i ) − ˜ L ( V , X i ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i (cid:16) ˜ L ( V , X i ) − E ( ˜ L ( V , X )) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (61)The first term on the right hand side of (61) goes to 0 in probability uniformly in V by Theorem 14, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i ˜ L n ( V , X i ) − ˜ L ( V , X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup V × s ∈ A (cid:12)(cid:12)(cid:12) ˜ L n ( V , s ) − ˜ L ( V , s ) (cid:12)(cid:12)(cid:12) = O P ( δ − n ( a n + h n )) (62)The second term in (61) converges to 0 almost surely for all V ∈ S ( p, q ) by the strong law of large numbers. In orderto show uniform convergence the same technique as in the proof of Theorem 11 is used. Let B j = { V ∈ S ( p, q ) : (cid:107) VV T − V j V Tj (cid:107) ≤ ˜ a n } be a cover of S ( p, q ) ⊂ (cid:83) Nj =1 B j with N ≤ C ˜ a − dn = C ( n/ log( n )) d/ ≤ C n d/ , where d = dim( S ( p, q )) is defined in the proof of Theorem 11. By Lemma 15, | µ l ( V , X i ) − µ l ( V j , X i ) | ≤ C (cid:107) P V − P V j (cid:107) (63)Let G n ( V ) = (cid:80) i ˜ L ( V , X i ) /n with E ( G n ( V )) = L ( V ) . Using (63) and following the same steps as in the proof ofTheorem 11 we obtain | G n ( V ) − L ( V ) | ≤ | G n ( V ) − G n ( V j ) | + | G n ( V j ) − L ( V j ) | + | L ( V ) − L ( V j ) |≤ C ˜ a n + | G n ( V j ) − L ( V j ) | (64)for V ∈ B j and some C > C . Inequality (64) leads to P (cid:32) sup V ∈S ( p,q ) | G n ( V ) − L ( V ) | > C ˜ a n (cid:33) ≤ C N P ( sup V ∈ B j | G n ( V ) − L ( V ) | > C ˜ a n ) ≤ C n d/ P ( | G n ( V j ) − L ( V j ) | > C ˜ a n ) ≤ C n d/ n − γ ( C ) → (65)where the last inequality in (65) is due to the Bernstein inequality (44) with Z i = ˜ L ( V j , X i ) , which is boundedsince ˜ L ( · , · ) is continuous on the compact set A , and γ ( C ) a monotone increasing function of C that can be madearbitrarily large by choosing C accordingly. Therefore, sup V ∈S ( p,q ) | L n ( V ) − L ( V ) | ≤ O P ( δ − n ( a n + h n )+˜ a n ) with δ n = inf V × s ∈ A n t (0) ( V , s ) , where t (0) ( V , s ) is defined in (11), and A n = S ( p, q ) ×{ x ∈ supp ( f X ) : f X ( x ) ≥ b n } for a sequence b n → so that δ − n ( a n + h n ) → for any bandwidth h n that satisfies the assumptions, which implies(20). Proof of Theorem 4.
We apply Theorem 4.1.1 of [2] to obtain consistency of the conditional variance estimator. Thistheorem requires three conditions that guarantee the convergence of the minimizer of a sequence of random functions L n ( P V ) to the minimizer of the limiting function L ( P V ) ; i.e., P span { (cid:98) B } ⊥ = argmin L n ( P V ) → P span { B } ⊥ =argmin L ( P V ) in probability. To apply the theorem three conditions have to be met: (1) The parameter space iscompact; (2) L n ( V ) is continuous and a measurable function of the data ( Y i , X Ti ) i =1 ,...,n and (3) L n ( V ) convergesuniformly to L ( V ) and L ( V ) attains a unique global minimum at span { B } ⊥ .Since L n ( V ) depends on V only through P V = VV T , L n ( V ) can be considered as functions on the Grassmannmanifold, which is compact, and the same holds true for L ( V ) by (14). Further, L n ( V ) is by definition a measurablefunction of the data and continuous in V if a continuous kernel is used, such as the Gaussian. Theorem 3 obtainsthe uniform convergence and Theorem 1 that the minimizer is unique when L ( V ) is minimized over the Grassmannmanifold G ( p, q ) , since span { B } is uniquely identifiable and so is span { B } ⊥ (i.e. (cid:107) P span { (cid:98) B } − P span { B } (cid:107) = (cid:107) (cid:98) B (cid:98) B T − BB T (cid:107) = (cid:107) ( I p − BB T ) − ( I p − (cid:98) B (cid:98) B T ) (cid:107) = (cid:107) P span { (cid:98) B } ⊥ − P span { B } ⊥ (cid:107) ). Thus, all three conditions are metand the result is obtained. 22 PREPRINT - F
EBRUARY
18, 2021
Proof of Theorem 6.
The Gaussian kernel K satisfies ∂ z K ( z ) = − zK ( z ) . From (17) and (18) we have ˜ L n = ¯ y − ¯ y where ¯ y l = (cid:80) i w i Y li , l = 1 , . We let K j = K ( d j ( V , s ) /h n ) , suppress the de-pendence on V and s and write w i = K i / (cid:80) j K j . Then, ∇ K i = ( − /h n ) K i d i ∇ d i and ∇ w i = − (cid:16) K i d i ∇ d i ( (cid:80) j K j ) − K i (cid:80) j K j d j ∇ d j (cid:17) / ( h n (cid:80) j K j ) . Next, ∇ ¯ y l = − h n (cid:88) i Y li (cid:16) K i d i ∇ d i − K i ( (cid:80) j K j d j ∇ d j ) (cid:17) ( (cid:80) j K j ) = − h n (cid:88) i Y li w i d i ∇ d i − (cid:88) j w j d j ∇ d j = − h n (cid:88) i Y li w i d i ∇ d i − (cid:88) j Y lj w j (cid:88) i w i d i ∇ d i = − h n (cid:88) i ( Y li − ¯ y l ) w i d i ∇ d i (66)Then, ∇ ˜ L n = ∇ ¯ y − y ∇ ¯ y , and inserting ∇ ¯ y l from (66) yields ∇ ˜ L n = ( − /h n ) (cid:80) i ( Y i − ¯ y − y ( Y i − ¯ y )) w i d i ∇ d i = (1 /h n )( (cid:80) i (cid:16) ˜ L n − ( Y i − ¯ y ) (cid:17) w i d i ∇ d i ) , since Y i − ¯ y − y ( Y i − ¯ y ) = ( Y i − ¯ y ) − ˜ L nn